True Story Follows
So I’m on the grind, writing code like it’s 1995, and at work I need some autocompletion functionality. Fortunately, I’d written an autocompletion module before on a side project, so I just busted that out and opened a pull request. So within a matter of minutes, I’d graced my coworkers with hundreds of lines of beautiful code to review. The response: “uuhh…is there no open source library that already does this?”
So I double checked, and there was not. So the new plan was to then release an open source project for autocompletion in python so that I could then say to my coworkers, “psshh, of course there’s an open source library that does this, so it must be good. Let’s use that!”
Also, I’ve been on a quest to do more than just be a code monkey and do more than has leverage and impacts more than just my own immediate code base.
It’s on the internets right here: Finisher: An Autocompletion Library for Python
How it Works
There are in essence two things going on in this library: autocompletion and spellcheck. Autocompletion works by assuming that the input tokens are intended. The input tokens can be incomplete, so it’s assumed that the input tokens can be a substring of a token that corresponds to a complete string in an input training model of a list of strings. This is one of many “n grams” that correspond to a valid token in the model. So in essence, a dictionary of n grams exist that correspond to sets of tokens. Those tokens then each correspond to full length strings that they exist in. So for any set of n grams, we can get tokens they correspond to, and all of the subsequent full strings those correspond to. With a set of full strings, we can now score them by which results have the best matches.
Spellcheck is a little more intense. If we get an input token from the user, we can take every token and generate all possible typos of that token. We can get more intense and calculate that again for another dimension of typos for each previously generated typo. So if a user makes a typo, a typo of that typo can result in a real token. So for every token, we can either use that token or a typo of that token if it exists, or a typo of the typo if that exists.
So we can take an input blob of text, tokenize it, try to convert all those tokens to valid tokens, then find the best match from those tokens. From there, the only real concern is how to make things run as fast as possible. There’s still some room for improvement on that front I’m pretty sure, so I’m hoping to add some updates.
Releasing a pip installable library
This was surprisingly pretty easy. It took a while to find the right guide, but I ended up finding the best information here.