Writing Your Own Autocomplete Engine from Scratch in Python (with SpellCheck)

True Story Follows

So I’m on the grind with Exercise Library, and to make a long story short, I wanted an autocomplete search feature. If I was really lazy and otherwise a terrible person I could save a bunch of strings to SQL and make some queries that were something like:

SELECT exercise_name FROM exercise_table WHERE exercise_name LIKE 'bar%';

But I’m better than that. And so are you.

So I have experience with SOLR and Haystack which has autocomplete. I’ve also recently played around with Amazon CloudSearch which also has autocomplete.

But the reality is that this is a really simple application and adding any sort of third party support would be overkill, and I didn’t want to create additional dependencies.

Spell Check

I found this really thorough and interesting blog post complete with code samples here.

The only real difference is that my use case is far simpler. In the blog post mentioned above, spell check attempts to correct based on all words in the English dictionary. I’m just trying to spell check based on a finite data set of about 700 exercise names.

The resulting code for my case looks like this:

import re
from collections import defaultdict
from exercise_library.utils import read_file_as_json


class SpellChecker(object):

    def words(text):
        return re.findall('[a-z]+', text.lower())

    def train(features):
        model = defaultdict(int)
        for f in features:
            model[f] += 1
        return model

    NWORDS = train(words(
        " ".join(
            [dict_obj["name"] for dict_obj in read_file_as_json("exercises.json")]
        )
    ))

    alphabet = 'abcdefghijklmnopqrstuvwxyz'

    def _edits1(self, word):
        splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
        deletes = [a + b[1:] for a, b in splits if b]
        transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1]
        replaces = [a + c + b[1:] for a, b in splits for c in self.alphabet if b]
        inserts = [a + c + b for a, b in splits for c in self.alphabet]
        return set(deletes + transposes + replaces + inserts)

    def _known_edits2(self, word):
        return set(e2 for e1 in self._edits1(word) for e2 in self._edits1(e1) if e2 in self.NWORDS)

    def _known(self, words):
        return set(w for w in words if w in self.NWORDS)

    def correct_token(self, token):
        candidates = self._known([token]) or self._known(self._edits1(token)) or self._known_edits2(token) or [token]
        return max(candidates, key=self.NWORDS.get)

    def correct_phrase(self, text):
        tokens = text.split()
        return [self.correct_token(token) for token in tokens]

So now if we run the program:

spellchecker = SpellChecker()
tokens = spellchecker.correct_phrase("dumbel sid bat babell shug pank")
print tokens

We get the output:

['dumbbell', 'side', 'bar', 'barbell', 'shrug', 'plank']

You’ll notice that all of the word corrections are taken from my exercise library and don’t include all words in the English language.

Autocomplete

This was actually a pretty straightforward problem, especially since the data I’m working with is finite. The idea is that for every exercise name, I can map all of its tokens to the exercise name, so for example:

token_to_exercise_name = {
    "no": [
        "no feet push up"
    ], 
    "vacuums": [
        "vacuums"
    ], 
    "holding": [
        "standing lat stretch  holding object "
    ], 
    "roll": [
        "swiss ball roll outs", 
        "hip roll", 
        "straight arm roll outs"
    ], 
    "flexion": [
        "ankle flexion with weight plate"
    ], 
    "sphinx": [
        "sphinx push up"
    ], 
    "together": [
        "standing hamstring stretch, feet together", 
        "seated hamstring stretch, feet together"
    ], 
    "depth": [
        "depth jump"
    ]
}

And then from there, I can make edge N grams so that subsets of a token map to a token (and that token subsequently maps to exercise names). So for example:

n_gram_to_tokens = {
    "wei": [
        u'weighted', u'weight'
    ],
    "weig": [
        u'weighted', u'weight'
    ],
    "weigh": [
        u'weighted', u'weight'
    ],
    "weight": [
        u'weighted', u'weight'
    ],
    "weighte": [
        u'weighted'
    ],
    "weighted": [
        u'weighted'
    ]
}

And we can build out those datastructures like so:

class AutoCompleter(object):

    MIN_N_GRAM_SIZE = 3

    exercise_names = [dict_obj["name"] for dict_obj in read_file_as_json("exercises.json")]
    token_to_exercise_name = defaultdict(list)
    n_gram_to_tokens = defaultdict(set)
    for exercise_name in exercise_names:
        exercise_name = exercise_name.lower().replace("-", " ").replace("(", " ").replace(")", " ").replace("'", " ")
        tokens = exercise_name.split()
        for token in tokens:
            token_to_exercise_name[token].append(exercise_name)
            for string_size in xrange(MIN_N_GRAM_SIZE, len(token) + 1):
                n_gram = token[:string_size]
                n_gram_to_tokens[n_gram].add(token)

Then I filled out the rest of the class that basically completes some pretty simple steps for a program:

  • Take input tokens and treat them as potential n grams and turn them into tokens (that then map to exercises)
  • From the cleaned tokens we have, get all of the possible exercises they map to
  • Score how well the tokens map to the exercise name
  • Return a subset of the exercises (i.e. don’t return 100 results)

So the entire class ends up looking like this:

class AutoCompleter(object):

    MIN_N_GRAM_SIZE = 3

    exercise_names = [dict_obj["name"] for dict_obj in read_file_as_json("exercises.json")]
    token_to_exercise_name = defaultdict(list)
    n_gram_to_tokens = defaultdict(set)
    for exercise_name in exercise_names:
        exercise_name = exercise_name.lower().replace("-", " ").replace("(", " ").replace(")", " ").replace("'", " ")
        tokens = exercise_name.split()
        for token in tokens:
            token_to_exercise_name[token].append(exercise_name)
            for string_size in xrange(MIN_N_GRAM_SIZE, len(token) + 1):
                n_gram = token[:string_size]
                n_gram_to_tokens[n_gram].add(token)

    def _get_real_tokens_from_possible_n_grams(self, tokens):
        real_tokens = []
        for token in tokens:
            token_set = self.n_gram_to_tokens.get(token, set())
            real_tokens.extend(list(token_set))
        return real_tokens

    def _get_scored_exercises_uncollapsed(self, real_tokens):
        exercises__scores = []
        for token in real_tokens:
            possible_exercises = self.token_to_exercise_name.get(token, [])
            for exercise_name in possible_exercises:
                score = float(len(token)) / len(exercise_name.replace(" ", ""))
                exercises__scores.append((exercise_name, score))
        return exercises__scores

    def _combined_exercise_scores(self, exercises__scores, num_tokens):
        collapsed_exercise_to_score = defaultdict(int)
        collapsed_exercise_to_occurence = defaultdict(int)
        for exercise, score in exercises__scores:
            collapsed_exercise_to_score[exercise] += score
            collapsed_exercise_to_occurence[exercise] += 1
        for exercise in collapsed_exercise_to_score.keys():
            collapsed_exercise_to_score[exercise] *= collapsed_exercise_to_occurence[exercise] / float(num_tokens)
        return collapsed_exercise_to_score

    def _filtered_results(self, exercises__scores):
        min_results = 3
        max_results = 10
        score_threshold = 0.4
        max_possibles = exercises__scores[:max_results]
        if exercises__scores and exercises__scores[0][1] == 1.0:
            return [exercises__scores[0][0]]

        possibles_within_thresh = [tuple_obj for tuple_obj in exercises__scores if tuple_obj[1] >= score_threshold]
        min_possibles = possibles_within_thresh if len(possibles_within_thresh) > min_results else max_possibles[:min_results]
        return [tuple_obj[0] for tuple_obj in min_possibles]

    def guess_exercises(self, tokens):
        real_tokens = self._get_real_tokens_from_possible_n_grams(tokens)
        exercises__scores = self._get_scored_exercises_uncollapsed(real_tokens)
        collapsed_exercise_to_score = self._combined_exercise_scores(exercises__scores, len(tokens))
        exercises__scores = collapsed_exercise_to_score.items()
        exercises__scores.sort(key=lambda t: t[1], reverse=True)
        return self._filtered_results(exercises__scores)

So for the following program:

spellchecker = SpellChecker()
tokens = spellchecker.correct_phrase("dumbel sid bat babell shug pank")
print AutoCompleter().guess_exercises(tokens)

tokens = spellchecker.correct_phrase("babell bech press")
print AutoCompleter().guess_exercises(tokens)

tokens = spellchecker.correct_phrase("side step")
print AutoCompleter().guess_exercises(tokens)

The output becomes:

[u'barbell shoulder shrug', u'barbell row', u'side plank']
[u'barbell bench press']
[u'side to side push up', u'step up ', u'side plank']

The Javascript

Skip a few steps and assume we now have an API that returns the data we just built out. So now we have something like:

Screen Shot 2015-02-27 at 2.04.21 PM

Then with Javascript and Bootstrap, it’s really easy to build out an autocomplete front end:

<script>
$(document).ready(function(){
    $(".typeahead").typeahead({
        source: _.debounce(function(query, process){
            var url = "/api/autocomplete/?q=" + query;
            return $.get(url, function(data){
                // data is the raw response from the API
                return process(data);
            });
        }, 100),
        matcher: function(item){
            // needed in order to allow typos
            return true;
        },
        items: 10,
        minLength: 1
    });
});
</script>

I used Bootstrap Typeahead, and I also modified the matcher function to always return True so that spelling mistakes were allowed (since those get corrected on the backend). And also notice that I used Underscore’s debounce function in order to throttle the number of API requests being made. And now the finished product looks like this:

Screen Shot 2015-02-27 at 2.07.03 PM

The End