To paint the context for the scenarios and the problems I’ll outline, consider an app in which I’m classifying exercises for workouts. Each exercise has its own attributes such as which muscle group it works, which equipment it requires, whether it’s considered a strength, power, or stabilization exercise, and a few others.
For the given scenario, you’ll notice that for all practical purposes, the data is finite. Once we input our several hundred exercises, it’s not going to grow any more. Maybe a few more will be added, but it’s not like the data will suddenly take off and grow by several magnitudes. The total size of the data is in kilobytes, not megabytes or even gigabytes.
So the question is: Is it worth storing that data in a database?
The answer to the above question: No. Sort of. At least you don’t have to store the data in a database, and I’ve recently adjusted an app to avoid using a database for the exact problem described.
One of the things I thought was interesting when I first started my job as a software engineer was how enumerated data types were represented inside of a database. In a database course in college, you’d learn that every item or tangible object belonged in a row on a table, and other rows on other tables could reference that item.
However, in an application where we’re using a database for the purpose of being referenced by another programming language, it’s possible to avoid representing bounded data (or enumerated data types as a simple example) inside the database. So for example, you can have a row in a database represent another arbitrary enumerated datatype using an id that corresponds to something else that lives in the code.
This is precisely what Django-RichEnum is for. It’s an open source Python library written at Hearsay that allows you to create enumerated data types and then have database rows reference those enums. In turn, when you’re loading objects from the database into a django model, you can seamlessly reference the object’s attributes that represent an enum.
So for example, if I wanted to represent gender, I could do something like this:
You could represent gender a number of different ways, but doing so with an enumerated data types provides a consistent solution. Now if we have a Django model, we can reference gender with:
And inside the database, the “gender” field would use “gender_id” and would either be 1 or 2, which corresponds to the enumerated data type above.
For the example I’m using for exercises, I’m essentially doing the same concept, but instead of 2 different items being represent, I’m representing 700+ items.
Bottom line: it’s faster, simple to maintain, and this module by itself has no dependency on a database.
Databases are slow because they pull data from a hard drive. Accessing the hard drive compared to a single cpu cycle is like comparing a year to a second in terms of the difference in magnitudes. While databases are optimized for querying, there’s still overhead involved with every single query. As long as keeping data in memory isn’t costly, you can get some performance benefits be keeping your data outside of a database.
In my example of exercises, I had 5 different database tables just to represent exercises. This was because I had the table for exercises and then 4 additional many to many tables to join against phases, muscle groups, equipment required, and exercise type. So I spent a few minutes joining all of the data and then just dumping it into a JSON file.
Now each exercise is represented as a JSON blob and looks something like this:
And I have a single json file that has about 700 of these exercises. At this point there’s not really a compelling reason to represent the data in a different way.
Now I want to be able to query the data efficiently. I can do that by indexing the data with a simple dictionary based on every possible way I’d want to query it. To keep the example simple, let’s just say I only want to query by muscle group. So creating a class like this one below will fulfill those needs:
As we step through the code, here’s why I made each decision:
This is just a simple class that will never be directly accessed by anything other than the outer “Exercise” class. Its purpose is to convert a JSON blob into a Python object. Since we know that each exercise ID is unique (every exercise ID is unique if I didn’t mention), we’ll override the hash function so that it just returns the unique ID and doesn’t do anything wild and crazy that might slow us down.
Here I just read in the json file and converted each item to the “_Exercise” object previously described. You’ll notice that the “_exercises” variable is defined as a class attribute. This means that this line of reading in a file will be executed only once to minimize interaction with the hard drive. Every piece of the code that might reference this class will reference the same static object, and an added plus for web applications is that this is still run exactly once until the application is restarted. So the program in memory will have an immortal list of exercises that never dies.
Now I’m indexing all of the exercises by muscle group using a dictionary that references a set. So for any given muscle group, you can immediately look up any corresponding exercises in O(1) time.
I’m using a set instead of a list because in practice, additional filtering will happen. So for example, I use a similar process to index all exercises by the required equipment. So now, if I want to query by equipment and I query by muscle group, I get two distinct sets. Then I want the intersection (I want to AND them together) to get all of the exercises that satisfy both requirements. And that brings us to:
In order to query items, I’m going through the process described above. I instantiate the Exercise class with a query representing a set of all possible exercises. For every additional condition that I add to the query, I want the intersection of the possible results. By returning “self” I can chain queries together.
Now when I query the data, I can do so with lines that look like the above. Also note that the “for_equipment” method wasn’t included in the stripped down example, but here you can see how queries can be chained together with sets.
For a standard query it took on average 0.15 milliseconds.
The queries are now fast as a mother falcon. Dictionary lookups are generally O(1), and set intersections are O(n) where n is the length of the smaller of the two sets. In every case, O(n) for n=~700 is drastically faster than a database query. The result is that I can query my data without worrying about any incurred cost.
Compare this to the alternative where I had to query 5 different tables or make 4 different table joins. In a best case scenario with optimal code, it would have probably taken around a millisecond per table query. That’s a 30x increase in speed! Now consider a worst case scenario where someone unknowingly left a database query inside of a loop. It would be a bummer.