True Story Follows
So I’m by no means a database expert and don’t consider myself as such, but some of my coworkers who I would label as database experts have educated me enough to learn about some best practices that would be difficult to otherwise stumble upon and self-learn. At this point I at least know enough to occassionaly educate someone else about one or more of these alleged best practices that I think it would be useful to document some of the practices I’ve adopted.
Basic Database Terms
If you’ve read this far down the page, I’m going to assume that you quite possibly know nothing about databases except that they’re a storage mechanism that you can query that’s generally was faster than some code you could write to retrieve some data. There are a few terms I’ll end up throwing around so I’m going to define them here:
- Database Index: A database table contains rows. Some of the columns on those rows are needed in order to identify and find the data while other columns serve simply to tell us information. In any case where you are going to search for a database row based on one of its attributes, that attribute should exist somewhere on a database index. The index means that the attribute is stored on the hard drive so that it can be retrieved very quickly. Conceptually, this is the same as putting the word “Melon-Lion” in the dictionary in the M section right after the word “melon”. In practice, the index is stored on a B Tree for quick retrieval.
- Multi-Column Index: A multi-column index is an index that consists of multiple attributes. Order matters. An index of attributes (a, b, c) is completely different from (c, b, a). When you read the index from left to right, the outer left attributes must exist in the query. For the index (a, b, c) you can query by (a), (a, b), and (a, b, c). But you could not use the query for (b, c) by itself.
- Table Scan: A table scan occurs when you’re querying by an attribute that isn’t indexed. Imagine a dictionary where the words were completely shuffled. The ONLY way to find the page for the word “Melon-Lion” would be to flip the pages one by one.
- Dependent Subquery: My explanation here might be off, but a dependent subquery is where a query exists inside of another query, and one query must be evaluated in order to evaluate the outer query.
- Temp Table: A temp table is a new table created in memory by SQL in order to execute table joins.
- ORM: Object Relation Mapper. ‘A programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language.’ In this case and in all my examples, I’m referring to Django’s ORM that represents objects in the database.
The “gotchas” in Django
Django can make writing code easy. A little too easy. So easy it can encourage some really sloppy practices, particularly with SQL.
Alex Gaynor gave an interesting talk to the SF Django Meetup where he made the very good point that Django’s sample code can be extremely misleading because it’s assumed that the sample code is good code. But it often isn’t.
Some mistakes can be made that are extremely subtle, and without knowing some of Django’s nuances it’s easy to remain ignorant of problems.
Understanding that Querysets are Lazily Evaluated
Querysets are lazily evaluated. This is important to understand from the get-go. Lazily evaluating queries is generally extremely powerful, but not knowing that queries are lazily evaluated will lead to huge problems, generally because you can code without realizing when a query gets executed, and this often leads to dozens or even hundreds of unnecessary and costly database queries. Let me illustrate that point with a simple example:
When I first started using Django, I would have thought that this was one Database query. We did an initial fetch, it loaded all the data into some internal object, and then we checked if each animal name existed in the query. Wrong.
It’s 4 redundant database queries. Don’t assume Django is doing some magical caching for you. At the point at which you intend to have information onhand, go ahead and save the data into your own datastructure. In this case, it would be optimal to do something like:
Later down in this post, I try to explain why dependent sub-queries are bad. But for the time being, just accept that dependent sub-queries are bad, and therefore we want to avoid doing them. Here’s an example of a dependent sub-query in Django:
The intent here is to fetch all of the animals who have an owner by the name of “Scott Lobdell”. So let’s fetch the owners first and then run the query (The below code is still bad practice and has a dependent sub-query):
Can you spot the problem? It’s very subtle.
Remember that django querysets are lazily evaluated, and another “gotcha” is that a values_list doesn’t actually return a list, but another subclass of a Queryset that’s still lazily evaluated.
So the result is that we’re still passing a dependent sub-query. We can fix with:
Lack of a Table Join
Here’s a simple example of what would surely be unintended behavior on the programmer’s part:
This iteration would case 100 database queries. You can get around this with:
The above example is still sub-optimal, we’ll talk more about why table joins are bad further down.
Another example of unintended consequences:
There’s an extra database query being made because of the first animal’s owner. This could be simply mitigated with:
Using table joins creates a temp table. For the sake of argument, assume these are bad, and I explain more of why further down. If you were using purely SQL as your tool for doing whatever it is what you were doing, it would indeed make a lot of sense to join tables together to query information.
But the point is: From the perspective of a programmer like myself that’s ALWAYS using an ORM, you’re always using another programming language with its own data structures in conjunction with SQL. So there are alternate ways of accomplishing the same without joining tables in SQL.
So consider again our simple example of iterating over animals and printing their owners’ respective names. Instead of doing one database query with a large table join (select_related), we can do two database queries and join them in memory:
To explain the above code: instead of just joining two tables together based on owner ID, I’m fetching the owners based on the ID and then just using a basic dictionary to map id to object. Now I can essentially use a dictionary to effectively join two tables in memory.
Why Dependent Subqueries are Bad
This is a question that I don’t claim to be an expert in order to answer. Check out a StackOverflow Question about the topic. The best answerer made the quote:
“If you have a dependent subquery you might run into performance problems because a dependent subquery typically needs to be run once for each row in the outer query. So if your outer query has 1000 rows, the subquery will be run 1000 times.”
In what cases is a subquery dependent versus independent? I couldn’t tell you the details. But again, from the perspective of a programmer that’s just using an ORM, why even bother with subqueries at all when you never have to?
Why Temp Tables are Bad
Another really good answer to this topic is from a blog post. In short, doing a table join will result in:
- CREATE TABLE
- INSERT data
- SELECT data
- DROP table
You’re creating an unnecessary load on the database, when the database is a shared resource.
Meanwhile, your ORM is being executed from a worker process that can be scaled at will, so in many cases it’s better not to let the database do the heavy lifting.
Indexing in Django
The lowest hanging fruit when it comes to database optimizations is to ensure that the columns you’re querying against are indexed.
In Django, you can do that with db_index=True
In the above example, let’s pretend I have an application where I make requests where I include an access token string as one of the parameters. I need to be able to retrieve access tokens based on that string, so I can index it to query quickly.
The index is stored in a B tree, so the access token will be found in logarithmic time. So if I had 1 billion elements, searching for the access token by that index would be the same as searching 30 elements linearly.
Without the index, querying for the Access Token would result in a table scan. So if we had a billion elements with no indexes, and we ran the above query, it might take a very long time to find the correct row(s).
Made Up Real World Example for Multi-Column Indexes
You can add indexes for multiple columns on a database table. Each index is stored in its own B-Tree, and as long as an index exists for the WHERE clause of your query, it will be pretty fast. But let’s consider an example where a multi-column index would be extremely useful.
Let’s pretend you’re building an application for a huge store, say Wal-Mart, and you want to store every single purchaseable item they sell across the country in a database row by serial number (let’s pretend every single item at Wal-Mart has a unique serial number). So let’s assume that the numbers of purchasable items is in the magnitude of several billion elements. You could index by serial number and call it a day, but once your database starts getting a load of requests, you’ll start to run into problems.
Namely, the amount of data that you have is going to take up some ridiculous amount of hard drive space.
Even if every item is indexed, the hard drive will still have to spin to the appropriate location in order to retrieve the data.
In the real world, the limitation here is the drastically slow speed of mechanical elements.
So in our made up scenario, if I have two physical items in a store, and I query for both of them in the database, they’re probably going to be spread out across the hard drive, and so the database is going to be slow in retrieving their data even with the index.
Enter Multi-Column Indexing.
In the above scenario, I was querying 2 items in the same store.
Maybe even the same department. Maybe even the same aisle. And that’s probably the common use case across the board.
So instead of just indexing by serial number, we could create a multi column index that was something like (store_id, department_id, aisle_id, serial_number). In this way, data is stored on the hard drive in blocks that align with the index. Therefore, all of our data is scoped, and now if I were to query every single adjacent item in a single aisle of a Wal-Mart store, every item would hit an index and would be adjacent on the hard drive, so the queries would be performant.
Note that with this example, if we query by serial number, we also need to know the store, department, and aisle in order to query it. If there’s another use case that doesn’t know about that information, a separate index needs to be created.
In Django, you can either use the index_together attribute, or you can just write your own SQL and just reference that database table with Django.
More Best Practices
From my experience using Django, here are some of the things that I’ve found to be extremely beneficial and practical:
Build an Abstraction Layer on Top of your Django Models
My favorite book on coding is Clean Code by Bob Martin. You should read it.
One of the small take-aways is that if you’re using a 3rd party library (in this case your ORM), it’s a good idea to build your own abstraction layer on top of it. That way, if you ever decide you want to change which 3rd party library you’re using, it’s relatively simple and straightforward.
Taking our example above with an Access Token, the way I do this was outlined in a previous blog post, but here’s an example of an abstraction layer built on top of Django’s ORM:
(Hey! Check out the Null Object Pattern used)
In this way, the rest of the application knows nothing about Django. If I suddenly wanted to change out ORM’s, it would be easy to do. All of my use cases are encapsulated in one file. It’s easy to figure out what database indexes needs to be created. Proper querying is baked into the code.
For my previous example of avoiding a temp table entirely with that arguably whacky dictionary logic, it’s easy to just isolate it in one place and avoid forcing other colleagues from having to mess with that.
Never Ever Use Foreign Keys
A foreign key constraint exists in order to enforce certain rules about the database setup.
But again, if you’re using SQL in conjunction with an ORM, your code’s logic should already enforce your constraints.
Foreign keys add an extra check that can slow things down, but more importantly, eliminating foreign keys will:
- Force you to understand what’s going on underneath the hood
- Gain a small performance boost
- Ease maintenance
The first two of those bullets are pretty self-explanatory, but the maintenance piece comes in with the added flexibility of not having a foreign key constraint. Not having foreign keys will enable you to do things like store tables in entirely different databases or entirely different storage engines. For example, perhaps you want to move an entire database table into a pure caching layer like Redis or Memcached. Or perhaps, you could move your finite database table into a simple json file.
Don’t use Django’s Many to Many Relation Manager
This is really an extension of the first bullet. If you don’t know how else to represent a many to many relationship without Django’s ORM, then that means that you don’t know what’s happening under the hood, and the result will eventually be bad database queries (and temp tables). When you simply use Django’s many to many manager, you won’t even get to see the table that’s being used unless you explicitly create a through table.
The solution I like is a combination of everything written above. Create an abstraction layer. Create a many to many table that’s “private” along with your other private Django models. Use the abstraction layer to query the many to many table to get the object ID’s you need, then query those objects separately.
Quick example from a current side project:
In this example, I have a program that generates workouts by creating the framework for a workout, and each of those frameworks references some workout components. So I have the day framework table, a many to many table that references workout components, and then the workout components themselves live elsewhere and in this case are actually just enumerated data types (an example of the flexibility of not using foreign keys and not necessarily storing your data in the same database).
Then I have an abstraction layer that’s just a Python object, and nothing else except that one class needs to know about the many to many relationship and the querying involved.
Now in practice and behind my abstraction layer, I can query the many to many table like so:
First, there’s extra caching in the class that I stripped out for the purposes of this example to keep the logic easy to understand. To explain the above code, a single day framework will have many workout components. So the purpose of the many to many table is to eventually give us data that represents one day framework and many workout components. I represented this as a dictionary of a day framework ID’s to workout component ID’s.
So in the code, I query the many to many table, and each row will have a day framework ID and a workout component ID. Then I simply amend the workout component ID to a dictionary and use a defaultdict as a shortcut.
It can get kind of confusing, but also notice that the logic is inside of a small private method (“private” meaning it has an underscore denoting such), and that method has test coverage, and the logic inside the method never needs to be re-created again. You could also look at this in the sense that there’s a general inverse relationship flexibility and efficiency. Here I’ve given up the flexibility to maximize performance, and that performance is now baked into a particular use case associated with a Python class.
The last example really encapsulates most of what I’ve written about.
- There’s an index for one side of the many to many table based on the query patterns of our use case.
- There’s no foreign key (I used an integer instead of a foreign key).
- There’s an abstraction layer for the use case.
- Database queries are made one table at a time with no table joins or dependent sub-queries.
- I’m casting QuerySets to lists as soon as I know I want to fetch data from the database.
- The “gotchas” that you might experience with Django are impossible to stumble upon for another programmer that’s using the new class’s API.
I didn’t give an example of multi-column indexing though. Whatever.