A common complaint I’ve heard on the streets with Django is that it’s slow, like a Greenland Shark, but a reasonably acceptable level of speed can be achieved with basic database optimizations. Since database access requires reading or writing data from the hard drive, the amount of time required to interact with the database is magnitudes slower than a few processor cycles. When a single processor cycle is in the realm of 10 nanoseconds and reading from the a database can take milliseconds, the difference between the two is like comparing a second to an entire year.
This post is NOT about database optimizations themselves which would be another subject entirely, but more about navigating Django Querysets properly such to minimize the number of database queries made and avoiding redundant logic. After devoting a few dozen hours to optimizing Django queries, I wanted to document the most common cases I was able to find.
Below are the common pitfalls I found, in order of my perceived commonality:
Not Recognizing When a Queryset is Executed
A Django developer with a marginal level of experience should recognize that Django Querysets are lazily evaluated, meaning that the database isn’t accessed until the data inside of it is actually needed. So if we have a queryset:
No actual fetching to the database has occurred yet. Rather, any of the following logic will fetch from the database:
With that said, you’ll notice that the above logic is fetching the same data over and over, and we’ve now ended up making three queries to the database, but we’re working with the same set of data. Hence, the ideal scenario in this case would be:
So in a real world scenario, consider a case in which we’re making a point to cache results, and we have some python decorator that caches the results of a function. So let’s say we have a function:
Alas, the entire intent of the programmer here has been missed entirely. If we’re caching a queryset rather than a list, we’re not caching anything significant at all. If this function is called again and again,
we’re evaluating a queryset again and again because we’ve yet to cache the results of the queryset. EDIT: Django will cache the results of a queryset so I was incorrect about this particular case. Either way, we can deliberately cache the results ourselves of a queryset with:
Iterating Over Objects With a Many to Many Relationship
In general, if there’s any type of queryset inside of a loop, I’m willing to bet at least $7 that there’s a better way of querying the database. The only exception to this rule is that we’re making a deliberate effort of minimizing the memory footprint our application is creating.
For this example, let’s say we have many teachers that each have many students, thus creating a many to many relationship. If we were to iterate over all teachers and fetch their respective students, the basic case is something like this:
The number of queries here is equal to the number of teachers in the database (plus the initial query to fetch the teachers). But if you’ve ever been to any seriously cool college party in your life, then you know that all of this data could be fetched in a single database query. To do so, we need to add a through table in our models.py file to designate something to read from our many to many table:
Now, instead of fetching all teachers and subsequently fetching each of their students, we can just do:
Django actually has a built in method for this specific case with prefetch_related, but it only works for the exact case I just described because it caches the results of teacher.students.all(). However, if we did any type of filtering against teacher.students, prefetch_related would no longer work.
In the above example, you’ll notice that we did a JOIN between 3 different database tables, and the result is that our memory footprint has expanded quite a bit. In this example, if we know that there are fewer teachers than students, then we also know that the database join has returned redundant teacher information, and Django is parsing every single column of every single row and loading those into memory. So if we want to address that problem we can do something like this:
Note the changes made here:
- a dictionary was created in memory from teacher ID to its associated object
- we no longer select_related() against ‘teacher’, therefore we’re not doing a JOIN against the Teacher table
- inside the loop, we fetch the teacher from the dictionary in memory instead of the bloated database query. To do this, note that we’re still using teacher_id that’s present in the many to many table
Avoiding Extra Logic to Handle Many-to-One Relationships
A very similar case to the above example is one in which we have a many-to-one relationship instead of a many-to-many relationship. For this example, let’s create another arbitrary model where we have a person that has many pets. Each pet only has one owner.
Similar to the previous case, if we iterate over each owner, a subsequent query will be made to fetch each pet:
My preference for dealing with this case is to pre-fetch all of the pets we care about before iterating over Owners, and assigning the pets to the proper owner as intended without using a django queryset inside of a loop. To do so, let’s modify our Owner model ever so gently like so:
Now, we can iterate over a set of owners, fetch the pets that belong to that set of owners, and then amend those Pet objects to their respective Owner objects:
We’ve now gone from n number of queries where n is the number of owners down to exactly 2 queries. In Django, with the example model, this actually could have been done in a single query like so:
However, I brought up the initial example first because in a complex scenario, we don’t necessarily have the luxury of isolating all of the querysets we need in a single location. Additionally, while we’re making extra database queries, we’re also minimizing our memory footprint by avoiding massive JOINs in SQL.
Foreign Keys inside of an object
While the solution to this problem has already been addressed in the previously listed examples, another common mistake is failing to recognize that an object fetched via a Foreign Key has yet to be loaded into memory. Simple example:
Without adding a select_related(), a database query is being made for every reference to a pet’s owner. The simple solution is to add:
We now only make one database query.
However, in my time optimizing previously written code, this case is clearly known, but trouble surfaces when multiple objects reference the same object. If that object is already in memory, the programmer might assume that all foreign keys reference the object that already exists which intuitively makes sense. So let’s expand our model and consider something like so:
In this model, the case I’m trying to illustrate is that Pet and Collar both have the same owner, and a Pet has a one to one relationship with a collar. Now if we revisit our previous example:
We’ve run into the exact same problem. We’re now making n number of queries again where n is the number of pets because pet.collar.owner is making a database query even though it’s referencing the same object above. There’s not really a catch-all fix for this type of problem, but what’s important is that the programmer recognize that pet.owner and pet.collar.owner reference the same object. Hence, when referencing the owner here, the programmer should just use the owner object that’s already been fetched from the database.
Avoiding In-Memory Operations
One fairly good rule of thumb is to offload the heavy lifting to the database. Thus, it’s not necessarily bad practice doing something like so:
In this example, the queries are quite simple. But if we were making more complex queries that were each fairly costly, we could easily beat database speeds in python. In a particular case I was optimizing, I found that operator.attrgetter() was particularly handy at minimizing CPU time spent sorting data. To do so, we could do something like so:
Yes. This is how to get high-fives.
Not Taking Advantage of Django Functionality
Finally, it’s important to recognize the built in shortcuts Django has in place. For example:
is way faster than:
is way faster than:
These are the simplest examples, but it would be a massive tangent to divulge in all of the ways that django queries could be optimized. The important thing is to talk to Google about it for your particular case.