Optimizing Django Querysets

A common complaint I’ve heard on the streets with Django is that it’s slow, like a Greenland Shark, but a reasonably acceptable level of speed can be achieved with basic database optimizations.  Since database access requires reading or writing data from the hard drive, the amount of time required to interact with the database is magnitudes slower than a few processor cycles.  When a single processor cycle is in the realm of 10 nanoseconds and reading from the a database can take milliseconds, the difference between the two is like comparing a second to an entire year.

This post is NOT about database optimizations themselves which would be another subject entirely, but more about navigating Django Querysets properly such to minimize the number of database queries made and avoiding redundant logic.  After devoting a few dozen hours to optimizing Django queries, I wanted to document the most common cases I was able to find.

Below are the common pitfalls I found, in order of my perceived commonality:

Not Recognizing When a Queryset is Executed

A Django developer with a marginal level of experience should recognize that Django Querysets are lazily evaluated, meaning that the database isn’t accessed until the data inside of it is actually needed.  So if we have a queryset:

ArbitraryModel.objects.all()

No actual fetching to the database has occurred yet.  Rather, any of the following logic will fetch from the database:

ArbitraryModel.objects.all()[0]
list(ArbitraryModel.objects.all())
len(ArbitraryModel.objects.all())

 

With that said, you’ll notice that the above logic is fetching the same data over and over, and we’ve now ended up making three queries to the database, but we’re working with the same set of data.  Hence, the ideal scenario in this case would be:

arbitrary_models = list(ArbitraryModel.objects.all())
first_element = arbitrary_models[0]
length = len(arbitrary_models)

 

So in a real world scenario, consider a case in which we’re making a point to cache results, and we have some python decorator that caches the results of a function.  So let’s say we have a function:

@cached
def cached_models():
    return ArbitraryModel.objects.all()

 

Alas, the entire intent of the programmer here has been missed entirely.  If we’re caching a queryset rather than a list, we’re not caching anything significant at all.  If this function is called again and again, we’re evaluating a queryset again and again because we’ve yet to cache the results of the queryset.  EDIT: Django will cache the results of a queryset so I was incorrect about this particular case. Either way, we can deliberately cache the results ourselves of a queryset with:

@cached
def cached_models():
    return list(ArbitraryModel.objects.all())

 

Iterating Over Objects With a Many to Many Relationship

In general, if there’s any type of queryset inside of a loop, I’m willing to bet at least $7 that there’s a better way of querying the database.  The only exception to this rule is that we’re making a deliberate effort of minimizing the memory footprint our application is creating.

For this example, let’s say we have many teachers that each have many students, thus creating a many to many relationship.  If we were to iterate over all teachers and fetch their respective students, the basic case is something like this:

all_teachers = Teacher.objects.all()
for teacher in all_teachers:
    students = teacher.students.all()

 

The number of queries here is equal to the number of teachers in the database (plus the initial query to fetch the teachers).  But if you’ve ever been to any seriously cool college party in your life, then you know that all of this data could be fetched in a single database query.  To do so, we need to add a through table in our models.py file to designate something to read from our many to many table:

class Student(models.Model):
    pass

class Teacher(models.Model):
    students = models.ManyToManyField(Student, through=‘TeacherToStudent’)

class TeacherToStudent(models.Model):
    class Meta:
        db_table = “this_table_name_can_be_anything”

    student = models.ForeignKey(Student)
    teacher = models.ForeignKey(Teacher)

 

Now, instead of fetching all teachers and subsequently fetching each of their students, we can just do:

from collections import defaultdict
teacher_to_students = defaultdict(list)

m2m_qs = TeacherToStudent.objects.select_related(‘teacher’, ‘student’).all()

for m2m_row in m2m_qs:
    teacher_to_students[m2m_row.teacher].append(m2m_row.student)

 

Django actually has a built in method for this specific case with prefetch_related, but it only works for the exact case I just described because it caches the results of teacher.students.all(). However, if we did any type of filtering against teacher.students, prefetch_related would no longer work.

In the above example, you’ll notice that we did a JOIN between 3 different database tables, and the result is that our memory footprint has expanded quite a bit.  In this example, if we know that there are fewer teachers than students, then we also know that the database join has returned redundant teacher information, and Django is parsing every single column of every single row and loading those into memory.  So if we want to address that problem we can do something like this:

from collections import defaultdict

teacher_id_to_obj = {teacher.id: teacher for teacher in Teacher.objects.all()}
teacher_to_students = defaultdict(list)

m2m_qs = TeacherToStudent.objects.select_related(‘student’).all()

for m2m_row in m2m_qs:
    teacher = teacher_id_to_obj[m2m_row.teacher_id]
    teacher_to_students[teacher].append(m2m_row.student)

 

Note the changes made here:

  • a dictionary was created in memory from teacher ID to its associated object
  • we no longer select_related() against ‘teacher’, therefore we’re not doing a JOIN against the Teacher table
  • inside the loop, we fetch the teacher from the dictionary in memory instead of the bloated database query.  To do this, note that we’re still using teacher_id that’s present in the many to many table

Avoiding Extra Logic to Handle Many-to-One Relationships

A very similar case to the above example is one in which we have a many-to-one relationship instead of a many-to-many relationship.  For this example, let’s create another arbitrary model where we have a person that has many pets.  Each pet only has one owner.

class Owner(models.Model):

    def pets(self):
        return list(Pet.objects.filter(owner_id=self.id))

class Pet(models.Model):
    owner = models.ForeignKey(Owner)

 

Similar to the previous case, if we iterate over each owner, a subsequent query will be made to fetch each pet:

for owner in Owner.objects.all():
    the_owners_pets = owner.pets()

 

My preference for dealing with this case is to pre-fetch all of the pets we care about before iterating over Owners, and assigning the pets to the proper owner as intended without using a django queryset inside of a loop.  To do so, let’s modify our Owner model ever so gently like so:

class Owner(models.Model):
    age = models.IntegerField() # just an example attribute for us to query against

    prefetched_pets = None

    def pets(self):
            if prefetched_pets is not None:
                return prefetched_pets
            return list(Pet.objects.filter(owner_id=self.id))

class Pet(models.Model):
    owner = models.ForeignKey(Owner)

 

Now, we can iterate over a set of owners, fetch the pets that belong to that set of owners, and then amend those Pet objects to their respective Owner objects:

owners = list(Owner.objects.filter(age__gt=40))
owner_id_to_obj = {owner.id: owner for owner in owners}
pets = Pet.objects.filter(owner_id__in=owner_id_to_obj.keys())
for pet in pets:
    owner = owner_id_to_obj[pet.owner_id]
    if owner.prefetched_pets is None:
        owner.prefetched_pets = []
    owner.prefetched_pets.append(pet)

 

We’ve now gone from n number of queries where n is the number of owners down to exactly 2 queries.  In Django, with the example model, this actually could have been done in a single query like so:

Pet.objects.select_related(‘owner’).filter(owner__age__gt=40)

 

However, I brought up the initial example first because in a complex scenario, we don’t necessarily have the luxury of isolating all of the querysets we need in a single location. Additionally, while we’re making extra database queries, we’re also minimizing our memory footprint by avoiding massive JOINs in SQL.

Foreign Keys inside of an object

While the solution to this problem has already been addressed in the previously listed examples, another common mistake is failing to recognize that an object fetched via a Foreign Key has yet to be loaded into memory.  Simple example:

for pet in Pet.objects.all():
    print pet.owner.age

 

Without adding a select_related(), a database query is being made for every reference to a pet’s owner.  The simple solution is to add:

for pet in Pet.objects.select_related(‘owner’).all():
    print pet.owner.age

 

We now only make one database query.

However, in my time optimizing previously written code, this case is clearly known, but trouble surfaces when multiple objects reference the same object.  If that object is already in memory, the programmer might assume that all foreign keys reference the object that already exists which intuitively makes sense.  So let’s expand our model and consider something like so:

class Owner(models.Model):
    pass

class Collar(models.Model):
    owner = models.ForeignKey(Owner)

class Pet(models.Model):
    owner = models.ForeignKey(Owner)
    collar = models.ForeignKey(Collar)

 

In this model, the case I’m trying to illustrate is that Pet and Collar both have the same owner, and a Pet has a one to one relationship with a collar.  Now if we revisit our previous example:

for pet in Pet.objects.select_related(‘owner’, ‘collar’).all():
    print pet.owner.age
    print pet.collar.owner.age

 

We’ve run into the exact same problem.  We’re now making n number of queries again where n is the number of pets because pet.collar.owner is making a database query even though it’s referencing the same object above.  There’s not really a catch-all fix for this type of problem, but what’s important is that the programmer recognize that pet.owner and pet.collar.owner reference the same object.  Hence, when referencing the owner here, the programmer should just use the owner object that’s already been fetched from the database.

Avoiding In-Memory Operations

One fairly good rule of thumb is to offload the heavy lifting to the database.  Thus, it’s not necessarily bad practice doing something like so:

class Owner(models.Model):
    age = models.IntegerField()
    date_joined_website = models.DateField()
    number_of_pets = models.IntegerField()

sorted_list1 = Owner.objects.all().order_by(“age”)
sorted_list2 = Owner.objects.all().order_by(“date_joined_website”)
sorted_list3 = Owner.objects.all().order_by(“number_of_pets”)

 

In this example, the queries are quite simple.  But if we were making more complex queries that were each fairly costly, we could easily beat database speeds in python.  In a particular case I was optimizing, I found that operator.attrgetter() was particularly handy at minimizing CPU time spent sorting data.  To do so, we could do something like so:

class Owner(models.Model):
    age = models.IntegerField()
    date_joined_website = models.DateField()
    number_of_pets = models.IntegerField()

import operator

sorted_list1 = list(Owner.objects.all().order_by(“age”))
sorted_list2 = sorted(sorted_list1, key=operator.attrgetter(“date_joined_website”))
sorted_list3 = sorted(sorted_list1, key=operator.attrgetter(“number_of_pets”))

 

Yes.  This is how to get high-fives.

Not Taking Advantage of Django Functionality

Finally, it’s important to recognize the built in shortcuts Django has in place.  For example:

Owner.objects.all().exists()

is way faster than:

Owner.objects.count() > 0

and

Owner.objects.count()

is way faster than:

len(Owner.objects.all())

These are the simplest examples, but it would be a massive tangent to divulge in all of the ways that django queries could be optimized.  The important thing is to talk to Google about it for your particular case.