SQL Database Best Practices with Django (or any ORM)

True Story Follows

So I’m by no means a database expert and don’t consider myself as such, but some of my coworkers who I would label as database experts have educated me enough to learn about some best practices that would be difficult to otherwise stumble upon and self-learn. At this point I at least know enough to occassionaly educate someone else about one or more of these alleged best practices that I think it would be useful to document some of the practices I’ve adopted.

Basic Database Terms

If you’ve read this far down the page, I’m going to assume that you quite possibly know nothing about databases except that they’re a storage mechanism that you can query that’s generally was faster than some code you could write to retrieve some data. There are a few terms I’ll end up throwing around so I’m going to define them here:

  • Database Index: A database table contains rows. Some of the columns on those rows are needed in order to identify and find the data while other columns serve simply to tell us information. In any case where you are going to search for a database row based on one of its attributes, that attribute should exist somewhere on a database index. The index means that the attribute is stored on the hard drive so that it can be retrieved very quickly. Conceptually, this is the same as putting the word “Melon-Lion” in the dictionary in the M section right after the word “melon”. In practice, the index is stored on a B Tree for quick retrieval.
  • Multi-Column Index: A multi-column index is an index that consists of multiple attributes. Order matters. An index of attributes (a, b, c) is completely different from (c, b, a). When you read the index from left to right, the outer left attributes must exist in the query. For the index (a, b, c) you can query by (a), (a, b), and (a, b, c). But you could not use the query for (b, c) by itself.
  • Table Scan: A table scan occurs when you’re querying by an attribute that isn’t indexed. Imagine a dictionary where the words were completely shuffled. The ONLY way to find the page for the word “Melon-Lion” would be to flip the pages one by one.
  • Dependent Subquery: My explanation here might be off, but a dependent subquery is where a query exists inside of another query, and one query must be evaluated in order to evaluate the outer query.
  • Temp Table: A temp table is a new table created in memory by SQL in order to execute table joins.
  • ORM: Object Relation Mapper. ‘A programming technique for converting data between incompatible type systems in object-oriented programming languages. This creates, in effect, a “virtual object database” that can be used from within the programming language.’ In this case and in all my examples, I’m referring to Django’s ORM that represents objects in the database.

The “gotchas” in Django

Django can make writing code easy. A little too easy. So easy it can encourage some really sloppy practices, particularly with SQL.

Alex Gaynor gave an interesting talk to the SF Django Meetup where he made the very good point that Django’s sample code can be extremely misleading because it’s assumed that the sample code is good code. But it often isn’t.

Some mistakes can be made that are extremely subtle, and without knowing some of Django’s nuances it’s easy to remain ignorant of problems.

Understanding that Querysets are Lazily Evaluated

Querysets are lazily evaluated. This is important to understand from the get-go. Lazily evaluating queries is generally extremely powerful, but not knowing that queries are lazily evaluated will lead to huge problems, generally because you can code without realizing when a query gets executed, and this often leads to dozens or even hundreds of unnecessary and costly database queries. Let me illustrate that point with a simple example:

class Animal(models.Model):
    name = models.CharField(max_length=50)

all_animals = Animal.objects.all()
for animal_name in ['lizard', 'rat', 'dog', 'melon-lion']:
    print animal_name in all_animals.values_list('name', flat=True)

When I first started using Django, I would have thought that this was one Database query. We did an initial fetch, it loaded all the data into some internal object, and then we checked if each animal name existed in the query. Wrong.

It’s 4 redundant database queries. Don’t assume Django is doing some magical caching for you. At the point at which you intend to have information onhand, go ahead and save the data into your own datastructure. In this case, it would be optimal to do something like:

class Animal(models.Model):
    name = models.CharField(max_length=50)

all_animals = set(Animal.objects.all().values_list('name', flat=True))
for animal_name in ['lizard', 'rat', 'dog', 'melon-lion']:
    print animal_name in all_animals

Dependent Sub-Queries

Later down in this post, I try to explain why dependent sub-queries are bad. But for the time being, just accept that dependent sub-queries are bad, and therefore we want to avoid doing them. Here’s an example of a dependent sub-query in Django:

class Owner(models.Model):
    full_name = models.CharField(max_length=50)

class Animal(models.Model):
    name = models.CharField(max_length=50)
    owner = models.ForeignKey(Owner)

all_animals = Animal.objects.filter(owner__in=Owner.objects.filter(full_name="Scott Lobdell"))

The intent here is to fetch all of the animals who have an owner by the name of “Scott Lobdell”. So let’s fetch the owners first and then run the query (The below code is still bad practice and has a dependent sub-query):

possible_owner_ids = (Owner.objects.filter(fill_name="Scott Lobdell").
                             values_list("id", flat=True))
all_animals = Animal.objects.filter(owner__id_in=possible_owner_ids)

Can you spot the problem? It’s very subtle.

Remember that django querysets are lazily evaluated, and another “gotcha” is that a values_list doesn’t actually return a list, but another subclass of a Queryset that’s still lazily evaluated.
So the result is that we’re still passing a dependent sub-query. We can fix with:

possible_owner_ids = list(Owner.objects.filter(fill_name="Scott Lobdell").
                             values_list("id", flat=True))
all_animals = Animal.objects.filter(owner__id_in=possible_owner_ids)

Lack of a Table Join

Here’s a simple example of what would surely be unintended behavior on the programmer’s part:

class Owner(models.Model):
    full_name = models.CharField(max_length=50)

class Animal(models.Model):
    name = models.CharField(max_length=50)
    owner = models.ForeignKey(Owner)

all_animals = Animal.objects.all()[0: 100]
for animal in all_animals:
    print animal.owner.full_name

This iteration would case 100 database queries. You can get around this with:

all_animals = Animal.objects.all().select_related("owner")[0: 100]
for animal in all_animals:
    print animal.owner.full_name

The above example is still sub-optimal, we’ll talk more about why table joins are bad further down.

Another example of unintended consequences:

first_animal = Animal.objects.all()[0]

new_animal = Animal.objects.create(
    name="melon-lion",
    owner_id=first_animal.owner.id
)

There’s an extra database query being made because of the first animal’s owner. This could be simply mitigated with:

first_animal = Animal.objects.all()[0]

new_animal = Animal.objects.create(
    name="melon-lion",
    # note the change from owner.id to owner_id
    owner_id=first_animal.owner_id
)

Table Join

Using table joins creates a temp table. For the sake of argument, assume these are bad, and I explain more of why further down. If you were using purely SQL as your tool for doing whatever it is what you were doing, it would indeed make a lot of sense to join tables together to query information.

But the point is: From the perspective of a programmer like myself that’s ALWAYS using an ORM, you’re always using another programming language with its own data structures in conjunction with SQL. So there are alternate ways of accomplishing the same without joining tables in SQL.

So consider again our simple example of iterating over animals and printing their owners’ respective names. Instead of doing one database query with a large table join (select_related), we can do two database queries and join them in memory:

all_animals = Animal.objects.all()[0: 100]
owner_ids = [a.owner_id for a in all_animals]

corresponding_owners = Owner.objects.filter(id__in=owner_ids)
owner_id_to_owner = {o.id: o for o in corresponding_owners}

for animal in all_animals:
    owner_id = animal.owner_id
    corresponding_owner = owner_id_to_owner[owner_id]
    print corresponding_owner.full_name

To explain the above code: instead of just joining two tables together based on owner ID, I’m fetching the owners based on the ID and then just using a basic dictionary to map id to object. Now I can essentially use a dictionary to effectively join two tables in memory.

Why Dependent Subqueries are Bad

This is a question that I don’t claim to be an expert in order to answer. Check out a StackOverflow Question about the topic. The best answerer made the quote:

“If you have a dependent subquery you might run into performance problems because a dependent subquery typically needs to be run once for each row in the outer query. So if your outer query has 1000 rows, the subquery will be run 1000 times.”

In what cases is a subquery dependent versus independent? I couldn’t tell you the details. But again, from the perspective of a programmer that’s just using an ORM, why even bother with subqueries at all when you never have to?

Why Temp Tables are Bad

Another really good answer to this topic is from a blog post. In short, doing a table join will result in:

  • CREATE TABLE
  • INSERT data
  • SELECT data
  • DROP table

You’re creating an unnecessary load on the database, when the database is a shared resource.
Meanwhile, your ORM is being executed from a worker process that can be scaled at will, so in many cases it’s better not to let the database do the heavy lifting.

Indexing in Django

The lowest hanging fruit when it comes to database optimizations is to ensure that the columns you’re querying against are indexed.
In Django, you can do that with db_index=True

class AccessToken(models.Model):
    token = models.CharField(max_length=255, db_index=True)
    user_id = models.IntegerField()
    permission_level = models.IntegerField()

In the above example, let’s pretend I have an application where I make requests where I include an access token string as one of the parameters. I need to be able to retrieve access tokens based on that string, so I can index it to query quickly.

AccessToken.objects.filter(token="UAEKASDKLJGQWJKHWESDFKJL")

The index is stored in a B tree, so the access token will be found in logarithmic time. So if I had 1 billion elements, searching for the access token by that index would be the same as searching 30 elements linearly.

Without the index, querying for the Access Token would result in a table scan. So if we had a billion elements with no indexes, and we ran the above query, it might take a very long time to find the correct row(s).

Made Up Real World Example for Multi-Column Indexes

You can add indexes for multiple columns on a database table. Each index is stored in its own B-Tree, and as long as an index exists for the WHERE clause of your query, it will be pretty fast. But let’s consider an example where a multi-column index would be extremely useful.

Let’s pretend you’re building an application for a huge store, say Wal-Mart, and you want to store every single purchaseable item they sell across the country in a database row by serial number (let’s pretend every single item at Wal-Mart has a unique serial number). So let’s assume that the numbers of purchasable items is in the magnitude of several billion elements. You could index by serial number and call it a day, but once your database starts getting a load of requests, you’ll start to run into problems.

Namely, the amount of data that you have is going to take up some ridiculous amount of hard drive space.
Even if every item is indexed, the hard drive will still have to spin to the appropriate location in order to retrieve the data.
In the real world, the limitation here is the drastically slow speed of mechanical elements.

So in our made up scenario, if I have two physical items in a store, and I query for both of them in the database, they’re probably going to be spread out across the hard drive, and so the database is going to be slow in retrieving their data even with the index.

Enter Multi-Column Indexing.
In the above scenario, I was querying 2 items in the same store.
Maybe even the same department. Maybe even the same aisle. And that’s probably the common use case across the board.

So instead of just indexing by serial number, we could create a multi column index that was something like (store_id, department_id, aisle_id, serial_number). In this way, data is stored on the hard drive in blocks that align with the index. Therefore, all of our data is scoped, and now if I were to query every single adjacent item in a single aisle of a Wal-Mart store, every item would hit an index and would be adjacent on the hard drive, so the queries would be performant.

Note that with this example, if we query by serial number, we also need to know the store, department, and aisle in order to query it. If there’s another use case that doesn’t know about that information, a separate index needs to be created.

In Django, you can either use the index_together attribute, or you can just write your own SQL and just reference that database table with Django.

More Best Practices

From my experience using Django, here are some of the things that I’ve found to be extremely beneficial and practical:

Build an Abstraction Layer on Top of your Django Models

My favorite book on coding is Clean Code by Bob Martin. You should read it.
One of the small take-aways is that if you’re using a 3rd party library (in this case your ORM), it’s a good idea to build your own abstraction layer on top of it. That way, if you ever decide you want to change which 3rd party library you’re using, it’s relatively simple and straightforward.

Taking our example above with an Access Token, the way I do this was outlined in a previous blog post, but here’s an example of an abstraction layer built on top of Django’s ORM:

class _AccessToken(models.Model):
    token = models.CharField(max_length=255, db_index=True)
    user_id = models.IntegerField()
    permission_level = models.IntegerField()


class AccessToken(object):

    def __init__(self, _access_token):
        self._access_token = _access_token

    def has_access_to_user(self, user):
        if self._access_token.permission_level == PermissionLevel.GOD:
            return True
        return self._access_token.user_id == user.id

    @property
    def token_data(self):
        return self._access_token.token

    @classmethod
    def create_for_user(cls, user):
        _access_token = _AccessToken.objects.create(
            token=base64.b64encode(str(uuid.uuid4())),
            user_id=user.id,
            permission_level=PermissionLevel.NORMAL
        )
        return AccessToken(_access_token)

    @classmethod
    def get_from_token_data(cls, token_data):
        try:
            _access_token = _AccessToken.objects.get(token=token_data)
        except ObjectDoesNotExist:
            return InvalidAccessToken()
        return AccessToken(_access_token)

(Hey! Check out the Null Object Pattern used)

In this way, the rest of the application knows nothing about Django. If I suddenly wanted to change out ORM’s, it would be easy to do. All of my use cases are encapsulated in one file. It’s easy to figure out what database indexes needs to be created. Proper querying is baked into the code.

For my previous example of avoiding a temp table entirely with that arguably whacky dictionary logic, it’s easy to just isolate it in one place and avoid forcing other colleagues from having to mess with that.

Never Ever Use Foreign Keys

A foreign key constraint exists in order to enforce certain rules about the database setup.
But again, if you’re using SQL in conjunction with an ORM, your code’s logic should already enforce your constraints.
Foreign keys add an extra check that can slow things down, but more importantly, eliminating foreign keys will:

  • Force you to understand what’s going on underneath the hood
  • Gain a small performance boost
  • Ease maintenance

The first two of those bullets are pretty self-explanatory, but the maintenance piece comes in with the added flexibility of not having a foreign key constraint. Not having foreign keys will enable you to do things like store tables in entirely different databases or entirely different storage engines. For example, perhaps you want to move an entire database table into a pure caching layer like Redis or Memcached. Or perhaps, you could move your finite database table into a simple json file.

Don’t use Django’s Many to Many Relation Manager

This is really an extension of the first bullet. If you don’t know how else to represent a many to many relationship without Django’s ORM, then that means that you don’t know what’s happening under the hood, and the result will eventually be bad database queries (and temp tables). When you simply use Django’s many to many manager, you won’t even get to see the table that’s being used unless you explicitly create a through table.

The solution I like is a combination of everything written above. Create an abstraction layer. Create a many to many table that’s “private” along with your other private Django models. Use the abstraction layer to query the many to many table to get the object ID’s you need, then query those objects separately.

Quick example from a current side project:

class _DayFramework__WorkoutComponent(models.Model):
    day_framework_id = models.IntegerField(db_index=True)
    workout_component_id = models.IntegerField()


class _DayFramework(models.Model):
    pass


class DayFrameworkCollection(object):
    pass

In this example, I have a program that generates workouts by creating the framework for a workout, and each of those frameworks references some workout components. So I have the day framework table, a many to many table that references workout components, and then the workout components themselves live elsewhere and in this case are actually just enumerated data types (an example of the flexibility of not using foreign keys and not necessarily storing your data in the same database).

Then I have an abstraction layer that’s just a Python object, and nothing else except that one class needs to know about the many to many relationship and the querying involved.

Now in practice and behind my abstraction layer, I can query the many to many table like so:

    def _get_m2m_workout_components(self):
        day_framework_ids = [d.id for d in self._sorted_day_frameworks]
        m2m_rows = list(_DayFramework__WorkoutComponent.objects.filter(day_framework_id__in=day_framework_ids))
        return m2m_rows

    def _create_day_framework_id_to_workout_component_list(self):
        day_framework_id_to_workout_component_list = defaultdict(list)
        for m2m_row in self._get_m2m_workout_components():
            day_framework_id = m2m_row.day_framework_id
            workout_component_id = m2m_row.workout_component_id
            day_framework_id_to_workout_component_list[day_framework_id].append(workout_component_id)
        return dict(day_framework_id_to_workout_component_list)

First, there’s extra caching in the class that I stripped out for the purposes of this example to keep the logic easy to understand. To explain the above code, a single day framework will have many workout components. So the purpose of the many to many table is to eventually give us data that represents one day framework and many workout components. I represented this as a dictionary of a day framework ID’s to workout component ID’s.

So in the code, I query the many to many table, and each row will have a day framework ID and a workout component ID. Then I simply amend the workout component ID to a dictionary and use a defaultdict as a shortcut.

It can get kind of confusing, but also notice that the logic is inside of a small private method (“private” meaning it has an underscore denoting such), and that method has test coverage, and the logic inside the method never needs to be re-created again. You could also look at this in the sense that there’s a general inverse relationship flexibility and efficiency. Here I’ve given up the flexibility to maximize performance, and that performance is now baked into a particular use case associated with a Python class.

The last example really encapsulates most of what I’ve written about.

  • There’s an index for one side of the many to many table based on the query patterns of our use case.
  • There’s no foreign key (I used an integer instead of a foreign key).
  • There’s an abstraction layer for the use case.
  • Database queries are made one table at a time with no table joins or dependent sub-queries.
  • I’m casting QuerySets to lists as soon as I know I want to fetch data from the database.
  • The “gotchas” that you might experience with Django are impossible to stumble upon for another programmer that’s using the new class’s API.

I didn’t give an example of multi-column indexing though. Whatever.

The End

  • I strongly suggest that you get yourself a copy of SQL Antipatterns. Heck, I’m even tempted to buy a copy and send it to you.

    You make so many bad assertions here that it really is quite scary.

    • Joshua Smeaton

      Agree. I really hope the author has learned a few more things since writing this. Nearly everything about joins is wrong. Foreign Keys should always be used. Subqueries are *usually* fine, depending on the backend in use, and the query being executed.

      Also, your subquery example should simply be:

      Animal.objects.filter(owner__full_name=’Scott Lobdell’)

      So much harmful advice here.

      • Scott Benedict Lobdell

        If anything I feel more strongly about it. Since writing this I’ve had more cases of migrating a database table to a different data store where the foreign keys can’t exist, so having the enforcement logic on the application side has made the migration painless and otherwise the application maintains the same SLA’s.

        And doesn’t your example query just do a simple table join?

        • Joshua Smeaton

          So because non-relational stores can’t have foreign keys, you should throw away foreign keys for relational stores? No. Non relational databases force the application to handle things like referential integrity internally rather than as a native feature of the database. Same goes for indexes. It’s like saying write all of your statically typed code as dynamic code incase you need to port it to javascript (C# dynamic types as an example). If you’re writing it against a non-relational database that’s fine, but don’t hamstring a relational database just incase you’re going to move away from it. Very few people have this problem.

          Yes, my example query does a join. Rather than a subquery or two separate queries. The suggestion you’d need something other than a join is just as harmful and wrong. Your example doesn’t even feature a dependent (correlated) subquery. It’s just a simple, straight forward, subquery. A correlated subquery uses data from the outer query, which these do not. Django ORM hardly ever uses correlated subqueries. Exists() might be an exception.

          Let’s step through this query, and roughly what the database is doing:

          Animal.objects.filter(owner__full_name=’Scott Lobdell’)

          SELECT a.* FROM ANIMAL a JOIN OWNER o ON a.owner_id = o.id WHERE o.name = ‘Scott Lobdell’;

          Assuming the owner table is smaller than animal and that no index exists on owner.name (the index should exist if this is a regular query..), the database will:

          1: scan the entire owner table (or index if one exists) for the appropriate owner_ids (probably returning a single item) and fetch that into memory.

          2: Since a.owner is a foreign key, it’s indexed, so it’ll then scan the a.owner index for ids matching the previous step, returning the primary key (a.id) of each.

          3: With the very small list of primary keys of Animal table, it’ll then do a seek to each row on disk, and return just that piece of data.

          4. Send the number of matching rows across the network.

          5. Django reassembles that data into Model objects (fairly expensive)

          Total disk hits == number of rows returned + number of rows in owners table if there was no index on the Name.

          Here’s your version:

          possible_owner_ids = list(Owner.objects.filter(fill_name=”Scott Lobdell”).
          values_list(“id”, flat=True))
          all_animals = Animal.objects.filter(owner__id_in=possible_owner_ids)

          Your version does everything mine does with the following differences:

          – An extra network roundtrip to fetch all of the owner IDs. Extra data is returned on the wire here too.
          – A ValuesQuerySet construction in python for the owner IDs, including any pre- and post- database conversions. I mentioned above that queryset and model object construction is expensive. You’ve got two of them.

          When the list of owners named “Scott Lobdel” gets large (assume a different kind of query where that could be true), passing a huge list of IDs as an argument to IN() gets horribly inefficient. Joins are preferred even **more** in that case.

          Doing two queries to avoid a join or subquery is nearly always wrong. The only time you might have to do this with Django is when you’re querying lots of rows from the reverse side of a foreign key. But now we have prefetch_related() which does the second query for you and caches the result.

          Your premise that joins are bad is wrong and infuriating. Please link to some documentation that shows that temporary tables are created when joining data. For relatively small (within RAM..) queries, joining is all done in memory. See here for the various algorithms used when joining tables: http://www.madeiradata.com/loop-hash-and-merge-join-types/

          Joins on data that aren’t linked by foreign keys and without appropriate indexes are going to perform really badly. But guess what? Pulling all of the data from both tables and then doing either a nested join or a hash merge (which you’ve loosely proposed) in PYTHON (one of the slowest languages) is ALWAYS going to be slower than what the database is capable of doing, just from a CPU perspective. Then you’ve also got to account for network transfer of all of this data.

          Seriously, perform some benchmarks. Show some numbers. Do some reading. You self admitted that you aren’t much of a SQL person. You shouldn’t be dishing out best practise advice based on conversations you’ve had with co-workers without verifying particular behaviour. Most of the advice above is wrong.

          ORMs are great. They aren’t an excuse for not understand SQL. Unfortunately they are far too often. I’m not suggesting that everyone be an expert in SQL, but understanding a few of the basics (how joins work, how indexes work, and what best practise is for ORM usage — select_related() being the main one) is absolutely required.

          • Scott Benedict Lobdell

            All of the expensive parts you
            mention are on the application side and easily scalable. That’s the point.

    • Scott Benedict Lobdell

      like what?

  • Sayem

    This article is very ill informed and borderline dangerous. This came up in the first page of google search for one of my queries so it’s important to point out the blatant mistakes.

    Losing the relational bit in a relational database(aka Foreign keys) is insane. Why would you suggest anyone to do such a thing? “Force you to learn what’s going on under the hood”? Would you fly off of an airplane without a parachute because it’d “force” you to learn how to fly? Utter madness.

    Oh and your database must have one table. Why else would you ask to avoid joins?

    • Scott Benedict Lobdell

      See above comments

  • Naveen Michaud-Agrawal

    This is satire, right?

  • Just tried some benchmarks –

    directavg: 0.17892849445343018s (a query doing join with double underscore)
    subavg: 0.18407721519470216s (a query doing join with a subquery – as you advise against?)
    listsubavg: 0.24150981903076174s (calling list() on the subquery first – seems significantly slower)

    gist here: https://gist.github.com/chozabu/6f41a359423c7ef07171279471d981d0

    so – seems like

    all_animals = Animal.objects.filter(owner__in=Owner.objects.filter(full_name=”Scott Lobdell”))

    should be faster than

    all_animals = Animal.objects.filter(owner__in=list(Owner.objects.filter(full_name=”Scott Lobdell”)))

    but

    all_animals = Animal.objects.filter(owner__full_name=”Scott Lobdell”)

    would be even faster – as well as being more concise and understandable

    • Scott Benedict Lobdell

      Yeah, it is slower because you’re shifting the load from the database to the application, but that’s the point I’m trying to make because the database is a shared resource.

      • Ah, I see your point – though after another quick benchmark it seems incorrect

        subavg: 0.3328558921813965
        asubavg: 0.3521065473556519 plus 0.09574031829833984 (application work)
        directavg: 0.31947994232177734

        I timed the list evaluation and subtracted it from the total – gist here – https://gist.github.com/chozabu/9bbbbf8f1ba0508d19c6e286c3bafbea

        Result is it still seems slower just in terms of DB access! We are getting a massive list out of the database, evaluating it in python, and sending a massive list back into the database.

        If you have some benchmarks that can show otherwise, it’d be great to see

        • Scott Benedict Lobdell

          It seems I should, given the comments here 🙂 also never expected this to rank on Google

          • Well – I found it from a google search, though I don’t remember the search terms, it probably involved “subquery” and “django”
            I was searching due to this discussion: https://github.com/philipn/django-rest-framework-filters/issues/54
            The link to this page probably increases the search ranking 😉

            but really – if there is a case where it is verifiably quicker not to use subquerys it would be good to know – but it seems like normally they are much better then fully evaluating the first query and performing a second based on the result

          • If you do – would be fantastic to post a comment here and alert me. I love being shown to be incorrect – means I have learned something new 🙂

  • Jamie Strauss

    In addition to what Schinckel wrote, you should get a copy of “Joe Celko’s Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL”.

    Everything you have written is absolutely ludicrous.