Composite Indexing in SQL

The Problem

If you’re one of those wild and crazy coders that likes to shoot from the hip, when it comes to creating a table in SQL a winning strategy might just be to add an index on every single attribute in the column. That way, regardless of what query gets made, an index will be hit because every attribute exists somewhere in an independent B-tree. The query is relatively fast, you stand by the keg and you high-five some bros, and then you’ll measure your arms and compare it to the rest of your team members.

But, there is a better way. Placing some forethought into your use cases will allow you to create multi-column indexes that are, as I’ve found, more efficient in every way. This is true for storage space, indexing speed, and querying speed.

What is a Composite Index?

When I first started my professional coding career, I had never heard of a multi-column index. But every time I tried to merge SQL scripts, I was backhanded in the face until I re-wrote indexes as composite indexes.

An analogy that I like to explain multi-column indexes is to pretend that we have a database that stores every single item sold by Wal-Mart in the entirety of the United States. If you created a database where every single column was indexed, then you’re querying every item against every other item without scoping down the query at all. Every single query will traverse down a tree that includes all items in the entire US.

This above use case especially makes no sense if you’re searching for an item in a store in a given aisle, and indeed every use case is likely to be one that is far more practical. In reality, I might search for an item at Wal-Mart by state, then by city, then by store ID, then by department, then by aisle number, and then finally by some unique ID.

So a composite index is simply an index that consists of multiple columns where, going from right to left, you will not know a given column without also knowing the value of the column to its left (i.e. if I know the aisle, I also know the department, and if I know the department, I also know which store I’m in, etc…).

So in technical terms, a composite index is one where I have columns (a, b, c). I will hit an index if I…

  • SELECT * FROM table WHERE table.a = ‘…’;
  • SELECT * FROM table WHERE table.a = ‘…’ AND table.b = ‘…’;
  • SELECT * FROM table WHERE table.a = ‘…’ AND table.b = ‘…’ AND table.c = ‘…’;

But it will NOT hit an index to…

  • SELECT * FROM table WHERE table.b = ‘…’;
  • SELECT * FROM table WHERE table.c = ‘…’ AND table.b = ‘…’;

Setting Up an Experiment

My theory sounds excellent, and there is no way that you’d ever doubt me. Unless you just don’t trust some random guy on the internets. So I set up an experiment for myself.

I created 3 different databases in Postgres, and populated each one with a single table that had a million rows with 4 columns made up of UUID’s. I compared the performance in saving and querying. Here’s the code in Python to walk through everything…

For the code, I set up a virtual environment with the following requirements:

SQLAlchemy==1.0.5
psycopg2==2.6
wsgiref==0.1.2

Save that text to a requirements.txt file and then you can requirements:

pip install -r requirements.txt

Here’s my generic Python code that we can use for general SQL operations:

import os

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker


USERNAME = os.environ['database_username']
DATABASE_NAME = 'minimal_index_test'
HOST = "localhost"
PORT = "5432"
PASSWORD = os.environ['database_password']

connect_url = 'postgresql://{username}:{password}@{host}:{port}/{database_name}'.format(
    username=USERNAME,
    host=HOST,
    port=PORT,
    database_name=DATABASE_NAME,
    password=PASSWORD,
)

engine = create_engine(connect_url, echo=True)

Session = sessionmaker(bind=engine)

db_session = Session()

For each test I created a different table. For simplicity’s sake, I combined all 3 tables into one model, and you can see the difference between each table with the comments:

from sqlalchemy import Column, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.schema import PrimaryKeyConstraint
from sqlalchemy.schema import Index

Base = declarative_base()


class ArbitraryTableMinimalIndex(Base):
    __tablename__ = 'arbitrary_table'

    col1 = Column(String)
    col2 = Column(String)
    col3 = Column(String)
    col4 = Column(String)

    # COMPOSITE INDEX
    __table_args__ = (
        PrimaryKeyConstraint('col1', 'col2', 'col3', 'col4'),
    )

    # INDEX EVERYTHING
    __table_args__ = (
        Index('index1', 'col1'),
        Index('index2', 'col2'),
        Index('index3', 'col3'),
        PrimaryKeyConstraint('col4'),
    )

    # NO INDEX (in practical terms)
    __table_args__ = (
        PrimaryKeyConstraint('col4'),
    )

Then for each different database, I created the table with its respective indices:

from models.arbitrary_table import Base
Base.metadata.create_all(engine)

And finally, a simple script I hacked together that will create data:

from models.arbitrary_table import ArbitraryTableMinimalIndex

import uuid
import datetime
start_time = datetime.datetime.utcnow()
rows_to_create = 1000000 / (5 * 5 * 5 * 5)
all_rows = []
for _ in xrange(rows_to_create):
    for outer1 in xrange(5):
        col1 = str(uuid.uuid4())
        for outer2 in xrange(5):
            col2 = str(uuid.uuid4())
            for outer3 in xrange(5):
                col3 = str(uuid.uuid4())
                for outer4 in xrange(5):
                    col4 = str(uuid.uuid4())
                    row = ArbitraryTableMinimalIndex(
                        col1=col1,
                        col2=col2,
                        col3=col3,
                        col4=col4
                    )
                    print "Adding %s" % _
                    all_rows.append(row)
db_session.add_all(all_rows)
db_session.commit()
end_time = datetime.datetime.utcnow()
print "Finished in %s seconds" % (end_time - start_time).total_seconds()

Results

Thankfully for the purposes of writing this post, the results of the tests backed up the theory. Here’s the breakdown:

Indexing Time

No index (except for a mandatory primary key):

no_index

Compound Index

compound

Index Everything

index_everything

So here we can see that indexing (a good thing), does in fact incur a cost when saving items to the database. However, between a composite index and indexing multiple columns, the former is noticeably faster.

Database Size

I can compare the size of each database with the following command:

select t1.datname AS db_name,
       pg_size_pretty(pg_database_size(t1.datname)) as db_size
       from pg_database t1
       order by pg_database_size(t1.datname) desc;

And when I run that, here’s what I can see:

db_size

Again, we can see that indexing incurs a cost, but the composite index outperforms the index-all-the-things index.

Querying Time

In order to test the querying, I simply grabbed an item in the database with an offset of 723,034 or so, then queried the database with the values in that row to see how quickly it would be retrieved. I ran “EXPLAIN ANALYZE” in addition to the regular query to get all of the information. Results are below:

No index (except for a mandatory primary key):

no_index

Compound Index
compound

Index Everything
index_everything

Now we can really see the pay-off of a composite index! Sort of. The difference between no indexing in this table vs. indexing was 125 ms to about 2 ms, and that ratio would have only gotten larger with more rows in the database. It’s important to note as well that one reason the gap between the compound and the non-compound indexing was so low was because of the extreme cardinality in the database. You’ll notice that when the query was against multiple indexes, there’s a bitwise AND that happens. This means that set arithmetic happens between two separate sets to join them together. This operation happens in O(n) time where n is the smaller of the two sets. However, in this case, because I populated the table with random UUID’s, the size of the set was 25, so the cost of this operation was negligible.

Had this been a query of our imaginary Wal-Mart database, and we wanted to query “Dairy Items” in Store #123, two sets between all dairy items in the United States and all items in Store #123 would have been created, and the intersection of those results would have been taken. With that size database, the querying between indexes would NOT have been fast.

In the case of a compound index, a query for dairy items in Store #123 would be a query of dairy items that had already been scoped by store, and it would already be stored in the hard drive as such.

Conclusion

Indexing your database is costly for hard drive space (which is cheap) and saves. But for querying, the pay-off is on the order of several magnitudes depending on your database size. From there, a composite index incurs no additional cost to creating multiple indexes per column. Database size decreases, save speed increases, and querying speed increases. The only cost, if you want to call it that, is that you’ll need to think through your use cases in advance for all of your queries. This requires…thinking. And a little bit of typing.

And if we wanted to take this a step further, which I did not, there’s another dimension. For a very large data set, when the queries are scoped (by store number for instance), it’s quite probable that a number of queries that happen in succession will be related (i.e. query this item inside the store, then this item inside the store, etc). Hard drives are the slowest bottleneck for a computer, and when items are close together on the physical hard drive, it will take less time to seek those items one after another.

The End

  • Great post, Scott. The examples and analogy you gave were very helpful.

  • Sam Vilain

    It’s worth noting that the bitwise AND between the indexes you’re talking about is not available in all databases. They can’t combine them; if you’re lucky they’ll do some estimates of the number of rows likely to be returned by using either one of them, and pick the one with lower cost.