Determining Your Primary Key for SQL

True Story Follows

So I’m on the grind chasing that paper and writing code like it’s 1995, when I come across an interesting case. I’m creating a new database table, and it makes sense to uniquely identify a single row with a composite index across multiple columns. My concern though is two-fold:

  • Each column is a UUID instead of an ID and therefore the primary key is large and possibly of concern.
  • One of the attributes that make up the primary key is something that will change frequently, and common lore says that you should not make your primary key something that will change frequently because this will force SQL to move a record around the physical hard drive.

But this leads me to some other questions that I could not find answers for on the internets.

  • If I’m not supposed to make an attribute part of a primary key, does adding it to an index share the same adverse consequences?
  • Is querying by primary key vs. an index any faster?
  • Is the size of the primary key a real concern?

The Experiment

So I ran a previous experiment for composite indices that I found more insightful than any random quip of theory on the internets, and I otherwise could not find any answers to my questions. So these are things that I could find out on my own.

I ran an experiment that tested the above questions. To do so:

  • Create two SQL tables where one is indexed by 4 UUID columns and another is indexed by 4 Integer columns. See if the integer table is noticeably faster for writing.
  • Create two SQL tables where one has a composite primary key of 4 UUID’s and the other has that same key represented as a composite index instead and has a single UUID Primary Key. Determine if updating records is more costly against the more complex primary key.
  • For the above two tables, see if querying directly against the primary key is any faster than querying against an otherwise identical index

The Results

Before explaining how the experiment was run, here’s what I found:

  • A primary key composed of integers instead of UUID’s was between 17% and 19% faster for writing. (58.251757 seconds vs 48.749904 seconds for 50,000 records and sessions)
  • Querying time on an index of integers vs. UUID’s was pretty much the same. There was about a 1% difference in speed in favor of the integer index.
  • Updating a primary key was more costly than updating an index by 10% (54.885245 seconds vs 49.724557 seconds for updating 10,000 records twice each).
  • Querying directly against a primary key vs querying against an index was pretty much the same (10.290781 seconds vs 10.549278 seconds for querying 10,000 unique records)

The take away then is that indices should be created for your querying patterns for sure, but making that index your primary key is not necessarily a good decision. Based on the first experiment, it makes sense to keep the primary key short, and based on the second experiment, it makes sense to de-couple an index from the primary key. Given that, my conclusion is that in an environment where we use UUID’s instead of auto incrementing ID fields, I want to make my primary key a single UUID. For the benefit of partitioning the hard drive in a way that matches my query patterns, I might go ahead and make this a composite indexed scoped by something else static; what my experiment did not test for was the case in which the database has grown absurdly large and in a real world scenario it’s likely that queries that happened in succession would be related. But, I will not make my primary key anything that will require updating its value. Anything else that we want to query for can be stored in a separate index. This will be faster to update, and the trade-off will likely be hard drive space, which I don’t care about.

How the Experiment Was Run

If you don’t trust a random guy on the internets, here’s some of the code I used to reach my conclusions. I used the same setup from my other database experiment, and from there I made these 3 tables that I ended up using:

from sqlalchemy import Integer, Column, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.schema import PrimaryKeyConstraint
from sqlalchemy.schema import Index

Base = declarative_base()


class UUIDCompositeIndexTable(Base):
    __tablename__ = 'uuid_composite_index_table'

    col1 = Column(String)
    col2 = Column(String)
    col3 = Column(String)
    col4 = Column(String)
    __table_args__ = (
        PrimaryKeyConstraint('col1', 'col2', 'col3', 'col4'),
    )


class UUIDShortPkCompositeIndexTable(Base):
    __tablename__ = 'uuid_table2'
    col1 = Column(String)
    col2 = Column(String)
    col3 = Column(String)
    col4 = Column(String)
    __table_args__ = (
        PrimaryKeyConstraint('col1'),
        Index('col1', 'col2', 'col3', 'col4'),
    )


class IntegerCompositeIndexTable(Base):
    __tablename__ = 'integer_composite_index_table'

    col1 = Column(Integer)
    col2 = Column(Integer)
    col3 = Column(Integer)
    col4 = Column(Integer)
    __table_args__ = (
        PrimaryKeyConstraint('col1', 'col2', 'col3', 'col4'),
    )

For the first experiment to determine speed of UUID’s vs integers:

table1_indexes = []
table2_indexes = []
with open("output.txt", "w+") as f:
    f.write("STARTING UUID TEST")
    f.write("\n")
    start_time = datetime.datetime.utcnow()
    for _ in xrange(rows_to_create):
        row = UUIDCompositeIndexTable(
            col1=str(uuid.uuid4()),
            col2=str(uuid.uuid4()),
            col3=str(uuid.uuid4()),
            col4=str(uuid.uuid4()),
        )
        print "Adding %s" % _
        db_session.add(row)
        db_session.commit()
        table1_indexes.append((row.col1, row.col2, row.col3, row.col4))
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

    f.write("STARTING INTEGER TEST")
    f.write("\n")
    start_time = datetime.datetime.utcnow()
    for _ in xrange(rows_to_create):
        row = IntegerCompositeIndexTable(
            col1=random.randint(0, 10000),
            col2=random.randint(0, 10000),
            col3=random.randint(0, 10000),
            col4=random.randint(0, 10000),
        )
        print "Adding %s" % _
        db_session.add(row)
        db_session.commit()
        table2_indexes.append((row.col1, row.col2, row.col3, row.col4))
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

    print "QUERYING"
    f.write("Starting query of UUID PK")
    start_time = datetime.datetime.utcnow()
    for col1, col2, col3, col4 in table1_indexes:
        db_session.query(UUIDCompositeIndexTable).filter_by(
            col1=col1,
            col2=col2,
            col3=col3,
            col4=col4,
        ).all()
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

    f.write("Starting query of IntegerCompositeIndexTable")
    start_time = datetime.datetime.utcnow()
    for col1, col2, col3, col4 in table2_indexes:
        db_session.query(IntegerCompositeIndexTable).filter_by(
            col1=col1,
            col2=col2,
            col3=col3,
            col4=col4,
        ).all()
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

And for the second experiment:

table1_indexes = []
table2_indexes = []

rows_to_create = 10000
with open("output.txt", "w+") as f:
    f.write("STARTING PK TEST")
    f.write("\n")
    start_time = datetime.datetime.utcnow()
    for _ in xrange(rows_to_create):
        row = UUIDCompositeIndexTable(
            col1=str(uuid.uuid4()),
            col2=str(uuid.uuid4()),
            col3=str(uuid.uuid4()),
            col4=str(uuid.uuid4()),
        )
        print "Adding %s" % _
        db_session.add(row)
        db_session.commit()
        row.col2 = str(uuid.uuid4())
        db_session.add(row)
        db_session.commit()
        row.col3 = str(uuid.uuid4())
        db_session.add(row)
        db_session.commit()
        table1_indexes.append((row.col1, row.col2, row.col3, row.col4))
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

    f.write("STARTING INDEX TEST")
    f.write("\n")
    start_time = datetime.datetime.utcnow()
    for _ in xrange(rows_to_create):
        row = UUIDShortPkCompositeIndexTable(
            col1=str(uuid.uuid4()),
            col2=str(uuid.uuid4()),
            col3=str(uuid.uuid4()),
            col4=str(uuid.uuid4()),
        )
        print "Adding %s" % _
        db_session.add(row)
        db_session.commit()
        row.col2 = str(uuid.uuid4())
        db_session.add(row)
        db_session.commit()
        row.col3 = str(uuid.uuid4())
        db_session.add(row)
        db_session.commit()
        table2_indexes.append((row.col1, row.col2, row.col3, row.col4))
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

    print "QUERYING"
    f.write("Starting query of UUID PK")
    start_time = datetime.datetime.utcnow()
    for col1, col2, col3, col4 in table1_indexes:
        db_session.query(UUIDCompositeIndexTable).filter_by(
            col1=col1,
            col2=col2,
            col3=col3,
            col4=col4,
        ).all()
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

    f.write("Starting query of UUID Index")
    start_time = datetime.datetime.utcnow()
    for col1, col2, col3, col4 in table1_indexes:
        db_session.query(UUIDShortPkCompositeIndexTable).filter_by(
            col1=col1,
            col2=col2,
            col3=col3,
            col4=col4,
        ).all()
    end_time = datetime.datetime.utcnow()
    f.write("Finished in %s seconds" % (end_time - start_time).total_seconds())
    f.write("\n")

Conclusion

There you have it. Don’t ever doubt me.