Representing polymorphic objects in a Database

The Problem

This post is really about type safety in general, something near and dear to my heart since it gets me so amped. While I normally develop in Python, it feels extremely luxurious to program in a language like Swift where the type system is so strict. If we have an object and we know its type and we know that’s it’s not None or Null, we’re free to act upon it without examining any of its data. This benefit is expanded to polymorphic entities where objects are subclasses of a parent class. Since all objects inherit from a shared class, all operations supported by the parent class are supported against the object.

Now, in some cases, this concept makes sense to apply to models that are saved into some persistence layer (a database). But then things just get weird.

SQLAlchemy, which I’ve been using at work, has its own method of storing polymorphic entities by writing each child entity to a separate table. Without considering too much of the merits of an implementation like this, it seems overly complex and too much happening at the persistence layer.

Recently I implemented my own version of polymorphic representation in a database, and I really liked it, so I wanted to go ahead and share…

But First: Representing Types in General

Before jumping ahead, one library I’ve been using a lot lately is the Schematics library. In essence, you can define a reasonably complex object structure, throw some data into the structure, and from there you can immediately validate that the input data matches the object’s type system. This can be useful if you’re reading or writing data to an external API or database. For a quick example:

import datetime

from schematics.models import Model
from schematics.types import (
    DateTimeType,
    EmailType,
    IntType,
    StringType,
    UUIDType,
)
from schematics.types.compound import ListType, ModelType


class SomeOtherEntity(Model):
    uuid = UUIDType()
    description = StringType(min_length=1, required=True)


class ArbitraryEntity(Model):
    uuid = UUIDType()
    some_int = IntType()
    child_entities = ListType(ModelType(SomeOtherEntity), required=True)
    created_by_email = EmailType()
    created_at = DateTimeType(default=datetime.datetime.utcnow)

So now with an object structure defined, we can either create the object from input data, or we can construct the an object ourselves and serialize it so that it can be written to a database or returned in an API:

assert isinstance(input_primitive_data, dict)
arbitrary_entity = ArbitraryEntity(input_primitive_data)

arbitrary_entity.validate()  # will throw an exception if data is invalid

serializable_data = arbitrary_entity.to_primitive()
assert serializable_data == input_primitive_data

Abstraction Layer in Front of Database

If you follow my writings vigorously, which you probably don’t, then you know that it’s a good idea to add an abstractoin layer between your application and whatever persistence layer it uses. Your application shouldn’t know about what version of SQL you’re using or if you’re even using SQL at all, and if you wanted to change your persistence layer then you should only need to make isolated changes. The same concept applies to 3rd party libraries or 3rd party interfaces to separate API’s. So the strict types above can be used as return types in these abstraction layers.

The point is simply: Whatever module interfaces with services that are not directly tied to your application should return these objects or something like it. They are pure objects that don’t know about anything else and can be serialized into primitive types which are universally understood.

Polymorphism in the Database

Now, to expand upon the previous ideas with a basic modeling concept: if an object looks like a duck, quacks like a duck, and does everything else like a duck, it’s a duck. Therefore, two distinct types will never be exactly the same, and this means that types can be inferred. So, it would be possible to create a table whose columns are made up of the union of all of our class’s and its subclass’s attributes. If each subclass has non-nullable fields, then given a database row we can determine its type based on which fields are populated or an exception can be thrown if the input row does not conform to any particular type.

Now, combine everything I’ve said and accept it as gospel. We can express those concepts in the code below, and we can add a mapper class that converts a row retrieved from the database into a particular type. The type is inferred by returning the first entity that correctly validates from the input row (but no more than one entity should correctly validate). I should point out that this is kind of slow, but at least the slow operations are happening in memory and are comparitively much faster than a database query.

def get_subclasses(cls):
    """ Returns all of the subclasses of an input class. """
    subclasses = [] + cls.__subclasses__()
    for subclass in subclasses:
        subclasses.extend(get_subclasses(subclass))
    return subclasses


class _IncompleteParentEntity(Model):
    uuid = UUIDType(required=True)
    description = StringType(min_length=1, required=True)
    created_at = DateTimeType(default=datetime.datetime.utcnow)


class LobbdawgEntity(object):

    def __init__(self):
        raise Exception("Don't instantiate this class")

    @staticmethod
    def inferred_from_data(primitive_dict):
        for sub_cls in get_subclasses(_IncompleteParentEntity):
            try:
                instance = sub_cls(primitive_dict)
                instance.validate()
                break
            except (ModelConversionError, ModelValidationError):
                continue
        else:
            raise ModelValidationError("Ambiguous type based on input data")
        return instance


class LobbdawgEntity1(_IncompleteParentEntity):
    arbitrary_attribute = IntType(required=True)


class LobbdawgEntity2(_IncompleteParentEntity):
    arbitrary_int_array = ListType(IntType, required=True)


class DbRowToEntityMapper(object):

    attrs = (
        'created_at',
        'description',
        'uuid',
        'arbitrary_attribute',
        'arbitrary_int_array',
    )

    @classmethod
    def to_entity(cls, row):
        primitive_dict = {}
        for attr in cls.attrs:
            value = getattr(row, attr)
            if value:
                primitive_dict[attr] = value
        return LobbdawgEntity.inferred_from_data(primitive_dict)

Some Moar Thoughts on Inheritance

Somewhat tangent to the idea of representing multiple subclasses in a database, a few other thoughts:

A parent class with “Base” as a prefix in the name is a stupid name. This was one of my takeaways from Jeremiah “Boardgame Bruce” Lee. If you have a base class that’s not abstract, there’s nothing that inherently makes it “base” because all classes can be inherited from (one of the beauties of object oriented programming). If a class is “abstract,” however, this actually means something and denotes that a child class will work as intended if it implements the abstract methods and properties. This is similar to an interface or a protocol but can additionally provide default behavior.

In this case, schematics uses meta classes, so without some rigorous works I can’t actually make the parent class abstract. I find it expressive to name it “Incomplete” and use an underscore to denote that it is not meant to be a public facing class.