I can already hear the trolls coming for me. Why on earth would I write a purely functional data structure in quite possibly the slowest language for the task? It all came about because of a real world problem: Someone looked me in the eye and told me I wasn’t crazy enough to write a purely functional red black tree in Python.
Now did this actually happen? No. But it could happen any minute. It’s like getting lottery winner’s insurance for employers. It might not happen, but when one your best and most productive employees quits your company as a result of winning the lottery, you’ll laugh and say to yourself: “man I’m really glad I got that lottery winner’s insurance”
Back at the Academy, one of my professors was Dr. Chris “I’m a Celebrity in the Functional Programming Game but I’m so Humble so I won’t Mention It Or Wear Tie-Dye Shirts” Okasaki. Just recently I revisited his book Purely Functional Datastructures, but when it was first introduced to me 10 years ago, I didn’t see the practical benefits of his ideas. In the same way, sitting in a college classroom and working through mind-bendy recursive problems on a blackboard day after day felt more like a programmer’s playground than actual practical exercises that you’d find yourself using out in the real world streets of white collar San Francisco, struggling to survive.
But after a few years on the streets of the developer game where talk about scalability is whispered on every corner, suddenly those whacky ideas about a complete absence of assignment statements suddenly becomes a plausibly good idea.
Aside from being ready to put that random passerbyer in the hallway in his or her place when they try to call me out, functional programming concepts have become increasingly appealing to me. Functional programming can be defined in many different ways, but you don’t necessarily need a purely functional language to implement a purely or predominantly functional program. In particular, the ideas of referential transparency and immutability are simple but powerful. In the former concept, a function with the same parameters will evaluate to the same result in any context. Stateless functions will have no side effects and therefore the possibility of a bug is diminished significantly. Couple that with consistent typing, minimal parameters, and a solid test suite – pretty soon your coworkers will be calling you “The Exterminator” because your code will have no bugs.
Immutable objects are completely thread safe and open the possibility of completely parallelizing a program. Taken together with referentially transparent functions, caching becomes a trivially simple problem.
So now imagine a real world context: You have a few dozen celery workers that all share access to a Redis server. If you implemented your own datastructure made up of nodes where each node could be individually fetched based on a distinct ID, and pointers to other nodes just become the distinct ID of those nodes that corresponded to a Redis key, you can now create a distributed datastrucuture in Redis. And if you have a proxy in front of Redis where your key space is sharded, you can have a single datastructure spread out in memory across multiple machines.
Now take it one step further: you have caching logic that backfills entities from Redis into a least recently used cache in memory. Now a copy of the datastructure is maintained in memory across multiple machines and sections of the datastructure are updated from Redis as needed.
Since the datastructure is immutable, locks aren’t necessary. At any given time, it would be gauranteed that at least one process was able to make progress in updating the datastructure with no risk of deadlock.
Now before the trolls start coming for me, I should point out that this above idea comes with many drawbacks. Since Redis is not durable, it’s possible for the entire structure to become entirely corrupted at any given time. Also, after we go through some code samples, you might notice that inserts are fairly slow. And furthermore, a purely functional datastructure generally relies on a language with garbage collection in order to remove nodes that no longer have anything referencing them. Hence, if the above example was carried through you would also have to ensure that either a time to live was associated with every node or you manually cleaned up after yourself when a node was no longer referenced. In the former case, your datastructure goes to not reliably durable to definitely not durable after a certain amount of time. In the latter case, your interactions with the tree will be slower and no longer purely functional. In all cases, a purely functional datastructure will be fairly expensive on the memory front, but the increasingly lowered costs of memory can be correlated to the rise in popularity of functional programming.
Purely Functional Red Black Trees
The exercise I went through was to implement a purely functional red black tree (but in a non-purely functional language). A red black tree is a variant of a standard binary tree that will re-balance the tree every time you insert or delete a node. Therefore the variance between best and worst case scenarios is minimized, and the time complexity of inserting and reading is O(log n). There are very informative technical writings about how to maintain a red-black tree, but the idea is essentially that a tree is assumed to be balanced, and upon inserting a new node, the new node is marked as red which denotes an unbalanced node. Adding one more unbalanced node will force the recently inserted node to trigger a rebalancing at the local parent which will propogate up with further rebalancing as necessary.
The purely functional aspect of things means that rather than inserting a new node into the tree, we create immutable copies of existing nodes all the way down the tree to the new node, and all of the unchanged nodes from the operation can still be referenced from the new copy, resulting in no destructive updates of any variable.
You’ll notice that assignment statements only exist in the constructor and are otherwise nowhere present. All methods called upon the tree return new copies of the data rather than updates to existing data. Now I’ll break my self-imposed rule of no assignment statements to demonstrate usage:
I’ll try to avoid going into detail about the algorithm because there are lots of other sites and videos that explain in great detail how to implement a red black tree. Hopefully the code sample above can be coupled with some of those great explanations to try and outline a working code sample. Again, the point of writing this in python was not for efficiency, but just to be able to learn how to implement a red black tree in a readable manner.
I took this a step further and put together a sloppier subclass of the above RedBlackTree in the form of a tree meant to be distributed across Redis keys. In order to serialize everything, I used the Schematics python package to create type safe objects that could be serialized and deserialized at will. Importantly, the class also implements overrides for less than and greater than along with a primary key to determine what attributes to compare relative to other elements in a similar conceptual manager to what SQL engines are doing upon storing a row.
So you might find this sort of idea practical in the gray area between throw-away datastructures in memory provided by the standard library of your programming language of choice and the highly optimized datastructures that are used in persistence layers. This might make itself useful in a context where we want to quickly and easily examine a large collection of very recent historical data for example. Or in some case where you needed to be able to query Redis.