True Story Follows
Recently I was working on a problem in which I needed to process 2 Terabytes of data. This doesn’t sound like all that much to me unless I refer to it as 2,000 gigabytes, or about 1.7 million copies of “Moby Dick”. That’s enough copies of Moby Dick for every man, woman, and child in the city of Philadelphia, PA. And it’s also 375 times more copies than were printed during the author’s lifetime.
I was processing the data on a large EC2 instance which had 32 processors.
In order to process all the data in a reasonable amount of time, I would need to leverage all of the processors on the machine. Up until now in practical terms, this was the first time where it would be an absolute necessity to parallelize. And sure, you’re parallelizing with asynchronous background tasks in web applications, but that’s generally for scaling as your user base grows, not for an individual task that will yield some finite output.
Furthermore, I could go install some task broker and messaging system like celery and RabbitMQ, but it should be within my immediate grasp to leverage all of the machine’s processors.
At this point it also made me wish I’d paid attention better and grasped some concepts with functional programming better back in Dr. Chris “Purely Functional Badass” O’kasaki’s class. Functional programming has no state and therefore no side effects and therefore no dependencies and therefore parallelizing with multiple processors is built into functional languages.
But alas, I’m using Python. And I’m not going to learn Haskell just so I can read Moby Dick 1.7 million times.
The problem I was solving had a specific use case, so the code correlated to that case. I broke out the generic parts of it to demonstrate here. So for a simple example, let’s just say I want to execute this really simple python script multiple times in parallel across multiple processors:
In practical terms (in hindsight), similar parallelization tasks that would be worth investing in would be for things like video processing that take minutes or hours to complete.
Anyway, I have two examples. One in which I want to create a pipeline where I parallelize tasks and then do something else once they all finish. Or, perhaps I’m just mapping a whole bunch of data and need to continuously launch more and more processes.
In either case, both take advantage of the following function definitions, so you can just read these and then accept them as fact:
In the real world case, I needed to process about 900 files. So rather than block until 32 processors each finished a respective task, I wanted to continue throwing available processors at reading through files. So this code is for that purpose:
In the second step of the process, basically something akin to a fold or reduce, I’m taking the data that was previously mapped and merging the results. In this case, I wanted to basically take hundreds of output files, have each processor load two distinct files and merge the results, then continue the process. Mind you, this was still dozens of gigabytes of data.
So in this case, I need a pipeline because the data is changing as I’m working with it. Once two files are merged, there’s a new file.
So in this code sample, I can just create a list of tasks and wait until they complete, then repeat with the new data:
In both examples, all that’s happening is that python is launching a new process with certain arguments. In this case, the arguments don’t do anything, but it’s how you could specify instructions to your subprocesses. I chose to implement that with a generator argument (which in my example just counts from 1 to 1000).
In conclusion, I never actually read Moby Dick. Not once.