Getting Started with DynamoDB (NoSQL) in Python with Boto

Way Too Long; Did not Read

DynamoDB is a cloud key-value store hosted by Amazon that’s an option for big data problems. Code samples in Python with Boto are shown below.

True Story Follows

So I’m on the grind with Workout Generator, and things are going fairly swell. Sort of. People are using the site, and more more people continue to sign up, but at this happens then I’m increasingly accountable for things that aren’t perfect. And this is also still a work in progress to a degree, and in particular, it was noticed that some of the workouts being produced kind of…sucked. Like this one:

Screen Shot 2015-03-16 at 10.28.33 AM

So anyway, the point is that the site is free of bugs in the sense that I could run the program millions of times and it would never barf, but it is problematic in that there’s fuzzy logic in there that’s meant to produce non-deterministic results, and the results in some cases are not that good.

Long story short, I really don’t know what’s going on, and I want to make a logging service that lets me troubleshoot exactly what’s happening with workouts that aren’t good so I can address the problems.

Enter Amazon’s DynamoDB

This post was meant to be a technical one about Dynamo in particular, but I just wanted to provide enough context so that I could:

  • Make the post marginally more interesting
  • Provide some possible reasons for using Dynamo
  • Eventually make a leap to connect the dots to the hit 1987 film “Running Man” starring Arnold Schwarzenegger and featuring the late Erland van Lidth as Dynamo

So in my case, I want to make a logging service, and to do that I’ll need to store the data somehow. DynamoDB is a good fit for this case. You can find it by logging into the Amazon Console here:
Screen Shot 2015-03-17 at 9.02.48 PM

What is DynamoDB?

In my own words, DynamoDB is a NoSQL key-value cloud hosted storage engine that charges based on throughput with a pretty good free tier plan. It’s an alternative to the open source HBase that was spawned in reaction to Google’s BigTable.

The History of DynamoDB

Check out the short video below of Amazon CTO Werner Vogels announcing DynamoDB back in 2012:

As a small tangent, you can also see a some short videos below that show California’s governor visiting Amazon’s San Francisco office and talking to an audience about Dynamo’s exciting sister services, Buzzsaw and Sub-Zero:

Why a Key-Value Store?

Just the other day on the internets, someone shared a REALLY interesting article that I’ll probably continue to reference about Disambiguating Databases. In it, the author makes the following observation about key-value stores:

“Key-value stores have been around since the beginning of persistent storage. They are used when the complexity and overhead of relational systems are not required. Because of their simplicity, efficient storage models, and low runtime overhead, they can usually manage orders of magnitude more operations per second than relational databases. Lately, they are being used as event-log collectors”

For me, I’d never really setup my own key-value store from scratch, making a logging service sounded fun, and it would be a really useful tool in my own projects that I could definitely take advantage of later on down the line.

Setting up a Table

If you’re reading this post for the technical aspect even more than the entertainment aspect, chances are that all that really matters is getting started. You can either create tables programmatically or you can just use Amazon’s interface to create a table.

For this simple setup, I chose the latter:

Screen Shot 2015-03-17 at 5.01.14 PM

For this service, there’s no need to define indexes or attributes or columns. You just need a single key, and you can throw whatever you want as a value into that corresponding row. You’re actually passing bytes, but in doing so you can represent anything.

The general use case here is that you’re going to scan a bunch of rows, so you’ll probably want to define your key such that it could be captured in some sort of range. A common use case is probably a user ID and a timestamp (so capture all data for a user during a certain time period). This is in fact the use case that I’m trying to encapsulate.

There are a few more steps to click through, but simply setting up the key is all you need to do.

Screen Shot 2015-03-17 at 5.03.38 PM

The Code

In the sample code, I didn’t actually create a logging service. I’ll probably do a follow on post that outlines that specific case in more detail, but the code samples below should be enough to get you started using Dynamo.

To connect to a Dynamo table that you can subsequently write to or read from:

import os

from boto import dynamodb2
from boto.dynamodb2.table import Table


TABLE_NAME = "lobbdawg_test"
REGION = "us-east-1"

conn = dynamodb2.connect_to_region(
    REGION,
    aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
)
table = Table(
    TABLE_NAME,
    connection=conn
)

Then to write to that table:

absolute_junk = {
    "favorite_color": "blue",
    "quest": "seek_holy_grail",
}


with table.batch_write() as table_batch:
    for example_counter in xrange(10):

        required_hash_data = {
            "user_id": 11,
            "timestamp": datetime_to_timestamp_ms(datetime.datetime.utcnow())
        }

        final_dynamo_data = dict(absolute_junk.items() + required_hash_data.items())
        table_batch.put_item(data=final_dynamo_data)
        time.sleep(0.001)  # in this example time is used to uniquely identify a key

A few things to note:

  • Don’t actually add time.sleep into your actual code. I did that to ensure unique keys
  • If you write to an existing key, you’ll get an exception. You need to instead get that row with the key then write to it (not shown in my example)
  • “absolute_junk” can be be populated with any serializable dictionary data, but the final dictionary passed to AWS must have your table’s keys in them (note how “user_id” and “timestamp” map to the table I created in Amazon’s console)
  • You don’t have to use a context manager with the “as batch,” but batching writes and reads is pretty much the whole reason for using a key-value store

Then to read from the table by scanning some rows you can do something like:

results = table.query_2(
    user_id__eq=11,
    timestamp__between=[
        datetime_to_timestamp_ms(datetime.datetime.utcnow() - datetime.timedelta(days=7)),
        datetime_to_timestamp_ms(datetime.datetime.utcnow())
    ]
)
for dynamo_item in results:
    print dict(dynamo_item.items())

And in my case the output becomes:

{u'timestamp': Decimal('1426714155355'), u'favorite_color': u'blue', u'quest': u'seek_holy_grail', u'user_id': Decimal('11')}
{u'timestamp': Decimal('1426714155356'), u'favorite_color': u'blue', u'quest': u'seek_holy_grail', u'user_id': Decimal('11')}
{u'timestamp': Decimal('1426714155357'), u'favorite_color': u'blue', u'quest': u'seek_holy_grail', u'user_id': Decimal('11')}
{u'timestamp': Decimal('1426714155358'), u'favorite_color': u'blue', u'quest': u'seek_holy_grail', u'user_id': Decimal('11')}

What Happens Next

  • makerj

    nice post!