Getting Started with Amazon Cloud Search (NoSQL) in Python

Too Long; Did not Read

Check out the python code samples below in order to interface with Amazon Cloud Search. Uses the python boto library.

What is CloudSearch?

CloudSearch is a service hosted by Amazon that allows you to index documents. Like a lot of other Amazon services, you pay for only what you use so you can scale easily, and the costs are about as low as you could ask for. You could use it in conjunction with a database as a caching layer for faster searching, or in some cases there’s nothing stopping you from using it as your storage engine entirely.

The most fitting use case probably involves full text searching. In this case, a SQL database is not optimal for searching for substrings, and it definitely doesn’t support stemming (i.e. make it so that “runner” will return a result with “running”).

Why CloudSearch?

Right now at work, one of the projects is to move off of our own SOLR servers and instead use Amazon CloudSearch. The advantages:

  • We don’t have to manage our own SOLR instances. This might be relatively trivial but it’s something
  • It’s faster than SOLR. In one instance it was allegedly 50x faster
  • In order to interface with SOLR, we’re using Haystack, which we’d like to move away from.
  • After doing the implementation, I found that the learning curve for CloudSearch is really low

Initial Setup

Log in to your Amazon Console. Sign up if you haven’t, and it’s free. Then navigate to Cloud Search:

cloudsearch

The next steps are fairly self-explanatory, and you can just follow the wizard. You’ll create a new domain that’s identified by a unique string which you’ll later use in your python code in conjunction with whatever Amazon region you chose:

Screen Shot 2015-02-14 at 11.58.21 AM

Any number of attributes can be added to a document. You can see blow some of the examples:

Screen Shot 2015-02-14 at 11.59.05 AM

The Code

You could make raw HTTP requets, but you can save yourself a lot of trouble if you just install boto:

pip install boto==2.35.1

From here, we start making some web requests in order to initialize a client. This is costly because of the nature of a network request, but we also want to avoid getting throttled by Amazon because of excessive and unnecessary requests. Therefore, you want to cache your initialized domain client.

This can be done with something like memcached in order to share an instance across multiple processes, but the poor man’s method is to just cache your domain instance in a mutable object inside of a class attribute. In this way, each process or worker that you have will initialize the client exactly once and will subsequently live in memory.

My choice was to create a class for the purpose of inheriting since a cloudsearch domain will be used both for querying and indexing. To ensure every class had only one responsibility, I chose to make a class for each of those two cases, and both of those classes would inherit the class below:

Base Amazon Client

import boto
from django.conf import settings


class AmazonClient(object):
    REGION = settings.AWS_CLOUDSEARCH_REGION

    _cls_domain_cache = {}

    def get_domain(self, domain_index):
        try:
            return self._cls_domain_cache[domain_index]
        except KeyError:
            self._cls_domain_cache[domain_index] = boto.connect_cloudsearch2(
                region=self.REGION,
                sign_request=True).lookup(domain_index)
            return self._cls_domain_cache[domain_index]

The above class has the sole responsibility of caching a domain instance. The line to connect to cloudsearch has an HTTP request involved.

The next step is to index our documents. I wrote a simple class that’s a context manager that manages the batching of requests to Amazon. Context managers can be useful when you have a case that always requires setup, some action, and then teardown. In this case, the teardown is the actual POST to Amazon with a batch of data.

Therefore, the usage of this class is either a simple call to “add_document” or “delete_document.”

Indexer

from .amazon_client import AmazonClient
DEFAULT_BATCH_SIZE = 500


class CloudSearchIndexer(AmazonClient):

    def __init__(self, domain_index, batch_size=DEFAULT_BATCH_SIZE):
        self.domain = self.get_domain(domain_index)
        self.document_service_connection = self.domain.get_document_service()
        self.batch_size = batch_size
        self.items_in_batch = 0

    @classmethod
    def for_domain_index(cls, domain_index):
        return cls(domain_index)

    def __enter__(self):
        return self

    def __exit__(self, *args, **kwargs):
        if len(args) > 1 and isinstance(args[1], Exception):
            raise args[1]
        self._commit_to_amazon()

    def _commit_to_amazon(self):
        self.document_service_connection.commit()
        self.document_service_connection.clear_sdf()
        self.items_in_batch = 0

    def add_document(self, cloud_search_document):
        cloud_search_json = cloud_search_document.to_cloud_search_json()
        cloud_search_json = self._nullify_falsy_values(cloud_search_json)
        self.document_service_connection.add(
            cloud_search_document.cloud_search_id,
            cloud_search_json
        )
        self._update_batch()

    def _nullify_falsy_values(self, json_dict):
        return {k: v for k, v in json_dict.items() if v}

    def delete_document(self, cloud_search_document):
        self.document_service_connection.delete(cloud_search_document.cloud_search_id)
        self._update_batch()

    def _update_batch(self):
        self.items_in_batch += 1
        if self.items_in_batch == self.batch_size:
            self._commit_to_amazon()

You can see a few other additions besides just a single post to Amazon. Data is chunked out into 500 items at a time, and data is cleaned of null values before a POST. Of note, that might not be the absolute best decision. As far as I can tell, you can’t set something to “null” or “None” with CloudSearch, so you should probably be explicit about how you’re representing empty data.

From here, you might notice that my code is just passing a “cloud_search_document” which I haven’t defined so far. In reality, the only thing you need to pass to Amazon is a unique identifier which is a string, and a serialized JSON blob. I made this explicit by creating an abstract cloud search document that all other documents should inherit from, thus guaranteeing they can be indexed:

Abstract Amazon Document

from abc import ABCMeta
from abc import abstractmethod
from abc import abstractproperty


class AbstractCloudSearchDocument(object):

    __metaclass__ = ABCMeta

    @abstractproperty
    def cloud_search_id(self):
        ''' A string that represents a unique identifier; 
           should mimic the primary key of a model '''
        pass

    @abstractmethod
    def to_cloud_search_json(self):
        ''' A JSON representaiton of the document 
           that should match up with the index schema in Amazon '''
        pass

With the above two classes defined, it’s very simple to index documents. Just ensure that the json representation of your document corresponds to what you set up in Amazon. Here’s an example:

Sample Usage

with CloudSearchIndex.for_domain("my_domain_index_string") as cloud_search_indexer:

    # ConcreteCloudSearchDocument is some implementation of the abstract cloud
    # search document
    cloud_search_document = ConcreteCloudSearchDocument(some_data)

    cloud_search_indexer.add_document(cloud_search_document)

# because of the context manager, data will be committed to amazon after the
# above block in a batch

In order to search for documents, you’ll need to write your own queries. For a comparable service, you could use something like Haystack where the generation of queries is abstracted away between multiple backends. The problem with that approach is that Haystack is decent at everything, but excels at nothing (sorry, I hope there are no hardcore Haystack fans reading this).

I also found that it’s easier to learn Amazon’s straightforward language for querying than it is to learn about all the different quirks and boilerplate code of a third party library. The below class is a stripped down version of what I’d use to query a document:

Searcher / Queryer

from abc import ABCMeta

from .amazon_client import AmazonClient


class AbstractCloudSearchSearcher(AmazonClient):

    __metaclass__ = ABCMeta

    DEFAULT_PARSER = "structured"

    def __init__(self, domain_index):
        self.domain = self.get_domain(domain_index)
        self.search_connection = self.domain.get_search_service()

    def execute_query_string(self, query_string):
        amazon_query = self.search_connection.build_query(q=query_string,
                                                          parser=self.DEFAULT_PARSER)
        json_search_results = [json_blob for json_blob in self.search_connection.get_all_hits(amazon_query)]
        return [json_blob['fields'] for json_blob in json_search_results]

From here, you would just need to pass in strings, and this class will query Amazon and return results in the form of a list of dictionaries.

You can learn how to write queries from Amazon’s Documentation. Note that all of Amazon’s documentation for example queries requires that you pass in a “structured” parser as I did in the sample code (You can see the differences about the parsers here.

The End

  • Troy Grosfield

    I’m going through a very similar thought process (almost identical) to the one you wrote about. However, you mention wanting to move away from django-haystack, why? The ability to plugin different backends is super beneficial when it comes to local development. When developing locally, I don’t want to be calling a boto search backend. I rather call a pure python implementation such as whoosh [1] so I’m not incurring any expenses. django-haystack creates a nice api that abstracts away the backend preferences and lets me use it seemlessly regardless of search engine preferences.

    Why not create a cloudsearch django-haystack backend [2]? A similar solution can be seen here [3]. That way, the only change to your system is just changing the django haystack search setting with much less code changes.

    Thoughts?

    [1] https://pypi.python.org/pypi/Whoosh/
    [2] http://django-haystack.readthedocs.org/en/latest/creating_new_backends.html
    [3] https://github.com/pbs/haystack-cloudsearch

    • Scott Benedict Lobdell

      Hey Troy,
      Thanks for reading! All of your points regarding Haystack are totally valid. I guess the short answer from my perspective is that Haystack is meant to be generic and that decision alone incurs trade-offs. So if you know you were going to use just Cloudsearch, you could build for that specific case. From there, you have a choice…do you learn how to use Cloudsearch or do you learn how to use Haystack? Haystack does have its own quirks, but the learning curve is accelerated by making things Django-like in behavior. But, if you find yourself not agreeing with even a small part of how Haystack is implemented, you’re still stuck with it. Moreover, some querying functionality is not built in so you end up having to write your own queries anyway, so at that point the generality is tainted.

      That was my experience, but my intent wasn’t to bash Haystack. Certainly I could address my qualms by opening pull requests, but at the end of the day what it came down to was that the code in question was company related code, and we had the resources to dedicate to maintaining our own CloudSearch client, and we could maintain it as we saw fit.

      • Troy Grosfield

        Makes sense. I’ve also gone through those rounds of thoughts as well. However, when thinking about just using a simple boto client I can’t seem to get beyond how to create a development ecosystem that plays nicely with local vs the other dev environments all the way to prod.

        The boto solution as you mentioned is great for prod and I don’t have any problems with that, but if you follow that route, how do you guys do local development? Surely you’re not still calling boto directly, are you?

        • Scott Benedict Lobdell

          No…for our particular use case cloudsearch replicated what was in a SQL database, so for tests and local dev I wrote a quick mock class that just mimicked the behavior with SQL. It was much slower but fine for those 2 cases.

          Further on I think the goal is to set up VPN’s from local dev to have IAM rights to Amazon

          • Scott Benedict Lobdell

            Also another problem was that Whoosh slowed down our tests a lot, so getting rid of it helped to speed that up