Too Long; Did not Read
Check out the python code samples below in order to interface with Amazon Cloud Search. Uses the python boto library.
What is CloudSearch?
CloudSearch is a service hosted by Amazon that allows you to index documents. Like a lot of other Amazon services, you pay for only what you use so you can scale easily, and the costs are about as low as you could ask for. You could use it in conjunction with a database as a caching layer for faster searching, or in some cases there’s nothing stopping you from using it as your storage engine entirely.
The most fitting use case probably involves full text searching. In this case, a SQL database is not optimal for searching for substrings, and it definitely doesn’t support stemming (i.e. make it so that “runner” will return a result with “running”).
Right now at work, one of the projects is to move off of our own SOLR servers and instead use Amazon CloudSearch. The advantages:
- We don’t have to manage our own SOLR instances. This might be relatively trivial but it’s something
- It’s faster than SOLR. In one instance it was allegedly 50x faster
- In order to interface with SOLR, we’re using Haystack, which we’d like to move away from.
- After doing the implementation, I found that the learning curve for CloudSearch is really low
Log in to your Amazon Console. Sign up if you haven’t, and it’s free. Then navigate to Cloud Search:
The next steps are fairly self-explanatory, and you can just follow the wizard. You’ll create a new domain that’s identified by a unique string which you’ll later use in your python code in conjunction with whatever Amazon region you chose:
Any number of attributes can be added to a document. You can see blow some of the examples:
You could make raw HTTP requets, but you can save yourself a lot of trouble if you just install boto:
From here, we start making some web requests in order to initialize a client. This is costly because of the nature of a network request, but we also want to avoid getting throttled by Amazon because of excessive and unnecessary requests. Therefore, you want to cache your initialized domain client.
This can be done with something like memcached in order to share an instance across multiple processes, but the poor man’s method is to just cache your domain instance in a mutable object inside of a class attribute. In this way, each process or worker that you have will initialize the client exactly once and will subsequently live in memory.
My choice was to create a class for the purpose of inheriting since a cloudsearch domain will be used both for querying and indexing. To ensure every class had only one responsibility, I chose to make a class for each of those two cases, and both of those classes would inherit the class below:
Base Amazon Client
The above class has the sole responsibility of caching a domain instance. The line to connect to cloudsearch has an HTTP request involved.
The next step is to index our documents. I wrote a simple class that’s a context manager that manages the batching of requests to Amazon. Context managers can be useful when you have a case that always requires setup, some action, and then teardown. In this case, the teardown is the actual POST to Amazon with a batch of data.
Therefore, the usage of this class is either a simple call to “add_document” or “delete_document.”
You can see a few other additions besides just a single post to Amazon. Data is chunked out into 500 items at a time, and data is cleaned of null values before a POST. Of note, that might not be the absolute best decision. As far as I can tell, you can’t set something to “null” or “None” with CloudSearch, so you should probably be explicit about how you’re representing empty data.
From here, you might notice that my code is just passing a “cloud_search_document” which I haven’t defined so far. In reality, the only thing you need to pass to Amazon is a unique identifier which is a string, and a serialized JSON blob. I made this explicit by creating an abstract cloud search document that all other documents should inherit from, thus guaranteeing they can be indexed:
Abstract Amazon Document
With the above two classes defined, it’s very simple to index documents. Just ensure that the json representation of your document corresponds to what you set up in Amazon. Here’s an example:
In order to search for documents, you’ll need to write your own queries. For a comparable service, you could use something like Haystack where the generation of queries is abstracted away between multiple backends. The problem with that approach is that Haystack is decent at everything, but excels at nothing (sorry, I hope there are no hardcore Haystack fans reading this).
I also found that it’s easier to learn Amazon’s straightforward language for querying than it is to learn about all the different quirks and boilerplate code of a third party library. The below class is a stripped down version of what I’d use to query a document:
Searcher / Queryer
From here, you would just need to pass in strings, and this class will query Amazon and return results in the form of a list of dictionaries.
You can learn how to write queries from Amazon’s Documentation. Note that all of Amazon’s documentation for example queries requires that you pass in a “structured” parser as I did in the sample code (You can see the differences about the parsers here.