How to Use Mechanical Turk with Python and Boto to Crowdsource Tasks

Too Long; Did Not Read

Skip to code samples below to see how to use Amazon’s Mechanical Turk in order to crowdsource small tasks for very cheap. You can automate as much as possible and outsource the remaining human tasks to a large crowd.

Source code is below. Also, a ridiculous sample is shown.

True Story Follows

So, as has been the case with my last few blog posts, I’m on the grind with Exercise Library. There’s a whole back story to it and a number of things where I’ve been using code to automate some of the tasks that need to be done.

Long story short, I re-filmed several hundred exercises, each from two different angles. I need to edit each “scene” which is composed of two shots, and each shot needs to be trimmed of dead space in the beginning and end. Moreover, not every video is framed very well, so I need to crop each video.

However, this is extremely tedious. Aside from the tasks above, I still need to manually load and render each set of videos. The rate at which I was editing was only about 10 videos per hour. For the 700 total exercises I needed to get done, this was going to take me 70 hours, or basically a week straight of just editing videos, assuming I had enough free time to begin with.

I looked online at sites like Fiverr to find people that might be willing to edit videos for really really cheap, but, alas, they basically don’t exist. However, I suddenly had an idea while my mind was wandering and walking my two dogs, Bonethug and Snuggles.

I bet there was a site that allowed me to crowdsource really small tasks (Ah, this does exist. Amazon Mechanical Turk, among others). For each video, I can divide up all of the actions I need to take on the video into its own task: frame the video properly, find the actual start point, find the actual end point. I could pay a few cents to each action, then write a script that will actually execute all of those instructions on all of my videos using OpenCV.

But Scott, You’re Not Crazy Enough

I am crazy enough.

What is Mechanical Turk?

Mechanical Turk is a crowdsourcing market place hosted by Amazon. You can use it to have people take surveys, maybe describe a picture…basically anything that isn’t morally questionable. And by morally questionable, I mean driving traffic to your site or something. Labeling pornographic images is fine. Machine Learning. Read about it.

Without looking at other forums, the site would lead you to believe that you primarily use Amazon’s user interface to create tasks, but if you’re reading this blog, then you, like me, are far more interested in their API and doing as much as possible with programming.

What the Docs Don’t Tell You

Before going onto wild tangents while I explain code, and to help anyone that possibly stumbled here in the troubleshooting process, Amazon’s documentation is not very thorough, and it took some trial and error to get things working. Some helpful things to know if you’re using an external website to perform tasks:

  • When POSTing back to Amazon to complete a Human Intelligence Task (HIT), you MUST make it from straight from the client. Server side POSTs will be rejected.
  • Amazon’s documentation tells you that you just need to post the “assignmentId”. This is false, you need the “hitId” and the “workerId” as well
  • Limiting a user to only one task is not a built in feature. You need to add your own logic to enforce that rule.
  • I believe that when you have a group of HIT’s, all HIT’s in that group will be locked as long as one person is viewing them. I could be wrong, but if true, that might slow down task execution.

Sample Walk-Through

I could break this out into code and a demo separately, but you sort of lose context, so let’s do all at once.

Suppose I want to farm out some arbitrary task. It turns out that this task can be pretty much anything, but if you want that sort of power and flexibility, the way to do is to create an “external question,” which basically means that an iFrame will load your separate website, and your website will POST back to Amazon.

So let’s set up a really simple demo. There’s a development sandbox where you have an infinite budget, and then you can easily switch over to a production environment. I tested everything initially, but I’ll walk through the example in production.

Screen Shot 2015-03-09 at 11.59.49 PM

Sign up on Mechanical Turk as a requestor. There’s tons of resources on the internets about making money online and signing up as a worker. But we’re the producers, the big ballers that drop mad nickels on the economy and make things happen, paying out $2 in just a few minutes distributed across dozens of people. So we’ll talk about that angle. You pretty much just authenticate with Amazon, ensure you have an Amazon Developer Console, and add some funds to your account. The same can be done in the Sandbox.

To paint a picture of what I’m doing, I basically crowdsourced an arbitrary task. In this case, I took some awkward and ridiculous pictures of my roommate from back at the Academy, and asked workers to describe the man. See below:
Screen Shot 2015-03-09 at 9.30.06 PM

Let’s go ahead and get additional context by scrolling down further:

Screen Shot 2015-03-10 at 12.09.18 AM

For the code, there’s really two parts to this. In this case, for the simple aspect of demonstration, below is a quick script that will generate some tasks. This is only usable because I’ve already written the server side code (bear with me).

import os
from boto.mturk.connection import MTurkConnection
from boto.mturk.question import ExternalQuestion
from boto.mturk.price import Price

AWS_ACCESS_KEY_ID = os.environ['AWS_ACCESS_KEY_ID']
AWS_SECRET_ACCESS_KEY = os.environ['AWS_SECRET_ACCESS_KEY']

if os.environ.get("I_AM_IN_DEV_ENV"):
    HOST = 'mechanicalturk.sandbox.amazonaws.com'
else:
    HOST = 'mechanicalturk.amazonaws.com'

connection = MTurkConnection(aws_access_key_id=AWS_ACCESS_KEY_ID,
                             aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
                             host=HOST)

url = "https://mturk-demonstration.herokuapp.com/"
title = "Describe a picture in your own words (COMPLETE THIS TASK ONLY ONCE!)"
description = "COMPLETE THIS TASK ONLY ONCE! All submissions after the first will be rejected"
keywords = ["easy"]
frame_height = 800
amount = 0.05

questionform = ExternalQuestion(url, frame_height)


for _ in xrange(60):
    create_hit_result = connection.create_hit(
        title=title,
        description=description,
        keywords=keywords,
        max_assignments=1,
        question=questionform,
        reward=Price(amount=amount),
        response_groups=('Minimal', 'HITDetail'),  # I don't know what response groups are
    )

Credits to this blog for the code sample that got me started.

All that’s happening here is that we’re creating Human Intelligence Tasks that simply redirect to my website: https://mturk-demonstration.herokuapp.com/. Also, there’s no guarantee that the link I just posted will live forever, I just threw together a quick Heroku App.

Server Side Code

Here’s my server side (Django) code:

if os.environ.get("I_AM_IN_DEV_ENV"):
    AMAZON_HOST = "https://workersandbox.mturk.com/mturk/externalSubmit"
else:
    AMAZON_HOST = "https://www.mturk.com/mturk/externalSubmit"


def home(request):
    if request.GET.get("assignmentId") == "ASSIGNMENT_ID_NOT_AVAILABLE":
        # worker hasn't accepted the HIT (task) yet
        pass
    else:
        # worked accepted the task
        pass

    worker_id = request.GET.get("workerId", "")
    if worker_id in get_worker_ids_past_tasks():
        # you might want to guard against this case somehow
        pass

    render_data = {
        "worker_id": request.GET.get("workerId", ""),
        "assignment_id": request.GET.get("assignmentId", ""),
        "amazon_host": AMAZON_HOST,
        "hit_id": request.GET.get("hitId", ""),
    }

    response = render_to_response("base.html", render_data)
    # without this header, your iFrame will not render in Amazon
    response['x-frame-options'] = 'this_can_be_anything'
    return response

A few things to note that I have commented:

  • Host endpoints vary based on production/sandbox
  • You’ll need assignment ID, worker ID, and hit ID in order to POST back to Amazon (despite what docs say)
  • Assignment ID will have the value “ASSIGNMENT_ID_NOT_AVAILABLE” if the worker is just viewing the task (it would be a good idea to hide the submit button or something so they don’t do unnecessary work and get frustrated)
  • Your page won’t render unless you modify the “x-frame-options” header. Any value I used gave Javascript errors, but the errors caused it to ignore the header, which is pretty much what I wanted instead of “DENY”
  • If your task happens to be like the one I described where it’s the same for everyone, you’ll want to guard against the same worker executing the same task repeatedly.

Client Side Code

Initially, I built out a more extensive template with some Javascript and AJAX that posted to my server to collect some data, which then posted to Amazon. The reason was that even though this example is simple, I was going to do some far more extensive stuff down the line. However, I found through some forums and verified from trial and error that Amazon will only accept requests directly from the client. So your best bet is to just write a simple template. Here’s mine:

<html>
    <head>
        {% include "_js_scripts.html" %}
    </head>
    <body>
    <h3>Using your own words, describe this man however you want.</h3>
        <form action="{{ amazon_host }}" method="POST">
            <textarea rows="4" cols="50" name="user-input">
            </textarea>
            <input type="hidden" id="assignmentId" value="{{ assignment_id }}" name="assignmentId"/>
            <br/>
            <input type="hidden" id="workerId" value="{{ worker_id }}" name="workerId"/>
            <input type="hidden" id="hitId" value="{{ hit_id }}" name="hitId"/>
            <input type="submit">
        </form>
        <div>
            <img src="https://s3.amazonaws.com/lobbdawg/mturk/broome1.jpg" style="width: 30%;">
            <img src="https://s3.amazonaws.com/lobbdawg/mturk/broome2.jpg" style="width: 30%;">
            <img src="https://s3.amazonaws.com/lobbdawg/mturk/broome3.jpg" style="width: 30%;">
            <img src="https://s3.amazonaws.com/lobbdawg/mturk/broome4.jpg" style="width: 30%;">
            <img src="https://s3.amazonaws.com/lobbdawg/mturk/broome5.jpg" style="width: 30%;">
            <img src="https://s3.amazonaws.com/lobbdawg/mturk/broome6.jpg" style="width: 30%;">
        </div>
    </body>
</html>

Putting It All Together

When I post a task as a requestor, I can see it has been created:

Screen Shot 2015-03-09 at 9.30.20 PM

From Amazon, I can use their frustrating interface to see all of my batches which are counter-intuitively at 0, and then click the not so obvious “Mange HITs Individually” to get some actual data:

Screen Shot 2015-03-09 at 9.17.44 PM

Then for whatever reason, I can only review tasks individually, which kind of defeats the purpose of crowdsourcing:

Screen Shot 2015-03-09 at 9.18.23 PM

So we need some more code. One more sample for how to manipulate HITs with Python:

all_hits = [hit for hit in connection.get_all_hits()]

for hit in all_hits:
    assignments = connection.get_assignments(hit.HITId)
    for assignment in assignments:
        # don't ask me why this is a 2D list
        question_form_answers = assignment.answers[0]
        for question_form_answer in question_form_answers:
            # "user-input" is the field I created and the only one I care about
            if question_form_answer.qid == "user-input":
                user_response = question_form_answer.fields[0]
                print user_response
                print "\n"
        # connection.approve_assignment(assignment.AssignmentId)

Results

This post of course, would not be complete without the results of the crowdsourcing. All in all, I spent $3. If that. I’m not sure if all the tasks got completed. But here are the responses that I got from all the workers for the above task:

  • Eclectic. Mustachioed. Interesting. ‘Murican.
  • The man in the picture is of average height and weight but muscular build from working out. He has dark hair and sometimes facial hair but not always. He has lived a life of adventure. He is not afraid to take risks and can be the life of the party but tires at the end of the day just like the rest of us. He interacts well with different people including children. He has many varied interests from cars to vacationing. He is confident and proud.
  • Diverse. Looks like he knows how to have fun but also how to be serious. Soldier. Silly. Strong.
  • amazing
  • Sociable, outgoing, and always busy/occupied
  • This man is jolly type and hard worker. he enjoys the nature everthing.Children like him somuch because always like as freind to children.
  • Outgoing and fun.
  • 1. Suburb Perv
    2. Phelps Wannabe
    3. Cool Vet and Prince look-a-like
    4. Career Day Vet
    5. Leprechaun Pimp
    6. 4 hrs 3 minutes and 30 seconds clean Guy
  • This man knows how to live life. He is at one with himself and knows what it means to be a real man and a hero.
  • A fun guy that enjoys having a good time.
  • It’s like if the “Most Interesting Man in the World” became a hipster that tries too hard
  • This man is looking handsome. He is brave and honest. He looks like an army soldier.
  • wove
  • nice
  • Active and sense of humor.
  • Looks like he wants to promote bikini, so he approaches a manufacturer to give him a chance to promote their bikini but he refuses and instead offer him to promote his company in school and mall with different type of costumes.End of the day he is exhausted and taking a nap on the table.
  • nice
  • Good handsome guy serving his nation with pride.
  • Hardworking solider who enjoys his free time.
  • The man is athletic, outgoing, a patroit, and enjoys vintage automobiles.
  • outgoing, a soldier, doesn’t take himself to seriously, good sense of humor
  • Goofy, oddball, patriot, soldier, friendly, off beat, unique, strange, funny.
  • good look
  • freeness

broome_0

  • Adventurous extrovert with well rounded social life.
  • This person appears to be someone who likes to have fun and doesn’t take himself too seriously. However, he also seems to have a serious side that is dedicated to the causes in which he believes.
  • he is a man of many disguises, seems to have a sense of humor and likes to stand out in a crowd.
  • Sleepy. Furry. Half-naked. Touching a sweet car.
  • he seems to be a military man who wears what he wants to wear regardless of what people will think of him
  • goodreads
  • An active, social, healthy and handsome man
  • Soldier. Silly. Strong. Knows how to be serious but how to have fun.
  • super style
  • CRAZY, JOKESTER WHO ENJOYS MAKING OTHERS LAUGH, WHILE STILL HAVING A SERIOUS SIDE AND SERVING OUR COUNTRY
  • Soldier from the south in some interesting getup.
  • Adventurous, funny, caring, selfless
  • Outgoing
  • extrovert, social, funny, helpful, service oriented, stylish, hard working
  • He is a sexy and versatile man!
  • He is the all american type of guy.
  • This dude is a soldier who makes time to talk to kids but definitely knows how to enjoy his down time to the fullest. Also, he gets sleepy sometimes.
  • Very outgoing. Not afraid about what others might think of him.
  • This man is part of the military. He is in good physical shape. These photos are not of the same man. The man in the green fuzzy suit is not the same man on the beach.
  • He is a veteran of the armed forces. When he’s not serving our country he likes to have fun by restoring vintage cars, hanging out at the beach and going to parties. He is a charitable man who likes to help out with children at their school and help out at fundraisers. Sometimes he pushes himself too hard and gets tired.
  • adventurous, fun
  • Party animal, fun-loving, good person. Bumpkin, hard worker, fun.
  • Outgoing
  • He seems like a ridiculous man. He does silly things and wears weird clothes.
  • He seems fun,with a great personality and a humanitarian.
  • funny, energetic, fun, happy, exciting, positive, partier
  • This is Tim. He’s not afraid to take risks and enjoy life by posing in bizarre pictures. He stays out late and hits up the nightlife and is known for wearing odd clothes on occasion. Tim has an above-average respect for the armed forces. His active social life allows him to be in constant contact with the public.
  • This man is responsible yet outgoing. He loves to have fun but he knows how to control himself.
  • Michael Sverdlin

    Nice article. Just one point – you are running the above server (https://mturk-demonstration.herokuapp.com/) with debug=True. This might be ok with you, but consider someone malicious spamming you with 1000s of wrong crop times making the 3$ you spent be for naught 🙂

    • Scott Benedict Lobdell

      Ah, thank you! I believe I left that on troubleshooting a production only issue, it’s gone now.

  • Kamiel Choi

    Cool idea :-

  • Tejas Khot

    I launched a few hits using boto in sandbox mode but now I don’t know how to view them in AMT; there is no url output after launching. I want to verify if things are working fine. Can you please help. How does one get url for those hits and how to see in sandbox for external question mode? Thanks!

    • gunnerliao

      After launching HITs, you can view them via requestersandbox.mturk.com (log in as requester) and click manage, then you will see “manage HITs individually” on the right, by clicking which you will see the them.

      But I got confused as well, since I can view them, how can I launch them as external HITs instead of aws hits.

      • Tejas Khot

        Hey
        If I am using ExternalQuestion (AMT loads a different/external site within its frame to show the task details), then I am parsing the url from AMT to find out hitId, assignmentId, workerId etc. so I can make unique assignments to all of them from my backend server. However, if I visit AMT site, I cannot figure out how to see the data visible there in sandbox because I guess it does not append the stuff to url at that time so I am not able to check my tasks due to lack of data loading. How do you suggest I handle this?

  • Burch Kealey

    This was very helpful today – let me add something to it. I believe the reason assignments is a list of lists relates to the capability to specify max_assignments. In your case you specified 1 – each assignment was completed by one worker. I am setting up a data entry task and I want to use the results to self-validate the workers – if they match then accept as good, if no match then human intervention. So in my case max_assignments = 2

    Each item in assignments is associated with one WorkerId and has the answers submitted by that worker.

    It did seem cumbersome as you noted until I realized that.