Batch Video Editing with a Web Application

Too Long; Did not Read

In a failed attempt to crowdsource video editing with Mechanical Turk, I found that the tooling I’d put in place dramatically increased the rate at which I could edit hundreds of videos. Code samples and screenshots below.

True Story Follows

So in my last post about Mechanical Turk and Boto I’d outlined the context for writing code to use and understand the crowdsourcing service. To see how to use those services, check out that post.

I needed to edit hundreds of videos by framing the person in the video (cropping), trimming the dead space at the beginning, trimming the dead space at the end, and then combine two shots to create a single scene. Doing this manually through video editing software was extremely tedious and slowgoing, so I wanted to see about crowdsourcing the problem. The idea was pretty awesome, if I do say so myself, but as had been warned by Dr. Jay “Miles Dyson” Buckingham, using Mechanical Turk had some problems.

Why Mechanical Turk Didn’t Work

To paint some context for what sort of tasks I gave to workers, here’s an example GIF of a sample response from my web server:

crop

The task was to frame the demonstrator so that the pixels that were not used at all for the demonstration would be cropped. I wrote some javascript that allowed the user to crop the frame on the left that corresponded to the video on the right. The problems I had might be predictable, but here’s what it came down to:

  • Even if 90% of the workers did exactly as I expected, I still had to sift through 100% of the results to catch the 10% that were no good
  • The workers did not (necessarily) share my motivation to produce the perfect video with the awesome results
  • The directions I gave were as concise as I could really think to make them, but it still didn’t communicate things clearly

What actually ended up happening is that about 10 to 20 percent of the results from work were what I’d consider good. The others probably didn’t understand exactly what it was I was trying to accomplish. In some cases the video was cropped to the person’s upper body in the video, in other cases the user had cropped the picture on the left but clearly didn’t watch the motion on the right.

So, there are a number of things I could do to mitigate the above problems:

  • Make workers pass a “test image” with a pre-defined answer before I let them continue to work
  • As a similar idea, throw in test images throughout the working set with pre-defined answers that the worker must get close to in order to accept results.
  • Make instructions more clear by giving sample good and bad answers
  • Allow multiple workers to execute tasks on the same videos, and only accept the results if they’re close to each other
  • Repeatedly reject bad work (this has drawbacks: you get sassed over E-Mail, and you’re backing out on a nickel’s worth of work, which makes me feel like a cheap skape)

I figured that the work I had to do to mitigate the above problems wasn’t worth it, and even then I’d still need to review all results in order to ensure 100% quality instead of, say, 90%.

The Silver Lining

In order to review all of the work that people had done, I had to write yet another web application endpoint to view the work. From the data from Amazon, I could get key-value pairs, but I still had to extract its meaning by seeing how people had cropped the videos. As I went through this process, I quickly realized that the tooling I’d set in place vastly increased how well I could manage multiple videos in a short amount of time.

So I basically wrote a web application that I ran locally that would read videos from the local hard drive, render them using HTML5, allow me to edit using some Javascript, and store the results with a simple JSON file.

For each of the different tasks involved the edit, I wrote a separate endpoint. Once complete, I would have a single set of instructions that I could execute on all of the videos exactly once in a single batch.

Step 0: Convert all the videos to webm

I’m starting at step 0 for two reasons:

  • Indexing in programming generally starts at 0
  • I’d already written all of the other steps, and I wanted to come back to provide additional context

In order to make videos as convenient as possible to render immediately using just HTML, they needed to be converted to WebM format first. I used Miro Video Converter to do this.

Step 1: Crop the Videos

crop

As described above, getting accurate cropping through crowdsourcing did not work. I went ahead and reviewed each data and just used the workers’ input as a starting point. Using the start point was only a marginal increase in productivity.

I used Cropper in order to easily create the framing for each video. If you see my screenshot, it’s slightly counter-intuitive in that you’re not cropping the video itself, but just a single frame of the video. This was just because Cropper only worked on images, not on video. Using the results from the cropping data, I just updated the CSS for the video container to crop the corresponding pixels. Here’s sample usage for the cropper:

<link rel="stylesheet" href="{% static 'js/cropper/cropper.css' %}">
<link rel="stylesheet" href="{% static 'css/bootstrap-2.3.2.min.css' %}">
<script type="text/javascript" src="{% static 'js/jquery.min.js' %}"></script>

<script type="text/javascript" src="{% static 'js/bootstrap.min.js' %}"></script>

<script type="text/javascript" src="{% static 'js/cropper/cropper.js' %}"></script>
<script>
    var initCropper = function(){
        var targetEl = $("#same-frame-img");
        var aspectRatio = targetEl.width() / targetEl.height();
        targetEl.cropper({
            aspectRatio: aspectRatio,
            background: true,
            crop: function(data) {
                var leftOffset = data.x;
                var topOffset = data.y;
                var width = data.width;
                var height = data.height;
                // my own functions defined elsewhere that do stuff
                adjustVideo(leftOffset, topOffset, width, height);
                adjustInputs(leftOffset, topOffset, width, height);
            },
            autoCropArea: 0.8,
            guides: false,
            zoomable: false,
            highlight: false,
            dragCrop: true,
            movable: true,
            resizable: true
        });
    };
</script>

In order to more easily see the results of the cropping across the entire video, I increased the playback of the HTML5 video by 3x. This made it easy to see if the person went off the screen at any time and prevented me from having to sit through the entire playback. This was also way faster than using conventional video editing software because I could immediately see the effects of the entire shot without having to re-render to close/open some tooling options. Here’s some sample code to create the video thumbnail and increase the video playback:

<script type="text/javascript" src="{% static 'js/popcorn-complete.min.js' %}"></script>
<script type="text/javascript" src="{% static 'js/popcorn.capture.js' %}"></script>
<script>
    var createThumbnail = function(){
        var video = Popcorn("#main-video");
        var midPoint = video.duration() / 2;
        video.listen("canplayall", function() {
            this.currentTime(midPoint).capture({
                target: "#img-placeholder",
                media: true
            });
            initCropper();
        });
    };

    var increasePlayback = function(){
        document.getElementById("main-video").playbackRate = 3.0;
        document.getElementById("main-video").defaultPlaybackRate = 3.0;
    };

    $(document).ready(function(){
        $("video").one("loadeddata", function(){
            increasePlayback();
            createThumbnail();
        });
    });
</script>

Screen Shot 2015-03-12 at 7.50.28 AM

Screen Shot 2015-03-12 at 7.50.42 AM

Step 2: Trim the beginning dead space

For the separate task of trimming the dead space at the beginning of each video, I used the Powerange Jquery Slider for an easy slider interface to manipulate the video. I put two HTML5 videos side by side, where one was permanently paused. Adjusting the slider would move the current time at each video to the corresponding seconds. The still video allowed me to see exactly what position the video started from, and the playing video allowed me to see what the next second or so would look like.
trim

Here’s some sample code for how to use Powerange:

<link rel="stylesheet" href="{% static 'css/powerange.css' %}">
<script type="text/javascript" src="{% static 'js/powerange.min.js' %}"></script>

<input type="text" class="slider-input" />

<script>
    var createSlider = function(){
        var previewVideo = $("#preview_video").get(0);

        var duration = Popcorn("#preview_video").duration();
        new Powerange(self.$(".slider-input")[0], {
            step: 1 ,
            min: 0,
            start: 0,
            hideRange: true,
            max: 1000,
            callback: function(){
                var value = parseFloat($(".slider-input").val());
                var percent = value / 1000.0;
                var seconds = percent * duration;
                previewVideo.currentTime = seconds;
                previewVideo.play();
            }
        });
    };
</script>

Step 3: Trim the ending dead space

With conventional software, trimming the dead space at the end was far more tedious than trimming the deadspace at the beginning. The reason is that you have to sit through the entire video and watch the whole thing. To make this much easier, I basically took the exact same process from above (trimming the beginning of the video), but applied it to a reversed version of the video.

Here’s a code sample for reversing a video using OpenCV:

def create_video_writer(original_filename, height, width, frames_per_sec):
    codec = cv2.cv.FOURCC('M', 'J', 'P', 'G')
    original_filename = original_filename.replace(".mp4", "")
    new_filename = "reversed_" + original_filename + ".avi"
    video_writer = cv2.VideoWriter(new_filename, codec, frames_per_sec, (width, height))
    return video_writer


def generate_valid_frames(capture):

    while True:
        success, frame = capture.read()
        if not success:
            break
        frame = resized_frame(frame)
        yield frame


if __name__ == "__main__":
    capture = cv2.VideoCapture(capture_path)
    reversed_frames = list(reversed(list(generate_valid_frames(capture))))
    video_writer = create_video_writer(filename, height, width, frames_per_sec)
    for frame in reversed_frames:
        video_writer.write(frame)
    video_writer.release()

You can see in the sample below how the video is actually in reverse and defies gravity:

trimreverse

If you consider that I got to avoid watching hundreds of 15 seconds clips by doing this and assumed 0 overhead, this alone saved my several hours of time.

Step 4: Execute instructions on each shot

After I’ve completed cropping and trimming, all of the instructions are saved to a JSON file that looks something like this:

{
    "batch1_MVI_0002": {
        "start": "2.1",
        "height": "159.39312500000003",
        "width": "281.8",
        "offset_x": "22.19999999999999",
        "offset_y": "1",
        "from_end": "0.31110800000000005"
    },
    "batch1_MVI_0003": {
        "start": "0",
        "height": "169.8",
        "width": "300.1988950276243",
        "offset_x": "19.801104972375697",
        "offset_y": "0",
        "from_end": "0"
    }
}

Then I wrote a really simple script using OpenCV that executes all of the instructions on each video:

FRAMES_PER_SEC_KEY = 5
INSTRUCTIONS_FILE = "combined_instructions.json"
with open(INSTRUCTIONS_FILE, "rb") as f:
    all_instructions = json.loads(f.read())


input_directory = './consolidated/'
for filename in os.listdir(input_directory):
    video_name = filename.replace(".mp4", "")
    instructions = all_instructions[video_name]
    capture_path = input_directory + filename
    capture = cv2.VideoCapture(capture_path)
    frames_per_sec = capture.get(FRAMES_PER_SEC_KEY)

    all_frames = list(generate_valid_frames(capture, instructions))
    edited_frames = [changed_frame(frame, instructions) for frame in all_frames]
    height, width = edited_frames[0].shape[0: 2]
    video_writer = create_video_writer(filename, height, width, frames_per_sec)
    for index, frame in enumerate(edited_frames):
        video_writer.write(frame)
    video_writer.release()

The rendering on each video took about an hour to process. Again, this was a huge time saver. The alternative with video editing software would have been the same render time, but I would have been physically present for each render of each video, and a huge degree of overhead would be added (open a file, crop the video, see the results of the crop, trim the beginning, see results, trim the end, see results, save to a filename, wait for video to render, clear the video editing project, sigh with angst, repeat).

Step 5: Join Videos

In this case, the videos are obviously exercise demonstrations, and in 90% of the cases, I filmed from two different angles, and I still needed to combine each one. Another 10% of videos were actually outtakes, so I had to identify those as well. So again, I wrote another endpoint that made it really easy to identify each case. Here’s a sample screenshot:

combine_stuff

I could check two videos to combine them or I could click a button to label them as individual videos (either junk or a single shot video). In any action, the page would refresh with 6 more videos. It probably took about 10 minutes to associate all of the videos. Again, the results were saved to a JSON file:

{
    "single_shot": [
        "batch0_MVI_0148",
        "batch2_MVI_0119",
        "batch2_MVI_0119"
    ],
    "junk": [
        "batch0_MVI_0151",
        "batch1_MVI_0036",
    ],
    "needs_edit": [
        "batch1_MVI_0043",
        "batch2_MVI_0071"
    ],
    "pairs": [
        [
            "batch0_MVI_0122",
            "batch0_MVI_0123"
        ],
        [
            "batch0_MVI_0120",
            "batch0_MVI_0121"
        ]
    ]
}

I executed another script with OpenCV to join all the videos together and let the program run for an hour. Not having to be physically present, I enjoyed some of the finer things in life. I lifted weights, drank some whiskey, and all the other things you might expect during an hour of dead time.

Step 6: Profit

After completing all of the video editing, I used Miro Video Converter again to resize each video to 360p, converted each one to mp4 and webm, and uploaded everything to Amazon S3 so that the videos can be used in web applications. It’s my conclusion that all of the time invested to writing code saved drastically more time than would have otherwise been spent editing videos.