Last week I submitted some proposals for PyCon 2K15, one of which was to estimate a one rep max for squats / bench press / deadlifts using computer vision. The proof of concept is demo’ed in the video above. The idea is simple:
- Force = mass * acceleration
- Therefore, if force is known and acceleration is equal to the force of gravity, mass will represent the ceiling amount of pounds lifted.
- However, force is not known because acceleration is unknown, but mass is known.
- Acceleration can be calculated in terms of pixels.
- If the center of the bar can be found in every frame of a video, acceleration could be calculated in pixels per second per second.
- Pixels can be converted to meters (or feet or whatever) if an object with known dimensions can be detected with exact dimensions on the image.
- Olympic Barbells have standard dimensions and are ubiquitous.
- Detecting a barbell coincides with detecting the barbell’s center in the first place.
So I set out to use OpenCV and accurately detect the barbell in every frame. The code itself currently stands at about 1,000 or so lines, so it’s not really worth sharing code samples (unless you want them…leave some comments if you want to know how I did anything). What might be more interesting is to learn what DIDN’T work:
Failure 1: Template Matching
When I first started working this problem, I thought it would end up being really easy. If we took advantage of knowing that a barbell existed in the image on a horizontal plane, it would be likely that a picture of a barbell with a transparent background with 135 pounds loaded would match pretty well in the image. I also followed someone else’s advice on StackOverflow to handle transparent template matching by converting the transparent portion of the image to random noise. In this way the opaque portion of the image would become statistically insignificant and would not affect the template matching one way or another.
However, the clever opaque noise proved moot. I found that template matching only worked for exact matches, and no scaling ever happened, and therefore this method was virtually useless. The length of the bar is an initial unknown.
Failure 2: Contour Matching
Moving along, I tried what I thought would be a similar method with Contour matching. Contour matching accounts for image rotations and scaling, so if I could take the above approach but use contours instead, I should be able to find a match. It’s fairly simple to get the thresholds just right so that we can convert the first template (the “needle”, if you will (and I will)) I had displayed into a single contour. However, the challenge I found was to accurately detect the appropriate contour in the “haystack” image.
The sample image above shows an attempt at contour detection (not even matching yet) where every contours is a different color. The barbell in the image has detected contours along all of its edges, but unfortunately, this does not create a single contour that we can try matching against. It may be possible to threshold a single input image just right, but to do so across multiple frames with unknown lighting and background settings with different image qualities did not seem feasible.
Failure 3: Feature Matching
This straight up didn’t work, and I didn’t try anything else. I saw feature matching as a viable option somewhere on StackOverflow, but the above image detects features from the template barbell to the image. The results don’t seem like anything I can even start to work with.
Failure 4: Machine Learning and Neural Networks
Did not attempt. Fail.
However, it would be worth noting that creating a machine learning model would be a viable option, but this would require me to provide thousands of pieces of training data in order to make a reliable model. Going down this path didn’t seem like a good idea because I didn’t even know if I could necessarily get this working to match exact points.
Sort of Getting Somewhere: Motion Detection
After the failure of the methods above, I realized that the only way to make any progress for detecting a barbell on the image would require taking advantage of the constraints of our use case. As I just gave away in the header here, motion detection turned out to be an important constraint to take advantage of. Howevere, there are a number of things that we know about our use case:
- Movement of the barbell will take place in the video
- The barbell will be moving in an up and down direction
- The barbell will be the widest object in the image
- The barbell will be an olympic barbell
- There will be limits to the possible acceleration of the bar (i.e. A reasonable constraint may be the force generated by world record numbers
- There will be similar data across multiple frames
But to go through the process of failure to success in a linear process, motion detection turned out to be the most important constraint and a great starting point. Motion detection in OpenCV is an incredibly simple process:
- Get the absolute difference between two matrices representing adjacent frames
- Repeat above step for the following frame
- Get the bitwise exclusive OR of the above two resultant frames
The resulting frame looks like the image above. Now we have something to work with.
Better, but still Failing: Line Detection
My first attempt at taking advantage of motion detection is pictured here. I took the motion detection frame and applied line detection to the frame, and at first glance you might think we had something that was completely reliable. Basically, if we know that the bar is a straight line, and motion detection will pick up long, horizontal lines, we might be able to assume that our best instance of a lengthy line is the bar. However, this approach will also output anamolies:
It becomes clear that line detection by itself will not produce reliable results. However, it did end up being useful as a basic threshold to even attempt barbell detection.
Winning: Using Motion Detection and Line Detection! And Brute force…And adding some more constraints
The basic evolving solution has been to repeatedly constrain the problem based on what we know about each frame. Another known constraint is that an olympic bar will be used in the image. Therefore, if we can constrain the pixels to examine to only a subset using motion detection, it then becomes reasonable to brute force pixel matching. So given a reasonable subset of pixels, we can try every possible bar width with every possible x and y offset and find the best position and bar size combination. So here’s where I stand:
- Apply motion detection to the raw frames
- Apply line detection to the motion detection
- If a thresholded amount of lines exist, try to detect the bar
- Brute force across the entire image takes forever (maybe 20 minutes per frame), so constrain some more:
- It is a reasonable assumption that the pixel row with the most motion detection pixels contains the bar, so establish a Y constraint here
- Another reasonable assumption is that the bar will at least pass through the center of the image. The bar itself doesn’t need to be center, but some part of it will pass through the center. So establish an (x, y) coordinate that the bar must pass through based on this rule and the previous one.
- It is also a reasonable assumption that left and right limits can be set based on the first and last occuring instances of motion detection across columns, so establish a left and right X in this way
- Ignore comparing when the examined block hardly has any motion detection pixels
- We can find a best match by:
- Creating a matrix of absolute values of the pixel difference between the before and after versions of the frame when applying the attempted overlay and summing the matrix. This represents the total impact on the image in pixels.
- Divide that number by the total number of non-transparent pixels in the overlay so as to not let a smaller bar get better results.
- Now we have bar matches in most of our frames. Now we can try and filter out bad results with some fairly safe filters:
- The y values for the bars across the frames should be normally distributed. So discard detected barbells with y values outside of 3 standard deviations from the mean (three signa rule)
- Detected bar widths should be extremely close together, so again discard barbells based on the three sigma rule based on bar widths
- Finally, take the average X of the detected bars and the average bar widths across all of the frames, and you’ll find that we have pretty good matches
The result of said process can be found in the video at the beginning of this post. This works beautifully in a sterile environment where almost all of the detected motion is from the bar. However, in my sample video, I’m using a camera on a tripod. If we did the same experiment on a camera held by hand, the results are pretty good, but there are also anamolies all over the place, and therefore not good enough to apply physics equations to them.
If you have some ideas, please leave them in the comments, but I still have a lot more I want to try, and I wanted to document my progress thus far before I forget all of the steps that I ended up taking. Here’s what I still want to try:
- Given that most of our frames are pretty good matches at this point, get the average pixel representation from the original RGB image at the points inside the barbell overlay and retry template matching.
- This would give us the bar location in all frames and not just the frames in which movement is happening
- Given all of the (x, y) coordinates we now have representing the barbell, discard points that don’t make any sense (except, if the transition between points don’t make sense, which do you discard and which do you keep?)
- Along these lines, I was thinking of an n choose K style approach where I tried all possible combination of point representations and discarded representations that physically didn’t make sense. However, I think this is beyond the processing power of my machine
- Somehow take advantage of a valid assumption that the original pixels in the image at the overlay points should be relatively symmetrical.
I’ll follow up with more, but let me know if you have any thoughts!