At Clarifai, we have an internal Hack Day every month where everyone works on a pet project they don't normally have time for. My Hack Day project this month was to attempt to answer a question that bored kids everywhere have asked when they found themselves stuck indoors, be it from either from a rainy day, a doctor's visit, or a cross country road trip in the back of a minivan: Where's Waldo?
With scenes filled with animals, aliens, mermaids, and people of all sorts (including several Waldo imposters or "friends" as they call them), I wanted to see just how well our custom training model could perform on such a challenging and popular data set. Here's how I did it!
Step 1: Find the Data
Luckily for me, some kind and curious soul had already created and shared a Where's Waldo? dataset online, where he took 19 Where's Waldo maps, split them into grids/tiles, then labeled the tiles accordingly ("waldo" or "not waldo").
I decided to follow that method and do the same with other Waldo maps not included in his set. Maps were split up in several different grids (4x5, 4x6, 5x8, etc.) to increase the sample sizes. These tiles were then uploaded into our app and labeled accordingly.
Step 2: Train, Test, More Data
After the initial training, I tested the model against a map not in the training set, featuring Waldo on a moon colony, naturally. Against the entire map, our model found... nothing. Well... a 1% chance of Waldo being present. Not an unexpected result, since so much of the image simply isn't Waldo — that's what makes playing these so fun/frustrating.
I decided to test it again using the same grid system as the training. Going from tile to tile, results started making more sense. 1-3% chance of Waldo on relatively empty tiles. 10-20% chance on busier ones. 30-40% chance on Waldo imposters (shakes fist). Then, a tile with 50% chance of Waldo! I scanned the tile anxiously. Boom. Waldo, hiding behind a crowd of people on the second floor of a biodome.
Step 3: More Data, Fewer Imposters
My next goal was to lower the percentages of the Waldo look-alikes being flagged as Waldo. I went back to the original maps and manually cropped out several (somewhere between 500-1,000) people; some look-alikes, others with similar color schemes, others just random people. All to tell the model who wasn't Waldo.
Higher percentages on Waldo positive tiles, lower percentages on Waldo negative tiles, fewer false-positives on lookalikes.
Overall, though, the numbers still aren't great. The busier tiles still throw the model for a loop, Waldo predictions still aren't high enough, and false-positives are still present. Oddly enough, it isn't the characters who would normally trick human eyes (the lookalikes, characters wearing stripes or glasses, etc.) that are throwing the false-positives, but something else that I haven't been able to discern yet.
As I revisit this project, I believe my next steps would be to find/increase the Waldo-positive examples to try and counteract the implicit unbalanced nature of the data. That's the beauty of machine learning — the more examples you show, the better the results.