A Harder Problem
At Famigo, we do many, many distinct, complex queries when it comes to recommending apps for families (eg, give me the top 1000 puzzle games for young adults that are free on the Amazon App Store, sorted by user rating). Doing all these queries on demand proved to be a little bit slow (average query time is about a second), so I decided to cache the results of each distinct query for 8 hours. That's slightly more complex, but it wasn't like I was writing an Erlang compiler in Visual Basic.
Take 1 of cache implementation: Use memcached for the cache. Ten minutes later, curse memcached for not having an ordered datatype.
Take 2 of cache implementation: Use MongoDB for the cache. Many minutes later, celebrate success (prematurely).
What did our cached query results look like in MongoDB? Each document in the cache had a cache key (eg, most-popular-puzzle-games-for-young-adults), an expiration date, an ordinal, and a reference to the application document that we wanted to render.
There were already hints that I was doing it wrong. Case in point: I had to manage all of the cache expiration myself. In MongoDB, you can specify a maximum number of documents that a collection can store (which I was doing; I specified a max of 500k docs), but that's not at all the same thing as caching these results for exactly 8 hours. Speaking of which: hey 10gen, we want TTL collections!
Another sign I was doing it wrong: I had to do a lot of index tuning to make my interactions with the MongoDB cache fast. Every time I checked the cache, I had to specify the cache key, expiration date, and sort by the ordinal; for that to be fast, all of those needed to be covered by an index. While the index sped up my finds, it slowed down my inserts. I had a hell of a time finding the right balance.
Unfortunately, I'm not yet done listing the signs that I wasn't doing it right. You can't delete from a MongoDB capped collection. That's no problem if you're just collecting logs, but from time to time, we must invalidate our cache. Since I couldn't delete these documents from the cache, I had to add another column that stored an Active status, which also required an index, since we had to query by it every time.
How Did It Work? (Spoiler Alert: Not So Great.)
We ended up running in production on my MongoDB app query cache for a month or two. It was definitely faster than performing all of the complex queries in real time (~300ms instead of 1s), but there was a new delay when we had to add results to the cache (~200ms). As both app data and users scaled up by an order of magnitude, it was clear that this would just burst into flames at some point.
A New Solution Emerges!
I decided to try something new. I knew that Redis had a sorted set datatype, so I started to play with that. Rather than cache these app query results in MongoDB, I created a sorted set of app ids for each query. I let Redis handle all of the cache expiration business by setting a TTL value for each key. When I wanted to pull from the cache, I did so, then did a find in MongoDB using the $in operator with all of the app ids, then I reordered that in Python based on the app ordering in Redis. I knew it wasn't as pretty, but was it effective?
For my first test, I merely timed how long it took to add a few hundred results to my Redis-backed cache. That was regularly around 200ms; it was now down to 1 or 2ms. Impressive... but then that should be fast. I refused to be impressed until I started pulling from the cache.
Was it faster to pull the app ids from Redis, use that to pull the documents from MongoDB, then use Python to reorder everything? Actually, yes. Thus far, getting from the cache takes 1/3 of the time that it did before. Meanwhile, adding to the cache is essentially free.
How Not to Do MongoDB, or Any Other Datastore
It turns out that, technically, I was correct. I could use MongoDB as a key-value store for caching, much like I could use my Mazda 3 as an amphibious assault vehicle. In practice, neither would be optimized for those use cases.
A key part of determining your architecture is understanding the strengths and weaknesses of your technology choices. The primary strength of MongoDB is how it allows you to simplify and decouple your data modeling via document-orientation. What about Redis? Its primary strength is how it enables very fast access to a few key data structures, like sets and dictionaries. With both of those stated, it becomes clear the situations in which you can combine MongoDB and Redis to build delightful software.