Milk my Caching for all it’s worth
One of our big challenges at Yipit as an aggregator has been weaning ourselves off of all of those dastardly MySQL Joins. As Uncle Ben once warned Peter Parker, “With great power comes interminably long queries”.
Fortunately, because our workload skews heavily towards reads, we’ve had success implementing various caching strategies across the stack. On the application level, we’ve leveraged Django’s built-in caching framework to enable site-wide caching for anonymous users, view level caching for authenticated pages (especially useful for our API), and ad-hoc caching for aggregate queries.
Lately, we’ve begun to dive more deeply into the holy grail of Django data retrieval: cached QuerySets. Conveniently, there are a number of quality libraries available to facilitate this sort of thing, including Johnny Cache, Django Cache-Bot, and Django Cache Machine. We’ve decided to go with Cache Machine for the foreseeable future thanks to its dead simple integration, its common sensical handling of invalidation (including through Foreign Key relationships), useful ancillary features such as caching RawQuerySets and QuerySet counts, and its easy extensibility.
A Quick Recap of How Cache Machine Works
Cache Machine stores full model QuerySets as key value pairs in one of three backends: Memcached, Locmem, or Redis. The key here is generated by hashing the underlying raw MySQL for a given query, while the value is yielded by iterating through the entire QuerySet and extracting field values for each object. On a storage level, Cache Machine extends the built-in Django caching backend to enable infinite cache timeouts. While generally an awesome feature, this makes intelligent invalidation critical.
To ensure that cached QuerySets represent (mostly) consistent views of the underlying model data, Cache Machine ties each cache key to a flush list of associated objects, including Foreign Key relations. For any given object, post-save and post-delete Django signals (hooked in through the Manager class) are responsible for invalidating all related cache keys via their respective flush lists.
Setting up Cache Machine in your Django Project
Adding Cache Machine to your app is ridiculously easy. Just subclass a Mixin in your model definition and set the default manager to the library’s CachingManager.
from django.db import models from caching.base import CachingManager, CachingMixin class CacheIt(CachingMixin, models.Model): key_data = models.CharField(max_length=30) related_stuff = models.ForeignKey('related.RelatedStuff') objects = CachingManager()
Yep, it’s that wonderfully simple. Under the hood, the CachingManager returns a custom QuerySet (CachingQuerySet) which wraps the caching functionality around the core Django Queryset. It does so by overriding the iterator() method.
Rather than simply iterating through the QuerySet, the CachingQuerySet iterator method instantiates a generator function (via the CacheMachine class) and then iterates through this function, either yielding objects from the cache or, alternatively, getting objects from the SQL cursor and then setting them in the cache once the iterable is completely exhausted and StopIteration is raised.
For best performance, the library recommends that the CachingManager is set as the default model manager. This enables caching for related models (i.e. CacheIt.objects.all().related_stuff). However, if you so choose, you can add a non-default manager so long as its get_query_set() method returns a CachingQuerySet object. All things being equal, it’s obviously desirable to allow for caching FK objects.
Extending Cache Machine for Yipit
We love that Cache Machine just works right out the box. There are, however, a couple of major issues that we had to account for prior to pushing this library live. Our biggest concern was that cache invalidation only applies to objects already present in the original QuerySet. Saving or deleting old instances will invalidate a given query key; however, creating a new model instance will not force this action. Calling update() on QuerySets also fail to invalidate the appropriate cache key.
This was an intentional choice by the library author and, in many cases, it promotes acceptable behavior. The idea here is that data will, for the most part, become eventually consistent through either active model saving or through culling of data on the storage level as cache memory becomes saturated.
In certain cases, though, this sort of behavior is less palatable. At Yipit, our data has variable time sensitivity and expense of retrieval. We wanted the flexibility to pick and choose which models to cache (as well as the duration for each). With that in mind, we decided to stick to the theme of a single default manager which returns a custom QuerySet. The big difference is that QuerySet class only conditionally hits the cache. Our code looks like the following:
from django.db import models from caching.base import CachingMixing, CachingManager, CachingQuerySet class OurModel(CachingMixin, models.Model): data = models.IntegerField() objects = CachedManager(default_from_cache=True, cache_timeout=1200) class CachedManager(CachingManager): def __init__(self, *args, **kwargs): self.timeout = kwargs.pop('cache_timeout', None) self.default_from_cache = kwargs.pop('default_from_cache', False) super(CachedManager, self).__init__(*args, **kwargs) def __getattr__(self, name): return getattr(self.get_query_set(), name) def get_query_set(self): return CachedQuerySet(self.model, timeout=self.timeout, default_from_cache=self.default_from_cache) class CachedQuerySet(CachingQuerySet): def __init__(self, *args, **kwargs): new_timeout = kwargs.pop('timeout', None) self._retrieve_from_cache = kwargs.pop('default_from_cache', False) super(CachedQuerySet, self).__init__(*args, **kwargs) self.timeout = new_timeout def _clone(self, *args, **kw): qs = super(CachingQuerySet, self)._clone(*args, **kw) qs._retrieve_from_cache = self._retrieve_from_cache return qs def from_cache(self, **kwargs): self._retrieve_from_cache = True return self def skip_cache(self, **kwargs): self._retrieve_from_cache = False return self def iterator(self): from_cache = self._retrieve_from_cache iterator = super(CachingQuerySet, self).iterator if not from_cache: return iter(iterator()) else: try: # Work-around for Django #12717. query_string = self.query_key() except query.EmptyResultSet: return iterator() if FETCH_BY_ID: iterator = self.fetch_by_id return iter(CacheMachine(query_string, iterator, self.timeout))
The CachedQuerySet class overrides the CachingQuerySet iterator method to add a flag (“from_cache”) to determine whether the given query should hit the cache. This flag depends on the private QuerySet attribute, retrieve_from_cache, which is first set in init() magic method and later potentially overridden in the from_cache() method. Finally, it is copied in the clone() private method (clone is called in the iteration process so you’ll need to set the attribute here as well).
Hitting the cache can be set as the default behavior for a given QuerySet by setting the “default_from_cache” keyword argument to True when initializing the Queryset. This initialization occurs in the get_query_set() method of the CachedManager. You may also set the default timeout for the QuerySet in this method, which is something that we have also taken advantage of on a per model basis.
At the end of the day, we can decide whether we want all QuerySet methods cached for a particular model within a single line:
objects = CachedManager(default_from_cache=True, cache_timeout=1200)
Alternatively, we could have created a separate manager here for caching; however, handling it in the QuerySet propagates the caching more quickly throughout the existing code base and, more importantly, offers the nice advantage of chaining. By setting the getattr() magic method in the CachedManager, you can effectively handle all your lazy chaining needs (see this post by Zach Smith on this awesome Django tip).
Remember to Select_Related
The big downside to this method is that QuerySets with non-caching defaults will not allow for FK object caching. To get around this issue, make sure to explicitly call the select_related() QuerySet for models with FK relationships which you wish to traverse. Django will force potentially evil (time wise) Joins here to collect the related data. Fortunately, you’ll be able to cache this result set for lightning fast future access.
Future Plans for QuerySet Caching
While we think that this is a good start for our internal QuerySet caching needs, there’s still a lot for us to do. Rather than conditionally caching certain queries and models, we plan to explore invalidation techniques for updated and newly created object instances. We hope you’ll tune in for those future updates!