I wrote couple of weeks ago on dangers of bad cache design. Today I’ve been troubleshooting the production down case which had fair amount of issues related to how cache was used.
The deal was as following. The update to the codebase was performed and it caused performance issues, so it was rolled back but yet the problem remained. This is a very common case when you would see customer telling you everything is the same as it was yesterday… but it does not work today.
When I hear these words I like to tell people computers are state
machines and they work in predictable way. If it does not work same
today as it worked yesterday something was changed… it is just you may
not recognize WHAT was changed. It may be something subtle as change in
query plan or increase in search engine bot activity. It may be RAID
writeback cache disabled due to battery learning but there must be
something. This is actually where Trending often comes handy – graphs
would often expose which metrics became different, they just need to be
So back to this case… MySQL was getting overloaded with thousands of same queries… which corresponded to cache miss storm but why it was not problem before ? The answer lies in caching as well. When software is deployed the memcache is cleared to avoid potential issues with different cache content, so system have to start with cold cache which overloads the system and it never recovers. When you have expiration based cache you increase the chance of conditions when system will not gradually recover by populating cache – if because of cache misses performance is so bad the speed of populating cache with new items is lower than speed with which items expire due to timeout you may never get a system warmed up.
But wait again… was this the first change ? Was not the code ever updated before ? Of course it was. As often with serious failures there is more than one reason which pushes system over top. During normal deployment the code change is done at night when when the traffic is low, so even if system has higher load and worse response time for several minutes after code is updated, the traffic is not high enough to push it to conditions it is unable to recover. This time code update was not successful and by the time rollback was completed the traffic was already high enough to cause the problems.
So the immediate solution to bring the system up was surprisingly simple. We just had to get traffic on the system in stages allowing Memcache to be warmed up. There were no code which would allow to do it on application side so we did it on MySQL side instead. “SET GLOBAL max_connections=20” to limit number of connections to MySQL and so let application to err when it tries to put too much load on MySQL as MySQL load stabilizes increasing number of connections higher until you finally can serve all traffic without problems.
So what we can learn from this, besides cache design related issues I mentioned in the previous post.
Include Rollback in Maintainance Window Ensure you
plan the maintainance window long enough so you can do rollback inside
this window and do not hesitate to do this rollback if you’re running out of time. Know how long rollback takes and have it
well prepared. Way to often I see people trying to make things work
until time allocated for the operation is up and when
rollback have to be done outside of the time window allowed.
Know your Cold Cache Performance and Behavior Know how your application behaves with cold cache. Does it recovers or does it just dies with the high traffic ? How high is the response time penalty and how long it takes to reach normal performance ?
Have a way to increase traffic gradually There are many reasons beyond caching when you may want to slowly ramp up the traffic on the system. Make sure you have some means to do that. I’d recommend doing it on user session so some users are in and can use the system completely while others have to wait for their turn to get in. It is a lot better than having it done on page basics when you randomly have some pages giving error messages. In some cases you can also do ramp up feature by feature.
Consider Pre-Priming Caches In some cases when cold performance gives too bad response time you may want to prime the caches by running/replaying some production workload on the system before it is put online. In this case all ramp up and suffering from bad response time can be done by script… which does not care.