In the nearly two decades that I’ve been pushing code into production systems, I’ve made a lot of mistakes. More interestingly, I’ve seen many people around me avoiding those same mistakes, sharing their learnings, and in some cases, putting their ideas and practices together with others to create trends that have become well-known and popular.
One of those trends is Extreme Programming. We chuckle now, but compared to what I was doing before, XP was amazing. Writing tests first, pairing programming on everything - it was not only better than the slow, lonely, frustrating practices I’d used before, but it was a lot more fun. More recently, of course, there are lots of paradigms designed to help us manage the imperfection of our craft in various ways: DevOps, Continuous Delivery, Continuous Integration, Blue-Green Deploys, Canaries, Micro-Services, you name it.
One of the enduring practices I’ve loved over the years, and which is more feasible than ever today due to the easy availability of metrics collection systems, is "See What Changes When You Deploy." It’s not a catchy name, but it sure works.
In this non-radical and unsurprising method, you write your code, do all the usual testing and verification and so on, and then you deploy it. Then, you wait a minute or so (when you have one-second metrics, you can see results really quickly) and see what’s changed on your systems.
It seems too simple to be any good, but it works like magic. The reason this simple approach is still valuable and needed (despite all the other things you’re also doing that are more formal and “feel more rigorous”) is that old truism: code is complex, and it’s hard to predict what’ll change and how. It’s the same reason we’ve invented debuggers: reasoning about the effects of your change is great, but nothing beats actually observing it. It’s also in the spirit of the old Knuth quote: premature optimization is the root of much evil because programs are so sufficiently complex that predicting what’s important to optimize is hard.
Surely, there should be a way to automate this, right? Okay, maybe. Anomaly detection, mean shift detection, those types of things. Sure. But speaking from experience, most systems (especially at one-second resolution) are very noisy and there is a ton of normal complexity happening all the time. You’ll find that a lot of false positives will arise if you look at everything that looks abnormal around the time you deploy. Instead, look for changes in the most important portions of the system’s behavior (workload).
At VividCortex we do this constantly when we change things: we deploy to staging, open up the key performance dashboards and look at the heavy-hitters. Top Queries by total time, frequency, error count, etc. Anything significant will show up here as an instantly recognizable visual shift in a sparkline.
It helps that we built this capability into VividCortex itself, but you don’t need VividCortex specifically for this; you can do it with other systems too, as long as they capture the metrics you’re interested in. Most general-purpose time-series graphing systems these days support some way to time-shift a metric, so you can overlay “the last one hour” with “one hour, starting two hours ago.” That’s all you need to get started with the basics. (If your metrics system doesn’t support this, you should do something about that.)
This capability is hugely popular with our customers. It was originally a separate feature called Compare Queries that essentially showed two views of Top Queries in two-time ranges, over and under. We’ve recently broadened support for this to “anything you can rank and sort in the Profiler,” which is a much larger set of metrics.
The most important things to look at in the database tier are heavy-hitter queries, so to get the benefit of this approach in the database tier specifically, you’ll need to capture per-query metrics at high resolution. The most useful are total time, latency, frequency, and error rate (all at a per-query granularity). This is especially important if you’re using any type of ORM or other intermediate layer between your code and the database because ORMs and other types of action at a distance can be really difficult to anticipate; they may write really bad queries that you’re not aware of as a programmer.
Deploying to non-production systems first and examining before-and-after there (before deploying to production) is a powerful technique that prevents a ton of problems from ever making it into production, helps engineers more deeply understand their systems and the consequences of changing them, and results in faster, more confident, higher-performance deploys. In other words: shipping more code to production, more often, faster, and with better results. What’s not to love?
How do people push code without graphs? So grateful for all of the infrastructure wizards at this company that make my life less terrifying— dan (@dxna) August 9, 2016
You can request a free trial today to see VividCortex's time comparison feature in action.
P.S. Tangentially, another “best practice” we espouse is giving all developers access to performance insights in both production and non-production environments. Developers are smart. They don’t need DBAs to interpret what’s going on with database performance if they have convenient access to do it themselves.