“At the end of each sprint.” “4 times a day.” “Every 6 minutes.” Pick a handful of blog posts or presentations, and you could easily come to the conclusion that Continuous Delivery is all about speed, speed, speed. Sure, if you are releasing twice a year and your competitor is pushing new features every week, you will almost certainly be under pressure to figure out how you can deliver faster. By focusing too heavily on acceleration, however, we risk forgetting that there is no point dumping features into production in double-quick time if the system ends up breaking on a regular basis. Instead of talking about releasing software faster, we should be aiming to release software better…and time-to-delivery is only one of the metrics that we should be taking into account.
Of course, whenever the phrase “get features into production faster” is mentioned, there is usually an implicit assumption that we are making sure that the feature is actually working. It’s true that testing is still often seen as the “poor cousin” of development, and QA as “the department of ‘no’” rather than as essential members of the team. But pretty much every Continuous Delivery initiative that gets off the ground quickly realizes that automated testing—and, above and beyond that, early automated testing—is essential to building trust and confidence in the delivery pipeline.
In many cases, however, automated testing means code-level unit and integration testing, with only limited functional testing being carried out against a fully deployed, running instance of the system. Functional testing is obviously important to validate that whatever functionality is being added actually behaves in the intended fashion, and that related use cases “still work.” However, as teams progress along the Continuous Delivery path of making smaller, more frequent releases with fewer functional changes, the emphasis of the pipeline’s “quality control” activities needs to shift to follow suit: building confidence that the system will continue to behave and scale acceptably after the change is just as important—perhaps more important—than demonstrating that a small handful of new features work as planned.
It’s at this point that automated functional testing proves insufficient, and we need to start adding automated stress and performance testing to the pipeline. More specifically: given the nature of today’s highly distributed systems, we need to see a fully deployed, running instance of the system in action to see how its behavior has (or has not) changed in somewhat representative multi-user scenarios. As long as the system doesn’t completely fall over, functional test results alone will not uncover problematic patterns that should stop our pipeline in its tracks.
To give but one real-world example: as part of a relatively minor feature change to only show the contents of a shopping cart if requested by the user, the data access logic was changed to load the item details lazily, and modified in a way that resulted in <# items in cart> calls to the database, rather than one. All functional tests still passed, since the impact of the additional database calls was not significant in the single- or low-user scenarios exercised by the functional tests. Without automated stress or performance testing in the pipeline, the change was released to production and quickly caused an expensive production outage.
Of course, adding automated stress and performance tests to a Continuous Delivery pipeline remains highly non-trivial. Even with automated provisioning and deployment, spinning up an environment that is somewhat production-like, and getting a realistic configuration of your application deployed to it, is much more complicated than booting up a single instance of your app on a VM or in a container. Perhaps your organization maintains a dedicated performance lab, in which case the idea of “spinning up your own environment” is probably out of bounds right from the get-go. And we haven’t even started talking about getting hold of sufficiently realistic data for your stress test environment yet. Given all that, how likely is it that we’ll ever get this idea off the ground?
Here’s the trick: we don’t need almost-production-like environments for performance and stress tests to be useful in a Continuous Delivery pipeline. We don’t even need the absolute response times of the multi-user tests we will be running—not if we are talking about a pipeline run for a relatively minor feature change for a system that’s already in production, at any rate (of course, we’ll want to get some reasonably realistic numbers for the first launch of a system, or before a major change that means that the current production instance is no longer a broadly accurate predictor).
What we need to do is run multi-user scenarios against a fully deployed instance of the system that is configured for scalable, multi-user use and compare the results against the previous pipeline runs. Whether a particular call takes 4s or 8s in our stress test environment is not that interesting—what matters is when a call whose 90th percentile has averaged 7.8s for the past couple of pipeline runs suddenly clocks in at 15.2s. In other words, deviations rather than absolute values are the canary in the coal mine that we are looking for.
How significant does the deviation have to be before we care? How do we avoid being flooded by false positives? To some extent, the answers to these questions are still largely down to the specifics of each individual system, and the team’s choice. There are also a couple of statistical techniques we can apply here, such as setting the threshold for acceptable deviations in relation to the standard deviation, rather than allowing only a maximum percentage increase. We can also make use of outlier detection algorithms to flag values that are “way off the charts” and are more likely caused by a broken test environment than by the current change going through the pipeline. In addition, we can allow for re-calibration of our thresholds if we already know that the current change will impact the results for a specific scenario.
The classic metrics returned by stress or performance tests—response time and number of successful/failed calls—can be reasonably good at identifying changes to the system that may have a problematic impact. However, response time is influenced by a number of factors external to the system under test, which can be hard to control for, especially with on-demand testing environments: “noisy neighbor” VMs can influence performance; network latency can be impacted by other VMs in the same virtual segment; disk I/O can be variable; and so on.
Dedicated hardware is a possible solution, but it is expensive. One alternative is to run a small benchmarking suite on your on-demand environment and factor the underlying fluctuations in performance into the thresholds that trigger a pipeline failure. However, this is time-consuming and adds a complicated step to the processing of the test results. Happily, there is another option we can consider instead: measuring key architectural metrics directly.
This approach is based on the observation that many instances of poor performance and failure of large-scale systems are linked to a change in one of a relatively small number of metrics: number of calls to external systems (including databases); response time and size of those external calls; number of processing exceptions thrown (e.g. when an internal buffer or queue is full); CPU time taken; and other related metrics.
Rather than keeping track only of performance testing results, we can, with suitable instrumentation and/or data collection tools, measure these architectural metrics directly. We can monitor their values over time and watch out for deviations in the same way as for response times. In fact, we can even consider collecting a subset of these metrics right at the beginning of our pipeline, when running code-level unit or integration tests! Some metrics, such as the amount of data returned from database calls, will not be of much use with the stub databases and sample datasets typically used for code-level testing. But we will still be able to spot a change that causes the system to jump from 1 to N+1 requests to the databases, as in our shopping cart scenario earlier.
In short, thinking of the purpose of our Continuous Delivery pipeline as delivering software better makes us focus on adding automated testing and metrics. The main purpose of these metrics is to give us confidence that the system will still continue to behave well, at scale, when the changes currently in the pipeline are added. An effective way to gather such measurements is to run performance or stress tests that simulate multi-user scenarios against a fully deployed, running instance of the system configured for scalable, multi-user use… without needing to incur the expense and complexity of building “fully production-like” environments.
Alongside the response time numbers returned by the performance tests, we can use log aggregators, APM tools, or custom instrumentation to track key architectural metrics inside the system. By looking for deviations in these key parameters, we can flag up pipeline runs that risk causing major system instability, allowing us to accelerate our delivery pipeline without putting quality at risk.