How Shutterfly Masters Continuous Performance Testing by "Adjusting the Belt Tight"
This article was originally written by Federico Toledo for the Abstracta blog.
Picture this: you are the owner of an e-commerce website and you want to be sure that its excellent customer experience doesn’t deteriorate over time with the introduction of new functionalities and that your customer retention rate remains healthy. For you, one critical pillar for achieving this is having reliable and highly performant web pages. As they say, your competitor is always one click away.
The same goes for Shutterfly, the platform for creating custom photo creations and gifts (that has recently passed the $1 billion mark in revenue), whose performance engineering team is concerned with testing and monitoring its web and application performance on a daily basis. Making sure that each user smoothly sails through the entire process from uploading photos to designing their creation to check-out is of the utmost importance for the company to remain the market leader.
Above is a screenshot of a page that lets me preview my custom mug from the Shutterfly website. You can see, its powerful tool allows you to visualize almost any product you can think of with your photo(s) on it. Imagine all the work that goes into making sure this system works perfectly, all the time.
I have been fortunate enough to work with Shutterfly’s performance engineering team, assisting them in the past few months while learning about their cutting edge continuous performance testingmethod. I’m thoroughly impressed by how this particular client has devised a way to test so that they can discover degradations in performance almost immediately!
Today I'd like to discuss how Shutterfly's continuous performance testing methodology works so that you may use it to your advantage.
The methodology involves a very different approach than that of having a load simulation for acceptance testing, in which we simulate the whole load scenario with a test infrastructure similar to the production environment. We are not substituting that for this, as this is just a way to complement those tests and discover degradations before the acceptance test. This way, we reduce the risk of having to implement big solutions (aka solve big problems) just a couple of days before the day we wanted to deliver our new version of the system.
Essentially, the goal is to be able to detect the exact moment when someone enters a line of code that impairs the performance of the system. This methodology provides a great way to do so and to also be able to automatically identify it as soon as possible, having a record that gives us a clue as to when it was inserted so that we can identify it easier.
So, how does it work?
First off, a precondition that must be met is to perform tests frequently. Continuous Integration would be best, running different sets of tests at different times: basic regression tests before each commit and a full regression at least two or three times a week. You must have load scenarios (acceptance testing) before putting a new version in production and tests ready for each fix. In Shutterfly’s case, engineers check in code around the clock and the software is updated in the test environment daily.
Secondly, we must make sure that we have an exclusive environment for tests. With an exclusive test environment, the results will be more or less predictable and won’t be affected if another person, for example, runs something else at the same time, causing the response times to soar, generating false positives, and wasting a whole lot of time.
Another key aspect is that the tests should have acceptance criteria (assertions) as close as possible so that at the slightest system regression, before any negative impact, some validation fails, indicating the problem. This should be done in terms of response times and throughput.
I like to call this “adjusting the belt tight.”
Now I’ll explain how to create, maintain, and use this “belt” in performance tests.
Imagine we have a new test up and ready for stress testing the login. What we will do is see just how much we can “tighten the belt.” Until what point can we increase the burden on the login while keeping an acceptable service level?
Say we run the first test with 100 concurrent users and as a result, there are no crashes, the response times are below 100ms (at least the 95th percentile), and the throughput is 50 TPS. Then we run the test with 200 virtual users and again, it results in zero crashes. The times are at 115 ms and the throughput is at 75 TPS.
Great, it’s scaling.
If we continue on this path of testing, we will at some point reach a certain load in which we see that we are no longer achieving an increase in the throughput. We will also be getting errors (exceeding 1%, for example) that would indicate that we are saturating the server. It’s likely that response times from then on will begin to increase significantly because some process or connection or something begins to stick amid all the architecture of the system.
Following this scenario, imagine we get to 350 concurrent users and we have a throughput of 150 TPS, with 130 millisecond response times and 0% errors. If we pass 400 virtual users and the throughput is still about 150 TPS, with 450 users the throughput will be even less than 150 TPS.
There is a concept called the knee, or the point at which performance “degrades ungracefully,” that we would be encountering with this type of testing illustrated in the graphic below. The TPS is expected to increase when we increase the number of concurrent users. If that doesn’t happen, it’s because we are overloading the system’s capacity.
This is the basic methodology for finding the knee when doing stress testing when we want to know how much our servers can scale under the current configuration.
Then, the test that we will schedule to continue running frequently is the one that executes 350 users, is expected to have less than 1% error with expected response times below 130 times 1.2 ms (this way we give ourselves a margin of 20%). Last but not least, we have to assert the throughput, verifying that we are reaching 150 TPS. This is the way we can detect right on time when something degrades the performance.
It is a good idea to monitor the test for a couple of weeks to ensure that this analysis is valid and endures and that the test stabilizes. It is also important to review these values from time to time, just in case the performance of the service improves, in which case, we’d need to adjust the belt even tighter there! Meaning, we’ll always have to adjust the assertion values.