Let’s be honest: your software has bugs. Sometimes, it is also slower than end users expect. This is the situation we all live in. The key is to admit that the software you create will end up disappointing the users at some point. After all, even the almighty Google itself cannot guarantee the availability of their services for more than 99.95% of the time.
The sad part is that when the inevitable happens and some service is not behaving as it should, most of us either are not capturing the correct signal or cannot act based on the signal due to the impact of the issue being unclear. Whenever clearly defined goals and transparent reporting are missing, finger-pointing and blame games start happening.
This post is targeted to any product owner or operations team lead who wants to have a clear and meaningful measure towards understanding whether or not your IT assets behave as they should. After reading this post, you will end up understanding how to set a really simple objective and how to track the progress towards the performance and availability goal over time.
Every good team has a reasonable objective to attain when it comes to functional aspects of the software they are building:
- Sign-up to product activation ratio must exceed 35%.
- The recommended products share of the aggregate shopping cart size must exceed 10%.
- The nurturing e-mail campaigns targeted towards inactive users must result in 6% or more reactivation rates.
The very same teams also have objectives around performance and availability of the software. However, these types of objectives are often not too well chosen. Let me give you an idea by visiting the following objectives some of the real-world teams:
- CPU utilization must not exceed 80%.
- 99% of the database queries must respond under 1 second.
- The application must support 1,200 concurrent users.
- Any application server node must not experience more than 2h downtime per month.
I could continue the list with similar examples, but the pattern is hopefully clear already. None of these requirements really focuses on the user experience aspect of the performance and availability of the software. As a result, you will find yourself again and again in situations like the one below:
We have all been present in rooms where you could cut the tension in the air with the knife. On one side of the table, there is the product team claiming that they are not tolerating the availability issues around the product offering. On the other side of the table sits the operations team who is pointing towards the fact that from their perspective the systems are working just fine.
So, let’s admit that we need a different goal to measure. Our real goal is not to make database queries fast nor is it to keep CPUs idling. The real goal is in making sure the end users of the software are satisfied with the application.
How Can You Measure User Satisfaction?
How can user satisfaction towards the service availability and performance be expressed in a measurable way? Apparently, the answer is simpler than you would expect. User satisfaction builds upon monitoring every interaction end users are performing with the application to track whether or not:
- The application performs the interaction the user wanted it to.
- The interaction completes with the expected outcome.
- The interaction completes within a reasonable timeframe.
There is a variety of tools on the market being available of capturing the interactions and flagging every interaction based on whether or not the outcome completed successfully and/or fast enough from the end user’s point of view.
Using the interactions as the input, we could measure the satisfaction across your user base via the following (simplified) formula:
Satisfaction = successful interactions / total interactions.
Now, if you have met the goal of 99.9% satisfaction rate on each given day, it would mean that if on a particular day when 500,000 user interactions were performed, the goal would be met if up to 500 non-successful interactions occurred during the day.
If you have not monitored real user experience before, tracking your users using the formula above is a good starting point. You will learn and improve a lot while using this approach.
The Devil Is in the Details
Unfortunately, understanding and applying this simple formula is only the first step. There are many details to take into account when monitoring for performance and availability. For example:
- The availability of different services has a different impact on your business. What if the failures you experienced during the day all occurred during checkout of a shopping cart? In an e-commerce solution, this would indicate a serious issue and would likely require immediate action taken by the operations team.
- The performance of different services cannot be treated equally. Some operations might have to be completed in few hundred of milliseconds while for some others it might be perfectly OK to experience 10+ second response times.
- Success is sometimes difficult to monitor. When the operation involves a complex calculation, checking whether or not the outcome of the calculation was correct might not be feasible, so the monitoring must rely upon just checking the metadata about the operation (duration, response codes, etc.) to decide whether or not the operation completed as expected.
There are many details to cover down the road but rest assured, you will figure the nuances out along the way. I can only recommend you to adopt the mindset of “failures do happen” and start measuring the real user experience to make sure you stay on top of such failures.