Reduce Toil With Better Alerting Systems
Let's dive deeper into how to reduce toil by defining better alerting strategies within an alert management system.
Join the DZone community and get the full member experience.Join For Free
Are you an SRE or on-call engineer struggling to manage toil?
Toil is any repetitive or monotonous activity that can lead to frustration within an incident management team. Also, at the business level, toil doesn't add any functional value towards growth and productivity.
However, toil can be tackled with simple but effective automation strategies across every stage of the incident management process.
In this blog, we dig deeper into how to reduce toil by defining better alerting strategies within an alert management system.
Google’s SRE workbook defines toil as:
"The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows."
For reducing toil, first, we should learn the characteristics of toil (identify) and calculate the time taken to resolve incidents manually (measure toil).
Ways to Identify, Measure Toil
Identifying toil is basically understanding the overall characteristics of a routine task. It can be done by evaluating a task on the basis of:
- what type of work is involved.
- who will be responsible for executing the work.
- how the work can be completed.
- whether the work is easy (less than an hour), medium (less than few hours), or hard (less than a day) in terms of difficulty during execution.
Measuring toil is simply computing human time spent on each activity. It is done by analyzing certain trends:
- on-call incident response.
- through tickets.
- survey data.
With this analysis, we can prioritize toil to create a balance between production tasks and routine operational tasks.
Note: In all organizations, the goal is to ensure that toil does not occupy more than 50% of SREs' time. This is to keep the team focused on production-related functionalities.
Before we look into the elaborate causes of toil let’s get to know the after-effects of toil.
Effects of Toil
Whether it is an incident management task or any other activity, if you keep doing the same task repeatedly, then, often, you will be filled with discontent over the job you do.
In some cases, toil even causes an increased attrition rate due to burnout, boredom, and/or alert fatigue among SREs, which may eventually slow down the overall development process.
Let's find out ways to reduce toil by first looking into the various causes that contribute to toil.
Causes of Toil Across an Alerting System
Lack of Automation in Alert Management Systems
If alerts are repetitive and need to be resolved manually, then managing those alerts would be a tiring task. If your system notifies you that your web requests at 6 AM are 3x higher than usual, this indicates a good amount of traffic to your website, but it doesn’t pose any threat to the architecture. These alerts just provide information about your system performance and need no manual intervention. So spending time on suppressing such trivial alerts can result in missing important alerts that need to be addressed manually. Also, manual suppression of too many alerts can add up to toil.
Automation is key in reducing toil, at every stage of alert configuration. If there is a possibility to automate an alert response, then it must be done on a priority basis. This would greatly help in reducing alert noise.
Poorly Designed Alert Configuration
A poorly configured alert system will generate either too many alerts or no alerts at all. These kinds of alerts are due to sensitivity issues within an architecture.
The sensitivity is of two types: over-sensitivity (marginal sensitivity) and under-sensitivity. Over-sensitivity is a condition when the system sends too many alerts. This occurs when alert conditions are marginal at threshold levels.
For example, when the response time degradation in a database service is set exactly at 100ms (absolute value) then even the slightest difference would generate too many alerts. So rather than setting alert conditions to be marginal, we can set up relative values, like an alert that doesn't pop up for less than 50%.
On the other hand, under-sensitivity is a condition when a system does not send any alerts. This poses a bigger problem. This can happen when a system has an issue that goes undetected. There's a risk of running into a major outage and having no means to get to the root cause. In this case, the system might require re-engineering to scrutinize such sensitivity issues.
Ignoring SRE Golden Signals While Configuring Alerts
Latency, traffic, errors, and saturation are the golden signals of SRE that help in monitoring a system. Other variations, such as USE (Utilization, Saturation, and Errors) and RED (Rate, Error, and Durability) can also be used to measure key performances of the architecture.
While setting up alerts, the utilization of database, CPU, and memory have to be estimated and optimized following these vital SRE signals.
For example, say if the average load experienced by the infrastructure is 1.5x higher than the normal rotation per second of CPU count, then the system would trigger unusual amounts of alerts. This is due to not having the proper optimization(s) in place. So ignoring such basic saturation levels of the system would generate abnormalities which can ultimately result in outages.
Insufficient Information on Alerts
Insufficient information on alerts means that the system is going through some difficulty in processing a particular set of instructions and is not alerting specifically about the ongoing situation. This can lead to unusual toil when figuring out where the problem exists and what contributes to an outage.
Let’s say you have received an alert stating
instance i-dk3sldfjsd CPU utilization high. Here, this alert does not convey sufficient information about an incident, like either the IP address or hostname. Only with minimal information, the on-call engineer cannot respond to an incident. So they might have to open the AWS console to figure out the actual IP location of the server to proceed with the troubleshooting processes. In this scenario, the time taken to log on to the server and resolve the issue would be substantially high.
Ways to Reduce Toil With a Better Alerting System
Set Alert Rules Based on Historic Performance of the System
While configuring alerts, instead of setting tight thresholds, take a look at the “Trend/Historical Rolling Number” of system performances. This can be done by calculating the rate of change in system performances and it would give you a clear idea of how to fix the right thresholds. Almost all modern monitoring systems help in recording the rate of change in system performances.
For example, let’s consider instances when the percentage of CPU utilization is consistently greater than 70-80% or the server response time falls above 4-6ms and the count of log query stands greater than 100-125 – here the alerts can be optimized within the performance range of the system by expressing it in terms of percentile values, like the 95th percentile. This will reduce alerts drastically and help your system to stay reliable.
Create Proactive Alert Checks
With their predictive characteristics, proactive alerts play a vital role in understanding system performances.
Before we expand further on proactive alerts, here’s a quick look at the different kinds of alerts and their implications.
Investigative Alerts, Proactive Alerts, and Reactive Alerts
In an alert management system, the foremost step in alerting is to categorize alerts so that we can monitor the system’s health in a strategic order. There are three types of alert categories: investigative, proactive, and reactive.
Investigative alerts are the ones that can cause harm to system health in the long run.
Whenever there is a change in user behavior, and if it falls beyond the scope of the defined SLO, then there will be a service failure. For example, if an SRE configures a system to specify conditions into an incident management tool on regex and logical constraints alone and if the developer coded it with different parameterized expressions in different programming languages, that could cause a deviation of conditions by not falling into the said system configurations. So, automatically the system would not respond to the user-specified instruction and may cause an outage in the long run.
It has to be noted that investigative alerts are also referred to as “cause-based alerts” that can turn into toil if not properly aligned with other alerting strategies.
Proactive alerts are those which pose a future threat to the organization.
For instance, if an alert configured for storage utilization is 100%, then an engineer will be notified only when the storage capacity runs out of space, and the situation might soon turn into an outage. To avoid such incidents, alerts have to be configured for 70% and above. By doing this, the system would alert the team when the storage space is less than 30% of capacity. And the team would have some buffer time to resolve the issue.
This way of predicting system performances and configuring alerts accordingly is called proactive alerts.
Reactive alerts are those that indicate an immediate threat to business goals.
This kind of alert will arise when the system or service breaches defined SLOs. These alerts notify the team only when an outage occurs and the team should respond reactively. An example of this would be an unexpected blackout of a payment portal or any feature of a product. In cases like these, the user can’t access anything with respect to the affected service owing to a major incident for the team to handle. This is a reactive alert.
It is the prime responsibility of an incident response team to segregate, prioritize and categorize alerts to have a structured alert response procedure.
Therefore, setting up well-defined alert rules based on reliability targets and automating them is convincingly a possible way to reduce toil.
Ways Proactive Alerts Help in Reducing Toil
- Since it is predictive, it helps an incident management team to gather all the required tools beforehand (prepare) for response activities.
- It helps in reducing user-reported incidents.
- It drastically reduces the incident response time.
- Having all the response plans in hand, the team can easily automate through runbooks or execute necessary steps in resolving an incident. Thus, proactive alerts considerably increase the overall productivity of teams and business.
- Plays an important role in increasing the velocity of innovation.
In SRE practices, an alerting policy is a set of rules or conditions we define for a monitoring system. This set of rules helps in notifying the engineering team when there is a system abnormality. Alerting policies play a vital role in maintaining the performance and health of system architecture.
Alert-as-code is an evolutionary technique that helps in defining all the system alerts or the entire alerting policies in the form of code. This helps to point out the incidents more specifically with a monitoring tool.
This alert-as-code configuration can be done while building the system with an infrastructure-as-code architecture.
For example, consider an infrastructure that has alert-as-code configuration and uses Kube-Prometheus to deploy Prometheus across their architecture, and with that configuration, they have created/modified all the alerting rules for the infrastructure. Here, the use case is that all the changes that are made to the monitoring setup are being version controlled across Git and stored in GitHub.
Also, alert-as-code helps in predictive analysis and root cause analysis to scrutinize the underlying reason for an incident. Some of the other use cases are:
- This offers a way to automate routine tasks and gain more control over infrastructure with version control platforms.
- It saves lots of time by standardizing all those complex and dynamic systems throughout the infrastructure.
- It also supports documentation processes for future citations.
- Alerts can also be managed by cloud monitoring APIs. It helps in automating the process of creating, editing, and managing alert policies.
- Alerting APIs are helpful in real-time monitoring of system health and identifying event triggers for categorizing alerts.
- It supports the team by flagging potential issues within the system architecture.
Note: While detecting anomalies, the programmatic alerting policy creates alerts only when there is a deviation from the historical performance of the system.
Less Toil, More Productivity!
Proper alerts with necessary automation strategies will give way to more effective and toil-free incident management ecosystems. These practices would greatly help in reducing operational toil and can ultimately enhance the productivity of the team.
Published at DZone with permission of Biju Chacko. See the original article here.
Opinions expressed by DZone contributors are their own.