Superheroes Cannot Scale to Meet Application SLAs, But They Can Enhance Their Superpowers
Software Reliability Engineers (SREs) are support superheroes. But what happens when their work needs to scale?
Join the DZone community and get the full member experience.
Join For FreeEverywhere one turns, at least in the movie theater or small screen, there is a new superhero with some new power or focus. Given how important application or site uptime is, it’s a wonder that there isn’t yet a superhero for troubleshooting software problems. In a way, though, superheroes have existed for some time and have been the primary means of keeping applications running smoothly.
These Software Reliability Engineers (SREs) or support superheroes are highly skilled at finding a solution — if not the actual root cause — of issues based on their deep experience, knowledge, and intuition. But these software superheroes cannot scale to meet the new reality caused by the combination of ever-increasing complexity and the accelerating rate of application change. The combination is driving a steady growth in incident numbers, and observability tools are not the problem. The limitation is the amount of information the human mind needs to quickly absorb and analyze to resolve these incidents.
Tackling this human limitation requires shifting the problem-solving approach to give superheroes — and mere mortals — greater leverage through the use of automated root cause analysis. The perfect storm of “building while flying” development practices; growing complexity of software (such as the trend toward microservices), cloud components and open-source software; ever increasing uptime expectations; and high pressure and expectations for “perfection” make the task of solving an incident increasingly difficult. We will still need the skill and intuition of these experienced technical professionals. But we need to help them overcome innate human scalability limitations by drastically simplifying the amount of information they need to absorb in their daily decisions.
Technology that can prove helpful has generally been called automated root cause analysis (RCA). Automated RCA is still not a well-known technology. It is designed to mimic the way that a skilled engineer performs the troubleshooting process. Without automation, the engineer would need to look at all the clues available about what has happened. These are usually found in telemetry from the application in the form of metrics, traces and logs. Metrics are useful for indicating when a problem occurred, and traces help narrow down where it occurred or is occurring. Logs, however, are one of the best sources for the why it happened (the root cause). Sometimes logs are eschewed because of their limitations — a large volume of events, the level of “noise” in the logs, and the non-standard and ever-changing “free-form” formats. Some superheroes may try to implement a fix based simply on their intuition, without knowing the complete cause. But logs are one of the best ways to discover actual root cause, especially for new, never-before-seen problems.
Manually reviewing logs takes time. Engineers — even of the superhero variety — need to examine individual log lines, out of millions available, to find meaningful anomalies and clues to why a problem occurred. Often the answer is found across several logs, rather than contained only in a single log. Software superheroes are good at quickly eliminating or ignoring logs that are irrelevant to solving their problem, but finding the unusual log lines that may explain root cause is harder, and there may be millions of log lines to review. Engineers can leverage their time and expertise by leaving this vast review to an automated system. In effect, an expert is having their superpowers augmented and empowered rather than accepting something that could potentially replace them.
New technology can provide critical assistance, but it also requires and benefits from some new practices. First, the natural aversion to poring through logs should be embraced rather than avoided. Automated RCA is key — it cuts down the massive size of the troubleshooting task to make it manageable and can even summarize key log events using modern NLP techniques. Now, log review can be exhaustive, and the precise cause can be determined and addressed — but more importantly, the process can scale beyond what human superheroes can.
Second, engineers need to shift from digging through all the logs to looking at and interpreting the results from the automated RCA platform. Software superheroes will become particularly adept at using these findings to quickly remedy an incident. Admittedly, this does require some level of trust. Use of automated RCA has shown spectacular results in problem-solving across many kinds of companies. A wide swath of Cisco product teams recently incorporated automated RCA and reported that it was able to find the correct root cause indicators in logs in 95.8% of incidents.
Third, rather than using or procuring automated RCA as a standalone tool, the capability should ideally be integrated with the primary observability tools and dashboards already in use. The combination makes log review a more intuitive and natural part of the familiar workflows and processes that already exist.
Fourth, by incorporating automated RCA into normal monitoring and observability, practitioners should consider how the capability could be used proactively to address problems before they become substantial.
These changes may seem obvious, but they can have a transformative effect on operational effectiveness and on beating or maintaining SLAs. Managers may sometimes quip, “If only I could clone my star performers.” By enhancing and extending the superpowers of software superheroes, rather than cloning them, organizations are able to utilize more of their skills. In this way, superheroes can indeed scale to meet the critical challenges.
Opinions expressed by DZone contributors are their own.
Comments