Change is the only constant in life. There are many definitions of change, but for the purpose of this blog, I will define change as any deviation or variation between two or more instances. Just as we as humans experience changes in our mood on a regular basis, applications often change from release to release. The problem with change is we usually don’t know whether it will have positive or negative consequences until after it’s occurred. When deciding to implement a change you can weigh the pros and cons and collect all kinds of data, but you can never be 100% sure of what the outcome will be until the process is complete.
Further complicating matters, not all changes are planned. Unplanned changes can be due to natural and random variation, factors unintentionally introduced, or they can be the result of an error. Whatever the case changes can lead to deviations from the norm and produce unexpected results and in the worse case an outage. More important than knowing that change happens is being able to identify and understand the impact change has in an attempt to predict when issues may occur.
Throw anomalies into the mix, and you can end up losing your mind, trying to determine if a change is due to an anomaly or if a change wasn’t picked up because of an anomaly. Anomalies don’t conform to normal patterns but they are critical to detect. In Anomaly Detection – Using Machine Learning to Detect Abnormalities in Time Series Data” Applications the need for applications “to detect abnormal behavior which can be an indication of systems failure or malicious activities, and they need to be able to trigger the appropriate steps towards taking corrective action” is described. But anomalies can only be detected when there is agreement as to what defines normal vs. abnormal behavior, and how far can something deviate before it is considered an anomaly.
Applications can experience different performance levels at different times, but these variations aren’t always a cause for concern. For a B2B application, a decline in the number of connections to the API during non-business hours may not be a cause for alarm; fewer requests are made because fewer people are accessing the application. A decline during peak business hours would, however, be a cause for concern. The complexity of having multiple baselines and normals makes it harder to identify anomalies. Only once a baseline of “normal” has been defined can anomalies and change be measured.
In the monitoring world, change can be a fundamental symptom that something has gone wrong and may get worse. Sites experiencing a “hug of death” from a post going viral see a pattern. Traffic to the site starts to increase, then response times slowly start to creep up, and some eventually experience an outage. CodInGame shared their lessons learned from a Reddit hug of death where they went from rejoicing to crisis mode Receiving an alert that an unexpected change has occurred can lead to a flurry of activity and an all hands on deck situation to diagnose and remedy the problem and reduce the impact or avoid an outage.
We rely on machines to help us detect and recognize when something has changed or identify anomalies in a large dataset. Before an outage occurs, there may be indicators that something has changed such as what CodInGame experienced, identifying those changes quickly can reduce the impact of the outage. We set an alert when thresholds are exceeded and identify shifts in trends that can indicate something has gone wrong. Identifying a change has occurred helps us detect when something is amiss and identify what caused the change. Analyzing change is at the core of the troubleshooting process.