Managing the Modern IT Environment – Observability Do’s and Don’ts
Managing the Modern IT Environment – Observability Do’s and Don’ts
Observability is no longer just a synonym for monitoring. Check out some of these best (and worst) practices.
Join the DZone community and get the full member experience.Join For Free
Most people think observability is simply a fancier synonym for monitoring. But in the context of modern IT environments, “observability” takes on a much more relevant and distinct role to address new constructs like microservices and service mesh architectures, which have greatly complicated traditional management strategies.
It used to be easy. You’d run a client/server model, for example, and you could quickly determine when the server wasn’t responding, or the client wasn’t communicating with the server. You’d set up two specific conditions, monitor for those, and be done.
Remember those days?
Now, with service meshes, microservices, and the growth of SaaS-enabled environments in general, these services themselves—rather than technology professionals—are making decisions about which process to run first, how to run them, what the background processing will look like, and more. The challenge comes when you consider that all these decisions have an impact on application performance, but because these are often shared services (as in, you don’t control them) there isn’t an easy way to visualize what’s actually happening behind the scenes.
In sum, observability is an answer to the blind spots exposed by standard monitoring in modern environments. Certainly, all the elements of traditional monitoring should still be present as part of the observability strategy, but you must be hyperaware of the type of environment you’re trying to build for, its different endpoints, and the complexities that it brings to successfully mitigate failures and performance issues.
With that in mind, the following Do’s and Don’ts provide a guide to the basic tenets of a successful observability strategy that can be implemented in your environment.
DO Focus on Symptoms Rather Than A Specific Problem
Someone very famous once said, “You want to monitor for symptoms of a problem, not for the problem itself.” That’s the essence of observability. In other words, the aim is to be able to say, “I have a fever,” rather than jumping straight to “I have pneumonia.” By setting very specific conditions that indicate a problem is brewing, tech pros are empowered with health and performance data to ultimately troubleshoot before the problem impacts end users.
You should also ensure you’re creating a high-level view of an application or services by thinking about how best to curate your alerting conditions. As you or your team write the application or architect a service, think about the metrics that will deliver this type of behind-the-scenes view. More data will become available once the application or service is moved through build, test, and production, which will, in turn, help you not only refine your alerting metrics but help establish a performance baseline (what “good” looks like) that you can reference in the future.
This baseline is especially helpful when it comes to building in anomaly detection, which monitors for errors that exceed a specified “normal” amount. The catch is that tech professionals must remember to architect applications to include these types of capabilities from the get-go.
DON’T Try to Retroactively Implement Observability
This is so important it bears repeating. For observability to be truly effective, you must consider every layer of the application—even at the architecture level—and where it would be most effective to add tracing and alerting capabilities for more streamlined visualization as you’re building that application. Observability needs to be continually top of mind during the development process.
Consider your performance and resiliency needs, and use tools that will allow you to start measuring the associated metrics as you develop the application. Once you’ve run the application through its first dev test, where do you potentially need to refine the observability functions? Are the metrics initially designed to catch errors A, B, and C (for example) actually triggering an alert when needed? If not, where do you need to adjust alerts or thresholds to more accurately flag impending failures?
This is how you optimize and ensure you’re proactively performance tuning, before an application is pushed out.
DO Build in Anomaly Detection and Tracing Functionality
As we know, one of the key challenges in this brave new world of IT is the inability to add monitoring alerts to a shared service. Observability offers the opportunity to instead instrument an application or service to execute a trace that tracks the entire lifecycle of a failed request once a broader conditional alert is triggered.
Similarly, you should plan to build in alerts for significant anomalies. Remember—you’re planning for failure. It’s not if something will fail, it’s when. With different technologies and service meshes in play, it’s also nearly impossible to predict all the different types of error conditions that might arise. So, if your application fails, how is it failing? Did the response time slow to a particular level? Maybe it failed four requests in a row, in a certain geography, moving from service A to service B, when those services were hosted on different platforms or through a different service backbone.
There are so many different parameters associated with application performance and failures that it’s critical to capture all of this contextual information. The richness of that data is what allows you to be more intelligent about setting up symptom alerts, and identify and get ahead of bigger issues.
DON’T Set It and Forget It
Just as observability shouldn’t be a retrofitted solution, neither should it be considered a one-time activity. Your approach to observability should be continuously evaluated and iterated on in response to the nearly endless amount of ways an application can fail. At the same time, as applications evolve—whether they’re integrated with other applications or migrated to new hosted platforms—the metrics associated with identifying issues will need to evolve, too.
It’s also incredibly important to remember that although you—or your organization—may use a public provider like AWS, observability is not a standard, default setting within that offering. You should leverage what metrics are offered, certainly, but remember that the platform is not designed to create observability metrics specific to the functionality of your application.
The complex cloud environments of today fail in intricate ways, and the reason isn’t always easily identifiable. Additionally, the stakes to meet high digital expectations for speed and availability are higher. Using the basic do’s and don’ts above, today’s tech professionals and developers can begin to integrate the tenets of observability alongside traditional monitoring with greater confidence, and ensure failures and performance issues are detected before they reach the end user.
Opinions expressed by DZone contributors are their own.