Moving to Observability Driven Development
An outline of how to move to observability driven development, and the different components thereof.
Join the DZone community and get the full member experience.Join For Free
With all the benefits that microservices architecture and cloud-native and serverless applications bring, they also add a lot of complexity from an operations point of view. To successfully support and manage these applications, it has become very important to have full visibility into these distributed applications. Application monitoring practices as we currently use the are no longer sufficient. The current state of monitoring is such that visibility is not a consideration during the design or development, but something that is strapped on in the form of an agent, just before the production release. As you could guess, it is not sufficient to provide full visibility, as it only works when services fail or are about to fail in predictable ways.
While traditional monitoring is used to find out whether the system is functioning properly, observability goes a step beyond and helps in answering the question, "Why it is not functioning the way it should?" Wikipedia defines observability as "the measure of how well internal states of a system can be inferred from knowledge of its external outputs." It brings better visibility into systems. It can help in the more proactive identification of deviation from expected behavior before it becomes an issue that can disrupt services.
Observability-Driven Development (ODD) is a new terminology that has started being used recently to highlight the need for incorporating observability aspects throughout the software development life cycle. ODD encourages a left-shift of the activities required for observability right from the early stages. The common key dimensions of monitoring are logs, metrics, and tracing. Also, observability attempts to embrace the uncertainty of modern systems by allowing open-ended queries to be made in real-time to get more insights beyond the known unknowns. We will explore the high-level considerations for adopting ODD across various life cycle stages below.
Design Time Considerations
During the design phase the questions that the system needs to answer about its behavior, while in operation, have to be identified. These could be platform level or Business KPI related concerns. A good starting point would be to start with the Quality of services that the system needs to provide and trying to find out what can go wrong and how that can be detected in advance. Based on the behavior that requires monitoring appropriate instrumentation has to be planned.
The appropriate places where the instrumentation can be added, if planned during the design, can help in structuring the code so that the number of touchpoints can be reduced but still providing maximum coverage. For example, if tracing or event logging is required in all the entry and exit points of a component, then it can be designed such a way that all entry points or exit points go through some common code before diverging to handle specific functionalities. This will make it easier to add instrumentation at one place either explicitly or through AOP that covers all scenarios. It can help in avoiding refactoring later. If many such components are to be instrumented, it is better to create a common design and implement instrumentation at the framework level.
The decision on whether to use a pull or pushed based approach for Observability has to be made. In Push based approach, the business applications generate the instrumentation data as asynchronous events that are externally captured and used for immediate or delayed processing by appropriate Observability solution. In Pull based Observability, the business applications expose APIs about its behavior then, which can be queried by an external Observability solution. While pull makes it easier to use in any environment with less dependency, there are pros and cons with both approaches. A combination of both patterns may also be used for different types of observabilities and tool choices.
Development Time Considerations
For the Observability to be more meaningful it is important to add sufficient context to the instrumentation data. It is important to standardize the context and have them included consistently across all instrumentation data wherever it is published. It is better to include the context setting at the framework level so that it does not depend on manual developer-level discretion.
The developer who builds the system knows about the system lot better than anyone else. It will be a lot more effective if the system's behavior is observed by the same developer in the development environment itself to observe any unusualness that can be fixed at an early stage.
Too much instrumentation can become a challenge and can inhibit performance and overwhelm the analysis. A right balance is required on the level of instrumentation.
Build and Deployment Time Considerations
One way to enforce Observability during the deployment is to follow ‘Observability as code’ practice. They are a version that is controlled and automated through the Continuous Deployment pipelines. This will ensure that they are repeatable and not missed out in any environment or deployment.
For Containerized applications the container platform monitoring probes such as the health and readiness probe has to be leveraged effectively. If properly exposed these probes can facilitate the container orchestration platform to identify unhealthy container instance and avoid its usage for new workload and restart as part of self-healing. Implementing these probes to simply return static content may not be effective in determining if the application instance is healthy and ready for usage. It is important to perform a meaningful check or dummy operation involving the dependencies in these probe implementation for it to be effective. The probes invocation should not have any side effects.
For push-based Observability, it is important to plan for the capacity to handle the huge volume of instrumentation data that might be published. Appropriate archival and/or sampling, the aggregation has to be made. Similarly, an appropriate security mechanism has to be set in place depending on pull-based API or push-based eventing.
Standard best practices like proactive monitoring and alerting should be in place. Having a feedback loop from observations made during operations to the development team is important for continuously improving the Observability. These inputs have to be prioritized for implementation along with business functionalities.
Standards and Tooling Considerations
Internet-scale companies like Google and Facebook have mature Observability practices and have built their tools to get the required insights. Some of their learnings are being shared and implemented through open source initiatives.
Google released OpenCensus, an open-source library for metric collection and tracing, especially for microservice-based architecture. OpenTracing is another open-source library for distributed Tracing incubated by Linux Foundation’s CNCF (Cloud Native Computing Foundation). Recently both OpenCensus and OpenTracing have agreed to merge to form OpenTelemetry which is now the CNCF sandbox project. OpenTelemetry is an effort to combine Tracing, Metrics and Logging into a single set of system components and language-specific telemetry libraries. Tracing and Metrics will be initially supported in OpenTelemetry and logging is planned to be supported in the future.
Traditional monitoring tools help in capturing a predetermined set of metrics, which help in capturing known unknowns. Different tools may be required for different types of monitoring such as log, metrics, and tracing. Both open-source and commercial tools are available for each type of monitoring. APM (Application Performance Monitoring) tools that are commonly used typically support both metrics and tracing. Log monitoring is generally separate. They could be either be leveraged as Cloud SaaS offering or deployed along with the applications on-premise or in the cloud.
Unified Observability platforms have started becoming available that handle all the three types of monitoring and help in querying for unknown unknowns.
ODD as a practice is going to become more significant and widely practiced in developing new distributed architectures going forward.
Opinions expressed by DZone contributors are their own.