Observability and GitOps
Observability and GitOps
Observability is vital; it allows us to manage a system properly, determine if it is working correctly, and decide what needs to be fixed, changed, or improved.
Join the DZone community and get the full member experience.Join For Free
Observability is a vital component of our architectures, as it allows us to manage a system properly, determine if it is working correctly, and decide what needs to be fixed, changed, or improved. Observability has been an area of interest in the cloud-native community for a long time, supported by the Cloud Native Computing Foundation (CNCF). Many projects and products are being developed to allow observability of systems and applications. The creation of a Special Interest Group for Observability (SIG) and the development of an observability framework (i.e., the OpenTelemetry project) prove the importance of this concept and the desire to standardize it.
This article deals with observability in the context of GitOps. As mentioned in the previous article, the key functionality of GitOps is comparing the desired state of the system, stored in Git, to the current state of the system and applying the required changes to make the two converge. GitOps relies on a controller (an operator or a process) that manages and controls the remote resources. This means GitOps relies on the observability of the controller to identify the actions that need to be performed. But GitOps is also a system that must provide observability.
What Is Observability?
The old supervision methods have reached their limits in the supervision of the new standards of application architecture. The management of highly scalable and portable microservices requires the adaptation of tools in order to facilitate debugging and diagnosis at all times, thus, requiring the observability of systems.
Often, monitoring and observability are confused. Basically, the idea of a monitoring system is to get a state of the system based on a predefined set of metrics to detect a known set of issues. According to the SRE book by Google, a monitoring system needs to answer two simple questions: “What’s broken, and why?” Analyzing an application over the long term makes it possible to profile it in order to better understand its behavior regarding external events and, thus, be proactive in its management.
Observability, on the other hand, aims to measure the understanding of a system state based on multiple outputs. This means observability is a system capability, like reliability, scalability, or security, that must be designed and implemented during system design, coding, and testing.
The purpose of this concept is to control what is happening to a system to answer questions like:
- What version of an application was most recently deployed and by whom?
- Can we decrease the number of replicas without impacting performance?
- How do microservices interact with one another?
- Can it be a bottleneck?
Observability completes monitoring projects based on three pillars: logs, metrics, and traces. Monitoring tells us when the state is wrong while observability tells us why. Monitoring is a subset of and key action for observability.
What To Observe and Why
Observability has no limit, it depends on the context (the monitoring metrics, log management, etc), the time allocated to the implementation, the target (is it a dashboard for a developer, a salesperson, a manager?), etc. But there are some overall parameters that have to be observed.
The Desired State vs the Current State
The first and probably most important aspect to observe is the difference between the desired state stored in a Git repository and the real, current state of the resources.
It is important to oversee this link between the desired state and the current state to ensure that the last updates are correctly applied before asking for another chance. This is a best practice for every automation tool result, as debugging is easier the sooner you do it.
The idea behind this aspect of observability is to answer questions like:
- Was my last commit correctly applied to my system?
- If not, what happened exactly?
- Should I roll back to the last version of my repository?
- Is the system working as intended?
The Deployment Frequency
This is a crucial aspect to measure the impact of the GitOps methodology on our internal processes. The purpose is to move to a reliable system that can handle multiple deployments per day without (or with limited) impact on the business.
Automating internal processes like continuous integration, continuous testing, or continuous deployment require continuous observation to extract insights from it and improve each process based on the requirements of each team/project.
The deployment frequency is a good metric to understand if the internal processes are optimized or not. For example, the frequency is a good indicator of the quality of the development lifecycle. Having a big difference between the number of deployments in the staging and production environment can reveal multiple issues in the SDLC like a poor automated tests process, bad code quality, a bad development lifecycle, etc.
The idea of this aspect of observability is to answer questions like:
- How many deployments are done per week? Per month? Per environment?
- How many deployments finished successfully?
- How many deployments failed or are out of sync?
- How many deployments introduced a bug or a regression?
The Deployment Velocity
An improvement of the deployment frequency is usually followed by an improvement of the deployment velocity. Indeed, deploying more frequently usually means that the updates are small and thus easily manageable. Velocity is the result of a tendency to deploy micro changes more frequently. allowing for better control of the infrastructure in order to minimize impact. Also, increasing velocity can help with meeting RTO/RPO requirements in case of an outage.
The idea of this aspect of observability is to answer:
How long does it take to deploy a new feature? A fix?
The previous topics deal a lot with the human-process part of GitOps. Latency-related metrics help understand the behavior of the controller and, more precisely, what amount of time the controller takes to really apply changes. Understanding how the controller works and the different stages that are implemented in the process can help reduce the latency and increase the velocity even more by making the current state converge faster to the desired state.
The idea behind this aspect of observability is to answer questions like:
- How long does it take for the current state to match the desired state?
- How many pending tasks are in the queue?
- Are there out-of-sync deployments?
How To Observe
The how-to depends on multiple factors. The idea obviously is not to introduce a bunch of new tools that will also need observability. The first thing to do is probably to rely on the existing monitoring and logging tools to grab and present the data in a nice dashboard.
But as the GitOps community grows, interesting projects that deserve special attention will continue to emerge.
Keptn is a control plan for DevOps automation of cloud-native applications. It's an SLO-driven multi-stage delivery tool that automates observability and alerting. This project was initially created by Dynatrace and it’s now managed by the Linux Foundation as a sandbox project of the CNCF.
Keptn has the ability to be integrated into multiple tools, like Argo, Gitlab, Jenkins, Prometheus, Locust, Slack, etc.
Keptn, with his Quality Gate, adds another layer to the quality evaluations of any pipelines. Thanks to the Quality Gate, Keptn can prevent shipping bad code to production without any human qualification.
Open Telemetry tries to fix one big problem in observability, rather than having to manage agent software for individual IT management tools, it will soon be possible to deploy a single set of agent software to generate data that can be consumed by multiple platforms. The main purpose is to not focus on the tool used to send data to multiple vendors but on the data itself.
This kind of framework is definitely the right way to standardize the collection of metrics, logs, or traces to quickly get insights from applications, systems, pipelines, etc., and in the same way, improve the observability of an entire infrastructure.
As previously said, the first thing to do is probably to rely on existing tools and processes and improve them. You might already have a pipeline that pushes the changes into your environment after merging a pull request; or, in GitOps words, a controller that is responsible for synchronizing the desired state with the current state. In this existing pipeline, we could send metrics and/or messages that might improve observability.
For instance, notify an external system (NewRelic, Prometheus, Signoz, etc) every time a deployment succeeds or fails; we could even notify when the synchronization has been initiated and when the system is actually synchronized, or not. This could help to measure the latency, as well as calculating a ratio between successes and failures of deployments, and therefore, improve deployment frequency. This is how observability can improve overall SDLC quality.
Even though new tools are flooding the market, it is also possible to implement GitOps observability with existing tools and processes.
GitOps is a huge improvement for the management of infrastructure and requires the right level of monitoring and observability to be efficient.
The purpose of this article was to provide an overview of the observability with GitOps and some interesting projects that should help to improve the collection of data to improve the GitOps workflow. Obviously, one article is not enough to explain all the aspects of the observability with GitOps, that’s why in the next article, we will talk about how the GitOps methodology and the observability can improve some security aspects of the infrastructure management. Stay connected!
For more information on the observability in the GitOps methodology:
Opinions expressed by DZone contributors are their own.