Observability: Let Your IDE Debug for You
Building observability into applications allows teams to automatically collect and analyze data to optimize applications and resolve issues before they impact users.
Join the DZone community and get the full member experience.Join For Free
Current events have brought an even stronger push by many enterprises to scale operations across cloud-native and distributed environments. To survive and thrive, companies must now seriously look at cloud-native technologies—such as API management and integration solutions, cloud-native products, integration platform as a service (iPaaS), and low-code platforms—that are easy to use, accelerate time to market, and enable re-use and sharing. However, due to their distributed nature, these cloud-native applications have a higher level of management complexity, which increases as they scale.
Building observability into applications allows teams to automatically collect and analyze data about applications. Such analysis allows us to optimize applications and resolve issues before they impact users. Furthermore, it significantly reduces the debugging time of issues that occur in applications at runtime. This allows developers to focus more on productive tasks, such as implementing high-impact features.
Observability has three key components: logs, metrics, and traces. To get a complete understanding of a system’s behavior, it is necessary to collect all three components. Having one or two is not sufficient to debug complex behaviors of applications.
In this article, we discuss the importance of observability. We also look at how Choreo—our new integration platform as a service (iPaaS) for cloud-native engineering—enables developers of all skill levels to observe performance, identify anomalies, and troubleshoot issues using deep observability functionality.
Building observability into applications requires significant effort if we want to collect tracing, logging, and metrics data in production, without impacting application performance. Ensuring that observability has minimal impact on performance is one of the main challenges faced while developing observability frameworks. Tracing, in particular, has a significant performance overhead if you are to trace all requests.
To make an application observable, developers need to write performance-optimized observability code and use algorithms, such as adaptive sampling, which can dynamically control performance overhead under varying traffic conditions. Unfortunately, many application developers do not implement code that collects all three pillars of observability.
There are two reasons for this. First, many developers do not have the level of expertise required. Second, the effort required is significant, making it an expensive exercise.
Integrating observability into applications requires multiple rounds of optimizations and extensive testing to ensure there is no significant performance overhead.
How Choreo Helps
Choreo is designed in such a way that developers do not have to make the application observable themselves. Ballerina, the underlying programming language of Choreo, has a powerful observability framework that makes programs written in Ballerina fully observable. Choreo uses Ballerina’s observability framework to collect the observability data for its services.
Since Ballerina’s observability framework has been carefully designed to ensure minimal performance overhead, Choreo applications can run in production with observability turned on. Observability data is presented to users in various forms of visualizations. Users can then use these visualizations to optimize applications and then identify and resolve issues that could have an adverse impact on the user experience. The following figure shows the main observability page of a Choreo application.
Figure 1: Observability in Choreo: the main page
Debugging High Response Times Using Average Values
Response time is a critical performance metric that has a direct impact on the user experience. High response times could result in an unpleasant user experience, which may lead to customer churn. Unfortunately, developers do not give much attention to the response time (latency) of applications while developing them. It is only after deploying the applications in production that they discover such issues.
Even in the case where developers optimize response times, workload conditions can change over time, resulting in anomalies in response times. Choreo collects the latency of applications at different points, and this allows users to immediately identify bottlenecks in their code. A high response time could be due to an issue in the code, or it can be related to a delay in a call to an external endpoint. In Choreo, developers can identify and resolve these issues without having to spend much time debugging them. The following figure shows the latency breakdown of an application developed in Choreo. Note how the different external (connector) calls contribute to total latency.
Figure 2: The latency breakdown of an application developed in Choreo
Debugging Individual Request Latencies
While we can debug certain latency behaviors using metrics, there are more complex issues related to the latency that we cannot debug using the average latency values shown in the above diagram. Examples of such issues are the gradual increase of response times, sudden drops or spikes of the average response time (called level shifts), and latency spikes that occur at random points in time, and so on.
Choreo allows the developer to dig into the data at a more granular level to detect such issues. For example, the tracing data collected by Choreo observability allows developers to debug latency spikes by investigating the latency breakdown of individual requests. The following figure shows a throughput and latency behavior of an application developed in Choreo.
Figure 3: Throughput and latency behavior of an application developed in Choreo
Let us assume that we want to understand the latency of a particular request. We can do this by clicking on a particular point in the graph. When you click on a point, then, we can get the latency of individual requests and the latency breakdown (Choeo shows where a request spends the most time in the request path). This is illustrated in the diagram below.
Figure 4: Viewing where a request spends the most time in the request path
Debugging Complex Issues
While we can resolve a large number of issues (e.g., slow backend) by analyzing latency data, there are certain issues that require more detailed analysis. Such analysis requires us to look at multiple metrics as well as logs on a consolidated view. These metrics include both system metrics, such as CPU and memory, and application metrics, such as throughput, latency, and error rates. Choreo’s diagnostic view facilitates this. It allows developers to drill down and debug behaviors such as high CPU usage and relate high CPU usage to a change in another performance metric (e.g., increase in latency). The following figure shows the diagnostic view of Choreo Observability.
Figure 5: The diagnostic view of Choreo observability
Debugging in Lower Environments
It is often the case that developers have to write new code or modify existing code due to various reasons such as the introduction of new features and bug fixes. When this happens, there is a likelihood of new performance-related bugs being introduced to applications. Choreo allows the early detection of such issues in lower environments. Developers can test their applications in lower environments and compare their performance behaviors with those of the previous versions. If issues are found, they can be addressed before the new version is deployed in production. Even if it is a completely new application, developers can still test the application in lower environments by providing sample test data/cases.
Alerting on performance anomalies
It is not practical to continuously monitor observability dashboards to identify anomalous behavior in applications. Therefore, many systems have ways to automatically detect performance anomalies and alert relevant parties. Many alerting systems use threshold-based methods to send out alerts to the users. For example, if CPU utilization is greater than 80%, an alert is generated and sent out. Threshold-based methods have known limitations. First, they require manual configuration and expert knowledge when determining threshold values. Second, threshold-based methods are known to have lower accuracy due to their inability to detect complex anomaly patterns in applications.
Choreo has a sophisticated anomaly detection framework that uses advanced machine learning and anomaly detection algorithms to detect performance anomalies in Choreo user applications. It uses observability data collected from previous applications to train these machine learning models.
While service mesh attempts to provide solutions to observability through sidecar microservices that collect observability data of services, the data collected via these sidecars by default is not sufficient to perform deep debugging of cloud-native applications. Such debugging requires the collection of data at more granular levels within the services (e.g., breakdown of latency of a particular request at different points within the service).
Choreo has a powerful observability framework that addresses the issues in service meshes. Choreo collects all three pillars of observability of its applications with minimal impact on the application’s performance. It presents this data to its users in a form where they can easily debug and detect performance anomalies in applications. In addition, Choreo has advanced machine learning-based anomaly detection techniques that accurately detect the performance anomalies of Choreo applications.
Published at DZone with permission of Malith Jayasinghe. See the original article here.
Opinions expressed by DZone contributors are their own.