{{announcement.body}}
{{announcement.title}}
refcard cover
Refcard #364

Full-Stack Observability Essentials

Using OpenTelemetry for Flexibility

Observability and telemetry work together to correlate the health of individual systems with the overall health of the business, highlighting what’s going on within the complex systems, processes, and microservices of an entire tech stack and/or application, purely from the existing data streams collected. In this Refcard, explore full-stack observability essentials and how to adopt OpenTelemetry for increased flexibility.

Free PDF for Easy Reference

Brought to You By

Sumo Logic
refcard cover

Written By

author avatar Joana Carvalho
Performance Engineer, Postman
Section 1

What Is Observability?

The concept of observability (often abbreviated “o11y”) was first introduced in control theory by the engineer Rudolf Kámál, defining observability as “the ability to understand the current state of a system using only its external outputs.” For example: If a system is a black box, it cannot be understood, and therefore, it becomes harder to maintain and predict. That’s where high-quality telemetry and observability enter.   

Observability and telemetry signals — logs, metrics, traces, events, and metadata — work together to correlate the health of individual systems with the overall health of the business, highlighting what’s going on within the complex systems, processes, and microservices of an entire tech stack and/or application, purely from the existing data streams collected. Ultimately, observability provides developers and operations teams a greater level of understanding into the health of their systems.  

In order to understand and manage a service, we need to measure, observe, and quantify its behaviors. For that, we rely on a set of KPIs that make sense for our domain that will define what’s expected of our service. Defining these KPIs is paramount to understanding what actions need to be taken if the appropriate level of service is not delivered, ideally before the consumer of the service notices degradation. This will help to build a relationship of trust that can be maintained and quantified using the following abstractions:  

Difference Between Observability and Monitoring 

Observability is not meant to replace monitoring. On the contrary, observability is meant to amplify monitoring’s potential. Monitoring is an action; it can be done by a human or an automated process, and it requires knowing in advance what signals to look for. Monitoring can be used to generate alerts, provide insights, suggest actions, measure traffic or real-user activity, and warn when issues occur. 

Observability, on the other hand, lets you understand why the issue occurred.  It is an approach that enables teams to ask questions about the holistic state of a system. A system can be monitored, but if the telemetry is ineffective by not providing sufficient or accurate data to infer its status, then observability cannot be achieved.   

Section 2

Essentials to Full-Stack Observability

Full-stack observability is the ability to understand at any time what is happening across a technology stack. By collecting, correlating, and aggregating all telemetry in the components, it provides insight into the behavior, performance, and health of a system. Through full-stack observability, teams can deeply understand the dependencies across domains and system topology. Contrary to traditional monitoring systems, full-stack observability enables IT teams to proactively react, predict, and prevent problems using artificial intelligence and machine learning, which is all but a requirement considering that the amount of data collected would be nearly impossible otherwise. Presenting this data in the form of unified analytics and dashboards can give the observer a complete picture into the health of the system, for example, where issues occurred and solutions were introduced.  

Telemetry (MELT)

To achieve observability of a system, we rely on the aggregation of telemetry data from four categories. These are the raw data that will feed the system — Metrics, Events, Logs, and Traces (or MELT):    

  • Metrics: A numerical representation of measurements over a period of time. They can be used to report on the health of a system. Examples of metrics include the amount of memory being used at a given time, the queue size of a message broker, or the number of errors per second. 
  • Events: Immutable time-stamped records of events over time. Usually emitted by the service due to actions in the code. 
  • Logs: Lines of written text describing an event that occurred at a given time, usually the result of a block of code being executed. Logs can be represented in plain text, structured text (like JSON), or binary code. They are especially useful to troubleshoot systems less prone to instrumentation, databases, or load balancers to name a few. The basic “Hello World” is usually the first log that any developer writes — and they tend to be more granular than an event. 
  • Traces: Represents the flow of a single transaction or request as it goes through a system. Traces should show the path that is followed, the latency each component inflicts on the way, and the relevant information associated with each component that might indicate bottlenecks or issues. These data streams should be able to answer questions easily and clearly about the availability and performance of a system.  

Microservices 

The single responsibility principle was first coined by Robert C. Martin and became the base of the microservices philosophy:  

“Gather together the things that change for the same reasons. Separate those things that change for different reasons.” 

A microservice architecture follows this approach by arranging an application into smaller, loosely coupled services that can be developed, deployed, and maintained independently. The services communicate with one another through APIs as building blocks of the overall system. This can provide more agile development that monoliths do not benefit from. Microservices can be developed and fixed separately while isolating other components from possible issues occurring that have a negative impact on the overall system. Microservices can give a development team freedom to select the best technology set for the problem at hand — and can be altered, enhanced, and scaled without having to cross other services’ borders.  

Microservices, however, are not the holy grail that solves all problems, and companies can easily reach hundreds or thousands of microservices, making security, observability, and network policies more complex to address. That’s where service meshes come in, by enabling managed, observable, and secure communication between services. Service meshes remove the logic through governing the communication between microservices from the services and into an infrastructure layer.  

 

Although not all service mesh implementations are as follows, most requests will be routed between microservices through proxies that will live in their own infrastructure, decoupling business logic from network function. The individual proxies sit beside the service and are sometimes called a “sidecar” for that same reason. All of these decoupled proxies together form a service mesh network. Without the service mesh, microservices must govern service-to-service communication, and not only can it become a hidden point of failure, but it also becomes more costly to maintain. 

Service meshes can help bring visibility into the application layer without code changes, making it easier to configure collection of metrics; as all traffic goes through the proxies in the mesh, this enables the observer to have greater visibility into all service interactions. Each proxy will report on its portion of the interaction, providing metrics like inbound and outbound proxy traffic, service-level metrics. Access logs with service calls and distributed traces, which the mesh generates via information on every service within it, make it easier to follow a request across multiple services and proxies.  

Since microservices are distributed by nature, they can be challenging to monitor and observe. Correlating logs from many different services and going through them to figure out where issues can occur is one of the main pain points in such an architecture.   

Distributed Tracing 

As mentioned earlier, distributed tracing enables development and DevOps teams to debug and monitor complex distributed systems such as microservices by tracking transactions from service to service, throughout the stack. This holistic approach empowers teams to make informed decisions when needed by combining server-side and client-side monitoring together with the power of visualization tools. This is end-to-end observability.   

A distributed tracing system collects data as requests go from service to service, recording each segment as a span (or step) that contains the details and is combined into one trace. Once completed, a trace will represent the whole journey of a request as it flows through the system. The following image shows a flame chart, which is one of the preferred visualizations for traces. It shows the execution path from a system represented by Figure 5.  

A span is, for example, the Search Courier action in the Dispatch microservice that has three child spans. By combining spans into a trace and exposing the child-parent relationship between them, it becomes possible to visualize the granularity of the trace, dependencies, and how long each dependency takes.  

In practice, distributed tracing starts with a single request or transaction. Each request is marked with a unique ID that is often known as a trace or correlation ID that will identify, henceforth, that specific transaction by passing it in the request headers propagating trace context between subsequent requests.  

Visualization and Reporting 

Gathering and correlating all the data in a system is challenging, especially without the appropriate tools or techniques to visualize and process the telemetry in a meaningful manner, which is key. Being able to understand dependencies throughout the stack, as well as across services, processes, and hosts, gives observers and SRE teams the ability to predict and understand the topology of their service.  

Creating a baseline for every system component, automating node discovery, and adding instrumentation to applications can help shift the focus from manual configuration and setup to proactive alerting and a reporting system based on the correlation of the data collected. In multi-cloud environments, applications can reach millions of components, which makes it especially challenging to understand the context between them. One way to visualize the relationships between these components is through topology maps. 

Analyzing distributed traces is often one of the more effective ways to perform root cause analysis, for which there are some useful visualization aids that help profiling. 

Flame or Icicle Charts

Flame or icicle charts help developers see the relationship between service calls and the impact on the overall single trace, like unusually high latency or errors and how they impact other calls and services. They represent each call on a distributed trace as a horizontal (see Figure 6) or vertical (see Figure 7) line that is time-stamped and contains details for each span.  

Trace Map 

A trace map shows the connection between spans and allows one to quickly understand the relation between services. 

 

Trace Tree 

A trace tree is similar to a trace map but is represented vertically with the parent span as the root node. 

Sunburst Diagram 

A sunburst burst diagram represents a trace using a set of rings and semi-rings. The circle is the parent span and the semirings fanned out represent the parent-child relationship between them. 

Telemetry Querying 

Storing all telemetry from a system in one place is very powerful; however, IT teams need to be able to interrogate this data in order to answer meaningful questions. Luckily, teams are able to rely on query languages (like SQL, Kusto, etc.) for analysis. The intention of those query languages is to provide a read-only query request that will process the data and return a result. That result can then be used to create visualizations, conduct troubleshooting, perform business analysis, etc. 

 This is an example of a SQL query that will return the slowest operations within the last hour: 

SQL
 
SELECT service, 
   span_name, 
   ROUND( 
       approx_percentile(0.95, percentile_agg(duration_ms))::numeric, 
       3 
   ) as duration_p95, 
   ROUND(avg(duration_ms)::numeric, 3) as duration_avg 
FROM span 
WHERE start_time > NOW() - INTERVAL '1 hour' 
   AND parent_span_id = 0 
GROUP BY service, 
   span_name 
ORDER BY duration_p95 DESC 
LIMIT 10 

 

And the result would be: 

service 

span_name 

duration_p95  

duration_avg 

frontend 

/cart/checkout                      

25000 

500 

dispatch 

/calculateVat 

10000 

9000 

auth 

/login 

1700 

100 

frontend 

/                                   

1100 

270 

frontend    

/product/{id}                       

750 

110 

frontend    

/cart                               

600 

90 

checkout     

/generateInvoice 

200 

30 

frontend    

/setCurrency                        

8 

0,5 

payment      

/getPaymentPage 

1,5 

0,45  

dispatch 

/getAvailableCourier 

1,2 

0,39 

 

Based on this information, we can immediately verify that there is a performance issue in the frontend service for the /cart/checkout endpoint, because although the average can be considered good (500 ms), at least 5% of the users will have a poor experience (~25 seconds). Having the ability to use querying languages to crunch this data and to correlate it is very powerful. In the example above, we could cross the information of the slowest operations in the problematic endpoint with metrics like CPU usage, memory consumption, etc. Combining OpenTelemetry with the flexibility of a powerful query language allows users to gain more value from the telemetry stored.   

Section 3

OpenTelemetry

Inspired by OpenTracing and OpenCensus, OpenTelemetry has one goal: to give developers a vendor-agnostic specification for telemetry, thus enabling teams to trace a request from start to finish by instrumenting each transaction. Since the introduction of these tools, the industry has recognized the need for collaboration in order to lower the shared cost of the software instrumentation required to gain this visibility. OpenTracing and OpenCensus have led the way in that effort.   

OpenTracing became a CNCF project in 2016, and the OpenCensus project was made open source by Google in 2018. This meant that there were now two competing tracing frameworks that shared the same goal but weren’t mutually compatible. Although competition usually means better software, in the open-source world, this is not necessarily true as it can often lead to poor adoption, contribution, and support (often called “The Tracing Wars”). To avoid this, it was announced at KubeCon 2019 in Barcelona that the projects would converge into OpenTelemetry and join CNCF. Hence OpenTelemetry, or OTel for short, was born.  

Purpose and Audience

With the ultimate goal of providing a unified set of vendor-neutral standards, libraries, integrations, APIs, and SDKs to make robust, portable telemetry as a built-in feature of services, OpenTelemetry has become the de-facto standard for adding flexible full-stack observability to cloud-native applications. This open standard now enables any company with any technology stack to gather observability data (including distributed traces, logs, and metrics) from all their systems. 

Today, if for every instrumentation, a microservice needed to be conducted by hand, it would likely mean spending almost as much time building and maintaining telemetry as building and maintaining the software itself. That’s where tracing auto-instrumentation steps in to make it possible to collect application-level telemetry from applications without making manual changes to the code. Auto-instrumentation allows tracing the path of a transaction as it navigates different components, including: 

  • Application frameworks 
  • Communication protocols 
  • Data stores 

Tracing all operations involved in a transaction provides an end-to-end view of how the service functions. We can then visualize, aggregate, and inspect what has been collected to understand the experience of individual users, identify bottlenecks, and map out dependencies between services.  

Key Components 

When integrating OpenTelemetry into your application stack, the telemetry delivery pipeline consists of: 

  • APIs (Application Programming Interface): APIs define how applications speak to one another and are used to instrument an application or service. These APIs are generally available for developers to use across popular programming languages (e.g., Ruby, Java, Python, etc.).  And because these APIs are part of the OpenTelemetry standard, they will work with any OpenTelemetry-compatible backend system moving forward — eliminating the need to re-instrument again in the future. 
  • SDK (Software Development Kit): The SDK is also language-specific, providing the bridge between APIs and the exporter. 
  • Exporter: The exporter sends telemetry to a configured backend. It separates the instrumentation from the backend configuration. This allows users to easily switch backends without re-instrumenting the code. 
  • Collector: The collector is used for data collection, filtering, aggregation, and batching. It allows greater flexibility for receiving and sending the service data to the backends. The collector has two primary deployment models: 
    • A collector instance running as an agent that lives within the application or the same host as the application (by default, OpenTelemetry assumes a locally running collector is available) 
    • One or more collector instances running as a standalone service 

Image source: Schema based on OpenTelemetry: beyond getting started 

As OpenTelemetry is a library framework for receiving, processing, and exporting telemetry, it requires a backend to receive and store the data.  

A Standard for the Future 

Correlating and analyzing data can be cumbersome for development teams that are trying to obtain insight into their applications, especially nowadays with highly ephemeral infrastructure and a vast number of services to manage. OpenTelemetry wants to simplify data collection so that the focus can be on data analysis and processing while creating a standard to eliminate the proprietary and in-house implementation used so far.  

With almost 2,000 contributors and more than 300 companies backing up and maintaining the project in the last year, OpenTelemetry provides access to an extensive set of telemetry collection frameworks. This also means that whenever new technologies arise, the community will also respond and help with support, not having to wait for a vendor to provide support.  

By trying to standardize the instrumentation across different stacks, technologies, and providers, OpenTelemetry helps diminish the gap across all services, which means diminishing blind spots in the system and resolving and detecting issues faster.  

Section 4

Conclusion

In order to reach full-stack observability of modern distributed application stacks, data must be collected, processed, and correlated in a visually meaningful way from the entire system. This can be challenging for development teams to achieve as they have to support all stacks, frameworks, and providers.  

Using only a commercial solution added at the top of the application in production — to collect the telemetry — can leave teams locked to a vendor and make it hard to change in the future. OpenTelemetry became the go-to solution to instrument applications, without being bound to a vendor, and in some cases, this involves simply adding lines of code into the solution. By standardizing how frameworks and applications collect and send observability data, OpenTelemetry aims to solve some of the challenges created by the heterogeneity of stacks, giving teams a vendor-neutral, portable, and pluggable solution that is easily configured with open-source and commercial solutions alike. 

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}