Observability 101: Terminology and Concepts
Welcome to Observability 101! This post is to help orient folks who want to learn more about observability but feel intimidated by the vocabulary list.
Join the DZone community and get the full member experience.Join For Free
When I first started following Charity on Twitter back in early 2019, I was quickly overwhelmed by the new words and concepts she was discussing. I liked the results she described: faster debugging, less alert fatigue, happier users. Those are all things I wanted for my team! But I was hung up on these big polysyllabic words, which stopped me from taking those first steps toward improving our own observability.
This post is my attempt to help orient folks who want to learn more about observability but maybe feel overwhelmed or intimidated by the vocabulary list, like I did. My goal is to get everyone on the same page about what these words mean so that we can focus on leveraging the tools and ideas to build better software and deliver more value!
Welcome to Observability 101!
In software, observability (abbreviated as “o11y”) is the ability to ask new questions of the health of your running services without deploying new instrumentation. The term comes from control theory, which defines observability as the ability to understand the internal state of a system from its external outputs (source).
Telemetry consists of those “outputs”—it’s the data generated by your system that documents its state. Telemetry gets generated because of instrumentation: code or tooling that captures data about the state of your running system and stores it in various formats. Some examples of software telemetry include: metrics, logs, traces, and structured events.
Modern software teams have gotten good at accounting for failures that can be caught by tests and continuous integration tooling. We use retries and autoscaling and failovers to make our systems more resilient in the wild world of production. Since we catch and respond to known variables so well, what we’re left with are the unknown-unknowns. The types of issues we often see in modern production software systems are emergent failure modes, which happen when a bunch of unlikely events line up to degrade or take down your system. They’re really interesting but difficult to debug, which is why we need observability.
In order to have good observability into your software services, you need both good instrumentation generating high-context telemetry data, and you need sophisticated ways of interacting with that data that enable asking novel questions—questions you couldn’t have thought of when you wrote the code. Put more simply, software observability requires good data and good tooling. Let’s start by discussing the data.
To reiterate, telemetry is data that your system generates that tells you about the system’s health. The term comes from the Greek tele-, meaning “remote”, and -metry, meaning “measure.” You probably already generate telemetry from your programs, even if you’re not paying for monitoring or logging services. In fact, even the output from a
console.log() statement is a form of telemetry!
Let’s discuss the most common forms of telemetry generated from production software services.
One commonly used form of telemetry data in software is metrics. A metric consists of a single numeric value tracked over time. Traditional monitoring uses system-level metrics to track things like CPU, memory, and disk performance. This data is important for choosing among virtual machine instance types, with options for processor speed, RAM, and hard disk storage. But it doesn’t tell you about user experience, or how to improve the performance of your code.
Modern monitoring services also provide application performance monitoring (APM) features, which track application-level metrics like average page load times, requests per second, and error rates. Each metric only tracks one variable, which makes it cheap to send and store. Values are pre-aggregated at write-time, however, so you need to deploy a code change if you want to track metrics for a new intersection of data, e.g. error rates for a specific user.
Logs are text strings written to the terminal or to a file (often referred to as a “flat” log file). Logs can be any arbitrary string, but programming languages and frameworks have libraries to generate logs from your running code with relevant data at different levels of specificity (e.g.
DEBUG mode). There’s no standardization among programming communities about what should get included at each log level.
Log aggregation services allow you to send the content of your logs and store them for later retrieval. Querying flat logs (as opposed to structured logs) is slow because of the computational complexity of indexing and parsing strings, which makes these tools impractical for debugging and investigating your production code in near-realtime.
A trace is a visualization of the events in your system showing the calling relationship between parent and child events as well as timing data for each event. The individual events that form a trace are called spans. Each span stores the start time, duration, and
parent_id. Spans without a
parent_id are rendered as root spans.
A distributed trace connects calling relationships among events in distributed services. For example, Service A calls Service B, which makes a database query and then hits a third-party API.
A structured event is a data format that allows you to store key-value pairs for an arbitrary number of fields or dimensions, often in JSON format. At a minimum, structured events usually have a timestamp and a
name field. Instrumentation libraries can automatically add other relevant data like the request endpoint, the user-agent, or the database query.
When we talk about wide events, we mean structured events with lots of fields. You can add whatever fields you want, and because they’re key-value pairs, it’s easy to query for them later on.
Events can be built into traces by pointing each event to its parent event with the
These are some more terms commonly used when discussing observability.
Context refers to the collection of additional dimensions, fields, or tags on a piece of data that tells you more about the internal state of your system. Context can include metadata, system-level metrics, and domain-specific information about the running program.
The dimensionality of data refers to how many fields or attributes a piece of data has attached to it. Examples of low-dimensionality data include metrics and flat logs. High-dimensionality data includes anything that can support many fields or attributes, including JSON blobs, structs, and objects.
The term cardinality refers to how many possible values a dimension can have. Some examples of low-cardinality data include booleans (true/false) and days of the week. High-cardinality examples are first name, social security number, UUID, or build ID.
Cardinality becomes especially relevant when you start considering multiple dimensions at the same time. For a REST API, for example, endpoints can number from dozens or hundreds (not including unique IDs). If you want to look at the behavior of each endpoint across the over 43 million user agents in the world, you get what’s called a cardinality explosion. Imagine trying to track the intersection of endpoint × user agent with application-level metrics—that’s billions of lines of code. Thus when you have questions that require high-cardinality data to answer, you need telemetry and tools that support it.
Instrumentation With Honeycomb
Telemetry gets generated from instrumentation, which is code or tooling that gathers data from your system in real-time. Whatever observability tool you’re using, there’s some setup and configuration to do in order to generate telemetry data and then send it off to your vendor or service.
For brevity and clarity, I’m limiting this section to Honeycomb-specific instrumentation.
Instrumenting Code for Structured Data
At Honeycomb, we feel that you get the best observability into your systems when you instrument your code. There are a number of ways you can do that.
You can roll your own structured data by generating structured logs using your language or framework’s existing logging libraries. To get your structured logs into Honeycomb, this approach requires first instrumenting your code and then deploying an agent that’s configured to send the data to Honeycomb.
Under the hood, the Beeline integrations use Libhoney, our low-level event handler SDK in each of the languages. Libhoney does not support tracing out of the box, and it’s rare that you should need to interface with it directly. If you’re just getting started with Honeycomb, we recommend using a Beeline or OpenTelemetry. The Get Your Data In docs provide more guidance on choosing an instrumentation library.
What Observability Tooling Needs
Observability goes beyond application-level monitoring. It goes beyond tracing. It goes beyond “three pillars.”
Observability tooling needs to support structured events, unaggregated data blobs containing whatever key-values pairs you decide to send. It needs to empower you to flow seamlessly between broad aggregates (time-series graphs) and deep context (trace visualizations) and then broad aggregates again, using your structured events’ context fields to ensure that the answers you get are relevant and informative. Observability means aggregating at query-time, not at write-time, because you never know what questions you might want to ask later on.
With observability, you can deeply interrogate your code behavior in production because you have more than just isolated time-series graphs and some trace visualizations. You can send whatever fields might be relevant inside your structured events, and then ask new questions that your vendor (and you!) could never anticipate.
Have questions? Notice something I missed? Reach out to me on Twitter.
Published at DZone with permission of Shelby Spees. See the original article here.
Opinions expressed by DZone contributors are their own.