DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • Top Book Picks for Site Reliability Engineers
  • Overview of Telemetry for Kubernetes Clusters: Enhancing Observability and Monitoring
  • Shift-Right Testing: Smart Automation Through AI and Observability

Trending

  • *You* Can Shape Trend Reports: Join DZone's Software Supply Chain Security Research
  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Build an MCP Server Using Go to Connect AI Agents With Databases
  • Segmentation Violation and How Rust Helps Overcome It
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. The Five Tenets of Observability

The Five Tenets of Observability

What makes good observability?

By 
Greg Leffler user avatar
Greg Leffler
·
Mar. 08, 22 · Opinion
Likes (5)
Comment
Save
Tweet
Share
6.8K Views

Join the DZone community and get the full member experience.

Join For Free

A new year is a chance to have a new start, and one thing that it’s a great opportunity to think about is the monitoring and observability platform you’re using for your applications. If you’ve been using a legacy monitoring system, you’ve probably heard about observability all over the ‘net and want to figure out if this is really something you need to care about.

In this post, I’ll briefly explain what observability is, what a system needs to actually provide you with true observability, and how you can start the observability journey yourself.

Observability is a mindset that lets you answer questions about your business — from the user’s experience, through the application itself, and beyond to the business metrics and processes that the application enables. It’s an evolution of monitoring that greatly expands the volume of ingested data and radically expands the number and type of questions you can answer. It’s not just “metrics, traces, and logs” – observability is really about instrumenting everything and using this data to make better decisions. I wrote more about this in a different post, Observability: It's Not What You Think, that I’d encourage you to check out for an observability deep-dive.

Before I came to work at Splunk, I was an SRE (well, a systems admin at one of my jobs, but I’m old.) I know first-hand how important enterprise-grade observability is, because there are plenty of problems I solved in the past that I wish I had been able to use an observability system like the one we sell at Splunk to dive into. In the rest of this post, I’m going to discuss five things that an observability system must do to make it worth your investment, and I’m also going to give some examples from my experience in operations as to why these are critical.

What Differentiates One Observability Product from Another?

Every vendor will tell you that by buying their product and installing it you instantly ‘get’ observability, and in every case, including buying the product from us, this isn’t true. What you get out of the box varies a lot, however. When you’re thinking about what an observability solution will get you, you need to think of a few things that aren’t necessarily going to be published on the website or discussed in reviews. In the next section, I’ll discuss what I’ve found to be the five key tenets of an observability system. These apply to any system – commercial or homegrown – and make a real difference in how you can get value from an observability migration.

The Five Key Tenets of Observability

When evaluating an observability system, here are the five key tenets of Observability: Full-stack, end-to-end visibility; real-time answers; analytics-powered insight; enterprise-grade scale and features; and open standards. Let’s dive in to each of these in more detail:

Full Stack and End-to-End 

Adopting an observability platform that can’t give you 100% visibility into all your transactions, from the user browser, through your application, to the underlying business platform is setting yourself up to miss something critical. This includes support for things like RUM to determine user browser behavior, but also this includes avoiding sampling - read this post to learn why sampling is an antipattern in observability. In addition to the user’s experience, you’ll also need to have insight into the backend performance, including things like database query performance or code profiling.

I can’t count the number of issues I had to troubleshoot at LinkedIn brought on by someone important firing off a bug report to the sre@ email list – at that point, you simply have to figure out what happened and fix it. If our tools at LinkedIn hadn’t been able to see the end-to-end history for all our users, I may not have been able to fix those issues at all, or it would have taken much longer than necessary.

Real-Time

A good observability platform must give you insights and data in real-time. If you have to wait for a periodic alert rollup to find out about a problem, you’re likely to hear about it first from an angry tweet or an unhappy customer. Additionally, in a serverless world, the lifetime of a function can be in the hundreds of milliseconds (or less,) so it’s critical that your platform is able to show you issues as quickly as possible.

In one of my early tech jobs, we found out about a problem via phone call from the CTO before any of our alerting told us it was a problem. While he was explaining the issue, alerting started to fire, but by that point, the issue had already been happening for close to 15 minutes. We hit bad timing when the problem happened, but this could easily happen to anyone.

Analytics-Powered

The volume of data generated by an observability system is astronomical. There’s no way around it – you need something to help you make sense of this data and to suggest things that matter. An observability platform has to make problems easier to solve, not more difficult. Just instrumenting and adding tons of data into a system with no way for it to surface important things is going to make your problems worse.

Adding additional information to an observability system can backfire on you without a way to analyze it. In one of my past jobs, nearly every service ran in a JVM, so of course, it made sense to collect JVM memory statistics and to then alert on excessive memory usage, GC pause time, and things like that. What we didn’t anticipate when adding these metrics was how many events would be generated by small problems in one application. The alerting tool had no deduplication and there were thousands of events to manually clear every time the workload changed enough to alter memory allocation patterns in one app. These patterns didn’t have any user impact, the app was just behaving differently to us. A good analytics tool would have at least deduplicated these, and at best would have indicated that these aren’t impacting any customer-facing metrics so aren’t worth a real-time investigation.

Enterprise-Grade

Yes, I know that we’re dealing with buzzword city whenever anyone says “enterprise”, but a robust observability system has to do many things that go beyond simple monitoring. Your system eventually will probably need to operate across multiple clouds (and probably a few on-premise systems.) You’ll start to rely on it, so it needs to keep running no matter how much you grow and no matter how many services you have. Eventually, as you get even larger, true ‘enterprise’ features like RBAC and access tokens and accounting will be needed. The worst outcome would be needing these features and them not being available, requiring a time-consuming shift in observability tools unnecessarily.

Open Standards

OpenTelemetry is the future of observability. This is primarily because instrumentation is challenging work. To get the benefits of observability, you have to instrument all of your applications, but ideally, you would want to only instrument one time then observe from anywhere. OpenTelemetry enables this. Without an open standard, time spent instrumenting your environment is time and effort on work that you’ll almost certainly have to do again at some point in the future. With OpenTelemetry, you can change observability platforms if the need arises easily. You also have full control over what data is sent where, for enhanced customer privacy and possibly enhanced performance of your observability system.

Observability

Published at DZone with permission of Greg Leffler. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • Top Book Picks for Site Reliability Engineers
  • Overview of Telemetry for Kubernetes Clusters: Enhancing Observability and Monitoring
  • Shift-Right Testing: Smart Automation Through AI and Observability

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!