DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Implementing Observability in Distributed Systems Using OpenTelemetry
  • Manual Investigation: The Hidden Bottleneck in Incident Response
  • Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery
  • Seeing the Whole System: Why OpenTelemetry Is Ending the Era of Fragmented Visibility

Trending

  • Building a DevOps-Ready Internal Developer Platform: A Hands-On Guide to Golden Paths, Self-Service, and Automated Delivery Pipelines
  • Migrate a Hardcoded LangGraph Agent to LaunchDarkly AI Configs in 20 Minutes
  • Why DDoS Protection Is an Architectural Decision for Developers
  • Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. The Golden Signals of Monitoring

The Golden Signals of Monitoring

Learn how Google’s Golden Signals — latency, traffic, errors, and saturation — help monitor service health and detect issues before they escalate.

By 
Krishna Vinnakota user avatar
Krishna Vinnakota
·
Jul. 23, 25 · Analysis
Likes (4)
Comment
Save
Tweet
Share
2.5K Views

Join the DZone community and get the full member experience.

Join For Free

This article describes "Golden Signals," how they can provide a high-level health overview and performance of your service. These signals are very useful to understand the state of any service and can help identify potential issues. They can be used as a good starting point to implement monitoring strategies specific to your workload. If any of these signals are out of the norm, it is a strong indicator that something needs attention.

Here's a breakdown of each signal and how it can be used as a starting point for monitoring your services:

What Are The Golden Signals?

The term "Golden Signals" was introduced by the Google SRE book and consists of four critical metrics:

  1. Latency: The time it takes to serve a request
  2. Traffic: The demand on your system, typically measured by requests per second
  3. Errors: The rate of requests that fail
  4. Saturation: How "full" your service is

Now, let's look into each of these signals individually:

1. Latency: How Fast Is Your Service Responding?

Latency measures the time taken to complete a request, from the moment it's received to the moment a response is sent back. It's a direct indicator of user experience. High latency means slow interactions, leading to user dissatisfaction.

What to Monitor

  • Average latency: While it can provide a high-level overview, it may not be a general indicator of user experience.
  • Percentiles: Percentiles can provide better insight into user experience. While the average might look good, the 99th percentile tells you how the slowest 1% of your users are experiencing your service. This is often where real problems hide.
  • Latency by endpoint/service: Performance might vary based on different endpoints of your application.   Implementing a monitoring strategy granularly will help pinpoint bottlenecks in your system.

How to Use as a Starting Point

You can establish acceptable latency thresholds for your critical operations. For example, a database service might aim for P99 read latency under 20ms. If these thresholds are crossed, an alert can be triggered, signaling that users are experiencing slowdowns that need further investigation.

2. Traffic: What's the Demand on Your Service?

Traffic is a measure of how much demand is being placed on your system, measured in a high-level system-specific metric.  For example, for a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a database system, the number of transactions or queries that it can handle per second.

Examples of metrics to monitor:

  • Requests per second (RPS): The most common measure for web services
  • Concurrent users: For applications with long-lived sessions
  • Throughput (e.g., MB/s): For services focused on data transfer

How to Use as a Starting Point

Monitoring traffic helps you understand normal operating patterns and detect anomalies. A sudden drop in traffic might indicate an issue with upstream dependencies or a problem with your service itself (e.g., it's not responding, so requests are failing). A sudden spike could signal an adoption of a new feature by users, a denial-of-service attack, or an unexpected load, which in turn might impact other signals such as latency or errors.

3. Errors: Are Things Breaking?

Errors are the rate of requests that fail, such as explicit error codes (e.g., HTTP 5xx series), but also implicit errors like timeouts, dropped connections, or invalid responses. High error rates may directly impact service availability and user satisfaction.

What to Monitor

  • HTTP error codes (e.g., 500 rate, 429 rate)
  • Application-specific error messages/codes
  • Service-level errors (e.g., database connection failures)
  • Client-side errors: Are users encountering issues in their browsers/apps?

How to Use as a Starting Point

Set clear error budget thresholds. For example, your goal might be to have less than 0.1% of requests returning a HTTP 5xx error for less than 10 minutes. Any deviation above this threshold should immediately trigger an alert, as it directly impacts the reliability of your service. Errors often provide the clearest signal of a service problem.

4. Saturation: How Full Is Your Service?

Saturation is a measure of how "full" your service is. Saturation is a measure of resource utilization.  It helps predict any capacity issues. A highly saturated service can lead to performance degradation or failures.

What to Monitor

  • CPU utilization: proportion of the total available processor cycles that are consumed by each process?
  • Memory usage: Is your service using too much RAM?
  • Disk I/O: Are disk operations becoming a bottleneck?
  • Network bandwidth: Is your network interface saturated?
  • Queue lengths: Are requests piling up in internal queues (e.g., message queues, thread pools)? This is a particularly strong indicator of impending saturation.

How to Use as a Starting Point

Monitor your key resource utilization percentages. For example, if CPU load averages on your Linux operating system are above 10, it may indicate your service is saturated. This means the number of processes waiting for CPU time is more than the number of cores available. You can set up an alert to monitor the load average on the operating system, allowing you to proactively investigate potential bottlenecks in your system.

Golden Signals as a Starting Point for Monitoring Your Service

The Golden Signals provide a good framework to start your monitoring journey:

  1. Instrument for the Golden Signals: Make sure that your service emits metrics for the four golden signals: latency, traffic, errors, and saturation. This might involve using a common telemetry library such as  OpenTelemetry or leveraging built-in metrics from your cloud provider.
  2. Establish baselines and thresholds: Understand what "normal" looks like for each signal. Then, define clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs) based on these signals.  For example,  P99 latency must be below 20ms for a read query.
  3. Set up alerts: Configure alerts for when any of the Golden Signals cross your predefined thresholds. Alerts should be actionable and have clear troubleshooting guides.
  4. Build dashboards: Create simple, high-level dashboards that display the four signals along with other application-specific metrics. This gives the teams a quick overview of the service's health at a glance.
  5. Drill down from Golden Signals: When a Golden Signal indicates a problem, use it as your starting point for deeper investigation. For instance, if latency is high, you'd then look at more granular metrics like database query times, external API calls, or specific code path durations.

Either as an SRE or as a service owner, if you are trying to implement your monitoring strategy, beginning with these signals will give you a good starting point to build your monitoring.

References

  1. Site Reliability Engineering: How Google Runs Production Systems
  2. The Site Reliability Workbook: Practical Ways to Implement SRE
Signal User experience systems Observability

Opinions expressed by DZone contributors are their own.

Related

  • Implementing Observability in Distributed Systems Using OpenTelemetry
  • Manual Investigation: The Hidden Bottleneck in Incident Response
  • Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery
  • Seeing the Whole System: Why OpenTelemetry Is Ending the Era of Fragmented Visibility

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook