The Golden Signals of Monitoring
Learn how Google’s Golden Signals — latency, traffic, errors, and saturation — help monitor service health and detect issues before they escalate.
Join the DZone community and get the full member experience.
Join For FreeThis article describes "Golden Signals," how they can provide a high-level health overview and performance of your service. These signals are very useful to understand the state of any service and can help identify potential issues. They can be used as a good starting point to implement monitoring strategies specific to your workload. If any of these signals are out of the norm, it is a strong indicator that something needs attention.
Here's a breakdown of each signal and how it can be used as a starting point for monitoring your services:
What Are The Golden Signals?
The term "Golden Signals" was introduced by the Google SRE book and consists of four critical metrics:
- Latency: The time it takes to serve a request
- Traffic: The demand on your system, typically measured by requests per second
- Errors: The rate of requests that fail
- Saturation: How "full" your service is
Now, let's look into each of these signals individually:
1. Latency: How Fast Is Your Service Responding?
Latency measures the time taken to complete a request, from the moment it's received to the moment a response is sent back. It's a direct indicator of user experience. High latency means slow interactions, leading to user dissatisfaction.
What to Monitor
- Average latency: While it can provide a high-level overview, it may not be a general indicator of user experience.
- Percentiles: Percentiles can provide better insight into user experience. While the average might look good, the 99th percentile tells you how the slowest 1% of your users are experiencing your service. This is often where real problems hide.
- Latency by endpoint/service: Performance might vary based on different endpoints of your application. Implementing a monitoring strategy granularly will help pinpoint bottlenecks in your system.
How to Use as a Starting Point
You can establish acceptable latency thresholds for your critical operations. For example, a database service might aim for P99 read latency under 20ms. If these thresholds are crossed, an alert can be triggered, signaling that users are experiencing slowdowns that need further investigation.
2. Traffic: What's the Demand on Your Service?
Traffic is a measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For example, for a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a database system, the number of transactions or queries that it can handle per second.
Examples of metrics to monitor:
- Requests per second (RPS): The most common measure for web services
- Concurrent users: For applications with long-lived sessions
- Throughput (e.g., MB/s): For services focused on data transfer
How to Use as a Starting Point
Monitoring traffic helps you understand normal operating patterns and detect anomalies. A sudden drop in traffic might indicate an issue with upstream dependencies or a problem with your service itself (e.g., it's not responding, so requests are failing). A sudden spike could signal an adoption of a new feature by users, a denial-of-service attack, or an unexpected load, which in turn might impact other signals such as latency or errors.
3. Errors: Are Things Breaking?
Errors are the rate of requests that fail, such as explicit error codes (e.g., HTTP 5xx series), but also implicit errors like timeouts, dropped connections, or invalid responses. High error rates may directly impact service availability and user satisfaction.
What to Monitor
- HTTP error codes (e.g., 500 rate, 429 rate)
- Application-specific error messages/codes
- Service-level errors (e.g., database connection failures)
- Client-side errors: Are users encountering issues in their browsers/apps?
How to Use as a Starting Point
Set clear error budget thresholds. For example, your goal might be to have less than 0.1% of requests returning a HTTP 5xx error for less than 10 minutes. Any deviation above this threshold should immediately trigger an alert, as it directly impacts the reliability of your service. Errors often provide the clearest signal of a service problem.
4. Saturation: How Full Is Your Service?
Saturation is a measure of how "full" your service is. Saturation is a measure of resource utilization. It helps predict any capacity issues. A highly saturated service can lead to performance degradation or failures.
What to Monitor
- CPU utilization: proportion of the total available processor cycles that are consumed by each process?
- Memory usage: Is your service using too much RAM?
- Disk I/O: Are disk operations becoming a bottleneck?
- Network bandwidth: Is your network interface saturated?
- Queue lengths: Are requests piling up in internal queues (e.g., message queues, thread pools)? This is a particularly strong indicator of impending saturation.
How to Use as a Starting Point
Monitor your key resource utilization percentages. For example, if CPU load averages on your Linux operating system are above 10, it may indicate your service is saturated. This means the number of processes waiting for CPU time is more than the number of cores available. You can set up an alert to monitor the load average on the operating system, allowing you to proactively investigate potential bottlenecks in your system.
Golden Signals as a Starting Point for Monitoring Your Service
The Golden Signals provide a good framework to start your monitoring journey:
- Instrument for the Golden Signals: Make sure that your service emits metrics for the four golden signals: latency, traffic, errors, and saturation. This might involve using a common telemetry library such as OpenTelemetry or leveraging built-in metrics from your cloud provider.
- Establish baselines and thresholds: Understand what "normal" looks like for each signal. Then, define clear Service Level Indicators (SLIs) and Service Level Objectives (SLOs) based on these signals. For example, P99 latency must be below 20ms for a read query.
- Set up alerts: Configure alerts for when any of the Golden Signals cross your predefined thresholds. Alerts should be actionable and have clear troubleshooting guides.
- Build dashboards: Create simple, high-level dashboards that display the four signals along with other application-specific metrics. This gives the teams a quick overview of the service's health at a glance.
- Drill down from Golden Signals: When a Golden Signal indicates a problem, use it as your starting point for deeper investigation. For instance, if latency is high, you'd then look at more granular metrics like database query times, external API calls, or specific code path durations.
Either as an SRE or as a service owner, if you are trying to implement your monitoring strategy, beginning with these signals will give you a good starting point to build your monitoring.
References
- Site Reliability Engineering: How Google Runs Production Systems
- The Site Reliability Workbook: Practical Ways to Implement SRE
Opinions expressed by DZone contributors are their own.
Comments