The Multi-Agent OS for SRE & DevOps

The Four Golden Signals of Monitoring: A Complete Guide (2026)

Latency, traffic, errors, and saturation. These are the four signals Google's SRE book tells you to watch if you can only watch four, because together they catch the large majority of user-facing problems with the smallest possible set of metrics. This is the complete 2026 guide: where the signals come from, how each one works, how they compare to the RED and USE frameworks, how to turn them into SLOs that drive action, a 90-day instrumentation plan, and a 10-point checklist.

15 min read Published May 2026 By Dr. Samson Tanimawo, Nova AI Ops
Monitoring dashboard showing the four golden signals: latency, traffic, errors, and saturation across services

What the golden signals are and where they come from

The four golden signals of monitoring are latency, traffic, errors, and saturation. They come from the "Monitoring Distributed Systems" chapter of Google's Site Reliability Engineering book, published in 2016, which states it plainly: if you can measure only four metrics of your user-facing system, focus on these four. The framing was deliberately minimal. Rob Ewaschuk and his co-authors wanted a small set of signals that any team could instrument on any service, regardless of language or stack, that would surface most outages before users opened tickets.

Why these four and not some other set? Because together they answer the question that actually matters during an incident: are users being harmed, and if so, in what way? Latency tells you whether requests are slow. Errors tell you whether requests are failing. Traffic tells you how much demand is hitting the system, which is the context that makes the other numbers meaningful. Saturation tells you how close the system is to running out of headroom, which is the early warning before the other three go bad. A team watching all four catches the large majority of user-facing problems with a small, stable, cheap-to-collect set of metrics.

The idea spread because it is simple enough to adopt in an afternoon and general enough to apply to almost any request-driven service. It is also the antidote to a common failure mode: teams that instrument hundreds of low-level metrics, page on dozens of them, and still miss the outage because no single host-level metric mapped to user pain. The golden signals push you to monitor at the service boundary, where the user actually lives. Think of them as the top layer of a broader observability strategy, with detailed telemetry underneath for when you need to diagnose why.

The rest of this guide walks each signal in turn, then compares the golden signals to the two other minimal frameworks you will hear about, RED and USE, and finishes with how to turn the signals into SLOs that drive action, a 90-day instrumentation plan, and a checklist.

The one-sentence version. Monitor latency, traffic, errors, and saturation at the boundary of every user-facing service. If you do nothing else, do that. Everything below is detail on how to do each one well and how to turn the readings into alerts and SLOs that respect the on-call engineer's attention.

Signal 1: Latency

Latency is the time it takes to serve a request. It is usually the first signal users feel, because a service that has gone slow feels broken long before it starts returning errors. But there is a subtlety that trips up most teams: you must distinguish the latency of successful requests from the latency of failed requests.

The reason is that errors often fail fast. A request that hits a misconfigured route and returns a 500 in two milliseconds will pull your average latency down, making a service that is failing half its requests look faster than a perfectly healthy one. If you mix successful and failed latency into one number, a spike in fast errors can hide a real problem behind a reassuring graph. Track the two separately: a sudden drop in overall latency that coincides with rising errors is a classic signature of a service failing fast.

The second rule of latency is that you measure distributions, not averages. An average collapses the whole experience into one number and throws away exactly the information you care about. A service can show a healthy 40 millisecond average while one request in twenty takes two seconds, and those slow requests are the ones users notice and abandon. Percentiles describe the shape of the distribution:

  • p50 (the median) is the typical experience. Half of requests are faster, half are slower. Good for a baseline sense of normal.
  • p95 is where the slow tail starts to matter. One request in twenty is at least this slow, which at any real traffic volume is a lot of unhappy users.
  • p99 is the tail that drives complaints, churn, and the worst-case experience. At high traffic, p99 represents thousands of requests per hour, often your most valuable, highest-activity users who make the most requests.

A practical rule: set your latency target on a high percentile, not the average. "99 percent of requests complete in under 300 milliseconds" is a target users can feel. "Average latency under 150 milliseconds" is a target that can be met while a meaningful slice of traffic times out. Latency is also the golden signal that maps most directly onto a service level objective, which we return to below.

Signal 2: Traffic

Traffic measures the demand on your system. For a web service it is typically requests per second. For an API it might be transactions per second; for a streaming service, concurrent sessions or bandwidth; for a database, queries or transactions per second. Whatever the unit, traffic is the question: how much is the system being asked to do right now?

Traffic is the golden signal teams are most tempted to skip, because on its own it does not tell you whether anything is wrong. A high traffic number is not a problem; it might be a great day for the business. The reason traffic earns a place among the four is that it is the denominator and the context for everything else.

It is the denominator for errors: fifty failures a minute means something completely different at 100 requests a minute than at 100,000. It is the context for latency: latency that climbs as traffic climbs points at a capacity limit, while latency that climbs with flat traffic points at a regression or a degraded dependency. And it is the context for saturation: saturation rising because traffic doubled is a scaling conversation, while saturation rising on flat traffic is a leak or a runaway process. Without the traffic number next to them, the other three signals are ambiguous.

Traffic is also your sanity check that monitoring itself is alive. A traffic number that suddenly drops to zero is not good news that the errors stopped; it usually means requests are not reaching the service at all, a load balancer, DNS, or upstream failure. A drop in traffic to a normally busy service deserves a page just as much as a spike in errors does.

See latency, traffic, errors, and saturation correlated into one incident view across your whole fleet.

Try Nova →

Signal 3: Errors

Errors measure the rate of requests that fail. The critical word is rate, not count. Error count rises and falls with traffic, so a raw count of failures tells you almost nothing on its own: fifty errors a minute is a crisis at 100 requests a minute and a rounding error at 100,000. Error rate, the fraction of requests that fail, normalizes against traffic so the same threshold means the same thing at any volume. Always express and alert on the percentage of total requests, not the absolute number.

The harder part of the errors signal is defining what counts as a failure. There are two kinds, and the second is where teams get burned:

  • Explicit failures are the obvious ones: HTTP 500s, RPC failures, connection refused, timeouts. Your infrastructure already knows these happened. They are the easy half.
  • Implicit failures are requests that return a success status code but the wrong result: a 200 response with malformed JSON, a search that returns no results when it should, a checkout that confirms but never charges, content that violates a policy or correctness rule. The status code says success; the user experience says failure.

Implicit failures are dangerous precisely because they do not show up in the explicit error count. A deploy that breaks your serialization can return 200 on every request while serving garbage, and a monitor that only counts non-2xx responses will show a perfectly green dashboard through the entire outage. Catching implicit failures means defining correctness checks: schema validation on responses, smoke tests of critical user journeys, content checks for known-bad output. The more of your error definition you can push from "did it return 500" to "did it return the right thing," the more outages your errors signal will actually catch.

Errors, like latency, maps cleanly onto an SLO: an availability target such as "99.9 percent of requests succeed over 28 days" is just your error-rate signal turned into a goal with an error budget attached. When errors do fire, the next questions are how fast you detect, diagnose, and recover, which is the domain of MTTR and AI incident response.

Signal 4: Saturation

Saturation measures how full your most constrained resource is. Every service has one resource that will run out before the others: CPU, memory, disk space, disk I/O, a connection pool, a thread pool, a queue, or network bandwidth. Saturation is the utilization of whichever of these is closest to its limit. It is the golden signal that takes the most thought to instrument, because it requires knowing your system well enough to identify the bottleneck, and the bottleneck can move as the workload changes.

Saturation earns its place because it is the leading indicator. Latency, traffic, and errors tell you about problems that are happening now; saturation tells you about problems that are about to happen. The reason is that resources degrade nonlinearly. A system running at 80 percent utilization is usually fine, with comfortable headroom. The same system at 99 percent has fallen off a cliff: queues back up, latency spikes, and errors begin, often all at once. The last few percent of utilization is where everything goes wrong, and it goes wrong fast.

This is why saturation predicts the other three signals. A memory utilization climbing steadily toward 100 percent is a latency spike and an error wave that has not happened yet. A connection pool at 95 percent is the next outage. Watching saturation lets you act, by scaling out, shedding load, or fixing a leak, in the window before users feel anything. That early-warning property is also what makes saturation the natural input to capacity planning and to anomaly detection: a saturation trend line is a forecast.

The 99 percent cliff. Treat any resource crossing roughly 80 to 90 percent sustained utilization as an alert, not a celebration of efficiency. The instinct to run resources hot to save money is exactly how teams end up living on the edge of the cliff, where a small traffic bump tips the system over. Saturation headroom is not waste; it is the buffer that absorbs the spike you did not forecast.

Golden signals vs RED vs USE

You will hear two other three-letter frameworks alongside the golden signals: RED and USE. They are not competitors so much as different lenses, and they overlap heavily. Understanding where each fits keeps you from arguing about which is "right" when the honest answer is that mature teams use more than one.

Framework Signals Best for
Golden signalsLatency, Traffic, Errors, SaturationAny user-facing service; the broadest umbrella
REDRate, Errors, DurationRequest-driven services and microservices
USEUtilization, Saturation, ErrorsHardware and infrastructure resources

RED, popularized by Tom Wilkie, stands for Rate, Errors, and Duration. Rate is traffic, Duration is latency, and Errors is errors. In other words, RED is the golden signals with saturation removed. That makes it a clean fit for request-driven services and microservices, where you care about the request flow and you measure each service from the outside. RED is easy to apply uniformly across a fleet of services because every service has a request rate, an error rate, and a duration, and you can build one dashboard template that works for all of them.

USE, from Brendan Gregg, stands for Utilization, Saturation, and Errors, and it is resource-centric rather than request-centric. You apply it to a physical or logical resource, a CPU, a disk, a network interface, a memory bus, and ask: how utilized is it, how saturated (how much queued work is waiting on it), and is it throwing errors? USE is the framework for the layer underneath your services, the infrastructure that the services run on.

The overlap is the point. Errors appears in all three. Saturation appears in the golden signals and in USE. The golden signals are the broadest framing, the umbrella that contains the others: RED is the request-side view of the golden signals, and USE is the resource-side view. In practice the clean division of labor is RED for your services and USE for the resources underneath them, with the golden signals as the mental model that ties the two together. You do not have to choose one; you choose RED at the service layer and USE at the resource layer, and you have covered all four golden signals across both.

From signals to SLOs and action

Signals on a dashboard do not improve reliability by themselves. The value comes from turning them into SLOs that set targets, alerts that fire only when users are affected, and an incident response loop that acts on what fires. Here is how the four golden signals flow into each.

Signals become SLIs and SLOs

A service level indicator, an SLI, is a precise measurement of one aspect of service health, and an SLO sets a target for that indicator over a window. Two of the four golden signals map onto SLOs almost directly. A latency signal becomes a latency SLI like "the proportion of requests served in under 300 milliseconds," with an SLO target such as 99 percent over 28 days. An error signal becomes an availability SLI with a target like 99.9 percent of requests succeeding. The gap between your target and 100 percent is your error budget, the amount of unreliability you are allowed to spend, which governs how aggressively you can ship. Traffic and saturation are usually inputs and context rather than SLOs themselves, but they shape the targets you can credibly promise.

Signals drive alerting without noise

The most common monitoring mistake is alerting on causes instead of symptoms: a page for high CPU on one host, another for a single slow query, another for one failed health check. The on-call engineer drowns in pages that do not map to user impact, and the real signal gets lost in the noise. Alerting on the four golden signals at the service boundary, ideally tied to SLO burn rate so a page means "you are spending your error budget too fast," is the core defense against alert fatigue. When a page fires, it means users are actually being affected, which is the only thing worth waking someone for.

Signals drive incident response

Once a golden signal crosses a threshold, the clock starts on detection, diagnosis, and recovery, the components of MTTR. The signals tell you that something is wrong and roughly where; getting to why still needs traces, logs, and deeper telemetry. This is where an agentic platform earns its keep. Nova AI Ops watches latency, traffic, errors, and saturation across AWS, GCP, Azure, Linux, and Windows, then correlates a signal anomaly on one service into a single incident rather than a storm of disconnected alerts. Its agents reason over the four signals alongside logs, traces, and recent deploys to find the likely root cause, and where the diagnosis matches a known pattern they auto-resolve within a policy envelope you control. The golden signals are the detection layer; AI incident response turns a signal anomaly into a diagnosis and, where it is safe, a fix.

Turn a golden-signal anomaly into a diagnosed, auto-resolved incident inside one policy envelope.

Try Nova →

A 90-day plan and a 10-point checklist

Instrumenting the golden signals on every service is not a one-week project on a real fleet, but it is very achievable in a quarter if you sequence it. Here is a plan that delivers value early and avoids the trap of trying to boil the ocean.

Days 1–30: Instrument the two easy signals everywhere

Start with traffic and errors, because almost every service already emits them or can with a thin middleware layer. Add request-rate and error-rate metrics at the boundary of every user-facing service, labeled by status class so you can separate explicit failures. Stand up one dashboard template (RED-style) that you can apply uniformly to every service. By day 30 you should be able to answer, for any service, "how much traffic and what error rate, right now." Do not build alerts yet; just get clean, trusted data flowing.

Days 31–60: Add latency distributions and define correctness

Layer in latency histograms so you can read p50, p95, and p99, and split successful-request latency from failed-request latency. In parallel, define implicit-failure checks for your most critical user journeys so your error signal catches wrong-content failures, not just 500s. By day 60 every important service has all three request-side signals (rate, errors, duration) with percentiles and a real definition of "failure."

Days 61–90: Add saturation, set SLOs, and wire alerts

Identify the constraining resource for each service and instrument its utilization and saturation (USE-style) on the infrastructure underneath. Then turn the latency and error signals into SLIs, set conservative SLOs with error budgets, and wire symptom-based, burn-rate alerts tied to those budgets. By day 90 every service has all four golden signals, every critical service has an SLO, and on-call pages fire on user impact rather than on low-level causes.

  1. Every user-facing service emits all four golden signals at its boundary: latency, traffic, errors, and saturation.
  2. Latency is measured as a distribution, not an average, with p50, p95, and p99 tracked.
  3. Successful-request latency is separated from failed-request latency so fast errors cannot mask a degraded service.
  4. Error rate is expressed as a percentage of traffic, never as a raw count.
  5. The error definition includes implicit failures (wrong content, policy violations), not just explicit 5xx and timeouts.
  6. Traffic is monitored for sudden drops, not only spikes; a drop to zero pages the on-call engineer.
  7. The constraining resource for each service is identified and its saturation is tracked as a leading indicator.
  8. Saturation alerts fire at 80 to 90 percent sustained utilization, before the 99 percent cliff, not after.
  9. Latency and error signals are turned into SLIs with SLOs and error budgets for every critical service.
  10. Alerts are symptom-based and tied to SLO burn rate, so a page always maps to real user impact.

A service that passes all ten of these is monitored the way the SRE book intended: a small, stable set of signals, measured at the boundary, turned into targets and pages that respect the on-call engineer's attention. From there you can layer deeper observability for diagnosis, automate the routine response with DevOps automation, and let an agentic platform watch the signals around the clock so a human only sees what truly needs a human.

Frequently asked questions

What are the four golden signals of monitoring?
The four golden signals are latency, traffic, errors, and saturation. They come from the Monitoring Distributed Systems chapter of Google's Site Reliability Engineering book, which states that if you can measure only four metrics of your user-facing system, focus on these four. Together they catch the large majority of user-facing problems with a small, stable set of metrics, which is why they have become the default starting point for service monitoring.
Where do the golden signals come from?
The golden signals were popularized by Google's Site Reliability Engineering book, published in 2016, in the chapter on monitoring distributed systems written by Rob Ewaschuk and colleagues. The framing was deliberately minimal: a small set of signals that any team could instrument on any service, regardless of language or stack, that would surface most outages before users opened tickets. The idea spread because it is simple enough to adopt in an afternoon and general enough to apply to almost any request-driven service.
Why measure latency percentiles instead of averages?
Averages hide the tail. A service can show a healthy 40 millisecond average while one request in twenty takes two seconds, and those slow requests are exactly the ones users notice and abandon. Percentiles describe the distribution: p50 is the typical experience, p95 and p99 describe the slow tail where pain concentrates. You should also separate the latency of successful requests from the latency of failed ones, because a fast error can make a broken service look healthier than it is.
Why is error rate better than error count?
Error count rises and falls with traffic, so a raw count of failures tells you almost nothing on its own. Fifty errors a minute is a crisis at 100 requests a minute and a rounding error at 100,000. Error rate, the fraction of requests that fail, normalizes against traffic, so the same threshold means the same thing at any volume. That is why error rate, expressed as a percentage of total requests, is the signal you alert on and report against, not the absolute count.
What is saturation in the golden signals?
Saturation measures how full your most constrained resource is, the CPU, memory, disk, connection pool, or queue that will run out first. It is the leading indicator among the four signals because resources degrade nonlinearly: a system at 80 percent utilization is usually fine, but the same system at 99 percent falls off a performance cliff where latency spikes and errors begin. Watching saturation lets you act before users feel the latency and errors that saturation predicts.
What is the difference between golden signals, RED, and USE?
All three are minimal monitoring frameworks that overlap heavily. The golden signals (latency, traffic, errors, saturation) are the broadest and apply to any user-facing system. RED (rate, errors, duration) is a request-centric subset best suited to microservices and request-driven services; it is essentially the golden signals minus saturation. USE (utilization, saturation, errors) is resource-centric and best suited to hardware and infrastructure components like CPUs, disks, and network interfaces. In practice teams use RED for services and USE for the resources underneath them, and the golden signals are the umbrella that contains both.
How do the golden signals become SLOs?
Each golden signal becomes a service level indicator, a precise measurement of one aspect of service health, and an SLO sets a target for that indicator over a window. A latency signal becomes an SLI like the proportion of requests served under 300 milliseconds, with an SLO target such as 99 percent over 28 days. An error signal becomes an availability SLI with a target like 99.9 percent successful requests. The gap between your target and 100 percent is your error budget, which governs how aggressively you can ship. Latency and errors are the two golden signals that map most cleanly onto SLOs.
Do the golden signals reduce alert noise?
Yes, when you alert on the signals rather than on every underlying cause. The classic mistake is alerting on dozens of low-level conditions, high CPU on one host, a single slow query, one failed health check, which buries the on-call engineer in pages that do not map to user impact. Alerting on the four golden signals at the service boundary, ideally tied to SLO burn rate, means a page fires when users are actually affected, which is the core defense against alert fatigue.
Are the golden signals enough on their own?
They are the right place to start, not the whole story. The golden signals tell you that something is wrong and roughly where, but they do not always tell you why; for root cause you still need traces, logs, and deeper telemetry. They also assume a request-driven service, so batch jobs, data pipelines, and event-streaming systems need adapted signals such as queue depth, lag, and freshness. Treat the golden signals as the top of a layered observability strategy, with detailed telemetry underneath for diagnosis.
How does Nova AI Ops use the golden signals?
Nova AI Ops watches latency, traffic, errors, and saturation across AWS, GCP, Azure, Linux, and Windows, then correlates a signal anomaly on one service into a single incident rather than a storm of disconnected alerts. Its agents reason over the four signals alongside logs, traces, and recent deploys to find the likely root cause, and where the diagnosis matches a known pattern they auto-resolve within a policy envelope you control. The golden signals are the detection layer; the agents turn a signal anomaly into a diagnosis and, where it is safe, a fix.

The signals connect directly to the practices built on top of them: observability (the broader telemetry strategy the signals sit on top of), SLOs and error budgets (turning signals into targets), alert fatigue (alerting on signals instead of causes), MTTR (what happens after a signal fires), and anomaly detection (catching signal deviations automatically). Go deeper on detection and response: AI observability, AIOps, incident management, AI incident response, and root cause analysis. On the SRE foundations and operations: site reliability engineering, AI SRE, Agentic SRE, self-healing infrastructure, on-call management, DevOps automation, eliminating toil, and capacity planning. For teams shipping AI systems and resilience practices: LLMOps, the AI engineer's guide to production reliability, blameless postmortems, and chaos engineering. On the broader practice: monitoring, runbooks, and DevOps. See it all working together on the Nova AI Ops features page.

Watch your golden signals across every cloud, on one screen.

Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams watch latency, traffic, errors, and saturation across AWS, GCP, Azure, Linux, and Windows, correlate a signal anomaly into one incident, find the root cause, and auto-resolve within your policy envelope. Free tier available for small teams.