The Four Golden Signals of Monitoring, Finally Explained Clearly
Latency, traffic, errors, saturation. You have seen the list. Here is what each one actually measures, how teams get them wrong, and the dashboard every service should have.
The four signals
Google's SRE book crystallised them: latency, traffic, errors, saturation. They are not the only metrics worth watching, but together they catch the vast majority of production issues before users notice.
Latency
How long your service takes to answer. Always measured as a distribution, p50, p95, p99, p99.9, never as a mean. Means hide the slow tail where your problems live.
Separate successful-request latency from failed-request latency. A slow 500 is categorically different from a slow 200 and should alert differently.
Traffic
How much demand is hitting your service. Usually requests per second, but can be sessions, queue depth, or bytes/second depending on the workload.
Traffic is important because it contextualises everything else. A 2% error rate at 10 req/s is noise. The same rate at 10,000 req/s is a fire. Always chart errors as a rate, not a raw count.
Errors
Requests that failed. Explicitly (HTTP 5xx, uncaught exceptions) and implicitly (HTTP 200 with a wrong response body, slow requests that timed out upstream).
The common mistake is counting only explicit errors. Most modern outages include a period where the service returns 200s but the bodies are wrong. Have a success definition beyond “2xx status”.
Saturation
How close your service is to running out of something. CPU, memory, file descriptors, connection-pool slots, queue depth. Saturation is the signal that tells you when latency is about to degrade.
Saturation metrics are a diagnostic tool more than an alerting tool. Alerting on 80% CPU paged teams for years without any correlation to user-visible problems, tune saturation alerts against actual latency impact.
The starter dashboard
For any new service, default dashboard has six panels:
- Request rate (traffic)
- Latency p50/p95/p99 (latency)
- Error rate, success vs explicit failure (errors)
- CPU and memory (saturation)
- Queue depth or connection pool utilisation (saturation)
- Error-budget burn over 28 days (SLO-derived)
If your dashboard has 30 panels, someone is using it wrong. Four signals, six panels, one glance, that is the target.
If your dashboard has 30 panels, someone is using it wrong. Four signals, six panels, one glance.
Dashboard audit template
For each of your top five services, open the default dashboard. Time how long it takes to answer, in one glance: is the service healthy, and if not, in which of the four signals is the problem?
If the answer takes more than five seconds, the dashboard is too busy. Delete panels until it fits on one screen and answers the question in three.
Review dashboards once a quarter. Every panel should justify its existence by something concrete a human would do if it moved.