The Four Golden Signals, Revisited for 2026
Latency, traffic, errors, saturation. The original four still hold; the way you measure them has evolved. The 2026 update with concrete metric definitions.
Latency: percentiles, not averages
Latency is the user-experience signal. Averages lie about tails; percentiles tell the truth.
- p50, p95, p99. The working percentile triple per endpoint; mean is not in the picture for SLO purposes.
- Averages are useless for SLOs. Mean conflates fast and slow requests; tail regressions vanish in the aggregate.
- Per-endpoint, per-method. A service-level latency hides the bad endpoints behind the good ones; split by route.
- Per-region cut. Latency by geography catches PoP-specific regressions before customers complain.
Errors: rate, not count
Errors are the correctness signal. Rate makes comparison across services possible; raw counts mislead.
- Errors per request. Rate is comparable across services and traffic levels; raw count is not.
- 4xx vs 5xx. Client errors and server errors warrant different responses; do not collapse the categories.
- Error budget. Error rate feeds the SLO budget calculation; the budget drives ship-or-stop decisions.
- Named owner per class. Each error class has a responsible team; avoids 'everyone's-and-no-one's' alerts.
Saturation: the leading indicator
Saturation is the leading indicator. It fires before the user-visible failure; most teams under-instrument it.
- CPU, memory, connection pool. Utilisation gauges per resource; these move first when the system is stressed.
- Preventive action. Acting on saturation before symptoms appear is the difference between an investigation and an incident.
- Under-instrumented default. Most teams track latency and errors, miss saturation; the cost is missed early warnings.
- Alarm threshold. 70 to 80% utilisation is the standard watch level; tune per resource based on burst behaviour.