Symptom-Based vs Cause-Based Alerts: Which Wins
The default modern advice is ‘alert on symptoms, not causes.’ The advice is right 80% of the time. The 20% matters.
What each means
Symptom alerts: “users see slow responses.” Cause alerts: “CPU is at 95%.”
Symptoms are user-perceived; causes are infrastructure-measured. Most teams alert on causes; users care about symptoms.
Why symptoms win the default
- Symptoms catch what matters and ignore what does not. CPU at 95% on a workload that handles it fine is not an incident; pages on it are noise.
- Symptoms also catch cause combinations. A cause-only alert misses outages where two unhealthy systems combine to cause user pain.
When causes win
Causes win for early-warning. CPU climbing slowly is a leading indicator; user-perceived slowness is the lagging indicator.
Causes also win when you cannot measure the symptom directly, backend services with no user-facing metric.
The hybrid pattern
Page on symptoms (high signal, user-visible). Ticket on causes (early warning, internal). The two-tier pattern keeps pages clean and signals trends early.
Most teams that go symptom-only lose visibility into early-warning. The hybrid is the realistic posture.
Antipatterns
- Cause-only alerts. Pages on noise; misses real impact.
- Symptom-only alerts. No early warning.
- Both at page-tier. Doubles pager noise.
What to do this week
Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.