Observability Beginner By Samson Tanimawo, PhD Published Oct 2, 2026 5 min read

Signals vs Symptoms: What Your Monitoring Should Actually Watch

Most teams alert on symptoms (CPU high, queue depth growing) and react too late. Alerting on signals (user-visible behaviour) catches problems before they become incidents.

The two terms

A signal is something the user experiences: latency, errors, missing data. A symptom is something the system experiences: CPU saturation, queue depth, GC pauses. Both are real. They tell you different things.

The vocabulary matters because the conflation is what produces noisy alerting. Teams treat symptoms as if they were signals, paging on CPU above 80% when no user is affected. The page wakes someone up; investigation reveals nothing customer-facing is broken; the alert burns trust. Multiplied across services, this pattern is what makes on-call hated.

The right framing changes how you choose what to monitor. Signals belong on customer-facing dashboards; they drive incident severity. Symptoms belong on engineering dashboards; they drive capacity planning and proactive maintenance. Mixing them is what produces dashboards nobody can read.

Why it matters

Symptom-based alerts wake you up for "CPU is at 92%", a state that may or may not affect users. Signal-based alerts wake you up for "checkout latency is over 2s", which absolutely affects users. Tune toward signals and on-call gets calmer; tune toward symptoms and the team chases ghosts.

The math at scale. A typical service has 5-10 symptom alerts (resource utilisation, queue depths, error rates) and 2-3 signal alerts (user-facing latency, availability, correctness). Pages from symptoms outnumber pages from signals 5:1 by default. Most of those symptom pages are noise, the system was unhappy but customers weren't affected.

The shift's effect on culture. Teams that move to signal-based alerting report dramatic drops in pager fatigue. Engineers stop dreading on-call because the pages they get are real. Recruiting improves because the role becomes sustainable. The signal/symptom distinction sounds academic; the operational benefit is large.

Examples of symptom-based alerting

Symptoms are useful for capacity alerts (you should add nodes before users feel anything) and predictive alerts (disk filling up, certs expiring). Beyond that, symptoms create noise.

The legitimate use cases. CPU at 90% sustained for 30 minutes, actionable as a capacity signal, even if no user is yet affected. Disk at 85% on a database, predictive; you have hours to act before it bites. Certificate expiring in 14 days, predictive; you have days to renew. Each is a leading indicator with enough lead time for non-urgent action.

The illegitimate use cases. CPU at 70% momentary, meaningless without context. Queue depth above 100, meaningless if the queue drains; meaningful if it persists. Memory above 80%, meaningless on JVM (which uses what's available); meaningful on services with memory leaks. The distinction is subtle; teams tend to over-alert on symptoms because the alerts are easy to write.

Examples of signal-based alerting

Latency p95, error-rate, success-rate of the four critical user journeys. These map directly to user-visible problems. When they fire, on-call knows something is wrong, not just unusual.

The four critical user journeys principle. Most services have 3-5 user journeys that matter (signup, login, search, checkout, etc.). Alert on the latency and availability of each. A 7-service stack with 4 critical journeys yields 28 signal alerts; that's manageable. Compare to symptom-based, which produces hundreds.

The implementation. Synthetic checks against the critical paths every minute. Real-user monitoring (RUM) data complements but doesn't replace synthetic, synthetic gives constant signal even when traffic is low; RUM gives the actual user experience. Both feed into signal-based alerts.

The four-question filter

For any alert ask: (1) Is a user affected? (2) Will a user be affected within an hour? (3) Can the team act on the metric? (4) Does the team know what to do about it? If yes to all four, keep the alert. If no to any, demote or delete.

Each question targets a different alert antipattern. Question 1 catches alerts that fire when nothing's actually wrong. Question 2 catches alerts that fire too late or too early. Question 3 catches alerts that produce information the team can't act on. Question 4 catches alerts that produce information without runbooks.

The application discipline. Once a year, audit every alert against the four questions. Each alert that fails any question gets demoted (P1 → P2) or deleted. Most teams find 30-50% of their alerts fail at least one question; that's the noise reduction the team has been needing.

Common antipatterns

"Better safe than sorry" alerting. Team adds an alert because "we should know if X happens." X happens monthly; the alert fires; nobody acts because there's no runbook. The alert exists for emotional comfort, not operational benefit. Delete it.

Symptoms aliased as signals. Team treats "queue depth above 1000" as user-facing because "users will eventually be affected." Eventually isn't now. Demote to symptom alerts (capacity dashboard); reserve user-facing severity for actual user impact.

Inherited alerts that nobody owns. Service was inherited from another team; the original alerts are still firing; nobody knows whether they're meaningful. Audit; delete the unowned ones; the remaining set has clear ownership.

The "leading indicator" excuse. Team defends a symptom alert as "leading indicator." Leading indicators are useful; they go on dashboards, not on the pager. The pager is for "act now" not "watch this trend."

What to do this week

Three moves. (1) For your most-paged service, classify each alert as signal or symptom. Most teams find 70%+ are symptoms. (2) Apply the four-question filter to the symptom alerts. Demote the failures to dashboard or delete them. (3) Identify your top 4 user journeys and confirm you have signal-based alerts for each. Most teams discover 1-2 critical journeys aren't actively monitored; that's the gap to fill.