Distinguishing Real Incidents from Noise
Half the things that page your team aren't real incidents. The discipline of telling them apart, fast, is a more productive investment than any new tool.
The noise flood
Most teams' page volumes are 60-80% noise. A flap, a known-flaky check, a vendor blip, a deploy-related transient, none of which is a real customer-facing incident. The 20-40% that is real gets the same response as the 60-80% that isn't, which is why on-call is exhausting.
The cost of noise compounds. Each false page costs ~30 minutes of engineering attention (the page itself, the recovery time, the context-loss in whatever the engineer was doing). At 10 noisy pages per week, that's 5 hours of erosion per engineer per week, most of a working day. Across a team of 8, the team is losing the equivalent of one full-time engineer to noise.
The other cost: noise erodes alertness. After the 8th false page, engineers stop reading the alert details closely. The 9th page, which IS real, gets handled with the same casualness as the previous 8 false ones. The miss probability on real incidents rises with every noisy page that came before.
The three-question filter
- Is a customer affected right now? If no, it might be tomorrow's incident, not today's.
- Is the metric still bad? If it self-recovered before you ack'd, it was a flap.
- Is it the SAME flap as last week? If yes, the alert is wrong; the flap is the bug.
30 seconds, three questions, the page is classified. Real incidents survive all three; noise dies on at least one. The filter is what separates "we need to investigate" from "we need to delete this alert."
Each question targets a different noise pattern. Question 1 catches alerts that fired on internal symptoms with no customer impact. Question 2 catches transient flaps. Question 3 catches the chronic-noise pattern where the same alert misfires repeatedly. Together they catch most of the 60-80% noise.
Question 1: customer-affected right now
The most important question. An alert that fires because "CPU is at 90%" is a symptom alert, not a customer alert. If customers aren't experiencing slowness, errors, or unavailability, the symptom alert may or may not be useful, but it's not a SEV-anything incident. It's an early-warning signal.
How to check fast. The dashboard for customer-facing SLIs (latency, error rate, success rate) lives at a known URL. The on-call's first action when paged is to open that dashboard. If the SLIs are healthy, the page is informational; investigate during business hours. If the SLIs are degraded, this is a real incident; activate the response.
The classification mistake. Treating the symptom alert as if it were a customer alert. CPU at 90% is rarely a customer-affecting fact; it's a leading indicator that COULD become customer-affecting. Treating it like an active incident burns engineering time on a future-tense problem; the right response is to investigate during business hours and add capacity if the trend warrants it.
Question 2: still bad, or self-recovered
A flap is an alert that fires and clears within a few minutes without intervention. Almost always noise, either the threshold was too tight, the metric was bouncing in normal range, or the underlying transient was real but not worth waking up for.
The check. Look at the metric over the last 15 minutes. If the metric returned to normal before the on-call could meaningfully respond, the alert is fundamentally too sensitive. The only intervention was the on-call being woken up to no purpose.
What to do with flaps. Tag them. Track them. After a flap, post in the on-call channel: "page X was a flap, will tune the alert." That triggers either a threshold change, a debounce (require N minutes of bad before paging), or a deletion if the alert flaps weekly. The discipline of after-action tagging is what stops flaps from becoming permanent.
Question 3: same flap as last week
If a specific alert has flapped 3+ times in a quarter, the alert itself is the bug. Either it's mis-tuned, the underlying system has a known transient that the alert keeps catching, or the metric is fundamentally not aligned with customer impact.
The fix scales with frequency. 3 flaps a quarter: tighten threshold or add debounce. 5 flaps: re-evaluate whether the alert is the right alert at all. 10+ flaps: delete the alert and replace it with one based on a different metric (preferably a customer SLI).
The mistake. Continuing to investigate the same chronic-flap alert as if each instance were a new incident. Engineers debug it from scratch each time, find the same transient cause, mark it as resolved, and the cycle repeats. The pattern is the bug; the alert is its symptom.
Keep the audit trail
For every page, log the classification: real / flap / known-flaky / customer-irrelevant. The log is read at the next paging-policy review. Without the log, the team's memory blurs flaps with incidents and the noise filter never improves.
The log doesn't need to be elaborate. A Slack thread per page where the on-call posts the classification at end-of-shift. A spreadsheet with date / alert / classification / 1-line reason. A field in your incident-management tool. Whatever the team will actually do; rigour beats elegance.
The review. Monthly, the EM (or a senior engineer) reads through the log. Patterns emerge: "alert X has been classified as flap 8 times in 6 weeks, let's kill it." "Alert Y has been classified as real every time, let's tighten the threshold to fire earlier." The review is what turns the log into action.
Don't leave it to the on-call alone
The on-call at 3am is the worst person to decide if something is noise or not. They are biased toward noise (so they can sleep). Disagreements get reviewed by the team during business hours; the on-call's classification is provisional.
The two-person rule. Any classification of "flap" gets a second-set-of-eyes review during the next standup. Most of the time, the second engineer agrees and the classification stands. Sometimes the second engineer disagrees, "wait, that wasn't a flap, customers reported issues", and the team learns something. The second pair of eyes catches the bias.
The classifications that especially need review. The "flap, no customer impact" classifications are the dangerous ones, they could be either honestly noisy or really impactful but unnoticed. Real incidents that get classified as flaps are the ones that come back at 1.5x severity in 24 hours.
Improving over time
Every page logged as flap gets a follow-up: tighten the threshold, add a debounce, remove the alert if it has flapped 5+ times in a quarter. The noise rate is a lever, not a constant. Most teams that take this seriously cut noise 50% in their first quarter.
The compounding effect. Lower noise means the on-call is more alert; more alert means real incidents get faster response; faster response means lower MTTR; lower MTTR means happier customers; happier customers means less churn. The chain starts with disciplined noise reduction.
Track the noise rate as a team metric. Pages-classified-as-flap divided by total pages, weekly. The trend matters more than the level. A team going from 40% noise to 25% in a quarter is winning; a team plateaued at 60% needs structural attention.
What to do this week
Three moves. (1) Pin the three-question filter in your on-call channel. New on-callers reach for it; veterans use it as a sanity check. (2) Start the classification log this Friday, five-minute habit at end of each shift. The log starts small; trust the discipline. (3) Schedule the monthly review on the EM's calendar. The review is what turns the log into noise reduction.