Alert vs Dashboard Decision
Some signals belong on dashboards, not in alerts.
The decision rule
The decision is action-driven. Alert if customer impact is happening or imminent, time-sensitive action is required, and someone needs to act now; dashboard if trending data, aggregate metrics, situational awareness, or post-hoc analysis (data informs decisions but doesn’t demand immediate action). Mixing the two creates fatigue: dashboards full of pageable signals get ignored, pages that should have been dashboards burn out the on-call.
- Alert: customer impact + action + urgency. All three required; the test for paging.
- Dashboard: trends, aggregates, awareness. Informs decisions; doesn’t demand action now.
- Mixing creates fatigue. Dashboards full of pageable signals get ignored; over-paging burns out.
- Per-signal placement decision. Each signal lands in alert or dashboard; supports clear ownership.
Strict criteria for alerts
Three criteria must all hold for a signal to be an alert. Customer impact (real or imminent; signals with no customer connection like CPU at 80% are dashboards, not pages); action exists (an alert without a runbook is a notification of helplessness, find an action or move it to a dashboard); time-sensitive (if the action can wait until business hours, the alert can wait).
- Customer impact required. No customer connection means dashboard, not page.
- Action exists. Alert without runbook is notification of helplessness; find an action or move.
- Time-sensitive. If action can wait until business hours, alert can wait.
- Per-criterion check. All three required; the discipline lives in the trio.
Dashboard criteria
Three categories belong on dashboards. Trends and aggregates (week-over-week, month-over-month, capacity planning, SLO burn-down); operational awareness (on-call checks at start of shift, during incidents dashboards inform but don’t drive paging); stakeholder reports (business metrics, customer counts, revenue, audience is decision makers).
- Trends and aggregates. WoW, MoM metrics; capacity planning; SLO burn-down.
- Operational awareness. Start-of-shift check; during-incident inform.
- Stakeholder reports. Business metrics, customer counts, revenue; audience is decision makers.
- Per-dashboard owner. Each dashboard has an owner team; supports continued curation.
Converting between them
Conversion goes both ways. Frequently-firing alerts that operators dismiss without action are dashboard candidates (track per-alert action rate, below 50% means not an alert); dashboard panels that surface real problems people only see in postmortems are alert candidates (convert when the pattern repeats); quarterly review of both directions because each conversion is a small win and the cumulative effect is significant alert quality.
- Below 50% action rate. Demote to dashboard; the alert isn’t earning its keep.
- Postmortem pattern repeats. Dashboard panel becomes alert candidate.
- Quarterly review both directions. Each conversion is a small win; cumulative quality.
- Per-conversion documented rationale. Records why the move happened; supports investigation.
Anti-patterns
Three anti-patterns survive too long. Dashboards full of red panels nobody investigates (dashboards are not alerts; visual urgency creates anxiety without action); alerts that exist for reassurance (“alert if too quiet”) without clear meaning (define what “too quiet” means and what to do, or remove); both alert and dashboard for the same signal (pick one based on the action, or ensure they have different audiences and clear ownership).
- Red dashboard panels. Dashboards aren’t alerts; visual urgency creates anxiety without action.
- Reassurance alerts. “Alert if too quiet”; define meaning or remove.
- Same signal both places. Pick one based on action; or differentiate audiences and ownership.
- Per-anti-pattern lint. CI catches the common cases; the discipline lives in the linter.