Alerts Depending on Other Incidents
Some alerts shouldn't fire during specific incidents.
The cascade problem
When the primary database goes down, a hundred services alert at once. Most are downstream symptoms; the human only needs the root cause. Without dependency-aware suppression, the on-call drowns in pages that all describe the same incident and MTTA recovers slowly even when the root is identified in a minute. PagerDuty, Opsgenie, Nova AI Ops support dependency rules; few teams configure them properly.
- Cascade pattern. Primary DB down; 100 services alert; 99 downstream symptoms.
- Drown in pages. Without suppression, on-call sees identical-cause pages from many services.
- Slow MTTA recovery. Even when root identified in a minute, on-call still digging.
- Tools support, teams underuse. PagerDuty, Opsgenie, Nova; dependency rules underconfigured.
Building the dependency graph
The graph starts from the service catalog and stays accurate via traces. Backstage, OpsLevel, or homegrown YAML work; map each service to its critical upstream dependencies; use OpenTelemetry traces to validate the graph because manual catalogs drift while trace-derived graphs stay accurate within hours; update the graph in CI so new service deploys fail if dependencies are not declared.
- Service catalog as base. Backstage, OpsLevel, or homegrown YAML; pick one.
- Map per-service upstreams. Critical upstream dependencies per service.
- OTel traces validate. Manual catalogs drift; trace-derived graphs stay accurate within hours.
- CI enforcement. New deploys fail if dependencies not declared; the only way to stay fresh.
Suppression rules
Suppression rules are conservative by design. If service A is in incident state and service B depends on A, suppress B’s alerts for the duration of A’s incident plus 5 minute cooldown; always log the suppression so the on-call can query “what was suppressed during incident X?” for the postmortem; never suppress security alerts or data-loss alerts because the cost of a missed signal outweighs noise reduction.
- Suppress B if A is in incident. Plus 5-minute cooldown after A resolves.
- Always log suppression. Postmortem queries “what was suppressed during X?”.
- Never suppress security or data-loss. Missed signal cost outweighs noise reduction.
- Per-rule documented exception. Security and data-loss exception lives in the rule config; supports correct exclusion.
When suppression backfires
Three failure modes deserve mitigation. Stale dependency graphs suppress real alerts (always include a kill switch to disable suppression during a major incident); bidirectional dependencies (rare but real) confuse simple rule engines (map them explicitly or use a graph-aware engine); cross-team dependencies need cross-team postmortems (suppression that hides another team’s incident from them is worse than no suppression).
- Stale graphs suppress real. Kill switch to disable suppression during major incident.
- Bidirectional dependencies. Rare but real; explicit map or graph-aware engine.
- Cross-team incident hiding. Worse than no suppression; cross-team postmortems required.
- Per-failure mitigation. Each failure mode has a documented response; supports continued operation.
Get started
Start small and iterate. Pick the top 5 services by page volume and map each to its 3 most-critical upstreams; configure dependency suppression in PagerDuty event orchestration or your alerting tool; run for one month then review every suppressed alert in a postmortem and adjust until the false suppression rate is under 1%.
- Top 5 services first. By page volume; the highest-leverage start.
- 3 critical upstreams each. Bounded mapping; not the full graph.
- One-month review. Every suppressed alert reviewed in postmortem.
- 1% false suppression target. Tune until rate is under threshold; supports correct calibration.