Alerts Depending on Other Incidents

Some alerts shouldn't fire during specific incidents.

The cascade problem

When the primary database goes down, a hundred services alert at once. Most are downstream symptoms; the human only needs the root cause. Without dependency-aware suppression, the on-call drowns in pages that all describe the same incident and MTTA recovers slowly even when the root is identified in a minute. PagerDuty, Opsgenie, Nova AI Ops support dependency rules; few teams configure them properly.

Building the dependency graph

The graph starts from the service catalog and stays accurate via traces. Backstage, OpsLevel, or homegrown YAML work; map each service to its critical upstream dependencies; use OpenTelemetry traces to validate the graph because manual catalogs drift while trace-derived graphs stay accurate within hours; update the graph in CI so new service deploys fail if dependencies are not declared.

Suppression rules

Suppression rules are conservative by design. If service A is in incident state and service B depends on A, suppress B’s alerts for the duration of A’s incident plus 5 minute cooldown; always log the suppression so the on-call can query “what was suppressed during incident X?” for the postmortem; never suppress security alerts or data-loss alerts because the cost of a missed signal outweighs noise reduction.

When suppression backfires

Three failure modes deserve mitigation. Stale dependency graphs suppress real alerts (always include a kill switch to disable suppression during a major incident); bidirectional dependencies (rare but real) confuse simple rule engines (map them explicitly or use a graph-aware engine); cross-team dependencies need cross-team postmortems (suppression that hides another team’s incident from them is worse than no suppression).

Get started

Start small and iterate. Pick the top 5 services by page volume and map each to its 3 most-critical upstreams; configure dependency suppression in PagerDuty event orchestration or your alerting tool; run for one month then review every suppressed alert in a postmortem and adjust until the false suppression rate is under 1%.