Alerts Depending on Other Incidents

Some alerts shouldn't fire during specific incidents.

The cascade problem

When the primary database goes down, a hundred services alert at once. Most are downstream symptoms; the human only needs the root cause. Without dependency-aware suppression, the on-call drowns in pages that all describe the same incident and MTTA recovers slowly even when the root is identified in a minute. PagerDuty, Opsgenie, Nova AI Ops support dependency rules; few teams configure them properly.

Cascade pattern. Primary DB down; 100 services alert; 99 downstream symptoms.
Drown in pages. Without suppression, on-call sees identical-cause pages from many services.
Slow MTTA recovery. Even when root identified in a minute, on-call still digging.
Tools support, teams underuse. PagerDuty, Opsgenie, Nova; dependency rules underconfigured.

Building the dependency graph

The graph starts from the service catalog and stays accurate via traces. Backstage, OpsLevel, or homegrown YAML work; map each service to its critical upstream dependencies; use OpenTelemetry traces to validate the graph because manual catalogs drift while trace-derived graphs stay accurate within hours; update the graph in CI so new service deploys fail if dependencies are not declared.

Service catalog as base. Backstage, OpsLevel, or homegrown YAML; pick one.
Map per-service upstreams. Critical upstream dependencies per service.
OTel traces validate. Manual catalogs drift; trace-derived graphs stay accurate within hours.
CI enforcement. New deploys fail if dependencies not declared; the only way to stay fresh.

Suppression rules

Suppression rules are conservative by design. If service A is in incident state and service B depends on A, suppress B’s alerts for the duration of A’s incident plus 5 minute cooldown; always log the suppression so the on-call can query “what was suppressed during incident X?” for the postmortem; never suppress security alerts or data-loss alerts because the cost of a missed signal outweighs noise reduction.

Suppress B if A is in incident. Plus 5-minute cooldown after A resolves.
Always log suppression. Postmortem queries “what was suppressed during X?”.
Never suppress security or data-loss. Missed signal cost outweighs noise reduction.
Per-rule documented exception. Security and data-loss exception lives in the rule config; supports correct exclusion.

When suppression backfires

Three failure modes deserve mitigation. Stale dependency graphs suppress real alerts (always include a kill switch to disable suppression during a major incident); bidirectional dependencies (rare but real) confuse simple rule engines (map them explicitly or use a graph-aware engine); cross-team dependencies need cross-team postmortems (suppression that hides another team’s incident from them is worse than no suppression).

Stale graphs suppress real. Kill switch to disable suppression during major incident.
Bidirectional dependencies. Rare but real; explicit map or graph-aware engine.
Cross-team incident hiding. Worse than no suppression; cross-team postmortems required.
Per-failure mitigation. Each failure mode has a documented response; supports continued operation.

Get started

Start small and iterate. Pick the top 5 services by page volume and map each to its 3 most-critical upstreams; configure dependency suppression in PagerDuty event orchestration or your alerting tool; run for one month then review every suppressed alert in a postmortem and adjust until the false suppression rate is under 1%.

Top 5 services first. By page volume; the highest-leverage start.
3 critical upstreams each. Bounded mapping; not the full graph.
One-month review. Every suppressed alert reviewed in postmortem.
1% false suppression target. Tune until rate is under threshold; supports correct calibration.