Pre-Prod Alert Noise

Pre-prod alerts shouldn't page production on-call.

Where pre-prod noise comes from

Pre-prod noise has predictable sources. Staging clusters reuse production alert configs and fire on every test, chaos run, and flaky deploy; pre-prod has fewer humans, so the page rate per engineer is often higher than production; pre-prod alerts are often misrouted to the production rotation, paging the on-call for a staging issue at 2am.

Reused production configs. Staging fires on every test, chaos run, flaky deploy; the configs were not tuned for staging.
Fewer humans. Per-engineer page rate often higher than production; the noise burden is concentrated.
Misrouted to production rotation. Staging issues page the on-call at 2am; the routing was never updated.
Per-environment alert configs. The fix is environment-aware configs; staging is not production.

Separate paging for pre-prod

Pre-prod paging needs separation from production. Dedicated channel rather than production on-call rotation; Slack-only for sev2 and below; pre-prod sev1 still pages but to the team’s business-hours rotation, not the 24/7 on-call. Every alert tagged with environment for routing.

Dedicated channel. Pre-prod alerts go to a dedicated channel, not the production on-call rotation.
Slack-only for sev2 below. Pre-prod degradation is not a 2am page; the channel suffices.
Pre-prod sev1. Still pages but to the team’s business-hours rotation, not the 24/7 on-call.
Environment label everywhere. Every alert tagged; routing rules use the tag.

Mute during known events

Known events should mute pre-prod alerts. CI runs, chaos drills, and performance tests should mute alerts on affected services for the duration; a maintenance-mode API gives CI a hook to call before a destructive test and end after. Without muting, the team learns to ignore alerts, and that habit carries to production.

Mute during CI runs. Chaos drills and performance tests mute alerts on affected services for the duration.
Maintenance-mode API. CI calls before a destructive test, ends it after; the muting is automatic.
Habit transfer risk. Without muting, the team learns to ignore alerts; that habit carries to production.
Per-test mute scope. Mute the affected service, not all alerts; preserve unrelated signal.

Pre-prod gets a noise budget too

Pre-prod alert volume deserves a budget. 10-20% of production volume is the target; higher means configs are over-noisy or staging itself is broken. Pre-prod page count above production count is a red flag worth investigating the same week. Review pre-prod alerts on the same quarterly cadence as production.

10-20% of production. Target volume; higher means over-noisy configs or broken staging.
Pre-prod above production. Red flag; investigate same week.
Quarterly review cadence. Pre-prod alerts reviewed on the same cycle as production; supports symmetric discipline.
Per-environment alert KPIs. Volume, false-positive rate, signal quality tracked per environment.

How to fix pre-prod noise

Fixing pre-prod noise is concrete work. Environment label on every alert with routing that doesn’t page production on-call; muting hooks in CI for chaos and load tests; removing or downgrading pre-prod-only alerts because production alerts should not run in staging without modification.

Environment label on every alert. Routing adjusts so pre-prod doesn’t page production on-call.
Muting hooks in CI. Chaos and load tests trigger automatic muting; the noise window is bounded.
Remove or downgrade pre-prod-only. Production alerts shouldn’t run in staging without modification.
Per-fix verification. Volume drop measured after each change; supports confirming the fix worked.