Pre-Prod Alert Noise
Pre-prod alerts shouldn't page production on-call.
Where pre-prod noise comes from
Pre-prod noise has predictable sources. Staging clusters reuse production alert configs and fire on every test, chaos run, and flaky deploy; pre-prod has fewer humans, so the page rate per engineer is often higher than production; pre-prod alerts are often misrouted to the production rotation, paging the on-call for a staging issue at 2am.
- Reused production configs. Staging fires on every test, chaos run, flaky deploy; the configs were not tuned for staging.
- Fewer humans. Per-engineer page rate often higher than production; the noise burden is concentrated.
- Misrouted to production rotation. Staging issues page the on-call at 2am; the routing was never updated.
- Per-environment alert configs. The fix is environment-aware configs; staging is not production.
Separate paging for pre-prod
Pre-prod paging needs separation from production. Dedicated channel rather than production on-call rotation; Slack-only for sev2 and below; pre-prod sev1 still pages but to the team’s business-hours rotation, not the 24/7 on-call. Every alert tagged with environment for routing.
- Dedicated channel. Pre-prod alerts go to a dedicated channel, not the production on-call rotation.
- Slack-only for sev2 below. Pre-prod degradation is not a 2am page; the channel suffices.
- Pre-prod sev1. Still pages but to the team’s business-hours rotation, not the 24/7 on-call.
- Environment label everywhere. Every alert tagged; routing rules use the tag.
Mute during known events
Known events should mute pre-prod alerts. CI runs, chaos drills, and performance tests should mute alerts on affected services for the duration; a maintenance-mode API gives CI a hook to call before a destructive test and end after. Without muting, the team learns to ignore alerts, and that habit carries to production.
- Mute during CI runs. Chaos drills and performance tests mute alerts on affected services for the duration.
- Maintenance-mode API. CI calls before a destructive test, ends it after; the muting is automatic.
- Habit transfer risk. Without muting, the team learns to ignore alerts; that habit carries to production.
- Per-test mute scope. Mute the affected service, not all alerts; preserve unrelated signal.
Pre-prod gets a noise budget too
Pre-prod alert volume deserves a budget. 10-20% of production volume is the target; higher means configs are over-noisy or staging itself is broken. Pre-prod page count above production count is a red flag worth investigating the same week. Review pre-prod alerts on the same quarterly cadence as production.
- 10-20% of production. Target volume; higher means over-noisy configs or broken staging.
- Pre-prod above production. Red flag; investigate same week.
- Quarterly review cadence. Pre-prod alerts reviewed on the same cycle as production; supports symmetric discipline.
- Per-environment alert KPIs. Volume, false-positive rate, signal quality tracked per environment.
How to fix pre-prod noise
Fixing pre-prod noise is concrete work. Environment label on every alert with routing that doesn’t page production on-call; muting hooks in CI for chaos and load tests; removing or downgrading pre-prod-only alerts because production alerts should not run in staging without modification.
- Environment label on every alert. Routing adjusts so pre-prod doesn’t page production on-call.
- Muting hooks in CI. Chaos and load tests trigger automatic muting; the noise window is bounded.
- Remove or downgrade pre-prod-only. Production alerts shouldn’t run in staging without modification.
- Per-fix verification. Volume drop measured after each change; supports confirming the fix worked.