On-Call Reachability Testing
Verify the on-call gets pages. Test it.
Test
On-call reachability testing catches paging-chain failures before a real incident does. Quarterly synthetic pages across every channel; new-hire reachability test on day one. The first time the team discovers the chain is broken should never be a real incident.
- Quarterly test page. Synthetic page per rotation. Verify the on-call actually receives it.
- Catches DND and app issues. Phone Do-Not-Disturb, app permissions, network problems per engineer. All real failure modes.
- All channels covered. Phone, app push, SMS, voice per rotation. Each channel can fail independently.
- New-hire reachability test. Day-one test per onboard. Catches misconfigured new accounts before the first shift.
Review
Reviewing the test results closes the loop. Latency, missed channels, and engineer-side issues all surface here. Without review, the test is just performative.
- Arrival time. Arrival versus target latency per test. Did it arrive within the SLO?
- Latency over 30 seconds. Investigation trigger per test. 30s+ delay points to vendor or carrier issues.
- Per-channel success rate. Per-quarter delivery rate per channel. Drops surface vendor degradation.
- Per-engineer follow-up. “What blocked the page” debrief per failure. Supports targeted fixes.
Avoid
The recurring failure mode is skipping the test because nothing is wrong. The first failure during a real incident is too late to discover. Build the discipline as a recurring habit, not a quarterly afterthought.
- Skipping the test. No-skip rule per rotation. First failure during a real incident is too late to discover.
- Build the discipline. Recurring test cadence per org. Cron-driven, calendar-blocked, results posted.
- Published result. Visible success or failure per test. Accountability comes from publication.
- Named owner per rotation. Responsible engineer for the test. Catches bystander effect.