On-Call Reachability Testing

Verify the on-call gets pages. Test it.

Test

On-call reachability testing catches paging-chain failures before a real incident does. Quarterly synthetic pages across every channel; new-hire reachability test on day one. The first time the team discovers the chain is broken should never be a real incident.

Quarterly test page. Synthetic page per rotation. Verify the on-call actually receives it.
Catches DND and app issues. Phone Do-Not-Disturb, app permissions, network problems per engineer. All real failure modes.
All channels covered. Phone, app push, SMS, voice per rotation. Each channel can fail independently.
New-hire reachability test. Day-one test per onboard. Catches misconfigured new accounts before the first shift.

Review

Reviewing the test results closes the loop. Latency, missed channels, and engineer-side issues all surface here. Without review, the test is just performative.

Arrival time. Arrival versus target latency per test. Did it arrive within the SLO?
Latency over 30 seconds. Investigation trigger per test. 30s+ delay points to vendor or carrier issues.
Per-channel success rate. Per-quarter delivery rate per channel. Drops surface vendor degradation.
Per-engineer follow-up. “What blocked the page” debrief per failure. Supports targeted fixes.

Avoid

The recurring failure mode is skipping the test because nothing is wrong. The first failure during a real incident is too late to discover. Build the discipline as a recurring habit, not a quarterly afterthought.

Skipping the test. No-skip rule per rotation. First failure during a real incident is too late to discover.
Build the discipline. Recurring test cadence per org. Cron-driven, calendar-blocked, results posted.
Published result. Visible success or failure per test. Accountability comes from publication.
Named owner per rotation. Responsible engineer for the test. Catches bystander effect.