Monitoring the On-Call

The on-call rotation is itself a system that needs monitoring. The metrics.

Page volume

Page volume is the first signal. Per-shift, per-engineer, per-service cuts each surface different patterns; together they show whether the rotation is sustainable.

Per-shift, per-engineer, per-service. Multi-cut volume view per rotation; trends and outliers surface against the cuts.
Healthy means bounded and predictable. Per-team volume target; drives whether the rotation is sustainable.
Per-quarter trend chart. Per-quarter volume trajectory; catches degrading rotation health before incidents.
Per-service noise share. Per-service volume contribution; identifies the noisy systems driving the page count.

Response time

Response time has two halves. Page-to-ack measures reachability; ack-to-action measures effectiveness. Both deserve their own metric and their own degradation alert.

Page to acknowledgement. Per-incident MTTA timer; reachability and tooling check; the engineer’s phone reached them.
Acknowledgement to action. Per-incident response-effectiveness timer; real engagement vs over-eager ack and silence.
Slowing means burnout or tooling. Per-quarter trending-up signal; leading indicator of rotation degradation.
Per-quarter cause investigation. Named driver for any degradation; catches "the metric just slipped" complacency.

Rotation health

Rotation health is the structural metric. Headcount, tenure, departures all signal whether the system that produces on-call is healthy or degrading.

Engineers per rotation. Per-rotation headcount; drives shift frequency; below 6 engineers becomes punishing.
Tenure on rotation. Per-engineer time-on-rotation; drives experience distribution; uniform low tenure is a turnover signal.
Voluntary departures. Per-quarter departure rate; leading indicator of staffing problems; matters more than absolute headcount.
Per-rotation exit-interview signal. Per-departure on-call mention; catches systemic on-call problems before they cause cascading turnover.