Pre-Paging Context Loading
Context loaded before the on-call sees the page.
The idea
By the time a human is paged, the system already has 90 seconds of context: what changed recently, which alerts also fired, which dashboards are relevant.
Pre-paging context loading attaches that data to the alert payload before the page goes out.
Saves 2 to 5 minutes of triage per incident. Compounded over a year, that's days of on-call time recovered.
What to pre-load
Recent deploys for the affected service (Argo CD events, GitHub Actions runs). 80% of incidents follow a deploy.
Related alerts within the last 15 minutes. Cluster the firing signals so the on-call sees the full picture.
Top affected endpoints, top affected customers, current load. All derivable from APM data.
How to load
Webhook from PagerDuty into a Lambda or Cloud Run job. Job queries Datadog, Argo CD, and the service catalog. Job posts back to the alert payload.
Latency target: 30 seconds. Slower than that and the human reaches the alert before the context arrives.
Cache aggressively. Most incidents share context within a 5-minute window; one query per service per minute is enough.
When it fails
Stale data. If the deploy lookup is 30 minutes behind, it's worse than no data; the on-call trusts wrong information.
Too much data. A page with 40 lines of context is unreadable on a phone. Cap at 5 facts.
Vendor outages. If Datadog is down, pre-loading fails. Fall back to a basic page; don't block on context.
Get started
Pick your top 3 services. Build a simple webhook that adds "recent deploys" to the alert payload.
Measure MTTA and MTTR before and after. Target a 30-second drop in median triage time.
Iterate per service. Adding context to all services at once is over-investment; pick by page volume.