Pre-Paging Context Loading
Context loaded before the on-call sees the page.
The idea
By the time a human is paged, the system already has 90 seconds of context: what changed recently, which alerts also fired, which dashboards are relevant. Pre-paging context loading attaches that data to the alert payload before the page goes out, saving 2-5 minutes of triage per incident; compounded over a year, that’s days of on-call time recovered.
- 90 seconds of latent context. Recent changes, related alerts, dashboards; available before the page.
- Attach before page-out. The alert payload carries the context the human needs.
- 2-5 minutes saved per incident. Triage starts with data, not investigation.
- Days per year recovered. Compounded; the cost of building pays back fast.
What to pre-load
Three categories of context cover most incidents. Recent deploys for the affected service (Argo CD events, GitHub Actions runs; 80% of incidents follow a deploy); related alerts within the last 15 minutes (cluster the firing signals so the on-call sees the full picture); top affected endpoints, top affected customers, current load (all derivable from APM data).
- Recent deploys. Argo CD events, GitHub Actions runs; 80% of incidents follow a deploy.
- Related alerts (15 min). Cluster firing signals; the on-call sees the full picture.
- Top affected endpoints, customers, load. All derivable from APM data; pre-render into the payload.
- Per-service context plan. Each service has its own context list; supports targeted enrichment.
How to load
The mechanics are simple. Webhook from PagerDuty into a Lambda or Cloud Run job; the job queries Datadog, Argo CD, and the service catalog and posts back to the alert payload; latency target 30 seconds because slower means the human reaches the alert before the context arrives; cache aggressively because most incidents share context within a 5-minute window.
- PagerDuty webhook into Lambda. Or Cloud Run; the enrichment job runs out-of-band.
- Multi-source query. Datadog, Argo CD, service catalog; the job assembles the context.
- 30-second latency target. Slower means human reaches alert before context arrives.
- Aggressive caching. 5-minute window shares context; one query per service per minute is enough.
When it fails
Three failure modes deserve mitigation. Stale data (deploy lookup 30 minutes behind is worse than no data because on-call trusts wrong information); too much data (40 lines of context is unreadable on a phone, cap at 5 facts); vendor outages (if Datadog is down, pre-loading fails, fall back to a basic page rather than blocking).
- Stale data worse than none. 30-minute-behind deploy lookup; on-call trusts wrong info.
- Cap at 5 facts. 40 lines is unreadable on a phone; the cap protects the page.
- Vendor outage fallback. If Datadog is down, fall back to basic page; don’t block on context.
- Per-failure mitigation policy. Each failure mode has a documented response; supports continued operation.
Get started
The starter ramp is concrete. Pick your top 3 services and build a simple webhook that adds “recent deploys” to the alert payload; measure MTTA and MTTR before and after with a target of a 30-second drop in median triage time; iterate per service because adding context to all services at once is over-investment.
- Top 3 services first. Highest leverage; the best starting point.
- Simple webhook. “Recent deploys” in the payload; the smallest useful enrichment.
- 30-second median triage drop. The measurable target; before-and-after MTTR shows the effect.
- Per-service iteration. Pick by page volume; add context where it pays back.