Real Outage: A CDN Edge Collapse
A single VCL config touched every edge node at once. 51 minutes of cascading failures, a rollback that didn’t actually roll back, and the staged-config rule that prevents it happening again.
Timeline
Anonymised composite of the global-VCL-push pattern at large CDN providers. Times in UTC.
10:47, A customer ships a self-service config change adding a new request-routing rule. The change passes the customer-facing validator. It’s pushed to all edge POPs in 38 seconds.
10:48, The new VCL contains a regex with catastrophic backtracking on certain user-agent strings. Edge nodes start spending 600ms+ per request in regex evaluation. CPU saturates within 90 seconds across the global fleet.
10:50, Origin-shielding requests start timing out. Customer error rates climb from 0.04% to 18%. Edge response latency p99 goes from 80ms to 12 seconds.
10:53, First page. Detection: 6 minutes. The synthetic-monitor that should have caught this in 30 seconds was running on a sample size that didn’t include the affected user-agent pattern.
11:01, On-call identifies the customer config push as the trigger. Issues a rollback command via the control plane.
11:04, Rollback “completes” according to the control plane. Edge nodes are still saturated. Investigation reveals the rollback updated the source-of-truth but the pushed-config sync interval is 5 minutes.
11:09, Force-sync triggered manually. Edge nodes load previous config; CPU drops within 20 seconds.
11:38, Recovery confirmed across all POPs. The long tail was edges in three Asian regions where the force-sync took longer due to the control-plane being primarily US-East. Total customer-impact: 51 minutes.
The detection lag
Six minutes from push to alert. The detection failure was specifically about coverage: synthetic monitors hit ~150 URLs across ~12 user-agent strings, and the regex-backtracking bug only triggered on a long User-Agent header pattern that wasn’t in the synthetic set. The monitors stayed green while real customer traffic was burning.
The deeper failure: there was no fleet-wide CPU alarm with the right threshold. CPU was monitored per-edge, alarmed at 95% sustained for 5 minutes. The actual saturation was 88-92% across most edges, just below the threshold, and the per-edge alarm was tuned for noisy nodes rather than fleet-wide pattern recognition.
What would have caught it: a fleet aggregate alarm on “p99 edge response latency > 500ms across more than 30% of POPs simultaneously”. That signature is only ever a global-config issue or a network event, both of which need a 30-second response.
The cascade
Three layers. Layer one: the regex itself. A pathological pattern that took 600ms-2s per request to evaluate. Single-CPU bound, no way to optimise around it.
Layer two: the request queue. Edge nodes had a default request queue of 1,000 in-flight before shedding. With each request taking 600ms instead of 80ms, the queue filled in seconds. Once full, the edge started returning 503s on the queue overflow path, even for requests that would have hit cache and returned in 5ms. Healthy traffic was rejected because of the saturation.
Layer three: customer retries. Mobile clients with aggressive retry logic saw the 503 wave and re-requested with no backoff. Effective request rate doubled within 90 seconds. The CDN was now both slower (regex) and seeing more traffic (retries) at the same time.
The cascade pattern was: bad config → CPU saturation → queue overflow → 503s → client retries → more traffic. Each layer was correctly designed in isolation. Together they produced 51 minutes of pain.
What the runbook said
The CDN runbook had a “global VCL rollback” procedure. The procedure was documented, tested in staging, and worked. The problem was step 3, which read: “Issue rollback via control plane. Recovery should be visible within 60 seconds.”
That sentence was wrong. The control plane processed rollbacks in under a second; the edge sync interval was 5 minutes. The on-call followed the runbook, saw “rollback complete” in the control-plane UI, and waited for recovery. Recovery didn’t come because the edges hadn’t actually pulled the new config yet.
This cost about 5 minutes of the incident. The on-call eventually thought to check edge logs directly, found they were still serving the bad config, and triggered the manual force-sync. The runbook didn’t mention force-sync; it didn’t even acknowledge the sync interval existed.
What actually fixed it
The actual recovery was three commands: rollback the customer config, trigger a manual edge-config force-sync, and watch CPU drop. The total work was maybe 90 seconds of typing. The 51-minute incident was almost entirely detection lag plus the runbook being wrong about how rollback worked.
The customer who pushed the config was not at fault in any meaningful sense. The validator accepted the config. The push was authorised. The system was supposed to handle pathological VCL gracefully and didn’t. The postmortem explicitly avoided naming the customer; the failure was structural.
Action items
- Regex complexity limit in the validator. The customer-facing config validator now rejects regexes that fail a backtracking-complexity check. Required two weeks of work and prevents an entire class of bug.
- Per-request CPU budget. Edge nodes now kill any single request that exceeds 250ms of CPU time and return a 503 immediately. Caps the blast radius of any future pathological config.
- Force-sync as default rollback. The control-plane rollback now triggers a force-sync as part of the same operation. The 5-minute sync interval is no longer in the rollback path.
- Fleet-wide latency alarm. New alarm: p99 edge latency > 500ms across >30% of POPs for > 60 seconds. Detection floor is now under 90 seconds.
- Synthetic UA expansion. The synthetic monitor sample now includes the long-tail user-agent patterns from the previous 30 days of real traffic.
The architectural change
The architectural answer was the staged-config pattern. The new rule: any customer-driven or platform config change rolls out in three stages, one POP, one region, then global. The single-POP stage runs for 5 minutes; the regional stage runs for 10. Each stage has automated abort criteria (CPU spike, latency spike, error-rate spike) that pull the config back without human intervention.
The pushback was “customers expect their changes to be live in seconds”. The response was: customers expect their site not to break. The new SLA is “15 minutes to global” with the understanding that any change can abort itself in stage one and never go further. The customer experience for normal traffic is identical; the worst case is 15 minutes of waiting; the protection is that one bad config can’t take down the entire CDN.
Six months in, the staged-config pattern has caught 11 customer pushes that would have caused fleet-wide degradation. None of them turned into incidents because they aborted at stage one. That number is the postmortem closing argument every time someone asks why the CDN is “slower to deploy now”.