Postmortem Advanced By Samson Tanimawo, PhD Published Sep 17, 2026 11 min read

Real Outage: A BGP Withdrawal Cascade

A global CDN withdrew its prefixes from peer routers for 27 minutes after a config push. Peer dampening then hid the recovery for another 8. Here’s what happened and the change-control rule that came out of it.

Timeline

All times UTC. Anonymised composite of the BGP-withdrawal pattern at large CDN networks. The numbers and config values are representative.

09:14, Network-engineering pushes a routing-policy change to the global edge fleet. Intent: tighten an inbound prefix-list to drop a small set of bogons. Actual change: a stray ! at the start of one prefix-list line negated the entire ACL.

09:15, All edge routers re-evaluate export policy. The negated ACL matches everything. Outbound BGP announcements for ~1,400 customer prefixes are withdrawn from 47 peer relationships.

09:16, Anycast reachability for the affected customers drops to 11% globally. Traffic that still resolves goes to the surviving regions, which immediately saturate.

09:18, Internal monitoring shows a synthetic-prober alert. The router-level metrics still look healthy because the routers themselves are fine; they’re correctly applying a wrong policy.

09:23, On-call identifies the change in the deploy log. Initial revert is pushed at 09:25.

09:31, Revert applied to all routers. BGP re-announces. But three of the largest peers have route-flap dampening enabled with a 15-minute suppress timer that started at the withdrawal.

09:41, Two peers come out of dampening; reachability climbs to 73%.

09:49, Last peer un-dampens. Full reachability restored. Total impact window: 35 minutes (27 of withdrawal, 8 of partial recovery).

The detection lag

Four minutes from change to first alert sounds reasonable until you realise the change was a global config push. Anything that breaks the entire edge fleet should fire in under 30 seconds. The team’s synthetic monitors ran every 60 seconds and required two consecutive failures to alarm, so the floor on detection was 120 seconds plus alert delay, on a problem affecting 1,400 customer-facing prefixes.

The deeper issue: there was no alert tied to BGP announcement count. The team monitored peer session state (still up), router CPU (fine), forwarding-plane drops (zero). Nothing watched the number of prefixes being announced. A change that withdrew 1,400 prefixes simultaneously didn’t trip a single network-layer alarm.

The cascade

Two cascades, both nasty. First, the surviving regions saturated within 90 seconds. Anycast routing concentrated all global traffic onto the 11% of the network that was still announcing; CPU on those edges hit 94%; TLS handshakes started timing out; latency for customers who were technically reachable spiked to 4-6 seconds.

Second, and more insidious: peer-side route-flap dampening. When BGP prefixes are withdrawn and re-announced quickly, peer routers can mark them as unstable and suppress them for a configured window. Three large peers had aggressive dampening with a 15-minute suppress timer. Even after the team reverted the bad config at 09:31, those peers refused to re-accept the announcements until their internal timer expired. The fix was correct; the network just refused to believe it for another 8 minutes.

You can’t fix peer-side dampening from your side. You can only avoid triggering it.

What the runbook said

The BGP runbook was excellent, for the wrong scenario. It assumed a peer or hardware failure: check session state, check interface, check route-server log, escalate to network engineering. None of those steps catch a self-inflicted policy bug. The on-call ran the entire runbook in the first six minutes and concluded “all upstream looks healthy”, which was true in the narrow sense and useless in the operational sense.

The runbook had no “recent change” step. The deploy log existed; nothing pointed the responder at it. That cost roughly 4 minutes of the incident.

What actually fixed it

Revert the offending config commit and wait. There was no shortcut around peer dampening. The team did what they could from their side: pushed the revert at 09:25, watched announcements come back at 09:31, then sat on their hands for 18 minutes while three peers’ suppress timers drained.

The team did call those three peers’ NOCs to ask for a manual clear ip bgp on the affected sessions. Two NOCs answered within 4 minutes; one didn’t. The two that responded shortened their dampening from ~15 minutes to ~6. The one that didn’t answer became the long pole in the recovery.

Action items

Six items came out of this one. The five that mattered:

BGP-prefix-count alarm. If announced prefix count drops by more than 5% in 60 seconds, page. Detection floor in the next game-day was 22 seconds.
Two-stage config rollout. Network policy changes now go to a single canary edge, sit for 15 minutes with synthetic-prober traffic, then roll fleet-wide on a 30-minute window. Same change reviewed by the same engineer, but blast radius is one router instead of all of them.
Linter on prefix-list syntax. The stray ! that broke the ACL would have been caught by a 20-line linter. The team had no syntactic check on routing config in the deploy pipeline.
NOC contact list. The peer who didn’t answer became the post-incident owner of a renewed contact-list exercise. Every peer relationship now has two on-call paths and an SLA on response.
Dampening tracking. The team built a small service that tracks observed dampening from peers based on re-announce delays. Three peers were flagged as “runs aggressive dampening” and the change-control process now treats those peers as critical-path.

The architectural change

The big one: no global config push without a canary stage. The team had been doing global pushes for routing policy because “BGP changes are atomic anyway, you can’t partially apply them”. That’s technically true on a single router and operationally false across a fleet. You can absolutely apply a change to one router first and watch the world break.

The new rule: any change that touches BGP, prefix-lists, route-maps, or peer policy goes to one canary edge first. The canary runs for 15 minutes with synthetic probes from three continents. If the probes pass, the change rolls out to a region. If the region passes, fleet-wide.

The pushback in the design review was “this slows down emergency changes”. The response was: emergency changes are exactly when you need a canary. The 35-minute outage was an emergency change without one.

The line that ended up on the wiki: “Global config pushes are global incidents waiting for the wrong character to appear in a config file.”