The Slack Notification Storm: When Retry Logic Fights Retry Logic
Retry logic is rarely the bug. Retry logic interacting with another retry logic is almost always the bug.
The setup
Service A sends a Slack message via service B’s notification API. Service B forwards to Slack. Both retry on failure. Both have backoff. Both have a max-retries cap.
Slack throttles temporarily for 30 seconds. Service B retries; throttle continues; service B exhausts its retries and returns an error to service A. Service A retries; service B starts the same loop.
The collision
- Service B’s retries take 90 seconds total (3 retries with backoff). Service A’s retries take 4 minutes total (5 retries with backoff). Within the 4 minutes, service A retries six times; each retry triggers a fresh service B retry chain.
- Net result: an original 30-second throttle generates 6 × 90 seconds of retry traffic, plus the original. By the time Slack’s throttle clears, the queue has built up four minutes of pending notifications. They all fire at once when service B succeeds.
Why it lasts so long
The notification flood reaches users 4 minutes after the original message. To users, this looks like a 4-minute delay followed by a burst of duplicate notifications.
The next message follows the same retry-on-throttle path. The system does not stabilize; it oscillates. The 90-minute incident from the title is exactly this oscillation, sustained.
Three decoupling changes
- 1. Idempotency keys at every layer. So duplicate retries do not duplicate user-facing actions.
- 2. Retry budgets, not retry counts. Total retries per service per minute, capped.
- 3. Don’t retry through retrying systems. If service B retries internally, service A should not retry on service B errors.
Antipatterns
- Retry chains nobody owns. Each layer adds ‘just one retry’; the product is multiplicative.
- No jitter. Synchronizes the storm.
What to do this week
Three moves. (1) Map the full retry topology of one user-visible action. (2) Add idempotency keys to one downstream call. (3) Define a retry budget per service in your service catalog.