Postmortems Beginner By Samson Tanimawo, PhD Published Dec 14, 2026 8 min read

The Slack Notification Storm: When Retry Logic Fights Retry Logic

Retry logic is rarely the bug. Retry logic interacting with another retry logic is almost always the bug.

The setup

Service A sends a Slack message via service B’s notification API. Service B forwards to Slack. Both retry on failure. Both have backoff. Both have a max-retries cap.

Slack throttles temporarily for 30 seconds. Service B retries; throttle continues; service B exhausts its retries and returns an error to service A. Service A retries; service B starts the same loop.

The collision

Service B’s retries take 90 seconds total (3 retries with backoff). Service A’s retries take 4 minutes total (5 retries with backoff). Within the 4 minutes, service A retries six times; each retry triggers a fresh service B retry chain.
Net result: an original 30-second throttle generates 6 × 90 seconds of retry traffic, plus the original. By the time Slack’s throttle clears, the queue has built up four minutes of pending notifications. They all fire at once when service B succeeds.

Why it lasts so long

The notification flood reaches users 4 minutes after the original message. To users, this looks like a 4-minute delay followed by a burst of duplicate notifications.

The next message follows the same retry-on-throttle path. The system does not stabilize; it oscillates. The 90-minute incident from the title is exactly this oscillation, sustained.

Three decoupling changes

1. Idempotency keys at every layer. So duplicate retries do not duplicate user-facing actions.
2. Retry budgets, not retry counts. Total retries per service per minute, capped.
3. Don’t retry through retrying systems. If service B retries internally, service A should not retry on service B errors.

Antipatterns

Retry chains nobody owns. Each layer adds ‘just one retry’; the product is multiplicative.
No jitter. Synchronizes the storm.

What to do this week

Three moves. (1) Map the full retry topology of one user-visible action. (2) Add idempotency keys to one downstream call. (3) Define a retry budget per service in your service catalog.