Postmortems Beginner By Samson Tanimawo, PhD Published Dec 14, 2026 8 min read

The Slack Notification Storm: When Retry Logic Fights Retry Logic

Retry logic is rarely the bug. Retry logic interacting with another retry logic is almost always the bug.

The setup

Service A sends a Slack message via service B’s notification API. Service B forwards to Slack. Both retry on failure. Both have backoff. Both have a max-retries cap.

Slack throttles temporarily for 30 seconds. Service B retries; throttle continues; service B exhausts its retries and returns an error to service A. Service A retries; service B starts the same loop.

The collision

Why it lasts so long

The notification flood reaches users 4 minutes after the original message. To users, this looks like a 4-minute delay followed by a burst of duplicate notifications.

The next message follows the same retry-on-throttle path. The system does not stabilize; it oscillates. The 90-minute incident from the title is exactly this oscillation, sustained.

Three decoupling changes

Antipatterns

What to do this week

Three moves. (1) Map the full retry topology of one user-visible action. (2) Add idempotency keys to one downstream call. (3) Define a retry budget per service in your service catalog.