A cascading failure where every client retries simultaneously, amplifying load on a downstream that's already struggling.
A retry storm is the failure pattern where a transient downstream issue triggers every upstream client to retry at once, the synchronized retries amplify load far beyond the original traffic, and the downstream gets crushed by the very clients trying to reach it. The classic shape: downstream blips for 5s, every client retries at exactly t+1s, the downstream sees 10x normal load, can't recover, takes 30 minutes to come back.
Most cascading failures in distributed systems are retry storms. The fix is well-known and well-engineered: exponential backoff with jitter, circuit breakers that stop hammering a known-broken downstream, retry budgets that cap how much capacity a service can spend on retries. Building these in everywhere is unsexy work that prevents the worst outage shape in production.
See the part of the platform that handles retry storm in production.