Stuck Pipeline Recovery
Pipelines hang. Recovery.
Detect
A stuck pipeline is the worst kind of failure: it does not error out, it does not page anyone, and the engineer who pushed the change just sits there refreshing the CI tab waiting for something that will never finish. The first move is to make stuck-ness a measurable, alertable signal instead of a vibe.
The rule that works in practice:
- Static threshold against the rolling average.: If the median pipeline duration is 10 minutes, anything over 30 minutes (3x median) is stuck, not slow. The 3x multiplier is the sweet spot. Lower and you false-alert on legitimately long-tail runs. Higher and you let real hangs sit for an hour.
- Per-stage timeouts.: Each stage of the pipeline (lint, unit, integration, build, deploy) gets its own max-duration cap, set at roughly 2x its 99th percentile observed runtime. A 4 minute integration stage that crosses 12 minutes is stuck. Killing it at the stage level is faster than waiting for the global pipeline timeout.
- Heartbeat from the runner.: If the runner stops emitting log lines for 5 minutes mid-stage, that is stuck even if the wall-clock has not exceeded the cap. Real work produces output. Silence means the process is wedged on something (network, lock, deadlock) that will not resolve on its own.
- Page on stuck, not on slow.: Slow pipelines are a backlog problem; stuck pipelines are a now problem. Wire the alert so that a stuck pipeline opens an incident, not just a Slack ping that gets buried by lunch.
The detection layer is what turns stuck pipelines from a recurring source of lost engineering hours into a known signal you can route and act on.
Kill
Once a pipeline is identified as stuck, the right first action is almost always to kill it and retry. This sounds wasteful, but the math is brutal: a stuck pipeline at 30 minutes has zero probability of producing useful output, and a clean retry is statistically very likely to succeed.
- Cancel-and-retry as the default response.: Cancel the stuck run, spawn a fresh runner, restart from the merge commit. Most stuck pipelines are transient (DNS hiccup, runner host pressure, a one-time network blip) and will pass on the second try.
- Cap the retries.: Two retries with backoff is the right ceiling. After three stuck attempts, do not keep trying; there is a real bug and another retry is just burning runner-hours and engineer attention. Escalate to debug.
- Auto-retry, not manual.: The cancel-and-retry should fire from the alert, not from a human clicking buttons. Engineers should find a healthy passing pipeline by the time they look, with a note saying "auto-retried after 32-min stuck." That is the difference between detecting the problem and solving it.
- Preserve the failed run's logs.: Even when the retry passes, keep the stuck run's logs and runner state for at least 7 days. The pattern only emerges across multiple stuck runs, and you cannot find the pattern if you delete the evidence on success.
Cancel-and-retry is not a fix, it is a release valve. It buys time to investigate without making engineers wait.
Debug
If a stuck pipeline happens once, it is noise. If it happens twice in the same week to the same stage, it is a systemic issue that retries will only paper over. Treat it like an incident: investigate the pattern, find the root cause, fix it.
- Group stuck runs by stage and runner.: Most patterns surface immediately. "Every stuck run is on runner pool X" points at the runner pool. "Every stuck run is in the migration stage" points at the database. "Every stuck run is in the e2e tests after lunch" points at a downstream service that is overloaded during a known time window.
- Profile a hung process.: When you catch one stuck in real time, take a thread dump or pstack snapshot of the running process before killing it. The stack trace usually points directly at the lock, the network call, or the loop that wedged. This is the fastest path from symptom to root cause.
- Check resource pressure on runners.: Memory, disk, file descriptors, network sockets. Stuck pipelines are often the first symptom of a runner host quietly running out of something. The host metrics around the time of the hang tell that story.
- Treat repeated hangs as a P2 incident.: Open a ticket, assign an owner, set a deadline. The fix is usually a timeout that was missing, a connection pool that was not sized, a flaky test that depends on a real network call, or a runner image that needs an upgrade. None of these get fixed by themselves.
The hardest stuck pipelines to fix are the ones nobody owns because they recover on their own through retries. Nova AI Ops watches pipeline duration and stage timeouts as first-class signals, groups stuck runs by stage and runner pool to surface the pattern, and pages the on-call when the same stage has hung more than twice in a rolling window so the systemic fix gets made before the team adapts to the noise.