CI/CD & GitOps Practical By Samson Tanimawo, PhD Published Dec 18, 2025 4 min read

Stuck Pipeline Recovery

Pipelines hang. Recovery.

Detect

A stuck pipeline is the worst kind of failure: it does not error out, it does not page anyone, and the engineer who pushed the change just sits there refreshing the CI tab waiting for something that will never finish. The first move is to make stuck-ness a measurable, alertable signal instead of a vibe.

The rule that works in practice:

The detection layer is what turns stuck pipelines from a recurring source of lost engineering hours into a known signal you can route and act on.

Kill

Once a pipeline is identified as stuck, the right first action is almost always to kill it and retry. This sounds wasteful, but the math is brutal: a stuck pipeline at 30 minutes has zero probability of producing useful output, and a clean retry is statistically very likely to succeed.

Cancel-and-retry is not a fix, it is a release valve. It buys time to investigate without making engineers wait.

Debug

If a stuck pipeline happens once, it is noise. If it happens twice in the same week to the same stage, it is a systemic issue that retries will only paper over. Treat it like an incident: investigate the pattern, find the root cause, fix it.

The hardest stuck pipelines to fix are the ones nobody owns because they recover on their own through retries. Nova AI Ops watches pipeline duration and stage timeouts as first-class signals, groups stuck runs by stage and runner pool to surface the pattern, and pages the on-call when the same stage has hung more than twice in a rolling window so the systemic fix gets made before the team adapts to the noise.