Pipeline Fail-Fast Patterns
Fail fast, signal early.
Why fail fast
Fail-fast saves engineer time and CI compute. Cheap checks run first, expensive checks are gated on prior pass, and the pipeline aborts the moment any stage fails so nothing downstream burns time on a known-bad commit.
- Thirty-minute pipeline failing at minute 28. Twenty-eight minutes of wasted CI compute and engineer wait time per failed run. Fail-fast moves that failure to minute one.
- Cheap first, expensive last. Cost-ordered stages. The pipeline aborts as soon as a check fails, so heavy stages only run when the lightweight ones already passed.
- Reduces feedback, compute, and wait time together. All three move in the right direction with the same change. The savings compound across thousands of builds.
- Documented stage cost. Per-stage timing data published in the pipeline README. Optimisation targets are obvious rather than guessed.
Stage ordering
The order is cost-ascending. Lint and format first, integration last, deploy gated on everything above. Exact times depend on the codebase, but the relative ordering is universal.
- Lint and format: thirty seconds. The cheapest gate. Catches syntactic issues and style violations without running any code.
- Type check: one to three minutes. Static-analysis gate. Catches type-safety issues before any tests run, which often kills entire test failures at the source.
- Unit tests: three to ten minutes. Cheap-to-run test layer. Catches per-module bugs while the runner is still warm and feedback is interactive.
- Integration and beyond. Ten to thirty minutes of integration tests, then security scans, image builds, and deploy. Only runs if everything above passed; cuts worst-case waste in half.
Parallel where it helps
Parallelism trades clarity for speed. Use it where the speedup is real, cap it to avoid runner queueing chaos, and document the structure so on-call engineers can read the failure log without guessing.
- Lint, type check, and unit tests in parallel. The cheap stages run concurrently across modules. Total wall time drops to the slowest single stage rather than their sum.
- Do not sacrifice clarity. A fifty-job matrix nobody understands is worse than five sequential jobs everyone can read. Optimise for the on-call engineer, not the benchmark.
- Cap parallelism. CI runners are not infinite. Over-parallelising hits queueing limits, and a single noisy team can starve everyone else's pipelines.
- Documented parallel groups. Named concurrency policy in the pipeline config. Operators understand which jobs share runners and which do not.
Fast feedback to author
The PR author gets immediate failure notice. Slack ping, PR annotation, and a deep link to the failing log line; the gap between "test failed" and "engineer is reading the log" should be zero clicks.
- Notify on failure immediately. PR-author ping fires the moment any stage fails. Do not wait for the full pipeline to finish surfacing the rest.
- Slack with stage name and log link. The named stage plus a deep link to the failing line. The author goes straight to the broken assertion without scrolling.
- Annotate the PR. GitHub Checks or GitLab merge-request comment. Visible in the review surface for anyone looking at the PR later.
- Suggested fix link. Linked runbook or doc per known failure mode. Resolution accelerates because common failures have written remedies.
How to roll this out
Roll out by audit, target, metric. Measure the current pipeline, set the goal, and track time-to-failure as the headline number that captures whether the discipline is working.
- Audit current pipeline. List every stage that has ever failed late. The candidates for moving earlier are obvious from the data.
- Set a target. Ninety percent of failures within the first five minutes is a concrete, measurable bar. Below that and the team has objective evidence the discipline is missing.
- Track time-to-failure. Pipeline metric exported to the team's dashboard. Improvements compound across thousands of builds and become visible in the trend.
- Named pipeline owner. One engineer responsible for the pipeline's health. Catches the "everyone-and-no-one" maintenance pattern that kills CI hygiene.