Deploy Postmortem When It Fails
Failed deploy → postmortem.
Trigger
Most engineering organizations have a postmortem process for incidents. Fewer have one for failed deploys. The two are related but different: an incident postmortem covers the full operational story; a deploy postmortem focuses specifically on what got past the deploy gates and why. Adding deploy postmortems to the practice is one of the highest-leverage CI/CD improvements a team can make.
What should trigger a deploy postmortem:
- Deploy failed and was rolled back.: Any deploy that triggered automatic rollback or that the on-call manually rolled back gets a postmortem. The fact that the rollback worked is good; the fact that the rollback was needed is the lesson. Both pieces are captured.
- Or caused customer impact.: Any deploy whose regression reached customers gets a postmortem, even if the rollback fired in time and the impact was small. Customer-impacting deploy failures are the highest-priority category because they are evidence the safety net is too loose.
- Or required out-of-band intervention.: A deploy that succeeded but required emergency configuration changes, manual database tweaks, or last-minute approvals also warrants a postmortem. Those interventions point to gaps in the standard deploy process that the team should close.
- Trigger is automatic.: When the deploy pipeline detects failure or rollback, it opens the postmortem ticket automatically. The ticket has the deploy ID, the artifact, the time of rollback, the contributing metrics. The deploy team picks up the ticket; the postmortem starts.
- Cadence: every failure, no exceptions.: "Small" failures get postmortems too. The discipline is to learn from every failure, not just the embarrassing ones. A team that postmortems only the customer-visible incidents misses the lessons from the ones that almost became visible.
The trigger is what makes the practice routine. Without an automatic trigger, deploy postmortems happen only when someone remembers to ask for one, which is rarely.
Focus
The deploy postmortem has a specific focus that incident postmortems often miss: the gap between the deploy gates that should have caught the regression and the regression itself. Why did CI pass? Why did canary not catch it? Why did the soak window expire successfully? Each "why" closes a specific gap in the safety net.
- What broke.: The technical regression that caused the failure. Specific code change, specific input, specific dependency interaction. The detail level is high; the goal is understanding, not blame.
- Why CI did not catch it.: The most important question. Test coverage gap? Test exists but was flaky and got retried green? Test runs against synthetic data that did not include the failure shape? The CI gap is the lesson that prevents the next similar failure.
- Why canary did not catch it.: If the team uses canary, why did the metric gates pass during canary? Was the threshold too loose? Was the soak window too short? Was the metric not measuring the right thing? Each answer points to a tuning improvement.
- Why the soak window did not catch it.: If the deploy soaked for some period before promotion, why did the regression not appear during soak? Was the soak too short? Was the traffic during soak not representative? Was the regression intermittent in a way that needed longer observation?
- Closes the gap.: Each "why" identifies a specific gap. The action items address each gap with a specific fix. The deploy postmortem is mechanical: identify the gap, document it, schedule the fix.
The focus on gates makes deploy postmortems different from incident postmortems. The incident postmortem asks "what happened?"; the deploy postmortem asks "why did our deploy safety net fail?"
Action items
The output of every deploy postmortem is action items. Concrete, owned, time-bound. Without action items, the postmortem is a debrief; with them, it is a real safety improvement.
- Test added or improved.: The most common action item: add the test that would have caught this. Sometimes it is unit, sometimes integration, sometimes end-to-end. The new test is committed alongside the action item; the test runs on every future PR; the regression cannot recur silently.
- Process improved.: Sometimes the gap is procedural rather than test-shaped. Tighter canary metric thresholds. Longer soak windows for high-risk changes. Required approval for a class of changes that previously did not need it. The process change is documented and rolled out.
- Tooling added.: Sometimes the gap is in the deploy tooling itself. Automated rollback that did not fire because the threshold was wrong. Burn-rate alert that took too long to trigger. Configuration drift detector that missed the change. The tooling fix lands in the platform team's roadmap.
- Documentation updated.: The runbook, the deploy gate documentation, the canary configuration, the on-call playbook. Each of these may have a gap that contributed to the failure. Update them while the lesson is fresh.
- Compound effect over time.: Each deploy postmortem closes one or two gaps. After a year of disciplined practice, the team has closed dozens. The deploy safety net has been tightened from many directions; the rate of customer-impacting deploy failures drops correspondingly. The compounding is the real value.
Deploy postmortems are the practice that turns deploy failures from operational pain into systematic improvements. Nova AI Ops auto-opens postmortem tickets on rollback events, links them to the deploy log and the contributing metrics, and tracks the action items so the lessons from each failure produce a permanent improvement to the safety net.