Reliability Engineering

The change has not shipped yet,
and Nova has already imagined three ways it breaks

Pre-Mortem runs an adversarial critic against a planned change before deploy. Given the diff and the affected services, the critic enumerates the top three failure modes, ranks them by likelihood and severity, and proposes mitigations. The output is one paragraph per failure. Useful for changes that scare you (and changes that should but do not).

Get Started Talk to Sales

app.novaaiops.com / pre-mortem

● LIVE

Pre-mortem · payments@a3f291

1.connection pool exhaustion under retry stormlikely · severe

2.silent partial-write on tenant_id mismatchunlikely · severe

3.cache stampede on cold hitlikely · medium

How It Works

Adversarial critic, scoped to your change

Pre-Mortem takes the diff (or the runbook), the affected services, the recent incident history, and the agent's own service knowledge. It runs an adversarial role: "what could go wrong?" The output is the top 3 failure modes, each with a one-paragraph explanation, a likelihood class, a severity class, and a one-line mitigation suggestion.

✓
Diff + service context: reads the actual change, not just the title; understands what files moved
✓
Adversarial role: the agent is told to look for failures; it is not optimizing for "ship this"
✓
Concrete mitigations: each failure mode comes with a one-line "do this to reduce the risk"

app.novaaiops.com / pre-mortem · how

Inputs

diffpayments@a3f291 (24 files)

servicespayments, cart (downstream)

recent incidentslast 30d, payments cluster

service knowledgepostgres-doctor + cache-warmer context

Pre-Deploy Gate

Optional, but recommended on tier-0

Configure Pre-Mortem as a CI gate: every PR that touches a tier-0 service runs it before merge. The author sees the failure modes in the PR comment and can either address them or note why they are accepting the risk. The gate is optional for lower-tier services.

✓
CI integration: GitHub, GitLab, Buildkite, Bitbucket, same plugin as Agent Fitness
✓
PR comment with the 3 modes: authors see the analysis directly in the PR; risks are explicit, not hidden
✓
Reviewer can require addressing: reviewers can require a response (mitigated, accepted, or false-positive) before approving

app.novaaiops.com / pre-mortem · gate

PR comment · sample

# Pre-mortem · top 3 failure modes 1. connection pool exhaustion (likely) mitigation: enable PgBouncer txn-pooling for this path 2. silent partial-write (unlikely, severe) mitigation: add tenant_id check 3. cache stampede (likely, medium) mitigation: warmer prefetch on deploy

Mitigation Tracking

Record what you addressed, what you did not

For each failure mode, the author marks it as addressed (with a link to the mitigation in the PR), accepted (with a written justification), or dismissed (false-positive, with a reason). The decisions are stored with the deploy and surfaced in postmortems if the deploy turns out to cause an incident.

✓
Three decisions per mode: addressed, accepted, dismissed, each requires evidence (link, justification, reason)
✓
Stored with deploy: the decisions ship with the deploy record; postmortem can pull them if needed
✓
Pattern detection: if you keep "dismissing" the same failure mode, the system flags it

app.novaaiops.com / pre-mortem · mitigation

Decisions · payments@a3f291

1. pool exhaustionaddressed (#1422 enables pgbouncer)

2. partial-writeaccepted · "rare path, alert covers it"

3. cache stampedeaddressed (#1423 prefetch)

Postmortem Use

When pre-mortem was right, the postmortem cites it

If a deploy turns out to cause an incident and the pre-mortem predicted the failure mode, the postmortem builder includes the prediction directly. "We were warned." This is not punitive; it is feedback for the gate. Patterns where pre-mortem predicts correctly raise the gate's influence; patterns where it consistently misses get prompt-tuned.

✓
Auto-cited when right: postmortem auto-pulls the matching pre-mortem prediction so the connection is explicit
✓
Calibration tracking: pre-mortem precision is tracked: how often did it predict the actual failure?
✓
Tuning loop: low-precision pre-mortems trigger prompt review; high-precision ones get more weight in CI

app.novaaiops.com / pre-mortem · cite

Cite · postmortem inc-4821

# root cause section deploy payments@a3f291 caused connection pool exhaustion under retry storm. Pre-mortem warned us · failure mode 1. Mitigation was scheduled but not yet shipped at deploy time.

The change has not shipped yet,and Nova has already imagined three ways it breaks