Multi-Env CD
Continuous Deployment across environments.
Auto-promote
Continuous Deployment is not just CI plus a deploy step at the end. It is a pipeline that moves a single immutable artifact through every environment automatically, with each environment acting as both a verification stage and an approval gate. The key word is automatic. If a human has to manually trigger the staging deploy after merge, or hit a button to promote to prod, you do not have CD. You have CI plus a button.
The promotion model that holds up at scale:
- Dev to staging on merge.: Every PR that lands on the main branch builds an artifact and deploys it to staging within minutes. No queue, no batch, no waiting for the next planned release. The merge IS the deploy. This forces every change to be small enough to ship on its own and removes the temptation to bundle.
- Staging to prod with gates.: Once an artifact has soaked in staging long enough to clear all gates (typically 15 to 60 minutes for read-heavy services, several hours for stateful ones), it auto-promotes to prod. The promotion is not a redeploy. It is the same artifact, same hash, same environment variables (modulo prod-only secrets), now serving production traffic.
- Same artifact, every environment.: If your dev binary differs from your prod binary, you do not have CD. The exact same container image, jar, or wasm module that ran in staging at 3:14 PM is what runs in prod at 3:46 PM. Configuration changes per environment, code does not.
This pattern only works if every commit is independently shippable, which forces a discipline that pays for itself: feature flags for incomplete work, schema migrations that are forward-compatible across at least one release, and tests that actually catch regressions. The pipeline is the forcing function.
Gates
An auto-promote pipeline without gates is a foot-gun on a timer. The gates are what make the system safe enough to run unattended. Layer them so that no single failure mode can let a bad artifact through.
- Test pass.: Unit, integration, and contract tests must be green on the artifact before it leaves the merge stage. End-to-end tests run against staging after the deploy completes. A failed e2e at staging blocks the prod promotion automatically and pages the on-call.
- Health check soak.: After the staging deploy, the artifact must pass health checks for a minimum window (commonly 15 minutes for stateless services, longer for those with warm caches). If error rate, latency, or saturation breaches a threshold during the soak, promotion is blocked.
- Time window.: Production promotions are gated to safe windows (no Friday afternoon deploys, no ramps during the on-call handoff hour, freeze during black-out periods like Black Friday). The pipeline respects the calendar.
- Approval (only for high-risk changes).: Schema migrations, IAM changes, and anything touching the payment path require a synchronous human approval before promotion. Everything else flows through unattended.
Each gate is independent and each one fails closed. That is the difference between CD that runs while everyone sleeps and CD that wakes the on-call at 3 AM because a flaky e2e nudged a bad artifact into prod.
Rollback
The rollback is what makes auto-promote tolerable. If a regression makes it past every gate (and eventually one will), the pipeline must be able to revert it as fast as it shipped it.
- Per-environment rollback.: Production can roll back without touching dev or staging. The previous prod artifact is held warm and ready to take traffic on a single command, with no rebuild required. A rollback is the inverse of a promotion, not a separate code path.
- Forward-compatible migrations.: Database migrations are written so that the previous version of the application can still run against the new schema. This is the only way to make rollback safe in the presence of schema changes. If your migration drops a column the previous version reads, your rollback will break worse than your forward fix.
- Auto-rollback on burn rate.: If the new artifact pushes the error budget burn rate above a threshold for more than a few minutes, the pipeline rolls back automatically. The on-call confirms or overrides, but the default is revert. Mean time to recovery should be measured in minutes, not hours.
The combination is what unlocks weekly releases turning into hourly ones: every commit is shippable, every promotion is gated, every rollback is independent. Nova AI Ops watches the pipeline end to end (test status, soak health, burn rate, deploy traceability) and pages on-call when the auto-rollback fires so the human only sees the cases that need a brain.