Environment Promotion: Dev → Staging → Prod
Promotion gates between environments.
Dev
The dev environment is where engineers build, break, and rebuild. The fundamental design principle is that dev is for the people working on the code, not for the customers consuming the product. Friction here punishes velocity; safety here is mostly about not blowing up your teammates' work, not about protecting customers.
What dev should look like:
- Free-form by default.: Any engineer can deploy any branch at any time. No approval workflow, no scheduled deploy windows, no blocking pre-checks beyond the cheapest ones. The dev environment exists to make iteration fast.
- Low gates, fast feedback.: Lint and unit tests must pass before deploy because they are cheap. Integration and e2e suites do not have to pass; the dev environment IS where you run them when in doubt. Failing dev gates should never block a deploy; they should produce visible diagnostic output for the developer.
- Reset-able and disposable.: Dev databases get wiped weekly or on demand. Dev caches are small and short-lived. Dev queues are drained nightly. Nothing in dev should persist long enough to develop hidden state that leaks into the next change.
- Per-engineer or per-feature namespaces.: Where infrastructure permits, each engineer or each PR gets its own ephemeral dev slice (a namespace, a pod, a preview environment). This eliminates the "I broke dev" problem because dev is not a shared singleton.
- Cheap, not production-faithful.: Dev runs on smaller instances, smaller datasets, fewer regions. The point is to exercise the code path, not to reproduce production load. Mistaking these is how teams end up with a dev environment that costs as much as production but reproduces fewer real bugs.
The dev environment is the team's workshop. Optimize it for iteration speed, not for the appearance of caution.
Staging
Staging is where the change has to look like the production version of itself before customers see it. The fundamental design principle here is parity: staging exists to expose the failure modes that a production-grade environment will hit, before they actually hit production.
- Production-like, not production-equal.: Same dependencies, same network topology, same data shapes, same configuration system. Reduced scale is fine; missing components are not. If a service exists in prod, it exists in staging.
- Real artifact, not rebuilt.: The exact same container image, jar, or binary that will run in prod runs in staging. Different builds hide differences that surface only in production. The same artifact, deployed to two environments differing only in config, is non-negotiable.
- Mid-tier gates.: Tests must pass. End-to-end suites must pass. The artifact must soak in staging for a minimum window (15 to 60 minutes for stateless services, longer for stateful) before promoting to prod. The gate is not a meeting; it is automatic.
- Synthetic traffic.: Staging without traffic does not exercise the code paths that matter. Run synthetic load that approximates the shape of production requests: read/write ratios, tenant distribution, time-of-day patterns. The most common deploy-time bug is one that only shows up under load, and a quiet staging cannot find those.
- Real customer data is forbidden.: Production data does not flow to staging. Synthesized data, scrubbed copies, or smaller test datasets only. This is both a privacy boundary and a discipline that forces tests to work against synthetic shapes rather than memorize specific real users.
Staging done right is the gate that separates "we think it works" from "we have evidence it works." The team that invests here ships hot patches in minutes, not hours, because they can promote to prod with real confidence.
Prod
Production is where the customer is. Every gate, every mechanism, every override-able protection is in service of the same principle: the cost of a bad change in prod is at least an order of magnitude higher than a bad change in any other environment. Prod earns the strict gates by virtue of who is on the other side.
- Customer-facing by definition.: If a customer can hit it, it is prod. Internal staging URLs that have been opened up to a beta program are prod for those users. Treat them accordingly.
- Required code review.: Every change to prod is reviewed. Not as a checkbox but as a real review by an engineer who has reasoned about the change. Self-merge to prod is forbidden except for explicitly trivial cases (a doc typo) under a fast-path policy.
- Automated canary or blue-green deploys.: The change reaches a small percentage of traffic first. Health gates measure error rate, latency, and saturation against the rest of production for a soak window. Promotion to full traffic is gated on canary success.
- Continuous monitoring with auto-rollback.: The deploy pipeline watches the SLO burn rate during and after each change. A burn-rate spike triggers automatic rollback; the on-call confirms or overrides. The default response to an emerging incident is revert, not investigate.
- Approval for high-risk changes.: Schema migrations, IAM changes, payment-path code, and anything touching production data require a synchronous human approval before promotion. Everything else flows automatically. The approval is reserved for the changes whose blast radius cannot be contained by canary.
- Deploy windows for high-traffic periods.: Black Friday, end-of-quarter, big-event days. Prod respects the calendar. The same change that ships fine on Tuesday afternoon is a worse idea at 5 PM Friday before a long weekend.
The promotion path from dev through staging to prod is what turns continuous development into continuous delivery. Each environment has a different job and the gates are tuned to its job. Nova AI Ops watches every promotion event, gates each transition on the right SLO and burn-rate signals for that environment, and pages the on-call only when the auto-rollback fires so promotion is silent when it works and loud only when it has to be.