Deploy Window Discipline
Deploy during business hours.
Rule
Deploy windows are calendar-based rules about when production deployments are allowed. The argument for them: incident response capability varies by time of day; deploys should happen when capability is highest. The argument against: a service that cannot deploy at any hour is a service whose architecture is brittle. Both arguments have merit; the right answer depends on the service's risk profile.
What a sensible deploy window rule looks like:
- Deploy 9 AM to 4 PM business days.: The default window for production deploys. Most engineers are at work; on-call coverage is full; customer support is available. Deploys that go wrong have the maximum response capability available to fix them quickly.
- Stagger by team.: If multiple teams deploy daily to overlapping services, stagger their windows so they do not collide. Team A: 9-11. Team B: 11-1. Team C: 2-4. The staggering prevents the case where two teams' regressions land simultaneously and the on-call can't tell which is causing the issue.
- Most engineers are awake during the window.: The fundamental property the rule preserves: the team's collective response capability. A deploy at 2 PM on Tuesday has 50+ engineers available; the same deploy at 2 AM on Sunday has the on-call alone.
- Different windows per service tier.: Tier 0 (revenue-path) services have tighter windows. Tier 1 (customer-affecting but non-critical) services have broader windows. Tier 2 (internal) services can deploy anytime. The rule scales with the consequence of failure.
- Window respected by automation.: The deploy pipeline checks the current time and the window rule. Outside the window, the deploy is blocked or queued for the next window. The rule is enforced by the pipeline; the human discipline is what produces the rule, not what enforces it.
The rule is a compromise between operational safety and engineering velocity. Tight windows reduce risk and slow shipping; broad windows increase risk and speed shipping. The right balance depends on the service.
Emergency
Some changes legitimately need to ship outside the window. Security fixes for actively-exploited vulnerabilities. Hot patches for production incidents. Critical customer-impacting bugs. The rule has to permit these without becoming theatrical.
- Critical fix exempt.: The deploy pipeline supports an emergency override. The on-call can mark a deploy as emergency; the override skips the window check. The override is recorded in the deploy log; the audit trail captures who used it and why.
- Documented and logged.: Every emergency deploy generates a record: the trigger (incident reference, security advisory), the change, the deployer, the time, the post-deploy verification. The record is reviewable; auditors can verify emergency deploys actually were emergencies.
- Approval still required.: Emergency deploys still require a second approver, typically the on-call manager or a security leader for security-related changes. The approval is captured in the incident channel; nobody is single-deploying to production at midnight.
- Audit trail extends to retrospective.: Emergency deploys feed into the next quarterly review. Patterns surface: a team that emergency-deploys frequently is either facing real urgent issues or treating the override as routine. Both warrant investigation.
- Reset to normal as soon as possible.: An emergency deploy is followed by a normal-deploy cycle to verify the fix works under standard process. The emergency is the exception; the team returns to standard operating mode after the immediate need.
The emergency path is necessary; the discipline is making sure it is rare and well-documented when used. A team that uses the emergency path weekly has either real reliability issues or a broken normal process.
Avoid
The combination of times when deploys should not happen is well-understood. The discipline is honoring it consistently rather than making exceptions for "just this one change."
- Friday afternoon.: The classic pattern. A deploy at 4 PM on Friday has the entire weekend of reduced staffing available to suffer from. If the deploy regresses, the on-call's weekend is consumed; if the regression is severe, customer impact extends across the weekend.
- Holiday windows.: The week before Christmas. The week of Thanksgiving. End-of-year code freeze in many enterprises. Black Friday for retail. Tax season for accounting platforms. Each industry has its own peak periods where deploys are riskier than usual; the rule extends to match.
- Reduced staffing matters.: The fundamental reason for these restrictions: when something goes wrong, the response capability is reduced. The rule is not arbitrary; it matches operational capability to deploy timing.
- Daylight savings transitions.: The hours around DST transitions break time-dependent code in subtle ways. Deploys around the transition reveal these bugs at the worst possible moment. The rule extends to skip the immediate transition window.
- Major event launches.: Product launches, major marketing campaigns, scheduled high-profile demos. Deploys near these events are higher-stakes than usual. The rule treats them as additional restricted windows.
Deploy window discipline is the operational rule that aligns deploy timing with response capability. Nova AI Ops respects deploy window configuration in the pipeline, surfaces the calendar of restricted windows, and tracks emergency-deploy usage so the team can see whether the override is being used appropriately or has become routine.