Deploy Blast Radius Mapping
What this deploy affects. Map.
Services affected
Every deploy carries risk; the question is how much. Most teams answer that question with intuition, which is correct most of the time and catastrophically wrong occasionally. Blast radius mapping replaces intuition with a structured analysis of what the deploy could affect if it goes wrong. Done before the deploy, it informs the gating decisions; done after an incident, it informs the retro.
What service-level blast radius covers:
- Direct: services with code changes.: The list starts with the services whose code is in the change set. A multi-service PR that touches 3 services has 3 services on the direct list. Each of these is at primary risk if the deploy regresses.
- Indirect: services that depend on the changed services.: Every service that calls one of the directly-affected services is on the indirect list. If service A regresses, services B and C that depend on A are at risk too. The dependency graph determines the indirect blast radius.
- Transitive: dependents of dependents.: If service B depends on A, and service D depends on B, then D is on the transitive list. Blast radius cascades through the dependency tree. Most teams stop at one or two levels because beyond that the risk usually attenuates.
- Service criticality matters.: A change that touches a Tier 0 service has higher blast radius than the same change to a Tier 2 service, even with identical dependency depth. The map weights services by their criticality, not just by their position in the graph.
- Auto-generated from the dependency graph.: Modern observability tooling (service mesh, distributed tracing, dependency catalogs) produces the dependency graph automatically. The blast radius for a given PR can be computed by intersecting the changed services with the graph. Manual mapping is error-prone; automation is what makes the discipline routine.
The service-level map answers "what could go wrong if this deploy regresses?" The answer informs the gating: more dependents and higher criticality justify more careful canary or longer soak.
Customers
Service-level blast radius is one input. Customer-level blast radius is another. The same change can affect 100% of customers (a change to a shared core service) or 0.01% of customers (a change to a feature only a few have enabled). The customer view is what most stakeholders actually care about.
- Estimated impact if rollback is needed.: The map estimates how many customers, what tier of customers, and what percentage of revenue is exposed if the deploy regresses and a rollback is needed. The estimate uses the same per-tenant traffic data the SLO calculations use; it is approximate but bounded.
- Inform the deploy decision.: A change with a 100-customer blast radius gets shipped on a Wednesday afternoon. The same change with a 100,000-customer blast radius gets shipped earlier in the week with longer soak windows and tighter rollback gates. The decision is calibrated to the impact.
- Inform the rollback decision.: When a deploy is in flight and metrics start drifting, the customer blast radius informs how aggressively to roll back. A small blast radius can tolerate a few minutes of investigation; a large blast radius rolls back immediately and investigates after.
- Per-tier impact.: Some customers tolerate degradation more than others. Free-tier impact is different from enterprise impact. The customer blast radius is broken down by tier so the team can see which segments are exposed.
- Geographic and regional concentration.: Some changes affect specific regions. Some affect specific customer cohorts more than others. The map captures this so the decision-makers know if the impact is concentrated rather than distributed.
The customer view is what makes the technical analysis useful to non-engineers. Sales, customer success, and leadership care about customer impact more than about service dependency depth.
Revenue
The third dimension is the financial blast radius. Every deploy that could affect revenue-generating services has a per-minute cost if it causes an outage. Putting a number on that cost is the discipline that makes the deploy decision economically grounded.
- Per-minute cost if outage.: For revenue-path services, the team computes a per-minute cost of full outage. Aggregate revenue divided by minutes in the period. For a service that processes $10M/month, the per-minute cost is roughly $230. For a service handling 5x that, the cost is correspondingly higher.
- Sizes the risk in dollars.: "This deploy has a 5% chance of a 10-minute outage on a service that costs $2,000 per minute when down" gives a risk number ($1,000 expected cost). The deploy decision compares that against the value of shipping the change. The decision is concrete.
- Different services have different per-minute costs.: Internal admin tools have near-zero per-minute revenue cost. Payment processing has very high per-minute cost. The blast radius analysis must use the right cost number for the affected services.
- Time of day matters.: A revenue-path service has higher per-minute cost during peak business hours than at 3 AM on a Sunday. The financial map factors in time-of-day to reflect when the deploy is happening. Late-night deploys to revenue services have lower expected cost than mid-day deploys.
- Drives investment in protective tooling.: When the team can quantify the financial cost of bad deploys, the investment case for canary, blue-green, automated rollback becomes clear. A canary controller that costs 2 engineer-months pays back in a single avoided incident on a high-cost service.
Blast radius mapping turns deploy risk from intuition into a measured input to the deploy decision. Nova AI Ops auto-generates the per-deploy blast radius from the dependency graph, the per-tenant traffic data, and the per-service revenue model, so each deploy event is annotated with the impact estimate the team needs to gate it appropriately.