Dark Launches and Shadow Traffic: Testing in Production Safely
A dark launch runs your new code on real traffic but throws away the result. You learn how it behaves under production load, before any user sees an answer that depends on it.
What a dark launch is
You write the new code, deploy it, and route real production traffic to it, but the response is thrown away (or compared to the old response and logged). Users see the old behaviour. The new code learns under real load.
The mental model. A dark launch decouples deployment from rollout. The code is in production; the traffic flows through it; the user just doesn't see the result yet. This means you can stress-test the new code against real load, real data shapes, real concurrency, before any user is affected by its output.
The high-leverage uses. Rewrites of core services (payment processing, recommendation engines). New algorithms whose correctness needs production-data validation. Performance changes whose impact only shows at scale. Each is a case where staging environments lie because they don't have production's traffic shape.
Why staging is not enough
Staging traffic is curated. It does not have the long tail of weird inputs, the cross-tenant interactions, the cache patterns, or the request distribution of production. A dark launch gives the new code the actual conditions it will run in.
The specific gaps. Staging usually has 1-10% of production data volume. Staging traffic is generated by automated tests or a small QA team — predictable, narrow distribution. Real traffic has a 1000x heavier tail of edge cases, unusual user behaviour, and traffic spikes that synthetic tests don't reproduce.
The bugs that escape staging are the bugs dark launches catch. Race conditions that only fire under certain concurrency patterns. Performance cliffs at specific data sizes. Memory leaks that take days to manifest. Each of these passes staging tests and then breaks production within hours of full launch — unless a dark launch caught them first.
Pattern 1: parallel call
The request handler calls both the old and new code paths, returns the old response, and discards (or compares) the new one. Cheap to add, narrow blast radius. Works for read-heavy paths. Watch CPU; you are doubling the work for one path.
The implementation. The handler does its normal work; in parallel (or sequentially after returning), it calls the new code with the same inputs. The new code runs, produces a result; the result is logged or compared, then dropped. The user sees only the old code's output.
The metric to watch. Latency on the OLD path. The new code can starve resources (connection pool, CPU, memory); if it does, the old path slows down. Many dark-launch failures are latency regressions on the old path that nobody noticed because nobody was monitoring the old path's latency.
Pattern 2: shadow consumer
The new code subscribes to the production message stream as a passive consumer. It processes every event the old service does, but writes to a sandbox store. Best for stream-processing rewrites. Cleanest isolation; takes the longest to set up.
The pattern works because message-driven systems are inherently parallel-friendly. Adding another consumer to a Kafka topic doesn't affect the other consumers. The new consumer processes everything the old one did, in parallel; comparing results is straightforward (both wrote to known stores).
The discipline. The shadow consumer must be SEPARATE from the production consumer in every dimension: separate consumer group, separate database, separate metrics, separate alerting. Mixing them creates contamination — the shadow's lag becomes correlated with production lag, and you can't tell which is the source of truth.
Pattern 3: async double-write
For write-heavy paths. The old code writes to the canonical store; an async job replays the same writes against the new store. Lets you validate that the new database accepts the production write pattern at production volume.
The async-replay pattern. Every write to the old store also queues a job to write to the new store. The async worker processes the queue with whatever lag it can sustain. The new store eventually has the same data; comparison queries verify equivalence.
The gotcha: ordering. If the queue isn't strictly ordered, writes can apply out of order to the new store, producing a different state than the old. Either use a strictly-ordered queue (Kafka with single partition per entity, FIFO SQS) or include an idempotency token that lets the new store's logic handle out-of-order applies.
The first metric to watch
Not correctness. Latency on the old path. A common dark-launch failure: the new code path holds a shared resource (a connection pool, a CPU core, a file descriptor) that starves the old path. Latency on the canonical path goes up; users notice. Watch it before you watch any other metric.
The metrics dashboard for a dark launch should look like this. Top of the dashboard: old-path latency p95 (must NOT regress). Old-path error rate (must NOT regress). Then: new-path metrics for monitoring the new code. Putting the user-facing metrics first reflects the priority — the user shouldn't notice the dark launch at all.
The trap. Engineers monitor the new code's correctness obsessively (because that's the new code, that's what they wrote). Meanwhile, latency on the old path slowly creeps up because the new code is competing for resources. The new code is "correct" but the user is frustrated. Always lead with the old-path metrics.
How long to dark launch
At least one full traffic cycle (typically 24 hours). Ideally a full week to capture weekly patterns. Several weeks for systems with monthly billing or quarterly cycles.
The reason for the duration. Dark launches catch issues that appear under specific traffic patterns. A 24-hour test catches the daily peak; a 7-day test catches the weekly batch jobs and weekend traffic; a 30-day test catches monthly cycles. Stopping early misses the patterns the dark launch was meant to catch.
The dwell time also lets you collect comparison data. After 7 days, you have ~7 million data points where the new code processed the same input as the old. Discrepancies (where new and old disagreed) cluster around interesting bug classes. Less than 24 hours of data, the discrepancies are too sparse to find patterns.
Common antipatterns
Dark launching forever. Team starts a dark launch, confirms it works, never moves to actual rollout. Six months later the dark launch is its own infrastructure to maintain. Always set a "promote or abandon" deadline — typically 4 weeks.
The dark launch with logs nobody reads. Discrepancies are logged at INFO level into a system the team doesn't monitor. Weeks pass; the team thinks the new code is correct because nothing's screamed. Set up alerting on discrepancy rate; not just logging.
Dark launches that mutate state. The "new code path" performs side effects (sends emails, posts to webhooks) that you can't easily undo. Now the dark launch is no longer dark; users are receiving things. Strict isolation: if it has side effects, it can't be a dark launch.
Dark launches without a comparison harness. The new code runs in parallel; nobody compares the outputs. Six weeks in, the team is confident the new code "is fine" — but they have no quantitative basis. Comparison harnesses log the discrepancies; without them, you're flying blind.
What to do this week
Three moves. (1) Pick a service rewrite that's been parked because "we need to be sure it works at scale." Start a dark launch this sprint — even a basic parallel-call pattern unlocks the rewrite. (2) Add an old-path latency dashboard tile to your dark-launch monitoring; lead with it. (3) Set a 4-week deadline for any active dark launch. The deadline forces a decision: promote, extend with reason, or abandon.