Benchmarking Discipline
Reproducible benchmarks.
Setup discipline
Benchmarking is the discipline of producing numbers that hold up to scrutiny. Setup is where most benchmarks already break: inconsistent environment, no warm-up, single run, missing config. Get the setup right and the numbers become trustworthy.
- Reproducible environment. Same hardware, OS, dependency versions, and data per run; differences invalidate the comparison silently.
- Warm-up period. JVM JIT, OS page cache, connection-pool warm-up per run; measure steady-state rather than cold-start artifacts.
- Repeat the test. Five runs minimum per comparison; single run is noise. Report mean and standard deviation, not the best number.
- Captured config per run. Saved settings per run support later reproduction; "I forgot what I tested" wastes the benchmark.
What to measure
What you measure shapes the conclusions. Latency tails, sustained throughput, and resource utilisation each tell a different part of the story; missing any one produces benchmarks that mislead.
- Latency percentiles. p50, p95, p99, p99.9 per run; mean is misleading because user experience is shaped by the tails.
- Throughput. Sustained requests per second over the test window; not peak, sustained, because peak is just the warm-up burst.
- Resource utilisation. CPU, memory, IO, network per run; helps identify bottlenecks beyond raw latency numbers.
- Error rate per run. Failed-request percentage; catches benchmarks that "scaled" by silently dropping work.
Realistic load patterns
Realistic load patterns are where benchmarks earn or lose credibility. Pure-write benchmarks rarely reflect reality; single-thread load misses connection-level effects; synthetic payloads hide real-payload artifacts.
- Match production traffic shape. Read/write ratio per benchmark; pure-write benchmarks rarely reflect production behaviour.
- Burst and steady-state both. Both modes per benchmark; benchmarks that only run steady miss the burst behaviour where most production issues hide.
- Multi-client load. Multiple machines and threads per benchmark; single-threaded clients miss connection-level effects that production hits.
- Realistic payload per benchmark. Production-shape payload size; synthetic-payload artifacts produce numbers that do not transfer.
Comparing variants
Comparing variants is its own discipline. One variable per comparison, same conditions, statistical significance check; without those, the comparison number is folklore.
- Single variable change. One thing at a time per comparison; changing two things means you cannot attribute the difference to either.
- Same load generator. Same generator, duration, hardware per comparison; anything different is a confounding variable.
- Statistical significance. Noise-floor check per comparison; a 2% change with 5% standard deviation is noise, not signal.
- Documented hypothesis per comparison. Named expectation before the run catches confirmation bias.
Reporting results
Reporting is where the work lands. Configuration, load shape, and caveats together let future readers reproduce or argue with the numbers; without them the benchmark becomes a marketing claim.
- Include configuration. Hardware, software versions, settings per report; future readers reproduce only if they have this.
- Include load shape. RPS, request size distribution, payload type per report; numbers are meaningless without context.
- Include caveats. Not-measured and hard-to-control items per report; the honest report builds more trust than the polished one.
- Raw data link per report. Published raw runs support independent verification; benchmarks that only publish summaries do not survive scrutiny.