Benchmarking Discipline
Reproducible benchmarks.
Setup discipline
Reproducible environment. Same hardware, same OS, same dependency versions, same data. Differences invalidate comparisons.
Warm-up period before measurement. JVM JIT, OS page cache, connection pools. Measure steady-state, not cold-start.
Repeat the test. Single run is noise. Five runs minimum; report mean and standard deviation.
What to measure
Latency percentiles: p50, p95, p99, p99.9. Mean is misleading because tails matter for user experience.
Throughput: requests per second sustained over the test window. Not peak; sustained.
Resource utilisation: CPU, memory, IO, network. Helps identify bottlenecks beyond raw latency.
Realistic load patterns
Match production traffic shape. If production is 80% reads / 20% writes, the benchmark must match. Pure-write benchmarks rarely reflect reality.
Burst and steady-state both. Benchmarks running steady miss the burst behaviour where most production issues hide.
Multi-client load. Single-threaded clients miss connection-level effects. Use multiple clients across multiple machines for realism.
Comparing variants
Single variable change between runs. Changing two things at once means you can't attribute the difference.
Same load generator, same test duration, same hardware. Anything different is a confounding variable.
Statistical significance. The difference must exceed the noise floor across runs. A 2% change with 5% standard deviation is noise.
Reporting results
Include configuration. Hardware, software versions, settings. Future readers reproduce only if they have this.
Include load shape. RPS, request size distribution, payload type. The numbers are meaningless without context.
Include caveats. What's not measured. What was hard to control. The honest report builds more trust than the polished one.