Performance & Capacity Practical By Samson Tanimawo, PhD Published Nov 13, 2025 4 min read

Benchmarking Discipline

Reproducible benchmarks.

Setup discipline

Reproducible environment. Same hardware, same OS, same dependency versions, same data. Differences invalidate comparisons.

Warm-up period before measurement. JVM JIT, OS page cache, connection pools. Measure steady-state, not cold-start.

Repeat the test. Single run is noise. Five runs minimum; report mean and standard deviation.

What to measure

Latency percentiles: p50, p95, p99, p99.9. Mean is misleading because tails matter for user experience.

Throughput: requests per second sustained over the test window. Not peak; sustained.

Resource utilisation: CPU, memory, IO, network. Helps identify bottlenecks beyond raw latency.

Realistic load patterns

Match production traffic shape. If production is 80% reads / 20% writes, the benchmark must match. Pure-write benchmarks rarely reflect reality.

Burst and steady-state both. Benchmarks running steady miss the burst behaviour where most production issues hide.

Multi-client load. Single-threaded clients miss connection-level effects. Use multiple clients across multiple machines for realism.

Comparing variants

Single variable change between runs. Changing two things at once means you can't attribute the difference.

Same load generator, same test duration, same hardware. Anything different is a confounding variable.

Statistical significance. The difference must exceed the noise floor across runs. A 2% change with 5% standard deviation is noise.

Reporting results

Include configuration. Hardware, software versions, settings. Future readers reproduce only if they have this.

Include load shape. RPS, request size distribution, payload type. The numbers are meaningless without context.

Include caveats. What's not measured. What was hard to control. The honest report builds more trust than the polished one.