CPU Bottleneck Diagnosis with Flame Graphs

Flame graphs are the single most useful CPU debugging tool. Reading them is teachable; capturing them is one command.

Why flame graphs

Flame graphs are the single most useful CPU debugging tool. They visualise CPU time per call stack so the hot path is visible at a glance.

Visualisation. Wide bars equal CPU hotspots; tall bars equal call depth; the picture conveys what numbers cannot.
Pattern recognition. Skim the graph; the widest top-of-stack bar is almost always where to start optimising.
Cross-language. The format is universal; tools differ per language but the reading skill transfers.
Production-safe. Modern profilers run in production with sub-1% overhead; do not save profiling for incidents.

Four-step workflow

1. Capture profile during representative load.
2. Generate flame graph.
3. Identify widest bars at top of stack.
4. Optimize and re-capture.

Language tools

Each language has its preferred profiler. The ergonomics differ; the output format converges to flame graph in every case.

Linux perf. perf record plus flamegraph.pl; cross-language; native CPU overhead.
async-profiler. JVM-specific; low overhead; produces flame graphs directly.
Go. go test -cpuprofile plus go tool pprof; flame graph in browser via web.
Python. py-spy for sampling; runs against a live PID without modifying the target process.

False-positive checks

Not every wide bar is the bug. Three false-positive shapes recur; recognising them avoids optimising the wrong thing.

Wide system call. Not the app's fault; check the kernel; consider whether the volume is excessive.
Wide GC frame. GC pressure, not GC bug; tune memory or reduce allocation rate, do not 'fix' GC.
Wide framework call. Maybe expected; verify against a baseline before optimising; the framework usually knows.
Lock contention. Wide off-CPU time on locks shows in off-CPU flame graphs; standard CPU profile misses it.

Antipatterns

One-shot capture without warmup. Misleading.
Profile in dev with no load. Wrong picture.
Optimize without re-profiling. Maybe made it worse.

What to do this week

Three moves. (1) Apply this pattern to your slowest production endpoint. (2) Measure p99 before/after. (3) Document the win and ship the runbook so the team can reproduce.