Linux Perf Cheat Sheet
Brendan Gregg's USE method, condensed into the commands you actually run when a Linux box is misbehaving.
First-minute scan
Brendan Gregg's classic 60-second checklist. Run these in order and you'll have a working hypothesis before you finish typing the last one.
uptime, load averages (1, 5, 15 min). If 1m >> 15m, load is climbingdmesg | tail, kernel messages; OOMKills, hardware errors, NIC resets land herevmstat 1 5, five 1s samples;r,b,si,so,us,sy,id,wampstat -P ALL 1 3, per-CPU breakdown; uneven means a single hot corepidstat 1 3, per-process CPU over 3 secondsiostat -xz 1 3, per-device IO; watch%utilandawaitfree -m, memory in MB;availableis what matters, notfreesar -n DEV 1 3, network throughput per interfacesar -n TCP,ETCP 1 3, TCP retransmits and errorstoporhtop, interactive view to confirm
CPU
Which core, which process, which kind of work, user, system, IO wait, steal.
top, sort by CPU withP, by mem withMhtop, color, scrolling, tree view; install itmpstat -P ALL 1, per-CPU;%steal> 0 means hypervisor is taking your timepidstat 1, per-process per-second CPUuptime, load avg = runnable + uninterruptible procs; not just CPUcat /proc/loadavg, same numbers, scriptableperf top, live profile of where the kernel/user spends CPUperf record -F 99 -p <pid> -g -- sleep 30 && perf report, flamegraph-ready stack samples
Memory
The free output confuses everyone once. Page cache is not "used" in any meaningful sense.
free -h, human-readable;available= what apps can actually claimvmstat 1,si/so> 0 means swapping (bad)cat /proc/meminfo, every counter;MemAvailable,SwapFree,Dirty,Slabps aux --sort=-rss | head, top RSS consumerssmem -tk, proportional set size; better than RSS for shared-memory appsslabtop, kernel slab allocations; useful when "memory is gone but no process owns it"dmesg | grep -i kill, recent OOMKills with the killed PID
Disk IO
%util at 100% doesn't mean "saturated" on SSDs, it means "had at least one request in flight at every sample." Look at await and queue depth.
iostat -xz 1, extended per-device, hide idleiotop, per-process IO (needs root)biotop(bcc-tools), per-process block IO with latencybiolatency(bcc-tools), block IO latency histogramdf -h, filesystem usagedf -i, inode usage; "disk full" withdf -hshowing free space is usually inodesdu -sh /var/log/*, find what's eating disklsof | head, open files; massive output, pipe throughgrep
Network
Replace netstat with ss. It's faster, the flags are saner, and it's installed everywhere modern.
ss -tunap, TCP and UDP, all states, with PIDsss -ltn, listening TCP sockets, numericss -s, summary counts by statess -tn state established '( dport = :443 or sport = :443 )', filter by state and portsar -n DEV 1, per-interface throughputsar -n TCP,ETCP 1, retransmits, resetstcpdump -i any -nn -s 0 port 443, raw packets; pipe to-w file.pcapfor Wiresharkmtr <host>, traceroute + ping over timeethtool -S eth0 | grep -i drop, NIC-level drops
Per-process
When you've narrowed to one PID, these are the deep-dive tools.
cat /proc/<pid>/status, VmRSS, VmSize, threads, statecat /proc/<pid>/limits, ulimits the process is actually underls -l /proc/<pid>/fd | wc -l, open file descriptorsstrace -p <pid> -f -e trace=open,read,write -o trace.out, syscalls; expensive, use brieflyltrace -p <pid>, library callslsof -p <pid>, files, sockets, pipes the process holdsgdb -p <pid>withthread apply all bt, stack traces of every thread
Deeper tools
When the basics don't answer, eBPF tools usually do.
execsnoop, every new process system-wideopensnoop, everyopen()syscalltcplife, connection lifetimes with throughputrunqlat, scheduler run-queue latency histogramprofile(bpftrace), CPU profiling without the perf overheadperf sched record -- sleep 10+perf sched latency, scheduling latencies per taskbpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }', syscall heatmap