PromQL Essentials Cheat Sheet
Twenty patterns cover 90% of dashboards. Memorise these and you'll write better queries than most teams' grafana folders.
Rates and counters
Counters only go up. You almost always want a per-second rate, not the raw value. Pick the function that matches your scrape interval.
rate(http_requests_total[5m]), average per-second rate over 5 minutes; use this 95% of the timeirate(http_requests_total[5m]), instant rate from the last two samples; spikier, useful for fast-moving signalsincrease(http_requests_total[1h]), total events in the last hour (rate × window)rate(...[5m]) * 60, per-minute rate when stakeholders prefer "requests/minute"- Range must be at least 4× the scrape interval, or the rate gets noisy
Aggregation
by keeps the listed labels and drops everything else. without drops the listed labels and keeps everything else. Always be explicit, the default of dropping all labels is rarely what you want.
sum(rate(http_requests_total[5m])) by (service), total RPS per servicesum(rate(http_requests_total[5m])) by (service, status), RPS broken down by status codeavg by (service) (rate(http_requests_total[5m])), average across instancesmax by (pod) (container_memory_working_set_bytes), heaviest pod's memorytopk(5, sum by (service) (rate(http_requests_total[5m]))), top 5 noisiest servicescount by (job) (up == 1), how many targets are healthy per jobsum without (instance) (rate(...)), collapse instances, keep everything else
Histograms and quantiles
Histograms are pre-bucketed. histogram_quantile turns buckets into percentiles. Forgetting le in the by clause is the most common PromQL bug.
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))), global p95 latencyhistogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))), p99 per servicerate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]), average latency (cheaper than quantile, less informative)lemust be in thebyclause or the result is meaningless
Label tricks
The label functions feel obscure until the day you need them, and then they're the only thing that works.
label_replace(up, "host", "$1", "instance", "([^:]+):.*"), derive a new label from regex on an existing onelabel_join(up, "id", "/", "job", "instance"), concatenate labels into a new one{job="api", env=~"prod|staging"}, regex match with=~{job!="canary"}, exclusion with!=group_left/group_right, many-to-one joins; the side withgroup_*is the "many"on (instance) group_left(version) build_info, enrich a metric with labels from another
Time shifts
Compare now to a week ago without writing two queries.
rate(http_requests_total[5m] offset 1w), same query, one week agorate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1w), week-over-week ratiopredict_linear(node_filesystem_free_bytes[6h], 86400), extrapolate disk free 24h forward; classic disk-fills-in alertderiv(node_load1[10m]), slope of a gauge; useful for "is this trending up?"changes(up[1h]), flap counter
Alert-shaped queries
Patterns that translate cleanly into Alertmanager rules.
- Error ratio:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01 - Saturation:
avg by (pod) (rate(container_cpu_usage_seconds_total[5m])) / on (pod) group_left kube_pod_container_resource_limits{resource="cpu"} > 0.9 - Burn rate (1h):
(1 - sum(rate(slo_good_total[1h])) / sum(rate(slo_total[1h]))) / (1 - 0.999) > 14.4 - Target down:
up == 0for 5m - Disk fills in 24h:
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 86400) < 0 - Replicas missing:
kube_deployment_status_replicas_available < kube_deployment_spec_replicas