Intermediate By Samson Tanimawo, PhD Published Sep 30, 2026 5 min read

PromQL Essentials Cheat Sheet

Twenty patterns cover 90% of dashboards. Memorise these and you'll write better queries than most teams' grafana folders.

Rates and counters

Counters only go up. You almost always want a per-second rate, not the raw value. Pick the function that matches your scrape interval.

rate(http_requests_total[5m]), average per-second rate over 5 minutes; use this 95% of the time
irate(http_requests_total[5m]), instant rate from the last two samples; spikier, useful for fast-moving signals
increase(http_requests_total[1h]), total events in the last hour (rate × window)
rate(...[5m]) * 60, per-minute rate when stakeholders prefer "requests/minute"
Range must be at least 4× the scrape interval, or the rate gets noisy

Aggregation

by keeps the listed labels and drops everything else. without drops the listed labels and keeps everything else. Always be explicit, the default of dropping all labels is rarely what you want.

sum(rate(http_requests_total[5m])) by (service), total RPS per service
sum(rate(http_requests_total[5m])) by (service, status), RPS broken down by status code
avg by (service) (rate(http_requests_total[5m])), average across instances
max by (pod) (container_memory_working_set_bytes), heaviest pod's memory
topk(5, sum by (service) (rate(http_requests_total[5m]))), top 5 noisiest services
count by (job) (up == 1), how many targets are healthy per job
sum without (instance) (rate(...)), collapse instances, keep everything else

Histograms and quantiles

Histograms are pre-bucketed. histogram_quantile turns buckets into percentiles. Forgetting le in the by clause is the most common PromQL bug.

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))), global p95 latency
histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))), p99 per service
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]), average latency (cheaper than quantile, less informative)
le must be in the by clause or the result is meaningless

Label tricks

The label functions feel obscure until the day you need them, and then they're the only thing that works.

label_replace(up, "host", "$1", "instance", "([^:]+):.*"), derive a new label from regex on an existing one
label_join(up, "id", "/", "job", "instance"), concatenate labels into a new one
{job="api", env=~"prod|staging"}, regex match with =~
{job!="canary"}, exclusion with !=
group_left / group_right, many-to-one joins; the side with group_* is the "many"
on (instance) group_left(version) build_info, enrich a metric with labels from another

Time shifts

Compare now to a week ago without writing two queries.

rate(http_requests_total[5m] offset 1w), same query, one week ago
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1w), week-over-week ratio
predict_linear(node_filesystem_free_bytes[6h], 86400), extrapolate disk free 24h forward; classic disk-fills-in alert
deriv(node_load1[10m]), slope of a gauge; useful for "is this trending up?"
changes(up[1h]), flap counter

Alert-shaped queries

Patterns that translate cleanly into Alertmanager rules.

Error ratio: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
Saturation: avg by (pod) (rate(container_cpu_usage_seconds_total[5m])) / on (pod) group_left kube_pod_container_resource_limits{resource="cpu"} > 0.9
Burn rate (1h): (1 - sum(rate(slo_good_total[1h])) / sum(rate(slo_total[1h]))) / (1 - 0.999) > 14.4
Target down: up == 0 for 5m
Disk fills in 24h: predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 86400) < 0
Replicas missing: kube_deployment_status_replicas_available < kube_deployment_spec_replicas