Rookie Mistakes in Prometheus Recording Rules
Five rookie mistakes in recording rules and how they show up. Each costs cardinality, performance, or signal.
Recursive rules
Prometheus recording rules and alerting rules have specific failure modes. The mistakes are predictable; recognizing them prevents the failure modes from emerging in production. Each mistake is recoverable; the discipline is catching them early.
What recursive rules look like:
- Rule A computes from rule B; rule B computes from rule A.: The recursion can emerge accidentally. Rule A is built on rule B; later, rule B is updated to depend on something that depends on rule A. The cycle exists; the team may not notice.
- The evaluation engine handles it.: Prometheus does not crash on recursive rules. The engine evaluates each rule per the configured cadence; the recursion produces stale data rather than errors.
- The result is stale.: Each rule's result depends on the previous evaluation of the dependent rules. The data lags by one evaluation cycle per recursion; for tight cycles, the staleness is bounded; for longer cycles, the staleness is significant.
- Avoid.: The recursion is a bug, not a feature. The discipline is detecting and fixing recursive dependencies before they cause incidents.
- Compose rules in a DAG.: Rules form a directed acyclic graph. Each rule depends on rules below it in the order; no cycles exist. The DAG is documented; the team verifies it.
- With explicit ordering.: The evaluation order is explicit. Rules are grouped; groups are evaluated in order. The discipline prevents the implicit recursion that produces staleness.
Recursion is an obvious-once-you-see-it mistake. The discipline is verifying the rule structure produces a clean DAG.
Over-using rules
Recording rules are valuable but not free. Over-using them produces cardinality explosion and storage pressure. The discipline is using rules where they pay off.
- Every dashboard query becomes a rule.: The temptation is to convert every dashboard query into a recording rule "for performance". The pattern produces too many rules; the cumulative cardinality matters.
- Cardinality explodes.: Each rule produces new time series. Many rules produce many series; the cardinality grows beyond useful. The metric database struggles.
- Storage suffers.: The rules' output is stored. High cardinality means high storage. The team's metric storage bill grows; the operational characteristics of the database degrade.
- Rules are for queries used by many dashboards.: The right candidates are queries that run on many dashboards or in many alerts. The amortization is the value; without it, the rule is overhead.
- Single-use queries do not need pre-computation.: A query used in one panel on one dashboard does not justify a recording rule. The query runs when the dashboard loads; the cost is bounded; the rule would just add overhead.
Over-use is a real failure mode. The discipline is selectivity; rules where they pay off, not everywhere.
Under-using rules
The opposite mistake is under-use. Heavy queries that run frequently are candidates for rules; without rules, every query execution is expensive.
- Heavy queries that run on every dashboard load.: A query that takes seconds to run executes every time someone views the dashboard. Across many viewers and many loads, the cumulative cost is large.
- Slow rendering.: Dashboards render slowly when their queries are heavy. The user waits; the experience degrades; the dashboard's value is reduced.
- High CPU on the storage.: Heavy queries consume significant CPU on the metric database. The cumulative load can affect other queries; the database becomes a bottleneck.
- If a query is run more than 100 times per day and takes more than 1 second, recording rule it.: The threshold is approximate but useful. Above the threshold, the cumulative cost justifies the rule; below it, the rule is overhead.
- Track query patterns.: The team's metric database can report which queries run most often. The reports identify candidates for recording rules; the data drives the decisions.
Rookie mistakes in Prom rules are predictable; understanding the patterns prevents them. Nova AI Ops integrates with Prometheus and other PromQL backends, surfaces rule patterns and anti-patterns, and produces the audit that catches mistakes before they become operational issues.