The Slow Query Observability Stack
Slow queries hide. The instrumentation that surfaces them, the dashboard that ranks them, and the metric that proves you are gaining ground.
Instrument the path
Slow query observability stack is the discipline of producing the data that drives database optimization. Without instrumentation, slow queries are unknown; with instrumentation, the worst queries are visible and remediable. The discipline is consistent instrumentation across all database access paths.
What instrumentation looks like:
- Wrap every database call.: All database calls flow through an instrumented wrapper. The wrapper captures timing and metadata; no database call is invisible to the observability stack.
- Capture: query template (not raw).: The instrumentation captures the query template (with parameter placeholders), not the raw query. The template aggregates well; the raw query has too much cardinality.
- Duration.: Each query's duration is captured. The duration is the primary signal; long durations are slow queries.
- Rows returned.: Some queries are slow because they return many rows. Capturing rows returned distinguishes "slow because complex" from "slow because returns much data".
- Aggregate by template.: Many query executions of the same template aggregate into one statistic. The aggregation is the unit the team operates on; per-execution data is too granular.
- Top N slowest templates by total time spent.: The team's focus is on the templates that consume the most database time. The top 10 by total time is the optimization queue.
Instrumentation is the foundation. Without it, the rest of the discipline cannot operate.
The ranking
The ranking determines optimization priority. Total time consumed (duration times frequency) is the right metric; ranking by per-query duration alone misleads.
- Total time equals duration times frequency.: A query that takes 1 second and runs 1000 times per minute consumes 1000 seconds per minute of database time. A query that takes 10 seconds and runs once per hour consumes 10 seconds per hour. The total time differs by orders of magnitude.
- The slow rare query matters less.: A query that is slow but rarely executed is a smaller optimization target. The total time it consumes is small; the savings from optimizing it are bounded.
- Than the medium-fast frequent one.: A query that is moderately fast but very frequent consumes much more total database time. Optimizing it reduces database load significantly.
- Top 10 by total time is the actionable list.: The optimization queue is the top 10 by total time consumed. Each entry on the list represents real, measurable database load; optimizing it produces measurable savings.
- Below that is noise.: Beyond the top 10, the per-query impact drops to where the optimization effort is not worth it. The team focuses where the leverage is.
The ranking is what produces the priority. Without it, optimization happens in the wrong order.
Track wins
The metric tracks whether optimizations are working. Median and p99 query time over time produce the trend; improving trends validate the work; flat trends suggest different optimizations are needed.
- Per-week: median query time, p99 query time.: The two metrics together capture both typical and tail behavior. Median catches general improvement; p99 catches tail improvement; the combination is comprehensive.
- Both should trend down as the team optimises.: Optimizations should produce visible improvement. Median drops as common queries get faster; p99 drops as outliers get fixed.
- If both are flat, the optimisations are not landing where they matter.: Flat metrics indicate the optimization effort is not producing aggregate improvement. Either the wrong queries are being optimized, or the optimizations are not actually helping.
- Investigate flat trends.: When the metrics are not improving, the team investigates. Are we optimizing the right queries? Are the optimizations actually being applied? Is the workload changing in ways that mask the improvements?
- The metric drives the program.: Without the metric, the team cannot tell if their work is paying off. The metric is the feedback loop; optimization continues based on whether the metric shows progress.
Slow query observability stack is one of those database operations disciplines that pays off proportionally to database load. Nova AI Ops integrates with database telemetry, surfaces top-spending query templates, and produces the per-query optimization queue that drives the team's database performance work.