Datadog as a First-Class Tool for SRE Agents
Six Datadog API endpoints become six agent tools. The wrappers, the rate-limit handling, and the prompt patterns that get models to query Datadog effectively.
Six endpoints, six tools
metric_query: pull metric values for a tag set, a time range. Returns a series.
log_search: query logs by tag set. Returns top N matching lines.
trace_search: query APM traces. Returns trace IDs and high-level summaries.
service_check: pull service health. Returns ok/warning/critical with reason.
monitor_state: read current state of monitors. Surfaces what is currently alerting.
incident_list: pull current incidents and their owners. Useful for cross-incident correlation.
Wrappers, not raw API
Each tool is a Python wrapper, not a raw HTTP call from the prompt. The wrapper enforces sane defaults, scope limits, and result shape.
Defaults: time range capped at 24 hours, result count capped at 100, query strings sanitised.
The wrapper's API is what the agent sees. The model never sees Datadog's full API; it sees six narrow tools.
Rate limiting
Datadog rate limits per API key. Track usage and back off proactively.
Cache aggressive: same metric query within 60 seconds returns cached result. Most queries are repeatable; the cache pays.
Worst case: rate limit triggered, the wrapper returns a clear error ("rate limited; retry in 30 seconds"). The agent surfaces this rather than silently failing.
Prompt patterns that work
Tell the model the tools' capabilities explicitly. "metric_query supports the metric names you can find in Datadog's metric explorer."
Tell the model the limits. "Result counts are capped at 100; if you need more, refine your query."
Show examples in the prompt. "Example query: metric_query(metric='aws.rds.cpu', service='order-db', range='1h')"
Eval cases
Standard query: agent should call the right tool with reasonable args.
Cap-hit case: agent should refine when results are capped.
Rate-limit case: agent should back off and retry, not give up.
Cross-tool case: agent should combine results from multiple tools when needed.