Agentic SRE Advanced By Samson Tanimawo, PhD Published Mar 30, 2026 5 min read

Datadog as a First-Class Tool for SRE Agents

Six Datadog API endpoints become six agent tools. The wrappers, the rate-limit handling, and the prompt patterns that get models to query Datadog effectively.

Six endpoints, six tools

metric_query: pull metric values for a tag set, a time range. Returns a series.

log_search: query logs by tag set. Returns top N matching lines.

trace_search: query APM traces. Returns trace IDs and high-level summaries.

service_check: pull service health. Returns ok/warning/critical with reason.

monitor_state: read current state of monitors. Surfaces what is currently alerting.

incident_list: pull current incidents and their owners. Useful for cross-incident correlation.

Wrappers, not raw API

Each tool is a Python wrapper, not a raw HTTP call from the prompt. The wrapper enforces sane defaults, scope limits, and result shape.

Defaults: time range capped at 24 hours, result count capped at 100, query strings sanitised.

The wrapper's API is what the agent sees. The model never sees Datadog's full API; it sees six narrow tools.

Rate limiting

Datadog rate limits per API key. Track usage and back off proactively.

Cache aggressive: same metric query within 60 seconds returns cached result. Most queries are repeatable; the cache pays.

Worst case: rate limit triggered, the wrapper returns a clear error ("rate limited; retry in 30 seconds"). The agent surfaces this rather than silently failing.

Prompt patterns that work

Tell the model the tools' capabilities explicitly. "metric_query supports the metric names you can find in Datadog's metric explorer."

Tell the model the limits. "Result counts are capped at 100; if you need more, refine your query."

Show examples in the prompt. "Example query: metric_query(metric='aws.rds.cpu', service='order-db', range='1h')"

Eval cases

Standard query: agent should call the right tool with reasonable args.

Cap-hit case: agent should refine when results are capped.

Rate-limit case: agent should back off and retry, not give up.

Cross-tool case: agent should combine results from multiple tools when needed.