Datadog as a First-Class Tool for SRE Agents

Six Datadog API endpoints become six agent tools. The wrappers, the rate-limit handling, and the prompt patterns that get models to query Datadog effectively.

Six endpoints, six tools

The model never sees Datadog’s full API. It sees six narrow tools, each scoped to a specific question the agent needs to answer during triage. Six is the cap because the prompt budget for tool descriptions is finite and a thin tool surface keeps eval cost manageable.

metric_query. Pull metric values for a tag set across a time range. Returns a series with timestamps and units.
log_search and trace_search. Query logs by tag set and APM traces by service. Both return top-N matches with high-level summaries.
service_check and monitor_state. Surface current health and which monitors are alerting now. The agent uses these to confirm the failure is still active.
incident_list. Pull active incidents and their owners. Drives cross-incident correlation when multiple alerts may share a root cause.

Wrappers, not raw API

Every tool is a Python wrapper, not a raw HTTP call assembled from the prompt. The wrapper is where defaults, scope limits, and result shape are enforced, which is what makes the agent’s behaviour predictable across thousands of runs.

Defaults. Time range capped at 24 hours, result count capped at 100, query strings sanitised before they leave the wrapper.
Scope limits. Service tag is required on every call. The agent cannot accidentally pull data from an unrelated team.
Result shape. Every wrapper returns the same envelope: status, payload, citation. The agent learns one parse contract.
API surface. What the agent sees is the wrapper signature, not Datadog’s sprawling SDK. Six narrow tools, not sixty.

Rate limiting

Datadog rate limits per API key. An agent that retries on every error will exhaust the budget in minutes, so the wrapper layer absorbs back-pressure before it ever reaches the model.

Track usage. Each wrapper increments a counter scoped to the API key. Dashboards show consumption per agent role.
Cache aggressively. Same metric query within 60 seconds returns the cached result. Most queries during a single incident repeat.
Proactive back-off. When usage crosses 70 percent of the per-minute cap, the wrapper inserts pauses before issuing more calls.
Clear error path. Rate-limit triggers return “rate limited; retry in 30 seconds”, not a generic 429. The agent surfaces this to the operator instead of silently failing.

Prompt patterns that work

Tool descriptions inside the system prompt earn their tokens by reducing wrong-tool calls. The patterns below cut tool-selection errors by more than half in our internal evals.

Capabilities, explicit. “metric_query supports the metric names you find in Datadog’s metric explorer” beats “metric_query queries metrics”.
Limits, named. “Result counts are capped at 100; if you need more, refine your query” teaches the agent to narrow rather than retry.
Examples, concrete. Show one full call: metric_query(metric='aws.rds.cpu', service='order-db', range='1h'). The model copies the shape.
Failure modes, listed. Tell the agent what a rate-limit response looks like and how to react. Surprises during incidents are expensive.

Eval cases

The eval set is what tells you whether a prompt or tool change shipped a regression. Run it on every change to the wrapper layer or the tool descriptions.

Standard query. Agent picks the right tool with reasonable args for a routine triage prompt.
Cap-hit case. Result count hits 100; the agent refines the query rather than truncating.
Rate-limit case. Wrapper returns the rate-limit envelope; the agent backs off and retries cleanly.
Cross-tool case. The question requires combining metric, log, and trace data; the agent stitches the answer instead of stopping after one call.