Agentic SRE Advanced By Samson Tanimawo, PhD Published Jun 2, 2026 5 min read

Tracking Tool-Call Failures: A Dashboard That Matters

Tool failures cause more agent regressions than model regressions. The five panels, the alert thresholds, and the runbook entry that brings the on-call up to speed.

Five panels

Total failure rate over time: a single number with a 30-day trend.

Failure rate by tool: which tools are flakiest. The list is your remediation backlog.

Failure rate by error class: 4xx vs 5xx vs timeout. Different failure classes have different fixes.

Per-tenant failure rate: catches integrations that have gone bad on one customer.

Agent runs blocked by tool failure: the downstream impact metric. This is what the on-call cares about.

Alerts

Per-tool failure rate > 5% for 5 minutes: page the team that owns the tool.

Aggregate failure rate > 1% for 1 minute: page the agent platform team.

Per-tenant failure rate > 20% for 1 minute: page the integrations team.

Runbook for the on-call

Step 1: check the dashboard. Which panel is red? That tells you which team to pull in.

Step 2: check the tool's own dashboard (linked from the agent dashboard). Is the tool itself broken, or is the agent's wrapper at fault?

Step 3: throttle the agent's calls to the failing tool while diagnostics happen. Prevents cascade.

Retry vs hard fail

Idempotent tools: retry up to 3 times with backoff. Most read tools are idempotent.

Non-idempotent tools: do not retry. A retry on a write tool can cause duplicate effects.

Tag each tool's retry policy in the wrapper. The retry decision is at the wrapper level, not in the agent prompt.

Why this dashboard matters most

Tool failures cause more agent regressions than model regressions. The model is mostly stable; the tools change constantly.

Catching tool failures early prevents a class of agent failures that look like model failures (the agent reasons over wrong data).

Most teams underinvest in this dashboard. The investment pays back in fewer mysterious agent regressions.