Product Updates Intermediate By Samson Tanimawo, PhD Published Oct 3, 2026 9 min read

Introducing the 100-Agent Platform

100 specialized AI agents across 12 functional teams. Diagnose, remediate, audit, learn, predict, communicate, plan, verify, investigate, score, document, and detect. The capability map and the rollout plan.

Why 100 agents

The naive approach to agentic SRE is one giant agent that does everything, sees all the signals, decides what to do, fires the action. We tried this. It performs worse than a team of specialised agents at every measurable task. The reason is the same reason a single engineer can't be deeply expert in databases, networking, Kubernetes, and IAM at the same time, context capacity is finite, and depth eats breadth.

Specialisation works because each agent has a small, well-defined input space, a small tool palette, and a small evaluation set we can regression-test against. The DB Latency Diagnose agent reads exactly the metrics that diagnose database latency, calls exactly the tools that query database state, and is benchmarked against a corpus of historical database-latency incidents. It's good at its one job because it doesn't try to do anyone else's.

100 is the count we landed on because the underlying capability surface, the union of all the things SREs do, is roughly that big. Pick fewer and you get agents that try to span too much. Pick more and you get fragmentation that's hard to compose. 100 fits the work.

The 12 teams

Diagnose (16 agents). Database latency, memory leaks, cascading timeouts, SSL handshake failures, DNS resolution, certificate expiry, rate-limit collisions, deadlocks, leader-election flapping, network partition, disk-fill, log-flooding, GC pauses, container restart loops, IAM permission errors, mTLS misconfiguration. One agent per family of root cause.

Remediate (12 agents). Pool restart, scale up/down, drain node, rotate credentials, flush cache, fail over leader, evict pod, replay queue, roll back deploy, restart service, increase quota, force-kill stuck process. Each remediation is a small, well-tested action with explicit safety gates.

Detect (10 agents). Anomaly detection per signal type, latency, error rate, traffic, saturation, queue depth, customer-impact metrics. Each runs continuously across the customer's service map.

Audit (4 agents). Action ledger writer, compliance checker, RBAC verifier, change-tracker.

Learn (8 agents). Runbook drafter, runbook updater, action-item extractor, post-mortem trend analyser, knowledge-base curator, weak-signal pattern miner, similar-incident finder, lesson-encoder.

Communicate (8 agents). Slack updater, status-page authoring, X broadcaster, LinkedIn broadcaster, Threads broadcaster, customer-email drafter, executive-update writer, war-room facilitator.

Plan, Verify, Investigate, Score, Predict, Document. The remaining 42 agents fill out the rest of the capability map, pre-incident planning, change verification, deep-dive investigation, customer-impact scoring, predictive regression detection, documentation generation.

The agent runtime

Each agent is a small program with a typed input schema, a typed output schema, a tool palette, and a system prompt. The runtime executes the agent in a sandboxed environment with rate limits, token budgets, and per-tool authorisation checks.

Token budgets matter. An agent that runs unbounded can spend $50 on a single incident across LLM API costs; one that's budget-capped at 8k input + 2k output tokens spends a few cents. We aggressively cache static system prompts and few-shot examples, the cache hit rate across the fleet is 87%, which is the difference between economically viable and not.

Model routing happens per agent. Simple agents (action-item extraction, log-line classification) run on Haiku. Medium agents (diagnosis, remediation planning) run on Sonnet. Heavy agents (post-mortem authoring) run on Opus. The routing decision is encoded in the agent definition; we don't make it dynamically per request.

How agents collaborate

The agents don't call each other directly. The runtime is event-driven, an agent emits a structured event ("DB latency diagnosed: connection pool exhaustion"); other agents subscribed to that event class wake up. The Remediate agent subscribed to "diagnosis emitted" reads the diagnosis, picks the right remediation, executes it.

This decoupling matters because it means we can add agents without rewriting the others. The new agent declares what events it consumes and emits; it slots into the topology without modifying anyone else's code. This is how we go from 100 agents today to 200 next year without the whole thing collapsing under integration cost.

The audit ledger is the source of truth for the agent topology. Every event, every action, every decision is logged. When something goes wrong (and it does), the post-incident debugging starts from the ledger, not from log spelunking across 100 agent instances.

Rollout plan

The 100 agents are not all live for everyone. The default tier ships with 60 agents enabled, the diagnose, remediate, detect, audit, and learn families that handle the majority of incident types. The communicate agents activate when the customer authorises the relevant channel (X, LinkedIn, Slack, etc.). The predict agents activate after 30 days of telemetry, they need a baseline.

The remaining agents are gated behind tier or capability, single-tenant deployments unlock the compliance-checker and the regulated-industry-specific Diagnose agents (PCI scope analysis, HIPAA boundary checking). Enterprise tier unlocks the cross-tenant trend agents.

The agent fleet itself is a living thing. We add 1-3 new agents every release based on what customers' incidents reveal as missing capability. The capability gap between Nova and a perfect SRE platform is closing at a measurable rate; the gap will not be zero, but it's getting smaller.