Feature: Multi-Agent

Specialist agents.

Multi-agent capability

Multi-agent orchestration lets specialist agents coordinate on complex incidents instead of forcing a single generalist agent to know everything. Triage classifies the alert, specialists go deep on their domain, and the orchestrator tracks shared state and surfaces conflicts to the human on-call.

Specialist agents per incident class. Different incident shapes (database, Kubernetes, cloud, application) need different domain knowledge; one specialist per class deepens coverage.
Triage, remediate, orchestrate. Triage classifies and routes; specialists handle their domain; orchestrator coordinates the flow and shared state.
Faster MTTR on cross-system incidents. Single-agent approaches struggle when an incident spans database, Kubernetes, and the application all at once; multi-agent splits the load.
Documented rollout shape. Beta access plus a designed adoption path; the surface is being shaped with early users rather than launched at full feature breadth.

Specialists in design

Four specialists cover the canonical incident classes. Each one is trained on the runbooks and patterns that match its domain rather than trying to cover every signal.

Database specialist. Connection-pool exhaustion, slow queries, replication lag, lock contention; trained on database-specific runbooks rather than generic logs.
Kubernetes specialist. Pod crash loops, node pressure, network policy denials, autoscaler decisions; cluster-aware patterns the agent can interpret without external context.
Cloud infrastructure specialist. IAM, networking, storage, cost anomalies; cloud-vendor-aware patterns trained on the failure modes of the major clouds.
Application specialist. Service errors, dependency issues, deployment-induced regressions; application-aware patterns that read traces and exception logs.

Orchestration

The orchestrator is the conductor across specialists. Triage routes; a shared scratchpad keeps state; conflicts between specialists surface to the human on-call rather than getting silently resolved.

Triage routes the alert. Classify-and-route flow per alert; multiple specialists may engage on cross-cutting incidents.
Shared scratchpad. Cross-specialist state means specialists read each other's findings and avoid redundant investigation paths.
Conflict surfaces to the human. When two specialists disagree, both hypotheses surface to the on-call IC; the orchestrator does not silently pick a winner.
Audit log per orchestration. Captured agent-decision history per incident supports postmortem reconstruction and trust-building over time.

Integration with existing workflows

Multi-agent fits into the on-call surface that already exists. Slack, paging, and postmortem all stay in their normal places; the agents work alongside the human flow rather than replacing it.

Slack-native interaction. Human can interject at any point; on-call can redirect or stop the agents from the same channel where the incident is being managed.
Paging integration. Agent acknowledgement is recorded the same way human acknowledgement is; MTTA reporting stays meaningful.
Postmortem integration. Agent actions and findings auto-populate the timeline; the postmortem owner reviews rather than reconstructing.
Documented behaviour per integration. Named SLAs and limits per integration so on-call has no surprises mid-incident.

Availability and rollout

Multi-agent is in staged release with early users. The roadmap shape is designed to harden each specialist with real incident data before broadening access; documentation rides alongside the rollout.

Early-access programme. Per-team enable for selected workloads; early users help shape the rollout before broader release.
Tiering plan. Pricing structure designed so multi-agent fits both standard and higher tiers; details firm up as the roadmap progresses.
Documentation alongside rollout. Setup guide, expected behaviour, and limits documented so adoption does not require a sales call.
Named owner per early-access team. Responsible champion on the customer side ensures the rollout produces useful feedback rather than silent drift.