Nova Shell: Conversational Infrastructure Control

A shell that takes plain English. Generates the kubectl, terraform, or AWS CLI command, shows the diff, and waits for confirmation. The LLM is the planner; the human is the safety net.

Why a conversational shell

Engineers who run infrastructure spend a lot of their day translating intent to syntax. "I need to scale that deployment" becomes `kubectl scale deployment payments-api --replicas=8 -n production`. "Show me the recent log spike on the orders service" becomes a Datadog query. The translation layer is the cost; the actual decision (scale to 8, query that service) was already made.

Nova Shell removes the translation layer. Type "scale payments-api to 8 replicas" and the shell figures out the namespace from your service map, generates the kubectl command, shows you the resulting diff, and waits for you to click confirm. For read-only queries, it runs immediately. For writes, you confirm before execution.

This is not a chatbot. The shell is deterministic about what it can do, there's a fixed tool palette, the LLM picks from that palette, the human confirms. The model never invents a command outside the palette; it can fail to find a match (and tell you), but it can't quietly execute something nobody asked for.

The tool palette

The shell has 80+ vetted tools across the major operational surfaces, kubectl, terraform plan/apply, AWS CLI, Azure CLI, gcloud, Datadog query, Prometheus query, Slack message, GitHub create-PR, Jira create-issue, ServiceNow create-ticket, and so on. Each tool has a typed input schema, a description string the LLM uses for selection, and an authorisation check that runs before execution.

The model is given the user's input plus the tool descriptions plus the user's recent context (last 5 actions, current incident if any, current namespace if relevant). It produces a structured plan, which tool, what arguments, what to do with the output. The plan is rendered in the UI as "I'm going to run X, which will do Y. Confirm?"

Tool selection accuracy is 96.4% on our internal evaluation harness, a benchmark of 800 representative SRE intents labelled with the correct tool. The remaining 3.6% are mostly ambiguous inputs where two tools could be reasonable; we surface both options when the model is uncertain.

Diff before run

For write operations, the shell shows what's about to change before running anything. For Kubernetes operations, that's the rendered manifest diff. For Terraform, that's the plan output. For AWS, that's the API call body and the resource it will modify. The diff is rendered with a one-click expand for the full payload.

This is the part that makes the shell safe enough to put in production hands. The LLM might hallucinate a wrong replica count or a wrong namespace; the human looks at the diff, sees that "production" should have been "staging," and clicks cancel. The cost of a wrong command is the time to read the diff, not a fired-shaped incident.

For irreversible operations (database deletions, IAM permission changes, certificate revocation), the shell requires double confirmation, a typed "I understand" followed by the confirm click. Reversible operations (scale up/down, restart pod) require only the single click. The threshold is configurable per tenant.

The audit trail

Every shell interaction is logged in the audit ledger. The natural-language input, the generated command, the diff that was shown, the confirmation timestamp, the user, the IP, the result. This is the same ledger that captures dashboard edits, file transfers, and remediation actions, one source of truth for who did what.

The ledger is queryable. "Show me every shell action by user X in the last week" is a single query. "Show me every Terraform apply against production this quarter" is another. Compliance audits go from "spelunk through three log sources" to "run this query."

The ledger entries are content-addressed and tamper-evident. Each entry includes a hash of the previous entry; modifying an old entry would require regenerating the entire chain, which would be detected by the integrity check that runs nightly. This is the same property the cryptocurrency people get from blockchains, applied to a much smaller and more useful problem.

What we learned about safety

Three things from the first 90 days of production. (1) The diff-before-run pattern is doing real work, users cancel about 4% of write operations after seeing the diff. Without that step, those would be wrong commands executing in production. (2) The confidence threshold matters more than we thought. We initially had the shell auto-execute when model confidence was >95%; we backed off to never-auto-execute-writes after one misdiagnosed namespace. (3) Read operations don't need the same gate, auto-execute reads is fine because they don't change state.

The other thing we learned: users don't always want conversation. About 40% of the time, the user types a command directly (`kubectl get pods -n production`). The shell passes those through as-is, with the same auth and audit rules but without the LLM round-trip. The conversation is the option, not the requirement.

Nova Shell is live in production for all tenants today. Cmd+K from anywhere in the app opens it. The Slack integration ships in v2.7.