Deployment Bot Safety
Slack-bot deploys are convenient. Safeguards.
What deployment bots do
Deployment bots make deploys self-service: any engineer can ship from chat or a portal without leaving the surface they already work in. The convenience is real; so is the risk of turning a compromised Slack account into production deploy access.
- Chat-driven and portal options. Slack-driven deploys (Hubot-style), GitHub Actions ChatOps, or in-house self-service portals; all share the same risk model.
- Convenience. Lower barrier to ship; engineers do not need CLI access or production credentials.
- Risk surface. Unbounded automation in chat means any compromised account becomes deploy access; the bot is the new SSH key.
- Documented scope per bot. Named permissions list, named owner, named approval flow; "the bot can do anything" is how you find out the hard way.
Required controls
Three controls are non-negotiable: authentication, authorisation, and audit. Skip any one and the bot becomes the weakest link in your deploy chain.
- Authentication via SSO. Not Slack identity alone; Slack accounts get phished. Require the same SSO that gates the AWS console.
- Authorisation per service. Engineer A cannot deploy service B without approval; deploy permissions ride on the same RBAC model as code review.
- Audit per action. Actor, timestamp, command, output, exit code; the audit log is the postmortem source-of-truth.
- Named owner per bot. A maintaining team that pages on bot-related incidents; "everyone and no one" ownership is how you discover the bot has been broken for two weeks.
Guardrails
Guardrails prevent the predictable failure modes: deploy loops, off-hours mistakes, and one-engineer rollouts of high-blast-radius changes.
- Rate limit per service. 10 deploys per hour cap stops runaway loops and pipeline-storm self-DoS.
- Business-hours restriction for tier-1 services. Payments, auth, and other critical services blocked outside business hours unless an explicit override fires.
- Multi-engineer approval for tier-1 services. A second engineer gates the deploy; pair-approval keeps high-blast changes from happening alone.
- Documented override path per guardrail. Named exception process so "we just bypassed it" never becomes the cultural default.
Bot in incidents
Bot behaviour during incidents is its own design problem. The bot should make incidents better, not worse, and the team should know how to silence it when it does not.
- Disable during sev1. Block bot-driven deploys during severity-1 incidents; human-driven deploys force the slow path that produces fewer surprises.
- Logs in postmortem. Bot action history goes into every postmortem; the audit log becomes part of the timeline.
- Kill switch. Single-command org-wide bot disable, useful when the bot itself is compromised or misbehaving.
- Quarterly kill-switch drill. Simulated disable on a fixed cadence so the team confirms the switch still works before they need it for real.
How to deploy safely
Deploy the bot in stages. SSO and RBAC first, low-risk services first, quarterly audit always. The bot's surface area expands faster than its operational maturity if you do not gate it.
- SSO and RBAC before anything else. No chat-driven deploys without the auth foundation; every shortcut here becomes a future incident.
- Low-risk services first. Stateless, easily rolled-back services first; expand to high-risk only after guardrails are battle-tested.
- Quarterly bot-log audit. Look for unusual patterns: night-time deploys, unfamiliar accounts, repeated rollbacks of the same service.
- Quarterly per-engineer access review. Stale deploy access is the lateral-movement path nobody catches until incident response.