The SRE Toolchain Inventory: 12 Tools Every Team Uses
Most SRE teams converge on a recognizable toolchain. Knowing the standard saves you from re-inventing categories nobody needed.
Categories 1-3: monitoring, logs, traces
Monitoring: Prometheus / Datadog. Logs: Loki / Elastic / Splunk. Traces: Tempo / Jaeger / Honeycomb.
Most teams have 1-2 from each row. The integration story is what matters.
Categories 4-6: alerts, oncall, runbooks
- Alerts: Alertmanager / PagerDuty. On-call: PagerDuty / Opsgenie. Runbooks: Confluence / Notion / Backstage.
- The trio define the on-call experience. Friction here costs morale daily.
Categories 7-9: deploy, IaC, secrets
Deploy: Argo CD / Flux / Spinnaker. IaC: Terraform / Crossplane. Secrets: Vault / Secrets Manager.
Foundational. Hard to swap. Pick carefully on day 1.
Categories 10-12: chaos, security, postmortem
Chaos: Chaos Mesh / Litmus. Security: Falco / Tetragon + Trivy / Snyk. Postmortem: Notion / FireHydrant / Jeli.
Tier-2 in maturity. Most teams add these as the platform stabilizes.
Antipatterns
- One vendor for everything. Cheap on paper; expensive at contract renewal.
- 12 vendors with no integration story. Dashboard sprawl; tab-switching tax.
- No documented ‘our SRE stack.’ New hires re-discover it; old hires forget what tool does what.
What to do this week
Three moves. (1) Trial the candidate tool against one workload for two weeks. (2) Compare against your current using the four criteria above. (3) Plan the migration only if the trial shows real wins, not theoretical ones.