What site reliability engineering is
Site reliability engineering is the discipline of applying software engineering to operations problems. Instead of staffing a team of operators who manually keep systems alive, you staff a team of software engineers and give them a mandate to automate that work away. The reliability of the service is their product, and code is how they deliver it. That single inversion, treating ops as a software problem rather than a labor problem, is the whole idea, and everything else in SRE follows from it.
The discipline was created at Google in the early 2000s. Ben Treynor Sloss, who is usually credited as the founder, was asked to run a production operations team and made the deliberate choice to build it out of software engineers who would get bored doing manual operations and would therefore automate it. His own definition is blunt: SRE is what you get when you ask a software engineer to design an operations function. The 2016 book Site Reliability Engineering codified the practices and turned an internal Google approach into an industry-wide movement.
The reason the idea spread is that traditional operations does not scale. If every new server, service, and customer means more manual work, then growth requires linear headcount growth, and eventually the operations burden consumes the team. SRE breaks that link. By writing software to do the operational work, an SRE function can support a system that grows far faster than the team does. The goal is not zero operations; it is operations that scale sublinearly with the system, so a small team can run something large and keep it reliable.
SRE vs DevOps vs platform engineering
These three terms are constantly conflated, and the confusion is understandable because they overlap heavily and share the same goal: ship software fast without breaking reliability. But they are not the same thing, and knowing the difference is the difference between using them well and using them as buzzwords.
| Discipline | What it is | Primary concern | Owns |
|---|---|---|---|
| DevOps | A culture and set of principles | Breaking the dev/ops wall | Shared ownership and faster delivery |
| SRE | A concrete implementation of those principles | Reliability of running services | SLOs, incidents, toil, error budgets |
| Platform engineering | Building internal self-service tooling | Developer experience | Paved roads and golden paths |
DevOps is the culture; SRE is one opinionated way to implement it. DevOps says development and operations should share ownership, automate their pipelines, and shorten feedback loops, but it deliberately does not prescribe the mechanics. SRE supplies the mechanics: service level objectives, error budgets, a hard cap on toil, blameless postmortems, and a software-engineering approach to operational work. The often-quoted framing is that class SRE implements interface DevOps. You can do DevOps without SRE, but SRE gives DevOps teeth.
Platform engineering is the newer arrival and the one most often confused with SRE today. Where SRE owns the reliability outcome, platform engineering owns the tooling that makes that outcome cheap to achieve: the internal developer platform, the paved roads, the self-service infrastructure that lets product teams ship and operate their own services without filing tickets. The two are deeply complementary, and in practice many platform teams are staffed by engineers with SRE backgrounds, because a good internal platform is one of the most effective ways to deliver reliability at scale. For how AI is now reshaping the operational side of all three, see agentic SRE and AI SRE.
The core SRE principles
SRE is defined less by any single tool than by a small set of principles that, taken together, change how a team relates to reliability. These are the ones that matter.
Embrace risk and define reliability with SLOs
Perfect reliability is the wrong target: it is impossible, it is ruinously expensive, and beyond a point users cannot even tell the difference. SRE replaces the vague demand for "high availability" with a service level objective, a precise number such as 99.9 percent availability over a rolling month. The SLO makes reliability a measurable, negotiable quantity rather than an absolute, and it gives the team an honest answer to the question of how reliable a service actually needs to be.
Error budgets to arbitrate speed versus reliability
The flip side of an SLO is an error budget: the unreliability you are allowed to spend. A 99.9 percent SLO grants roughly 43 minutes of downtime a month, and that budget is yours to spend on shipping fast and taking risks. While the budget has room, the team moves quickly; when it runs out, releases freeze and the team works on reliability until the budget recovers. The error budget turns the eternal argument between product velocity and stability into an objective, data-driven decision.
Eliminate toil
Toil is manual, repetitive, automatable work that scales with the size of the system and produces nothing lasting. SRE treats it as the enemy, because work that grows linearly with the service eventually eats the whole team. The mandate is to engineer it away, and the famous mechanism is a hard cap: an SRE should spend no more than half their time on toil, so the rest goes to building the automation that removes future toil.
Automate everything that should be
Automation is the lever that makes the rest possible. The known, safe, repetitive operational actions, restarting a stuck worker, rolling back a bad deploy, clearing a full disk, scaling out a saturated tier, should be code and not a hand-typed sequence executed at 3 a.m. This is the bridge to self-healing infrastructure, where the system remediates the common failure classes without waking anyone.
Release gradually
Big-bang releases turn a small bug into a large outage. SRE favors progressive delivery: canaries, blue/green deploys, and percentage rollouts that expose a change to a sliver of traffic first, watch the SLO, and roll forward only if it holds. Gradual rollout means failures are caught while they are still small and cheap, which is one of the most direct ways to protect the error budget.
Blameless culture
The only sustainable way to get more reliable is to learn from every failure, and people only share what really happened when they are not going to be punished for it. A blameless postmortem treats an incident as a property of the system, not a fault of the person who was on call, and converts it into a detection improvement, a new runbook, or a new automation. Blame, by contrast, drives the truth underground and guarantees the same incident recurs.
What an SRE actually does day to day
The job is a deliberate split between keeping the system running today and engineering it so it runs better tomorrow, governed by the rule that the first half should never crowd out the second.
On-call and incident command
SREs carry the pager for the services they own. When an incident fires they respond, and for larger incidents one of them acts as incident commander, coordinating the response so the question "who is in charge?" never costs time. The discipline around this lives in incident management and on-call practice, and the speed of recovery is measured by MTTR.
Capacity, performance, and reliability work
Between incidents, SREs forecast capacity so the system does not fall over under growth, tune performance, watch SLOs and burn down error budgets, and decide when the budget says it is time to slow feature work and harden the service. This is the steady-state engineering that keeps the next incident from happening at all.
The 50% toil cap
The single most defining feature of the SRE role is the cap on toil at half the job. If operational load grows past 50 percent of an SRE's time, that is not treated as "we need to work harder"; it is treated as a signal that something must be automated or pushed back to the development team. The cap is what protects the engineering time that makes the whole model work. Without it, an SRE team quietly degrades into a traditional ops team that happens to have the SRE title.
Building automation and runbooks
The other half of the job is software: writing the automation that removes toil, codifying tribal knowledge into runbooks anyone on-call can execute, designing safe-rollout and self-healing systems, and running the blameless postmortems that feed all of it. The diagnostic discipline that lives inside an incident is covered in root cause analysis, and reducing the constant pages is the subject of alert fatigue.
The 50% toil cap only holds if the toil actually gets automated. See how Nova does it.
Try Nova →Toil: the enemy of SRE
Toil deserves its own section because it is the concept that, once understood, reframes the entire job. If you only remember one thing about what SREs fight against, remember toil.
What counts as toil
Toil is work that meets most of these tests: it is manual, it is repetitive, it could be automated, it is tactical rather than strategic, it has no enduring value, and it scales linearly with the size of the service. Restarting a hung process by hand every shift, manually applying the same configuration to a hundred machines, copy-pasting commands out of a runbook for a known failure: all toil. Note what is not toil: incident response to a novel failure, designing a new system, or writing the automation that removes toil are real engineering, even when they are hard and unpleasant.
Why toil is the enemy
Toil is dangerous for two reasons. The first is opportunity cost: every hour spent on toil is an hour not spent making the system better, so a team drowning in toil never gets ahead. The second is that toil scales with the system. As the service grows, the toil grows with it, and because it grows linearly while a team is fixed in size, unchecked toil eventually consumes everyone. A team at 100 percent toil has no path out, because it has no time left to build the automation that would reduce it.
How to measure and cut it
You cannot manage toil you do not measure, so start by tracking it: have SREs categorize their time and surface what fraction is toil. Then attack the largest, most repetitive sources first, because those have the best automation payoff. Codify the manual task into a runbook, then promote the runbook into an automated action, then promote the automated action into something the system does for itself. Hold the line at the 50 percent cap: when toil crosses it, that is the trigger to spend engineering time on removing it, or to renegotiate ownership with the development team. Modern DevOps automation and agentic remediation are the most powerful tools for collapsing the toil that used to be permanent.
The SRE practice stack
The principles are realized through a stack of concrete practices. Each one is a discipline in its own right; together they are what an SRE function actually operates.
| Practice | What it does | Why it matters |
|---|---|---|
| SLOs & SLIs | Define and measure reliability targets | Makes reliability a number you can manage |
| Observability | Metrics, logs, and traces of the system | You cannot operate what you cannot see |
| Incident management | Coordinated response to outages | Bounds the cost of every failure |
| Postmortems | Blameless analysis after incidents | Turns failures into permanent improvements |
| Automation | Code that removes toil and self-heals | Lets a small team operate at scale |
SLOs and SLIs are the foundation: a service level indicator is the raw measurement (request success rate, latency), and the SLO is the target you hold it to. Observability is the sensory system, the metrics, logs, and traces that let you see what the system is doing; for the AI-augmented version see AI observability. Incident management bounds the cost of failure when it happens. Postmortems convert each failure into a lasting improvement, and automation is what lets the whole thing scale. The broader operational category that ties detection, correlation, and remediation together is AIOps.
Building an SRE function
You do not build an SRE function by hiring a few people and renaming the ops team. You build it by establishing measurement and culture first, then choosing a model that fits your size, then maturing deliberately.
The three team models
There are three common shapes. Embedded SREs sit inside product teams and own reliability for specific services; this gives deep context but can fragment practice across the organization. Centralized SRE is a shared team that product teams consult or hand services to; this standardizes practice but risks becoming a bottleneck or an ops dumping ground. Platform SRE has the team build paved-road tooling and self-service infrastructure that product teams use to run their own services; this is the model that scales reliability without scaling SRE headcount linearly. Most organizations begin centralized and evolve toward a platform model as they grow.
Hiring
Hire software engineers who care about production, not operators who can also script. The whole premise of SRE is that the people running operations will engineer the manual work away, which requires genuine software-engineering ability plus the temperament to be bothered by repetitive work. Coding ability, systems thinking, and calm under incident pressure matter more than years spent administering servers.
Maturity
An SRE function matures along a predictable path. It starts reactive, firefighting incidents as they come. It becomes measured, once SLOs and honest observability are in place. It becomes proactive, once postmortems and capacity planning prevent incidents rather than just reacting to them. And it becomes automated, once the safe and repetitive remediations run without a human. The frontier in 2026 is agentic, where AI agents handle the operational load and SREs supervise. Where your AI engineers fit into this for the systems they ship is covered in the AI engineer's guide to production reliability and LLMOps.
The 2026 shift to agentic SRE
The original promise of SRE was operations that scale sublinearly with the system. In 2026 that promise is being taken to its logical conclusion by AI. The two spans where SREs lose the most time, diagnosing what broke and remediating it, are exactly where agentic systems have the most leverage, and as they take over those spans the shape of the job changes.
An agentic layer correlates a flood of alerts into a single incident with a ranked root cause in seconds, then auto-resolves the known-safe class of incidents within a policy envelope before a human finishes reading the page. The SRE does not disappear; they move up the stack. Instead of typing the fix, they define the policy that says what an agent may do unsupervised, write the runbooks the agents execute, set the guardrails, and own the genuinely novel incidents that no agent has seen before. This is the difference between a function whose cost grows with the system and one that does not.
This is where Nova AI Ops fits, and it is the answer to how a modern SRE function scales without linear headcount. Nova is the Multi-Agent OS for SRE and DevOps: 100 specialized AI agents across 12 teams that correlate signals across AWS, GCP, Azure, Linux, and Windows, rank root cause, and auto-resolve the safe class of incidents within a policy envelope, collapsing exactly the toil-heavy operational work that the 50 percent cap was always meant to eliminate. For the broader pattern, see agentic SRE, AI SRE, and AI incident response.
SRE-maturity checklist and 90-day plan
Two practical tools. First, a ten-point checklist to honestly assess where your SRE practice stands; if you cannot check a box, it is your next piece of work. Second, a 90-day plan to stand up or level up an SRE function.
- SLOs exist for your most important services, expressed as concrete numbers, not aspirations.
- Error budgets are computed from those SLOs and actually gate release decisions.
- Observability gives you honest metrics, logs, and traces, so you can see onset, not just hear about it from customers.
- Toil is measured and held under the 50 percent cap, with a trigger to act when it crosses.
- On-call is humane: sustainable rotations, sane alert volume, and clear ownership of every service.
- Incident response has a defined process, severity levels, and a single incident commander for big incidents.
- Blameless postmortems run after every significant incident and produce tracked action items.
- Runbooks exist for your top failure classes and are good enough for anyone on-call to execute.
- Safe remediations are automated, so the known and repetitive fixes do not require a human at 3 a.m.
- Learning compounds: postmortem findings feed back into detection, runbooks, and automation so the same failure resolves faster or stops recurring.
Days 1-30: Measure and establish culture
Before anything else, make reliability a number and make failure safe to discuss. Define SLOs and SLIs for your most important services, instrument them so you can see the truth, and adopt blameless postmortems immediately, because culture is the hardest thing to retrofit later. Start tracking toil so you know how much of your team's time is being consumed by manual work. The deliverable is a trustworthy baseline: you know your reliability, your error budget, and your toil load.
Days 31-60: Establish the practices
Stand up the operating disciplines on top of the baseline. Put a real incident management process in place with severity levels and incident command. Make on-call sustainable and assign a named owner to every service. Write runbooks for your top failure classes, and start enforcing the error budget so release speed responds to reliability data. Pick a team model that fits your size; for most teams that means centralized to start.
Days 61-90: Automate and scale
Turn the runbooks into automation. Promote the known-safe, repetitive remediations into automated actions bounded by a policy envelope, so the boring 80 percent of operational fixes happen without a human. Layer in agentic auto-resolution for the classes you trust, and close the loop so every postmortem feeds detection, runbooks, and automation. The goal at the end of the quarter is an SRE function whose operational load no longer grows linearly with the system, because the toil is being engineered, and increasingly automated, away. This is exactly where Nova AI Ops slots in on top of the measurement and culture work from the first two phases.
The classic mistake is skipping phase one and buying automation before measurement and culture exist. Automation on top of an unmeasured, blame-driven team just makes the chaos faster. Measure first, make failure safe, then automate; every gain downstream depends on that foundation.
Frequently asked questions
What is site reliability engineering (SRE)?
What is the difference between SRE and DevOps?
What is an error budget?
What is toil in SRE?
What does an SRE do day to day?
What are the core principles of SRE?
What team models are used for an SRE function?
Is SRE the same as platform engineering?
How do you build an SRE team from scratch?
How is AI changing site reliability engineering in 2026?
Related guides
Go deeper into the reliability stack: AI SRE and agentic SRE for how AI agents now operate systems; AIOps for the broader operational category; incident management for the lifecycle SREs run; AI incident response for how agents compress diagnosis; self-healing infrastructure for automating remediation; root cause analysis for the diagnose span; MTTR for measuring recovery; on-call and alert fatigue for sustainable operations; DevOps automation for cutting toil. For teams shipping AI systems: the AI engineer's guide to production reliability and LLMOps. See the Nova AI Ops feature set across detection, diagnosis, and auto-resolution.
Scale your SRE function without scaling headcount.
Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams correlate signals, rank root cause, and auto-resolve the safe class of incidents within a policy envelope across AWS, GCP, Azure, Linux, and Windows, collapsing exactly the toil the 50% cap was meant to eliminate. Free tier available for small teams.