Levelling SRE Engineers: A Concrete Ladder
An SRE ladder that copies the software-engineer ladder loses what is special about the role. Build one that names the skills SREs actually grow into.
Why SREs need a separate ladder
Software engineers grow toward writing harder code and influencing more code. SREs grow toward operating larger systems and reducing the work of operating them. The skills overlap; the trajectories do not.
The mismatch's cost. Using the SWE ladder for SREs penalises the right SRE behaviours. SWE L5 is "I shipped a complex feature." SRE L5 should be "I designed a system that requires no on-call for 2 quarters." Different work; different signal; ladders that don't reflect this push SREs into SWE-style work to get promoted.
The upside of a separate ladder. Recognises operational design as a senior skill. Rewards reliability investments alongside feature work. Lets SREs grow within their craft rather than crossing into SWE work to get promoted.
Five levels
Operator → Owner → Designer → Leader → Principal. Each level adds scope, not just skill. Pay bands track scope. Title progressions track titles you would actually say at a conference.
The scope dimension matters. Each level expands what the engineer is responsible for: known runbooks → owned services → designed systems → org-wide practice → industry contribution. The technical skill grows alongside, but scope is the primary axis.
The five-level structure's discipline. Five is enough to differentiate without being too many to maintain. Each level has clear differentiators from the levels above and below; calibration meetings can produce consistent decisions.
Operator (L3)
Handles on-call for known services. Executes runbooks. Closes alerts. Asks for help on novel incidents. Output: incidents resolved cleanly within established procedures.
The Operator role's value. New SREs need a level where they can contribute meaningfully without being expected to design systems. Operator is that level. Healthy teams have 1-2 Operators at any time; they're learning by doing.
The growth path. Operators graduate to Owner when they can handle unknown incidents (where no runbook exists), drive postmortems, and start maintaining the runbooks they execute. Typical timeline: 1-2 years.
Owner (L4)
Owns one or more services end-to-end. Improves runbooks. Writes incident-response automation. Drives postmortems. Output: service reliability improving quarter over quarter.
The Owner level's signal. The service the Owner takes over has improving reliability metrics over time. Page volume drops; MTTR drops; postmortems produce action items that ship. The Owner isn't just running the service; they're making it better.
The graduation. Owner graduates to Designer when they can architect new services that ship reliable from day one, mentor Operators, and contribute to org-wide reliability practice. Typical timeline: 2-3 years as Owner.
Designer (L5)
Designs new services with reliability built in. Reviews architectures across teams. Defines SLOs that hold. Output: new services that ship reliable on day one because the design caught the failure modes early.
The Designer's distinctive contribution. Architecture decisions that prevent classes of incidents. The Designer doesn't just respond to incidents; they design systems where the incident type can't occur. The most senior IC track in many teams.
The cross-team influence. Designers review architectures across teams; their SLO definitions become organizational standards. The job is no longer one service; it's the patterns that other engineers reuse. Output is leverage, not direct work.
Leader (L6)
Sets reliability strategy across an org. Mentors operators and owners. Influences hiring and ladder calibration. Output: organisation-wide reliability metrics improving and the next generation of SREs being grown internally.
The Leader's scope. Multiple services and teams. The org's incident response maturity. The hiring bar for SREs. Each is org-level; the Leader is responsible for the trend, not for any single incident.
The IC vs. management split. Leader can be IC track (still hands-on, but at architecture/strategy level) or management track (people leadership). Both are legitimate; the ladder accommodates both.
Principal (L7+)
Sets reliability practice for the company. Speaks for the discipline externally. Output: company is recognisable as one with mature reliability practice; alumni become senior SREs elsewhere.
The Principal's external influence. Conference talks, books, open-source contributions. The Principal's work shapes the industry. The company benefits from being associated with the Principal's external reputation.
The rarity. Most companies have 0-2 Principals. The level isn't a quota; it's earned through sustained external impact. Many strong Leaders never become Principals; that's fine.
Promotion evidence
Each level has an artefact that evidences it. Operators have clean on-call shifts and ack times. Owners have service-level reliability metrics they own. Designers have architecture documents and review records. Leaders have org-wide reliability outcomes. Principals have published practice. The artefact is the evidence; opinions follow.
The discipline of artefacts. "Sara is at L5 because everyone agrees" is unreliable. "Sara is at L5 because she designed the new payments system that has had 0 SEV1 incidents since launch and her SLO definitions are used by 4 other teams" is concrete. Always require artefacts.
The collection during the year. Artefacts collect naturally if the engineer is doing the level's work. Promotion conversations gather them. Without continuous collection, promotion becomes a scramble; with it, the case writes itself.
Common antipatterns
The SWE ladder forced onto SREs. SREs evaluated against software-engineering criteria; they over-invest in code, under-invest in operational work; promotion stalls. Use a separate ladder.
Levels without scope. "L4 is more senior than L3" without specifying what additional scope L4 has. Levels become subjective. Always specify scope explicitly.
Promotion without ladder grounding. "Sara has been here 3 years, time to promote." Tenure isn't level. The promotion case needs to be the level's work being done; without it, levels become tenure-based and the bar drops over time.
The ladder that's published but not used. Ladder document exists in the wiki; promotion decisions are made in EM 1:1s without reference to the ladder. The published ladder is theatre. Use it; reference specific levels in promotion cases.
What to do this week
Three moves. (1) If your team uses the SWE ladder for SREs, draft a separate SRE ladder. The first version doesn't need to be perfect; iterate. (2) Document the artefacts that evidence each level. The list helps engineers know what to invest in. (3) For your most senior SRE, write down what level they're at and why. The exercise is calibration; doing it for one engineer reveals whether the ladder is precise enough.