Interviewing SREs Without Trivia
The classic SRE interview tests trivia: command flags, system internals, syscall behaviour. Most senior SREs forget those between incidents and look them up. Interview for what you actually need.
Why trivia is the wrong test
"Which signal does SIGTERM correspond to?" An engineer who does not remember can find out in 10 seconds. An engineer who does remember might still freeze in an actual incident. Trivia rewards the wrong skill.
The mismatch. SRE interviews historically test memorisation of system internals: signals, syscalls, kernel internals, exact command flags. The skill these test is reference-memory. The skill SREs need is investigation under pressure. The two are different.
The selection bias. Trivia-heavy interviews select for engineers who memorise documentation. Real on-call work rewards engineers who form hypotheses, run experiments, and communicate clearly. Interviews biased toward memorisation produce hires who struggle with the actual job.
Four areas that actually matter
Incident reasoning, system design, observability strategy, communication. Test all four. Pass-bar is "this person would help on Day 1," not "this person knows everything."
The four areas cover the SRE skill space. Incident reasoning: how the candidate thinks under pressure. System design: how they architect for reliability. Observability: how they instrument for visibility. Communication: how they coordinate during and after incidents.
The discipline of "all four." Skipping one produces blind spots. A team that only tests system design ends up with hires who can architect but can't debug. A team that only tests incident reasoning ends up with strong responders who can't design new systems. Cover all four; weight by team need.
Incident reasoning
"Here are four facts about an unfolding incident. Talk me through your hypothesis tree." You are watching for: do they form hypotheses or jump to conclusions; do they consider what would falsify each one; do they track which hypotheses they have eliminated.
The setup. Present a realistic incident scenario: "Your service's p99 latency just jumped from 200ms to 4 seconds. CPU is at 60%, normal. Recent deploy was 6 hours ago. Database query latency hasn't changed." Watch how they approach it.
The signals. Do they ask clarifying questions before guessing? Do they suggest multiple hypotheses ("could be cache miss, could be connection pool exhaustion, could be a downstream dependency")? Do they propose how to test each hypothesis? Each is a sign of mature incident reasoning.
System design
"Design a service that does X with these reliability targets." Standard. The SRE-flavoured version emphasises capacity, failure modes, and operational interfaces, not just functional decomposition.
The differentiation. A software-engineer system design focuses on data flow and APIs. The SRE version asks: what's the failure mode if the database is down? What's the operational interface — how do you upgrade this without downtime? What's the capacity story — how does it scale to 10x traffic? Each is what SRE work is actually about.
The depth probe. After the candidate describes their design, ask: "your service is 50ms latency at 1k QPS; we're going to 100k QPS. What changes?" The question reveals whether they understand horizontal scaling, whether they think about bottlenecks, whether they consider operational complexity.
Observability strategy
"You inherit a service with no observability. What do you add first, second, third?" Watching for: do they reach for metrics, logs, or traces in the right order; do they prioritise the user-visible signals; do they say SLO before SLI before metric.
The right answer's shape. First: SLI/SLO at the user-facing boundary (latency, availability, errors). Second: metrics on the critical paths inside the service. Third: structured logs at boundaries (incoming requests, outgoing calls). Fourth: traces if multi-service. Most candidates jump to "add a Prometheus" without articulating WHY.
The senior signal. Senior candidates discuss cardinality before diving into metric collection. They mention costs before recommending tools. They distinguish leading indicators (saturation) from lagging (errors). Each is a sign of someone who has actually run observability at scale.
Communication
"Walk me through your worst incident, as if I am the VP of engineering." Watching for: timeline clarity, blameless framing, action-item ownership, calibration.
The signals. Can they describe the timeline cleanly? Do they use blameless language ("the deploy pipeline let through invalid config" vs "Sara pushed bad config")? Do they own the team's action items? Do they calibrate their description (e.g., "this affected ~10% of users for 30 minutes" rather than "it was a major outage")?
The story-telling vs. analysis distinction. Some candidates tell the incident as a story; others as an analysis. Both are fine; SRE work needs both. Probe for the analysis side: "what did you learn about the system?" — the answer reveals whether they extract structural insights from incidents.
The grading trap
Senior interviewers grade against themselves. They ask the question they would have aced and reject anyone who would not have aced it. Calibrate against the bar, not against your own background.
The pathology. Senior engineer with 15 years of experience interviews a candidate with 5 years; rejects because the candidate doesn't know what the senior knows. The bar gets set higher every year as senior engineers grow; the team can't hire because nobody meets the bar.
The corrective. Bar is "this person would help on day 1," not "this person knows what I know." Calibrate by asking: would this candidate have been a fine hire 3 years ago? If yes, they meet today's bar.
Common antipatterns
Whiteboard syscall trivia. Tests memorisation; selects for the wrong skill. Replace with reasoning questions.
Single-interview-style hire/no-hire. One bad interview kills a strong candidate. Use multiple interviewers across the four areas; require consensus.
The "culture fit" rejection. Vague rejection that's often disguised bias. Replace with specific concerns ("communication wasn't clear in the incident exercise"); concerns can be probed by another interviewer.
The "we don't have time to interview properly" shortcut. Team's hiring bar drops; weak hires create future on-call burden. Always invest the interview time; the hiring decision compounds for years.
What to do this week
Three moves. (1) Review your interview questions against the four areas. Most teams find they over-test one area and under-test another. (2) Replace any pure-trivia questions with reasoning questions. The shift in question style changes the candidate signal dramatically. (3) Calibrate the bar with peer teams. Compare hire/no-hire decisions; look for systematic bias (e.g., always rejecting on system design); adjust.