Best Practices Advanced By Samson Tanimawo, PhD Published Oct 14, 2025 6 min read

Embedded SRE vs Platform SRE: Which Org Shape Wins?

Embedded SREs sit in product teams. Platform SREs build shared reliability services. Each model produces a different culture, and most companies need both, in proportions that vary by company size.

Two models

Embedded SREs report into product teams, work next to product engineers, and own the reliability of that team's services. Platform SREs report into a central infra org, build shared reliability tooling, and serve product teams as customers. Both work; both fail; both fail differently.

The structural difference. Embedded: deep team-specific context, weak cross-team leverage. Platform: high cross-team leverage, shallow team-specific context. The trade-off is real; choosing requires understanding the org's needs.

The hybrid most teams converge on. After 3-5 years, most growing companies end up with both — a small platform team for shared infrastructure plus embedded SREs (or reliability-fluent product engineers) in product teams. The combination produces leverage AND context.

Embedded SRE

Closer to the product. Faster to ship product-specific reliability work. Becomes a force multiplier for one team. Knows the team's services intimately.

The embedded SRE's strengths. Deep service knowledge — they know the failure modes, the dependencies, the customer impact patterns. Fast feedback loop with product engineers — same team, same standup, same priorities. Aligned incentives — the SRE's success is tied to the product team's reliability.

The embedded SRE's structural risk. Isolated from peers. The only SRE on the team has nobody to learn from. Best practices invented separately by each embedded SRE; convergence is slow. Hiring SREs is harder when each is one-of-one.

Platform SRE

Reusable. Builds tools every team uses (observability, on-call infrastructure, deploy systems). Lives the leverage maths: one engineer's work helps 50 product engineers.

The platform SRE's strengths. Build once, deploy everywhere — the leverage is what makes platform SRE economical. Best practices baked into tools — when the deploy system enforces canary rollout, every team gets canary rollout. Shared on-call expertise — platform SREs see incidents across all services, learning patterns no single embedded SRE would.

The platform SRE's structural risk. Distance from product context. Tools that work in theory but don't fit product teams' specific workflows. Becomes a help-desk for the platform tools rather than a product reliability investment.

How embedded fails

Each embedded SRE rebuilds the same wheel. Best practices stay siloed. The on-call culture diverges by team. Hiring SREs becomes harder because the role is fragmented.

The wheel-rebuilding pattern. Team A's embedded SRE builds a custom alerting setup. Team B's builds another. Six months later, two teams have different alerting platforms; cross-team incidents can't share runbooks; new SRE hires have to learn a different setup per team.

The fragmentation cost. SRE hiring becomes per-team rather than per-company. Each team needs to grow their own SRE bench; transfer between teams requires retraining; SRE community across the company is diffuse. The talent investment is harder to amortise.

How platform fails

Distance from the product. The platform team builds elegant tools nobody uses because they did not solve the product team's actual problem. Becomes a help-desk for the things they did build.

The "tools nobody uses" pattern. Platform team identifies a problem ("teams should be using observability"); builds a tool; ships it; teams keep using their old tool because the new one doesn't fit their workflow. The platform team doubles down on the tool's quality; teams still don't use it. Two quarters of work; minimal adoption.

The help-desk pattern. Platform team's job becomes answering questions about the tools they built. Engineering work decreases; support work increases. Senior platform engineers leave because the work isn't growing them.

The hybrid

One platform team building shared reliability tooling. Embedded SREs (or "reliability-fluent product engineers") sitting in product teams to apply it. The platform sets the bar; the embedded team meets it on their service. Most mature orgs converge here.

The hybrid's leverage. Platform team's tools provide the foundation; embedded engineers customise per-team needs. Each layer does what it's best at; the combination produces both leverage and context.

The hybrid's discipline. Clear interfaces between platform and embedded. Platform owns: observability infrastructure, deploy systems, on-call tooling. Embedded owns: per-service runbooks, SLOs, capacity planning. Each respects the other's scope; without clear interfaces, conflict over ownership produces friction.

By company size

Up to 50 engineers: SREs are everyone. Embedded by default. 50-200: a small platform team makes sense; embedded SREs in the largest product teams. 200+: the hybrid model becomes natural; platform team scales with infra complexity, embedded team scales with product surface.

The transitions. Each transition is painful and worth doing on purpose. 50 engineers: hire the first dedicated SRE; embed them in the highest-risk team. 100 engineers: hire 2-3 more; some embedded, plus a platform engineer or two. 200 engineers: split formally into platform and embedded; build the interface between them.

The "we don't need SREs yet" delay. Companies skip the first SRE hire to save budget. By the time they hire one, on-call is broken, runbooks don't exist, and the new SRE spends 6 months on remediation work. Hire earlier than feels necessary.

Common antipatterns

Pure platform with no embedded influence. Platform team builds tools; product teams ignore them; platform team's work isn't impactful. Need at least informal embedded relationships.

Pure embedded at scale. 30 embedded SREs all building their own version of similar tools. The duplication is expensive and the divergence is risky. Need a small platform team for the shared work.

The platform team that does everything. Platform takes on per-team work because the embedded model wasn't established. They get over-loaded and produce neither great platform nor great team-specific work. Maintain the boundary.

SRE hires not yet justified. Engineering team is 30 people, no dedicated SRE. Founders argue "everyone's a generalist." Reality: nobody owns reliability; on-call is hated; reliability work loses to feature work. Hire your first SRE before you think you need one.

What to do this week

Three moves. (1) Identify your current model honestly: pure embedded, pure platform, or hybrid. Most teams are accidentally one or the other; choosing is the move. (2) For the model you're in, identify the single biggest gap: missing platform tooling, or missing embedded influence in a critical product team. (3) Make a small concrete move toward the gap: hire one engineer for the side that's missing, or formalise an interface between platform and embedded teams.