The Multi-Agent OS for SRE & DevOps

DevOps: Principles, Practices, and the 2026 State of Play

DevOps is a culture and a set of practices that unify development and operations so teams ship better software faster and more reliably. This is the foundational guide: what DevOps actually is, the CALMS principles and the Three Ways, the lifecycle, the core practices, the four DORA metrics, how it relates to SRE and platform engineering, where AI now fits, and a 90-day plan to adopt or level up.

18 min read Published May 2026 By Dr. Samson Tanimawo, Nova AI Ops
DevOps lifecycle and culture diagram showing development and operations unified across plan, code, build, test, release, deploy, operate, and monitor

What DevOps actually is

DevOps is a culture and a set of practices that unify software development and IT operations so a team can ship better software faster and more reliably. It removes the old wall between the people who write the code and the people who keep it running in production, replacing slow handoffs with shared ownership, automation, and fast feedback. The most important thing to understand up front is that DevOps is a way of working before it is any particular tool. You can buy every tool on the market and still not be doing DevOps, and a small team with shell scripts and discipline can be doing it well.

To see why it exists, picture the world it replaced. For decades, development and operations were separate departments with opposing incentives. Developers were rewarded for shipping change. Operators were rewarded for stability, which mostly meant resisting change. Code was thrown over a wall from one team to the other. When something broke in production, the two teams blamed each other across that wall: "it worked on my machine" met "your deploy took down the site." Releases were big, rare, and terrifying, because rare releases bundle huge batches of risky change into one event. DevOps emerged around 2009 as a direct reaction to that dysfunction, and the core move was to tear down the wall and make one team responsible for the whole life of a service.

That single structural change cascades into everything else. If the same team builds and runs a service, it has every incentive to make running it painless, which means automating the deploy, instrumenting the system so problems are visible, and writing code that fails gracefully. The famous phrase "you build it, you run it" captures the whole philosophy in four words: the team that ships the software also carries the pager for it. When the people writing the code are the same people woken up at 3 a.m. when it breaks, software quality and operability stop being someone else's problem.

It is worth being precise about what DevOps is not. It is not a job title, even though the industry now hires people called DevOps engineers. It is not a single tool or vendor. It is not just CI/CD, though continuous delivery is its engine. And it is not the same thing as automation, although automation is one of its pillars. DevOps is the operating model that ties culture, automation, measurement, and sharing into a system that continuously and safely delivers value.

The core principles: CALMS and the Three Ways

Two mental models capture the principles of DevOps better than any tool list. The first, CALMS, is a maturity framework for assessing whether a team is actually doing DevOps or just adopting its tools. The second, the Three Ways from The Phoenix Project and The DevOps Handbook, describes the flow of value, feedback, and learning that the practices are designed to produce.

CALMS: the five dimensions

CCulture

Shared ownership over silos. Blameless attitudes when things break. Collaboration between dev, ops, security, and the business. Culture is first because it is both the hardest dimension to change and the one that makes the rest possible. A team strong on tooling but weak on culture is not doing DevOps.

AAutomation

Remove manual toil from the path to production. Automate builds, tests, infrastructure provisioning, and deployments so that shipping is a non-event. Automation is what makes small, frequent, low-risk releases possible instead of large, rare, scary ones.

LLean

Work in small batches, limit work in progress, and optimize the whole end-to-end flow rather than any one stage. Borrowed from lean manufacturing, this is the principle that says a small change shipped today beats a big change shipped next quarter.

MMeasurement

Instrument everything and make decisions from data, not opinion. Track lead time, deployment frequency, failure rate, and recovery time (the DORA metrics, covered below). You cannot improve what you do not measure, and measurement is what keeps a DevOps transformation honest.

SSharing

Open knowledge, blameless learning, and visible information across teams. Share runbooks, dashboards, postmortems, and on-call wisdom. Sharing is the dimension that turns a single incident into an organization-wide lesson instead of a private scar.

The Three Ways: flow, feedback, and continual learning

The Three Ways are the underlying physics that the practices serve. The First Way is flow: optimize the movement of work from development through operations to the customer. Make the path to production fast, visible, and one-directional, with small batches and no large queues piling up between stages. CI/CD pipelines are the First Way made concrete.

The Second Way is feedback: create fast, constant feedback loops running right to left, from operations back to development. The sooner a developer learns that a change caused a problem, the cheaper it is to fix. Monitoring, observability, automated tests, and alerting are all about shortening that feedback loop so problems are caught in minutes, not in next quarter's incident review.

The Third Way is continual learning and experimentation: build a culture that rewards experiments, treats failure as a source of learning rather than blame, and continuously improves daily work. Blameless postmortems, error budgets that permit calculated risk, and dedicated time to pay down technical debt all live here. The Third Way is what keeps the loop tightening over years instead of plateauing after the first wins.

The honest caveat. Most failed DevOps transformations buy the tools and skip the culture. They install a CI server, hire a "DevOps team," and wonder why releases are still slow and on-call is still miserable. The culture and sharing dimensions of CALMS are unglamorous and slow to change, but they are where the leverage actually is. Tools without shared ownership just automate a broken handoff.

The DevOps lifecycle: the infinite loop

The DevOps lifecycle is conventionally drawn as an infinite loop, a sideways figure-eight, with eight phases. The shape is the point: there is no finish line, only a loop that feeds back into itself. The left lobe is the development half, the right lobe is the operations half, and they meet in the middle where new code crosses from being built to being run. Here is what each phase means.

Phase What it means Typical work
PlanDecide what to build and define the workBacklog, requirements, design, sprint planning
CodeWrite the software and store it in version controlDevelopment, peer review, branching, commits
BuildCompile and package the code into an artifactCompilation, dependency resolution, containerization
TestVerify the artifact behaves correctly and safelyUnit, integration, security, and performance tests
ReleaseApprove and stage a validated build for productionRelease gating, versioning, change approval
DeployRoll the release out to production environmentsBlue/green, canary, rolling deploys, feature flags
OperateKeep the running service healthy in productionScaling, incident response, on-call, remediation
MonitorObserve the system and feed insight back to planningMetrics, logs, traces, alerting, dashboards

The development half (plan, code, build, test, release, deploy) is where most teams start their DevOps journey, because the path to production is the most visible source of pain. But the right half of the loop, operate and monitor, is where the value of a service is actually realized and where DevOps is hardest to sustain. Monitoring does not just watch the system; it closes the loop by feeding what production teaches you straight back into the next plan. A regression spotted in monitoring becomes a backlog item; a recurring incident becomes an architecture decision. When the loop is healthy, you are never not improving.

One reason teams stall is that they automate the left half beautifully and then run the right half by hand, paging humans for every blip. That asymmetry is exactly where modern AI and agentic operations have the most to offer, which we return to in the 2026 frontier section below.

Core practices: CI/CD, IaC, testing, and more

If CALMS and the Three Ways are the principles, the following are the concrete practices that put them into effect. Each one either shortens a feedback loop or removes a manual handoff. None of them is sufficient alone; DevOps is the system they form together.

Version control for everything

Not just application code, but infrastructure definitions, pipeline configuration, database migrations, and documentation all live in version control. Version control is the substrate the rest of DevOps is built on: it gives you a single source of truth, an audit trail, the ability to review changes before they ship, and a reliable way to roll back when something goes wrong.

Continuous integration (CI)

Developers merge their work into a shared mainline frequently, and every merge automatically triggers a build and a test run. CI catches integration problems within minutes of the change that caused them, when they are cheap to fix, instead of letting them accumulate into a painful "merge week" before a release. The discipline of keeping the mainline always green is the foundation everything else depends on.

Continuous delivery and deployment (CD)

Continuous delivery means every change that passes the pipeline is automatically built, tested, and made ready to release at the push of a button. Continuous deployment goes one step further and ships every passing change to production automatically, with no human gate. CI/CD is the engine of DevOps: it turns the path to production into an automated, repeatable, low-risk pipeline, which is what makes small, frequent releases safe. The automation surface around this pipeline is deep enough to deserve its own treatment, which we cover in the dedicated guide to DevOps automation.

Infrastructure as code (IaC)

Define servers, networks, clusters, and cloud resources in declarative code rather than configuring them by hand through a console. IaC makes infrastructure versionable, reviewable, and reproducible: you can stand up an identical environment on demand, recover from a disaster by replaying the code, and eliminate the configuration drift that causes "works in staging, breaks in prod" mysteries. Tools like Terraform, Pulumi, and the cloud-native equivalents are the common implementations.

Automated testing

A layered suite of unit, integration, security, and performance tests runs automatically in the pipeline so that quality is verified continuously rather than in a manual gate at the end. Good test automation is what gives a team the confidence to deploy many times a day; without it, every release is a leap of faith and the pipeline grinds to a halt behind manual QA.

Continuous monitoring and observability

Instrument production with metrics, logs, and traces so the team can see what the system is actually doing and detect problems fast. This is the Second Way (feedback) in practice, and it is the bridge between the operate phase and the next plan. For the deeper treatment of how to instrument a system so it can be understood from the outside, see the guide to observability, and for the discipline of watching the right signals, see monitoring and the four golden signals.

Blameless postmortems

When an incident happens, the team writes up what occurred, why, and what will change, focusing on the systemic causes rather than on whom to blame. Blameless postmortems are the Third Way (learning) in practice: they turn a single failure into a durable organizational improvement and make people safe to surface problems early. The full discipline is covered in the guide to blameless postmortems.

See the operate and monitor half of the lifecycle handled autonomously, end to end.

Try Nova →

Measuring DevOps: the four DORA metrics

For years, DevOps maturity was argued about with anecdotes. Then the DORA research program (DevOps Research and Assessment, which surveyed tens of thousands of engineers over the better part of a decade) found that four metrics reliably distinguish high-performing software teams from low-performing ones. These four are now the industry-standard way to measure whether DevOps is working. They split into two pairs: two measure throughput, two measure stability, and the central finding is that elite teams are strong on both at once. Speed and reliability turn out not to be a trade-off.

Metric What it measures Elite vs low performers
Deployment frequencyHow often you ship to productionElite: on demand, multiple times a day. Low: between once a month and once every six months.
Lead time for changesTime from commit to running in productionElite: less than a day. Low: between one and six months.
Change failure rateShare of deployments that cause a failureElite: 0 to 15%. Low: substantially higher and harder to predict.
Time to restore serviceHow fast you recover from a failureElite: less than an hour. Low: between a week and a month.

The first two metrics, deployment frequency and lead time for changes, measure throughput: how fast value moves from idea to customer. The second two, change failure rate and time to restore service, measure stability: how often you break production and how quickly you recover when you do. Time to restore service is closely related to mean time to recovery; the practice of driving it down is covered in the guide to MTTR.

The counterintuitive lesson from the DORA data is that throughput and stability rise together. Teams that deploy more frequently tend to have lower failure rates and faster recovery, not higher and slower, because frequent small deployments are inherently less risky than rare large ones and because the discipline required to ship often (good tests, good automation, good observability) is the same discipline that makes recovery fast. If your team is trading speed against stability, that is usually a sign the underlying practices are weak, not that the trade-off is real.

A practical warning: DORA metrics are powerful precisely because they measure outcomes, not activity. Resist the urge to pad them with vanity metrics like lines of code or number of commits, which measure motion rather than value. The four DORA metrics, watched honestly over time, are enough to tell you whether your DevOps practice is actually improving.

DevOps vs SRE vs platform engineering

Three terms get tangled together constantly: DevOps, SRE, and platform engineering. They are related but distinct, and the cleanest way to understand them is by their scope and their level of prescription.

DevOps is the philosophy

DevOps tells you what to aim for, shared ownership, automation, fast feedback, continuous learning, but it deliberately leaves the implementation open. It is a culture and a direction, not a recipe. That openness is its strength (it adapts to any team) and its weakness (two teams can both claim to do DevOps and look nothing alike).

SRE is one concrete implementation of DevOps

Site reliability engineering is the discipline Google formalized to run large-scale systems, and it is best understood as a specific, prescriptive way to implement DevOps. Where DevOps says "create fast feedback and balance speed with stability," SRE says exactly how: define service level objectives, spend an explicit error budget that quantifies how much unreliability you can tolerate, cap the time engineers spend on toil, and treat operations as a software engineering problem to be automated away. The often-quoted formulation is "class SRE implements interface DevOps": SRE is a concrete class that fulfills the DevOps contract. If you want the full treatment, see the guide to site reliability engineering, and for the AI-native evolution of the role, the guide to AI SRE.

Platform engineering is how DevOps scales

Platform engineering is a more recent response to a failure mode of DevOps at scale. When "you build it, you run it" is applied across dozens of product teams, each one ends up reinventing the same pipelines, the same infrastructure modules, and the same monitoring setup, and the cognitive load on every developer becomes crushing. Platform engineering solves this by building an internal developer platform: a paved-road, self-service layer of tooling that gives product teams DevOps outcomes (fast safe deploys, good observability, reproducible infrastructure) without each team having to assemble it from scratch. It does not replace DevOps; it is how large organizations make DevOps sustainable. The platform team applies DevOps to the building of the platform itself.

The one-line summary. DevOps is the goal, SRE is a proven way to reach it, and platform engineering is how you reach it across many teams at once without burning everyone out. They are layers of the same idea, not competitors. A mature organization usually has all three operating together.

The 2026 frontier: AI and agentic DevOps

For most of its history, DevOps automation stopped at the deploy. The pipeline could build, test, and ship code without a human touching it, but the right half of the lifecycle loop, operate and monitor, still ran on human attention. Someone had to read the alert, open the dashboards, form a hypothesis, and execute the fix. That asymmetry is the single biggest reason DevOps does not scale cleanly: the development half automates, the operations half hires.

By 2026, AI sits across the whole lifecycle. In the development half it suggests code as you write, generates tests, reviews pull requests, and triages pipeline failures. But the larger shift is in the operations half, where agentic systems now do the work that used to require a human at the keyboard. AI-aware detection understands context instead of just statistical outliers, so it pages on a genuine 3 a.m. anomaly but stays quiet on an expected post-deploy spike. AI diagnosis reads the same logs, metrics, traces, and recent deploys an engineer would, in parallel, in seconds, and produces a ranked set of likely causes with the evidence for each. And AI remediation executes the fix within a policy envelope the team defines in advance, so routine pages close themselves and only genuine escalations reach a person.

This is where Nova AI Ops fits in the DevOps picture: it is the autonomous operations layer that closes the lifecycle loop. Nova detects, diagnoses, and remediates incidents across AWS, GCP, Azure, Linux, and Windows, all within a policy envelope and an immutable audit ledger, so the operate and monitor half of DevOps finally scales without linear headcount. The point is not to remove humans from operations; it is to let the same team run far more of the loop. Engineers move up the stack to policy, architecture, and the novel failures that genuinely need judgment, while the agents handle the routine, repeatable work that used to define on-call misery. In CALMS terms, this is the automation pillar finally reaching the right half of the lifecycle. In Three Ways terms, it is the feedback loop tightened to seconds and the operations queue drained continuously rather than by a paged human.

The practical effect on DORA metrics is direct: time to restore service drops because the diagnosis-and-fix loop runs at machine speed, change failure rate drops because regressions are caught and reverted automatically, and the team's appetite for higher deployment frequency rises because the operations cost of each deploy is no longer linear. Agentic DevOps does not change the principles of DevOps; it removes the headcount ceiling that used to cap how far they could take you.

A 90-day plan and a 10-point maturity checklist

DevOps has no end state, but a focused team can build real, measurable momentum in about 90 days. The sequence below front-loads culture and version control (the foundations everything else depends on), then layers the pipeline, then the operations half. Skipping ahead, for example automating deploys before the team shares ownership, tends to automate a broken process faster rather than fixing it.

Days 1-30: Culture and version control foundations

Start with the unglamorous foundations. Get the whole team, dev and ops together, agreeing that they share ownership of the service end to end. Move everything into version control: application code, infrastructure definitions, pipeline config, and runbooks. Establish a trunk-based or short-lived-branch workflow with mandatory peer review. Run your first blameless postmortem on a recent incident to set the cultural tone. By the end of the month, no change reaches any environment except through version control, and "you build it, you run it" is the explicit operating agreement.

Days 31-60: Continuous integration, delivery, and infrastructure as code

Stand up a CI pipeline that builds and tests every merge automatically, and hold the mainline always green. Add a continuous delivery stage so a validated build can reach production at the push of a button. Begin defining your environments as infrastructure as code so they are reproducible and reviewable. Aim to make deployment a non-event: small, frequent, and reversible. By the end of this month the team should be deploying at least weekly, ideally daily, with automated rollback.

Days 61-90: Observability, learning, and measurement

Instrument production with metrics, logs, and traces, and wire up alerting on the signals that actually matter rather than on everything. Make blameless postmortems a standing habit after every significant incident, and start tracking the four DORA metrics so improvement is visible and arguable from data. This is also the natural point to evaluate AI-assisted operations for the operate and monitor half of the loop. By the end of the quarter you have a closed lifecycle loop: monitoring feeds learning, learning feeds planning, and the team improves continuously rather than in fits and starts.

After 90 days, DevOps is continuous. You keep tightening feedback loops, paying down toil, and raising the DORA numbers indefinitely. Use the checklist below as a recurring self-assessment.

  1. Shared ownership is real. One team owns each service across its whole lifecycle, and the people who build it also carry the pager for it.
  2. Everything is in version control. Code, infrastructure, pipeline config, and runbooks all live in a single source of truth with peer review and a clean rollback path.
  3. Continuous integration is green. Every merge triggers an automated build and test run, and the mainline is kept always releasable.
  4. Delivery is automated and frequent. A validated change can reach production at the push of a button, and the team deploys in small batches at least weekly.
  5. Infrastructure is code. Environments are declarative, reproducible, and free of manual configuration drift.
  6. Testing is automated and layered. Unit, integration, security, and performance tests run in the pipeline so quality is verified continuously, not in a manual gate.
  7. Production is observable. Metrics, logs, and traces make system behavior visible, and alerts fire on signals that matter rather than on noise.
  8. Postmortems are blameless and acted on. Incidents produce systemic learning and concrete follow-up work, not blame.
  9. The four DORA metrics are tracked. Deployment frequency, lead time, change failure rate, and time to restore service are measured and trending the right way.
  10. The operate half is automated, not just staffed. Routine detection, diagnosis, and remediation are handled by automation or agents within a policy envelope, so operations scales without linear headcount.

Frequently asked questions

What is DevOps?
DevOps is a culture and a set of practices that unify software development and IT operations so a team can ship better software faster and more reliably. It removes the old wall between the people who build software and the people who run it, replacing handoffs with shared ownership, automation, and fast feedback. It is a way of working before it is any particular tool.
Is DevOps a job title or a culture?
DevOps is primarily a culture and an operating model, not a role. The industry does hire DevOps engineers, but the original intent was to dissolve the dev versus ops divide across a whole team, not to create a new silo between them. When DevOps becomes one person's job, you have usually recreated the wall it was meant to remove.
What does CALMS stand for in DevOps?
CALMS is a framework for assessing DevOps maturity across five dimensions: Culture (shared ownership over silos), Automation (remove manual toil from the path to production), Lean (small batches, limit work in progress, optimize end-to-end flow), Measurement (instrument everything and decide with data), and Sharing (open knowledge, blameless learning, visible information). A team strong on tooling but weak on culture is not doing DevOps.
What is the DevOps lifecycle?
The DevOps lifecycle is usually drawn as an infinite loop with eight phases: plan, code, build, test, release, deploy, operate, and monitor. The development half runs plan through deploy, the operations half runs operate and monitor, and monitoring feeds straight back into planning so the loop never ends. The shape matters: there is no finish line, only continuous improvement.
What are the four DORA metrics?
The four DORA metrics are deployment frequency (how often you ship to production), lead time for changes (how long from commit to running in production), change failure rate (what percentage of deployments cause a failure needing remediation), and time to restore service (how fast you recover from a failure). The first two measure throughput, the last two measure stability, and elite teams are strong on both at once.
What is the difference between DevOps and SRE?
DevOps is the philosophy and SRE is one concrete way to implement it. DevOps says what to aim for, shared ownership, automation, fast feedback, but leaves the how open. SRE, the discipline Google formalized, prescribes a specific how: service level objectives, error budgets, toil budgets, and treating operations as a software problem. A common phrase is that class SRE implements interface DevOps.
Where does platform engineering fit with DevOps?
Platform engineering is a response to a failure mode of DevOps at scale: when you you build it you run it pushes too much cognitive load onto every product team. A platform team builds an internal developer platform with paved-road self-service tooling so product engineers get DevOps outcomes without each reinventing the pipeline. It is not a replacement for DevOps, it is how large organizations make DevOps sustainable.
What are the core practices of DevOps?
The core technical practices are version control for everything, continuous integration, continuous delivery or deployment, infrastructure as code, automated testing, continuous monitoring and observability, and blameless postmortems. Each one shortens a feedback loop or removes a manual handoff. None of them is sufficient alone; DevOps is the system they form together.
Does DevOps require CI/CD?
In practice, yes. Continuous integration and continuous delivery are the engine that makes the rest of DevOps work, because they turn the path to production into an automated, repeatable, low-risk pipeline. You can hold DevOps values without a mature pipeline, but you cannot ship small batches frequently and safely without one, and frequent safe delivery is the point.
How does AI change DevOps in 2026?
AI now assists across the whole lifecycle: code suggestions while you write, test generation, pipeline triage, and especially the operate and monitor phases, where agentic systems detect, diagnose, and remediate incidents within a policy envelope. The biggest shift is that the operations half of DevOps no longer has to scale linearly with headcount. AI closes the loop that human on-call could not staff cheaply.

Go deeper on the topics this page touches. On the automation surface and the reliability disciplines closest to DevOps: DevOps automation (the deep-dive on the tooling and the full automation surface), site reliability engineering (the concrete implementation of DevOps), AI SRE, Agentic SRE, and AIOps. On the operate and monitor half of the lifecycle: incident management, AI incident response, on-call management, self-healing infrastructure, and root cause analysis. On measurement and reliability targets: MTTR, SLOs and error budgets, and capacity planning. On telemetry and signal: observability, monitoring, the golden signals, anomaly detection, and fighting alert fatigue. On practices and learning: runbooks, blameless postmortems, chaos engineering, and eliminating toil. For teams shipping AI systems: LLMOps, the AI engineer's guide to production reliability, and AI observability. See it all working together on the Nova AI Ops features page.

Close the DevOps loop on your real production telemetry.

Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams automate the operate and monitor half of the lifecycle across AWS, GCP, Azure, Linux, and Windows. Free tier available for small teams.