What anomaly detection is in IT operations
Anomaly detection is the practice of automatically flagging behavior in your systems that deviates from what is normal, so you catch problems that static thresholds miss. Instead of a human deciding in advance that CPU above 90 percent is bad, the detector learns the normal shape of each metric, log stream, or trace, and raises a signal when reality drifts away from that learned baseline. In an IT operations context it runs continuously across the telemetry coming off your cloud and host fleet, watching for the moment a signal stops looking like itself.
The reason this matters more now than it did a decade ago is scale, cardinality, and architecture. A monolith on a dozen servers produced a few dozen metrics a human could reason about. A modern microservices estate produces millions of time series, with new pods, new tags, and new label combinations appearing every minute. Nobody can hand-set a threshold on every one of those signals, and even if they could, the thresholds would be stale by the next deploy. Anomaly detection exists because the volume and churn of signals have outrun the human ability to define normal in advance.
The honest framing for the whole guide is this: detection is the first stage of a pipeline, not the finish line. A good detector tells you something moved. Turning that into a resolved incident still requires correlation, root-cause analysis, and action. We will get to that in section six. For now, the working definition is simple: anomaly detection is the automatic, continuous business of noticing deviation from a learned baseline across observability data, so the rest of your operations stack has something real to act on.
Anomaly detection vs static thresholds
The thing anomaly detection replaces is the fixed threshold: a line a human sets once, such as "alert if error rate is above 1 percent," and rarely revisits. Static thresholds are not wrong. They are simple, explainable, and exactly right for hard contractual limits like a disk filling to 100 percent. They break the moment normal itself is not constant, which in production is almost always.
| Dimension | Static thresholds | Dynamic baselining |
|---|---|---|
| Seasonality | Blind to daily and weekly cycles | Learns expected value per hour and weekday |
| Traffic growth | Line goes stale as volume climbs | Baseline tracks the trend automatically |
| False positives | Too tight at peak, fires on normal load | Band widens where variance is high |
| False negatives | Too loose at trough, misses real drops | Flags deviation relative to local normal |
| Maintenance | Hand-tuned per signal, never enough hours | Self-updates from recent history |
| Explainability | Trivially obvious why it fired | Needs a clear "expected vs actual" view |
| Hard limits | Perfect for contractual ceilings | Overkill for a fixed legal cap |
The core failure of a fixed line is that it assumes a flat world. Real signals have shape. Checkout traffic peaks at lunch and dies overnight, batch jobs spike memory every night at 2 a.m., a marketing email triples sign-ups for an hour. A single threshold cannot be simultaneously tight enough to catch a real outage at 3 a.m. and loose enough to not page on the normal noon peak. You end up either drowning in false positives or sleeping through real incidents, and usually both on different signals.
Dynamic baselining replaces the fixed line with a learned, moving expectation: the normal range for this metric, at this hour, on this day of week, given recent trend and variance. The detector flags points and sequences that fall outside that band rather than outside an arbitrary number. That is what lets it follow seasonality and growth without a human re-tuning thresholds every week. The pragmatic answer is not "throw away every threshold." It is to keep static limits for true hard ceilings and hand the shaped, seasonal, growing signals to dynamic detection.
The trap most teams fall into. They turn on dynamic detection everywhere on day one, get buried in false positives because the baselines have not warmed up and the sensitivity is untuned, and conclude "anomaly detection does not work." It works. What does not work is deploying it without a warm-up window, without seasonality awareness, and without correlation. Those three omissions are the whole difference between a useful detector and a noise machine.
The types of anomalies
"Anomaly" is not one thing. Detectors that only know how to catch one shape will silently miss the others. There are three shapes worth naming, plus two orthogonal distinctions that change which technique you reach for.
1Point anomalies
A single value far outside the normal range. One request that takes 40 seconds, one node that reports 100x its usual error count, a lone spike in queue depth. These are the easiest to catch and the kind simple statistics handle well. They are also the kind a static threshold sometimes catches by luck, which is why teams overestimate how covered they are.
2Contextual anomalies
A value that is only abnormal in context. Heavy traffic is normal at noon and suspicious at 3 a.m. A CPU at 80 percent is fine during a batch window and alarming on an idle service. No single static line can express "normal here depends on when." This is precisely the case where dynamic, seasonality-aware baselining earns its keep.
3Collective anomalies
A sequence that is abnormal as a group even though no individual point looks wrong. A slow, steady climb in memory that is a leak. A gradual rise in p99 latency over an hour. Each reading is within range; the pattern is the problem. Catching these needs methods that look at windows and trends, not isolated values.
4Univariate vs multivariate
Univariate detection watches one signal in isolation. Multivariate detection watches many signals together and flags problems that only appear in the relationship between them, such as latency rising while throughput falls. Many real failures are invisible in any single metric and obvious in the joint behavior, which is why multivariate methods matter for complex services.
5Metric anomalies
Deviations in numeric time series: latency, error rate, saturation, throughput, the golden signals. This is the most common form because metrics are already numeric and regularly sampled, which makes them the natural home for the statistical and forecasting techniques in the next section.
6Log and trace anomalies
A brand-new error template or a sudden surge in the rate of a known one is a log anomaly, often the first sign of a failure that has not yet moved a top-line metric. A latency or error pattern that deviates from the normal shape of a request path is a trace anomaly, which is how you localize a regression to a specific service or dependency.
The takeaway: a real anomaly detection capability has to span point, contextual, and collective shapes, work both univariate and multivariate, and reach across metrics, logs, and traces. A tool that only does univariate point detection on metrics will tell you a server caught fire but never warn you about the leak that lit the match.
See anomaly detection across metrics, logs, and traces, correlated into one incident.
Try Nova →The techniques: statistical and machine learning
The methods fall into two families. Neither is universally better. The right system layers them, using cheap statistics where they suffice and reserving heavier machine learning for the signals that need it.
Statistical methods
The workhorses, fast and explainable. Z-score flags points more than a few standard deviations from the mean, and its robust cousin MAD (median absolute deviation) does the same without being thrown off by outliers in the history itself. EWMA (exponentially weighted moving average) smooths a noisy series and tracks slow drift, making it good at catching gradual change. For signals with trend and seasonality, ARIMA and seasonal decomposition (STL) separate a series into trend, seasonal, and residual components, then alert on a residual that is too large to be ordinary noise. Statistical methods cover a large share of real metrics, cost almost nothing to run, and produce a baseline a human can actually look at and trust. Their limit is that they assume a fairly well-behaved single series.
Machine-learning methods
Where statistics run out of road, machine learning takes over: high cardinality, many correlated signals, and nonlinear relationships. Isolation forest isolates outliers by how few random splits it takes to separate them, and works well on multivariate data without assuming any distribution. Clustering (such as DBSCAN) groups normal behavior and flags whatever does not belong to a cluster. Autoencoders learn to compress and reconstruct normal data, then flag inputs they reconstruct badly, which is a strong fit for high-dimensional multivariate detection. Forecasting models predict the next value of a series and alert when the actual value diverges from the prediction by more than the expected error. These methods earn their cost on exactly the signals where simple statistics fall short.
Supervised vs unsupervised
Almost all production anomaly detection is unsupervised, and for a blunt reason: labeled anomalies are rare. You usually do not have a clean, trustworthy history of "this window was an incident, this one was not," so methods that need labels have nothing to learn from. Unsupervised methods learn the shape of normal from unlabeled history and call out deviation, which is why isolation forest, clustering, autoencoders, and the statistical methods dominate real deployments. Supervised detection becomes worthwhile only when you have accumulated a clean labeled corpus of past incidents, at which point a classifier can learn the specific signatures that preceded them. Most teams should plan to live in the unsupervised world and treat any labels they collect as a bonus that improves correlation and ranking later, not as a prerequisite.
The hard parts in production
Anomaly detection is easy in a notebook and hard in production. The gap is entirely in the operational realities a demo never shows you. These are the five that bite.
Seasonality and trend
The single biggest source of false positives is a detector that never learned your cycles. If the model does not know that sign-ups triple every Monday morning and that memory climbs every nightly batch, it will page on both as if they were incidents. Every serious detector has to model daily and weekly seasonality and the underlying growth trend, or it will cry wolf on schedule.
The cold-start problem
A detector knows nothing on its first day. With no history it cannot tell normal from abnormal, so a brand-new service, a new metric, or a freshly deployed pod has no baseline to deviate from. Push a cold detector straight to paging and it either fires on everything or nothing. The fix is a warm-up window: collect enough history to learn normal before the detector is allowed to wake anyone up.
Concept drift
Normal is a moving target. A deploy changes the latency profile, a traffic shift changes the daily shape, a new feature changes resource usage. A baseline learned last month is wrong this month, and a stale baseline produces a flood of false anomalies that are really just the new normal. The defense is continuous re-baselining and, critically, automatic re-baselining on known change events like deploys, so the model resets its expectations at exactly the moments normal is most likely to have shifted.
Alert fatigue from noisy detectors
A detector that fires on every blip is worse than no detector, because the real signal drowns in noise and on-call learns to ignore it. This is the failure mode that gives anomaly detection a bad name, and it is squarely a case of alert fatigue. The antidote is not just better tuning; it is correlation, collapsing the many anomalies that one failure throws off into a single incident rather than paging on each raw deviation. A detector without correlation in front of it will train your team to mute it.
Explainability and tuning sensitivity
An anomaly with no explanation is hard to trust and hard to act on. Engineers need to see the expected range, the actual value, and why this counts as a deviation, or they will not believe the detector under pressure. Tightly bound to this is the sensitivity-versus-specificity tradeoff: turn sensitivity up and you catch more real incidents but generate more false positives; turn it down and you cut the noise but risk missing genuine problems. There is no setting that is right for every signal, which is why detection has to be tunable per signal and paired with a clear view of why each anomaly fired.
From detection to correlation to action
Here is the most important idea in this guide, and the one most teams get wrong: a raw anomaly is not an incident. A raw anomaly is a statistical signal that one thing deviated. An incident is a customer-affecting problem with a cause and a fix. Conflating the two is exactly how you build an alerting system your team learns to ignore.
Consider what one real failure looks like in the telemetry. A bad deploy goes out, and within seconds latency rises on three services, error rates spike on a dozen endpoints, a new error template floods the logs, retry storms push queue depth up, and saturation climbs on the pods absorbing the retries. A naive detector sees dozens of independent anomalies and fires dozens of pages. The on-call engineer now has to mentally re-assemble those scattered signals back into the single thing that actually happened, under time pressure, at 3 a.m. That reassembly is the slow, error-prone work that dominates time-to-resolution.
The pipeline that works has four stages. Detect the deviations across metrics, logs, and traces. Correlate the related anomalies into one incident, so the operator sees a single problem instead of a storm. Run root-cause analysis to find the actual driver instead of a list of symptoms. Then act, either by paging a human with the diagnosis already attached or, for known-safe classes of problem, by remediating automatically. Detection is the trigger for this pipeline, not the verdict it produces.
This is exactly where Nova AI Ops positions itself. Nova runs dynamic, seasonality-aware anomaly detection across AWS, GCP, Azure, Linux, and Windows, then correlates the flood of related anomalies into a single incident instead of paging on every blip. From there it runs root-cause analysis to find the actual driver and, for the known-safe class of issues, auto-resolves within a policy envelope before a human finishes reading the page. The contrast that matters: a bare detector pages a human for every deviation; Nova turns deviations into a correlated, diagnosed, and where-safe resolved incident, and leaves humans only the genuinely novel cases. If you want the broader picture of how this fits an autonomous reliability stack, see self-healing infrastructure and AI incident response.
Where it sits in the 2026 AIOps landscape
Anomaly detection is not a product category on its own anymore. It is a capability inside a larger system, and where it lives tells you a lot about how useful it will be. In a modern AIOps platform, detection is the sensing layer that feeds everything downstream: correlation, root-cause, and action all depend on it, but none of them are it.
The first decision is open source versus commercial. Open-source options, from Prometheus recording rules with simple statistical alerts to libraries like Prophet for forecasting and frameworks for isolation forests and autoencoders, give you full control and no per-signal licensing cost. The price is that you own the hard parts yourself: seasonality modeling, cold-start handling, drift management, and especially the correlation layer that turns raw anomalies into incidents. Commercial platforms ship those operational pieces and the integrations, which is most of the real work, in exchange for cost and less control over the internals.
That leads to the familiar build-versus-buy question, and the honest answer is that the algorithm is the easy part. You can stand up an isolation forest on a metric stream in an afternoon. What takes quarters is everything around it: maintaining seasonality-aware baselines across thousands of churning signals, handling cold start on every new service, re-baselining on drift, and building the correlation and root-cause layers that make detection actionable rather than noisy. Teams that "build anomaly detection" usually build the easy 20 percent and then spend a year discovering the hard 80 percent. The pragmatic split is to build when detection is a core differentiator of your own product, and buy when you want detection that feeds a working incident pipeline without funding a multi-year platform effort. For how this connects to the rest of the practice, see AI observability and MTTR.
A 90-day rollout plan and readiness checklist
The fastest way to kill an anomaly detection rollout is to turn it on everywhere at once and bury your team in false positives on week one. The discipline below earns trust on a few signals before expanding, so the team comes to rely on the detector instead of muting it.
Days 1–14: Shadow mode on your top SLIs
Run detectors on your most important service-level indicators with paging turned off. The only goal is measurement: how often does the detector fire, and how often does a firing line up with a real incident? You are establishing precision before anyone gets woken up. Pick a small set of high-value signals, golden-signal metrics for your most critical services, not the whole estate.
Days 15–45: Tune sensitivity and add seasonality awareness
Using the shadow-mode data, tune each detector. Add daily and weekly seasonality, set a warm-up window so cold detectors cannot page, and adjust sensitivity per signal until precision on that small high-value set is genuinely good. The deliverable at the end of this phase is a handful of detectors you would trust to wake you up, not a hundred you would mute.
Days 46–75: Promote to paging with correlation in front
Turn paging on for only the tuned, trusted detectors, and route them through a correlation layer so related anomalies collapse into one incident instead of a storm of separate pages. Watch the page volume and the false-positive rate closely. If a detector starts crying wolf, demote it back to shadow mode and re-tune rather than letting the team learn to ignore it.
Days 76–90: Expand coverage and automate re-baselining
Now grow the footprint deliberately, one service tier at a time, always adding correlation as you go. Wire automatic re-baselining to your deploy pipeline and known traffic events so concept drift does not reintroduce the false positives you worked to eliminate. By day 90 you should have a trusted core of paging detectors, correlation collapsing storms into incidents, and a repeatable process for onboarding the next service without restarting the noise problem.
The 10-point readiness checklist
Before you promote any detector to paging, confirm it clears these ten. A detector that fails several of them is a false-positive generator waiting to happen.
- Seasonality modeled. Does the baseline know your daily and weekly cycles, or will it page on every Monday-morning peak?
- Trend tracked. Does the baseline follow traffic growth, so it does not go stale and drift into false alarms as volume climbs?
- Warm-up window. Is there a minimum history requirement before a detector is allowed to page, so cold start does not produce noise?
- Drift handling. Does the system re-baseline automatically on deploys and known change events instead of relying on a model from last month?
- Per-signal tuning. Can you set sensitivity independently per signal, rather than one global knob for everything?
- Multivariate coverage. Can it catch problems that only appear in the relationship between signals, not just univariate point spikes?
- Logs and traces, not just metrics. Does detection extend to new error templates and abnormal request paths, or is it metrics-only?
- Correlation in front. Do related anomalies collapse into one incident, so a single failure does not generate a storm of pages?
- Explainability. Does each anomaly show the expected range, the actual value, and why it counts as a deviation, so an engineer can trust it under pressure?
- Action path. Does a confirmed incident feed root-cause analysis and, where safe, automated remediation, or does it just add another page to the pile?
Score honestly. Most off-the-shelf "anomaly detection" features clear the first few and fail the last three, which is why so many teams have a detector switched on and a team that ignores it. The last three, correlation, explainability, and a real action path, are what separate a signal you act on from noise you mute.
Frequently asked questions
What is anomaly detection in IT operations?
How is anomaly detection different from static thresholds?
What are the main types of anomalies?
What techniques are used for anomaly detection?
Why does anomaly detection produce so many false positives?
Is a raw anomaly the same as an incident?
How does Nova AI Ops use anomaly detection?
What is dynamic baselining?
Does anomaly detection work on logs and traces, not just metrics?
How do I roll out anomaly detection without drowning in false positives?
Related guides
Anomaly detection feeds the rest of the operations stack. Start with the platform it lives inside, AIOps, and the data it reads: observability and AI observability. It is the cure for alert fatigue when paired with correlation, and it triggers AI incident response, root cause analysis, and self-healing infrastructure. On the broader autonomous stack: AI SRE, Agentic SRE, and incident management. On operational metrics and practice: MTTR, on-call management, SLOs and error budgets, site reliability engineering, blameless postmortems, chaos engineering, toil, and DevOps automation. For teams shipping AI systems: the AI engineer's guide to production reliability and LLMOps. Or see all Nova features.
Turn raw anomalies into resolved incidents.
Nova AI Ops detects anomalies across metrics, logs, and traces, correlates them into a single incident, finds the cause, and auto-resolves the known-safe class within a policy envelope. 100 specialized AI agents across 12 teams, running on AWS, GCP, Azure, Linux, and Windows. Free tier available for small teams.