Golden Signals, Auto-Enabled on Every Service
Latency, traffic, errors, saturation. The four signals every service should expose, and which somehow nobody actually has wired up consistently. Nova now turns them on by default.
Why golden signals
Google's SRE book named them in 2016 and ten years later most production services still don't expose all four. The reason isn't disagreement, it's friction. Wiring up latency histograms takes ten minutes; wiring up saturation properly takes an afternoon, and it's the third item on a sprint that's already over committed. So saturation never ships.
The four signals exist because together they answer the only question that matters during an incident: is the service healthy, and if not, where is the pressure? Latency catches slowness, traffic catches load shifts, errors catch failure modes, saturation catches the looming bottleneck. Drop any one and incident response starts guessing.
On by default
When you register a service in Nova, through the agent, the SDK, or our auto-discovery on Kubernetes, we now wire up all four signals automatically. The agent watches for the standard runtime exporters (Prometheus client, OpenTelemetry, JVM JMX) and synthesises the missing ones from request metadata it can already see.
For HTTP services, latency comes from request span duration, traffic from request count, errors from status code, saturation from concurrent in-flight count plus CPU and memory headroom. For gRPC, RPC type metadata replaces status code. For queue workers, traffic is messages-in, errors is dead-letter rate, saturation is queue depth divided by consumer capacity.
None of this requires you to add code. If you've already exported richer signals, RED-style histograms, custom error taxonomies, Nova prefers your data over the synthesised version. The defaults exist to fill gaps, not to override deliberate instrumentation.
Sensible thresholds
Defaults that are too loose wake nobody up; defaults that are too tight wake everybody up the first day. We picked the middle by analysing 18 months of incident data across our beta tenants and finding the alert thresholds that correlated with real customer-visible incidents.
The numbers we ship: latency p99 above 4× the trailing 7-day median; errors above 1% of total traffic over 5 minutes; saturation above 80% sustained for 10 minutes; traffic dropping below 30% of expected (catches the silent dead service nobody noticed). These aren't sacred, they're a starting point that catches the obvious failures.
The threshold engine uses adaptive baselines, not static numbers. A service that normally runs at 2ms p99 and a service that normally runs at 200ms p99 get different alerts off the same rule, because the rule is "4× the median," not "100ms." Static thresholds were the noisiest part of the old system; we removed them.
When to override
Defaults are wrong for about 20% of services and the override path matters as much as the defaults. From the service detail page, every signal has an "edit" link that drops you into the threshold editor with the current values, your trailing data, and a preview of which historical alerts the new threshold would fire or suppress.
Common overrides we see: batch jobs run their saturation closer to 100% by design, so the saturation alert gets disabled or moved to a longer window. Latency-tolerant background services bump the latency multiplier from 4× to 10×. High-volume edge services with low error tolerance tighten the error-rate alert from 1% to 0.1%.
The edits are versioned in your tenant's config repository, every change is a commit, with the engineer who made it and the reasoning they typed in. We've seen this turn into a useful audit trail when alerts go stale six months later and nobody remembers why.
What we observed
Two months in, here's what the data says. Tenants that adopted the auto-enabled signals had 31% more services covered by alerts than they did before, meaning a third of their services had been running with no alerting at all. Mean time to detection on incidents dropped 22% across the cohort. False-positive rate held steady, because the adaptive baselines were tighter than the static thresholds those teams had previously written by hand.
The ugliest finding: a non-trivial number of teams had no error alert on their primary customer-facing service. Not "wrong threshold", no alert at all. They'd been relying on customer reports as their detection mechanism. The auto-enabled defaults caught this for them and we got more "thank you" tickets in February than the rest of the year combined.
If you've already got the four signals on every service with thresholds you trust, the change won't affect you, Nova respects existing alerts and only adds the missing ones. If you're like most teams and have partial coverage, the auto-enabled signals close the gap without you having to schedule the work.