Prometheus Security Alerts
Security signals as Prometheus alerts.
Auth failures
Most engineering teams already run Prometheus for operational metrics. The same Prometheus instance can serve security monitoring with a small set of additional rules. Building security alerts on top of existing observability infrastructure produces fast, low-cost security signal that the on-call team can integrate with their existing workflow.
What auth failure alerting catches:
- Spike in failed authentications.: A sustained increase in 401/403 responses, login failures, or token validation errors is the signature of brute force, credential stuffing, or a compromised credential being tested. The Prometheus alert fires on rate-of-change rather than absolute count.
- Brute force pattern.: Many failed attempts from a small number of source IPs targeting many usernames. The pattern produces a specific shape in the metrics: high failure rate with low source diversity. The alert is tuned to this shape.
- Token leak pattern.: A specific token being used from many different source IPs in a short window. Legitimate use of a token concentrates around a small number of sources; widely-distributed use indicates the token has leaked. The alert catches this pattern.
- Alert and investigate.: The alert fires; the on-call investigates within minutes. Block the offending source IPs at the WAF; rotate the suspected leaked credentials; review the audit log for what the attacker accessed during the window. The response is structured.
- Per-tenant alerting.: In multi-tenant SaaS, the alert can fire per-tenant rather than globally. A specific tenant experiencing brute force gets its own alert; the security team responds for that tenant specifically. Cross-tenant patterns produce different signals.
Auth failure alerting is the cheapest and highest-value security alert most teams can add. The metrics are already being collected; adding the alerting rule is a few minutes of configuration.
Privilege escalation
The second category is unexpected privilege changes. An attacker who compromises an account often tries to expand their access by modifying IAM, adding admin roles, or creating new privileged users. These changes are observable; alerting on them catches the attacker mid-attack.
- Unexpected role changes.: CloudTrail (AWS), Cloud Audit Logs (GCP), Activity Log (Azure) record every IAM change. Prometheus or Loki ingests these logs and produces metrics on change rate. Sudden spikes in role attachments, policy changes, or user creations are alerted.
- Outside business hours.: Most legitimate IAM changes happen during business hours by known engineers. Changes at 3am, on weekends, or by accounts that do not normally make IAM changes are anomalous. The alert combines time-of-day with actor identity.
- Specific high-risk operations.: Creating administrative roles. Disabling logging. Adding cross-account trust relationships. Each is a specific operation with a specific log signature; alerts target the operation rather than IAM activity in aggregate.
- CloudTrail or audit log integration.: The alerts read from the audit log. A CloudTrail event matching the pattern produces a metric increment; the metric crosses the threshold; the alert fires. The plumbing connects existing logging infrastructure to existing alerting infrastructure.
- Suspicious by default; investigation determines truth.: The alert fires on the pattern. The investigation determines whether the change was legitimate (the engineer documented it; the change was approved) or suspicious (no approval, no engineer claiming it, fits an attack pattern). Most alerts are legitimate; the few that are not are caught early.
Privilege escalation alerts catch the lateral movement step that attackers usually take after initial compromise. The pattern is well-known; the alerting is straightforward; the response is structured.
Anomalous egress
The third category is unusual outbound traffic patterns. Data exfiltration shows up as outbound traffic to external destinations the system does not normally reach. Prometheus alerts on the network-flow metrics catch the pattern early.
- Sudden traffic to new external IPs.: A workload that normally talks to a fixed set of external services starts contacting new ones. The new destinations might be a new legitimate dependency, or might be an attacker's command-and-control server. The alert flags the change for investigation.
- Volume anomalies.: A workload that usually sends 100KB per minute to an external destination starts sending 100MB per minute. The volume spike could be a legitimate large export or a bulk exfiltration. The alert catches the change; investigation determines the cause.
- Possible exfiltration pattern.: Combined signals (new destinations plus high volume plus odd timing) are stronger evidence than any single one. The alerting rule combines them; the alert fires on the combined pattern rather than each signal separately.
- VPC flow log integration.: Cloud-provider VPC flow logs feed into Prometheus or a similar metric store. Per-workload egress patterns are computed; anomalies surface against the baseline. The infrastructure for collecting flow logs is usually already in place; the alerting layer is added on top.
- Investigation playbook.: When the alert fires, the on-call follows the playbook. Pull the flow log for the workload; identify the destination; check whether it is a legitimate dependency that was missed in the baseline; verify that no compromise has occurred. The playbook is documented; the response is structured.
Prometheus-based security alerts give the team fast, low-cost security signal on top of their existing observability infrastructure. Nova AI Ops integrates with Prometheus and similar metric stores, surfaces the security-relevant metrics alongside operational ones, and produces the alerting and investigation workflow that turns security signals into structured response.