Reliability Engineering

Cron jobs fail silently,
until something downstream breaks

Job Failures tracks every scheduled job: cron, Airflow DAG, Kubernetes CronJob, batch ETL, scheduled report. When a job fails or runs over its expected duration, the page surfaces it. Job-level monitoring is the missing layer between "the service is up" and "the data is fresh." Most teams discover at least one job that has been failing for weeks.

Get Started Talk to Sales
app.novaaiops.com / job-failures
● LIVE
Auto
discovered from runtimes
Per-job
expected duration
Pages
on streak failure
Owner
always identified
What's Tracked

Status, duration, owner, downstream

For each job: last run status, last run duration vs expected, owner team, downstream consumers (which dashboards or services depend on this job's output), and a streak of last-N-runs status. Streak failure (3+ consecutive fails) triggers a page; one-off failure goes to the team channel.

  • Last run status + duration: success/fail and how long it took compared to baseline
  • Owner team: pulled from job metadata or container labels; team-routed pages
  • Downstream consumers: which dashboards / reports / services depend on this job's freshness
  • Streak alerting: 3+ consecutive failures pages on-call; one-offs notify the team channel
app.novaaiops.com / job-failures · tracked
Auto-Discovery

Jobs find the page, not the other way around

The page does not require manually registering jobs. It auto-discovers from runtimes: Kubernetes CronJobs, Airflow DAGs, GitHub Actions schedules, AWS EventBridge rules, Cloud Scheduler. Any new scheduled job appears within minutes. Manual registration is supported for cron entries on individual hosts.

  • Five runtimes auto-detected: k8s CronJob, Airflow, GitHub Actions, EventBridge, Cloud Scheduler
  • Manual register for host cron: a small agent reports per-host cron status for legacy environments
  • Auto-tagged ownership: job ownership pulled from labels / annotations / repo CODEOWNERS
app.novaaiops.com / job-failures · discovery
Long-Tail Findings

Catch the silent-fail jobs

The most useful finding is usually the job that has been failing for weeks. The page surfaces "high streak failure" jobs: jobs whose latest 10 runs include 5+ failures. These rarely page anyone (single-failure thresholds let them through) but they degrade something downstream eventually. Surfacing the long tail is a one-time win for most teams.

  • High streak failure flag: 5+ failures in last 10 runs surfaces the job for review
  • Cumulative downtime view: how many hours the job's downstream has been stale this month
  • Resurface for owner team: jobs in this state appear in the owner team's weekly report
app.novaaiops.com / job-failures · long-tail
Auto-Remediation

For known-fail patterns, the agents can retry

Some failures are well-understood transients (rate limit, locked table, network blip). For these, the agent fleet can auto-retry with backoff. Auto-retry is opt-in per job class and bounded (max 3 retries). Persistent failures still page on-call. Auto-retry is logged in Agent Ledger so the rerun chain is auditable.

  • Per-class opt-in: retry policies are per job class, not global; teams choose where it makes sense
  • Bounded: max 3 retries with exponential backoff; persistent failure still pages
  • Auditable: every retry shows up in the agent ledger; nothing happens silently
app.novaaiops.com / job-failures · retry
Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

Find the silent-fail jobs

Job Failures surfaces what dashboards miss: the cron that has been failing every night for two weeks while everyone assumed the data was fresh.

Get Started Request a Demo