Reliability Engineering

Cron jobs fail silently,
until something downstream breaks

Job Failures tracks every scheduled job: cron, Airflow DAG, Kubernetes CronJob, batch ETL, scheduled report. When a job fails or runs over its expected duration, the page surfaces it. Job-level monitoring is the missing layer between "the service is up" and "the data is fresh." Most teams discover at least one job that has been failing for weeks.

Get Started Talk to Sales

app.novaaiops.com / job-failures

● LIVE

Failures · last 7d

nightly-refund-reconcilefailed 3 of 7 runs

weekly-cohort-rolluprunning 3x baseline duration

hourly-fraud-batchok · 168 / 168

cleanup-temp-tablesok

What's Tracked

Status, duration, owner, downstream

For each job: last run status, last run duration vs expected, owner team, downstream consumers (which dashboards or services depend on this job's output), and a streak of last-N-runs status. Streak failure (3+ consecutive fails) triggers a page; one-off failure goes to the team channel.

✓
Last run status + duration: success/fail and how long it took compared to baseline
✓
Owner team: pulled from job metadata or container labels; team-routed pages
✓
Downstream consumers: which dashboards / reports / services depend on this job's freshness
✓
Streak alerting: 3+ consecutive failures pages on-call; one-offs notify the team channel

app.novaaiops.com / job-failures · tracked

Job · nightly-refund-reconcile

last runfailed at 03:14

duration8m (baseline 4m)

ownerdata-team

downstreamfinance-rollup, refund-monitor

streak3 fail / 4 pass / 7 fail

Auto-Discovery

Jobs find the page, not the other way around

The page does not require manually registering jobs. It auto-discovers from runtimes: Kubernetes CronJobs, Airflow DAGs, GitHub Actions schedules, AWS EventBridge rules, Cloud Scheduler. Any new scheduled job appears within minutes. Manual registration is supported for cron entries on individual hosts.

✓
Five runtimes auto-detected: k8s CronJob, Airflow, GitHub Actions, EventBridge, Cloud Scheduler
✓
Manual register for host cron: a small agent reports per-host cron status for legacy environments
✓
Auto-tagged ownership: job ownership pulled from labels / annotations / repo CODEOWNERS

app.novaaiops.com / job-failures · discovery

Discovered jobs

k8s cronjob42

airflow88

github actions28

eventbridge14

host cron (manual)6

Long-Tail Findings

Catch the silent-fail jobs

The most useful finding is usually the job that has been failing for weeks. The page surfaces "high streak failure" jobs: jobs whose latest 10 runs include 5+ failures. These rarely page anyone (single-failure thresholds let them through) but they degrade something downstream eventually. Surfacing the long tail is a one-time win for most teams.

✓
High streak failure flag: 5+ failures in last 10 runs surfaces the job for review
✓
Cumulative downtime view: how many hours the job's downstream has been stale this month
✓
Resurface for owner team: jobs in this state appear in the owner team's weekly report

app.novaaiops.com / job-failures · long-tail

High-streak failures

nightly-refund-reconcile7 fail / 10

downstream stale42h this month

weekly-cohort-rollup5 fail / 10

downstream stale12h this month

Auto-Remediation

For known-fail patterns, the agents can retry

Some failures are well-understood transients (rate limit, locked table, network blip). For these, the agent fleet can auto-retry with backoff. Auto-retry is opt-in per job class and bounded (max 3 retries). Persistent failures still page on-call. Auto-retry is logged in Agent Ledger so the rerun chain is auditable.

✓
Per-class opt-in: retry policies are per job class, not global; teams choose where it makes sense
✓
Bounded: max 3 retries with exponential backoff; persistent failure still pages
✓
Auditable: every retry shows up in the agent ledger; nothing happens silently

app.novaaiops.com / job-failures · retry

Auto-retry · sample

03:14nightly-refund-reconcile · failed (rate limit)

03:14retry policy matched · auto-retry 1/3

03:18retry 1 · failed

03:25retry 2 · success

Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

Find the silent-fail jobs

Job Failures surfaces what dashboards miss: the cron that has been failing every night for two weeks while everyone assumed the data was fresh.

Get Started Request a Demo

Cron jobs fail silently,until something downstream breaks