Job Failures tracks every scheduled job: cron, Airflow DAG, Kubernetes CronJob, batch ETL, scheduled report. When a job fails or runs over its expected duration, the page surfaces it. Job-level monitoring is the missing layer between "the service is up" and "the data is fresh." Most teams discover at least one job that has been failing for weeks.
For each job: last run status, last run duration vs expected, owner team, downstream consumers (which dashboards or services depend on this job's output), and a streak of last-N-runs status. Streak failure (3+ consecutive fails) triggers a page; one-off failure goes to the team channel.
The page does not require manually registering jobs. It auto-discovers from runtimes: Kubernetes CronJobs, Airflow DAGs, GitHub Actions schedules, AWS EventBridge rules, Cloud Scheduler. Any new scheduled job appears within minutes. Manual registration is supported for cron entries on individual hosts.
The most useful finding is usually the job that has been failing for weeks. The page surfaces "high streak failure" jobs: jobs whose latest 10 runs include 5+ failures. These rarely page anyone (single-failure thresholds let them through) but they degrade something downstream eventually. Surfacing the long tail is a one-time win for most teams.
Some failures are well-understood transients (rate limit, locked table, network blip). For these, the agent fleet can auto-retry with backoff. Auto-retry is opt-in per job class and bounded (max 3 retries). Persistent failures still page on-call. Auto-retry is logged in Agent Ledger so the rerun chain is auditable.
Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.
Job Failures surfaces what dashboards miss: the cron that has been failing every night for two weeks while everyone assumed the data was fresh.