Build Agent Rotation
Long-running agents accumulate state.
Ephemeral
The single biggest reliability win in CI is making every build runner ephemeral. A runner that exists only for the duration of one job, then disappears, cannot accumulate state, cannot be poisoned by a previous build, cannot drift from its image, and cannot serve as the host for a quiet credential leak. Every problem caused by long-lived build infrastructure goes away when the runner's lifetime is shorter than its blast radius.
What ephemeral runners actually buy you:
- Reproducibility by default.: A fresh runner starts from a known-good image with no leftover artifacts, no cached secrets, no half-installed dependencies. The same job run on Monday and on Friday hits the same starting state, which is what makes "works on CI" mean something.
- Security by isolation.: A compromised job cannot poison the next job because there is no next job on the same host. Token theft, malicious dependencies, and supply-chain probes all hit a wall the moment the runner shuts down.
- No mystery flakes.: Most flaky tests on persistent runners are not flaky tests at all. They are real tests catching real state pollution from a previous job. Ephemeral runners eliminate that whole class of false signal.
- Capacity that tracks demand.: Spin up runners only when there is work. Tear them down the second the job finishes. Capacity bills follow real usage instead of paying for an idle pool 22 hours a day.
The cost of ephemeral runners is roughly 30 to 90 seconds of cold start per job, which is real but bounded. The benefit is a CI system that does not require a dedicated SRE to keep it healthy.
Hosted runners
If you use GitHub Actions hosted runners, GitLab SaaS shared runners, or CircleCI cloud executors, your runners are already ephemeral by design. Each job lands on a fresh container or VM, runs to completion, and is destroyed. There is nothing to rotate, nothing to patch, nothing to keep clean. The provider does it.
- No rotation needed.: The platform reaps and recreates runners on every job. There is no concept of a long-lived runner state to rotate. The work is already done for you.
- Image management is theirs.: Patching the OS, updating language toolchains, rotating dependencies, all handled by the runner image upstream. Your only job is to pin which image version you use so you control when you take an upgrade.
- Trust boundary is the platform.: The downside is that you have to trust the hosted infrastructure with your secrets, your code, and your build-time dependencies. For most teams that trade is fine. For regulated workloads (HIPAA, SOC 2 with strict data residency, classified) it usually is not.
If you can use hosted runners and your security team signs off, do it. The operational overhead is essentially zero compared to self-hosted.
Self-hosted
Self-hosted runners are necessary when you need network access to private resources (a VPC, an on-prem database, a regulated data plane) or when the cost math at high build volume tips against hosted. The price is that you are now responsible for all the runner hygiene the hosted provider gave you for free.
- Rotate on a schedule.: Every runner host has a maximum lifetime, ideally measured in jobs (e.g., reap after 50 jobs) or hours (reap after 8 hours), whichever comes first. Past that, state has had time to drift far enough from the baseline that test results become unreliable.
- Replace with a fresh image, not patch in place.: When you reap a runner, do not restart it. Spin up a new VM or container from the latest known-good image. Patching long-lived hosts in place is how you end up with snowflake runners that nobody can reproduce.
- Watch for the canary symptoms.: Stale state announces itself as: tests that pass on a fresh checkout but fail in CI, builds that get slower over time, intermittent permission errors, mysteriously full disks, certificate trust failures. Each of these is a signal that the rotation cadence is too slow.
- Pre-warm the pool, not the host.: Keep N runners warm and ready to take work, but rotate them out of the pool after each job. The pool size handles latency. The rotation handles cleanliness. Conflating the two is the most common self-hosted mistake.
Treat self-hosted runners like cattle, not pets. Nova AI Ops watches build runner health (job duration drift, error rate by host, age in pool) and pages when a self-hosted runner is showing the canary symptoms before it starts producing false test failures and burning engineer-hours on the resulting flake hunts.