AIOps Migration Guide
Migrating off Datadog, Splunk, or any incumbent AIOps platform, the dual-run pattern, the data-portability checklist, and a cutover script that survives contact with reality.
Why migrations fail
Most AIOps migrations fail in one of three ways. The first is data portability, dashboards, alert rules, and runbooks were built in the old platform's proprietary format and can't move. The team rebuilds from scratch, runs out of budget, and ends up running both platforms forever.
The second is the cutover gap. The team flips DNS or re-points agents on a Friday afternoon, discovers gaps in coverage on Monday, and spends a week fighting fires. The classic example is "we forgot the synthetic checks were only in the old platform", three weeks of customer-facing degradation that nobody noticed because the new platform didn't know to look.
The third is the contract overlap. The old vendor's annual contract has 7 months left and auto-renews. The team is paying for both platforms simultaneously for the next 14 months because the rebuild took longer than budget assumed. The combined cost destroys the migration ROI.
Each of these failures is preventable with discipline. The dual-run pattern, written portability requirements, and a cutover script that has been rehearsed at least once. None of it is glamorous; all of it is what separates a successful migration from a crisis.
The data-portability checklist
Before you sign the new vendor, audit what you'll lose if the old platform vanishes tomorrow. Specifically:
- Dashboards. Can you export them? Datadog exports as JSON; Grafana is portable; Splunk dashboards are mostly stuck. Estimate rebuild time at 30-60 minutes per dashboard.
- Alert rules. Can the rule definitions move? Most platforms support some form of YAML or JSON export, but the alerting semantics differ. Plan to re-validate every alert against historical data.
- Runbooks. Are these in the platform or in your wiki? If they're in the platform, you're rewriting them. Move them to a wiki anyway as part of the migration.
- Historical data. Most platforms don't let you export raw historical telemetry. Accept the loss; archive what you need; commit to building new history in the new platform.
- Custom metrics. Datadog's custom metrics use proprietary tags; OpenTelemetry standardises this. Migrate to OTel during or before the platform migration.
- Integrations. Every webhook, every API integration, every Slack channel routing. List them. Each one is a separate migration task.
The audit is uncomfortable because it reveals how much the team has invested in proprietary configuration. That's the lock-in tax made visible. The right time to do this audit is before the new contract is signed, while the old vendor still has reasons to be helpful.
The dual-run pattern
The single most important migration discipline. Run both platforms in parallel for at least 60 days, with both receiving full production telemetry, both alerting the same on-call team, both executing the same runbooks where possible.
The on-call team treats the new platform's alerts as primary and the old platform's alerts as a check. Any alert that fires on the old platform but not the new one is a coverage gap that must be fixed before cutover. Any alert that fires on the new but not the old one is either an improvement or a false positive, investigate either way.
Sixty days is the minimum because monthly incident patterns matter. A migration that runs only two weeks in dual-run misses the monthly batch jobs, the quarterly close, the seasonal traffic patterns. Customers who skip the dual-run period and cut over after a "clean two weeks" almost always discover a gap in week three.
The cost of dual-run is real, you're paying for both platforms during the overlap. Budget for it. The cost of a botched cutover is invariably higher.
Migration timeline
Realistic timeline for a 200-engineer organisation migrating between AIOps platforms.
Months 1-2: planning. Run the data-portability audit. Negotiate the new contract with overlap allowance. Get the new platform deployed in shadow mode (receiving telemetry, no alerting). Identify the migration squad, 1-2 SREs full-time plus a steering owner.
Months 3-4: configuration migration. Rebuild dashboards, alert rules, runbooks in the new platform. The bulk of the engineering work. Critical: rebuild from scratch using the new platform's idioms, don't try to port verbatim. The temptation to translate Datadog queries to PromQL line-by-line creates fragile, ugly configs.
Months 5-6: dual-run. Both platforms primary. On-call team uses the new platform for triage, the old platform as a safety net. All gaps surfaced get fixed; no shortcuts. Weekly migration review covers progress, gaps, regressions.
Month 7: cutover. The old platform becomes secondary; the new platform is the primary surface for all on-call activities. The old platform stays running for one more month as fallback.
Month 8: decommission. Cancel the old contract (timing the contract end-date to month 8 if possible). Remove agents. Keep the old historical data archived in cold storage if compliance requires.
The cutover script
The cutover itself is a 30-minute exercise if planned well, a 30-hour fire if not. Run it on a Tuesday or Wednesday morning, not Friday afternoon. Have both vendors' support engineers on-call during the window.
The cutover script has six steps. (1) Confirm the new platform's alerting is configured for all production services. Sign-off in writing from each service team. (2) Switch the on-call rotation in PagerDuty/Incident.io to route from the new platform. The old platform's alerts now go to a non-paging Slack channel for awareness only. (3) Update runbooks in the team wiki to reference the new platform's URLs. (4) Update the team's bookmarks, dashboards, and chatops integrations. (5) Run a synthetic incident drill, page someone, confirm they get the new alert and not the old one, confirm the new platform's runbook executed. (6) Document the cutover completion with timestamps and sign-offs.
The "synthetic drill" step is the one most teams skip. Don't. The drill catches misconfigured routing in 20 minutes; the alternative is finding it during a real incident at 3 a.m.
When to roll back
Rollback authority belongs to the on-call lead, not the migration owner. The on-call lead's job is reliability; the migration owner's job is the migration. The roles conflict. The on-call lead must have explicit authority to revert to the old platform if reliability is degraded, no questions asked.
The rollback trigger is simple. If the new platform produces a customer-impacting incident the old platform would have caught, or the new platform misses a real incident due to coverage gap, revert. Investigate, fix, retry. The migration loses a week; the customer relationship doesn't.
The temptation to push through a botched cutover because "the contract is signed" is what turns migrations into multi-quarter sagas. The right discipline is fast rollback, slow sign-off.
The old contract problem
Most migrations are kneecapped by the old vendor's contract structure. Annual contracts auto-renew; multi-year contracts have substantial cancellation fees. The right time to start migration planning is 9-12 months before the renewal date, not after the renewal letter arrives.
The leverage move. Tell the incumbent vendor 9 months before renewal that you're evaluating alternatives. Get their counter-offer in writing. Use the counter-offer as the floor for the new vendor's pricing. The incumbent will either drop their price (in which case you've saved money on the old platform) or fail to compete (in which case you have a clear migration runway).
The trap to avoid. Don't sign the new contract while still locked into the old one's auto-renewal. The legal teams will tell you the old contract terminates "at the end of the current term", what they mean is "you'll pay for the next 12 months whether you use it or not." Time the new contract start to within 30 days of the old contract end-date, not before.