Causal Inference in ML
Correlation is what ML learns. Causation is what business decisions need. Bridging the two requires causal inference techniques most ML engineers never learned.
Why correlation isn't enough
ML is excellent at finding correlations. Sometimes the correlations are causal; sometimes not. For decisions where you'll act on the prediction (change pricing, intervene clinically, recommend product), causal estimates are more useful than correlational ones. The "if we do X, what happens" question is causal; "X is associated with Y" is correlational. Many ML deployments confuse the two.
The classic confusion. Ice cream sales correlate with shark attacks. Both correlate with summer. Building a model "if you sell more ice cream, more shark attacks happen" is correlationally true and causally absurd. The example is silly; the same pattern in production decisions is common and consequential.
The decision framing. When you'll act on the prediction, you need causal estimates. "If we offer this customer a discount, will their LTV improve?" is causal; "customers who got discounts have higher LTV" is correlational and may reflect that you only offered discounts to good customers.
The selection-bias trap. ML models trained on observational data often learn selection patterns: "doctors order this test for sicker patients; the test is associated with worse outcomes; therefore the test causes worse outcomes". Wrong; the test was ordered BECAUSE of bad indicators. Without causal inference methods, the model perpetuates the confusion.
The downstream impact. Decisions made on correlational ML often produce surprises in production. The intervention doesn't work as predicted; A/B tests reveal smaller effects than the model predicted. Causal inference produces estimates that hold up under intervention.
Causal techniques
The toolbox includes:
- Randomised controlled trials (A/B testing), gold standard. Random assignment breaks the selection bias.
- Propensity score matching, adjust for observed confounders by matching treated/control units with similar predicted treatment probabilities.
- Instrumental variables, find variables that affect treatment but not outcome directly; use them to identify causal effects.
- Regression discontinuity, exploit threshold-based assignment (eligibility cutoffs) to identify local causal effects.
- Difference-in-differences, compare changes over time between treated and untreated groups.
- Causal forests / orthogonal ML, modern methods using ML to estimate heterogeneous treatment effects.
The A/B testing reality. The cleanest causal inference; the most expensive in time and traffic. When you can run an A/B test, do. When you can't (regulatory, ethical, scale), use the others.
The propensity score case. Score units by treatment likelihood; match treated to similar-propensity untreated; compare outcomes. Adjusts for observed confounders. Vulnerable to unobserved confounders; check robustness.
The instrumental variable case. Find an "instrument" (variable that affects treatment but not outcome directly). Use the instrument to identify the causal effect even with unobserved confounders. Powerful when valid instruments exist; instruments are hard to find.
The regression discontinuity case. Eligibility for treatment based on a threshold (test score > 80, BMI > 30). Compare units just above and just below threshold. Identifies the causal effect at the threshold; doesn't extrapolate. Clean when applicable.
The difference-in-differences case. Pre/post comparison plus treated/untreated comparison. The "double difference" controls for time trends and group differences. Requires parallel-trends assumption, verify it.
The causal-ML case. Machine-learning-based estimation of heterogeneous treatment effects. Find which units benefit most from treatment. Production-grade tools (DoWhy, EconML); applied increasingly in tech companies for personalisation.
Tools
Modern ecosystem:
- DoWhy (Microsoft), causal inference framework with explicit identification and estimation phases. Pythonic; pedagogical.
- EconML (Microsoft), econometrics-flavoured ML for treatment-effect estimation. Strong heterogeneous-effect support.
- CausalML (Uber), uplift modelling and personalisation. Production-grade.
- Pyro / NumPyro, probabilistic programming; build custom causal models.
- Stata, R packages, economics-flavoured tools; mature; sometimes preferred for academic-quality work.
The DoWhy case. Forces you to articulate causal assumptions: graph the causal model; pick identification strategy; estimate; refute. The discipline catches mistakes that ad-hoc analysis misses.
The EconML case. Strong heterogeneous treatment effect estimation. Useful for personalisation: which customers benefit from this intervention. Combined with DoWhy is a production-grade stack.
The CausalML case. Uber's production stack. Uplift modelling: predict which units will respond to treatment. Used for personalisation, retention campaigns, intervention targeting. Engineering polish.
The probabilistic-programming case. Pyro and NumPyro let you express any custom causal model. Maximum flexibility; steepest learning curve. Reserved for the cases where standard frameworks don't fit.
The discipline-mixing reality. Causal inference is part economics, part statistics, part ML. Tools reflect this. Pick by your team's existing language fluency; tools matter less than methodology.
Where it pays
Decisions where you'll act on the model's recommendation. Medical interventions. Pricing decisions. Marketing campaigns. Personalisation. Public-policy applications. Any case where "if we do X, Y will happen" is the question. The ROI of causal inference is in better decisions, not better predictions.
The medical case. "Should we administer drug X to patient Y?" The decision is causal; the question is causal; correlational ML produces wrong answers. Causal inference is essential.
The pricing case. "If we raise prices 10%, what happens to revenue?" The price-elasticity question is causal. Observational ML often gets it wrong because price changes correlate with other factors (competitor changes, product improvements). Causal methods produce trustworthy elasticities.
The marketing case. "If we send this email, will the customer convert?" Correlational answer: customers who received emails converted more (selection bias, you sent emails to good leads). Causal answer: conversion lift from sending. The two can differ enormously.
The personalisation case. "Which customers respond best to feature X?" Heterogeneous treatment effect estimation. Production tools (CausalML, EconML) make this tractable. Used for retention, recommendation, intervention targeting.
The public-policy case. Effectiveness of interventions: education programs, training, regulations. Causal inference is the standard methodology in policy research. ML extensions make analysis more powerful.
Common antipatterns
Predicting where intervention is the question. Predictions answer "what will happen"; interventions answer "if we do X, what happens". Different questions, different methods.
Skipping the causal graph. Articulating assumptions is the discipline that catches mistakes. Draw the DAG.
Hidden confounders without robustness checks. Always test sensitivity to potential unobserved confounders.
Using correlational ML predictions for treatment decisions. Bias guaranteed; magnitude unknown without causal estimation.
What to do this week
Three moves. (1) For one decision your ML model influences, articulate whether the underlying question is causal or correlational. The answer guides methodology. (2) For one important causal question, plan an A/B test if feasible; if not, use observational methods explicitly designed for causal estimation. (3) Train your team on basic causal inference. The skill transfers to many problems; current ML training rarely covers it.