Differential Privacy in ML
Differential privacy gives mathematical guarantees that no individual’s data measurably affects the model’s output. The cost is accuracy; the benefit is provable privacy.
What DP guarantees
Differential privacy is a mathematical framework that bounds how much any individual training example can affect the trained model. Formally: a model trained with DP at level epsilon produces nearly identical outputs whether or not any specific example was in the training data. The "nearly" is quantified by epsilon.
The intuition. If your data is in the training set, DP guarantees that the model couldn't have learned much about you specifically. The model learned about the population; specific individual data is bounded in influence. Privacy as bounded influence, not "no information used".
The epsilon parameter (more below). Smaller epsilon = stronger privacy. Epsilon = 1 is "strong privacy"; epsilon = 10 is "moderate"; epsilon = 100 is essentially no protection. The choice of epsilon trades privacy for accuracy.
The mathematical form. Pr[M(D₁) ∈ S] ≤ e^ε × Pr[M(D₂) ∈ S] + δ, where D₁ and D₂ are datasets differing in one example. The output of model training is a probability distribution; DP bounds how much the distribution can shift when one example changes.
The composability property. Multiple DP queries compose: epsilon_total = sum of epsilons. This lets you bound total privacy budget across many operations, but it also means complex pipelines exhaust budget fast. Budget management is core to applied DP.
The epsilon parameter
The privacy-accuracy dial. Smaller epsilon = stronger privacy = more noise = lower accuracy. Larger epsilon = weaker privacy = less noise = higher accuracy. Choosing epsilon is a policy decision, not a technical one, it's about how much privacy you're willing to trade for utility.
The numeric scales. Epsilon 0.1: very strong privacy, often substantial accuracy hit. Epsilon 1.0: strong privacy, moderate accuracy hit (usually 5-20% on classification). Epsilon 10: weak privacy, small accuracy hit (1-5%). Epsilon 100: little practical privacy, near-zero accuracy hit.
The choice principle. Pick the smallest epsilon that produces acceptable utility. Document the epsilon in any disclosure about model training. The numeric value is the real privacy claim; vague "we use DP" without epsilon is meaningless.
The delta parameter. DP also has a delta, the probability of the privacy guarantee failing. Should be cryptographically small (1e-6 to 1e-8 typical). Larger delta is essentially no guarantee; smaller is overhead with no real benefit.
The "privacy budget" framing. Treat epsilon as a finite budget. Each query against the data spends some budget. Once spent, the dataset is "burned", additional queries can't be made privately. Budget management discipline is critical.
How DP-SGD works
DP-SGD is the workhorse algorithm for differentially-private deep learning. The mechanics:
- Compute per-example gradients. Standard SGD averages gradients across a batch; DP-SGD computes them separately first.
- Clip per-example gradients. Bound each example's gradient norm to a fixed value C. Limits how much any one example can influence updates.
- Add Gaussian noise. Add noise scaled to C and the desired epsilon.
- Update weights. Use the clipped+noised average gradient as the update direction.
The clipping and noise together provide the DP guarantee. Clipping bounds individual influence; noise hides which individuals were in the batch.
The compute cost. Per-example gradients are 5-20x more expensive than batched gradients in naive implementations. Specialised libraries (Opacus for PyTorch, JAX/Flax with grad transformations) bring overhead down to 1.5-3x. Plan for 2-3x compute increase vs non-DP training.
The hyperparameter sensitivity. DP-SGD has more hyperparameters than SGD: clip norm C, noise multiplier, batch size, sampling probability. Each interacts with privacy budget. Tuning is harder than non-private training; require explicit experiments.
The implementation libraries. Opacus (PyTorch), TensorFlow Privacy, JAX with custom gradient transformations. All mature; pick based on your existing framework. Production deployments converge on Opacus or TF Privacy.
Accuracy cost
For modest datasets and modest privacy (epsilon ~1-10), DP-SGD typically loses 5-20% accuracy vs non-private training. For very strong privacy (epsilon < 1), accuracy losses can exceed 50% on small datasets. Larger datasets and pre-trained models reduce the cost, fine-tuning a pre-trained model with DP often loses only 1-5%.
The dataset-size effect. Large datasets give DP-SGD more signal; the per-example noise averages out across more examples. With 1M training examples, DP-SGD often loses <5%. With 10K examples, losses can be 30%+. The "more data dilutes the noise" effect is real and substantial.
The pre-training effect. Models pre-trained without DP and fine-tuned with DP fare much better than models trained from scratch with DP. The pre-trained model has good representations; DP fine-tuning just adapts. Most production DP systems use this approach.
The model-architecture effect. Some architectures handle DP better than others. Larger models tend to have higher capacity to absorb noise. Specific architectural choices (group normalisation instead of batch norm) help DP training. Existing literature has accumulated specific recommendations.
The accuracy-privacy frontier. For a fixed task, you can plot accuracy vs epsilon. The frontier curve is task-specific and sensitive to dataset size. Build your own curve before deploying; literature numbers don't generalise.
Real uses
Apple's keyboard usage statistics (epsilon ~4 daily). Google's Chrome telemetry (epsilon ~few). US Census 2020 (epsilon = 19 for some statistics). Increasingly, healthcare research datasets. DP is moving from research to production; the policy alignment (regulators favour mathematically-proven privacy) drives adoption.
The Apple case. iPhone telemetry uses local DP (each device adds noise before sending). Apple sees privacy-preserved aggregates; can't recover individual device data. Real privacy budget is reset daily. Has been in production for several years across iOS.
The Google case. Chrome reports usage statistics with DP. Differential privacy of various flavours (RAPPOR, Prochlo). Production system; significant engineering investment.
The census case. US Census Bureau deployed DP for the 2020 Census. Substantially controversial, the privacy benefit is real but the accuracy cost was visible in the data. Sets precedent; future censuses likely follow.
The healthcare research direction. Multi-hospital research consortiums increasingly require DP for shared datasets. NIH and similar funders push for DP-protected research data. Expect this to be standard for healthcare AI by 2027.
The implementation maturity. DP libraries are production-ready. The remaining barriers are policy (deciding epsilon), engineering (rebuilding pipelines), and accuracy budgets. Teams that commit have working systems within months.
Common antipatterns
Claiming "we use DP" without disclosing epsilon. Without epsilon, the claim is meaningless. Always state the value.
Spending privacy budget on hyperparameter tuning. Each tuning run uses budget. Use synthetic data or held-out non-private subsets for tuning; reserve real DP budget for the final training.
DP training from scratch on small datasets. Accuracy is destroyed. Pre-train without DP; fine-tune with DP.
Ignoring the composition of DP queries. Privacy budget runs out; downstream applications break. Track total budget across all uses.
What to do this week
Three moves. (1) For any privacy claim about your ML system, write down the specific guarantee. If it's not DP-based, it's likely informal. (2) Run a DP-SGD experiment on a representative task. The accuracy vs epsilon curve for YOUR task is essential information. (3) If you're collecting data subject to privacy regulation, evaluate DP-protected reporting (epsilon ~5-10 typical) as a path to publishing aggregates without privacy review headaches.