AI & ML Advanced By Samson Tanimawo, PhD Published Sep 25, 2026 6 min read

Federated Learning: Training Without Data Movement

Send the model to the data instead of the data to the model. Federated learning is the architecture for training when data can’t leave its origin.

The core idea

Train models on data without moving the data. Each device (or organisation) trains locally on its own data; only model updates (gradients or weight deltas) are sent to a central aggregator. The aggregator combines updates into a global model. Privacy through architecture, the raw data never leaves where it lives.

The structural promise. Sensitive data stays in place: medical records at the hospital, mobile keystrokes on the phone, internal documents at the enterprise. The model that emerges learned from all the data; no party ever saw the others' raw inputs.

The "no data movement" claim. Not entirely true, gradients move, and gradients leak some information about the training data. The privacy improvement vs centralised training is real but not absolute. Federated learning is a privacy improvement, not a privacy guarantee.

The application surface. Mobile keyboards (predictive text trained on your typing), healthcare (models trained across hospitals without sharing patient records), cross-bank fraud detection (models trained across institutions). Each domain has structural reasons centralised data collection is impossible.

The state in 2026. Federated learning is in production at Google (Gboard predictive text), Apple (some on-device personalisation), several healthcare consortiums. Mature for narrow use cases; still emerging for general LLM training.

Mechanics

The basic loop:

  1. Distribute, global model sent to participating clients.
  2. Local train, each client trains on its local data for a few steps.
  3. Upload, clients send their gradient (or weight delta) to the aggregator.
  4. Aggregate, server combines updates (typically weighted average) into a new global model.
  5. Repeat.

FedAvg is the standard aggregation algorithm. Variants address specific issues, FedProx for heterogeneous clients, FedAdam for adaptive learning rates.

The local-train step. Clients run a few SGD steps locally. Too few steps = high communication cost (lots of rounds). Too many = client drift (local model wanders far from global). The sweet spot is 1-5 epochs per round of communication.

The aggregation step. Weighted average of client updates, weighted by data size at each client. Larger clients have more influence; smaller clients still contribute. Variants weight by trust, recency, or other signals depending on use case.

The communication cost. Each round transmits gradient-sized messages between server and clients. For a 100M-parameter model, that's ~400MB per client per round. For mobile clients on cellular networks, this is non-trivial; communication-efficient variants (sparse updates, quantised gradients) reduce by 10-100x.

The straggler problem. Some clients are slow, offline, or drop out mid-round. Synchronous FedAvg waits for all clients; asynchronous variants don't. Production systems use partial-participation: each round picks a random subset of available clients.

Privacy reality

Federated learning is more private than centralised training but not magic. Gradients carry information about training data, model inversion attacks have shown that gradients can sometimes be inverted to reconstruct training inputs. Combine federated learning with differential privacy and secure aggregation for stronger guarantees.

The gradient leakage. Research (DLG, iDLG attacks) shows gradients can leak training-data details for small batches. The attack succeeds best when batches are small and the model is small. Production federated learning with large batches and large models is harder to attack but not immune.

The DP layer. Add Gaussian noise to client updates before sending. Differential privacy bounds the per-client information leakage. Costs accuracy (the noise hurts learning) but adds a mathematical privacy guarantee on top of the architectural one.

The secure-aggregation layer. Cryptographic protocols (homomorphic encryption, secure multi-party computation) let the server compute the average without seeing individual client updates. Even an honest-but-curious server can't extract per-client gradients. Adds significant compute overhead.

The threat-model framing. Federated learning protects against centralised data collection. It does NOT protect against compromised aggregator, against participating clients colluding, or against gradient inversion. List the threats you care about; ensure your protocol addresses them; don't over-claim.

Where it fits

Privacy-regulated industries (healthcare, finance) where data movement isn't permitted. Cross-organisational training where organisations won't share raw data with each other. Mobile-device training (Gboard, Apple) where the data is naturally distributed. For pure LLM pretraining, federated learning isn't yet competitive, too much communication overhead.

The healthcare case. Hospitals can't share patient records under HIPAA/GDPR. Federated learning lets them collaboratively train diagnostic models. Real production examples: NVIDIA Clara Federated Learning for tumour detection across hospitals.

The mobile keyboard case. Gboard (Google), QuickType (Apple) train next-word-prediction models on user typing without sending typing to servers. Each phone trains locally; aggregated updates improve the global model. Privacy story is structural, not just policy.

The cross-org fraud detection. Banks have fraud patterns that look similar across institutions. Sharing transaction data is regulatorily impossible. Federated learning lets banks build joint fraud models without sharing transactions; consortia like SWIFT have explored this.

The poor fit cases. Pure LLM pretraining at frontier scale needs centralised compute and data. Federated learning's communication overhead makes it 100-1000x slower than centralised for these scales. Federated approaches for LLMs work for fine-tuning, not pretraining.

Common antipatterns

Treating federated learning as automatic privacy. Gradients leak. Add DP and secure aggregation if privacy guarantees are part of the value proposition.

Synchronous training with high-variance clients. Stragglers dominate latency. Use partial participation or asynchronous aggregation.

Naive aggregation with adversarial clients. One malicious client can poison the global model with bad gradients. Use robust aggregation (median, trimmed mean) when client trust isn't perfect.

Ignoring communication cost. Federated learning's bottleneck is often network, not compute. Use compression and sparse updates.

What to do this week

Three moves. (1) For any privacy-regulated training scenario, evaluate whether federated learning is the right architecture. The alternative, getting data out of regulated environments, is often impossible. (2) If considering federated learning, model the communication cost. Production systems often need 10-1000x communication efficiency improvements; budget for that engineering. (3) Identify your specific threat model. "Privacy" is too vague; "what specifically must we prevent" determines the protocol.