DPO vs PPO vs SPIN for Alignment
Three algorithms for fine-tuning a chat model toward preference data. PPO trains a reward model; DPO skips it; SPIN does it without any preference data at all.
The three algorithms
All three start with a base model and a goal: make it behave more like preferred responses. They differ in the training procedure.
PPO (Proximal Policy Optimization)
The classical RLHF algorithm. Two-step:
- Train a reward model on (prompt, chosen, rejected) triples. The reward model scores any response.
- Use PPO to train the policy (the chat model) to maximise reward, while staying close to the original model (KL constraint).
Strengths: well-understood, flexible (can handle complex reward signals), the reference algorithm.
Weaknesses: complex, unstable, requires running 4 models in memory simultaneously (policy, reference, reward, value). Expensive.
DPO (Direct Preference Optimization)
2023 algorithm that proved you can skip the reward model entirely. DPO derives a closed-form expression that directly optimises the policy from preference data, treating the model itself as an implicit reward function.
Strengths: simpler than PPO, more stable, fewer models to coordinate, often matches or beats PPO on benchmarks.
Weaknesses: less flexible (can’t easily incorporate non-preference signals), more sensitive to data quality.
DPO has become the default for fine-tuning open-weight models. It’s simpler to set up and almost always works.
SPIN (Self-Play Fine-Tuning)
2024 algorithm. Doesn’t require preference pairs at all. The model plays against itself: a current copy generates responses; the model is trained to distinguish its own current outputs from gold-standard SFT outputs.
Strengths: requires only the SFT dataset (no separate preference data). Iterative: each round improves on the last.
Weaknesses: only matches the SFT data’s ceiling. Can’t teach behaviours that aren’t demonstrated. Less explored in production.
Picking one
- Have preference pairs, want simplicity: DPO. The default.
- Have a complex reward signal (multi-objective, online): PPO. Worth the operational cost.
- Only have SFT data, no preferences: SPIN. Especially good for fine-tuning when you can’t afford preference labelling.
- At frontier scale: hybrid. SFT, then DPO, then targeted RLHF on specific failure modes.
For most teams: DPO is the right starting point in 2025. PPO is for specialised cases. SPIN is interesting but less battle-tested.