AI & ML Advanced By Samson Tanimawo, PhD Published Aug 3, 2026 8 min read

DPO vs PPO vs SPIN for Alignment

Three algorithms for fine-tuning a chat model toward preference data. PPO trains a reward model; DPO skips it; SPIN does it without any preference data at all.

The three algorithms

All three start with a base model and a goal: make it behave more like preferred responses. They differ in the training procedure.

PPO (Proximal Policy Optimization)

The classical RLHF algorithm. Two-step:

Train a reward model on (prompt, chosen, rejected) triples. The reward model scores any response.
Use PPO to train the policy (the chat model) to maximise reward, while staying close to the original model (KL constraint).

Strengths: well-understood, flexible (can handle complex reward signals), the reference algorithm.

Weaknesses: complex, unstable, requires running 4 models in memory simultaneously (policy, reference, reward, value). Expensive.

DPO (Direct Preference Optimization)

2023 algorithm that proved you can skip the reward model entirely. DPO derives a closed-form expression that directly optimises the policy from preference data, treating the model itself as an implicit reward function.

Strengths: simpler than PPO, more stable, fewer models to coordinate, often matches or beats PPO on benchmarks.

Weaknesses: less flexible (can’t easily incorporate non-preference signals), more sensitive to data quality.

DPO has become the default for fine-tuning open-weight models. It’s simpler to set up and almost always works.

SPIN (Self-Play Fine-Tuning)

2024 algorithm. Doesn’t require preference pairs at all. The model plays against itself: a current copy generates responses; the model is trained to distinguish its own current outputs from gold-standard SFT outputs.

Strengths: requires only the SFT dataset (no separate preference data). Iterative: each round improves on the last.

Weaknesses: only matches the SFT data’s ceiling. Can’t teach behaviours that aren’t demonstrated. Less explored in production.

Picking one

Have preference pairs, want simplicity: DPO. The default.
Have a complex reward signal (multi-objective, online): PPO. Worth the operational cost.
Only have SFT data, no preferences: SPIN. Especially good for fine-tuning when you can’t afford preference labelling.
At frontier scale: hybrid. SFT, then DPO, then targeted RLHF on specific failure modes.

For most teams: DPO is the right starting point in 2025. PPO is for specialised cases. SPIN is interesting but less battle-tested.