AI & ML Advanced By Samson Tanimawo, PhD Published Jul 21, 2026 5 min read

RLHF Deep Dive

RLHF is what turned GPT-3 into ChatGPT. The core mechanics are straightforward; the operational details are where most teams stumble.

The pipeline

Three stages. (1) Supervised fine-tuning on human-written examples. (2) Train a reward model on (prompt, chosen, rejected) preference pairs. (3) Use PPO to train the SFT model to maximise predicted reward while staying close to the SFT distribution.

The reward model

Same architecture as the policy, smaller. Outputs a scalar score. Trained with pairwise loss (the chosen response should score higher). Quality of the RM caps the quality of the final policy.

PPO step

Clip-objective policy gradient that prevents the policy from drifting too far in one update. Stable in theory; in practice requires careful learning rates, KL penalties, and reward normalisation. Four models in memory simultaneously: policy, reference, reward, value.

Cost picture

Preference labelling: hundreds of thousands of (prompt, chosen, rejected) triples. Reward model training: cheap relative to the rest. PPO training: expensive (4 models in memory, lots of inference), measured in tens of thousands of GPU-hours for a frontier model.

Why the field is moving past it

DPO replaces the reward model and PPO with a single closed-form loss. Constitutional AI replaces human labellers with model self-critique. Together they cut the cost of alignment by 10-100x with comparable or better results.