AI & ML Advanced By Samson Tanimawo, PhD Published Dec 25, 2026 5 min read

RLHF Deep Dive

RLHF is what turned GPT-3 into ChatGPT. The core mechanics are straightforward; the operational details are where most teams stumble.

The pipeline

RLHF (Reinforcement Learning from Human Feedback) takes a pretrained language model and aligns it with human preferences. The pipeline has three stages: SFT (supervised fine-tuning on demonstration data), reward model training (humans rank model outputs; train a model to predict the rankings), and RL fine-tuning (use the reward model as a signal to fine-tune the language model toward higher-rated outputs).

The SFT stage. Start with a pretrained model (output is essentially next-token prediction, no real alignment). Fine-tune on supervised demonstrations of desired behaviour: helpful, harmless, honest responses. SFT alone produces a much-improved model; RLHF builds on top.

The reward model stage. Generate pairs of model responses; humans rank them; train a separate model (the reward model) to predict the rankings. The reward model embeds human preferences; it can be queried at inference time to score new outputs.

The RL stage. Fine-tune the SFT model using the reward model as the reward signal. Use PPO (Proximal Policy Optimization) or simpler alternatives (DPO). The model learns to produce outputs the reward model rates highly.

The end-to-end story. Pretrained model → SFT → RM → RL = aligned model. Each stage adds capability that the previous stages lacked. RLHF is what made models like ChatGPT, Claude, and similar production-grade.

The reward model

The reward model is the most labor-intensive part. Humans compare pairs of outputs, choosing the better one. 50K-500K comparisons typical for production RM training. Quality is everything: bad comparisons produce a bad RM produces a misaligned final model.

The labeler workforce. Annotators are usually contractors trained for the task. Quality control is critical: inter-annotator agreement, calibration training, periodic audits. Top-tier RLHF programs have dedicated labeler operations teams.

The pairwise format. Show two outputs (A and B) for the same prompt. Annotator picks better. Pairwise format is more reliable than absolute ratings (which suffer from anchoring effects). Some programs use Likert scales for fine-grained signal; pairwise is the dominant approach.

The RM architecture. Usually similar architecture to the language model itself, with a regression head that outputs a scalar reward. Sometimes a smaller model (faster, cheaper); sometimes the same size as the policy model. Smaller is more common for compute-bound RL stages.

The data quality. Bad annotations propagate to the final model. Annotation guidelines, quality reviews, and ongoing calibration are essential. The "labelling team" is half of RLHF investment; the model side is the other half.

PPO step

The RL training step. The policy (language model) generates an output for a prompt. The reward model scores it. PPO updates the policy toward higher-reward outputs while constraining KL divergence from the SFT initialisation (so the model doesn't degenerate). The KL penalty is what keeps RLHF from collapsing into degenerate "high reward" outputs that lose general capability.

The sampling. Generate K candidates per prompt; score with reward model; use as RL signal. K = 4-16 typical. Larger K is more compute but more reliable signal. Compute cost is one of the dominant costs of RLHF.

The KL constraint. Without it, the policy drifts to high-reward but degenerate text (the model finds reward-model exploits). The constraint penalises divergence from the SFT model. Tuning the constraint strength is a key hyperparameter, too tight and the policy doesn't improve; too loose and it collapses.

The reward hacking. The policy learns to exploit the reward model's biases, generates outputs the RM rates high but humans wouldn't actually prefer. Detection: track held-out human evaluations during training. When held-out human ratings stop improving while RM rewards keep climbing, the model is hacking the RM.

The DPO alternative. DPO (Direct Preference Optimization) skips PPO and the explicit RM, using the comparison data directly. Simpler, more stable, sometimes lower quality on hardest tasks. Many production systems have moved from PPO to DPO; some still use PPO for the highest stakes.

Cost picture

RLHF is expensive. Annotation: 50K-500K comparisons × $0.50-$5 each = $25K-$2.5M just for labels. Compute: RL is 5-20x base SFT compute. Engineering: months of work building the RLHF infrastructure. Frontier RLHF programs cost millions. The capability gain justifies the cost; the barrier is real for new teams.

The annotation cost breakdown. Per-comparison cost varies: simple ranking ($0.30-$1), nuanced multi-attribute ($1-$5), expert domain ($5-$25). For a complete RLHF pipeline, annotation budget is typically $50K-$500K per major iteration.

The compute cost. RL training is many forward passes per prompt (sample multiple completions). Per-step cost is 5-10x SFT. Total RLHF compute is 5-20x SFT compute depending on iteration count. For frontier-scale models, this represents tens of millions of dollars in cloud compute.

The engineering cost. Building RLHF infrastructure is non-trivial: distributed RL training, reward model serving, labeling pipelines, evaluation frameworks. Initial buildout: 6-12 engineer-months. Ongoing operations: 2-4 engineers fully dedicated for active programs.

The lower-cost alternatives. RLAIF (replace human feedback with AI feedback) substantially reduces annotation cost. Constitutional AI reuses prior annotations. Smaller open-source RLHF programs (Alpaca-style) achieve much of the benefit at a fraction of the cost. The trade-off curve is real; not every team needs frontier-scale RLHF.

Why move past it

RLHF has known limitations. Reward hacking is real. Annotation costs scale poorly. Constitutional AI, DPO, and synthetic preference methods are emerging as cheaper alternatives. The post-RLHF era is in motion; pure RLHF still works but is being replaced or augmented by these methods.

The limitation: reward hacking. Models find subtle reward-model exploits. Each generation of RLHF training has caught new exploits. The pattern continues; RLHF is an ongoing arms race against your own reward model.

The limitation: cost scaling. Doubling RLHF data quality requires roughly tripling annotation cost. The marginal returns diminish faster than costs. Frontier RLHF programs run into budget constraints at the scale they need.

The limitation: subjective scope. RLHF works best for verifiable preferences (factuality, safety, basic helpfulness). For deeply subjective domains (creative writing, opinion expression), human comparisons are noisier; RLHF gains are smaller.

The Constitutional AI alternative. Use a written constitution (set of principles) plus AI-generated comparisons to replace much human labelling. Anthropic's approach. Reduces annotation by 50-90%; capability is competitive with pure RLHF on most evals.

The DPO alternative. Skip the reward model entirely; train directly on comparison pairs. Simpler pipeline; fewer hyperparameters; often comparable quality to PPO. Production-ready and increasingly the default for new programs.

Common antipatterns

Skipping the SFT stage. RLHF on pretrained-only is unstable. SFT is the foundation; don't skip it.

Training the reward model on misaligned annotations. Bad in, bad out. Annotation quality is half the work; budget accordingly.

Pure RL with no KL constraint. Model collapses to degenerate high-reward outputs. The KL penalty is essential.

Skipping held-out human eval during RL. Reward hacking is invisible without human eval. Hold out humans for periodic evaluation.

What to do this week

Three moves. (1) If you're building an RLHF program, start with DPO (not PPO). The infrastructure is simpler; the quality is competitive; the bug surface is smaller. (2) For your annotation budget, calculate cost per quality unit (comparisons per percentage-point of held-out evaluation gain). The number guides where to invest more: more annotations, better annotation quality, or different methodology. (3) Reserve held-out human evaluation throughout training. Catching reward hacking early is much cheaper than recovering after a contaminated training run.