RLAIF and Constitutional Variants
RLAIF replaces the human raters in RLHF with a model. Constitutional AI structures that replacement around a written constitution. The combination is how alignment scales.
RLAIF basics
Replace human raters with a strong model. The model generates preference pairs by judging responses against a rubric. Train the reward model on those AI-generated preferences. Use PPO or DPO as before.
How CAI structures it
Constitutional AI gives the rater a written constitution, a list of principles the model should follow. The rater applies the constitution explicitly. Output is more consistent than ad-hoc human judgement and far more auditable.
Cost
Human labellers cost $1-5 per preference pair. Frontier-model raters cost $0.01-0.10. The 50-100x reduction enables vastly larger preference datasets at the same budget.
Limits
- The rater needs to be capable enough to recognise principle violations.
- Bias propagates: rater preferences become trainee behaviour.
- Edge cases not in the constitution are inconsistent.
Modern alignment combines RLAIF/CAI for the bulk of the work and human review for high-stakes edges.