Constitutional AI: RLHF's Alternative
RLHF needs a small army of human raters. Constitutional AI replaces some of them with the model itself, guided by a written constitution. The result is faster, cheaper, and easier to audit.
RLHF, recap
Reinforcement Learning from Human Feedback (RLHF): humans rate model outputs, those ratings train a reward model, the reward model is used to RL-train the chat model. Expensive (humans), slow, and the reward model can drift in unintuitive ways.
RLHF is what made GPT-3 into ChatGPT. It’s also why those models cost hundreds of millions to align.
How Constitutional AI (CAI) works
Anthropic’s 2022 paper proposed CAI:
- Write a “constitution”, a list of principles the model should follow (be helpful, avoid harm, decline to assist with X).
- Have the model generate a response.
- Have the model critique its own response against the constitution.
- Have the model revise its response based on the critique.
- Train the model on the revised responses.
Steps 2-4 are entirely automated by the model itself. The human burden drops from “rate millions of pairs” to “write a few dozen principles.”
RLAIF
RLAIF (Reinforcement Learning from AI Feedback) extends CAI: instead of humans rating pairs, a stronger model rates pairs against the constitution. The reward model is trained on those AI-generated ratings.
Empirically, RLAIF reaches comparable safety and helpfulness scores to RLHF at a fraction of the cost. Some claims even of slightly better alignment because the AI rater applies the constitution more consistently than 500 different human raters do.
Why the constitution helps auditability
An RLHF reward model is a black box. Why does it prefer A over B? Hard to say. Six months later, with new failure modes, you can’t easily audit what the model was rewarded for.
A written constitution is human-readable. Stakeholders, regulators, and red-teamers can inspect it, propose changes, and trace specific behaviours back to specific principles. This is becoming a regulatory requirement (EU AI Act, US executive orders) for high-impact models.
Limits
CAI isn’t a complete replacement for RLHF in 2025. Three caveats:
- Bootstrap problem: the model that critiques itself needs to be good enough to recognise principle violations. Below a capability threshold, CAI can’t bootstrap.
- Constitution gaps: the constitution can’t cover everything. Edge cases not in the principles produce inconsistent behaviour.
- Hybrid is common: production frontier models use CAI for the bulk of alignment plus RLHF or red-teaming for specific high-stakes cases.
The 2024-2025 trajectory: most alignment moves toward AI feedback (CAI / RLAIF), with human feedback reserved for the edges. The economics force it: human-rate-everything doesn’t scale.