AI & ML Advanced By Samson Tanimawo, PhD Published Nov 18, 2025 8 min read

Constitutional AI: RLHF's Alternative

RLHF needs a small army of human raters. Constitutional AI replaces some of them with the model itself, guided by a written constitution. The result is faster, cheaper, and easier to audit.

RLHF, recap

Reinforcement Learning from Human Feedback (RLHF): humans rate model outputs, those ratings train a reward model, the reward model is used to RL-train the chat model. Expensive (humans), slow, and the reward model can drift in unintuitive ways.

RLHF is what made GPT-3 into ChatGPT. It’s also why those models cost hundreds of millions to align.

How Constitutional AI (CAI) works

Anthropic’s 2022 paper proposed CAI:

  1. Write a “constitution”, a list of principles the model should follow (be helpful, avoid harm, decline to assist with X).
  2. Have the model generate a response.
  3. Have the model critique its own response against the constitution.
  4. Have the model revise its response based on the critique.
  5. Train the model on the revised responses.

Steps 2-4 are entirely automated by the model itself. The human burden drops from “rate millions of pairs” to “write a few dozen principles.”

RLAIF

RLAIF (Reinforcement Learning from AI Feedback) extends CAI: instead of humans rating pairs, a stronger model rates pairs against the constitution. The reward model is trained on those AI-generated ratings.

Empirically, RLAIF reaches comparable safety and helpfulness scores to RLHF at a fraction of the cost. Some claims even of slightly better alignment because the AI rater applies the constitution more consistently than 500 different human raters do.

Why the constitution helps auditability

An RLHF reward model is a black box. Why does it prefer A over B? Hard to say. Six months later, with new failure modes, you can’t easily audit what the model was rewarded for.

A written constitution is human-readable. Stakeholders, regulators, and red-teamers can inspect it, propose changes, and trace specific behaviours back to specific principles. This is becoming a regulatory requirement (EU AI Act, US executive orders) for high-impact models.

Limits

CAI isn’t a complete replacement for RLHF in 2025. Three caveats:

The 2024-2025 trajectory: most alignment moves toward AI feedback (CAI / RLAIF), with human feedback reserved for the edges. The economics force it: human-rate-everything doesn’t scale.