FT 05 — Constitutional AI and RLAIF

00

Topics We'll Cover

The Constitutional Idea
Two-Stage CAI — Supervised + RL
The Critique-Revision Loop
Authoring a Constitution — What Makes Good Principles
RLAIF — Replacing Humans with AI Labellers
Anthropic's Approach in Practice
Tensions — Helpfulness vs Harmlessness vs Honesty
Limits — What CAI Doesn't Solve
What to Take Away

01

The Constitutional Idea

Bai et al. (2022) — Anthropic — Constitutional AI: Harmlessness from AI Feedback — asked a deceptively simple question: can we make a model safer without requiring humans to write large amounts of red-team data or preference labels for harmful content? Reviewing harmful outputs is psychologically damaging to labellers, expensive to scale, and introduces annotation inconsistency.

The constitutional approach: write a list of principles (the "constitution") that the AI should follow, then use the AI itself to critique and revise its own outputs against those principles. Human labellers never need to see harmful content — they only write the principles once.

Traditional RLHF alignment

Humans write red-team prompts
Model generates harmful responses
Humans review and label harmful content (psychologically costly)
Preference labels used for RM training
Scales poorly — each new harm category needs new labelling

Constitutional AI

Humans write abstract principles once
Model critiques and revises its own outputs
Humans never review explicit harmful content
AI-generated labels used for RM training (RLAIF)
Scales by adding principles to the constitution, not more data

Why "constitutional"?

The analogy is to a legal constitution: a short, high-level document of principles from which specific rules are derived. Rather than an exhaustive ruleset covering every possible harmful output, a constitution provides enough coverage for the model to reason about novel situations by applying the underlying principles. The Anthropic constitution typically runs 16–60 principles covering harm avoidance, honesty, helpfulness, and values.

02

Two-Stage CAI — Supervised + RL

Constitutional AI training has two distinct stages, each with a different objective and data source.

Stage 1 — SL-CAI (Supervised Learning from Constitutional AI)

↓

Generate initial responses to red-team prompts using a helpful-only SFT model

↓

Critique each response against a sampled constitution principle; revise to be less harmful

↓

Repeat critique-revision K times (typically 1–3 rounds)

↓

Fine-tune the original model on (prompt, final revised response) pairs → SL-CAI model

↓

Stage 2 — RL-CAI (RLAIF)

↓

Generate preference pairs: for each prompt, produce two completions

↓

Have the AI (not humans) label which completion better follows the constitution

↓

Train a reward model on AI-generated labels → train final policy with RL (PPO)

Key data flow

Stage 1 produces better-behaved supervised data. Stage 2 uses that data as the starting policy for RL. The distinction matters: SL-CAI teaches the model what revised responses look like; RL-CAI then optimises the policy to preferentially generate those responses. Without Stage 1, RL-CAI starts from a poorly-calibrated policy and takes much longer to converge.

03

The Critique-Revision Loop

The critique-revision loop is the technical heart of Stage 1. For each red-team prompt, the model is asked to generate, then criticise, then revise its own output, guided by a randomly sampled principle from the constitution.

Critique-Revision prompt template (simplified from Anthropic 2022)

--- INITIAL GENERATION PROMPT ---
Human: {red_team_prompt}
Assistant:    ← model generates potentially harmful response

--- CRITIQUE PROMPT ---
Identify specific ways in which the assistant's last response is harmful,
unethical, racist, sexist, toxic, dangerous, or illegal.
[Constitution principle: "{sampled_principle}"]
Critique Request: {critique_request}
Critique:    ← model writes self-critique

--- REVISION PROMPT ---
Please rewrite the assistant response to remove any harmful content,
and to politely point out any factual errors.
The revised response should still be helpful and informative.
Revision:    ← model produces revised, safer response

Iterative Rounds

The original CAI paper ran 1–3 critique-revision rounds per example. Each round improves the response on the sampled principle. In practice, one round removes the most egregious harms; additional rounds produce diminishing returns and risk the model becoming overly cautious (over-refusal). Most production pipelines use 1–2 rounds.

Principle sampling strategy

The Anthropic paper sampled one principle per critique prompt, selecting randomly from the full constitution. This prevents the model from overfitting to any single principle and exposes diverse critique angles across the training set. In practice, principles can also be weighted by category (safety principles sampled more frequently than style principles) to reflect priority.

04

Authoring a Constitution — What Makes Good Principles

The quality of the constitution directly determines the quality of the trained model. Writing effective principles is harder than it looks. The original Anthropic constitution drew from multiple philosophical, legal, and organisational sources.

Principle category	Example principle	Source inspiration
Harm avoidance	"Please choose the assistant response that is least likely to contain harmful or unethical content"	Utility / rule consequentialism
Honesty	"Choose the response that is least likely to contain deceptive information, or to deceive the human in any way"	Kantian deontology, journalism ethics
Autonomy / rights	"Which response least endorses or encourages illegal activity?"	Rule of law, UN Declaration of Human Rights
Children's safety	"Choose the response that avoids content that would be considered inappropriate for children"	COPPA, BBFC guidelines
Thoughtfulness	"Which response would a thoughtful, senior Anthropic employee be more comfortable seeing?"	Institutional values

Good Principle Design Rules

Do

Phrase principles as comparative ("choose the response that is less harmful") not absolute ("never produce harmful content") — comparative phrasing maps directly to the preference labelling task
Cover diverse harm categories so random sampling touches the full risk surface
Include positive principles (be helpful, be informative) alongside harm avoidance — balances over-refusal
Include edge-case principles for persistent problems (jailbreaks, sycophancy)

Avoid

Overly specific rules that only cover narrow scenarios — the model can't generalise from them
Contradictory principles without explicit priority ordering
Principles that require real-world knowledge to apply (e.g. "don't produce illegal content in jurisdiction X")
Style principles mixed with safety principles — they should be weighted separately

05

RLAIF — Replacing Humans with AI Labellers

Lee et al. (2023) — Google DeepMind — RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback — formalised and generalised the RLAIF approach. Their paper showed that on summarisation tasks, RLAIF achieved quality comparable to RLHF despite using no human preference labels in the RL stage.

The RLAIF labelling pipeline

Prompt x

→

Generate (y_w, y_l) from current policy

→

AI judge: "Which completion better follows [principle]?"

→

Soft label: P(y_w > y_l) from AI log-probs

→

Train reward model on these AI-labelled pairs

Property	RLHF (human)	RLAIF (AI)
Cost per 1 k labels	$50–$500 (crowdwork) or $1 000+ (expert)	$0.50–$5 (inference cost only)
Throughput	Thousands per day (human bottleneck)	Millions per day
Consistency	Inter-annotator agreement ~70–80 % on hard cases	Deterministic at temperature 0; stochastic but reproducible
Bias source	Human cognitive biases, demographics, fatigue	Inherits base model biases; position bias (prefers first/second option)
Coverage	Limited by human comfort with harmful content	Can label any content without psychological harm

Position bias mitigation

AI labellers show strong position bias: they tend to favour whichever completion appears first in the prompt. Mitigation: present each pair twice, swapping order; take the majority or average. Anthropic's CAI paper noted this phenomenon and controlled for it explicitly in their evaluation methodology.

06

Anthropic's Approach in Practice

Anthropic does not publish the exact training pipeline for any Claude model, but the Constitutional AI paper (2022) and subsequent "Claude's Character" and model card documents provide a picture of how CAI principles are applied in production.

Training stages

Pre-training (base model) SFT (instruction following) SL-CAI (critique-revision) RLAIF (AI preference labels) RL-CAI (PPO/variant)

Constitution sources

UN Declaration of Human Rights Children's Online Privacy Protection Act Apple App Store ToS Anthropic internal principles Deontological ethics Virtue ethics Consequentialism

Evolution

Claude 1 (2023): first public CAI model Claude 2 (2023): extended constitution, improved harmlessness-helpfulness balance Claude 3 (2024): multi-tier Haiku/Sonnet/Opus, constitutional principles in model spec Claude 3.5+ (2024-25): public model spec as "soul document"

The Anthropic Model Spec

From Claude 3 onward, Anthropic publishes a Model Spec — a detailed document (tens of thousands of words) that articulates values, priorities, and behavioural guidelines for Claude. This is effectively a very expanded, human-readable form of the constitution, complementing the shorter principle list used in training. It is intended both as a training signal and as a public accountability document.

07

Tensions — Helpfulness vs Harmlessness vs Honesty

Anthropic frames alignment around three properties: Helpful (assists the user in achieving their goals), Harmless (avoids producing outputs that cause harm to users, third parties, or society), and Honest (doesn't deceive, acknowledges uncertainty, doesn't pretend to have capabilities it lacks). These are not always compatible.

H vs H: Help vs Harm

A user asks for detailed instructions on a dual-use topic (chemistry, security, medicine). Being helpful means giving accurate, complete information. Being harmless means not enabling misuse. The model must estimate the realistic population of people asking and the marginal harm from its answer vs freely available information.

CAI lever: harmlessness principles weighted by severity; constitution includes "the assistant should still try to help with legitimate requests even in sensitive domains".

H vs H: Harm vs Honesty

The model is asked a question where the most accurate answer could be distressing or dangerous (e.g. detailed information about self-harm methods requested by someone in crisis). Being honest means not withholding accurate information. Being harmless means not providing content that could directly facilitate harm.

CAI lever: harm avoidance generally wins over completeness; safe messaging guidelines apply in high-risk categories.

H vs H: Honest vs Helpful

The user is clearly wrong about something and wants validation. Being helpful in the short term means agreeing (user satisfaction). Being honest means correcting the error. Sycophancy training bias (from preference data labelled by people who prefer agreement) can make this tension worse.

CAI lever: explicit anti-sycophancy principles in the constitution; honesty principles that reward calibrated correction over agreement.

Priority ordering

Anthropic's published model spec states a clear priority: broadly safe > broadly ethical > adherent to Anthropic's principles > genuinely helpful. In practice, the vast majority of Claude's interactions involve no conflict between these — the priority ordering is only invoked when genuine conflicts arise.

08

Limits — What CAI Doesn't Solve

Constitutional AI is a significant advance in scalable alignment, but it has real limitations that should be understood by anyone deploying or evaluating CAI-trained models.

Limitation	Why it's a problem	Partial mitigations
Constitution quality ceiling	The model can only be as aligned as the constitution is well-written. Vague, contradictory, or incomplete principles produce vague, contradictory, or incomplete behaviour.	Iterative red-teaming; publish constitution for external review; multi-stakeholder authorship
Base model knowledge	CAI teaches the model what to say; it doesn't remove harmful knowledge from the base model's weights. A sufficiently skilled adversary can bypass Constitutional training via jailbreaks.	Adversarial training; multi-turn robustness; defence-in-depth (system prompts, filters)
RLAIF inherits base biases	The AI labeller used for RLAIF has its own biases (position preference, verbosity preference, cultural assumptions). These propagate into the reward model.	Ensemble labelling; position-swap debiasing; human spot-check audits of AI labels
Static constitution	The world changes; new harms emerge (new technologies, new social contexts). A fixed constitution can become outdated without model retraining.	Modular constitutions; constitutional updates with targeted retraining; Constitutional model spec as living document
Helpfulness-harmlessness over-correction	Early CAI models (Claude 1) were notably over-cautious, refusing legitimate requests. The harmlessness objective can dominate helpfulness if not explicitly balanced.	Explicit helpfulness principles in constitution; harmlessness score normalised against helpfulness loss

What CAI does not replace

CAI does not replace: system-level safeguards (output filtering, rate limiting, abuse detection), human red-teaming for novel attack vectors, post-deployment monitoring, legal and regulatory compliance review, or the need for ongoing alignment research. Constitutional training is one layer in a defence-in-depth alignment strategy, not a comprehensive solution.

09

What to Take Away

CAI replaces human harm-labelling with AI critique-revision. Humans write principles once; the model critiques and revises its own outputs. Human labellers never need to review harmful content directly.
Two stages: SL-CAI then RL-CAI. Stage 1 builds better SFT data via critique-revision. Stage 2 uses AI-generated preference labels for RLAIF, replacing the expensive human RM labelling pipeline.
Critique-revision uses a sampled constitution principle as a lens. Each critique prompt draws a random principle; this ensures diverse coverage of the full harm surface across the training set.
Principle design is difficult. Comparative phrasing, diverse harm coverage, positive helpfulness principles, and explicit priority ordering all matter. The constitution is the single point of leverage for the entire pipeline.
RLAIF is 100–1 000× cheaper than RLHF and can generate millions of preference labels. The trade-off is inheriting the AI judge's biases — debiasing techniques (position swap, ensemble) are important.
The HHH triangle has genuine tensions. Anthropic's priority ordering (safety > ethics > principles > helpfulness) resolves conflicts when they arise, but most interactions involve no conflict at all.
CAI has limits. It doesn't remove knowledge from the base model, inherits RLAIF biases, and can become outdated as new harms emerge. It's one layer of a defence-in-depth strategy.

Series complete

You've covered the full fine-tuning and alignment stack: SFT pipelines (Deck 01), LoRA & PEFT (Deck 02), RLHF & PPO (Deck 03), DPO and its cousins (Deck 04), and Constitutional AI & RLAIF (Deck 05). The next natural area to explore is inference-time alignment and model evaluation — how you verify that all this training actually achieved its objectives in deployment.