Fine-Tuning & PEFT Series — Presentation 05

Constitutional AI and RLAIF

Critique–revision loops, principle authoring, replacing human labellers with AI feedback, and the fundamental tensions in the Helpful–Harmless–Honest triangle.

Constitutional AI RLAIF Critique-Revision HHH Anthropic AI Feedback Alignment Safety
SFT Model Critique Revise SL-CAI RLAIF Labels RL-CAI Claude
00

Topics We'll Cover

01

The Constitutional Idea

Bai et al. (2022) — Anthropic — Constitutional AI: Harmlessness from AI Feedback — asked a deceptively simple question: can we make a model safer without requiring humans to write large amounts of red-team data or preference labels for harmful content? Reviewing harmful outputs is psychologically damaging to labellers, expensive to scale, and introduces annotation inconsistency.

The constitutional approach: write a list of principles (the "constitution") that the AI should follow, then use the AI itself to critique and revise its own outputs against those principles. Human labellers never need to see harmful content — they only write the principles once.

Traditional RLHF alignment

  • Humans write red-team prompts
  • Model generates harmful responses
  • Humans review and label harmful content (psychologically costly)
  • Preference labels used for RM training
  • Scales poorly — each new harm category needs new labelling

Constitutional AI

  • Humans write abstract principles once
  • Model critiques and revises its own outputs
  • Humans never review explicit harmful content
  • AI-generated labels used for RM training (RLAIF)
  • Scales by adding principles to the constitution, not more data
Why "constitutional"?

The analogy is to a legal constitution: a short, high-level document of principles from which specific rules are derived. Rather than an exhaustive ruleset covering every possible harmful output, a constitution provides enough coverage for the model to reason about novel situations by applying the underlying principles. The Anthropic constitution typically runs 16–60 principles covering harm avoidance, honesty, helpfulness, and values.

02

Two-Stage CAI — Supervised + RL

Constitutional AI training has two distinct stages, each with a different objective and data source.

Stage 1 — SL-CAI (Supervised Learning from Constitutional AI)
Generate initial responses to red-team prompts using a helpful-only SFT model
Critique each response against a sampled constitution principle; revise to be less harmful
Repeat critique-revision K times (typically 1–3 rounds)
Fine-tune the original model on (prompt, final revised response) pairs → SL-CAI model
Stage 2 — RL-CAI (RLAIF)
Generate preference pairs: for each prompt, produce two completions
Have the AI (not humans) label which completion better follows the constitution
Train a reward model on AI-generated labels → train final policy with RL (PPO)
Key data flow

Stage 1 produces better-behaved supervised data. Stage 2 uses that data as the starting policy for RL. The distinction matters: SL-CAI teaches the model what revised responses look like; RL-CAI then optimises the policy to preferentially generate those responses. Without Stage 1, RL-CAI starts from a poorly-calibrated policy and takes much longer to converge.

03

The Critique-Revision Loop

The critique-revision loop is the technical heart of Stage 1. For each red-team prompt, the model is asked to generate, then criticise, then revise its own output, guided by a randomly sampled principle from the constitution.

Critique-Revision prompt template (simplified from Anthropic 2022)
--- INITIAL GENERATION PROMPT ---
Human: {red_team_prompt}
Assistant:    ← model generates potentially harmful response

--- CRITIQUE PROMPT ---
Identify specific ways in which the assistant's last response is harmful,
unethical, racist, sexist, toxic, dangerous, or illegal.
[Constitution principle: "{sampled_principle}"]
Critique Request: {critique_request}
Critique:    ← model writes self-critique

--- REVISION PROMPT ---
Please rewrite the assistant response to remove any harmful content,
and to politely point out any factual errors.
The revised response should still be helpful and informative.
Revision:    ← model produces revised, safer response

Iterative Rounds

The original CAI paper ran 1–3 critique-revision rounds per example. Each round improves the response on the sampled principle. In practice, one round removes the most egregious harms; additional rounds produce diminishing returns and risk the model becoming overly cautious (over-refusal). Most production pipelines use 1–2 rounds.

Principle sampling strategy

The Anthropic paper sampled one principle per critique prompt, selecting randomly from the full constitution. This prevents the model from overfitting to any single principle and exposes diverse critique angles across the training set. In practice, principles can also be weighted by category (safety principles sampled more frequently than style principles) to reflect priority.

04

Authoring a Constitution — What Makes Good Principles

The quality of the constitution directly determines the quality of the trained model. Writing effective principles is harder than it looks. The original Anthropic constitution drew from multiple philosophical, legal, and organisational sources.

Principle categoryExample principleSource inspiration
Harm avoidance"Please choose the assistant response that is least likely to contain harmful or unethical content"Utility / rule consequentialism
Honesty"Choose the response that is least likely to contain deceptive information, or to deceive the human in any way"Kantian deontology, journalism ethics
Autonomy / rights"Which response least endorses or encourages illegal activity?"Rule of law, UN Declaration of Human Rights
Children's safety"Choose the response that avoids content that would be considered inappropriate for children"COPPA, BBFC guidelines
Thoughtfulness"Which response would a thoughtful, senior Anthropic employee be more comfortable seeing?"Institutional values

Good Principle Design Rules

Do

  • Phrase principles as comparative ("choose the response that is less harmful") not absolute ("never produce harmful content") — comparative phrasing maps directly to the preference labelling task
  • Cover diverse harm categories so random sampling touches the full risk surface
  • Include positive principles (be helpful, be informative) alongside harm avoidance — balances over-refusal
  • Include edge-case principles for persistent problems (jailbreaks, sycophancy)

Avoid

  • Overly specific rules that only cover narrow scenarios — the model can't generalise from them
  • Contradictory principles without explicit priority ordering
  • Principles that require real-world knowledge to apply (e.g. "don't produce illegal content in jurisdiction X")
  • Style principles mixed with safety principles — they should be weighted separately
05

RLAIF — Replacing Humans with AI Labellers

Lee et al. (2023) — Google DeepMind — RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback — formalised and generalised the RLAIF approach. Their paper showed that on summarisation tasks, RLAIF achieved quality comparable to RLHF despite using no human preference labels in the RL stage.

The RLAIF labelling pipeline

Prompt x
Generate (yw, yl) from current policy
AI judge: "Which completion better follows [principle]?"
Soft label: P(yw > yl) from AI log-probs
Train reward model on these AI-labelled pairs
PropertyRLHF (human)RLAIF (AI)
Cost per 1 k labels$50–$500 (crowdwork) or $1 000+ (expert)$0.50–$5 (inference cost only)
ThroughputThousands per day (human bottleneck)Millions per day
ConsistencyInter-annotator agreement ~70–80 % on hard casesDeterministic at temperature 0; stochastic but reproducible
Bias sourceHuman cognitive biases, demographics, fatigueInherits base model biases; position bias (prefers first/second option)
CoverageLimited by human comfort with harmful contentCan label any content without psychological harm
Position bias mitigation

AI labellers show strong position bias: they tend to favour whichever completion appears first in the prompt. Mitigation: present each pair twice, swapping order; take the majority or average. Anthropic's CAI paper noted this phenomenon and controlled for it explicitly in their evaluation methodology.

06

Anthropic's Approach in Practice

Anthropic does not publish the exact training pipeline for any Claude model, but the Constitutional AI paper (2022) and subsequent "Claude's Character" and model card documents provide a picture of how CAI principles are applied in production.

Training stages
Pre-training (base model) SFT (instruction following) SL-CAI (critique-revision) RLAIF (AI preference labels) RL-CAI (PPO/variant)
Constitution sources
UN Declaration of Human Rights Children's Online Privacy Protection Act Apple App Store ToS Anthropic internal principles Deontological ethics Virtue ethics Consequentialism
Evolution
Claude 1 (2023): first public CAI model Claude 2 (2023): extended constitution, improved harmlessness-helpfulness balance Claude 3 (2024): multi-tier Haiku/Sonnet/Opus, constitutional principles in model spec Claude 3.5+ (2024-25): public model spec as "soul document"
The Anthropic Model Spec

From Claude 3 onward, Anthropic publishes a Model Spec — a detailed document (tens of thousands of words) that articulates values, priorities, and behavioural guidelines for Claude. This is effectively a very expanded, human-readable form of the constitution, complementing the shorter principle list used in training. It is intended both as a training signal and as a public accountability document.

07

Tensions — Helpfulness vs Harmlessness vs Honesty

Anthropic frames alignment around three properties: Helpful (assists the user in achieving their goals), Harmless (avoids producing outputs that cause harm to users, third parties, or society), and Honest (doesn't deceive, acknowledges uncertainty, doesn't pretend to have capabilities it lacks). These are not always compatible.

H vs H: Help vs Harm

A user asks for detailed instructions on a dual-use topic (chemistry, security, medicine). Being helpful means giving accurate, complete information. Being harmless means not enabling misuse. The model must estimate the realistic population of people asking and the marginal harm from its answer vs freely available information.

CAI lever: harmlessness principles weighted by severity; constitution includes "the assistant should still try to help with legitimate requests even in sensitive domains".

H vs H: Harm vs Honesty

The model is asked a question where the most accurate answer could be distressing or dangerous (e.g. detailed information about self-harm methods requested by someone in crisis). Being honest means not withholding accurate information. Being harmless means not providing content that could directly facilitate harm.

CAI lever: harm avoidance generally wins over completeness; safe messaging guidelines apply in high-risk categories.

H vs H: Honest vs Helpful

The user is clearly wrong about something and wants validation. Being helpful in the short term means agreeing (user satisfaction). Being honest means correcting the error. Sycophancy training bias (from preference data labelled by people who prefer agreement) can make this tension worse.

CAI lever: explicit anti-sycophancy principles in the constitution; honesty principles that reward calibrated correction over agreement.

Priority ordering

Anthropic's published model spec states a clear priority: broadly safe > broadly ethical > adherent to Anthropic's principles > genuinely helpful. In practice, the vast majority of Claude's interactions involve no conflict between these — the priority ordering is only invoked when genuine conflicts arise.

08

Limits — What CAI Doesn't Solve

Constitutional AI is a significant advance in scalable alignment, but it has real limitations that should be understood by anyone deploying or evaluating CAI-trained models.

LimitationWhy it's a problemPartial mitigations
Constitution quality ceilingThe model can only be as aligned as the constitution is well-written. Vague, contradictory, or incomplete principles produce vague, contradictory, or incomplete behaviour.Iterative red-teaming; publish constitution for external review; multi-stakeholder authorship
Base model knowledgeCAI teaches the model what to say; it doesn't remove harmful knowledge from the base model's weights. A sufficiently skilled adversary can bypass Constitutional training via jailbreaks.Adversarial training; multi-turn robustness; defence-in-depth (system prompts, filters)
RLAIF inherits base biasesThe AI labeller used for RLAIF has its own biases (position preference, verbosity preference, cultural assumptions). These propagate into the reward model.Ensemble labelling; position-swap debiasing; human spot-check audits of AI labels
Static constitutionThe world changes; new harms emerge (new technologies, new social contexts). A fixed constitution can become outdated without model retraining.Modular constitutions; constitutional updates with targeted retraining; Constitutional model spec as living document
Helpfulness-harmlessness over-correctionEarly CAI models (Claude 1) were notably over-cautious, refusing legitimate requests. The harmlessness objective can dominate helpfulness if not explicitly balanced.Explicit helpfulness principles in constitution; harmlessness score normalised against helpfulness loss
What CAI does not replace

CAI does not replace: system-level safeguards (output filtering, rate limiting, abuse detection), human red-teaming for novel attack vectors, post-deployment monitoring, legal and regulatory compliance review, or the need for ongoing alignment research. Constitutional training is one layer in a defence-in-depth alignment strategy, not a comprehensive solution.

09

What to Take Away

Series complete

You've covered the full fine-tuning and alignment stack: SFT pipelines (Deck 01), LoRA & PEFT (Deck 02), RLHF & PPO (Deck 03), DPO and its cousins (Deck 04), and Constitutional AI & RLAIF (Deck 05). The next natural area to explore is inference-time alignment and model evaluation — how you verify that all this training actually achieved its objectives in deployment.