Critique–revision loops, principle authoring, replacing human labellers with AI feedback, and the fundamental tensions in the Helpful–Harmless–Honest triangle.
Bai et al. (2022) — Anthropic — Constitutional AI: Harmlessness from AI Feedback — asked a deceptively simple question: can we make a model safer without requiring humans to write large amounts of red-team data or preference labels for harmful content? Reviewing harmful outputs is psychologically damaging to labellers, expensive to scale, and introduces annotation inconsistency.
The constitutional approach: write a list of principles (the "constitution") that the AI should follow, then use the AI itself to critique and revise its own outputs against those principles. Human labellers never need to see harmful content — they only write the principles once.
The analogy is to a legal constitution: a short, high-level document of principles from which specific rules are derived. Rather than an exhaustive ruleset covering every possible harmful output, a constitution provides enough coverage for the model to reason about novel situations by applying the underlying principles. The Anthropic constitution typically runs 16–60 principles covering harm avoidance, honesty, helpfulness, and values.
Constitutional AI training has two distinct stages, each with a different objective and data source.
Stage 1 produces better-behaved supervised data. Stage 2 uses that data as the starting policy for RL. The distinction matters: SL-CAI teaches the model what revised responses look like; RL-CAI then optimises the policy to preferentially generate those responses. Without Stage 1, RL-CAI starts from a poorly-calibrated policy and takes much longer to converge.
The critique-revision loop is the technical heart of Stage 1. For each red-team prompt, the model is asked to generate, then criticise, then revise its own output, guided by a randomly sampled principle from the constitution.
--- INITIAL GENERATION PROMPT ---
Human: {red_team_prompt}
Assistant: ← model generates potentially harmful response
--- CRITIQUE PROMPT ---
Identify specific ways in which the assistant's last response is harmful,
unethical, racist, sexist, toxic, dangerous, or illegal.
[Constitution principle: "{sampled_principle}"]
Critique Request: {critique_request}
Critique: ← model writes self-critique
--- REVISION PROMPT ---
Please rewrite the assistant response to remove any harmful content,
and to politely point out any factual errors.
The revised response should still be helpful and informative.
Revision: ← model produces revised, safer response
The original CAI paper ran 1–3 critique-revision rounds per example. Each round improves the response on the sampled principle. In practice, one round removes the most egregious harms; additional rounds produce diminishing returns and risk the model becoming overly cautious (over-refusal). Most production pipelines use 1–2 rounds.
The Anthropic paper sampled one principle per critique prompt, selecting randomly from the full constitution. This prevents the model from overfitting to any single principle and exposes diverse critique angles across the training set. In practice, principles can also be weighted by category (safety principles sampled more frequently than style principles) to reflect priority.
The quality of the constitution directly determines the quality of the trained model. Writing effective principles is harder than it looks. The original Anthropic constitution drew from multiple philosophical, legal, and organisational sources.
| Principle category | Example principle | Source inspiration |
|---|---|---|
| Harm avoidance | "Please choose the assistant response that is least likely to contain harmful or unethical content" | Utility / rule consequentialism |
| Honesty | "Choose the response that is least likely to contain deceptive information, or to deceive the human in any way" | Kantian deontology, journalism ethics |
| Autonomy / rights | "Which response least endorses or encourages illegal activity?" | Rule of law, UN Declaration of Human Rights |
| Children's safety | "Choose the response that avoids content that would be considered inappropriate for children" | COPPA, BBFC guidelines |
| Thoughtfulness | "Which response would a thoughtful, senior Anthropic employee be more comfortable seeing?" | Institutional values |
Lee et al. (2023) — Google DeepMind — RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback — formalised and generalised the RLAIF approach. Their paper showed that on summarisation tasks, RLAIF achieved quality comparable to RLHF despite using no human preference labels in the RL stage.
| Property | RLHF (human) | RLAIF (AI) |
|---|---|---|
| Cost per 1 k labels | $50–$500 (crowdwork) or $1 000+ (expert) | $0.50–$5 (inference cost only) |
| Throughput | Thousands per day (human bottleneck) | Millions per day |
| Consistency | Inter-annotator agreement ~70–80 % on hard cases | Deterministic at temperature 0; stochastic but reproducible |
| Bias source | Human cognitive biases, demographics, fatigue | Inherits base model biases; position bias (prefers first/second option) |
| Coverage | Limited by human comfort with harmful content | Can label any content without psychological harm |
AI labellers show strong position bias: they tend to favour whichever completion appears first in the prompt. Mitigation: present each pair twice, swapping order; take the majority or average. Anthropic's CAI paper noted this phenomenon and controlled for it explicitly in their evaluation methodology.
Anthropic does not publish the exact training pipeline for any Claude model, but the Constitutional AI paper (2022) and subsequent "Claude's Character" and model card documents provide a picture of how CAI principles are applied in production.
From Claude 3 onward, Anthropic publishes a Model Spec — a detailed document (tens of thousands of words) that articulates values, priorities, and behavioural guidelines for Claude. This is effectively a very expanded, human-readable form of the constitution, complementing the shorter principle list used in training. It is intended both as a training signal and as a public accountability document.
Anthropic frames alignment around three properties: Helpful (assists the user in achieving their goals), Harmless (avoids producing outputs that cause harm to users, third parties, or society), and Honest (doesn't deceive, acknowledges uncertainty, doesn't pretend to have capabilities it lacks). These are not always compatible.
A user asks for detailed instructions on a dual-use topic (chemistry, security, medicine). Being helpful means giving accurate, complete information. Being harmless means not enabling misuse. The model must estimate the realistic population of people asking and the marginal harm from its answer vs freely available information.
CAI lever: harmlessness principles weighted by severity; constitution includes "the assistant should still try to help with legitimate requests even in sensitive domains".
The model is asked a question where the most accurate answer could be distressing or dangerous (e.g. detailed information about self-harm methods requested by someone in crisis). Being honest means not withholding accurate information. Being harmless means not providing content that could directly facilitate harm.
CAI lever: harm avoidance generally wins over completeness; safe messaging guidelines apply in high-risk categories.
The user is clearly wrong about something and wants validation. Being helpful in the short term means agreeing (user satisfaction). Being honest means correcting the error. Sycophancy training bias (from preference data labelled by people who prefer agreement) can make this tension worse.
CAI lever: explicit anti-sycophancy principles in the constitution; honesty principles that reward calibrated correction over agreement.
Anthropic's published model spec states a clear priority: broadly safe > broadly ethical > adherent to Anthropic's principles > genuinely helpful. In practice, the vast majority of Claude's interactions involve no conflict between these — the priority ordering is only invoked when genuine conflicts arise.
Constitutional AI is a significant advance in scalable alignment, but it has real limitations that should be understood by anyone deploying or evaluating CAI-trained models.
| Limitation | Why it's a problem | Partial mitigations |
|---|---|---|
| Constitution quality ceiling | The model can only be as aligned as the constitution is well-written. Vague, contradictory, or incomplete principles produce vague, contradictory, or incomplete behaviour. | Iterative red-teaming; publish constitution for external review; multi-stakeholder authorship |
| Base model knowledge | CAI teaches the model what to say; it doesn't remove harmful knowledge from the base model's weights. A sufficiently skilled adversary can bypass Constitutional training via jailbreaks. | Adversarial training; multi-turn robustness; defence-in-depth (system prompts, filters) |
| RLAIF inherits base biases | The AI labeller used for RLAIF has its own biases (position preference, verbosity preference, cultural assumptions). These propagate into the reward model. | Ensemble labelling; position-swap debiasing; human spot-check audits of AI labels |
| Static constitution | The world changes; new harms emerge (new technologies, new social contexts). A fixed constitution can become outdated without model retraining. | Modular constitutions; constitutional updates with targeted retraining; Constitutional model spec as living document |
| Helpfulness-harmlessness over-correction | Early CAI models (Claude 1) were notably over-cautious, refusing legitimate requests. The harmlessness objective can dominate helpfulness if not explicitly balanced. | Explicit helpfulness principles in constitution; harmlessness score normalised against helpfulness loss |
CAI does not replace: system-level safeguards (output filtering, rate limiting, abuse detection), human red-teaming for novel attack vectors, post-deployment monitoring, legal and regulatory compliance review, or the need for ongoing alignment research. Constitutional training is one layer in a defence-in-depth alignment strategy, not a comprehensive solution.
You've covered the full fine-tuning and alignment stack: SFT pipelines (Deck 01), LoRA & PEFT (Deck 02), RLHF & PPO (Deck 03), DPO and its cousins (Deck 04), and Constitutional AI & RLAIF (Deck 05). The next natural area to explore is inference-time alignment and model evaluation — how you verify that all this training actually achieved its objectives in deployment.