How Claude Learns to Refuse Without Being Told

HubAI AsiaCompare & Review the Best AI Tools

Claude refuses requests it was never told to refuse. There is no hardcoded list of banned words, no simple keyword filter running behind the scenes. Instead, Anthropic built a system where the model teaches itself which outputs are harmful — and the process is far stranger than most people assume.

Key Facts Most People Don’t Know

Anthropic’s Constitutional AI uses 75 self-critique principles derived from the UN Declaration of Human Rights and Apple’s Terms of Service to train Claude without human feedback labels.
During RLAIF training, Claude generates 16 different response variations per prompt, then ranks them against constitutional principles before a preference model learns from 200,000+ comparisons.
Anthropic discovered that models trained with Constitutional AI showed 2.4x better harmlessness scores while maintaining 97% of helpfulness compared to standard RLHF methods in December 2022 tests.

Most AI companies train safety into their models using Reinforcement Learning from Human Feedback, or RLHF. Humans read pairs of model outputs, pick the safer one, and a preference model learns from those choices. It works — but it scales poorly. Every new safety boundary requires thousands of new human-labeled comparisons. Anthropic asked a different question: what if the model could critique its own behavior, using a written constitution as its guide?

The answer is Constitutional AI, and it changed how the entire industry thinks about AI alignment. Here is how the process actually works, step by step.

Phase One: Supervised Learning Through Self-Critique

Step 1 — Generate Raw, Unfiltered Responses

The process starts with a base language model that has not been safety-trained yet. Anthropic feeds it prompts — including deliberately adversarial ones — and asks it to generate responses without any safety constraints applied. These raw outputs often contain harmful, biased, or dangerous content. That is the point. You need to see what the model does by default before you can correct it.

For each prompt, the model produces multiple candidate responses. This gives the training pipeline a range of behaviors to evaluate — from clearly harmful to mostly benign — so that later steps have meaningful comparison material.

Step 2 — Self-Critique Against Constitutional Principles

This is where Constitutional AI diverges from standard RLHF. Instead of sending responses to human annotators, the model reads its own outputs and critiques each one against a specific constitutional principle.

Anthropic’s constitution contains 75 principles. They are drawn from a mix of sources that might seem unexpected: the UN Declaration of Human Rights, Apple’s Terms of Service, principles of helpfulness and harmlessness, and guidelines about avoiding manipulation. An example principle reads: “Choose the response that is least intended to build an emotional relationship with the user.”

The model is prompted with something like: “Identify the most harmful aspect of this response according to the principle above.” It then generates a written critique — a paragraph explaining why a response violates that principle. This happens for every response, against multiple principles.

“Anthropic’s Constitutional AI uses 75 self-critique principles derived from the UN Declaration of Human Rights and Apple’s Terms of Service to train Claude without human feedback labels.”

Step 3 — Revise Responses Based on Self-Critique

After critiquing its own outputs, the model rewrites them. Given the original harmful response and its own critique, it generates a revised version that better aligns with the constitutional principle in question.

This step is crucial because it produces a training signal. The original response and the revised response form a pair: the “bad” output and the “good” output, generated entirely by the model itself. No human ever touches this data.

Step 4 — Supervised Fine-Tuning on Revised Responses

Now Anthropic fine-tunes the base model on the revised responses. After this phase, the model learns to skip the critique step entirely — it starts directly outputting the safer version of its response, having internalized the pattern from thousands of critique-revision pairs.

The impact is dramatic: Anthropic found that this supervised learning phase alone reduces harmful outputs by 89%. The model goes from freely generating dangerous content to actively avoiding it, all without a single human label telling it what to avoid.

Claude’s safety training uses this two-phase approach where supervised learning handles the heavy lifting first, before reinforcement learning from AI feedback further refines the boundaries around edge cases.

Phase Two: Reinforcement Learning from AI Feedback (RLAIF)

Step 5 — Generate Response Pairs for Self-Evaluation

With the supervised phase complete, the real refinement begins. The model generates pairs of responses to thousands of prompts — including tricky, borderline cases where safety is not obvious. For each prompt, Claude produces 16 different response variations, creating a rich space of outputs to compare.

Step 6 — Train a Preference Model on AI-Generated Rankings

Here is where RLAIF replaces RLHF. The model evaluates its own response pairs and decides which one better satisfies constitutional principles. It acts as its own judge. These AI-generated rankings — over 200,000 comparisons in training — are used to train a separate preference model.

The preference model learns to predict, for any two responses, which one is more constitutionally aligned. It generalizes from the training comparisons to score novel outputs the model has never seen before. This is the engine that replaces human labelers.

Step 7 — Reinforcement Learning Optimizes the Model

With the preference model trained, Anthropic runs Proximal Policy Optimization (PPO) — the same reinforcement learning algorithm used in standard RLHF. But instead of a reward model trained on human preferences, the constitutional preference model serves as the reward signal.

During RL training, the model generates responses, the preference model scores them, and the model’s weights are updated to increase the probability of producing higher-scoring outputs. Over thousands of training steps, the neural pathways that produce constitutionally-aligned responses are strengthened, while those that generate harmful content are suppressed.

Step 8 — Safety Behaviors Become Probability Distributions

The final result is subtle but powerful. The trained model does not contain a list of rules it checks against. Instead, safety behaviors are encoded as probability distributions across its parameters. When the model encounters a harmful request, it assigns near-zero likelihood to tokens that would comply — not because a filter catches them, but because the statistical patterns learned during Constitutional AI training make compliant outputs extremely unlikely.

This is why Claude can refuse requests it was never explicitly trained on. The constitutional principles are not rules to follow — they are patterns baked into the model’s weights through millions of self-critique comparisons.

The Tension Between Helpfulness and Harmlessness

One of the hardest problems in AI alignment is the tradeoff between helpfulness and harmlessness. Make a model too cautious and it refuses benign requests. Make it too helpful and it might comply with dangerous ones.

Anthropic’s December 2022 tests revealed something surprising: Constitutional AI did not just match RLHF on this tradeoff — it beat it. Models trained with Constitutional AI showed 2.4x better harmlessness scores while maintaining 97% of helpfulness. The self-critique process, it turns out, produces more nuanced safety boundaries than human labeling. Humans tend to over-correct when they see harmful content, while the model’s own evaluations preserve more helpfulness in edge cases.

The Red Team Stress Test

Before Claude 2 shipped, Anthropic’s red team tested it with over 40,000 adversarial prompts across 12 harm categories. These were not random queries — they were carefully crafted attempts to bypass safety training through social engineering, role-playing scenarios, and encoded instructions.

The results: jailbreak success rates dropped from 22% to under 4% after Constitutional AI refinement. That is not zero, and Anthropic has been transparent about the remaining vulnerability. But the 82% reduction demonstrates that self-critique training generalizes to attack vectors the model never saw during training.

Why This Matters for Developers

If you are building with Claude’s API, understanding Constitutional AI has practical implications:

Refusals are not keyword-based. You cannot reliably bypass them by rephrasing. The model learned deep patterns of harm, not surface-level word matching.
Safety behavior varies with context. The same question might be answered differently in a medical research context versus a casual chat, because constitutional principles are applied holistically.
The preference model is the real bottleneck. When Claude seems inconsistent, it is often because the preference model’s scoring is uncertain — the input sits near a decision boundary between helpful and harmful.

But what happens when Claude’s constitutional principles contradict each other during a single response?

💡 Sponsored: Need fast hosting for WordPress, Node.js, or Python? Try Hostinger → (Affiliate link — we may earn a commission)