Constitutional AI is a training method developed by Anthropic. The idea is to reduce how much a model's behaviour depends on humans hand-labelling harmful outputs, and instead have the model evaluate its own responses against a written set of principles, called a constitution.
In practice it works in two stages. First, the model generates a response, then critiques that response against the constitution and revises it, learning from its own corrections. Second, the model compares pairs of responses and judges which better satisfies the principles, and those judgements train it much as human rankings do in RLHF. Because the feedback comes from the model applying the constitution rather than from people, part of the approach is sometimes called reinforcement learning from AI feedback.
What it gets right
There are real advantages, and they are worth acknowledging plainly.
- The principles are explicit. Instead of values implied by thousands of individual human labels, there is a written document you can read, debate, and revise. That is more transparent than a pile of ratings.
- It scales the generation of feedback. Human labelling of harmful content is slow, costly, and hard on the labellers. Letting the model do more of that work removes a bottleneck.
- It can improve consistency. A written standard applied uniformly avoids some of the noise and drift of many different human raters on different days.
These are genuine engineering gains, and constitutional AI has produced models that are both more helpful and less prone to obvious harms. The Foundation has no interest in dismissing useful work.
What it does not solve
The harder questions survive the method intact, and it is important to see why.
The technique still optimises a model to satisfy a stated standard as judged by a model. It moves the human out of the per-response loop, and it does not change the underlying shape of the problem, which is that the system is being trained to produce outputs that score well against a target. If the constitution is incomplete, or its principles conflict in a situation nobody anticipated, the model optimises the letter of what was written, with the same Goodhart pressures as before. Writing the values down as a constitution does not escape the difficulty of specifying values; it relocates it into the document.
Two deeper limits remain untouched. The method shapes behaviour, and gives no guarantee about a model's internal goals, so the gap between looking aligned and being aligned, the inner alignment problem, is still open. And it relies on the model's own judgement of its outputs, which is trustworthy only insofar as the model is honest and capable of judging, precisely what we cannot assume for a system approaching or exceeding our own level.
Constitutional AI is a better way to steer behaviour. Steering behaviour was never the part of alignment we did not know how to do.
Keeping it in proportion
Constitutional AI is a good example of a pattern the Foundation watches closely: real, valuable safety progress that is easy to over-read as more than it is. Methods like this make current models more controllable and more transparent about their standards, and that is worth having. They do not demonstrate that we can align a system that outthinks us, and they do not close the verification gap that our whole argument turns on. Progress on training is welcome. It is not a reason to keep scaling capability on the assumption that the rest will follow, which is the case laid out in our plan.
Common questions.
Constitutional AI is a training method, developed by Anthropic, in which a model evaluates and revises its own outputs against a written set of principles called a constitution, rather than relying entirely on humans labelling harmful responses. The model critiques and improves its answers according to the principles, and its judgements of which responses better satisfy them are used to train it, an approach sometimes called reinforcement learning from AI feedback.
Standard RLHF trains a model using human rankings of its responses. Constitutional AI moves much of that judgement to the model itself, which compares and critiques responses against an explicit written constitution instead of depending on people to label each case. The main differences are that the values are written down and readable rather than implied by many individual labels, and that the feedback can be generated at scale without a human in every loop.
Its principles are explicit and can be read, debated, and revised, which is more transparent than values implied by thousands of individual human labels. It scales the generation of feedback, removing the bottleneck and human cost of hand-labelling harmful content. And applying a single written standard uniformly can improve consistency compared with many different human raters. In practice it has produced models that are more helpful and less prone to obvious harms.
No. It is a better way to steer behaviour, but steering behaviour was never the part of alignment we lacked. The method still trains a system to score well against a target, so an incomplete or internally conflicting constitution invites the same Goodhart-style exploitation as before. It shapes outward behaviour without guaranteeing anything about the model's internal goals, leaving the inner alignment problem open, and it relies on the model's own honest judgement of its outputs, which cannot be assumed for a system approaching or exceeding human capability.