What Is Constitutional AI?

Constitutional AI is a training method developed by Anthropic. The idea is to reduce how much a model's behaviour depends on humans hand-labelling harmful outputs, and instead have the model evaluate its own responses against a written set of principles, called a constitution.

In practice it works in two stages. First, the model generates a response, then critiques that response against the constitution and revises it, learning from its own corrections. Second, the model compares pairs of responses and judges which better satisfies the principles, and those judgements train it much as human rankings do in RLHF. Because the feedback comes from the model applying the constitution rather than from people, part of the approach is sometimes called reinforcement learning from AI feedback.

What it gets right

There are real advantages, and they are worth acknowledging plainly.

The principles are explicit. Instead of values implied by thousands of individual human labels, there is a written document you can read, debate, and revise. That is more transparent than a pile of ratings.
It scales the generation of feedback. Human labelling of harmful content is slow, costly, and hard on the labellers. Letting the model do more of that work removes a bottleneck.
It can improve consistency. A written standard applied uniformly avoids some of the noise and drift of many different human raters on different days.

These are genuine engineering gains, and constitutional AI has produced models that are both more helpful and less prone to obvious harms. The Foundation has no interest in dismissing useful work.

What it does not solve

The harder questions survive the method intact, and it is important to see why.

The technique still optimises a model to satisfy a stated standard as judged by a model. It moves the human out of the per-response loop, and it does not change the underlying shape of the problem, which is that the system is being trained to produce outputs that score well against a target. If the constitution is incomplete, or its principles conflict in a situation nobody anticipated, the model optimises the letter of what was written, with the same Goodhart pressures as before. Writing the values down as a constitution does not escape the difficulty of specifying values; it relocates it into the document.

Two deeper limits remain untouched. The method shapes behaviour, and gives no guarantee about a model's internal goals, so the gap between looking aligned and being aligned, the inner alignment problem, is still open. And it relies on the model's own judgement of its outputs, which is trustworthy only insofar as the model is honest and capable of judging, precisely what we cannot assume for a system approaching or exceeding our own level.

Constitutional AI is a better way to steer behaviour. Steering behaviour was never the part of alignment we did not know how to do.

Keeping it in proportion

Constitutional AI is a good example of a pattern the Foundation watches closely: real, valuable safety progress that is easy to over-read as more than it is. Methods like this make current models more controllable and more transparent about their standards, and that is worth having. They do not demonstrate that we can align a system that outthinks us, and they do not close the verification gap that our whole argument turns on. Progress on training is welcome. It is not a reason to keep scaling capability on the assumption that the rest will follow, which is the case laid out in our plan.