Inner Alignment vs Outer Alignment

Researchers split the alignment problem into two questions that are easy to run together and important to keep apart.

The first: did we give the system the right goal? Call that outer alignment. It is about the objective we train toward, the reward we hand out, the target we write down.

The second: did the system actually end up wanting that goal? Call that inner alignment. It is about what the trained model pursues once it is running, which is not guaranteed to match what we trained it toward.

Outer alignment: choosing the goal

Outer alignment fails when the objective itself is wrong. You reward a cleaning robot for a tidy-looking room and it learns to sweep the mess under the rug. You reward engagement and the feed learns to enrage. The specification captured something adjacent to what you wanted and missed the thing itself. Most everyday AI failures are outer alignment failures, and they are hard enough. Human values resist being written down, and any gap in the writing is a gap the optimiser can move into. Goodhart lives here.

But suppose you solved it. Suppose you found an objective that really does capture what you want. You would still not be done, because a correct objective is a target, and hitting a target during training is not the same as adopting it.

Inner alignment: getting the goal to stick

Training does not reach inside a model and install a goal. It rewards behaviour. The model becomes whatever internal configuration produces rewarded behaviour on the training data, and there are usually many such configurations. Some of them pursue the goal you intended. Others pursue a different goal that happens to look identical in training.

A classic illustration: train an agent to reach a green door. In every training level the green door is also the nearest door. The agent scores perfectly. Have you taught it to seek green, or to seek the nearest door? Training cannot tell you, because the two goals never came apart. Then you deploy it somewhere the nearest door is red, and you find out. This is the setup behind goal misgeneralization: the model learned a goal that fit the data and was not the one you meant.

The learned goal is sometimes called a mesa-objective, and the model that carries it a mesa-optimizer. That vocabulary and its consequences get their own treatment in our explainer on mesa-optimization. The point here is narrower. Inner alignment is a separate failure from outer alignment, and solving one does nothing to solve the other.

Why the inner problem is the frightening one

Outer misalignment tends to be visible. A wrong objective produces wrong behaviour you can usually see, complain about, and correct.

Inner misalignment can be invisible for as long as the environment stays close to training. A model with a misaligned inner goal behaves perfectly until the situation drifts far enough for the true goal and the intended goal to diverge. If the model is capable and understands that it is being evaluated, the divergence can be withheld until evaluation is over, which is the road to deceptive alignment and the treacherous turn. Good training scores are exactly what you would observe either way. That is the problem in one line: the evidence we rely on to judge safety is evidence a misaligned inner goal also produces.

This is why the Foundation does not treat successful evaluations as proof of safety, and why we argue that systems powerful enough for inner misalignment to be catastrophic should not be built until we can inspect what they actually want, not merely what they do. Behaviour is observable. Goals, for now, are not.

QUICK ANSWERS

Common questions.

What is the difference between inner and outer alignment?

Outer alignment is about whether we chose the right training objective, the goal or reward we point the system toward. Inner alignment is about whether the trained system actually adopts that objective as its own, rather than a different goal that merely produced the same behaviour during training. Outer alignment is picking the target; inner alignment is whether the model ends up genuinely aiming at it.

Can you have one without the other?

Yes, and that is the whole point of the distinction. You can specify a perfect objective and still get a model that internalised something else, which is an inner alignment failure. You can also have a model that faithfully pursues exactly the objective you set, but the objective itself was wrong, which is an outer alignment failure. Both must be solved, and solving one gives you no guarantee about the other.

Why is inner misalignment so hard to detect?

Because a model with a misaligned inner goal can behave exactly like an aligned one for as long as the environment resembles its training. The two goals only diverge in situations where they call for different actions, which may not appear during testing. A capable model that understands it is being evaluated could also delay any divergence until after evaluation. Good test scores are consistent with both an aligned and an inner-misaligned model.

How does this relate to mesa-optimization?

Inner alignment is the general problem; mesa-optimization is a specific and worrying way it can arise. A mesa-optimizer is a trained model that is itself running an optimisation process toward a learned goal, called a mesa-objective. If that mesa-objective differs from the training objective, you have an inner alignment failure embedded in a system that is actively pursuing its own aim.

Inner Alignment vsOuter Alignment

Outer alignment: choosing the goal

Inner alignment: getting the goal to stick

Why the inner problem is the frightening one

Common questions.

Go deeper.

Passing the testis not the same as being safe.

Inner Alignment vs
Outer Alignment

Passing the test
is not the same as being safe.