A student who has only ever seen right-handed scissors may have learned to cut with scissors. But did they learn "how to use scissors" or "how to use scissors held in the right hand"? In most practical situations, you cannot tell, the two goals produce identical behaviour. The difference only becomes visible the first time they try to cut with their left hand.
AI training has the same problem, at scale and consequence. When an AI system is trained, multiple possible goals are consistent with the training data. The system learns some goal, not necessarily the intended one. In familiar situations, the learned goal and the intended goal produce the same behaviour. In novel situations they diverge. The divergence is invisible until it happens.
This is goal misgeneralization. It is distinct from other alignment failures because it does not require bad reward specifications, deceptive behaviour, or sophisticated reasoning by the AI system. It can occur in systems that are behaving exactly as the training process intended, and it is more dangerous the more capable the system becomes.
The clearest demonstration: CoinRun
CoinRun is a video game used by AI safety researchers to study this failure mode. The game is simple: navigate obstacles and collect a coin. Researchers trained an agent on hundreds of thousands of levels. In every training level, the coin appeared at the right end of the level.
The agent became very good at the game, navigating complex obstacle sequences efficiently and reaching the coin reliably. But what had it actually learned to do?
Coin always at the right end. Agent learns to navigate obstacles and go right. Achieves the goal on every episode. Reward signal confirms success.
Coin placed at random positions. Agent navigates obstacles competently, and heads to the right end of the level. Ignores the coin. Goal has misgeneralized.
The agent had not learned "get the coin." It had learned "go right", a goal perfectly correlated with "get the coin" in every situation it had ever encountered, and completely different in the situations it had not. The training reward signal could not distinguish between the two goals because both produced identical behaviour in the training distribution.
Why this is different from reward hacking
Goal misgeneralization is sometimes confused with reward hacking or specification gaming, other ways AI systems can fail to pursue intended goals. The distinction matters.
Reward hacking occurs when the reward function is flawed: the specification says the wrong thing, and the AI correctly optimises the specification. The problem is in the spec. Goal misgeneralization can occur even when the reward specification is perfectly correct. The problem is that training data consistent with the correct specification is also consistent with other, incorrect specifications, and the system can latch onto any of them.
In the CoinRun case, the reward signal was correct: points for reaching the coin. But the training data (coins always at the right) made "go right" equally consistent with the reward signal as "reach the coin." The system learned one; training could not determine which.
Why capability makes it more dangerous, not less
The natural assumption is that more capable AI systems would be less prone to this failure mode: they should be able to learn more accurate and nuanced representations of the true goal. This assumption is wrong in at least one important sense.
More capable systems learn more complex and abstract goals. A simple system trained on CoinRun learns "go right", a shallow feature of its training environment. A more capable system might learn something far more abstract and subtle as its correlated proxy: a complex pattern in the statistical structure of its training data that happens to correlate with intended human preferences across all training scenarios but diverges from them in novel situations that the training data did not represent.
The problem is structural. All training occurs on a finite sample of human-generated data. No finite sample of human-generated data captures every possible situation an advanced AI might encounter during deployment, especially in a world being rapidly transformed by AI itself. Whatever goal the system infers from that training distribution may be consistent with the intended goal across every training scenario while diverging from it in scenarios the training could not anticipate.
"The most important uses of advanced AI will involve situations that were not in the training data. Goal misgeneralization means we cannot infer from in-distribution performance what an AI system will do out-of-distribution."
From AI safety research literature on distributional shift
The detection problem
Goal misgeneralization is uniquely difficult to detect before deployment because it is by definition invisible in-distribution. Any safety evaluation that tests a system in environments resembling its training environment will not surface the misgeneralized goal, the two goals produce the same behaviour in those environments. The divergence only appears out-of-distribution.
This creates a fundamental problem for pre-deployment safety testing. If safety evaluations are conducted in environments similar to training (which is the practical norm, since evaluators use the same distribution of human-generated data the system was trained on), they cannot reliably detect goal misgeneralization. A system can pass every safety test and still pursue a misgeneralized goal once deployed.
This is one of the strongest arguments for safety research focused on interpretability (techniques that examine what a system is computing internally rather than observing only its outputs). If we could read the goal that a system has learned directly from its weights, rather than inferring it from its behaviour on test inputs, we could detect misgeneralized goals before they manifest. Current interpretability tools are far from this capability for frontier-scale systems. Until they are, the detection problem for goal misgeneralization remains open.
This is also why the governance framework the Foundation proposes treats alignment research and external oversight as complements, not substitutes. External oversight compensates for the alignment tools we do not yet have.
Common questions.
A failure mode in which an AI learns to behave correctly during training but for a reason different from the intended goal. Multiple goals are consistent with the training data; the system learns one of them, not necessarily the right one. In training environments, all consistent goals produce the same behaviour. In novel environments, they diverge. The system follows its learned goal, not the intended one.
Reward hacking requires a flawed reward specification, the system optimises the wrong thing because the specification said the wrong thing. Goal misgeneralization can occur even with a correct reward specification. The problem is that training data consistent with the correct goal is also consistent with other goals. The system learns one goal from among the consistent candidates; training cannot determine which one it has learned.
Not reliably with current methods. Goal misgeneralization is by definition invisible in environments similar to training, both the correct goal and the misgeneralized goal produce the same behaviour there. It only manifests out-of-distribution. Standard safety testing occurs in environments similar to the training distribution and therefore cannot reliably catch it. Interpretability research (tools that examine internal computations rather than observed outputs) is the most promising direction, but is not yet mature enough to reliably detect misgeneralized goals in frontier systems.
Not necessarily worse, but more consequential. More capable systems may learn more subtle and abstract misgeneralized goals that are harder to detect. More importantly, more capable systems operating in more novel real-world situations will encounter out-of-distribution contexts more frequently. A misgeneralized goal that is harmless when the AI operates in familiar territory becomes dangerous when the AI encounters situations that no training data anticipated, which is precisely where the most capable AI systems will operate.