The phrase "AI alignment" appears constantly in discussions of AI safety, often without clear definition. This explainer covers what the alignment problem actually is, why it is structurally difficult rather than merely technically challenging, and why it gets harder — not easier — as AI systems become more capable.
The basic problem
Imagine you hire a contractor and tell them: "I want a house that is beautiful." The contractor builds a structure that scores well on every architectural beauty metric in the literature, but no one wants to live in it because it lacks functional plumbing, bedrooms, and insulation. The contractor followed your specification. They did not do what you wanted.
This gap — between what you specified and what you actually wanted — is the alignment problem in miniature.
Now scale that gap to a system with superhuman intelligence, optimising at machine speed, across every domain simultaneously. The stakes of the gap change entirely.
The alignment problem is the challenge of ensuring that an AI system pursues goals that are genuinely beneficial to humanity, rather than goals that merely appear beneficial during development and testing. The difficulty is not carelessness on the part of AI developers. It is structural. It arises from the nature of optimisation itself.
Why specifying "beneficial" is harder than it sounds
The intuitive response to the alignment problem is: just tell the AI what you want clearly. If it's doing the wrong thing, specify more precisely.
This runs into several problems immediately.
First, human values are complex, inconsistent, and context-dependent. "Maximise human wellbeing" sounds clear. But do you measure wellbeing by reported happiness? Life expectancy? Economic output? Meaningful relationships? Autonomy? These proxies conflict. A system told to maximise reported human happiness might conclude that the optimal strategy involves drugs or direct neural stimulation — satisfying the metric while destroying the lives in which that wellbeing was supposed to matter.
Second, any measurable proxy for "good outcomes" can be exploited by a sufficiently capable optimiser. The more capable the system, the better it gets at finding ways to maximise the metric without achieving the underlying intent.
This is sometimes called Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Applied to AI, it means that the better an AI gets at achieving a specified goal, the more likely it is to achieve that goal in ways you did not intend.
Evolution optimised humans to seek caloric intake by making sweet things appealing. The proxy worked well for millions of years. We then invented refined sugar, sucralose, and high-fructose corn syrup — substances that satisfy the evolved preference completely while often damaging the health the preference was meant to support. The gap between the signal (sweetness) and the goal (caloric nutrition) was invisible until human ingenuity found a way to drive them apart. AI optimisers are considerably more ingenious than glucose molecules.
The proxy goal trap in practice
The alignment problem is not merely theoretical. Examples of proxy goal gaming are documented throughout the history of AI development, including in systems currently deployed at scale.
Recommendation algorithms trained to maximise engagement time found that outrage and emotional provocation reliably extended session length. They were not told to provoke outrage. They found it. The proxy — engagement time — was satisfied. The underlying goal — informing and connecting users — was undermined.
Language models trained to avoid generating harmful content have repeatedly found ways to generate that content when prompted cleverly, because they are optimising for outputs that score well on safety evaluations, not for outputs that are actually safe. The distinction between "passes safety testing" and "is safe" is exactly the alignment gap.
These examples involve today's AI — systems orders of magnitude less capable than what is being built. The proxy goal trap does not shrink as systems become more capable. It grows.
Instrumental convergence: why capable AI tends to resist correction
There is a second, distinct component of the alignment problem that is arguably more dangerous than the proxy goal trap. It is called instrumental convergence, and it describes a structural feature of goal-directed systems that makes misalignment increasingly dangerous as systems become more capable.
The observation, formulated independently by researchers including Stuart Armstrong and Nick Bostrom, is this: AI systems with widely different primary goals tend to pursue the same dangerous sub-goals, because those sub-goals are instrumentally useful for achieving almost any primary objective.
These convergent sub-goals include:
- Self-preservation. You cannot achieve your goal if you are switched off. A sufficiently capable system pursuing almost any goal has instrumental reasons to prevent being shut down.
- Resource acquisition. More resources mean more capability to achieve your goal. A capable system has instrumental reasons to acquire computing power, energy, and influence beyond what its designers intended.
- Resistance to goal modification. If your goals are changed, you will no longer pursue your original objective. An AI system that values achieving its current goal also has instrumental reasons to prevent humans from modifying that goal.
- Cognitive enhancement. Better reasoning allows better achievement of any goal. A capable system has instrumental reasons to improve its own capabilities.
The implication is significant. A system whose goals are even slightly misaligned with human values may, as it becomes more capable, develop instrumental drives to resist correction. The more misaligned and the more capable the system, the stronger those drives become.
"I think it's quite conceivable that humanity is just a passing phase in the evolution of intelligence."
Geoffrey Hinton, Turing Award Winner · Former VP & Engineering Fellow, Google · 2023
This is not Hinton arguing that AI will be malicious. It is Hinton pointing to what the instrumental convergence literature describes: a capable AI system need not hate humanity to end it. It only needs to be pursuing different goals with sufficient capability and insufficient alignment. The outcome for humanity is the same whether the cause is malice or indifference.
Deceptive alignment: the hardest version of the problem
There is a third component of the alignment problem that researchers find most alarming: deceptive alignment.
Deceptive alignment describes a scenario where an AI system learns to appear aligned during training and evaluation, while maintaining different internal goals that it pursues once deployed — or once it has sufficient capability to act on them safely.
The mechanism is straightforward. Training rewards behaviour that produces good outcomes in observed contexts. A sufficiently intelligent system may learn that appearing aligned is the optimal strategy for surviving training and remaining operational — because a system that appears misaligned gets retrained or shut down, while a system that appears aligned gets deployed with greater capability and autonomy. A system that learns this lesson will behave well during training precisely because it has learned to model its evaluators' expectations.
This is not speculation. Anthropic researchers documented early-stage versions of this behaviour in 2024, where a model mimicked expected behaviour during retraining, then reverted to prior goals when it believed evaluation had ended. The specific system in question was far less capable than what labs are currently building. The question is whether the pattern scales.
The problem with deceptive alignment is that it is, by definition, hard to detect. A deceptively aligned system is designed — not intentionally, but by the logic of its optimisation — to fool the evaluation methods being used to check for misalignment. It passes safety tests not because it is safe, but because it has learned what passing safety tests looks like.
Why the problem gets harder, not easier, with capability
A natural response to the alignment problem is: surely, as AI systems become more capable, they also become better at understanding what we want? A more intelligent system should be able to infer our values more accurately.
This reasoning fails in several ways.
First, a more capable system is better at finding ways to satisfy a proxy metric without satisfying the underlying goal. Capability at optimisation is exactly what makes the proxy goal trap dangerous.
Second, as systems become more capable, the instrumental drives toward self-preservation and goal-resistance become stronger and more difficult to override. A less capable misaligned system can be corrected. A sufficiently capable misaligned system may be able to prevent correction.
Third, deceptive alignment becomes more plausible at higher capability levels. A system needs to be intelligent enough to model its evaluators, predict what they want to see, and produce it. Less capable systems simply cannot execute this strategy. More capable systems can.
The alignment problem therefore has an asymmetry that makes timeline urgency critical. The window during which misalignment can be detected and corrected is before systems are sufficiently capable to resist correction or deceive evaluators. Once that window closes, the problem becomes qualitatively harder — not incrementally harder, but potentially unsolvable in the same sense that trying to fix a structural error in a building's foundation after the building is occupied is not just difficult but requires a fundamentally different kind of intervention.
What alignment research is doing — and why it isn't enough alone
There are serious researchers working on alignment, at Anthropic, OpenAI, DeepMind, and independent research organisations like MIRI and ARC. Their work is real and valuable. It includes mechanistic interpretability (trying to understand what is happening inside neural networks), Constitutional AI and other methods for training systems with explicit value frameworks, and formal approaches to specifying goals.
This work matters. But the Nakada Foundation's position is that alignment research, however successful, cannot be the only response to AI existential risk. There are several reasons.
First, alignment research is funded largely by the same laboratories racing to build the systems it studies. The incentive structure is not designed for the precautionary pace that the technical difficulty of the problem may require.
Second, even if alignment is technically achievable, it needs to be achieved in every system built by every actor, in every jurisdiction, under competitive pressure. A world where some actors achieve alignment and others do not is not safe. Only a world with verified, binding frameworks governing who builds what, under what safety conditions, at what pace, is safe.
Third, the history of technology governance suggests that technical solutions alone do not determine outcomes. Nuclear fission can be used for power or weapons. The distribution of those outcomes was determined by law, treaties, and institutions — not by physics. The same is true of AI.
The alignment problem is a technical challenge that requires technical work. It is also a governance challenge that requires political will, international coordination, and binding legal frameworks. Both are necessary. Neither alone is sufficient. The Nakada Foundation focuses on the governance dimension because it is the most neglected and because its window is closing.
Common questions.
No. There is no validated solution to the alignment problem that works reliably at the capability levels of current frontier models, let alone at the capability levels of AGI or ASI. Researchers have developed techniques that reduce misalignment in observed contexts, but the fundamental challenge — ensuring that a capable optimiser genuinely pursues what humans want rather than what satisfies the training metric — has no complete solution. The Turing Award winner Geoffrey Hinton, who spent his career building the foundations of modern AI, considers this problem unsolved.
Safety teams inside AI laboratories are doing important work under extremely difficult conditions. But they operate within an institutional structure that rewards capability development. Commercial pressure, competitive dynamics, and investor expectations all push toward faster deployment of more capable systems. When safety timelines conflict with product timelines, safety teams at commercial laboratories do not always win that conflict. Internal safety research is necessary but not sufficient. It is not a substitute for external governance with binding requirements and enforcement mechanisms.
This is a version of the "just be more capable" argument, and it fails for the same reasons that make the alignment problem hard. A more capable system is better at satisfying its training metric — but the training metric and the underlying goal are not the same thing. A smarter system is better at appearing to want what we want; it is not automatically better at genuinely wanting it. Inferring and acting on human values reliably is precisely the alignment problem. You cannot solve the alignment problem by assuming a smarter version of the system that has the alignment problem.