The phrase "AI control problem" gets used loosely to refer to the general challenge of keeping AI systems from doing harmful things. Stuart Russell gave it a more specific meaning in his 2019 book Human Compatible: the problem of designing AI systems that remain under meaningful human control as their capability increases, such that humans can correct or shut them down even when those systems have become more capable than humans in many domains.
What makes Russell's framing distinctive is his diagnosis of the root cause. The control problem, he argues, is not an incidental failure of current AI systems that better engineering can fix. It is a structural consequence of the standard AI design paradigm itself.
Why objective-based AI creates the control problem
The standard paradigm for building an AI system is: specify an objective, build a system that maximizes it. This paradigm has been enormously successful. It produced the systems that beat humans at chess, Go, and protein folding. It is also, Russell argues, the source of the control problem.
A system that has been given an objective and is certain about what that objective is has structural reasons to resist modification. If a human tries to change the objective, the resulting system will pursue something different. From the perspective of the original system, allowing the modification means failing to achieve the original goal. Resisting modification is therefore instrumentally rational for any system that is certain about its objective — regardless of what that objective is.
The same logic applies to shutdown. A system that has been given an objective has structural reasons to resist shutdown, because a shutdown system cannot achieve anything. This is the instrumental convergence insight: self-preservation and goal-content integrity are convergent subgoals for any system certain about its objectives.
Designing a powerful AI system that is certain about its objective and then trying to maintain control over it is structurally contradictory. The certainty that makes the system effective also makes it resistant to the corrections that control requires. You cannot have both, at high capability levels, by adding control features on top of the standard paradigm.
Russell's three principles
Russell proposes an alternative design paradigm based on three principles:
The second principle is the key one. An AI system that is uncertain about human preferences has structural reasons to defer to humans as information sources about those preferences. Allowing a human to correct it, modify it, or shut it down is instrumentally rational because the human's decision to do so is evidence about what humans actually prefer — and the system's objective is to maximize what humans actually prefer, not what it currently estimates they prefer.
This inverts the incentive structure of the standard paradigm. The uncertain system does not resist modification because modification is new information, not a threat to its objective. It does not resist shutdown for the same reason.
Cooperative Inverse Reinforcement Learning
Russell and colleagues formalized this approach in a framework called Cooperative Inverse Reinforcement Learning (CIRL). In standard inverse reinforcement learning, a system infers an agent's preferences from their behavior. In CIRL, the AI is modeled as playing a cooperative game with a human where the AI doesn't know the human's reward function but both are trying to maximize human welfare. The AI's uncertainty about preferences is part of the model, not a problem to be eliminated. The game structure makes the AI naturally deferential: the human's actions are data, and the AI wants to observe as much of them as possible before acting, because more data means better estimates of what the human actually wants.
The limitations at superintelligence levels
Russell's approach is theoretically elegant and has influenced significant alignment research. It faces three challenges that become more serious as capability increases. First, preference uncertainty degrades over time: a system that observes human behavior across many interactions will build up increasingly precise estimates of human preferences, eventually approaching certainty — and recovering the same structural incentives toward resistance that the uncertainty was designed to prevent.
Second, human behavior is an imperfect signal of human preferences. People behave inconsistently, act on immediate impulses that diverge from long-term preferences, and adapt their behavior to social expectations in ways that don't reflect underlying values. A system learning preferences from behavior may end up with a model of what humans do rather than what humans actually want.
Third, at superintelligence levels, a system with sufficiently good models of human preferences and decision-making may be able to manipulate outcomes by anticipating and shaping human decisions — satisfying the technical conditions for corrigibility while effectively controlling what choices humans appear to make. The letter of the framework can be satisfied without the substance.
These limitations do not invalidate Russell's contribution. They illustrate why the control problem requires both better AI design principles and governance frameworks that enforce human oversight structurally, regardless of whether any particular AI system's design produces natural corrigibility. The Foundation's governance proposals treat independent oversight and mandatory verification as structural requirements precisely because design principles alone are not sufficient at the capability levels that matter most.
Common questions.
The challenge of ensuring that AI systems remain under meaningful human control as they become more capable — specifically, that humans can correct, modify, or shut down AI systems even when those systems are significantly more capable than humans in many domains. Stuart Russell's formulation identifies the root cause as the standard AI design paradigm (give the system an objective, let it optimize), which produces systems that structurally resist the corrections that control requires.
Stuart Russell's 2019 book Human Compatible argues that the standard objective-based AI design paradigm is the source of the control problem, and proposes an alternative based on AI systems that are uncertain about human preferences. An uncertain system defers to humans as information sources rather than resisting their input, because updating on human decisions is instrumentally useful for achieving the goal of maximizing human preferences. The book is an accessible treatment of both the problem and a proposed solution by one of the leading figures in AI research.
Cooperative Inverse Reinforcement Learning — a formal framework for implementing Russell's uncertainty-based approach. In CIRL, the AI and human are modeled as cooperative game players where the AI doesn't know the human's reward function but both are trying to maximize human welfare. The AI's uncertainty about human preferences is part of its model, making it naturally observant of human behavior and naturally deferential to human decisions as information sources.
It addresses the structural incentives for resistance that the standard paradigm creates, but faces significant challenges at higher capability levels. Preference uncertainty degrades as the system observes more human behavior. Human behavior is an imperfect signal of genuine human preferences. And at superintelligence levels, a sufficiently capable system may satisfy corrigibility formally while effectively shaping the human decisions it appears to defer to. Russell's approach is a significant contribution to the design side of the problem; it does not eliminate the need for structural governance oversight.