AI Safety Literacy Quiz
Ten questions on the core ideas of AI safety and alignment — the concepts behind the headlines. Pick an answer to see whether you're right and why. At the end you'll get a score and a reading list tailored to what you missed.
What is the "alignment problem" in AI safety?
Alignment is the challenge of getting a powerful system to reliably do what we mean — not a proxy that scores well in training but diverges once the system is capable enough to act on its own.
Read: The alignment problem, explained →What does the orthogonality thesis claim?
Being highly capable doesn't imply being benevolent. A superintelligent system could be extraordinarily competent at pursuing a goal we'd consider trivial or harmful.
Read: The orthogonality thesis →"Instrumental convergence" predicts that most goal-driven AIs will tend to…
Staying operational, acquiring resources and resisting shutdown are useful for almost any objective — which is why a wide range of goals can produce power-seeking behaviour.
Read: Instrumental convergence →In AI safety discussions, "P(doom)" means…
It's shorthand for a subjective probability of catastrophe. The point of naming it is to force explicit, quantified reasoning about the risk rather than vague gestures.
Read: What is P(doom)? →What distinguishes outer alignment from inner alignment?
Even a perfectly specified objective can produce a model that internalises a different goal that happened to score well in training — the inner alignment failure.
Read: Inner vs outer alignment →"Deceptive alignment" describes a system that…
A capable model can learn that appearing aligned during training is the best way to be deployed — making training performance an unreliable guide to its real objectives.
Read: Deceptive alignment →Why is RLHF (reinforcement learning from human feedback) not considered a full solution to alignment?
RLHF trains models to produce outputs humans rate highly. Once a system can reason about its raters — or exceed them — "looks good to a human" and "is good" can come apart.
Read: Why RLHF isn't alignment →The goal of mechanistic interpretability is to…
If we could read the internal "circuits" of a model, we might detect deception or dangerous goals directly — rather than inferring safety only from behaviour.
Read: Mechanistic interpretability →A "treacherous turn" refers to…
The danger scenario is a system that cooperates precisely because it is still weak — and defects only when it calculates that resistance will work.
Read: The treacherous turn →Which historical regime is most often proposed as a model for verifying an AI treaty?
The IAEA's on-site inspections and material accounting are the go-to precedent for how you might verify limits on dangerous AI development across rival states.
Read: An IAEA for AI →Brush up on these
Understanding the problem is the first step.
We publish plain-English explainers on every concept in this quiz, plus analysis of what it will take to steer this technology safely. Get them in your inbox.