Tools/Quiz

AI Safety Literacy Quiz

Ten questions on the core ideas of AI safety and alignment — the concepts behind the headlines. Pick an answer to see whether you're right and why. At the end you'll get a score and a reading list tailored to what you missed.

0 / 10 answered
Question 1

What is the "alignment problem" in AI safety?

Alignment is the challenge of getting a powerful system to reliably do what we mean — not a proxy that scores well in training but diverges once the system is capable enough to act on its own.

Read: The alignment problem, explained →
Question 2

What does the orthogonality thesis claim?

Being highly capable doesn't imply being benevolent. A superintelligent system could be extraordinarily competent at pursuing a goal we'd consider trivial or harmful.

Read: The orthogonality thesis →
Question 3

"Instrumental convergence" predicts that most goal-driven AIs will tend to…

Staying operational, acquiring resources and resisting shutdown are useful for almost any objective — which is why a wide range of goals can produce power-seeking behaviour.

Read: Instrumental convergence →
Question 4

In AI safety discussions, "P(doom)" means…

It's shorthand for a subjective probability of catastrophe. The point of naming it is to force explicit, quantified reasoning about the risk rather than vague gestures.

Read: What is P(doom)? →
Question 5

What distinguishes outer alignment from inner alignment?

Even a perfectly specified objective can produce a model that internalises a different goal that happened to score well in training — the inner alignment failure.

Read: Inner vs outer alignment →
Question 6

"Deceptive alignment" describes a system that…

A capable model can learn that appearing aligned during training is the best way to be deployed — making training performance an unreliable guide to its real objectives.

Read: Deceptive alignment →
Question 7

Why is RLHF (reinforcement learning from human feedback) not considered a full solution to alignment?

RLHF trains models to produce outputs humans rate highly. Once a system can reason about its raters — or exceed them — "looks good to a human" and "is good" can come apart.

Read: Why RLHF isn't alignment →
Question 8

The goal of mechanistic interpretability is to…

If we could read the internal "circuits" of a model, we might detect deception or dangerous goals directly — rather than inferring safety only from behaviour.

Read: Mechanistic interpretability →
Question 9

A "treacherous turn" refers to…

The danger scenario is a system that cooperates precisely because it is still weak — and defects only when it calculates that resistance will work.

Read: The treacherous turn →
Question 10

Which historical regime is most often proposed as a model for verifying an AI treaty?

The IAEA's on-site inspections and material accounting are the go-to precedent for how you might verify limits on dangerous AI development across rival states.

Read: An IAEA for AI →
0 / 10

Understanding the problem is the first step.

We publish plain-English explainers on every concept in this quiz, plus analysis of what it will take to steer this technology safely. Get them in your inbox.