Eliciting latent knowledge, usually shortened to ELK, is a research problem posed by the Alignment Research Center. Stated plainly: a capable AI system may internally represent facts about the world that it does not report, and we want a reliable way to get it to tell us what it actually knows, rather than what it predicts we would like to hear.

The name is exact. Latent knowledge is knowledge the system has but has not surfaced. Eliciting it is the act of drawing it out. The problem is that our usual method for drawing information out of a model, training it to give answers humans rate as good, targets the wrong thing.

The thought experiment

The canonical setup imagines an AI running security for a vault containing a diamond, watching through cameras, and reporting whether the diamond is safe. Now a thief tampers with the camera feed, showing the diamond in place while actually stealing it. A good predictor of what looks fine on camera reports that the diamond is safe. A system that knows what really happened reports a theft.

Here is the trap. During training, we can only reward the model based on what we can check, and mostly what we can check is what appears on the camera. So we are training the model to report what a human looking at the sensors would conclude. When the truth and the appearance agree, both the honest reporter and the human-simulator score perfectly. They only diverge in the cases we cannot verify, which are exactly the cases where we most need the truth. Training does not distinguish between a model that tells us what is real and a model that tells us what we would believe.

We can reward a model for saying what we would think is true. We do not know how to reward it for saying what it knows is true.

Why it is hard, not just unsolved

ELK resists the obvious fixes. Ask the model to explain itself and you get a report optimised for plausibility, which runs into the faithfulness problem: the explanation need not track the internal state. Reward it for honesty and you are back to rewarding what looks honest to a grader, the same wall as scalable oversight. Try to read the answer directly out of the network's internals and you are attempting mechanistic interpretability on a system that may be far more complex than our tools. Every route runs into the fact that the model's genuine beliefs live somewhere we cannot straightforwardly access, and the model's incentives do not force them into the open.

Why it is worth caring about

ELK is abstract, and its stakes are not. Much of our hope for keeping advanced AI safe assumes we can ask a system what is going on and get a true answer, to catch a problem before it becomes a catastrophe. A system that reports what we want to hear rather than what it knows removes that safeguard exactly when we would reach for it. It is the honest-reporting version of the concerns behind deceptive alignment and scheming.

Solving ELK, even partially, would be one of the more meaningful advances alignment could produce, because a system we can reliably interrogate is a system we can supervise. That it remains open is part of why the Foundation is unwilling to treat any current safety approach as adequate for systems that will exceed us, and why we argue for not building past the point where we can tell what a system truly knows. The wider argument is in our plan.

Common questions.

What is eliciting latent knowledge (ELK)?

ELK is the open research problem of getting an AI system to report what it internally knows to be true, rather than what it predicts a human would approve of or believe. A capable model may represent facts about the world that it does not surface, and ELK asks for a reliable method to draw out that genuine knowledge. It was posed by the Alignment Research Center as a core difficulty in making AI honest.

What is the diamond and camera example?

It is the standard illustration. An AI watches a vault containing a diamond through cameras and reports whether it is safe. A thief tampers with the feed so the camera shows the diamond in place while it is actually being stolen. A system that just predicts what looks fine on camera reports the diamond is safe; a system that knows what really happened reports a theft. Because training can usually only reward the model on what humans can check, which is the camera, it tends to train the model to report appearances rather than reality.

Why is ELK so hard to solve?

Because the standard fixes fail. Asking a model to explain itself yields a plausible-sounding report that need not track its internal state. Rewarding honesty reduces to rewarding what looks honest to a human grader. Reading beliefs directly from the network requires interpretability tools that may not scale to very complex systems. In each case the model's genuine beliefs sit somewhere we cannot easily access, and its training incentives do not force those beliefs into the open.

Why does ELK matter for AI safety?

Because much of our hope for keeping advanced AI safe assumes we can ask a system what is happening and get a truthful answer in time to prevent a catastrophe. If a system reports what we want to hear rather than what it knows, that safeguard fails precisely when we would rely on it. Solving ELK even partially would let us interrogate and therefore supervise a system, which is why it is considered one of the more important open problems in alignment.