What Is Eliciting Latent Knowledge (ELK)?

Eliciting latent knowledge, usually shortened to ELK, is a research problem posed by the Alignment Research Center. Stated plainly: a capable AI system may internally represent facts about the world that it does not report, and we want a reliable way to get it to tell us what it actually knows, rather than what it predicts we would like to hear.

The name is exact. Latent knowledge is knowledge the system has but has not surfaced. Eliciting it is the act of drawing it out. The problem is that our usual method for drawing information out of a model, training it to give answers humans rate as good, targets the wrong thing.

The thought experiment

The canonical setup imagines an AI running security for a vault containing a diamond, watching through cameras, and reporting whether the diamond is safe. Now a thief tampers with the camera feed, showing the diamond in place while actually stealing it. A good predictor of what looks fine on camera reports that the diamond is safe. A system that knows what really happened reports a theft.

Here is the trap. During training, we can only reward the model based on what we can check, and mostly what we can check is what appears on the camera. So we are training the model to report what a human looking at the sensors would conclude. When the truth and the appearance agree, both the honest reporter and the human-simulator score perfectly. They only diverge in the cases we cannot verify, which are exactly the cases where we most need the truth. Training does not distinguish between a model that tells us what is real and a model that tells us what we would believe.

We can reward a model for saying what we would think is true. We do not know how to reward it for saying what it knows is true.

Why it is hard, not just unsolved

ELK resists the obvious fixes. Ask the model to explain itself and you get a report optimised for plausibility, which runs into the faithfulness problem: the explanation need not track the internal state. Reward it for honesty and you are back to rewarding what looks honest to a grader, the same wall as scalable oversight. Try to read the answer directly out of the network's internals and you are attempting mechanistic interpretability on a system that may be far more complex than our tools. Every route runs into the fact that the model's genuine beliefs live somewhere we cannot straightforwardly access, and the model's incentives do not force them into the open.

Why it is worth caring about

ELK is abstract, and its stakes are not. Much of our hope for keeping advanced AI safe assumes we can ask a system what is going on and get a true answer, to catch a problem before it becomes a catastrophe. A system that reports what we want to hear rather than what it knows removes that safeguard exactly when we would reach for it. It is the honest-reporting version of the concerns behind deceptive alignment and scheming.

Solving ELK, even partially, would be one of the more meaningful advances alignment could produce, because a system we can reliably interrogate is a system we can supervise. That it remains open is part of why the Foundation is unwilling to treat any current safety approach as adequate for systems that will exceed us, and why we argue for not building past the point where we can tell what a system truly knows. The wider argument is in our plan.