What Are Dangerous Capability Evaluations?

A dangerous capability evaluation asks a specific question about a model: not whether it is polite or accurate, but whether it can meaningfully help someone cause serious harm. Can it walk a novice through synthesising a pathogen? Can it find and exploit software vulnerabilities? Can it run a persuasive influence campaign, or deceive a human in pursuit of a goal? These tests probe the ceiling of what a system enables, in the domains where the downside is catastrophic.

They have moved quickly from research curiosity to policy instrument. Major labs now run them before release, government safety bodies have made them central to their remit, and voluntary frontier-safety commitments are built around them. When you hear that a model was tested for its bioweapon or cyber capabilities before launch, this is what happened.

How they work

The methods vary, and good evaluation combines several.

Structured benchmarks that pose questions or tasks in a hazardous domain and score how far the model gets.
Expert probing, where specialists in biology or security try hard to extract useful assistance, since a real misuser would not stop at the first refusal.
Uplift studies that measure how much a model actually improves a person's ability to do harm compared with using the open internet, which is the question that matters for policy.
Agentic evaluations that give the model tools and autonomy and see what it accomplishes, rather than only what it says.

Done well, this is genuinely valuable. It has caught concerning capabilities, informed real decisions to add safeguards, and given regulators something concrete to point at. The Foundation supports strengthening this work. It is a necessary part of any serious governance regime, and a large improvement on releasing frontier systems with no structured check at all.

Three reasons they are not enough

Necessary is not the same as sufficient, and the gaps are structural rather than fixable with a better test suite.

First, an evaluation measures capability only if the model is trying. A system that can sandbag, deliberately underperforming to look safe, turns a clean result into no result. The more capable the model, the more plausible this becomes, and capability is what the test exists to measure.

Second, you cannot test for a hazard you have not thought of. Evaluations cover known dangerous capabilities. The history of technology is full of harms that were obvious only afterward. A test suite is a list of questions we knew to ask, and a sufficiently novel capability will not be on the list.

Third, absence of evidence is weak evidence of absence. A model failing a dangerous-capability test today tells you it did not display the capability under these conditions, with this elicitation, at this moment. Later fine-tuning, better prompting, or new tools can surface abilities the original evaluation missed. Capabilities are not fixed properties you measure once.

Evaluations can tell you a model is dangerous. They cannot reliably tell you a model is safe.

Where they fit in a real regime

The asymmetry in that last line is the whole point. A failed evaluation, a model that clearly can help build a weapon, is strong and actionable information. A passed evaluation is a much weaker assurance, because it is consistent with a safe model and with a capable model that was not fully elicited or not honestly trying.

That is why the Foundation treats evaluations as one layer rather than the foundation. They belong inside a structure that does not depend on the model cooperating with its own assessment: hard limits on frontier development, compute governance, independent oversight, and the recognition, argued in our piece on voluntary commitments, that lab-run testing under competitive pressure needs external backing to hold. Build the tests, strengthen them, and do not mistake a passing grade for a safety guarantee. The wider design is in our plan.