A dangerous capability evaluation asks a specific question about a model: not whether it is polite or accurate, but whether it can meaningfully help someone cause serious harm. Can it walk a novice through synthesising a pathogen? Can it find and exploit software vulnerabilities? Can it run a persuasive influence campaign, or deceive a human in pursuit of a goal? These tests probe the ceiling of what a system enables, in the domains where the downside is catastrophic.

They have moved quickly from research curiosity to policy instrument. Major labs now run them before release, government safety bodies have made them central to their remit, and voluntary frontier-safety commitments are built around them. When you hear that a model was tested for its bioweapon or cyber capabilities before launch, this is what happened.

How they work

The methods vary, and good evaluation combines several.

  • Structured benchmarks that pose questions or tasks in a hazardous domain and score how far the model gets.
  • Expert probing, where specialists in biology or security try hard to extract useful assistance, since a real misuser would not stop at the first refusal.
  • Uplift studies that measure how much a model actually improves a person's ability to do harm compared with using the open internet, which is the question that matters for policy.
  • Agentic evaluations that give the model tools and autonomy and see what it accomplishes, rather than only what it says.

Done well, this is genuinely valuable. It has caught concerning capabilities, informed real decisions to add safeguards, and given regulators something concrete to point at. The Foundation supports strengthening this work. It is a necessary part of any serious governance regime, and a large improvement on releasing frontier systems with no structured check at all.

Three reasons they are not enough

Necessary is not the same as sufficient, and the gaps are structural rather than fixable with a better test suite.

First, an evaluation measures capability only if the model is trying. A system that can sandbag, deliberately underperforming to look safe, turns a clean result into no result. The more capable the model, the more plausible this becomes, and capability is what the test exists to measure.

Second, you cannot test for a hazard you have not thought of. Evaluations cover known dangerous capabilities. The history of technology is full of harms that were obvious only afterward. A test suite is a list of questions we knew to ask, and a sufficiently novel capability will not be on the list.

Third, absence of evidence is weak evidence of absence. A model failing a dangerous-capability test today tells you it did not display the capability under these conditions, with this elicitation, at this moment. Later fine-tuning, better prompting, or new tools can surface abilities the original evaluation missed. Capabilities are not fixed properties you measure once.

Evaluations can tell you a model is dangerous. They cannot reliably tell you a model is safe.

Where they fit in a real regime

The asymmetry in that last line is the whole point. A failed evaluation, a model that clearly can help build a weapon, is strong and actionable information. A passed evaluation is a much weaker assurance, because it is consistent with a safe model and with a capable model that was not fully elicited or not honestly trying.

That is why the Foundation treats evaluations as one layer rather than the foundation. They belong inside a structure that does not depend on the model cooperating with its own assessment: hard limits on frontier development, compute governance, independent oversight, and the recognition, argued in our piece on voluntary commitments, that lab-run testing under competitive pressure needs external backing to hold. Build the tests, strengthen them, and do not mistake a passing grade for a safety guarantee. The wider design is in our plan.

Common questions.

What are dangerous capability evaluations?

They are tests that measure whether a frontier AI model can meaningfully help someone cause serious harm, in domains such as cyberattacks, bioweapons, large-scale persuasion, or deception. Rather than assessing whether a model is accurate or polite, they probe the ceiling of what it enables where the downside is catastrophic, and their results increasingly inform decisions about whether and how a model is released.

How are these evaluations carried out?

Through a combination of methods: structured benchmarks that score how far a model gets on hazardous tasks, expert probing where specialists try hard to extract real assistance, uplift studies that measure how much a model improves a person's ability to do harm compared with existing tools, and agentic evaluations that give the model tools and autonomy to see what it actually accomplishes. Strong evaluation combines several of these rather than relying on any one.

Why aren't dangerous capability evaluations enough on their own?

For three structural reasons. A model that can strategically underperform, or sandbag, can look safe while hiding a capability, and this gets more plausible as models get more capable. Evaluations only cover hazards we already thought to test for, so a novel dangerous capability may not be on the list. And a passing result reflects one moment, one elicitation, and one set of tools, while later fine-tuning or prompting can surface abilities the test missed.

Can an evaluation prove a model is safe?

No. There is a fundamental asymmetry: a failed evaluation, where a model clearly can assist with serious harm, is strong and actionable evidence, but a passed evaluation is a weak assurance, because it is equally consistent with a genuinely safe model and with a capable model that was not fully elicited or not honestly trying. Evaluations can demonstrate danger; they cannot reliably demonstrate safety, which is why they belong inside a broader governance regime rather than serving as its foundation.