Red teaming is the practice of deliberately attacking a system to find its weaknesses before someone else does. The term comes from military exercises and cybersecurity, where a red team plays the adversary against the defending blue team. Applied to AI, it means people, and increasingly other AI systems, working hard to make a model do things it should not: produce harmful instructions, leak private data, be manipulated past its guardrails, or reveal a dangerous capability.

Before a frontier model ships, red teams probe it for exactly these failures. They try jailbreaks, adversarial prompts, and creative misuse. What they find gets fixed or fenced off. It is now a standard and genuinely useful part of frontier development, and it is required or encouraged in most emerging AI governance frameworks.

What red teaming does well

Its strengths are concrete. Adversarial pressure surfaces failures that ordinary testing misses, because ordinary testing exercises the cases you expected and red teaming hunts for the ones you did not. It finds specific, fixable problems, and it stress-tests safeguards against the kind of determined effort a real misuser would apply rather than the polite queries of a normal user. A model that has survived serious red teaming is more robust than one that has not, and that is worth having.

It also feeds the rest of the safety apparatus. Red-team findings inform capability evaluations, sharpen safety training, and provide evidence for the kind of argument a safety case has to make.

The limit that matters

Red teaming shares a deep asymmetry with all testing, and it is the single most important thing to understand about it. Finding a flaw is proof the flaw exists. Not finding one is not proof that none does.

When a red team breaks a model, you have learned something certain and actionable. When a red team fails to break it, you have learned that this group, with this much time, using the techniques they thought of, did not succeed. A more capable attacker, a novel method, or simply more time may still succeed. The absence of a discovered failure is weak evidence of the absence of failures, and it gets weaker as the space of possible attacks grows larger than any team can explore.

Red teaming can show a system is unsafe. It cannot show a system is safe. The two are not symmetric.

Two further limits sharpen the point. Against a highly capable model the red team may be outmatched, and a system able to recognise and underperform on a test could pass red teaming while retaining the capability being probed. And red teaming addresses misuse and known failure modes far better than it addresses the deeper worry of a model with misaligned goals of its own, which will not obligingly reveal them to an adversary looking for jailbreaks.

Where it belongs

The Foundation supports red teaming and wants more of it, done independently rather than only in-house, with results disclosed rather than buried. It is a strong tool for catching the harms it is built to catch. It is not a certificate of safety, and treating a clean red-team result as one repeats the error that runs through so much of current practice: mistaking the absence of proof of danger for proof of safety. Red teaming belongs inside a governance regime that does not depend on the stress test being exhaustive, the design of which is set out in our plan.

Common questions.

What is red teaming in AI?

Red teaming is the practice of deliberately attacking an AI model to uncover its weaknesses before others do. Borrowed from military and cybersecurity exercises, it involves people, and increasingly other AI systems, trying hard to make a model produce harmful content, leak data, be manipulated past its guardrails, or reveal a dangerous capability. Frontier models are red-teamed before release, and the findings are used to fix or fence off the failures discovered.

What is red teaming good at finding?

It surfaces failures that ordinary testing misses, because ordinary testing exercises expected cases while red teaming actively hunts for unexpected ones. It finds specific, fixable problems and stress-tests safeguards against the sort of determined effort a real misuser would apply. Its results also strengthen the wider safety apparatus, informing capability evaluations, improving safety training, and providing evidence for safety arguments.

What are the limits of red teaming?

The central limit is an asymmetry shared by all testing: finding a flaw proves the flaw exists, but failing to find one does not prove none exists. A clean result only shows that this team, with this much time and these techniques, did not break the model, while a more capable attacker or a novel method might. Against a highly capable model the red team may be outmatched, a model that can recognise a test could underperform to pass it, and red teaming addresses misuse far better than it addresses a model with genuinely misaligned goals.

Does passing red teaming mean an AI model is safe?

No. Passing red teaming means known attack techniques applied within a limited time did not succeed, which is useful but far from a safety guarantee. The space of possible attacks is larger than any team can explore, so the absence of a discovered failure is weak evidence that no failure exists, and that evidence weakens as models become more capable. Treating a clean red-team result as a certificate of safety mistakes the absence of proof of danger for proof of safety.