What Is Red Teaming in AI?

Red teaming is the practice of deliberately attacking a system to find its weaknesses before someone else does. The term comes from military exercises and cybersecurity, where a red team plays the adversary against the defending blue team. Applied to AI, it means people, and increasingly other AI systems, working hard to make a model do things it should not: produce harmful instructions, leak private data, be manipulated past its guardrails, or reveal a dangerous capability.

Before a frontier model ships, red teams probe it for exactly these failures. They try jailbreaks, adversarial prompts, and creative misuse. What they find gets fixed or fenced off. It is now a standard and genuinely useful part of frontier development, and it is required or encouraged in most emerging AI governance frameworks.

What red teaming does well

Its strengths are concrete. Adversarial pressure surfaces failures that ordinary testing misses, because ordinary testing exercises the cases you expected and red teaming hunts for the ones you did not. It finds specific, fixable problems, and it stress-tests safeguards against the kind of determined effort a real misuser would apply rather than the polite queries of a normal user. A model that has survived serious red teaming is more robust than one that has not, and that is worth having.

It also feeds the rest of the safety apparatus. Red-team findings inform capability evaluations, sharpen safety training, and provide evidence for the kind of argument a safety case has to make.

The limit that matters

Red teaming shares a deep asymmetry with all testing, and it is the single most important thing to understand about it. Finding a flaw is proof the flaw exists. Not finding one is not proof that none does.

When a red team breaks a model, you have learned something certain and actionable. When a red team fails to break it, you have learned that this group, with this much time, using the techniques they thought of, did not succeed. A more capable attacker, a novel method, or simply more time may still succeed. The absence of a discovered failure is weak evidence of the absence of failures, and it gets weaker as the space of possible attacks grows larger than any team can explore.

Red teaming can show a system is unsafe. It cannot show a system is safe. The two are not symmetric.

Two further limits sharpen the point. Against a highly capable model the red team may be outmatched, and a system able to recognise and underperform on a test could pass red teaming while retaining the capability being probed. And red teaming addresses misuse and known failure modes far better than it addresses the deeper worry of a model with misaligned goals of its own, which will not obligingly reveal them to an adversary looking for jailbreaks.

Where it belongs

The Foundation supports red teaming and wants more of it, done independently rather than only in-house, with results disclosed rather than buried. It is a strong tool for catching the harms it is built to catch. It is not a certificate of safety, and treating a clean red-team result as one repeats the error that runs through so much of current practice: mistaking the absence of proof of danger for proof of safety. Red teaming belongs inside a governance regime that does not depend on the stress test being exhaustive, the design of which is set out in our plan.