AIRiskAware
Explainer

What Is AI Red Teaming?

AI red teaming is the practice of systematically testing AI systems by deliberately attempting to make them produce harmful, biased, inaccurate, or policy-violating outputs. It adapts the cybersecurity concept of red teaming — where security professionals simulate attacks to find vulnerabilities — to AI systems. AI red teaming includes adversarial prompting (trying to bypass safety guardrails), bias probing (testing for discriminatory outputs across demographic groups), capability testing (evaluating what the model can do that it should not), and robustness testing (testing behaviour under unusual or edge-case inputs).

Definition

AI Red Teamingstructured adversarial testing of an AI system — by humans or other AI systems — to identify vulnerabilities, failure modes, harmful outputs, and ways the system can be misused.

AI red teaming extends traditional cybersecurity red teaming to AI-specific risks: jailbreaks, prompt injection, training data extraction, model evasion, and emergent capability surfacing. It is mandated for systemic-risk GPAI under EU AI Act Article 55. NIST AI RMF MEASURE function references red teaming, and the US, UK, and Japanese AI Safety Institutes all run formal red-team programmes for frontier models.

Source: EU AI Act, Article 55; NIST AI RMF MEASURE function; US AISI red team

Why it matters for governance

The EU AI Act requires providers of GPAI models with systemic risk to conduct adversarial testing. The White House AI commitments (July 2023) include voluntary red teaming commitments from major AI companies. NIST AI RMF includes red teaming as a key practice in the MEASURE function. For organisations deploying AI, red teaming provides evidence that the system behaves as intended and that safety controls are effective — evidence that regulators and auditors increasingly expect to see.