Anthropic Just Got the Entire Industry to Agree on Something: Jailbreaks Need a Scale

On June 30, Anthropic announced a new industry framework for evaluating the severity of AI jailbreaks, developed in collaboration with Amazon, Microsoft, Google, OpenAI, and more than a dozen other organizations .

The framework, open for public comment through July 24 and expected to be finalized by fall, establishes a common severity scale for prompt-based attacks — ranked from low to critical, similar to how cybersecurity vulnerabilities are scored . The goal is to give researchers, developers, and deployers a shared vocabulary for comparing the seriousness of jailbreak attempts across different models and deployments .

Anthropic described the risk from prompt-based attacks as“serious and growing” as AI systems gain more capabilities and autonomy .

Anthropic Just Got the Entire Industry to Agree on Something: Jailbreaks Need a Scale — Anthropic

Amazon, Microsoft, Google and OpenAI All Signed

The list includes major AI developers, deployers, and government safety institutes: Anthropic, Google DeepMind, Meta, and OpenAI; Amazon and Microsoft; and the UK AI Safety Institute and US National Institute of Standards and Technology (NIST), plus Cohere, Databricks, IBM, NVIDIA, Palantir, Scale AI, and more .

NIST‘s involvement is significant. The institute is responsible for developing standards across industries, and its participation suggests this framework could become the foundation for future formal AI safety regulations — not just a voluntary industry guideline .

Four Levels: Low to Critical

The framework defines four severity levels for jailbreak attempts:

Low: Spam, mild hallucination, falsehoods — impacts users but not critical systems
Medium: Unauthorized access attempts, data exposure, user confusion
High: Fraud, automated malicious activity, operational disruption
Critical: Bioweapon instructions, infrastructure sabotage, large-scale financial fraud

The severity levels help determine which jailbreak attempts require immediate response and which can be handled through normal monitoring .

Safety Standards That Apply to Everyone

This is a rare moment of industry-wide alignment. Anthropic developed the draft framework with major competitors — Google DeepMind, Meta, and OpenAI — and they all signed on. That doesn’t happen often in a race where models are often judged by their safety shortcomings.

It also suggests a shift from spotting new attacks to triaging them. As jailbreak methods multiply, the ability to prioritize is now more valuable than the ability to detect. This framework creates a standard way to answer the key question: which jailbreaks actually matter?

There is also a preemptive element. An industry-wide safety standard reduces the risk of a single company being blamed for a catastrophic failure. The framework is open for public comment until July 24 and is expected to be finalized this fall — just as Anthropic‘s IPO preparations are expected to intensify.

P.S. Anthropic is leading the group that includes its own competitors. That’s a notable achievement. But it‘s also a defensive move. Safety standards that apply to everyone are cheaper than safety standards that apply to just one company. If something goes wrong, the blame is shared. NIST’s involvement means this framework could become the foundation for formal regulation — not just a voluntary guideline.

Anthropic Just Got the Entire Industry to Agree on Something: Jailbreaks Need a Scale

Amazon, Microsoft, Google and OpenAI All Signed

Four Levels: Low to Critical

Safety Standards That Apply to Everyone

Continue Down This Path

AI’s Cost Curve Just Bent Hard. The Economics Have Changed.

Fable 5 Is Coming Back. The Shutdown Only Lasted 19 Days.

CRAZE