When "Jailbreakable" Becomes a Recall Standard

The US government issued a directive yesterday requiring Anthropic to suspend all access to Fable 5 and Mythos 5, its two most capable models, for every foreign national, anywhere. Because compliance with that definition is effectively impossible to verify at scale, Anthropic disabled the models for everyone.

The stated reason: a method of bypassing the models' safeguards had been identified. A potential jailbreak. A narrow, non-universal one, Anthropic is careful to note: the kind that can elicit some information under specific circumstances, not the kind that broadly unlocks dangerous capabilities.

Anthropic has published their disagreement clearly. It is worth reading. But the statement raises a question the company, understandably, cannot ask too loudly: if this standard were applied consistently, would any frontier model currently deployed survive it?

The answer, by Anthropic's own account, is no. They write that the capability demonstrated by the alleged jailbreak — roughly, asking the model to review a codebase and identify vulnerabilities — is "widely available from other models (including OpenAI's GPT-5.5)" and is "used every day by the defenders who keep systems safe."

Sit with that for a moment. The threshold being applied here is not "does this model provide meaningful uplift over what adversaries could already access?" It appears to be closer to "has anyone demonstrated any method of eliciting any capability we find concerning?" Those are very different standards. The first is a reasonable basis for oversight. The second would logically require pulling every capable model already in the market.

Anthropic's defence-in-depth approach is sound security thinking: make jailbreaks narrow or expensive, then layer on monitoring and data retention. It is how serious organisations approach security problems that cannot be solved by a single perfect control. Perfect jailbreak resistance is not achievable. Pretending otherwise does not make it so.

What troubles me here is less the underlying disagreement about risk tolerance, which is a legitimate conversation, and more the process.

Anthropic received a letter at 5:21pm with no specific details of the national security concern. The evidence shared was, by their account, verbal. The directive required immediate compliance, affecting hundreds of millions of users with no notice. There was no prior disclosure, no opportunity to demonstrate the safeguards, no framework for what standard would need to be met for access to be restored.

Whatever one thinks of Fable 5's risk profile, this is not how sound governance of a critical technology looks. The concern with AI safety is, in part, that decisions with large consequences are being made without adequate process. That critique applies in both directions.

Anthropic has stated publicly that governments should have the ability to block unsafe deployments — as part of a process that is "transparent, fair, clear, and grounded in technical facts." Yesterday met none of those descriptors.

The honest version of this situation is that frontier AI governance does not yet have the institutions it needs. There is no agreed technical framework for what level of capability, under what circumstances, constitutes grounds for recall. There is no clear process for disclosure, assessment, or appeal. There is no shared definition of "unsafe" that could be applied consistently across providers.

Into that vacuum, decisions get made informally, urgently, and without the kind of evidence base that would allow anyone — companies, regulators, or the public — to evaluate whether they were right.

The precedent being set here matters more than the specific models being suspended. If any demonstrated narrow jailbreak is sufficient grounds for removing a commercial model from deployment, the industry needs to understand that now. So do the organisations that have built workflows, products, and services on top of these models.

Anthropic says it believes this is a misunderstanding and is working to restore access. Perhaps it will be resolved quickly and cleanly. But the conditions that produced it have not changed: no clear standards, no transparent process, and no shared technical language between regulators and labs. The next incident will find the same gaps waiting.

The capability to ask a model to review code for vulnerabilities is not, in itself, a safety failure. The failure is building the oversight frameworks after the models are deployed, not before.

Sources: Anthropic statement on Fable 5 and Mythos 5 access suspension · Anthropic's launch post for Fable 5 and Mythos 5 · Anthropic's policy on the AI Exponential

More from the blog

Stay current weekly