News

Fable 5’s new classifier: >99% block rate, and the trade-offs

The Fable 5 that came back today is not quite the Fable 5 that went dark on June 12. The model returns with a new safety classifier aimed squarely at the cyber jailbreak that triggered the suspension — the technique Amazon researchers reported, which got the model to flag software vulnerabilities and write exploit-demonstration code. Anthropic says the classifier blocks that technique in more than 99% of cases. This story covers what that costs, who vouched for it, and why a lot of paying users are angry anyway.

The trade-offs Anthropic admits

Anthropic is unusually direct about the cost of the new net. The classifier's "expanded safety margin" deliberately fires on some benign requests, which means more false positives during routine coding and debugging — the bread-and-butter work most Fable 5 users actually do. And in some flows, rather than refusing outright, the system hands the task to Opus 4.8 and notifies the user. You ask the frontier model; a different model answers, with a note explaining why.

That handoff is the friction point. Opus 4.8 is a strong model, but nobody paying frontier prices — or spending half-rate weekly limits during the July 1–7 window — wants a coin-flip on which model handles their debugging session. Early complaints center exactly there: routine coding tasks bounced to Opus by a classifier tuned for exploit development (MakeUseOf's rundown of the caveats).

Who vouched for it

The safeguards did not go back into production on Anthropic's word alone. Commerce's Center for AI Standards and Innovation (CAISI) tested both the prior and the new safeguards and called them "extraordinarily strong" — the independent verdict the export-control reversal leaned on. Anthropic also stood up a HackerOne vulnerability-disclosure program specifically for cyber jailbreaks in Fable 5, plus a 24/7 monitoring team (The Hacker News has the security detail).

A framework for the next jailbreak

Anthropic's most interesting move is an admission: "It is probably impossible to make any AI model fully robust (that is, impervious) to jailbreaks." And: "We expect that some jailbreaks will be found for our models, and that they will vary in severity." Rather than promising perfection, it proposes scoring jailbreak severity on four criteria: capability gain over existing tools; the breadth of offensive tasks enabled; ease of weaponization (a single-prompt exploit is worse than a fragile multi-step one); and discoverability. The most severe findings would trigger immediate preliminary mitigations. Anthropic wants this "codified in strong regulation and applied equally across frontier model developers" — a standard under which, notably, the June incident might not have warranted a worldwide shutdown, since Anthropic maintains every model tested could produce the same demonstration.

The backlash

None of this engineering has mollified subscribers, because the return terms shrank. The original launch promise was fourteen free days on paid plans, through June 22; the suspension cut that to three actual days. The restoration offers seven days at half usage. PCWorld rounds up the reaction, and the Reddit threads are blunt. "Not a good look to bring Fable back and then both half the usage and take away days," wrote one user. Another did the arithmetic: "We got to use it for like 3 days out of the 14 we were told, and now we get it for just 7 days at half usage?"

Add the classifier bouncing coding tasks to Opus 4.8, and the complaint compounds: less time, less usage, and less certainty the frontier model even handles your request. Anthropic has not directly addressed the complaints. The suspension was the government's doing; the return terms are Anthropic's, and that distinction is not lost on the people posting.

What to do about it

Expect some false positives this week, especially on security-adjacent coding — debugging, vulnerability triage, anything that smells like exploit work. When a task hands off to Opus 4.8, you'll get a notification; treat it as a signal to rephrase or split the request rather than a dead end. If you hit a block you believe is a genuine classifier flaw, the HackerOne program is the sanctioned channel. And budget your window time assuming a few bounces — seven days at half limits leaves little room for wasted runs.

Live