What the evidence shows so far
- Frontier physics research — the most concrete cross-vendor data point. Notation Capital: Fable was "the strongest model on frontier physics research while using a third of the reasoning tokens. In 36 hours it got nearly to where GPT-5.5 landed after four days."
- Anthropic's benchmark table — the announcement's comparison chart puts Fable at state-of-the-art on nearly all tested benchmarks against other leading models. Vendor-published — weigh accordingly.
- Hands-on community tests — early side-by-sides on hard coding prompts lean Fable ("makes GPT 5.5 feel like a toy", 6-surprising-uses comparison) — enthusiast-grade evidence, ranked on our charts.
What doesn't exist yet
Independent, methodical cross-vendor evals — LMArena-style rankings with Fable included, third-party SWE-bench runs, cost-normalized agentic comparisons. It's been one day. The evidence wall adds them as they publish; claims without receipts don't go up, and that cuts in both directions.
The structural differences you can compare today
| Fable 5 | GPT-5.5 | |
|---|---|---|
| Positioning | Highest-capability tier, safeguarded Mythos-class | OpenAI's frontier flagship |
| Sampling controls | None — prompting only | Vendor-specific |
| Reasoning control | Adaptive thinking + 5 effort levels | Vendor-specific |
| Context | 1M tokens, no long-context premium | See OpenAI's current docs |
| Safety architecture | Runtime classifiers + Opus 4.8 fallback, published red-team numbers | Different approach; see OpenAI's system card |
We deliberately don't quote GPT-5.5 specs from memory — vendor docs change. Check OpenAI's current documentation for their side of the table.
Run your own bake-off (an afternoon, ~$20)
- Pick your three hardest real tasks from the last month — not puzzles, your actual work.
- Write each as one complete brief: goal, constraints, what "done" looks like. (This favors no one — both frontier models reward specification.)
- Run each on both models. For Fable use our recommended defaults: adaptive thinking, effort
high. - Score on: did it finish without intervention; correctness; total cost from the usage fields — not tokens, completed-task cost.
Moral: benchmarks are other people's tasks. The only chart that matters is yours.