1. Give it your hardest real ticket
Not a puzzle — the ticket your team has bounced twice. Every's internal senior-engineer exam put Fable at 91/100 where Opus 4.8 scored 63; your backlog is the version of that exam that matters to you.
Here is a bug we have failed to fix twice. [paste the issue, the relevant
code, and what was tried]. Reproduce the reasoning behind the failure,
identify the actual root cause, and propose a fix with the tradeoffs stated.
Expect: the difference shows in the diagnosis, not the patch — watch whether it finds the cause your team missed.
2. Run a real migration or refactor
The Stripe story (50M lines, one day) scales down honestly. Pick a real chore: a framework upgrade, a deprecated-API sweep, a test-suite modernization. Give the whole brief in one turn — goal, constraints, what "done" looks like:
Migrate this module from X to Y. Constraints: no behavior changes, keep the
public API stable, tests must pass. Done means: all call sites updated, tests
green, a summary of every judgment call you made.
Recipe: the whole-task-brief pattern. In Claude Code this is where Fable's long-horizon strength shows most.
3. Build something playable from one prompt
The one-prompt Mario game wasn't a fluke — single-shot building is a real capability tier now. Make the brief vivid and complete:
Build a complete, playable browser game in a single HTML file: a lighthouse
keeper rowing supplies between islands as storms roll in. Canvas rendering,
keyboard controls, rising difficulty, a score, and a title screen. Make it
feel finished — sound optional, polish mandatory.
Expect: a working game on the first try; judge it on the polish you didn't ask for.
4. Rebuild an interface from a screenshot
The announcement's vision claim — rebuilds web-app source from screenshots — is easy to test. Screenshot any app you use, then:
[attach screenshot] Rebuild this interface as a single HTML file: faithful
layout, spacing, and typography, with working interactions where visible.
Note anything you can see but cannot infer.
Expect: startling fidelity on layout; the "note what you can't infer" line shows you how carefully it reads pixels.
5. Drop in something enormous
The 1M-token window is a feature most people never actually exercise. Feed it a whole repo, a year of meeting notes, or a full contract set:
[attach everything] Here is the entire codebase. Map the architecture:
the real module boundaries, the load-bearing files, the dead code, and the
three refactors that would pay off most. Cite file paths for every claim.
Expect: the citations are the test — Fable stays accurate across hundreds of thousands of tokens where smaller-window models lose the thread.
6. Run an overnight agent with a memory file
The Slay-the-Spire result (memory helped Fable 3× more than Opus) is reproducible in Claude Code: give a long task plus one instruction —
Work through this list overnight. Keep a NOTES.md as you go: decisions made,
dead ends hit, what's left. Before each new item, re-read your notes.
Recipe: memory files + task budgets. GitLab reports multi-day runs holding coherent; your overnight is the small version.
7. Run the bake-off
Before June 23 forces the question, answer it with data: run tasks 1, 2, and 5 on both Fable and your current model. Score on finished-without-intervention, correctness, and total cost. The protocol is written up in our bake-off guide — an afternoon, and you'll know.
Moral: a free window is a question with a deadline — "what would I do with the best model there is?" Answer it with real work.