The shape
Beta header task-budgets-2026-03-13; the budget covers the whole agentic loop — thinking, tool calls, and final output combined:
response = client.beta.messages.create(
betas=["task-budgets-2026-03-13"],
model="claude-fable-5",
max_tokens=64000,
thinking={"type": "adaptive"},
output_config={
"effort": "high",
"task_budget": {"type": "tokens", "total": 128000},
},
messages=[{"role": "user", "content": long_agentic_task}],
)
Budget vs. ceiling
task_budget | max_tokens | |
|---|---|---|
| Model aware of it | Yes — sees a running countdown | No |
| Enforcement | Suggestion — model self-moderates | Hard per-response cap |
| Scope | The whole task loop | One response |
| Behavior at the limit | Prioritizes and wraps up gracefully | Truncates mid-thought |
Use both: a generous task_budget so Fable paces itself, and max_tokens as the hard backstop.
Choosing the number
- Minimum is 20,000 — below that the request errors.
- Generous for open-ended work, tighter for latency-sensitive runs. Too tight and the model completes the task less thoroughly, citing the budget as its constraint.
- Don't guess in production: run the workload once unbudgeted, measure with
response.usage, then set the budget from data. - Per-step depth is still
effort— the budget shapes the run's total spend, not how hard each step thinks.
Moral: tell the traveler how far the inn is, and they'll pace the horse themselves.