Recipe 04

Prompt caching

At $10 per million input tokens with a million-token window, caching is the difference between affordable and not. One rule governs everything.

The one rule

Caching is a prefix match. The cache key is the exact bytes of the rendered prompt — tools, then system, then messages. A single byte changed anywhere invalidates everything after it. Get the ordering right (stable content first, volatile content last) and caching mostly works for free.

The basic move

response = client.messages.create(
    model="claude-fable-5",
    max_tokens=16000,
    system=[{
        "type": "text",
        "text": LARGE_STABLE_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},  # 5-min TTL; "ttl": "1h" available
    }],
    messages=[{"role": "user", "content": question}],
)

For multi-turn agents, also mark the last content block of the newest turn — earlier breakpoints stay valid, so hits accrue as the conversation grows. Max 4 breakpoints per request.

Verify it's working

print(response.usage.cache_creation_input_tokens)  # wrote cache (~1.25x)
print(response.usage.cache_read_input_tokens)      # read cache (~0.1x)
print(response.usage.input_tokens)                 # full price

If cache_read_input_tokens stays zero across identical-prefix requests, hunt for a silent invalidator:

PatternWhy it kills the cache
datetime.now() in the system promptPrefix changes every request
json.dumps(d) without sort_keys=TrueNon-deterministic bytes
Per-user IDs interpolated earlyNo cross-user sharing
Tool set varies per requestTools render at position 0 — everything misses
Fable-specific: the minimum cacheable prefix on claude-fable-5 is 2,048 tokens. Shorter prefixes silently don't cache — no error, just cache_creation_input_tokens: 0.
  • Reads cost ~10% of base input; 5-minute-TTL writes cost 1.25× — two requests already break even.
  • Switching models invalidates the cache (it's model-scoped) — expect one fresh write after migrating.
  • Don't edit the system prompt mid-session; inject dynamic context later in messages instead.

Moral: the cache doesn't reward cleverness, it rewards stillness — freeze the front of your prompt.

This recipe in about a minute