Prompt Caching: Cutting LLM Costs Without Quality Loss
Prompt caching is the single most under-used cost lever in production LLM applications. Teams that adopt it consistently see 50–90% reductions in input-token cost on cache-heavy workloads, and more importantly, meaningful latency improvements on long contexts.
How Prompt Caching Works
The mechanism is the same in spirit across providers: when you send a request that shares a prefix with a previous request, the provider serves the shared prefix from a server-side cache instead of reprocessing the tokens. Cache hits are billed at a fraction of the normal input rate (Anthropic: 10% of input-token rate on hits, 25% premium on writes; Azure OpenAI and OpenAI offer discounted hit rates with provider-specific details). Implementation differs:
- Anthropic Claude. Explicit
cache_controlbreakpoints you place on specific message blocks. Up to four breakpoints. Default five-minute TTL; a one-hour extended TTL is available. - Azure OpenAI / OpenAI. Automatic caching of prefixes above a minimum length threshold. No explicit breakpoint; the service decides where the cacheable prefix ends.
What to Cache
- Long, stable system prompts. Instructions, style guides, safety policies. Anything that changes across deploys, not per request.
- Tool definitions. If you pass tool schemas, they are stable and expensive to re-tokenise. Cache them immediately after the system prompt.
- Reference documents. A policy document or codebase excerpt that every user query in a session references.
- Few-shot examples. Long few-shot blocks are the textbook caching win.
- Conversation history. Sometimes. Cache the conversation up to the most recent turn; the newest user message is fresh.
What Not to Cache
- User-specific data that changes every request. No hit, pure overhead.
- Very short contexts. Providers enforce minimum prefix lengths (Claude: 1,024 tokens on most models; Azure OpenAI: similar thresholds). Below that, caching is a no-op.
- Contexts that mutate in the middle. If you inject something between the cached prefix and the user message, the hit evaporates.
Cache Breakpoints in Claude
const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
max_tokens: 1024,
system: [
{
type: 'text',
text: INSTRUCTIONS, // stable, long
cache_control: { type: 'ephemeral' },
},
{
type: 'text',
text: REFERENCE_DOCS, // stable, long
cache_control: { type: 'ephemeral' },
},
],
tools, // stable; cached by position
messages: [
...conversationHistory, // cached up to previous turn
{ role: 'user', content: newUserMessage }, // fresh — not cached
],
});
Key rules: everything before a breakpoint is cached together as a unit. Place breakpoints at stability boundaries. After the system prompt, after the tool definitions, after the reference docs. Everything between breakpoints and the end of the prompt is the volatile suffix.
TTL Realities
The default Anthropic TTL is five minutes. If your users return after a coffee break, you pay the write premium again. The one-hour extended TTL is better for interactive workflows where sessions pause; it costs more on write but amortises across hours of idle gaps. The rule of thumb: if you expect the same prefix to be used more than twice within an hour, pay the extended TTL write premium once; it will win.
Measuring Actual Impact
Every provider returns cache metrics in the response (Anthropic: cache_creation_input_tokens, cache_read_input_tokens). Ship these into your LLM observability stack immediately. If you cannot answer "what is my cache hit rate by user segment over the last week" you are shipping the feature blind.
Anti-Patterns That Erase the Savings
- Dynamic content inside the cached prefix. A timestamp, a session ID, or a per-user token injected into the system prompt makes every request a miss. Move them into the user message.
- Reshuffling tool order per request. Tool definitions are cached in the order they are sent. Sort them deterministically.
- Rotating model versions. Cache entries are keyed by model. A/B tests across model variants mean separate caches for each.
- Over-caching. The 25% write premium matters. Do not wrap every ten-token prompt; you will pay more than you save.
The Practical Target
For a mature LLM application with stable system prompts and moderate conversation lengths, cache hit rates of 70–85% are achievable. That is a 50–70% reduction in input-token cost translated to The engineering investment is a few hours of breakpoint placement and a dashboard. It is the highest-ROI change most teams have not made.