Prompt Caching in 2026: Anthropic, OpenAI, Azure Compared
Prompt caching is the highest-ROI cost lever on long-context LLM workloads in 2026. Done well, it cuts input-token cost by 30 to 50% on agent loops and RAG pipelines, with no quality change. Done poorly, it silently does nothing. The difference is placement, breakpoint discipline, and a measurement habit.
How the Three Providers Price Caching
Anthropic, OpenAI, and Azure OpenAI all offer prompt caching, with different mechanics. The numbers below reflect public pricing in mid-2026; check the live pricing pages before architecting around them.
- Anthropic. Explicit cache_control breakpoints. Writes cost 1.25x normal input rate; reads cost 10% of normal input rate (a 90% discount on cached input). Cache TTL is 5 minutes by default, with a 1-hour option at a higher write rate.
- OpenAI. Automatic caching above a token threshold (1,024 tokens of stable prefix). Cached prefix is billed at 50% of normal input rate. No explicit breakpoints; the system detects the longest matching prefix.
- Azure OpenAI. Mirrors the OpenAI behaviour for OpenAI models; same automatic prefix matching, same pricing structure. Regional caches per deployment.
The Anthropic model is more controllable. The OpenAI model is less work to wire up. For long, stable system prompts and tool definitions, both produce similar effective discounts.
The Cache Breakpoint Pattern
Caches hit on prefixes. The placement principle: put everything stable first, everything variable last. The order that produces the best hit rate:
- System prompt (most stable).
- Tool definitions (stable per agent version).
- Long static context (e.g. corpus excerpts that recur across queries).
- Slowly changing context (conversation history older than a few turns).
- The current user message (most variable).
Anything ordered after a variable element does not cache. A timestamp injected into the system prompt at the top of the request invalidates the entire cache. A request ID dropped into the tool list breaks the tool-list cache. The single most common mistake is putting a variable string before the long stable content.
// Anthropic: explicit breakpoints on the long stable parts
const response = await anthropic.messages.create({
model: 'claude-opus-4-7',
max_tokens: 1024,
system: [
{ type: 'text',
text: SYSTEM_PROMPT,
cache_control: { type: 'ephemeral' } }, // breakpoint 1
{ type: 'text',
text: KNOWLEDGE_BASE_EXCERPT,
cache_control: { type: 'ephemeral' } }, // breakpoint 2
],
tools, // tools cache too
messages: conversationHistory.concat([
{ role: 'user', content: userMessage } // variable, not cached
]),
});
Where the Cost Wins Are Largest
The economics work hardest when three conditions hold:
- The cacheable prefix is large. 5,000+ tokens of stable content makes a real difference. 500 tokens does not move the bill noticeably.
- The same prefix is reused many times. Agent loops (5 to 15 calls with the same system prompt and tools), RAG with a stable instruction template, batched evaluation runs.
- The variable part is comparatively small. If the user message is 50 tokens and the cached prefix is 8,000, the cache discount applies to 99% of the input cost.
The opposite case: short prompts with mostly-variable content. A 200-token classification request has almost nothing to cache. Caching adds complexity here for almost no benefit.
Measuring Hit Rate
Both Anthropic and OpenAI return cache statistics in the response. Track them. The dashboard you actually want has three lines per workload: input tokens, cached input tokens, and cache hit rate (cached/input). A hit rate below 50% on a long-context workload is a sign the breakpoint is in the wrong place.
// Recording cache metrics from each response
const usage = response.usage;
metrics.record({
workload: 'incident_triage_agent',
input_tokens: usage.input_tokens,
cache_creation_tokens: usage.cache_creation_input_tokens ?? 0,
cache_read_tokens: usage.cache_read_input_tokens ?? 0,
output_tokens: usage.output_tokens,
hit_rate: (usage.cache_read_input_tokens ?? 0) /
Math.max(1, usage.input_tokens),
});
Where the Cache Silently Goes Cold
Four failure modes worth specific attention:
- TTL expiry on bursty traffic. Anthropic's 5-minute default TTL is generous for steady traffic and tight for sporadic. If requests arrive once every 7 minutes, every request pays the write cost and never reads. The 1-hour TTL option is worth the write premium for these patterns.
- Regional or model split. A workload that load-balances across regions or across model versions does not share caches across them. A single deployment per workload caches better than a multi-region round-robin.
- Tool definition churn. Editing a single tool description invalidates the entire tool-list cache for everyone using it. A versioned tool schema with rare changes caches; one edited weekly does not.
- Conversation history reorganisation. If the application periodically rewrites or compacts conversation history, the cache for that prefix is destroyed. Compact infrequently and at predictable boundaries.
A Worked Example
An agent with a 4,000-token system prompt and 2,000-token tool definitions, running 8 model calls per session, with average per-call user content of 500 tokens.
Without caching, input cost per session is roughly: 8 × (4,000 + 2,000 + accumulated context). The accumulated context grows; round to 8 × 8,000 = 64,000 input tokens.
With caching on the system prompt and tools (6,000 tokens cached on the second call onward): 6,000 (write) + 7 × 6,000 (read at 10% of input rate, on Anthropic) + 8 × 2,000 (variable, full price) = 6,000 + 4,200-equivalent + 16,000. Total effective input cost equivalent: about 26,200 tokens. A 59% reduction in input-token billable equivalents.
The exact numbers vary with workload and provider, but the order of magnitude is consistent. Long stable prefix, many repeated calls, no variable content above the prefix: caching cuts input cost dramatically.
When Not to Bother
Three workloads where the caching investment does not pay back:
- Short, mostly-variable requests. A 300-token classification prompt has nothing meaningful to cache.
- Very low volume. A workload that runs ten times a day will pay the cache write cost each time and rarely read.
- Output-bound workloads. Long-form generation where the output dominates the cost (a 50-token prompt producing 5,000 tokens of output) is unaffected by input caching.
The 30-Minute Audit
A pragmatic order of investigation for an existing LLM workload:
- Pull the usage data for a week. Find the top three workloads by input-token volume.
- For each, identify the stable prefix length. If it is over 2,000 tokens, caching is worth a try.
- Verify the prefix order: stable first, variable last. Move anything out of order.
- Add explicit cache breakpoints (Anthropic) or verify automatic caching is engaged (OpenAI / Azure OpenAI).
- Log cache_read_input_tokens for a week. Plot hit rate by workload.
- Adjust breakpoint placement until hit rate stabilises above 60% on long-context workloads.
This is one of the rare optimisations where the code change is small and the bill change is large. The cost-conscious teams will already have done this audit. The teams that have not are leaving a quarter to a half of their LLM input bill on the table.