Cost-Optimizing Azure OpenAI: PTUs, Batch, Caching in 2026
Azure OpenAI bills scaled faster than revenue for most teams that shipped real LLM features in 2025. The levers to flatten that curve are better understood in 2026, but only the minority of teams are using more than one or two. This walk-through covers the six that matter in combination: Provisioned Throughput Units, prompt caching, the Batch API, Foundry IQ, tiered model routing, and token-level telemetry. None of them is exotic. Most of them are underused.
The Three Pricing Shapes
Azure OpenAI sells inference three ways. Each has its own unit economics, its own commit shape, and its own operational profile.
- Pay-as-you-go. Per-token billing at list price. No commitment, linear cost, highest per-token rate. The right default for unpredictable or bursty traffic, and the only realistic choice for small workloads.
- Provisioned Throughput Units (PTUs). Reserved capacity measured in PTUs, each worth a defined tokens-per-second slice of a specific model. Fixed monthly cost, no per-token billing, hard cap on concurrency. Wins at sustained high-throughput workloads.
- Batch API. Asynchronous submissions processed within a 24-hour window at 50 percent of the pay-as-you-go rate. Won't help interactive features; transforms the economics of bulk workloads (document processing, offline enrichment, evals).
PTUs: The Math That Actually Decides
PTUs are confusing on first reading because the unit has no direct meaning. A PTU is a slice of reserved capacity; how many tokens-per-second you actually get depends on the model, the prompt length, the output length, and whether you're measuring peak or sustained throughput. Microsoft publishes model-specific conversion tables (tokens-per-minute per PTU) but the practical break-even is easier to reason about using a simpler calculation.
Treat PTUs as a fixed monthly cost and compare that against what you would pay under pay-as-you-go for the same token volume. The break-even point is the utilisation rate at which the two lines cross. In practice, for most 2026 frontier models, PTUs become cheaper than pay-as-you-go somewhere between 40 and 60 percent sustained utilisation.
// Rough PTU break-even calculator. Replace constants with your deal pricing.
type Calc = {
ptuCount: number;
ptuMonthlyCostUsd: number; // reservation cost
maxTokensPerMinute: number; // per Microsoft's per-model table
paygInputPer1M: number; // $/1M input tokens
paygOutputPer1M: number; // $/1M output tokens
avgInputTokensPerCall: number;
avgOutputTokensPerCall: number;
};
export function breakEvenUtilisation(c: Calc): number {
const minutesPerMonth = 60 * 24 * 30;
const tokensPerCall = c.avgInputTokensPerCall + c.avgOutputTokensPerCall;
const maxCallsPerMonth = (c.maxTokensPerMinute * minutesPerMonth) / tokensPerCall;
const paygCostPerCall =
(c.avgInputTokensPerCall * c.paygInputPer1M / 1_000_000) +
(c.avgOutputTokensPerCall * c.paygOutputPer1M / 1_000_000);
const paygCostAtFullUse = maxCallsPerMonth * paygCostPerCall;
return c.ptuMonthlyCostUsd / paygCostAtFullUse; // 0..1 — fraction of capacity needed
}
Three operational realities to check before committing to PTUs. First, the minimum deployment for production PTU is usually small enough to be accessible (check the current minimum for the model you want), but still a real monthly line item. Second, PTU capacity is region- and model-specific; switching models or regions mid-reservation is not trivial. Third, PTUs do not provide burst capacity above their rated throughput; requests queue or reject once saturated. A hybrid model where steady-state runs on PTUs and burst overflow goes to pay-as-you-go is the common production shape.
Prompt Caching: The Highest-ROI Change Most Teams Have Not Made
Azure OpenAI automatically caches prefixes above a model-specific minimum length. Cache hits are billed at a significantly reduced rate (check current Microsoft documentation for the exact discount by model tier, typically in the 50 percent range for input tokens). The implementation work on the client side is zero; the architectural work is making sure your prompts are cache-friendly.
- Keep the stable parts stable. System prompts, tool definitions, and long reference documents must appear in the same position and byte-identical on every request for caching to hit. Move timestamps, session IDs, and per-user values to the end of the prompt, after the cacheable prefix.
- Sort tool definitions deterministically. Tools loaded from a Map or an unordered source will serialise in different orders; every order change is a cache miss.
- Track the cache hit rate. The API returns cache usage telemetry in the response. Track it per feature and per user cohort. Target hit rates above 70 percent on stable-prompt workloads.
// Azure OpenAI returns cache telemetry in the usage object
const response = await client.chat.completions.create({
model: 'gpt-5',
messages: [
{ role: 'system', content: LONG_STABLE_SYSTEM_PROMPT }, // cacheable
{ role: 'system', content: REFERENCE_DOCS }, // cacheable
...conversationHistory, // cached up to penultimate turn
{ role: 'user', content: newUserMessage }, // volatile suffix
],
});
// Log cache telemetry as OpenTelemetry span attributes
span.setAttributes({
'gen_ai.usage.input_tokens': response.usage.prompt_tokens,
'gen_ai.usage.output_tokens': response.usage.completion_tokens,
'gen_ai.usage.cached_tokens': response.usage.prompt_tokens_details?.cached_tokens ?? 0,
});
Batch API: Fifty Percent Off, With Patience
The Batch API accepts a JSONL file of requests and processes them asynchronously within a service level of 24 hours. Real-world completion is usually much faster, but 24 hours is the committed window. The pricing discount is 50 percent against pay-as-you-go for the same model. For any workload where a multi-hour turnaround is acceptable, it is the cheapest line in the spreadsheet.
Scenarios where Batch carries real weight in 2026:
- Overnight ingestion of documents into a RAG store, where embeddings and summaries can be regenerated in bulk.
- Evaluation runs against a standing test suite (DeepEval, Promptfoo) where batching cuts eval cost in half.
- Bulk classification and tagging jobs, such as ticket triage or product categorisation.
- Backfilling historical data when introducing a new model or prompt version.
// Submit a batch job — JSONL file, one request per line
// Each line: { "custom_id": "...", "method": "POST", "url": "/v1/chat/completions", "body": { ... } }
const batchFile = await client.files.create({
file: fs.createReadStream('batch-input.jsonl'),
purpose: 'batch',
});
const batch = await client.batches.create({
input_file_id: batchFile.id,
endpoint: '/v1/chat/completions',
completion_window: '24h',
});
// Poll until completed; download the output JSONL
// Results include the custom_id on each line, correlating back to your input
Foundry IQ: The Knowledge Layer That Keeps RAG Cheaper
Foundry IQ is the knowledge layer Microsoft positions underneath agent and RAG workloads. It aggregates retrieval targets (documents, databases, enterprise SaaS sources) behind a unified interface, handling access control and query planning so your agents and chat applications do not each reimplement retrieval glue. Its cost angle is twofold. First, it consolidates retrieval infrastructure that would otherwise be duplicated across features. Second, the agentic-retrieval model it enables lets the LLM do a smaller amount of smarter retrieval rather than indiscriminately stuffing context.
The practical cost question for a mid-sized B2B is whether Foundry IQ replaces a tangle of Azure AI Search, Cognitive Search, and per-team retrieval services, or duplicates them. The right answer is to consolidate: treat Foundry IQ as the retrieval plane, and decommission bespoke per-feature retrieval code.
Tiered Model Routing
Most user queries do not require a frontier model. A classifier in front of the LLM layer routes simple queries to a cheaper model, harder queries to the expensive one, and reasoning-heavy queries to the thinking-capable tier. The savings are substantial: a tiered architecture where 70 percent of traffic hits a small model, 25 percent hits a medium model, and 5 percent hits a frontier model typically spends 20 to 30 percent of what a flat frontier-model deployment would cost.
// Routing classifier — small model decides which tier handles the user query
const TIER = { SMALL: 'gpt-5-mini', MED: 'gpt-5', LARGE: 'o-series' };
async function routeQuery(q: string): Promise<keyof typeof TIER> {
const cls = await client.chat.completions.create({
model: 'gpt-5-mini',
messages: [
{ role: 'system', content: ROUTING_RUBRIC }, // stable, cached
{ role: 'user', content: q },
],
max_tokens: 5,
response_format: { type: 'json_schema', json_schema: { /* enum SMALL|MED|LARGE */ } },
});
return (JSON.parse(cls.choices[0].message.content!).tier) as keyof typeof TIER;
}
async function answer(q: string) {
const tier = await routeQuery(q);
return client.chat.completions.create({
model: TIER[tier],
messages: [{ role: 'user', content: q }],
});
}
Two guardrails make routing safe. First, evaluate accuracy on the routing classifier itself against a held-out set of queries with known correct tiers; misrouting to a too-small model produces a quality regression that can be worse than the cost it saved. Second, allow the small-model tier to escalate. If the small model returns a low-confidence answer or explicit refusal, re-run on the medium tier automatically. The escalation cost is minor, and it catches the misrouted queries before the user does.
Request-Level Optimisations
- Trim the prompt. Long few-shot examples and verbose system prompts make every request more expensive. Keep only what measurably improves quality on an eval.
- Use structured outputs when applicable. Structured JSON responses avoid the retry loops that occur when a model returns malformed output and the client asks it again.
- Cap
max_tokensaggressively. Set the output budget to the shortest realistic answer; long generations can silently multiply cost for negligible quality gain. - Stream for UI, buffer for pipelines. Streaming has no direct cost effect but improves perceived performance; buffered consumption simplifies logging and cost attribution.
Observability: You Cannot Optimise What You Cannot See
Every optimisation described here depends on per-call telemetry. Instrument every LLM call with OpenTelemetry GenAI semantic attributes, export to Application Insights, and build three dashboards that every engineering manager can open on demand.
// KQL — monthly spend attribution by tenant and feature
dependencies
| where timestamp > startofmonth(now()) and name == "llm.chat"
| extend
tenant = tostring(customDimensions["tenant.id"]),
feature = tostring(customDimensions["app.feature"]),
model = tostring(customDimensions["gen_ai.response.model"]),
inTok = tolong(customDimensions["gen_ai.usage.input_tokens"]),
outTok = tolong(customDimensions["gen_ai.usage.output_tokens"]),
cached = tolong(customDimensions["gen_ai.usage.cached_tokens"])
| extend
estUsd = (
(inTok - cached) * input_price_per_mtok[model] +
cached * cached_price_per_mtok[model] +
outTok * output_price_per_mtok[model]
) / 1e6
| summarize spend = sum(estUsd), calls = count()
by tenant, feature, model
| order by spend desc
- Spend by tenant and feature. The monthly breakdown answers "which customer is driving growth" and "which feature needs attention first."
- Cache hit rate over time. A dashboard that lights up when caching regresses — after a deploy, a prompt change, or a client refactor.
- Tier distribution. What percent of traffic landed on each model tier this week; drift here is usually the early signal for a quality or routing issue.
An Example Architecture With Numbers
A B2B SaaS with roughly 10,000 daily active users and an LLM-backed help assistant, without any of the optimisations above, might run a bill in the range of 15,000 to 30,000 USD per month on pay-as-you-go alone. The same workload after five changes — caching system prompts, routing 70 percent of traffic to a small model with escalation, batching the nightly document-ingest pipeline, moving the sustained-steady-state component to PTUs, and trimming system prompts by 40 percent — typically lands in the 5,000 to 10,000 USD range. The numbers vary with traffic shape and model mix, but the ratio is consistent: a 2x to 3x reduction is realistic within two engineering sprints of focused work.
Priority Order for a Team Starting From Zero
- Instrument GenAI telemetry and build the three dashboards. Without them the rest is guesswork.
- Restructure prompts so stable parts come first. Measure cache hit rate as it climbs.
- Cap
max_tokensand trim system prompts against an evaluation suite. - Route any bulk or offline workload through the Batch API.
- Introduce a routing classifier for tiered model selection, with escalation.
- Evaluate PTU reservations once the sustained load pattern is stable and measured.
- Consolidate retrieval on Foundry IQ instead of per-feature search infrastructure.
The order matters. PTU reservations made without cache hit rate data will be mis-sized; tiered routing introduced without telemetry will silently degrade quality. Teams that work through this list in order compound the savings. Teams that skip ahead to PTUs because they sound like the big lever often end up with a reservation they cannot fully use and a bill that did not move.