Small Models in Production: When Phi-4 and 8B Llama Win
Frontier models are the default. Defaults are how teams overpay on LLM bills. Three workloads where small models (Phi-4, Llama 3.x 8B, Mistral Small) outperform on cost-per-decision without losing meaningfully on quality, three workloads where they do not, and the two-tier production pattern that cost-conscious teams converge on after a quarter of evaluation work.
Prompt Caching in 2026: Anthropic, OpenAI, Azure Compared
Prompt caching is the highest-ROI cost lever on long-context LLM workloads in 2026. Anthropic, OpenAI, and Azure OpenAI all offer it with different pricing and breakpoint semantics. A worked comparison of the three providers, the placement patterns that actually hit cache, where the cache silently goes cold, and a 30-minute audit that pays back.
AI Agent Cost Economics: Why 100x and How to Cut It
Agent loops cost 10x to 50x what a chatbot interaction costs; multi-agent systems add another order of magnitude. The cost compounding is structural, not a bug. The cost reduction is structural too. Decomposing where the tokens go and how to bring agent economics back from runaway to acceptable.
Cost-Optimizing Azure OpenAI: PTUs, Batch, Caching in 2026
A concrete playbook for reducing Azure OpenAI bills in 2026. Break-even math for Provisioned Throughput Units, prompt-cache economics, the Batch API 50 percent discount, Foundry IQ for retrieval, tiered model routing, and the telemetry that keeps the wins honest.
Microsoft Foundry: The AI Platform for the Agentic Era - Ignite 2025
From scientific research to enterprise AI transformation, discover how Microsoft Foundry unifies models from OpenAI, Anthropic, Cohere, Meta, and more into one secure platform. Learn intelligent model routing, cost optimization, and the game-changing Claude integration.
Fine-Tuning in Microsoft Foundry: Building Production-Ready AI Agents - Microsoft Ignite 2025
Microsoft Ignite BRK188: Fine-tuning in Microsoft Foundry transforms generic models into production-ready agents. Synthetic data generation, supervised + reinforcement fine-tuning, 40-90% cost reduction, 95%+ accuracy. Real-world results: 2M docs/day, $27M savings.
Running Open-Source AI Models at Scale: Azure Container Apps, AKS, and On-Premise Deployments - Microsoft Ignite 2025
Microsoft Ignite BRK117: Deploy open-source AI models (Llama 3.3, Mistral) with Azure Container Apps serverless GPUs, AKS with Kaido workflows, and on-premise infrastructure. Cost reduction 60-85%, data sovereignty, and hybrid architectures with Azure Arc.