Small Language Models On-Prem: The Phi-4 and Llama 3.3 ROI Math
Small language models got good enough in 2025 that running them on-prem is now a defensible architectural choice, not a hobbyist project. Phi-4 at the Microsoft end, Llama 3.3 at the Meta end, and a handful of strong fine-tunes give you capable assistants you can run behind a firewall, inside Azure Sweden Central, or on your own hardware. But the cost math is counterintuitive. On-prem is cheaper than hosted inference in narrower circumstances than most architecture decks admit.
The Models Worth Considering in 2026
- Phi-4 (14B). Microsoft's small model with strong reasoning on technical tasks. Fits in 28 GB of VRAM at full precision, or 12 GB quantised.
- Llama 3.3 70B. Meta's flagship open-weights mid-size model. Flexible fine-tuning, but needs 140 GB VRAM at bf16 or ~48 GB quantised with quality compromises.
- Llama 3.2 3B / 1B. Edge-class, usable for classification and lightweight agents.
- Specialist fine-tunes. Many domain-specific fine-tunes (medical, legal, code) built on these backbones deliver frontier-grade quality on narrow tasks.
When On-Prem Wins
- Data cannot leave a tenancy. Regulated workloads (Finansinspektionen-supervised, healthcare, defence) where third-party inference is off the table. The ROI question changes from "cheaper?" to "allowed at all?"
- High sustained throughput on a narrow task. If a single workload churns tens of millions of tokens per day with predictable prompts, running a tuned model on dedicated GPUs can beat per-token hosted pricing.
- Latency-critical on-device. Sub-100ms inference is realistic with small models on local GPU; hosted inference cannot match on public internet.
- Cost predictability. Fixed GPU cost per month versus variable token bills. Useful for budget certainty even when average cost is higher.
When On-Prem Loses
- Bursty or low-volume traffic. GPUs bill continuously; your workload does not.
- Need for frontier reasoning quality. Phi-4 and Llama 3.3 close much of the gap on narrow tasks, but frontier models still lead on complex reasoning and long-context synthesis.
- You do not have the operational capacity. Running GPUs, managing drivers, monitoring inference servers, staying current on model updates. This is a role.
The Cost Math, Honestly
A rough 2026 Azure reservation for an NC24ads A100 v4 (one A100 80GB) sits around $2.30/hour on a 1-year reservation. Roughly $1,650/month fixed. An H100 instance is roughly 2–3x that. At 200 tokens/second sustained output on a 70B quantised model (reasonable for a single A100 with vLLM), you produce ~17 million output tokens per day if run continuously. Compared to hosted inference at prevailing 2026 rates, on-prem wins the math only if you genuinely utilise the GPU at a high duty cycle. Typically 40%+ sustained. Below that, hosted wins.
Deployment: Ollama or vLLM
- Ollama. Easy setup, great for development and internal tools. Not optimised for high-throughput production serving.
- vLLM. Production-grade serving with paged attention and continuous batching. Higher operational complexity, significantly better throughput.
- TensorRT-LLM. The fastest path on NVIDIA hardware for teams willing to invest in the build pipeline.
# vLLM serving Llama 3.3 70B quantised on an A100
docker run --rm --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.3-70B-Instruct \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--served-model-name llama-3.3-70b
The Azure Sweden Central Angle
For Swedish data-residency workloads, Sweden Central offers GPU VMs in the A100 and H100 families. This is the realistic home for a Swedish on-prem SLM deployment. It keeps data in-region under Azure compliance, while avoiding the operational overhead of physical hardware in your own datacentre. The main caveat is GPU capacity variability; reserved instances are the way to secure availability.
Quality Calibration
Before committing to on-prem, run your evaluation suite against both the small model and a frontier hosted alternative. Phi-4 and Llama 3.3 are genuinely good on many tasks and mediocre on others. The wrong answer is assuming a 70B model is universally competitive; the right answer is knowing task-by-task where it is.
The Short Decision Rule
- If data residency or tenancy rules forbid hosted inference. On-prem wins by necessity.
- If you have sustained, predictable volume on a narrow task. On-prem wins on cost.
- Otherwise, hosted inference is cheaper, more capable, and does not require operating a GPU fleet.