Fine-Tuning in Microsoft Foundry: Building Production-Ready AI Agents - Microsoft Ignite 2025
Baseline Performance Assessment (1-2 weeks)
- • Identify use case requiring fine-tuning (tool calling, data extraction, workflow execution)
- • Measure baseline with best-effort prompt engineering (accuracy, latency, cost)
- • Define success criteria (target accuracy, latency, cost reduction)
- • Estimate ROI (cost of fine-tuning vs. expected savings/value)
- • Validate data availability (need 1,000+ high-quality examples)
Training Data Preparation (3-4 weeks)
- • Collect real examples (historical data with known-good outputs)
- • Annotate data with expert labels (correct tool calls, extracted fields, classifications)
- • Use synthetic data generation to expand dataset (10× multiplier)
- • Split data: 80% training, 10% validation, 10% test
- • Format as JSONL (input-output pairs)
- • Quality assurance: review samples, ensure consistency
Model Selection and Training (2-3 weeks)
- • Choose base model (GPT-4o for accuracy, GPT-4o-mini for cost, Llama-3 for control)
- • Run fine-tuning in Foundry (developer tier for experimentation)
- • Hyperparameter tuning (learning rate, epochs, batch size)
- • Monitor training metrics (loss curves, validation accuracy)
- • Test multiple model versions (compare accuracy vs. cost trade-offs)
- • Select best performer for production
Validation and Testing (2-3 weeks)
- • Test on held-out test set (measure accuracy, latency, cost)
- • Compare to baseline (is fine-tuned model significantly better?)
- • Edge case testing (adversarial inputs, unusual formats, error conditions)
- • User acceptance testing (domain experts validate quality)
- • Performance benchmarking (throughput, concurrency, scaling behavior)
- • Document evaluation results and model limitations
Production Deployment (2-3 weeks)
- • Deploy fine-tuned model to Foundry inference endpoint
- • Canary rollout (5% → 25% → 100% of traffic)
- • Monitor production metrics (accuracy, latency, error rates)
- • Set up alerting for degradation (accuracy drops, latency spikes)
- • Implement fallback to baseline model if issues detected
- • Track business metrics (cost savings, throughput, user satisfaction)
Continuous Improvement (Ongoing)
- • Collect production data (new examples with errors to learn from)
- • Periodic retraining (monthly or quarterly with updated data)
- • A/B testing (compare new model versions vs. current production)
- • Explore reinforcement fine-tuning (if complex reasoning needed)
- • Model distillation (once large model proven, distill to smaller for cost)
- • Measure ROI continuously (track savings vs. training investment)