How does model distillation work?
Model distillation works by using a large, expensive model as a teacher and training a smaller model to imitate it. The process has three stages.
First, you collect a dataset of real inputs and capture the teacher model's responses to them — ideally from production traffic, so the examples reflect how the system is actually used. Second, you fine-tune a smaller student model on those input–output pairs, teaching it to reproduce the teacher's behaviour on your specific task. Third, you validate the student against the teacher with a scoring rubric and roll it out gradually, comparing quality at each step.
The key insight is that a general-purpose frontier model is far larger than any single task requires. By narrowing the target to one domain — say, a voice agent's conversational responses — a much smaller model can match the teacher's quality where it matters, while costing a fraction as much to run.
Distillation vs fine-tuning vs RAG
Distillation, fine-tuning, and RAG are often confused because all three customise an LLM. They solve different problems.
| Technique | What it changes | Best for |
|---|---|---|
| Distillation | Trains a smaller model to match a larger one | Cutting cost and latency at scale |
| Fine-tuning | Adjusts a model's weights on task data | Teaching a specific style or behaviour |
| RAG | Adds external knowledge at query time | Keeping answers current and grounded |
Distillation is fundamentally about efficiency — same task, smaller model. Fine-tuning is about behaviour — and is in fact the mechanism used to create the distilled student. RAG is about knowledge and is orthogonal: you can run RAG on a distilled model. Many production systems combine all three.
When is distillation worth it?
Distillation pays off when inference volume is high enough that the cost of running a frontier model dominates, and the task is narrow enough that a smaller model can cover it. A low-traffic feature rarely justifies the engineering effort. A system handling thousands of calls a day almost always does.
The economics are striking at scale. Prodinit distilled a GPT-4.1 teacher into a fine-tuned GPT-4o-mini student for a voice AI platform handling 10,000+ calls per day, cutting inference cost by 70% with no quality regression — running the student in a 90/10 hybrid with progressive A/B rollout and quality gates at every stage.