How does fine-tuning work?
Fine-tuning takes a model that already understands language and trains it further on examples of the specific behaviour you want. The model has seen the world in pre-training; fine-tuning teaches it your task.
In practice you assemble a dataset of input–output pairs that demonstrate the target behaviour — questions and ideal answers, prompts and correctly formatted responses. You then run additional training so the model's weights shift toward producing those outputs. The result is a model that handles your task more reliably and consistently than the base model with prompting alone, because the behaviour is now baked into its weights rather than coaxed out at runtime.
Modern fine-tuning often uses efficient methods like LoRA, which adjust a small set of added parameters instead of the full model — cutting the compute and cost of customisation substantially.
Fine-tuning vs RAG vs prompting
These three are the main ways to customise an LLM, and they are not mutually exclusive.
| Method | Changes | Best for |
|---|---|---|
| Prompting | Nothing — instructions only | Quick iteration, general tasks |
| RAG | Knowledge available at query time | Current, factual, changing data |
| Fine-tuning | The model's weights | Consistent style, format, domain behaviour |
The rule of thumb: start with prompting, add RAG when you need external knowledge, and fine-tune when you need consistent behaviour that prompting can't reliably produce. A production system often uses all three — a fine-tuned model, grounded with RAG, steered by a good prompt.
When is fine-tuning worth it?
Fine-tuning is worth the effort when prompting has hit its ceiling: the model can do the task but not consistently, or the prompts have grown long and brittle. It's also the right tool when you need a smaller, cheaper model to match a larger one on a narrow task — the basis of model distillation.
Prodinit used this at scale for a high-volume voice AI platform, fine-tuning GPT-4o-mini on production examples so it could replace the much larger GPT-4.1 on the platform's conversational task — cutting inference cost 70% while holding quality steady across 10,000+ calls per day.