Model Finetuning & Optimization

We improve model performance for your specific use case through fine-tuning, prompt engineering, RAG optimization, and evaluation frameworks.

Fine-TuningOpenAILoRAPEFTEvaluationDataset Curation

General-purpose models give general-purpose results. When your use case demands consistent, domain-specific accuracy — legal document analysis, clinical note extraction, sales persona simulation, technical support classification — a fine-tuned model will outperform a prompted base model and cost less to run.

We design and run complete fine-tuning pipelines: from dataset audit and curation through training, evaluation, and production deployment. We also build the evaluation infrastructure that tells you whether the model is actually better — before it goes live.

What We Deliver

Dataset Curation and Annotation Pipelines

Fine-tuning quality is constrained by data quality. We start with a systematic audit of your existing data, identify gaps, design annotation guidelines, and build tooling to scale high-quality labelling. For domain-specific tasks where ground truth is expensive, we design active learning workflows that prioritise the highest-value examples to annotate.

Typical dataset sizes for effective fine-tuning: 500–2,000 high-quality examples for task-specific adaptation; 5,000–50,000 for broader behavioural alignment.

Fine-Tuning Pipelines

End-to-end training pipelines for:

  • OpenAI fine-tuning (GPT-4o, GPT-4o-mini) — best for teams already using the OpenAI API who want consistent output format, improved task accuracy, or reduced prompt length
  • LoRA and QLoRA for open-source models (Llama 3, Mistral, Phi-3, Qwen) — preferred when inference cost, data privacy, or on-premise deployment requirements rule out managed APIs
  • Full fine-tuning on A100/H100 clusters for cases requiring deep behavioural change

We handle hyperparameter optimisation, training monitoring, checkpoint management, and infrastructure provisioning — the full pipeline, not just a training script.

Automated Evaluation Frameworks

A model is only better if you can measure that it's better. We build evaluation frameworks with:

  • Held-out test sets with annotated ground truth
  • Automated scoring (exact match, F1, LLM-as-judge, task-specific metrics)
  • Regression suites that run on every model version before promotion
  • Human evaluation panels for subjective quality dimensions (tone, accuracy, helpfulness)

Without a rigorous evaluation framework, you can't tell whether you've improved performance or just overfit to a limited test set.

Prompt Optimisation Alongside Fine-Tuning

Fine-tuning and prompt engineering are complementary, not alternatives. We run structured prompt optimisation in parallel — systematically testing prompt variations against your evaluation set to find the highest-performing combination of prompts and model weights.

LLMOps for Fine-Tuned Models

Production deployment of fine-tuned models with: model versioning and promotion gates, canary deployment for gradual traffic shifting, continuous evaluation to catch drift, and cost monitoring. Fine-tuning without a deployment pipeline means you can't safely iterate.

When to Fine-Tune

Fine-tuning is the right choice when:

  • Prompt engineering alone isn't getting consistent results — you've iterated extensively on prompts but accuracy on your domain-specific task is still below acceptable thresholds
  • Output format consistency is critical — you need the model to reliably produce structured outputs (JSON, specific classifications, constrained responses) without complex parsing logic
  • Inference cost is a constraint — a fine-tuned smaller model (GPT-4o-mini, Llama 3 8B) can match the accuracy of a prompted larger model at a fraction of the cost
  • Data privacy requires on-premise hosting — you can't send sensitive data to managed APIs; fine-tuned open-source models running in your own infrastructure are the answer
  • You have proprietary domain knowledge that public models lack — clinical terminology, legal concepts, internal product knowledge, industry-specific jargon

Related Work

We fine-tuned a custom LLM for a voice roleplay platform, using a dataset of 8,000+ annotated sales conversations. The fine-tuned model achieved a 35% improvement in persona consistency over the base GPT-4 model, measured by a human evaluation panel. Prompt engineering time was eliminated entirely. The model now handles 2,000+ sessions per week in production.

Frequently Asked Questions

For task-specific fine-tuning (output format, classification, constrained generation): 500–2,000 high-quality examples is often sufficient. For broader behavioural alignment or knowledge injection: 5,000–50,000. The quality of examples matters far more than the quantity — 500 carefully curated examples outperform 5,000 noisy ones.
Different tools for different problems. RAG is better for dynamic knowledge that changes frequently or needs sourcing/citation — it retrieves current information at inference time. Fine-tuning is better for consistent task behaviour, output format, tone, and domain-specific accuracy on stable knowledge. Many production systems use both: fine-tuned model + RAG retrieval.
A focused fine-tuning engagement — dataset curation, training, evaluation, deployment — typically takes 4–8 weeks. The largest time investment is data curation if you don't have clean annotated data. Training itself is fast; evaluation framework development and iteration take more time than teams expect.
Catastrophic forgetting is a real risk with aggressive full fine-tuning. We mitigate it through LoRA (which adds task-specific parameters without overwriting base weights), careful dataset design that includes general examples alongside domain-specific ones, and evaluation that explicitly tests for capability regression.
Significant. A fine-tuned GPT-4o-mini model typically costs 10–20× less per token than GPT-4o, and can match GPT-4o's accuracy on specific tasks it's been trained for. For high-volume production workloads, this compounds to meaningful savings — often $10,000–$100,000+ per month at scale.

Stay ahead in AI engineering.

Get the latest insights on building production AI systems, be the first to explore approaches that actually work beyond the demo.

Start a Project →