Fine-tuning: when it makes sense to customise a language model â€" Xap.es

When people start using language models seriously, they tend to reach the same conclusion before long: the model doesn’t talk like our company, doesn’t know our internal processes, doesn’t use our terminology. The instinctive answer is to train the model on our own data. Fine-tuning seems like the natural solution. Sometimes it is. But more often than people admit, it isn’t.

Understanding the difference helps you make better decisions about when to invest in customisation and when better instructions will do.

The difference between using a model and training it

When we use a language model through an interface or API, we’re drawing on the knowledge it acquired during its original training: millions of texts, books, code, conversations. That training is over. We simply give it instructions — the prompt — to guide how it responds.

Fine-tuning is something different: it means continuing that training, but with a smaller, more specific dataset that we provide. The model adjusts its internal parameters to learn new patterns. It’s not like uploading a document; it literally changes how the model processes and generates text.

This distinction matters because each option has different costs and limits. Using a model is cheap and immediate. Training it requires data, time and technical judgement.

When prompting is not enough

There’s a temptation to think of fine-tuning as simply a more powerful version of prompting. It isn’t. A well-written prompt can do a lot: set a tone, establish constraints, give examples of how to respond, provide context. For most everyday use cases, a good system prompt is enough.

Prompting starts to show its limits in specific situations:

When the output format has a very specific, non-standard structure. If you need the model to consistently produce output in a proprietary schema that doesn’t exist in its training data, examples in the prompt don’t always deliver reliable consistency.

When the style or voice is extremely particular. A distinctive editorial tone — short sentences, restricted vocabulary, fixed structure — can be hard to maintain through instructions alone.

When the behaviour you want differs significantly from the model’s default. If the model tends to reason in one way and you need it to reason differently on a consistent basis, a prompt can correct it in specific cases but not always in a stable way.

Outside these situations, fine-tuning rarely solves something that a better prompt couldn’t address.

What happens during fine-tuning

The technical process begins with a training dataset: pairs of inputs and outputs that represent the behaviour you’re after. The model processes these and adjusts its internal weights so that, when faced with similar inputs, it produces outputs closer to the examples.

This is not about feeding documents into the model. Fine-tuning is not a way to expand the model’s memory or reliably add factual information. If you give it your internal documents as training data, it will learn the style and structure of those documents, but it won’t accurately memorise the facts or retrieve them like a database.

This is one of the most common misunderstandings: using fine-tuning to make the model “know” specific things. For that, there is RAG — retrieval-augmented generation — which connects the model to documents in real time, during inference, rather than modifying the model itself.

Fine-tuning adjusts how the model thinks and responds. Not what it knows in a factual sense.

The real data requirements

Effective fine-tuning requires quality data, not just data. This means several things:

Representative examples. Each input-output pair must demonstrate exactly the behaviour you want. If the examples are inconsistent — sometimes one style, sometimes another — the model will learn that inconsistency.

Sufficient volume. It depends on the case, but you generally need hundreds of examples for simple behaviours and thousands for complex ones. With fewer than a hundred examples, results tend to be disappointing.

Clean data. Errors in the training data transfer to the model. A dataset with 30% incorrect examples produces a model that fails regularly.

Preparing a good dataset requires time, human judgement and careful review. It is, in most cases, the most expensive part of the whole process — and the one most underestimated before starting.

Alternatives before training

Before investing in fine-tuning, there are options worth exhausting first:

Improve the system prompt. A long, detailed prompt with concrete examples can reproduce many behaviours that intuitively seem to require training. Including three or four input-output examples directly in the prompt — known as few-shot prompting — is cheap and often enough.

RAG for specific knowledge. If the problem is that the model doesn’t know your internal documents, connecting it to those documents through retrieval is more flexible, more updatable and more transparent than fine-tuning.

Available specialised models. For specific domains — medicine, law, code — there are already models trained on relevant corpora. Using one of them is simpler than training your own.

Prompt chains. Breaking a complex task into steps and chaining different prompts for each can produce more controllable results than trying to encode all the desired behaviour at once.

When it actually makes sense

With all the above caveats, there are scenarios where fine-tuning is the right tool:

When you need the model to adopt a very specific editorial style and maintain it consistently at scale — thousands of documents, without human review of each one. When the task has a strict output format that examples in the prompt can’t stabilise. When latency is critical and a long prompt — needed to provide sufficient context — slows responses too much. And when usage volume is so high that the cost of a long prompt becomes a real expense.

In these cases, fine-tuning can deliver a faster, more consistent and more cost-effective model over time. But the initial investment — in data, training time and evaluation — must be justified.

The right question isn’t “could this be improved with fine-tuning?” It’s “can I get there in a cheaper way first?” Almost always, the answer is yes.

Fine-tuning: when it makes sense to customise a language model

The difference between using a model and training it

When prompting is not enough

What happens during fine-tuning

The real data requirements

Alternatives before training

When it actually makes sense

Keep reading

Active Listening: Hearing Is Not the Same as Listening

Decision fatigue: how saving mental energy changes what you can do

Deep work: how to reclaim the hours that actually matter