When we say a neural network “learns,” we use a powerful metaphor that can mislead. Machines do not learn the way humans learn: they do not reflect, do not generalise with intuition, do not have curiosity. But what they do — adjusting millions of mathematical parameters to minimise error on a task — produces results that closely resemble learning. Understanding how that process works is the foundation for everything else.
Learning without anyone explaining
Imagine you need to teach someone to distinguish spam from legitimate email, but without giving them rules. Instead of saying “if it contains the word ‘free’ and an external link, it is spam,” you show them thousands of pre-classified examples: this is spam, this is not, this one is, this one is not.
The person, after enough examples, starts detecting patterns. Not necessarily the same ones you would have programmed, but patterns that work. That is, in essence, machine learning: inferring rules from examples, not from explicit instructions.
Supervised learning — the most common type — requires three things: training data (the examples), labels (the correct answers for those examples), and a model capable of adjusting itself to approximate those answers.
Parameters: the knots of knowledge
A neural network is a mathematical graph of interconnected nodes, loosely inspired by the structure of the brain (though the analogy is very superficial). What connects the nodes are weights: numbers that determine how much the signal from one node contributes to the output of the next.
Those weights are the model’s parameters. A small model might have millions. GPT-4 has estimates in the hundreds of billions. Each number is an adjustable value.
At the start of training, those parameters are initialised with random values. The model knows nothing. Its initial predictions are little better than a coin flip.
What happens during training is a sequence of iterative adjustments to those parameters so that the model’s predictions get progressively closer to the correct answers in the training set.
The loss function: measuring error
To adjust the parameters, you first need to know how wrong the model is. That is the job of the loss function: a mathematical measure of the distance between what the model predicts and what it should have predicted.
If the model predicts that an email has a 30% chance of being spam, but it actually was spam, the loss function returns a high number. If it predicts 92% and it was spam, the number is low. The goal of training is to minimise that number averaged over all examples.
Different tasks use different loss functions. Classification, regression, text generation — each has different metrics of what it means to “be less wrong.”
Gradient descent: learning from failure
Once you have a measure of error, you need to know how to adjust the parameters to reduce it. This is where gradient descent comes in.
The gradient is the mathematical equivalent of slope on a terrain. If error is the terrain and the parameters are your position, the gradient tells you which direction to move to descend fastest. Gradient descent is simply moving in that direction, one small step at a time.
Calculating the gradient requires backpropagation: propagating the error from the model’s output backwards through all the layers, calculating how much each parameter contributed to the total error and adjusting it accordingly.
This process is repeated millions or billions of times, with different subsets of data, until the error on the training data reaches an acceptable level.
Training cycle:
1. The model receives an input example
2. It makes a prediction (forward pass)
3. The error is calculated (loss function)
4. The error is propagated backwards (backpropagation)
5. The parameters are adjusted (gradient descent)
6. Repeat with the next example → millions of times
What training cannot do
Understanding training also means understanding its limits — which are the limits of the AI we use today.
The model learns from the past, not the future. The parameters are fixed at the time of training. If the world changes — new events, new information — the model does not know unless it is retrained. This is why models have a “knowledge cutoff date.”
The model learns what is in the data. If the training data contains biases — over-representation of certain voices, under-representation of others, systematic errors — those biases get encoded in the parameters. The model cannot learn what is not in its data.
The model may memorise rather than generalise. If trained too long on the same data, the model learns those specific examples rather than the underlying patterns. This is called overfitting, and produces models that work well on training data but poorly on new data.
Size does not solve everything. More parameters and more data improve performance up to a point, but do not eliminate these structural limitations. A larger model has the same conceptual problems as a smaller one, just at greater scale.
Training is the most important process in modern AI. Everything models know — their capabilities and their limitations — is a consequence of how and with what they were trained. What follows — the interactions, the prompts, the outputs — is the expression of what training encoded.