Alignment: How AI Is Taught to Be Helpful and Safe â€" Xap.es

When you use ChatGPT, Claude or Gemini, you are not interacting with a basic language model. You are interacting with a model that has gone through an additional process — alignment — designed to make it helpful, honest and safe. Understanding that process explains many things about how these systems behave: why they refuse to do certain things, why they are sometimes overly cautious, and why their values are not neutral.

The problem with the base model

A language model pre-trained on large volumes of text has an astonishing capability: predicting what text comes after any given fragment. But that capability is not the same as being useful as an assistant.

If you ask “how can I improve my CV?”, the base model might respond with more questions (because many training documents have that structure), with an academic analysis of the concept of a CV, or even with a response in the style of a 2000s internet forum. It predicts what is most likely given the context. It does not understand that you are expecting practical advice.

The problem is that “following instructions” and “being helpful” are not properties that emerge automatically from pre-training on general text. They require an additional process.

Instruction fine-tuning

The first step of alignment is instruction fine-tuning (or SFT, Supervised Fine-Tuning).

The process is conceptually simple: a dataset of (instruction, ideal response) pairs is created and the model continues training on that dataset. “Instruction: summarise this text. Response: [quality summary].” “Instruction: explain what photosynthesis is to a 10-year-old. Response: [clear, age-appropriate explanation].”

After thousands or millions of these examples, the model learns to follow instructions rather than simply complete text. This step turns the base model into a functional assistant.

But a problem remains: “correct” responses are subjective. What is a good explanation? When is a response too long or too short? What level of detail is appropriate? To capture human preferences in a more nuanced way, something more is needed.

RLHF: learning from human preference

RLHF stands for Reinforcement Learning from Human Feedback. It is the technique that turned GPT-3 into ChatGPT and that underlies the most capable assistant models.

The process has three phases:

1. Collecting comparisons. Multiple model responses are generated for the same instruction and human evaluators are asked to rank them from best to worst. Instead of saying “this response is correct” (costly and subjective), they say “this response is better than that one” (easier and more consistent).

2. Training the reward model. With those comparisons, a separate model — the reward model — is trained to predict which responses humans prefer. This model acts as an automatic quality evaluator.

3. Reinforcement optimisation. The language model is optimised to maximise the reward model’s score, using reinforcement learning algorithms (specifically a variant called PPO). The result is a model that produces responses humans tend to prefer.

RLHF captures preferences that are difficult to specify explicitly: clarity, conciseness, appropriate tone, practical utility. Human evaluators do not need to articulate why they prefer a response — they just indicate which one is better.

What values the model learns

Alignment is not neutral. The values the model learns depend on:

The human evaluators. Their preferences, cultures, biases and criteria get encoded in the reward model. If evaluators value certain types of responses, the model learns to produce them.
The evaluation instructions. The criteria given to evaluators — what counts as a good response, what is considered harmful — are defined by the companies developing the models.
Content policies. Restrictions on what the model should not do — generate harmful content, help with illegal activities, provide dangerous information — are design decisions incorporated into the alignment process.

This has an important implication: when a model refuses to do something or responds overly cautiously, that decision is not from the model — it has no agency to decide. It is a consequence of the preferences and criteria encoded during alignment.

The limits of alignment

Alignment substantially improves model behaviour, but it has real limits.

The generalisation problem. The model learns to behave well in situations similar to those in training. In genuinely new or unusually formulated situations, it can fail in unexpected ways.

Over-caution. Strongly aligned models tend to be more cautious than necessary in many situations. If the evaluation criterion penalises potentially harmful errors heavily, the model learns to also avoid situations that only seem dangerous.

Jailbreaking. The alignment restrictions are learned layers on top of the base model, not fundamental properties. With the right instructions — so-called “jailbreaks” — it is sometimes possible to circumvent those restrictions. This shows that alignment is a behavioural adjustment process, not a deep value change.

Continuous evolution. As models are used in more contexts and users discover problematic behaviours, developers refine the alignment process. Current models are significantly more capable and better aligned than earlier versions, but the process is ongoing.

Understanding alignment means understanding that the AI models we use are the result of human design decisions that go far beyond technical training. Their behaviours — their capabilities, their limits, their biases — are the product of those decisions.

Alignment: How AI Is Taught to Be Helpful and Safe

The problem with the base model

Instruction fine-tuning

RLHF: learning from human preference

What values the model learns

The limits of alignment

Keep reading

The paradigm shift: saving for your children's wealth, not for retirement

Active Reading: Turning What You Read into Lasting Knowledge

Active Listening: Hearing Is Not the Same as Listening