When you interact with a language model, what you see on screen — a fluid conversation that seems to remember what you said before — can create an incorrect impression of how the system actually works. There are two concepts that, when misunderstood, lead to frequent mistakes when working with AI: tokens and context. Understanding them well changes how you write your prompts and how you interpret responses.
What is a token
Language models do not process text letter by letter or word by word. They process it in units called tokens — text fragments of variable length defined by a process called tokenisation.
A token can be:
- A complete word: “house,” “work,” “AI”
- Part of a word: “intel” + “ligence” = two tokens
- A punctuation mark: ”.” or ”,” = one token each
- A space + word: ” house” (with the space) = one token
As a rough general rule, 1 token ≈ 0.75 words in English.
Why it matters: models have limits expressed in tokens, not words. A model with a context window of 128,000 tokens can process approximately 100,000 words. APIs charge per token consumed, not per word. And the number of tokens in a prompt affects response speed.
The context window
The context window is the maximum number of tokens the model can process in a single interaction. Everything within that window — the conversation history, system instructions, documents you have provided, your question and the previous response — occupies space in that window.
Context windows have grown enormously in recent years:
- GPT-3 (2020): 4,096 tokens (~3,000 words)
- GPT-4 (2023): 8,192 – 128,000 tokens
- Claude 3 (2024): up to 200,000 tokens (~150,000 words)
- Gemini 1.5 (2024): up to 1,000,000 tokens
A large window allows working with complete documents, long conversations and complex contexts. But it has a real computational cost: processing a long context is slower and more expensive than a short one.
Why AI forgets
Here comes the critical distinction: the model has no memory between separate conversations.
When you close a conversation and open a new one, the model remembers nothing from the previous one. There is no persistent record of who you are, what you have discussed before, or what preferences you have expressed. Every conversation starts from scratch.
Within the same conversation, the model can access everything that is in the context window. But if the conversation is so long that the history exceeds the window, the oldest parts disappear — the model literally cannot see them and acts as if they did not exist.
This behaviour surprises many users who assume the model remembers the name they mentioned twenty messages ago, or an instruction given at the start of a very long conversation. Not necessarily — it may have already fallen outside the window.
Memory vs. context: the key distinction
CONTEXT MEMORY
─────────────────────────────────────────────────────
Within a session Between sessions
Temporary Persistent
Automatic (it's in the Requires external system
window or it is not) (databases, RAG, etc.)
Limited by tokens Potentially unlimited
Free (included in API) Requires infrastructure
What appears as “memory” in tools like ChatGPT with memory enabled is not part of the model itself: it is an external system that saves summaries or fragments of past conversations and injects them into the context at the start of each new conversation. The model reads them as if they were part of the prompt, not as internal memories.
This distinction matters if you build systems on top of language models: “memory” is always external and always consumes tokens from the context.
Practical implications
Place important information at the beginning and end. Models tend to pay more attention to tokens at the start and end of the context window than to those in the middle (a phenomenon known as “lost in the middle”). If you have critical instructions, do not bury them in the centre of a long document.
Do not assume the model “remembers.” If you have a long conversation and the model seems to have forgotten something you said earlier, it has probably already fallen outside its context window. Repeat it.
Long documents have a cost. Pasting a 200-page book into the context consumes valuable tokens, slows the response and can dilute the model’s focus. Extract only what is relevant when possible.
The context window is not the same as actual attention. Although technically the model can “see” everything in the window, in practice its ability to reason coherently over very long contexts degrades. 200,000 tokens available does not mean 200,000 tokens perfectly utilised.
Understanding tokens and context is not technical trivia. It is what allows you to design your interactions with AI to get the best results and anticipate where the system may fail.