How LLMs Work — A Deep Dive

Tokenization: How Text Becomes Numbers

Before a model reads a single word, it breaks everything into fragments called tokens. This shapes cost, context limits, and some of the model's strangest failure modes.

7 min read

Tokens are not words — they are fragments

A language model doesn't read text the way you do. Before anything else happens, your input is broken into tokens — chunks that are somewhere between a character and a word. Common words like "the" or "cat" are usually one token. Longer or rarer words get split: "tokenization" might become "token", "iz", "ation". A single emoji can be multiple tokens.

The algorithm that decides how to split text is called Byte-Pair Encoding (BPE). It builds a vocabulary of the most common character sequences in the training data, then uses that vocabulary to compress new text as efficiently as possible. The result is a vocabulary of roughly 50,000–130,000 possible tokens, depending on the model.

Tokenization explains cost, limits, and some failure modes

Understanding tokens is practically useful for three reasons.

Cost: APIs charge per token. A document that is 750 words is roughly 1,000 tokens. If you are processing thousands of documents, the token count directly drives your bill.

Context limits: every model has a maximum context window measured in tokens — not words, not pages. A 128,000-token context window sounds huge until you realise a dense technical document can consume it quickly.

Failure modes: because the model processes tokens, not characters, it never "sees" individual letters inside a word. Ask a model to count the letter 'r' in "strawberry" and it will likely say 2 — because "strawberry" is split into fragments that don't preserve individual characters. This is not a reasoning failure; it is a tokenization consequence.

Try it — instruct the agent

Agent console

You ask an AI assistant: 'How many times does the letter r appear in the word strawberry?' It answers: '2'. What is the most accurate explanation for this error?

Agent behavior: Capable and fluent — but working entirely with tokens, never with individual characters.

Check yourself

A model's context window is measured in tokens. Roughly how many tokens is 750 words of standard English text?

Why does the same content in Arabic or Chinese typically cost more tokens than in English?

Which of these tasks is most likely to trip up an LLM due to tokenization?

Your turn

Use the Tiktokenizer tool to watch how different text gets broken into tokens in real time — then try to find inputs that reveal tokenization's quirks.

Try in Tiktokenizer (tiktokenizer.vercel.app)

Open Tiktokenizer and tokenize a few common English sentences — note how efficient it is.
Paste the same content in another language and compare the token count.
Try a number like 1,234,567 and observe how it is tokenized.
Try the word 'strawberry' and confirm how many tokens it produces.
Find one input where you would have predicted fewer tokens than you actually got.

Reflection

PreviousThe Raw Material: Pretraining Data NextThe Architecture: What a Transformer Does