How LLMs Work — A Deep Dive

The Architecture: What a Transformer Does

The neural network design that made modern AI possible — explained through what it computes, not how it's built. Attention, layers, and why scale changes everything.

10 min read

Attention: every token looks at every other token

The core innovation of the transformer is attention — a mechanism that lets each word in a sentence consider all the other words and decide which ones are relevant to its current meaning.

Consider the sentence: "The contract was signed by Alice. It became binding immediately." To understand what "it" refers to, you need to connect it back to "contract." Attention does this by computing a relevance score between each pair of tokens in the context. "It" looks at "contract", "signed", "Alice", and every other token, and learns to weight "contract" most heavily. This happens for every token, simultaneously, across the whole context window.

Layers build up understanding; scale changes what emerges

A transformer is not just one attention operation — it is many, stacked in sequence. Each layer takes the output of the previous layer and refines it. Early layers pick up surface-level patterns (which words tend to appear together). Deeper layers build more abstract representations (meaning, reasoning, world knowledge). After the attention, each layer also passes through a feed-forward network that adds further processing.

What makes modern LLMs remarkable is that this simple design, run at enormous scale — billions of parameters, trillions of tokens — produces capabilities that were not explicitly trained. The ability to follow instructions, reason step-by-step, write code, explain concepts: none of these were directly taught. They emerged from the training objective of predicting the next token.

Try it — instruct the agent

Agent console

Consider the sentence: 'The merger agreement was drafted by the legal team. It was executed on 15 March.' When the model generates a response about what 'It' refers to, how does the transformer figure this out?

Agent behavior: A transformer that processes all tokens in parallel, attending across the full context.

Check yourself

What does the attention mechanism allow each token to do?

What do transformer layers do as they go deeper?

What is an 'emergent capability' in the context of large language models?

Your turn

Explore the transformer architecture visually to build an intuition for how tokens flow through layers and how attention connects them.

Try in Transformer Neural Net 3D Visualiser (bbycroft.net/llm)

Open the 3D visualiser and step through a token being processed.
Watch how attention weights change depending on the input tokens.
Identify the feed-forward layer and note where it sits relative to attention.
Increase the input length and observe how the attention matrix grows.

Reflection

PreviousTokenization: How Text Becomes Numbers NextPretraining: Teaching a Model to Predict