How LLMs Work — A Deep Dive
The Architecture: What a Transformer Does
The neural network design that made modern AI possible — explained through what it computes, not how it's built. Attention, layers, and why scale changes everything.
Attention: every token looks at every other token
The core innovation of the transformer is attention — a mechanism that lets each word in a sentence consider all the other words and decide which ones are relevant to its current meaning.
Consider the sentence: "The contract was signed by Alice. It became binding immediately." To understand what "it" refers to, you need to connect it back to "contract." Attention does this by computing a relevance score between each pair of tokens in the context. "It" looks at "contract", "signed", "Alice", and every other token, and learns to weight "contract" most heavily. This happens for every token, simultaneously, across the whole context window.
Layers build up understanding; scale changes what emerges
A transformer is not just one attention operation — it is many, stacked in sequence. Each layer takes the output of the previous layer and refines it. Early layers pick up surface-level patterns (which words tend to appear together). Deeper layers build more abstract representations (meaning, reasoning, world knowledge). After the attention, each layer also passes through a feed-forward network that adds further processing.
What makes modern LLMs remarkable is that this simple design, run at enormous scale — billions of parameters, trillions of tokens — produces capabilities that were not explicitly trained. The ability to follow instructions, reason step-by-step, write code, explain concepts: none of these were directly taught. They emerged from the training objective of predicting the next token.
Try it — instruct the agent
Consider the sentence: 'The merger agreement was drafted by the legal team. It was executed on 15 March.' When the model generates a response about what 'It' refers to, how does the transformer figure this out?
Agent behavior: A transformer that processes all tokens in parallel, attending across the full context.
Check yourself
What does the attention mechanism allow each token to do?
What do transformer layers do as they go deeper?
What is an 'emergent capability' in the context of large language models?
Your turn
Explore the transformer architecture visually to build an intuition for how tokens flow through layers and how attention connects them.