How LLMs Work — A Deep Dive

Pretraining: Teaching a Model to Predict

How a blank neural network becomes a world model by predicting the next token, billions of times. What a base model is — and crucially, what it isn't.

9 min read

Next-token prediction is a deceptively simple objective

The model starts as a randomly initialised neural network — billions of numbers set to random values, predicting nothing. Training has one objective: predict the next token. The network sees a sequence of tokens, tries to predict what comes next, gets told the correct answer, and adjusts its weights to do slightly better next time. Repeat this billions of times across trillions of tokens from the training corpus.

What makes this remarkable is the side effect. To predict the next word in a medical textbook, you need to understand medicine. To predict the next line in a Python file, you need to understand programming. To predict the next sentence in a philosophical essay, you need to understand philosophy. The model can't predict well without absorbing a model of the world — and that world model is what emerges from training.

A base model predicts text — it does not assist you

After pretraining, what you have is a base model. It is extraordinarily capable at predicting what text comes next — but that is all it does. If you prompt a base model with a question, it is likely to generate more questions, or the kind of text that typically follows questions in its training data (which might be a quiz, a forum thread, or a textbook problem set). It is not trying to help you; it is completing the document.

This distinction matters for understanding agents. The helpful, instruction-following AI assistant you interact with is not the base model — it is the base model after a second training stage that teaches it to behave as an assistant. That second stage is the subject of the next lesson.

From the field

Andrej Karpathy

Former Director of AI at Tesla; OpenAI founding member; creator of llm.c, a from-scratch reproduction of GPT-2 pretraining

software

The base model is a token simulator. It's trying to complete a document — it's not trying to help you. You ask it a question and it might give you more questions, because on the internet, questions are often followed by more questions. The assistant behaviour comes later.

1Initialise a transformer with random weights — it knows nothing and predicts randomly.
2Present sequences from the training corpus and ask the model to predict the next token.
3Compute the loss (how wrong the prediction was) and update the weights via backpropagation.
4Repeat across billions of sequences — the model progressively improves at prediction.
5After training, the result is a base model: a powerful next-token predictor that has absorbed world knowledge as a side effect, but has no instruction-following behaviour.
6To build an assistant, a second training stage is required — the base model is only the foundation.

Talk — source

Try it — instruct the agent

Agent console

You have access to a raw base model — no fine-tuning, no instruction-following. You send it the prompt: 'What is the capital of France?' What is the most likely response?

Agent behavior: A pure next-token predictor — it completes text, it does not answer questions.

Check yourself

What is the training objective during pretraining?

If you prompt a raw base model with a question, what is the most likely behaviour?

What do scaling laws describe in the context of pretraining?

Your turn

Experience a base model directly — before any instruction fine-tuning — and observe how it differs from the assistant models you are used to.

Try in Hyperbolic base model inference (app.hyperbolic.xyz) or llm.c on GitHub for the technical path

Access a base model (Llama 3 Base or similar) via Hyperbolic or another provider.
Ask it a direct question — e.g. 'What is the capital of France?' — and observe what it actually generates.
Try a prompt that implies a task (e.g. 'Summarise the following:') and see whether it follows the instruction.
Compare the same prompt against an instruction-tuned version of the same model.
Note at least two concrete differences between the base and instruction-tuned responses.

Reflection

PreviousThe Architecture: What a Transformer Does NextFrom Base Model to Assistant: SFT and RLHF