How LLMs Work — A Deep Dive

The Raw Material: Pretraining Data

What LLMs are actually trained on, how quality filtering works, and why the training data shapes everything the model knows — and doesn't know.

8 min read

LLMs are trained on a compressed snapshot of human writing

Before a model learns anything, someone has to collect the training data. The primary source is the web — billions of pages crawled and archived — supplemented by books, code repositories, scientific papers, and Wikipedia. But the raw internet is noisy: spam, duplicates, low-quality text, and harmful content. So the data goes through quality filtering pipelines that discard the worst material and keep the best.

What remains is still enormous — trillions of words representing a significant fraction of human writing. The model is trained on this compressed snapshot. That's it. There is no live connection to the internet, no lookup table of facts, no separate memory. Everything the model "knows" came from this data, collected up to a specific date.

The model's 'knowledge' is pattern-matching over what it read

It's tempting to think of an LLM as a database you can query. It isn't. The model has no fact store, no internal encyclopedia, no list of verified truths. During training, it learned statistical patterns: which words tend to follow which other words, across an enormous range of contexts. When you ask it a question, it generates the most statistically likely continuation of your prompt.

Usually, that continuation is correct. But when the training data contained wrong information, or when correct information was rare, the model will generate confident-sounding text that is simply wrong. It cannot tell the difference from the inside. This is what people call hallucination — and it is a direct consequence of how training data works, not a fixable bug.

From the field

Andrej Karpathy

Former Director of AI at Tesla; OpenAI founding member; creator of the Neural Networks: Zero to Hero lecture series

software

The model has read a large chunk of the internet. It doesn't remember facts the way you do — it has learned statistical patterns across trillions of words. The training data is everything: what it knows, what it doesn't, and where it will confidently confabulate.

1Start with the raw internet — Common Crawl gives a snapshot of billions of web pages per month.
2Apply quality filtering: train a classifier to prefer text resembling high-quality sources like Wikipedia and textbooks.
3Deduplicate aggressively to prevent the model from memorising repeated content.
4Mix domains deliberately — adjust the ratio of web, code, books, and scientific text to shape the model's downstream capabilities.
5Accept that the resulting model reflects only what was in this data: knowledge gaps and hallucinations trace directly back to data gaps.

Talk — source

Try it — instruct the agent

Agent console

Your AI assistant confidently tells you that a listed company's CEO salary is €4.2M. You're about to include this figure in a board report. What do you do?

Agent behavior: Confident and fluent — it states figures without flagging uncertainty, because it has no way to know whether its training data was current or accurate.

Check yourself

What does a model's 'training cutoff' mean?

Why does a model state incorrect facts with confidence?

What is the primary source of most LLM training data?

Your turn

Browse a sample of real LLM training data to understand what models actually read — and what that means for their knowledge gaps.

Try in FineWeb on HuggingFace (huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1)

Open the FineWeb demo and browse at least 10 random training examples.
Note the range of quality — what's surprisingly good, what's surprisingly bad.
Find at least one example that could contain an outdated or incorrect fact.
Reflect on what topics are likely over- or under-represented in web-scraped data.

Reflection

PreviousDelegate, Then Verify NextTokenization: How Text Becomes Numbers