How LLMs Work — A Deep Dive
From Base Model to Assistant: SFT and RLHF
The base model predicts text. A second training stage — Supervised Fine-Tuning and Reinforcement Learning from Human Feedback — teaches it to be helpful, harmless, and honest. This is where the AI assistant you interact with actually comes from.
Why the base model needs a second training stage
After pretraining, the model is a powerful text predictor — but it doesn't know how to be an assistant. Ask it a question and it may generate more questions. Ask it to help with something sensitive and it will complete the text pattern without any concern for harm. Give it an instruction and it may or may not follow it, depending on what the training data looked like after similar prompts.
Alignment is the process of taking this raw predictor and shaping its behaviour: making it follow instructions, be helpful to users, avoid harmful outputs, and respond consistently. This is not tweaking the base model slightly — it is a distinct training phase that fundamentally changes how the model behaves.
How SFT and RLHF work in practice
Alignment happens in two stages:
Supervised Fine-Tuning (SFT): human annotators write examples of ideal assistant behaviour — questions paired with good answers, instructions paired with good responses. The model is trained on these demonstrations, learning the format and style of being an assistant.
Reinforcement Learning from Human Feedback (RLHF): annotators compare pairs of model outputs and indicate which is better. A separate reward model is trained to predict these preferences. Then the assistant model is updated using reinforcement learning to generate responses that score higher according to the reward model — while a penalty keeps it from drifting too far from its original behaviour.
From the field
Long Ouyang et al. (OpenAI)
Research team at OpenAI; lead authors of InstructGPT, the paper that established RLHF as the standard approach to aligning large language models
Our results show that fine-tuning with human feedback significantly improves outputs on a wide range of tasks — and that labelers strongly prefer InstructGPT outputs over those of GPT-3, despite InstructGPT having 100x fewer parameters.
- 1Collect demonstration data: hire labellers to write ideal assistant responses to a diverse set of prompts.
- 2Fine-tune the pretrained GPT-3 base model on these demonstrations (Supervised Fine-Tuning).
- 3Collect preference data: have labellers rank multiple model outputs for the same prompt from best to worst.
- 4Train a reward model (RM) to predict which outputs labellers would prefer.
- 5Use PPO reinforcement learning to update the language model to maximise the reward model's score.
- 6Apply a KL-divergence penalty to prevent the model from drifting too far from its SFT behaviour.
Try it — instruct the agent
An AI assistant gives you a highly confident, polished answer to a sensitive medical question — recommending a specific medication dosage. What best explains why the response sounds so authoritative?
Agent behavior: Instruction-tuned via RLHF — optimised to produce responses that humans rate as helpful and confident.
Check yourself
What does Supervised Fine-Tuning (SFT) teach the model?
What does the reward model in RLHF do?
Why might an RLHF-trained model sound confident even when it is wrong?
Your turn
Compare the raw and aligned versions of the same model to make the effect of alignment training concrete and visible.