How LLMs Work — A Deep Dive

From Base Model to Assistant: SFT and RLHF

The base model predicts text. A second training stage — Supervised Fine-Tuning and Reinforcement Learning from Human Feedback — teaches it to be helpful, harmless, and honest. This is where the AI assistant you interact with actually comes from.

8 min read

Why the base model needs a second training stage

After pretraining, the model is a powerful text predictor — but it doesn't know how to be an assistant. Ask it a question and it may generate more questions. Ask it to help with something sensitive and it will complete the text pattern without any concern for harm. Give it an instruction and it may or may not follow it, depending on what the training data looked like after similar prompts.

Alignment is the process of taking this raw predictor and shaping its behaviour: making it follow instructions, be helpful to users, avoid harmful outputs, and respond consistently. This is not tweaking the base model slightly — it is a distinct training phase that fundamentally changes how the model behaves.

How SFT and RLHF work in practice

Alignment happens in two stages:

Supervised Fine-Tuning (SFT): human annotators write examples of ideal assistant behaviour — questions paired with good answers, instructions paired with good responses. The model is trained on these demonstrations, learning the format and style of being an assistant.

Reinforcement Learning from Human Feedback (RLHF): annotators compare pairs of model outputs and indicate which is better. A separate reward model is trained to predict these preferences. Then the assistant model is updated using reinforcement learning to generate responses that score higher according to the reward model — while a penalty keeps it from drifting too far from its original behaviour.

From the field

Long Ouyang et al. (OpenAI)

Research team at OpenAI; lead authors of InstructGPT, the paper that established RLHF as the standard approach to aligning large language models

software

Our results show that fine-tuning with human feedback significantly improves outputs on a wide range of tasks — and that labelers strongly prefer InstructGPT outputs over those of GPT-3, despite InstructGPT having 100x fewer parameters.

1Collect demonstration data: hire labellers to write ideal assistant responses to a diverse set of prompts.
2Fine-tune the pretrained GPT-3 base model on these demonstrations (Supervised Fine-Tuning).
3Collect preference data: have labellers rank multiple model outputs for the same prompt from best to worst.
4Train a reward model (RM) to predict which outputs labellers would prefer.
5Use PPO reinforcement learning to update the language model to maximise the reward model's score.
6Apply a KL-divergence penalty to prevent the model from drifting too far from its SFT behaviour.

Paper — source

Try it — instruct the agent

Agent console

An AI assistant gives you a highly confident, polished answer to a sensitive medical question — recommending a specific medication dosage. What best explains why the response sounds so authoritative?

Agent behavior: Instruction-tuned via RLHF — optimised to produce responses that humans rate as helpful and confident.

Check yourself

What does Supervised Fine-Tuning (SFT) teach the model?

What does the reward model in RLHF do?

Why might an RLHF-trained model sound confident even when it is wrong?

Your turn

Compare the raw and aligned versions of the same model to make the effect of alignment training concrete and visible.

Try in HuggingFace Inference Playground (huggingface.co/spaces/huggingface/inference-playground) for both base and instruct variants

Find a model available in both base and instruct versions (e.g. Llama 3.1 8B and Llama 3.1 8B Instruct).
Send the same prompt to both: a direct question and a sensitive request.
Note how the base model responds vs. the instruct model.
Find a case where the instruct model refuses something the base model would attempt.
Identify one response where the instruct model sounds confident — and assess whether that confidence is warranted.

Reflection

PreviousPretraining: Teaching a Model to Predict NextReasoning Models: Thinking Before Answering