How LLMs Work — A Deep Dive

Reasoning Models: Thinking Before Answering

A newer training approach gives models a scratchpad — and dramatically changes what they can do. The AlphaGo connection, chain-of-thought, and when to reach for a reasoning model.

9 min read

Giving the model room to think changes what it can do

Standard language models answer immediately — input goes in, output comes out. For simple questions, this works well. For hard multi-step problems — a complex legal analysis, a logic puzzle, a difficult coding challenge — producing the answer directly is often beyond what the model can do reliably.

Chain-of-thought (CoT) is the observation that if you ask the model to show its work before giving the final answer, accuracy improves dramatically on complex tasks. The model reasons through the problem step by step, and the intermediate steps provide a kind of scratch space that allows more reliable conclusions.

Reasoning models take this further: they are trained specifically to generate extended reasoning before answering, producing thinking tokens that are not shown to the user but shape the final output. The thinking step is not an afterthought — it is part of the model's architecture and training.

The AlphaGo connection: RL discovers strategies humans didn't teach

To understand why RL-trained reasoning models work, it helps to look at AlphaGo — DeepMind's system that mastered the board game Go in 2016.

AlphaGo was not programmed with strategies. It was trained using reinforcement learning: it played millions of games against itself, updated its weights based on what led to wins, and gradually discovered strategies that no human player had devised. Move 37 in Game 2 of the 2016 match against Lee Sedol — a move that professional players called a mistake — was later recognised as a stroke of genius. The system had discovered something beyond the known canon of the game.

The same principle applies to reasoning models. Given a reward signal for getting the right answer, models trained with RL discover reasoning strategies — not because they were shown those strategies, but because those strategies are what produce correct answers.

From the field

DeepSeek AI Research Team

Chinese AI research lab; authors of DeepSeek-R1, an open-source reasoning model that matched frontier closed-model performance using RL training on verifiable tasks

software

We find that through purely reinforcement learning training, without any supervised chain-of-thought demonstrations, the model spontaneously develops sophisticated reasoning behaviours — including self-verification, backtracking, and extended deliberation before answering.

1Start with a strong pretrained base model (DeepSeek-V3-Base).
2Apply a small cold-start phase with curated reasoning demonstrations to establish the output format.
3Train with reinforcement learning using two reward signals: correctness of the final answer and adherence to the required output format.
4Observe emergent reasoning behaviours — chain-of-thought, self-checking, and revision — arising without being explicitly trained.
5Apply rejection sampling to collect high-quality reasoning traces from the RL model, then use them for a further SFT stage.
6Release model weights publicly, enabling the research community to study and build on the approach.

Paper — source

Try it — instruct the agent

Agent console

You have two tasks today. Task A: draft a short summary of a 3-paragraph news article. Task B: identify whether the indemnification clause in a contract is broader than standard market practice and explain why. Which model type is most appropriate for each?

Agent behavior: A tool-selection advisor helping you choose between a standard assistant model and a reasoning model.

Check yourself

Why does asking a model to 'think step by step' improve accuracy on complex tasks?

What does AlphaGo's training demonstrate that is directly relevant to reasoning models?

Which task type benefits most from using a reasoning model?

Your turn

Compare a standard model and a reasoning model on the same complex analytical task to make the performance difference concrete.

Try in TogetherAI Playground (api.together.xyz/playground) for side-by-side model comparison, or LM Arena (lmarena.ai)

Pick a hard analytical task: a logic puzzle, a multi-step maths problem, or a contract clause analysis.
Run the task on a standard assistant model and a reasoning model (e.g. DeepSeek-R1 or similar).
Read the reasoning model's thinking trace if it is visible — note what strategies it uses.
Identify at least one case where the reasoning model caught something the standard model missed.
Note the difference in latency — reasoning takes longer.

Reflection

PreviousFrom Base Model to Assistant: SFT and RLHF NextRunning Models: The Inference Ecosystem