How LLMs Work — A Deep Dive

Running Models: The Inference Ecosystem

Closed vs. open models, running locally, evaluating quality, and staying current. A practical map of the landscape for anyone using or building with LLMs.

6 min read

Closed, open, and local: a practical map

When you use an AI model, you have three broad options:

Closed models (GPT-4, Claude, Gemini) — you send your data to a company's API and get a response back. Frontier capability, no infrastructure required, but your data leaves your control and you pay per token.

Open models (Llama 3, Mistral, DeepSeek, Qwen) — the weights are published. You can run them yourself via a cloud provider (HuggingFace, TogetherAI, Hyperbolic) or on your own hardware. You control the data; you manage the infrastructure.

Local models — you run an open model on your own laptop or server using a tool like LMStudio. Nothing leaves your machine. Capability is limited by your hardware, but privacy is complete.

The right choice depends on your task, your privacy requirements, your latency tolerance, and your budget.

Evaluating models and staying current in a fast-moving field

With dozens of models available, how do you know which to use? Two approaches:

LM Arena (lmarena.ai) — a human preference benchmark. Real users submit prompts, two anonymous models respond, users vote on which is better. The model that wins more votes ranks higher. This captures genuine human preference better than academic benchmarks, but has weaknesses: users can bias it toward verbose, confident responses.

Task-specific testing — the most reliable method. Run your actual task on candidate models with representative inputs. Real-world performance on your specific use case beats any benchmark.

The field moves extremely fast. A model that is frontier today may be significantly outperformed within months. Staying current means following a few good sources — the AI News Newsletter (buttondown.com/ainews) provides daily summaries of significant releases and research — and periodically re-evaluating your model choices.

Try it — instruct the agent

Agent console

Your law firm wants to deploy an AI assistant to help associates draft contract summaries. The firm has strict requirements: client data must not leave the EU, the solution must be auditable, and the firm cannot depend on a third-party API being available. Which approach best meets these requirements?

Agent behavior: A solutions architect advising on model deployment strategy.

Check yourself

What is the primary advantage of self-hosting an open model over using a managed API?

What does quantisation allow you to do with a large open model?

Why might a model that ranks highly on academic benchmarks perform poorly on your specific use case?

Your turn

Run a language model entirely on your own machine — no API key, no data leaving your device — and compare it to a frontier model on the same task.

Try in LMStudio (lmstudio.ai) for local inference; LM Arena (lmarena.ai) for comparative evaluation

Download and install LMStudio.
Download a model appropriate for your hardware (7B Q4 for 8GB VRAM, 13B Q4 for 16GB VRAM).
Run the model locally and ask it a practical question from your work.
Run the same question through a frontier model (Claude, GPT-4) and compare.
Note: latency, quality gaps, and anything the local model did surprisingly well.

Reflection

PreviousReasoning Models: Thinking Before Answering