AI & ML Beginner By Samson Tanimawo, PhD Published Mar 29, 2026 9 min read

What Is a Large Language Model?

A large language model is a neural network with a staggering number of parameters, trained on most of the internet, doing one thing: predicting the next token. Scale is the feature.

The core: predict the next token

At the centre of every LLM is a simple objective: given a sequence of tokens, predict the next one. That’s it. Pretraining optimises for this single objective, and every downstream ability, translation, summarisation, code generation, reasoning, is an emergent consequence of doing it very well.

A tokens is typically a common word or a word-piece. The model has a vocabulary (usually 50k-250k tokens) and assigns a probability to each token being the next one. Training nudges those probabilities so the observed next-token in training data gets higher probability.

What “large” actually means in 2025

The “large” in LLM refers to two numbers: parameters and training tokens.

Parameters: the adjustable numbers inside the model. Modern frontier LLMs range from 7B (small, runs on a laptop) to 400B+ (frontier, runs on a GPU cluster).
Training tokens: the amount of text the model saw during pretraining. Frontier models see 10-30 trillion tokens. A trillion tokens is roughly 750 billion English words, more text than any human reads in a thousand lifetimes.

The “Chinchilla scaling law” (Hoffmann et al., 2022) established that for a given compute budget, training a smaller model on more data usually beats training a bigger model on less data. Modern LLMs are built around this: not as big as possible, but matched to the training data budget.

Capabilities that emerged with scale

The striking thing about LLMs is that several capabilities weren’t trained for explicitly, they appeared when the model got big enough. Some examples, ordered by roughly the scale at which they emerged:

Fluent text generation: emerged around 100M-1B parameters.
Translation between languages (without being trained as a translator): around 1B-10B.
Basic arithmetic and reasoning chains: around 10B-70B.
Following complex multi-step instructions: around 70B+.
Writing working code from requirements: around 70B+.
Self-correction when asked to review its own work: larger still, and not fully emergent in all models.

Whether “emergent” is the right word is debated, some researchers argue the capabilities were always present but just weren’t measurable until the model got good enough. Either way, the observable fact is that adding parameters and data unlocks new behaviours in a way that doesn’t happen with smaller models.

Hard limits that haven’t moved

Scale hasn’t fixed everything. Several limits are stubborn:

Factual reliability on long-tail knowledge: asking about a niche topic, a recent event, or a specific person can produce confidently wrong answers (hallucinations). Retrieval-augmented generation (RAG) and tool use are the practical workarounds.
Long-horizon planning: multi-step plans that need to be correct for dozens of steps are still fragile. Errors compound.
True counting / exhaustive search: ask for “all the ways…” and the model will happily list some and stop.
Self-awareness about uncertainty: models express confidence in about the same register whether they’re reciting a fact or confabulating one. Calibration is improving but not solved.

Open-weight vs closed models

In 2025 there are two serious ecosystems:

Closed-weight frontier: Claude (Anthropic), GPT (OpenAI), Gemini (Google). You access these via API. They’re typically the most capable, safest out of the box, with the best tool use and reasoning. You don’t have the weights; you pay per token.

Open-weight: Llama (Meta), Mistral, DeepSeek, Qwen. You can download the weights and run the models on your own hardware. Capabilities trail the frontier by 6-12 months but the best open-weight models (70B+) are genuinely useful for most tasks.

Picking one is a capability-vs-control tradeoff. If you need the absolute best reasoning, closed frontier. If you need to keep data on-premises, or you’re optimising for cost at scale, open-weight. Many production systems mix: closed for the hard turns, open for the cheap volume.

A practical way to pick one

When you’re building with LLMs, start with a closed frontier model to figure out whether the task is tractable. Ship. Then, once you have real usage and real cost numbers, evaluate whether a smaller or open-weight model can handle a portion of the traffic at lower cost.

Concrete starter rules:

Classification and simple extraction: Haiku-class small models are often enough and 10-20× cheaper.
Writing, analysis, moderate reasoning: Sonnet-class mid-tier models are the workhorse.
Complex multi-step reasoning, difficult refactors, research-level analysis: Opus-class frontier models.
Bulk non-real-time workloads: use batch APIs or open-weight models; they cut cost dramatically.

Routing requests to the right tier by complexity is the single biggest cost lever in production LLM systems.