AI & ML Beginner By Samson Tanimawo, PhD Published Feb 18, 2025 9 min read

What Is a Large Language Model?

A large language model is a neural network with a staggering number of parameters, trained on most of the internet, doing one thing: predicting the next token. Scale is the feature.

The core: predict the next token

At the centre of every LLM is a simple objective: given a sequence of tokens, predict the next one. That’s it. Pretraining optimises for this single objective, and every downstream ability, translation, summarisation, code generation, reasoning, is an emergent consequence of doing it very well.

A tokens is typically a common word or a word-piece. The model has a vocabulary (usually 50k-250k tokens) and assigns a probability to each token being the next one. Training nudges those probabilities so the observed next-token in training data gets higher probability.

What “large” actually means in 2025

The “large” in LLM refers to two numbers: parameters and training tokens.

The “Chinchilla scaling law” (Hoffmann et al., 2022) established that for a given compute budget, training a smaller model on more data usually beats training a bigger model on less data. Modern LLMs are built around this: not as big as possible, but matched to the training data budget.

Capabilities that emerged with scale

The striking thing about LLMs is that several capabilities weren’t trained for explicitly, they appeared when the model got big enough. Some examples, ordered by roughly the scale at which they emerged:

Whether “emergent” is the right word is debated, some researchers argue the capabilities were always present but just weren’t measurable until the model got good enough. Either way, the observable fact is that adding parameters and data unlocks new behaviours in a way that doesn’t happen with smaller models.

Hard limits that haven’t moved

Scale hasn’t fixed everything. Several limits are stubborn:

Open-weight vs closed models

In 2025 there are two serious ecosystems:

Closed-weight frontier: Claude (Anthropic), GPT (OpenAI), Gemini (Google). You access these via API. They’re typically the most capable, safest out of the box, with the best tool use and reasoning. You don’t have the weights; you pay per token.

Open-weight: Llama (Meta), Mistral, DeepSeek, Qwen. You can download the weights and run the models on your own hardware. Capabilities trail the frontier by 6-12 months but the best open-weight models (70B+) are genuinely useful for most tasks.

Picking one is a capability-vs-control tradeoff. If you need the absolute best reasoning, closed frontier. If you need to keep data on-premises, or you’re optimising for cost at scale, open-weight. Many production systems mix: closed for the hard turns, open for the cheap volume.

A practical way to pick one

When you’re building with LLMs, start with a closed frontier model to figure out whether the task is tractable. Ship. Then, once you have real usage and real cost numbers, evaluate whether a smaller or open-weight model can handle a portion of the traffic at lower cost.

Concrete starter rules:

Routing requests to the right tier by complexity is the single biggest cost lever in production LLM systems.