AI & ML Beginner By Samson Tanimawo, PhD Published Mar 24, 2026 10 min read

How Does ChatGPT Actually Work?

At bottom, ChatGPT predicts the next token. Everything else, the training pipeline, the alignment, the instruction-following, is scaffolding around that core. Here is the whole picture.

The 30-second answer

ChatGPT is a neural network that was trained to predict the next token (a sub-word unit) given the preceding tokens. Given a chat turn, it predicts one token, appends it to the conversation, and repeats until it predicts a special stop token. That generation loop is the whole runtime behaviour.

The reason the output feels intelligent is the scale of the training: trillions of tokens of text from the internet, plus tens of thousands of human-written example conversations, plus preference data from millions of user ratings. All of that gets compressed into the model’s weights.

Training: three distinct phases

The chatbot you use is the final artefact of a three-phase pipeline. Each phase turns a less-useful version into a more-useful one.

Phase 1: Pretraining. The model is trained on a huge corpus of internet text (books, Wikipedia, code repositories, forums). Objective: predict the next token. No labels beyond the text itself. After this phase you have a “base model” that can autocomplete text with surprising coherence but has no idea how to be helpful in a conversation, ask it a question and it might produce ten more questions instead of an answer.

Phase 2: Supervised fine-tuning (SFT). Human contractors write thousands of example conversations showing what helpful, instruction-following responses look like. The model is fine-tuned on these. After SFT, the base model has learned the shape of a chatbot response. It’s now an assistant, but still prone to over-confident errors and inconsistent behaviour.

Phase 3: Reinforcement learning from human feedback (RLHF). Human raters compare pairs of model outputs and say which is better. A reward model learns from these preferences. The chat model is then trained against the reward model using a reinforcement-learning algorithm (typically PPO or DPO). After RLHF, the model is more helpful, more cautious, and better at following nuanced instructions.

Current models add a fourth phase: constitutional AI or RLAIF, where some of the preference data is generated by stronger models following a written “constitution” of rules, reducing the human-labelling cost.

Inference: what happens when you hit send

You type “What’s the capital of France?” and press enter. Inside the model:

Tokenisation. Your message is broken into tokens. “What’s the capital of France?” becomes something like ['What', "'s", ' the', ' capital', ' of', ' France', '?'].
Embedding. Each token is mapped to a vector of numbers (typically 4,096 or 12,288 dimensions in modern models).
Forward pass through attention layers. These vectors flow through 60-120 transformer layers. Each layer mixes information between positions (attention) and applies a per-position neural-network transformation.
Next-token prediction. The final layer produces a probability distribution over the 50,000-200,000 possible next tokens.
Sampling. One token is sampled from that distribution (often with temperature to control randomness).
Append and repeat. The sampled token is appended to the conversation and steps 3-5 repeat. One token per iteration.

This is why streaming responses appear word by word: the server is generating one token, sending it to you, and starting on the next. Each token typically takes 5-50 milliseconds on modern hardware.

What it can’t do, and why

The next-token prediction framing explains both the capabilities and the limits.

It doesn’t actually “know” facts the way a database does. It has compressed patterns from training text. Facts that appeared many times are reliably recalled; obscure facts are approximated or confabulated.
It can’t learn from your conversation long-term. The context window (typically 128K-2M tokens in 2025) is its short-term memory. Nothing you say persists across sessions unless the app stores it.
It doesn’t know what it doesn’t know. The probability distribution on each token looks the same whether the model is reciting a well-known fact or inventing something plausible. Without external grounding, the model can’t reliably tell you when it’s uncertain.
Arithmetic is hit-or-miss. Multi-digit multiplication was famously weak until recently; even now, models use tool-calling (a calculator function) for anything beyond small numbers.

The stochastic-parrot debate, fairly

A 2021 paper argued that large language models are “stochastic parrots”, producing fluent text without understanding. The phrase became a political flashpoint; the underlying question is genuinely important.

The strong case for “just a parrot”: the model has no sensory grounding, no embodied experience, no goal beyond predicting the next token. Its “understanding” is statistical pattern-matching that happens to look like reasoning.

The strong case against: modern models solve novel problems they couldn’t have memorised, generalise to new languages they weren’t explicitly trained on, and produce code that works on requirements they’ve never seen verbatim. That’s not what a parrot does.

The honest answer in 2025: there’s meaningful cognitive-like behaviour that emerges from statistical learning at scale. Call it “real understanding” or not; the output is useful and the field is still figuring out where the boundary sits.

Where this is all heading

Three trends are shifting the model of “what is ChatGPT” in 2025:

Tool use. Models are trained to call functions, calculators, search engines, code interpreters, APIs. The chat interface is increasingly a thin layer over a planning-and-tool-calling loop.
Reasoning models. Separate “thinking” tokens during inference let the model do multi-step reasoning before emitting an answer. This adds compute and latency but unlocks harder problems.
Multimodality. Models ingest and produce images, audio, video, and structured data as naturally as text. The “language” model framing is becoming outdated.

The next-token-prediction core is still there. It’s just doing increasingly sophisticated things with the tokens.