Gradient Descent Explained (Without the Calculus)
Imagine you’re lost in fog on a hill. You can’t see the bottom, but you can feel the slope under your feet. You take one step downhill. Then another. That’s gradient descent.
Rolling downhill
The model has thousands or billions of weights. The loss function evaluates how wrong the model is, given those weights. Plot the loss as a surface where the model’s weights are the coordinates and the height is the loss value, and you have a high-dimensional landscape.
Training is the process of finding low points in that landscape. Gradient descent does it by always stepping in the direction the surface descends most steeply at the current location.
The model can’t see the whole surface. It can only feel the slope at its feet. So it takes a small step downhill, recomputes the slope, takes another step, and so on. Eventually it lands somewhere flat. That somewhere isn’t guaranteed to be the lowest point on the entire surface, but in practice for big neural networks the “somewhere flat” is good enough.
Learning rate: too big vs too small
The size of each step is called the learning rate. It’s the most important hyperparameter you’ll tune.
- Too big: you overshoot the valley. Each step takes you over the bottom and up the other side. The loss bounces around or diverges (goes to infinity).
- Too small: you crawl. Training takes 10× longer than necessary, and you might not reach a good minimum within your compute budget.
- Just right: the loss decreases steadily, with mild oscillation. You can train to convergence in a reasonable time.
Practical starting points: 1e-3 for most architectures from scratch, 1e-4 to 1e-5 for fine-tuning a pretrained model, 5e-4 to 1e-3 for transformer training (with warm-up).
Tune by trying 1e-2, 1e-3, 1e-4, 1e-5 and watching which one produces a steady downward loss curve. Anything that diverges or stays flat is wrong by an order of magnitude.
Batch sizes: SGD, mini-batch, full-batch
How much data does the model see between weight updates?
- Full-batch: see all training examples before updating once per epoch. Most accurate gradient. Slowest. Only used for tiny problems.
- Stochastic gradient descent (SGD): one example at a time. Fast updates but noisy. Mostly historical interest now.
- Mini-batch: 32 to 512 examples per update. The default for almost everything. Balances accurate gradients with fast updates and good GPU utilisation.
Larger batches give cleaner gradients and better GPU utilisation but require more memory. Smaller batches are noisier (which can actually help generalisation) and use less memory.
If you’re doing transformer training and your GPU memory allows it, batch size 256-1024 is the typical range. For fine-tuning on a single GPU, 8-32 is realistic.
Adam, SGD, RMSprop in plain language
The base recipe (vanilla SGD with mini-batches) was the standard for decades. Modern optimisers add tricks:
- Momentum: instead of stepping purely in the current direction, accumulate a running average of recent step directions. The training trajectory rolls smoothly past minor bumps in the loss surface.
- RMSprop: scale each weight’s step size by how much that weight has been changing recently. Weights with consistent gradients take big steps; weights with thrashing gradients take small ones.
- Adam: combines momentum and RMSprop. The default for almost every modern neural network.
- AdamW: Adam plus a corrected version of weight decay. Modern transformer training defaults to this.
If you’re unsure, use AdamW. It’s the no-regret choice in 2025.
Schedules and warm-up
The learning rate doesn’t have to stay constant. Two common patterns:
- Warm-up: start at a tiny learning rate and ramp up to the target over the first few thousand steps. Prevents early-training divergence in transformers, which have very large gradients in their first updates.
- Decay: gradually decrease the learning rate as training progresses. Cosine decay (smooth ramp-down following a cosine curve) is the modern default.
The combined warm-up + cosine decay schedule is the recipe used in most transformer pretraining today. PyTorch and Hugging Face have one-line implementations.
When training stalls, the diagnosis
The loss isn’t going down. Three things to check, in order:
- Is your data loaded correctly? The single most common cause of stalled training is a bug in the data loader producing garbage. Print a batch. Look at it.
- Is the learning rate sensible? Plot loss for the first 100 steps. It should decrease. If it’s flat, learning rate too small. If it’s exploding, too large. Move by 10× and re-test.
- Is the model learning at all? Train on a tiny subset (10 examples) for a few hundred steps. The loss should approach zero. If it doesn’t, the model is broken, not the data.
The third check is underused. If you can’t overfit a 10-example subset, no amount of data will fix the model. Fix the architecture or the loss before adding data.