AI & ML Beginner By Samson Tanimawo, PhD Published May 13, 2025 9 min read

Gradient Descent Explained (Without the Calculus)

Imagine you’re lost in fog on a hill. You can’t see the bottom, but you can feel the slope under your feet. You take one step downhill. Then another. That’s gradient descent.

Rolling downhill

The model has thousands or billions of weights. The loss function evaluates how wrong the model is, given those weights. Plot the loss as a surface where the model’s weights are the coordinates and the height is the loss value, and you have a high-dimensional landscape.

Training is the process of finding low points in that landscape. Gradient descent does it by always stepping in the direction the surface descends most steeply at the current location.

The model can’t see the whole surface. It can only feel the slope at its feet. So it takes a small step downhill, recomputes the slope, takes another step, and so on. Eventually it lands somewhere flat. That somewhere isn’t guaranteed to be the lowest point on the entire surface, but in practice for big neural networks the “somewhere flat” is good enough.

Learning rate: too big vs too small

The size of each step is called the learning rate. It’s the most important hyperparameter you’ll tune.

Practical starting points: 1e-3 for most architectures from scratch, 1e-4 to 1e-5 for fine-tuning a pretrained model, 5e-4 to 1e-3 for transformer training (with warm-up).

Tune by trying 1e-2, 1e-3, 1e-4, 1e-5 and watching which one produces a steady downward loss curve. Anything that diverges or stays flat is wrong by an order of magnitude.

Batch sizes: SGD, mini-batch, full-batch

How much data does the model see between weight updates?

Larger batches give cleaner gradients and better GPU utilisation but require more memory. Smaller batches are noisier (which can actually help generalisation) and use less memory.

If you’re doing transformer training and your GPU memory allows it, batch size 256-1024 is the typical range. For fine-tuning on a single GPU, 8-32 is realistic.

Adam, SGD, RMSprop in plain language

The base recipe (vanilla SGD with mini-batches) was the standard for decades. Modern optimisers add tricks:

If you’re unsure, use AdamW. It’s the no-regret choice in 2025.

Schedules and warm-up

The learning rate doesn’t have to stay constant. Two common patterns:

The combined warm-up + cosine decay schedule is the recipe used in most transformer pretraining today. PyTorch and Hugging Face have one-line implementations.

When training stalls, the diagnosis

The loss isn’t going down. Three things to check, in order:

The third check is underused. If you can’t overfit a 10-example subset, no amount of data will fix the model. Fix the architecture or the loss before adding data.