Loss Functions: MSE, Cross-Entropy, and When to Use Each
The loss function is the answer to ‘how wrong is the model right now?’. Pick the wrong one and your model can’t learn the right thing. Here is the right loss for the right job.
What a loss function does
A loss function takes the model’s prediction and the true target and outputs a single number: the higher, the worse. Training is the process of minimising this number across the dataset.
Two design constraints. First, the loss must be differentiable, because gradient descent works by computing the gradient of the loss with respect to the model’s parameters. Second, the loss should align with what you actually care about, if you minimise the wrong proxy, you’ll get a model that’s great at the proxy and bad at the goal.
MSE for regression
Mean squared error: average over examples of (prediction - target) squared. Used when the model outputs a continuous number (price, temperature, rating).
Why squared, not absolute? Two reasons. The square heavily penalises large errors, which is usually what you want. And it’s smooth everywhere, which makes the optimisation better-behaved than mean absolute error (MAE), whose derivative isn’t defined at zero.
When MAE wins: when outliers in your data are rare bad data points you don’t want the model to chase. MSE’s squared term means a single 100×-too-far prediction can dominate the loss. MAE is more robust there.
If you’re unsure: start with MSE. Move to MAE if the model is clearly chasing outliers.
Cross-entropy for classification
For classification, the model outputs a probability over classes, cross-entropy is the dominant loss. The math:
loss = -log(predicted probability of the correct class)
If the model’s probability for the right class is 1.0, loss is 0. If it’s 0.5, loss is about 0.69. If it’s 0.01, loss is 4.6. The lower the probability the model assigned to the truth, the higher the loss, and steeply so.
This shape is exactly what you want. The model is rewarded for being confident and right, lightly punished for being uncertain, and severely punished for being confidently wrong. Most ML accuracy gains in the past decade have come on top of cross-entropy as the loss.
Binary vs multi-class
Two flavours of cross-entropy depending on the output shape:
- Binary cross-entropy (BCE): two classes. Output is a single sigmoid-activated probability between 0 and 1. Use for spam/not-spam, churn/not-churn, fraud/not-fraud.
- Categorical cross-entropy: K classes. Output is a softmax-normalised probability vector summing to 1. Use for digit classification, image labelling, language modelling.
For multi-label tasks (an example can have multiple correct labels at once, tags on a blog post), use binary cross-entropy applied independently to each label. Don’t use softmax there; it forces the labels to compete.
When to write a custom loss
Most projects don’t need to. The standard losses are well-tuned, well-supported, and rarely the bottleneck.
Three cases where a custom loss makes sense:
- Asymmetric costs. False positives and false negatives have very different real-world consequences (e.g., medical screening). A weighted loss reflects that asymmetry.
- Ordinal output. The classes have a natural order (1-star to 5-star reviews). Standard cross-entropy treats them as unrelated. Ordinal regression losses respect the ordering.
- Composite objectives. You want the model to be accurate AND calibrated AND sparse. Sum or weighted average of three losses, each enforcing one property.
If you’re writing a custom loss because the standard one feels “not quite right,” build the eval first. Often the standard loss is fine and the perceived gap was elsewhere (data, architecture, learning rate).
Class imbalance: weighted losses
If 99% of examples are negative and 1% positive, plain cross-entropy lets the model achieve near-zero loss by predicting always-negative.
Two fixes:
- Weighted cross-entropy: weight each class’s contribution to the loss by the inverse of its frequency. Positive examples get 99× the weight of negatives. The model can’t ignore them.
- Focal loss: down-weights easy examples (correctly classified with high confidence) so the model focuses on the hard ones. Originally designed for object detection, broadly useful for imbalanced classification.
For tabular tasks, oversampling the minority class (or undersampling the majority) often works as well as a weighted loss and is simpler. For image and text tasks, weighted loss generalises better.
The two failure modes to avoid: a precision/recall report that hides imbalance, and a model trained on a balanced sample but evaluated on imbalanced production data. Both produce surprises in the postmortem.