Activation Functions: ReLU, Sigmoid, and Why They Matter
Without activation functions, a 100-layer neural network is mathematically equivalent to a 1-layer one. The activation is what gives depth its power. Pick the wrong one and the network can’t learn.
Why activations exist at all
Without activation functions, a neural network is just a stack of matrix multiplications. And it turns out that stacking matrix multiplications is mathematically equivalent to one matrix multiplication. A 100-layer linear-only network is exactly as expressive as a 1-layer one.
The activation function is a non-linear operation applied between layers. It breaks the linearity. Once you stack non-linear operations, the network can represent functions that linear math cannot.
This single fact, non-linearity makes depth meaningful, is what made deep learning possible at all. Pick the wrong activation, and the network can’t learn complex patterns no matter how deep you make it.
ReLU: the workhorse
Rectified Linear Unit. The math: f(x) = max(0, x). If the input is positive, pass it through; if negative, output zero.
It’s the default activation in 95% of modern neural networks. Three reasons:
- Simple. One comparison, one multiplication. Fast on GPUs.
- No saturation on the positive side. The gradient is 1 for any positive input, which means gradients don’t vanish as they propagate back through deep networks.
- It works. ResNet, BERT, and most CNNs you’ve heard of use ReLU or a close variant.
The downside: the “dying ReLU” problem. Once a neuron’s input goes deeply negative, the gradient is zero forever and the neuron is stuck outputting zero. Variants like Leaky ReLU (a small slope on the negative side) and PReLU (a learnable slope) address this. They give modest improvements in some settings but plain ReLU is fine for most.
Sigmoid and Tanh: legacy roles
Two functions that were standard before ReLU and are now reserved for specific roles.
Sigmoid squashes any input to the range (0, 1). The S-shaped curve looks pretty in textbooks. The problem: it saturates. Inputs above 5 or below -5 produce gradients near zero, so deep networks using sigmoid as a hidden activation fail to train (this was a bottleneck for deep learning until ReLU).
Sigmoid’s remaining role: as the output activation for binary classification, where you want a probability between 0 and 1.
Tanh is sigmoid’s cousin, ranging from -1 to 1. Better than sigmoid for hidden layers because the output is zero-centred. Still suffers from saturation. Used in some recurrent network designs (LSTM, GRU) where its specific shape matters; rarely used in modern feedforward networks.
GELU and Swish: modern choices
Two activations that have edged out ReLU in transformers and other recent architectures.
GELU (Gaussian Error Linear Unit). Mathematically smoother than ReLU near zero. Empirically slightly better in transformers. Used in BERT, GPT-2, and many recent LLMs.
Swish / SiLU. x * sigmoid(x). Discovered by automated search; works well across many architectures. Used in EfficientNet, Llama, and others.
Both add a tiny amount of compute over ReLU and give a small accuracy boost in modern models. The boost is real but marginal, sometimes 0.5-1% on hard benchmarks, often nothing on easier ones. They’re defaults in transformer training largely because the marginal gain compounds at scale.
Output activations: softmax and friends
The output layer’s activation is determined by the task, not by general-purpose tradeoffs:
- Softmax: K outputs that sum to 1. Standard for multi-class classification. Pair with cross-entropy loss.
- Sigmoid: a single output between 0 and 1. Standard for binary classification, or for multi-label tasks where each label is independent.
- Linear (no activation): continuous outputs without bound. Standard for regression. Pair with MSE loss.
The output activation and the loss function are tightly paired. Mismatching them (e.g., softmax + MSE) usually trains poorly or not at all. The pairings above are the no-think defaults.
The rule of thumb
For 95% of new architectures: ReLU on hidden layers, softmax (or sigmoid) on the output, cross-entropy as loss for classification, MSE for regression.
For transformers and modern LLM-flavoured architectures: GELU or SwiGLU on hidden layers, the rest unchanged.
Don’t spend time picking activation functions early. Get the architecture and data right first. Activation choice is a 1% problem; data quality and training rate are 50% problems. Solve those first.