Adversarial Examples and Defense
An adversarial example is an input crafted to fool a model. Tiny perturbations, invisible to humans, can flip a classifier. The defences are partial; the attacks evolve fast.
What adversarial examples are
An input perturbed with carefully-computed noise that causes a model to misclassify. A panda image plus tiny calibrated noise looks identical to a human but classifies as “gibbon” with 99% confidence.
The original 2014 demonstrations on image classifiers spawned a thousand papers. The phenomenon generalises to text, audio, and now LLMs.
Why models are vulnerable
Neural networks are continuous functions in very high-dimensional space. Decision boundaries are complex; small movements perpendicular to a boundary can flip predictions. The training data covers a tiny submanifold of input space; off-manifold inputs are unconstrained.
This isn’t a bug to be fixed. It’s a property of how high-dimensional learned functions behave.
Attack categories
- White-box: attacker has model weights. Compute gradients, optimise the perturbation. Maximally effective.
- Black-box: attacker only has API access. Use queries to estimate gradients (zeroth-order) or exploit transfer.
- Transfer attacks: craft adversarial examples on a substitute model; they often work on the target model too.
- Universal perturbations: a single noise pattern that fools the model on most inputs.
Defences
- Adversarial training: include adversarial examples in training. Improves robustness, hurts accuracy on clean data.
- Input transformations: rescale, denoise, randomise inputs before classification. Partial defence.
- Detection: a separate classifier flags “suspicious” inputs. Cat-and-mouse.
None is bulletproof. The field has converged on adversarial training as the strongest practical defence, with the accuracy hit accepted.
LLM-specific cases
For LLMs, adversarial examples often look like jailbreak prompts: oddly-formatted strings that get the model to bypass safety training. Prompt injection (covered separately) is a related concern. The attack surface is text, not pixels, but the underlying mathematics rhyme.