AI & ML Advanced By Samson Tanimawo, PhD Published Apr 28, 2026 6 min read

Adversarial Examples and Defense

An adversarial example is an input crafted to fool a model. Tiny perturbations, invisible to humans, can flip a classifier. The defences are partial; the attacks evolve fast.

What adversarial examples are

An input perturbed with carefully-computed noise that causes a model to misclassify. A panda image plus tiny calibrated noise looks identical to a human but classifies as “gibbon” with 99% confidence.

The original 2014 demonstrations on image classifiers spawned a thousand papers. The phenomenon generalises to text, audio, and now LLMs.

Why models are vulnerable

Neural networks are continuous functions in very high-dimensional space. Decision boundaries are complex; small movements perpendicular to a boundary can flip predictions. The training data covers a tiny submanifold of input space; off-manifold inputs are unconstrained.

This isn’t a bug to be fixed. It’s a property of how high-dimensional learned functions behave.

Attack categories

Defences

None is bulletproof. The field has converged on adversarial training as the strongest practical defence, with the accuracy hit accepted.

LLM-specific cases

For LLMs, adversarial examples often look like jailbreak prompts: oddly-formatted strings that get the model to bypass safety training. Prompt injection (covered separately) is a related concern. The attack surface is text, not pixels, but the underlying mathematics rhyme.