Model Theft and Extraction Attacks
An attacker queries your model API and reconstructs your model from the responses. Model extraction is real, demonstrated, and harder to prevent than most teams expect.
What extraction is
An attacker queries your deployed model with carefully-chosen inputs, observes the outputs, and trains a replica. With enough queries, the replica matches the original closely enough to use commercially.
2024-2025 demonstrations: GPT-4-class models extracted with ~$50K in API calls. The replica isn’t identical, but it’s capability-comparable for most tasks.
How attackers do it
- Distillation-style: use the API as a teacher. Train a student on its outputs.
- Functional extraction: probe specific capabilities (reasoning, code, multilingual) and replicate each separately.
- Embedding extraction: for embedding APIs, query a corpus and rebuild the embedding model itself.
Cost depends on what the attacker wants: a fully-working clone is expensive; a model that matches on a specific task is cheap.
Defences
- Rate limiting: cap query volume per account. Slows extraction but determined attackers use many accounts.
- Output watermarking: subtle statistical signatures in outputs that prove a model is derived from yours. Detection-after-the-fact, not prevention.
- Output noise: add randomness to outputs. Hurts extraction; also hurts legitimate use.
- Capability-targeting: serve cheaper models for queries that look like extraction probes; reserve flagship for trusted users.
Legal landscape
Model extraction sits in unsettled IP territory. Terms of service typically prohibit it, but enforcement is hard. Several major lawsuits in 2024-2025 are testing whether trained model weights count as protectable trade secrets, copyrightable works, or neither.
For now, the practical defence is layered: rate limit, watermark, monitor, sue when you catch a bad actor. The combination doesn’t prevent extraction; it raises the cost enough that most attackers go elsewhere.