AI & ML Advanced By Samson Tanimawo, PhD Published Dec 21, 2026 6 min read

Model Theft and Extraction Attacks

An attacker queries your model API and reconstructs your model from the responses. Model extraction is real, demonstrated, and harder to prevent than most teams expect.

What extraction is

Model extraction is the process of recovering a model's weights (or a functional copy) by querying it. Attackers query the API many times; observe outputs; train a "student" model that approximates the target. The student can be deployed independently; the attacker no longer needs to query the target. Companies that spent millions training a model are at risk of having it cloned for orders-of-magnitude less.

The economic stakes. Frontier models cost $50M-$1B to train. A successful extraction attack might cost $10K-$1M in API calls. The asymmetry is favorable to attackers; the defenders (model providers) face an uneven battle.

The extraction quality spectrum. "Functional clone", the student matches behavior closely on common inputs but may diverge on edge cases. "High-fidelity clone", matches even on edge cases; harder to achieve. Most realistic extraction produces functional clones; high-fidelity clones are research-grade.

The use cases for stolen models. Cheaper inference (no API fees). Customisation (fine-tune the clone freely). Removing safety training (clone's outputs aren't filtered). Research advantage (study the model's behavior offline). Each use case has different attacker incentives; defenders target the most likely.

How attackers do it

Several techniques:

Output cloning, query target on diverse inputs; train student on (input, output) pairs. The simplest approach; works well for moderate accuracy clones.
Logit extraction, when target returns logprobs (probabilities), use them as training signal. Better quality than output-only cloning.
Architecture inference, figure out the target's architecture from response patterns and timing; train a same-architecture student.
Membership inference + extraction, combine techniques to recover specific training data, then train on it.

The output-cloning workflow. Generate diverse inputs (real prompts, synthetic prompts, edge-case inputs). Query the target; collect outputs. Train a student model on the (input, output) pairs as supervised learning. The student learns to mimic the target.

The logit-extraction advantage. Hard labels (the target's top-1 output) carry less information than soft labels (full probability distribution). When the target API returns logprobs, students can be trained much more efficiently. APIs that return logprobs leak more than necessary for many use cases.

The query-budget math. Strong extraction often requires 100K-10M queries. At API rates of $0.001-$0.01 per query, that's $100-$100K. Practical for motivated attackers; out of reach for casual ones. Rate limiting raises the cost meaningfully.

The student-architecture choice. Attackers can use the same architecture as the target (highest quality) or a smaller architecture (cheaper, faster, lower quality). The choice depends on the attacker's goal, high-fidelity clone vs cheap functional clone.

Defences

No defense is bulletproof. Mitigations layer:

Rate limiting, bound queries per user/IP. Slows extraction; doesn't prevent it.
Watermarking, embed signal in outputs that lets you detect a clone trained on your model. After-the-fact attribution; doesn't prevent extraction.
Output coarsening, return top-1 only (no logprobs); add tiny noise. Reduces extraction efficiency; some quality cost.
Anomaly detection, flag query patterns that look extraction-y (extreme diversity, scale, timing). Catches obvious attacks; sophisticated attackers evade.
Legal recourse, Terms of Service violations, copyright claims. Increasingly tested in court.

The rate-limiting reality. Different users have different legitimate needs. Aggressive rate limits hurt legitimate high-volume customers. Tiered limits with manual review for high-volume traffic is the production norm; pure-rate limits don't scale.

The watermarking reality. Active research; a few production deployments. Watermarks must survive fine-tuning of the clone; aggressive watermarks degrade quality on clean inputs. The trade-off curve is improving but not solved.

The output-coarsening reality. Removing logprobs costs some users (those who genuinely use them for downstream applications). Some providers offer logprobs only on Tier-2 or Tier-3 plans; tier the trade-off.

The anomaly-detection reality. Pattern-based detection catches scripted extraction (uniform query rate, diverse content, sustained pace). Sophisticated attackers humanize their queries; detection accuracy drops. Cat-and-mouse cycle.

The defense-in-depth conclusion. No single defense suffices. Combine: rate limit + output coarsening + watermarking + anomaly detection + legal terms. Raises attacker cost; doesn't prevent. For high-value models, the layered approach is the practical norm.

Legal landscape

Most model API ToS prohibit using outputs to train competing models. This is a contractual matter; enforcement varies by jurisdiction. Recent court cases (2023-2025) have started to address whether training a competing model on another's outputs is copyright infringement. The legal picture is unsettled but moving toward stronger protection for model providers.

The ToS pattern. OpenAI, Anthropic, Google, Meta all prohibit training competing models. ToS violations let providers terminate accounts; legal damages depend on jurisdiction. Most ToS are enforceable; clones developed via clearly-violating ToS face legal risk.

The copyright angle. Whether model weights are copyrightable; whether training data is "copying" for copyright purposes; whether outputs are derivative works. Legal opinions vary; cases are working through courts in 2025-2026. The eventual rules will shape what extraction is legally exposed.

The trade-secret angle. Model weights as trade secrets; extraction as misappropriation. Established in some jurisdictions; less developed than copyright. Adds another lever for providers in jurisdictions where applicable.

The international fragmentation. EU, US, China have different legal frameworks. Extraction conducted in jurisdictions with weak protections, then deployed elsewhere, is hard to address. The legal patchwork creates safe harbors that defense-in-depth must compensate for.

The long-term direction. Legal protection for ML models is strengthening. Expect more cases, more clarity, more provider-favorable rulings through 2027. The contractual + legal layer is meaningful even though it's not bulletproof.

Common antipatterns

Returning unbounded logprobs by default. Defenders volunteer extraction-friendly information. Make logprobs an opt-in for tier customers.

Ignoring extraction-pattern monitoring. The first sign of a successful extraction is suspicious traffic. Monitoring catches it before the clone deploys.

Sole reliance on rate limiting. Sophisticated attackers parallelize across accounts. Need defense in depth.

No ToS clauses about model training. Legal recourse depends on contract. Standard clauses are now table stakes.

What to do this week

Three moves. (1) If you serve a model API, audit your default response payload. Reducing logprob exposure for Basic tiers is usually a quick win. (2) Add anomaly detection on query volume + diversity per account. Even simple heuristics catch obvious extraction. (3) Verify your ToS prohibits training competing models. The legal foundation is cheap insurance.