Prompt Version Control: The Discipline That Pays Off

Prompts are code. Version them, review them, test them. The git workflow for prompts and the eval gate that protects every change.

Prompts in git

Treating prompts like code starts with version control. Each prompt lives in its own .md or .txt file in the repo: versioned, reviewable, diffable. PRs that change prompts get the same review as PRs that change code; tag prompt versions with a release identifier so the model invocation logs the version and debugging is reproducible.

Per-prompt file in repo. .md or .txt; versioned, reviewable, diffable; no more “we changed something but can’t remember what”.
Same review as code. A second pair of eyes catches subtle regressions; the discipline matches engineering norms.
Versioned with release id. Model invocation logs the version; debugging is reproducible.
Per-prompt change history. Git history is the audit trail; supports investigation when behaviour shifts.

Eval gate on every PR

An eval gate is what makes prompts compound in quality. Every prompt PR runs the eval suite (pass: merge proceeds; fail: PR stays open until prompt or eval is fixed); override is allowed but written (“Accepting eval regression on case-12 because new prompt fixes case-37 which is more important”) and logged. Without the gate, prompts drift; with it, they compound.

Eval suite per PR. Pass: merge proceeds; fail: PR stays open until prompt or eval is fixed.
Written overrides. “Accepting case-12 regression for case-37 fix”; the override is documented.
Logged overrides. Override entries persisted; the audit trail is intact.
Per-PR quality compounding. Without the gate prompts drift; with it, they compound.

What to put in the prompt vs in code

The split between prompt and code is opinionated. Prompt: reasoning steps, format, constraints expressible in language. Code: routing, validation, deterministic logic, tool calls. When in doubt, push toward code because code is testable and prompts are stochastic; the discipline pays.

Prompt holds reasoning. Reasoning steps, format, constraints expressible in language.
Code holds determinism. Routing, validation, deterministic logic, tool calls.
Push toward code. When in doubt; code is testable, prompts are stochastic.
Per-team scope rule. The split documented in the engineering handbook; supports consistent prompt design.

Rollback when something regresses

Rollback discipline closes the loop. Production logs the prompt version per request so a regression is traceable to the prompt change that caused it; rollback is a single PR that reverts the offending change (fast, reversible); post-rollback, write the eval case that would have caught the regression so future regressions are loud.

Per-request prompt version logged. Regression traceable to the prompt change that caused it.
Single-PR rollback. Reverts the offending prompt change; fast, reversible.
Post-rollback eval case. Write the case that would have caught the regression; lands in the suite.
Per-rollback learning. Each rollback grows the eval suite; future regressions are loud.