If prompts can break your product, they deserve the same engineering rigor as the code that calls them.
Most engineering teams version their code obsessively and their prompts not at all. Prompts live in environment variables, database rows, or copy-pasted strings scattered across the codebase. Nobody knows what changed, nobody can diff a prompt update, and regressions go undetected until users complain.
This is the same mistake teams made with configuration before infrastructure-as-code. The fix is the same: treat it as a first-class artifact with its own file, format, and review process.
Freeform prompt strings have four specific failure modes in production:
prompt = f"Review this diff: {diff_content}" — which mixes prompt logic with application logic and makes both harder to test.A prompt template is a YAML file with a fixed schema: name, description, variables (with types and defaults), and the body with variable placeholders. Here's a real example for a PR review workflow:
name: pr-review
description: Review a pull request diff for issues
version: "1.2"
model: claude-sonnet-4-5
variables:
- name: diff
description: The git diff output
required: true
- name: language
description: Primary language of the changed files
default: auto-detect
- name: focus
description: What to focus on (security, perf, style, all)
default: all
body: |
## Role
You are an expert {{language}} engineer reviewing a pull request.
## Task
Analyze the diff and identify issues in: {{focus}}.
Be concise. Only flag real problems. Format as a numbered list.
Skip minor style nits unless they indicate a larger pattern.
## Diff
{{diff}}
This file lives in your repo at prompts/pr-review.yaml. It gets reviewed in PRs, versioned in git, and you can send it from anywhere:
promptctl send pr-review \
--var diff="$(git diff main...HEAD)" \
--var language=TypeScript \
--var focus=security
You don't have to write templates from scratch. If you already have prompts that work, promptctl create converts them to structured YAML automatically:
# From a description
promptctl create "code review for TypeScript PRs, focus on security and performance"
# From an existing prompt string
promptctl create "$(cat my-prompt.txt)"
# Interactive — asks clarifying questions
promptctl create
The generated template extracts implicit variables (anything that would change between calls), sets up the role/task/input structure, and saves it to your local prompt library.
Once prompts are files, they fit naturally into your existing git workflow:
# List all prompts in your library
promptctl list
# See the diff between two versions
git diff HEAD~1 prompts/pr-review.yaml
# Update and commit
promptctl edit pr-review
git add prompts/pr-review.yaml
git commit -m "pr-review: tighten security focus, reduce false positives"
Anyone on the team can open a PR to change a prompt. The change is visible, reviewable, and reversible. You can add a CODEOWNERS rule so prompt changes require sign-off from whoever owns the LLM workflow.
This is where prompts-as-code pays off most. Every time you change a template, you can run it against a set of known inputs and compare outputs to a baseline:
# Run benchmark with default inputs
promptctl benchmark pr-review
# Compare two versions head-to-head
promptctl benchmark pr-review --compare pr-review-v2
# Record a baseline (save current outputs as the reference)
promptctl benchmark pr-review --record
The benchmark runs both versions against the same inputs, shows you token counts side-by-side, and flags semantic differences. If the new version produces meaningfully different output on your test cases, you know before it ships.
Regression testing belongs in CI, not just local development. Add a step to your pipeline:
# .github/workflows/prompt-regression.yml
- name: Run prompt regression tests
run: |
promptctl benchmark pr-review --record-if-missing
promptctl benchmark support-triage --record-if-missing
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
Now every PR that touches a prompt file automatically runs the benchmark. The CI output shows whether the change improved, degraded, or had no meaningful effect on the prompt's behavior.
Teams that adopt prompts-as-code typically report three things:
The discipline isn't the hard part. The hard part is building the habit when your current process "works fine." It works fine until one prompt change silently degrades a workflow and you spend three days debugging the wrong thing.
Run promptctl create on whatever prompt is causing the most maintenance headaches and see what the structured version looks like.
brew tap prompt-ctl/tap
brew install --cask prompt-ctl/tap/promptctl
promptctl create "your existing prompt text here"
Try in browser
Full docs