API design for a stochastic function.
"Prompt engineering" sounds like cargo-cult incantation. It isn't. It's the discipline of writing clear, specific, testable instructions for a model that will otherwise improvise. Treat it like you would treat a fuzzy endpoint — with a contract, examples, schema, and measurement.
In plain English. A prompt is a spec for a function whose body you can't see. The clearer the spec, the more reliable the output. Vague specs produce vague software; vague prompts produce vague answers.
flowchart TB
A[Zero-shot
just ask] --> B[Clear instructions
role, format, constraints]
B --> C[Few-shot
show examples]
C --> D[Chain of thought
think step by step]
D --> E[Self-consistency
sample N, vote]
E --> F[Tree of thoughts
explore branches]
F --> G[Tool use
delegate to code]
G --> H[Agentic loop
plan, act, verify, repeat]
style A fill:#e8f4ff
style B fill:#cfe7ff
style C fill:#a8d4ff
style D fill:#84c0ff
style E fill:#5fa9ff
style F fill:#3b91ff
style G fill:#1f7aef
style H fill:#0a5fcf,color:#ffffff
Each level adds reliability and cost. Climb only as high as you need.
flowchart TB
subgraph System
S1[Role + identity]
S2[Capabilities + constraints]
S3[Tone + style]
S4[Output format / schema]
S5[Safety rules]
end
subgraph User
U1[Context block
retrieved docs, state]
U2[The actual task]
U3[Few-shot examples]
U4[Restated goal]
end
System --> M[Model]
User --> M
M --> O[Structured output]
A real example, Java-flavored:
SYSTEM:
You are an experienced Spring Boot engineer. You are concise.
You will receive a failing test and the relevant production code.
Propose a minimal fix. Never change unrelated code.
Output JSON matching:
{
"root_cause": string,
"file_to_edit": string,
"patch": string // unified diff
}
If you need more info, output {"need": string}.
USER:
<context>
<file path="OrderService.java">...</file>
<file path="OrderServiceTest.java">...</file>
</context>
<error>
AssertionError at line 42: expected 200, got 500
</error>
Fix the test failure. Minimal change only.
Notice: role, constraint, exact schema, escape hatch, context block, and focused task. That is the shape of 95% of production prompts.
| Technique | When to use | Leverage |
|---|---|---|
| Clear, specific instructions | Always | Huge |
| Structured output (JSON / XML / regex-guided) | Anything code will parse | Huge |
| Few-shot examples | Novel or fuzzy task | Large |
| Chain-of-thought ("think step by step") | Reasoning, math, planning | Large on non-reasoning models |
| Role assignment ("you are a senior SRE") | Tone, perspective | Medium |
| XML tags for sections | Multi-part prompts | Medium |
| Pre-filled assistant turn | Force output format | Medium |
| Self-consistency (sample N, vote) | High-stakes, cost-tolerant | Medium |
| Prompt chaining | Complex task; easier debugging | Large |
| Retrieval (RAG) | Factual grounding | Huge (see Ch. 7) |
Instead of free-form prose, force the model to emit machine-readable output. Then parse it and fail loudly.
Three levels of "force":
response_format={"type": "json_schema", ...}, Gemini response schemas. These use constrained decoding — the model cannot produce invalid JSON.instructor (Python) or LangChain4j (Java) layer Pydantic/Jackson schemas on top, with automatic retry-on-parse-error.from pydantic import BaseModel
import instructor
from anthropic import Anthropic
class Fix(BaseModel):
root_cause: str
file_to_edit: str
patch: str
client = instructor.from_anthropic(Anthropic())
fix: Fix = client.messages.create(
model="claude-opus-4-7",
response_model=Fix,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
# fix.root_cause is typed. No json.loads. No KeyError.
This one change removes entire categories of production bugs.
For non-reasoning models (Haiku, Gemini Flash, most open models), adding a "think step by step" prefix or asking the model to show its work before the answer produces measurable gains on math, logic, and code tasks. The model uses its own output as scratch space.
flowchart LR
A[Question] --> B[Think step by step
visible reasoning]
B --> C[Intermediate steps]
C --> D[Final answer]
A useful pattern:
Think through this carefully. First, write your reasoning inside
<scratch>...</scratch> tags. Then give the final answer in <answer>
tags.
Then parse <answer> and ignore <scratch>. For reasoning models (o3, Claude with extended thinking, Gemini Thinking) this is built in.
For any task that isn't standard (e.g., "classify a support ticket into one of our 17 weirdly-named categories"), three to five examples embedded in the prompt outperform most of the fancier techniques — and require no infrastructure.
Classify the ticket. Examples:
Ticket: "My card keeps getting declined"
Category: payments.failure
Ticket: "I can't log in on mobile"
Category: auth.mobile
Ticket: "Where is my refund?"
Category: refunds.status
Ticket: "{new ticket text}"
Category:
Keep example count bounded (3–10 is usually right), balance the class distribution, and rotate occasional hard cases in.
"If you are unsure, return {"need": "..."}". Otherwise, it will fabricate.A test suite for your prompts is the single biggest maturity jump for an AI team. It lets you:
flowchart LR
A[Test set
50-500 inputs] --> B[Run prompt]
B --> C{Grader}
C -->|exact match| D1[pass/fail]
C -->|regex / schema| D1
C -->|LLM-as-judge| D1
C -->|human| D1
D1 --> E[Score]
E --> F[Regression report]
F --> G[Block merge
if score drops]
The stack you want:
tests/evals/ folder with @pytest.mark.eval beats no evals.Rule of thumb: never tune a prompt without at least 30 test cases. Never change model without re-running the suite.
For tasks where "correct" is subjective (good summary, friendly tone), a stronger LLM can grade outputs. This is surprisingly reliable if you:
Modern providers (Anthropic, Gemini, OpenAI, DeepSeek) support prompt caching: keep a long, stable prefix (system prompt + docs + few-shots) cached on the server, and only pay full price for the changing suffix.
Typical savings: 50–90% on cost, 30–70% on latency for workloads with long fixed prefixes. If your prompts are > 1k tokens of stable content, turn this on.
{"class": "A"|"B"|"C", "confidence": 0-1}.[{step, tool, input}] as JSON.You will assemble all of these from the same few prompt shapes.