Chapter 6 · Prompt Engineering, Properly

API design for a stochastic function.

"Prompt engineering" sounds like cargo-cult incantation. It isn't. It's the discipline of writing clear, specific, testable instructions for a model that will otherwise improvise. Treat it like you would treat a fuzzy endpoint — with a contract, examples, schema, and measurement.

In plain English. A prompt is a spec for a function whose body you can't see. The clearer the spec, the more reliable the output. Vague specs produce vague software; vague prompts produce vague answers.

The hierarchy of prompting techniques

flowchart TB
    A[Zero-shot
just ask] --> B[Clear instructions
role, format, constraints]
    B --> C[Few-shot
show examples]
    C --> D[Chain of thought
think step by step]
    D --> E[Self-consistency
sample N, vote]
    E --> F[Tree of thoughts
explore branches]
    F --> G[Tool use
delegate to code]
    G --> H[Agentic loop
plan, act, verify, repeat]
    style A fill:#e8f4ff
    style B fill:#cfe7ff
    style C fill:#a8d4ff
    style D fill:#84c0ff
    style E fill:#5fa9ff
    style F fill:#3b91ff
    style G fill:#1f7aef
    style H fill:#0a5fcf,color:#ffffff

Each level adds reliability and cost. Climb only as high as you need.

6.1 The anatomy of a production prompt

flowchart TB
    subgraph System
    S1[Role + identity]
    S2[Capabilities + constraints]
    S3[Tone + style]
    S4[Output format / schema]
    S5[Safety rules]
    end
    subgraph User
    U1[Context block
retrieved docs, state]
    U2[The actual task]
    U3[Few-shot examples]
    U4[Restated goal]
    end
    System --> M[Model]
    User --> M
    M --> O[Structured output]

A real example, Java-flavored:

SYSTEM:
You are an experienced Spring Boot engineer. You are concise.
You will receive a failing test and the relevant production code.
Propose a minimal fix. Never change unrelated code.
Output JSON matching:
  {
    "root_cause": string,
    "file_to_edit": string,
    "patch": string  // unified diff
  }
If you need more info, output {"need": string}.

USER:
<context>
  <file path="OrderService.java">...</file>
  <file path="OrderServiceTest.java">...</file>
</context>

<error>
  AssertionError at line 42: expected 200, got 500
</error>

Fix the test failure. Minimal change only.

Notice: role, constraint, exact schema, escape hatch, context block, and focused task. That is the shape of 95% of production prompts.

6.2 Techniques, ranked by leverage

Technique	When to use	Leverage
Clear, specific instructions	Always	Huge
Structured output (JSON / XML / regex-guided)	Anything code will parse	Huge
Few-shot examples	Novel or fuzzy task	Large
Chain-of-thought ("think step by step")	Reasoning, math, planning	Large on non-reasoning models
Role assignment ("you are a senior SRE")	Tone, perspective	Medium
XML tags for sections	Multi-part prompts	Medium
Pre-filled assistant turn	Force output format	Medium
Self-consistency (sample N, vote)	High-stakes, cost-tolerant	Medium
Prompt chaining	Complex task; easier debugging	Large
Retrieval (RAG)	Factual grounding	Huge (see Ch. 7)

6.3 Structured output is the backend engineer's superpower

Instead of free-form prose, force the model to emit machine-readable output. Then parse it and fail loudly.

Three levels of "force":

Ask nicely. "Respond only with valid JSON matching this schema."
Use the provider's structured output feature. Anthropic tool-use, OpenAI response_format={"type": "json_schema", ...}, Gemini response schemas. These use constrained decoding — the model cannot produce invalid JSON.
Use a library. instructor (Python) or LangChain4j (Java) layer Pydantic/Jackson schemas on top, with automatic retry-on-parse-error.

from pydantic import BaseModel
import instructor
from anthropic import Anthropic

class Fix(BaseModel):
    root_cause: str
    file_to_edit: str
    patch: str

client = instructor.from_anthropic(Anthropic())
fix: Fix = client.messages.create(
    model="claude-opus-4-7",
    response_model=Fix,
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
)
# fix.root_cause is typed. No json.loads. No KeyError.

This one change removes entire categories of production bugs.

6.4 Chain-of-thought, explained

For non-reasoning models (Haiku, Gemini Flash, most open models), adding a "think step by step" prefix or asking the model to show its work before the answer produces measurable gains on math, logic, and code tasks. The model uses its own output as scratch space.

flowchart LR
    A[Question] --> B[Think step by step
visible reasoning]
    B --> C[Intermediate steps]
    C --> D[Final answer]

A useful pattern:

Think through this carefully. First, write your reasoning inside
<scratch>...</scratch> tags. Then give the final answer in <answer>
tags.

Then parse <answer> and ignore <scratch>. For reasoning models (o3, Claude with extended thinking, Gemini Thinking) this is built in.

6.5 Few-shot: the underrated lever

For any task that isn't standard (e.g., "classify a support ticket into one of our 17 weirdly-named categories"), three to five examples embedded in the prompt outperform most of the fancier techniques — and require no infrastructure.

Classify the ticket. Examples:

Ticket: "My card keeps getting declined"
Category: payments.failure

Ticket: "I can't log in on mobile"
Category: auth.mobile

Ticket: "Where is my refund?"
Category: refunds.status

Ticket: "{new ticket text}"
Category:

Keep example count bounded (3–10 is usually right), balance the class distribution, and rotate occasional hard cases in.

6.6 Anti-patterns

Megaprompts. 3,000-token system prompts with nine personas and seven rule sets. Smell: it worked on your five test cases. Fix: split into retrieval + focused instructions, or break into a chain of calls.
Role-play drift. "You are the most brilliant engineer in the world…" adds tokens and decorative fluff. Models improved; stop flattering them.
Contradictory instructions. "Be thorough" and "respond in 50 words or less" in the same prompt. Pick one.
No escape hatch. Always give the model a way out: "If you are unsure, return {"need": "..."}". Otherwise, it will fabricate.
Schema drift. You change the prompt, and suddenly downstream parsing breaks in production. Fix: version your prompts; treat them like code.

6.7 Evals — where most teams stop being amateurs

A test suite for your prompts is the single biggest maturity jump for an AI team. It lets you:

Tune a prompt without regressing on known-good cases.
Compare models (Opus 4.7 vs Sonnet 4.5 vs Haiku) on your workload.
Catch hallucinations and formatting drift in CI.
Justify a model migration with numbers.

flowchart LR
    A[Test set
50-500 inputs] --> B[Run prompt]
    B --> C{Grader}
    C -->|exact match| D1[pass/fail]
    C -->|regex / schema| D1
    C -->|LLM-as-judge| D1
    C -->|human| D1
    D1 --> E[Score]
    E --> F[Regression report]
    F --> G[Block merge
if score drops]

The stack you want:

promptfoo — YAML-driven, great for quick regression suites.
Langfuse / LangSmith / Braintrust — hosted eval + tracing for teams.
inspect-ai (UK AISI) — serious eval framework for agentic tasks.
Your own pytest — fine to start. A tests/evals/ folder with @pytest.mark.eval beats no evals.

Rule of thumb: never tune a prompt without at least 30 test cases. Never change model without re-running the suite.

6.8 LLM-as-judge

For tasks where "correct" is subjective (good summary, friendly tone), a stronger LLM can grade outputs. This is surprisingly reliable if you:

Define a rubric (5 criteria, each 1–5).
Use a different, stronger model as the judge than the one you're evaluating.
Use pairwise comparisons rather than absolute scores when possible.
Sanity-check judges against human labels for your 50 hardest cases.

6.9 Prompt caching — free speed and cost

Modern providers (Anthropic, Gemini, OpenAI, DeepSeek) support prompt caching: keep a long, stable prefix (system prompt + docs + few-shots) cached on the server, and only pay full price for the changing suffix.

Typical savings: 50–90% on cost, 30–70% on latency for workloads with long fixed prefixes. If your prompts are > 1k tokens of stable content, turn this on.

6.10 A small library of patterns you'll reuse forever

Binary classifier: system prompt + schema {"class": "A"|"B"|"C", "confidence": 0-1}.
Extractor: system prompt + target Pydantic model + context block.
Summarizer: "One sentence, one paragraph, and three bullet points" — gives downstream callers choice.
Router: "Pick the right tool for this request." Returns a tool name and rationale.
Critic: takes a draft + rubric, returns scores and specific improvements.
Planner: takes a goal, returns [{step, tool, input}] as JSON.
Verifier: takes a solution + problem, returns correctness and a failing test if wrong.

You will assemble all of these from the same few prompt shapes.