Chapter 6 · Prompt Engineering, Properly

API design for a stochastic function.


"Prompt engineering" sounds like cargo-cult incantation. It isn't. It's the discipline of writing clear, specific, testable instructions for a model that will otherwise improvise. Treat it like you would treat a fuzzy endpoint — with a contract, examples, schema, and measurement.

In plain English. A prompt is a spec for a function whose body you can't see. The clearer the spec, the more reliable the output. Vague specs produce vague software; vague prompts produce vague answers.

The hierarchy of prompting techniques

flowchart TB
    A[Zero-shot
just ask] --> B[Clear instructions
role, format, constraints] B --> C[Few-shot
show examples] C --> D[Chain of thought
think step by step] D --> E[Self-consistency
sample N, vote] E --> F[Tree of thoughts
explore branches] F --> G[Tool use
delegate to code] G --> H[Agentic loop
plan, act, verify, repeat] style A fill:#e8f4ff style B fill:#cfe7ff style C fill:#a8d4ff style D fill:#84c0ff style E fill:#5fa9ff style F fill:#3b91ff style G fill:#1f7aef style H fill:#0a5fcf,color:#ffffff

Each level adds reliability and cost. Climb only as high as you need.

6.1 The anatomy of a production prompt

flowchart TB
    subgraph System
    S1[Role + identity]
    S2[Capabilities + constraints]
    S3[Tone + style]
    S4[Output format / schema]
    S5[Safety rules]
    end
    subgraph User
    U1[Context block
retrieved docs, state] U2[The actual task] U3[Few-shot examples] U4[Restated goal] end System --> M[Model] User --> M M --> O[Structured output]

A real example, Java-flavored:

SYSTEM:
You are an experienced Spring Boot engineer. You are concise.
You will receive a failing test and the relevant production code.
Propose a minimal fix. Never change unrelated code.
Output JSON matching:
  {
    "root_cause": string,
    "file_to_edit": string,
    "patch": string  // unified diff
  }
If you need more info, output {"need": string}.

USER:
<context>
  <file path="OrderService.java">...</file>
  <file path="OrderServiceTest.java">...</file>
</context>

<error>
  AssertionError at line 42: expected 200, got 500
</error>

Fix the test failure. Minimal change only.

Notice: role, constraint, exact schema, escape hatch, context block, and focused task. That is the shape of 95% of production prompts.

6.2 Techniques, ranked by leverage

Technique When to use Leverage
Clear, specific instructions Always Huge
Structured output (JSON / XML / regex-guided) Anything code will parse Huge
Few-shot examples Novel or fuzzy task Large
Chain-of-thought ("think step by step") Reasoning, math, planning Large on non-reasoning models
Role assignment ("you are a senior SRE") Tone, perspective Medium
XML tags for sections Multi-part prompts Medium
Pre-filled assistant turn Force output format Medium
Self-consistency (sample N, vote) High-stakes, cost-tolerant Medium
Prompt chaining Complex task; easier debugging Large
Retrieval (RAG) Factual grounding Huge (see Ch. 7)

6.3 Structured output is the backend engineer's superpower

Instead of free-form prose, force the model to emit machine-readable output. Then parse it and fail loudly.

Three levels of "force":

  1. Ask nicely. "Respond only with valid JSON matching this schema."
  2. Use the provider's structured output feature. Anthropic tool-use, OpenAI response_format={"type": "json_schema", ...}, Gemini response schemas. These use constrained decoding — the model cannot produce invalid JSON.
  3. Use a library. instructor (Python) or LangChain4j (Java) layer Pydantic/Jackson schemas on top, with automatic retry-on-parse-error.
from pydantic import BaseModel
import instructor
from anthropic import Anthropic

class Fix(BaseModel):
    root_cause: str
    file_to_edit: str
    patch: str

client = instructor.from_anthropic(Anthropic())
fix: Fix = client.messages.create(
    model="claude-opus-4-7",
    response_model=Fix,
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
)
# fix.root_cause is typed. No json.loads. No KeyError.

This one change removes entire categories of production bugs.

6.4 Chain-of-thought, explained

For non-reasoning models (Haiku, Gemini Flash, most open models), adding a "think step by step" prefix or asking the model to show its work before the answer produces measurable gains on math, logic, and code tasks. The model uses its own output as scratch space.

flowchart LR
    A[Question] --> B[Think step by step
visible reasoning] B --> C[Intermediate steps] C --> D[Final answer]

A useful pattern:

Think through this carefully. First, write your reasoning inside
<scratch>...</scratch> tags. Then give the final answer in <answer>
tags.

Then parse <answer> and ignore <scratch>. For reasoning models (o3, Claude with extended thinking, Gemini Thinking) this is built in.

6.5 Few-shot: the underrated lever

For any task that isn't standard (e.g., "classify a support ticket into one of our 17 weirdly-named categories"), three to five examples embedded in the prompt outperform most of the fancier techniques — and require no infrastructure.

Classify the ticket. Examples:

Ticket: "My card keeps getting declined"
Category: payments.failure

Ticket: "I can't log in on mobile"
Category: auth.mobile

Ticket: "Where is my refund?"
Category: refunds.status

Ticket: "{new ticket text}"
Category:

Keep example count bounded (3–10 is usually right), balance the class distribution, and rotate occasional hard cases in.

6.6 Anti-patterns

6.7 Evals — where most teams stop being amateurs

A test suite for your prompts is the single biggest maturity jump for an AI team. It lets you:

flowchart LR
    A[Test set
50-500 inputs] --> B[Run prompt] B --> C{Grader} C -->|exact match| D1[pass/fail] C -->|regex / schema| D1 C -->|LLM-as-judge| D1 C -->|human| D1 D1 --> E[Score] E --> F[Regression report] F --> G[Block merge
if score drops]

The stack you want:

Rule of thumb: never tune a prompt without at least 30 test cases. Never change model without re-running the suite.

6.8 LLM-as-judge

For tasks where "correct" is subjective (good summary, friendly tone), a stronger LLM can grade outputs. This is surprisingly reliable if you:

  1. Define a rubric (5 criteria, each 1–5).
  2. Use a different, stronger model as the judge than the one you're evaluating.
  3. Use pairwise comparisons rather than absolute scores when possible.
  4. Sanity-check judges against human labels for your 50 hardest cases.

6.9 Prompt caching — free speed and cost

Modern providers (Anthropic, Gemini, OpenAI, DeepSeek) support prompt caching: keep a long, stable prefix (system prompt + docs + few-shots) cached on the server, and only pay full price for the changing suffix.

Typical savings: 50–90% on cost, 30–70% on latency for workloads with long fixed prefixes. If your prompts are > 1k tokens of stable content, turn this on.

6.10 A small library of patterns you'll reuse forever

You will assemble all of these from the same few prompt shapes.

Further reading & watching