Appendix E · One-Page Cheat Sheet

Everything worth remembering, fit on one page if you print it at reasonable margins. Stick it on the wall above your desk.

Mental model

An LLM is a stateless function f(text, knobs) -> text. You provide the memory, the retrieval, the tools, the evals, and the humans in the loop.

flowchart LR
    A[LLM] -->|+ memory| B[Chatbot]
    B -->|+ retrieval| C[RAG]
    C -->|+ tools| D[Agent]
    D -->|+ other agents| E[Multi-agent]
    E -->|+ standard interface| F[MCP platform]
    F -->|+ autonomy + evals| G[Agentic software]

Default stack (April 2026)

Layer	Default	Alternate
Heavy model	Claude Opus 4.7	GPT-5
Fast model	Claude Haiku 4.5	Gemini Flash 3
Embeddings	`voyage-3` or `text-embedding-3-large`	`bge-large-v2` (self-hosted)
Vector DB	`pgvector` on Postgres	Qdrant / Weaviate / Pinecone
Orchestration	LangGraph (Python) / Spring AI (Java)	Temporal + LLM calls
Observability	Langfuse	LangSmith / Helicone / OTel
Evals	`promptfoo` + pytest	LangSmith / Braintrust
Coding	Claude Code + Cursor + Copilot	Zed AI / GitHub Copilot Workspace
Local model	`ollama` with `qwen2.5-coder`	LM Studio
Protocol	MCP for tools and resources	OpenAPI for classical APIs

Decision tree

flowchart TB
    A[Problem] --> B{Needs fresh facts?}
    B -- yes --> C[RAG]
    B -- no --> D{Needs tone / format?}
    D -- yes --> E{Few-shot fixes it?}
    E -- yes --> F[Few-shot prompting]
    E -- no --> G[Fine-tune LoRA]
    D -- no --> H{Needs deep reasoning?}
    H -- yes --> I[Reasoning model + CoT]
    H -- no --> J{Needs to act?}
    J -- yes --> K[Tool use / agent]
    J -- no --> L[Just call the model]
    K --> M{Irreversible?}
    M -- yes --> N[Human checkpoint]
    M -- no --> O[Full auto]

The five things you always do

Set max_tokens. Runaway generations cost money.
Read every diff. Never merge code you don't understand.
Measure with evals. Don't change a prompt without a test set.
Log tokens, latency, cost. On every call, from day one.
Gate destructive actions. Human in the loop on writes, deploys, money.

Prompting heuristics

Put instructions and examples in the system prompt. Put data in the user prompt.
Use XML tags to delimit sections: <context>...</context>, <question>...</question>.
Ask for structured output. Validate with a schema. Retry on parse error with the error message.
Use chain-of-thought ("think step by step") for reasoning tasks on non-reasoning models.
Use self-consistency (sample N, vote) when the task is high stakes.
Keep the system prompt under 1k tokens unless you have a reason; longer usually hurts.

RAG checklist

Chunk at 500–1,000 tokens, with 50–150 overlap.
Use hybrid search (vector + BM25) unless you have a reason not to.
Rerank the top 20 → top 5 with a cross-encoder.
Store metadata for filtering (tenant, date, doc_id, permissions).
Return citations with every answer.
Evaluate retrieval separately from generation — they fail differently.
Consider contextual retrieval (prepend chunk with a summary before embedding).

Agent checklist

One clear goal per agent.
A short list of well-described tools. Descriptions are prompts.
Budget on tool calls and tokens. Hard stop.
Human confirmation for destructive actions.
Treat tool outputs as untrusted input (prompt injection).
Observability on every step. Traces, not logs.
Evals on multi-turn scenarios, not just single turns.

Cost levers (largest first)

Use a smaller model where it works (Haiku / Flash / Mini).
Cache long system prompts (Anthropic, Gemini prompt caching).
Shorten inputs: better RAG beats bigger context.
Parallel tools reduce wall-clock latency, not cost — but UX matters.
Batch non-interactive jobs overnight at discounted rates.
Distill to a smaller model for very high-volume workflows.

Daily practice

flowchart LR
    M[09:00 Triage
agent summarizes] --> P[09:30 Plan
agent + you]
    P --> D[10-12 Deep work
pair coding]
    D --> R[14-16 Review diffs
draft PR text]
    R --> U[16-17 Unblock + learn
20 min paper]
    U --> S[17:30 Shutdown
tomorrow's top-3]

Skills that compound

Taste. Judgment. Systems thinking.
Writing specs, docs, prompts, decisions.
Reading diffs with genuine skepticism.
Teaching juniors and peers.
Staying curious.

Skills that decay

Rote syntax recall.
Writing boilerplate.
Framework-of-the-month chasing.
Memorizing APIs you call twice a year.

Emergency commands

# Install the CLI
npm install -g @anthropic-ai/claude-code
claude

# Run a local model offline
ollama run qwen2.5-coder:14b

# Test a prompt with evals
promptfoo eval

# Tokenize quickly
python -c "import tiktoken; enc=tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode(open('file.txt').read())))"

# Cheapest smart model for a one-off question
claude -p "your question here" --model claude-haiku-4-5

The one rule

Ship something this week.

The saga is still being written. Go write a chapter.

Back to index · Preface · Chapter 1

← Table of Contents