Appendix E · One-Page Cheat Sheet

Everything worth remembering, fit on one page if you print it at reasonable margins. Stick it on the wall above your desk.


Mental model

An LLM is a stateless function f(text, knobs) -> text. You provide the memory, the retrieval, the tools, the evals, and the humans in the loop.

flowchart LR
    A[LLM] -->|+ memory| B[Chatbot]
    B -->|+ retrieval| C[RAG]
    C -->|+ tools| D[Agent]
    D -->|+ other agents| E[Multi-agent]
    E -->|+ standard interface| F[MCP platform]
    F -->|+ autonomy + evals| G[Agentic software]

Default stack (April 2026)

Layer Default Alternate
Heavy model Claude Opus 4.7 GPT-5
Fast model Claude Haiku 4.5 Gemini Flash 3
Embeddings voyage-3 or text-embedding-3-large bge-large-v2 (self-hosted)
Vector DB pgvector on Postgres Qdrant / Weaviate / Pinecone
Orchestration LangGraph (Python) / Spring AI (Java) Temporal + LLM calls
Observability Langfuse LangSmith / Helicone / OTel
Evals promptfoo + pytest LangSmith / Braintrust
Coding Claude Code + Cursor + Copilot Zed AI / GitHub Copilot Workspace
Local model ollama with qwen2.5-coder LM Studio
Protocol MCP for tools and resources OpenAPI for classical APIs

Decision tree

flowchart TB
    A[Problem] --> B{Needs fresh facts?}
    B -- yes --> C[RAG]
    B -- no --> D{Needs tone / format?}
    D -- yes --> E{Few-shot fixes it?}
    E -- yes --> F[Few-shot prompting]
    E -- no --> G[Fine-tune LoRA]
    D -- no --> H{Needs deep reasoning?}
    H -- yes --> I[Reasoning model + CoT]
    H -- no --> J{Needs to act?}
    J -- yes --> K[Tool use / agent]
    J -- no --> L[Just call the model]
    K --> M{Irreversible?}
    M -- yes --> N[Human checkpoint]
    M -- no --> O[Full auto]

The five things you always do

  1. Set max_tokens. Runaway generations cost money.
  2. Read every diff. Never merge code you don't understand.
  3. Measure with evals. Don't change a prompt without a test set.
  4. Log tokens, latency, cost. On every call, from day one.
  5. Gate destructive actions. Human in the loop on writes, deploys, money.

Prompting heuristics


RAG checklist


Agent checklist


Cost levers (largest first)

  1. Use a smaller model where it works (Haiku / Flash / Mini).
  2. Cache long system prompts (Anthropic, Gemini prompt caching).
  3. Shorten inputs: better RAG beats bigger context.
  4. Parallel tools reduce wall-clock latency, not cost — but UX matters.
  5. Batch non-interactive jobs overnight at discounted rates.
  6. Distill to a smaller model for very high-volume workflows.

Daily practice

flowchart LR
    M[09:00 Triage
agent summarizes] --> P[09:30 Plan
agent + you] P --> D[10-12 Deep work
pair coding] D --> R[14-16 Review diffs
draft PR text] R --> U[16-17 Unblock + learn
20 min paper] U --> S[17:30 Shutdown
tomorrow's top-3]

Skills that compound

Skills that decay


Emergency commands

# Install the CLI
npm install -g @anthropic-ai/claude-code
claude

# Run a local model offline
ollama run qwen2.5-coder:14b

# Test a prompt with evals
promptfoo eval

# Tokenize quickly
python -c "import tiktoken; enc=tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode(open('file.txt').read())))"

# Cheapest smart model for a one-off question
claude -p "your question here" --model claude-haiku-4-5

The one rule

Ship something this week.

The saga is still being written. Go write a chapter.


Back to index · Preface · Chapter 1