Everything worth remembering, fit on one page if you print it at reasonable margins. Stick it on the wall above your desk.
An LLM is a stateless function
f(text, knobs) -> text. You provide the memory, the retrieval, the tools, the evals, and the humans in the loop.
flowchart LR
A[LLM] -->|+ memory| B[Chatbot]
B -->|+ retrieval| C[RAG]
C -->|+ tools| D[Agent]
D -->|+ other agents| E[Multi-agent]
E -->|+ standard interface| F[MCP platform]
F -->|+ autonomy + evals| G[Agentic software]
| Layer | Default | Alternate |
|---|---|---|
| Heavy model | Claude Opus 4.7 | GPT-5 |
| Fast model | Claude Haiku 4.5 | Gemini Flash 3 |
| Embeddings | voyage-3 or text-embedding-3-large |
bge-large-v2 (self-hosted) |
| Vector DB | pgvector on Postgres |
Qdrant / Weaviate / Pinecone |
| Orchestration | LangGraph (Python) / Spring AI (Java) | Temporal + LLM calls |
| Observability | Langfuse | LangSmith / Helicone / OTel |
| Evals | promptfoo + pytest |
LangSmith / Braintrust |
| Coding | Claude Code + Cursor + Copilot | Zed AI / GitHub Copilot Workspace |
| Local model | ollama with qwen2.5-coder |
LM Studio |
| Protocol | MCP for tools and resources | OpenAPI for classical APIs |
flowchart TB
A[Problem] --> B{Needs fresh facts?}
B -- yes --> C[RAG]
B -- no --> D{Needs tone / format?}
D -- yes --> E{Few-shot fixes it?}
E -- yes --> F[Few-shot prompting]
E -- no --> G[Fine-tune LoRA]
D -- no --> H{Needs deep reasoning?}
H -- yes --> I[Reasoning model + CoT]
H -- no --> J{Needs to act?}
J -- yes --> K[Tool use / agent]
J -- no --> L[Just call the model]
K --> M{Irreversible?}
M -- yes --> N[Human checkpoint]
M -- no --> O[Full auto]
max_tokens. Runaway generations cost money.<context>...</context>, <question>...</question>.
flowchart LR
M[09:00 Triage
agent summarizes] --> P[09:30 Plan
agent + you]
P --> D[10-12 Deep work
pair coding]
D --> R[14-16 Review diffs
draft PR text]
R --> U[16-17 Unblock + learn
20 min paper]
U --> S[17:30 Shutdown
tomorrow's top-3]
# Install the CLI
npm install -g @anthropic-ai/claude-code
claude
# Run a local model offline
ollama run qwen2.5-coder:14b
# Test a prompt with evals
promptfoo eval
# Tokenize quickly
python -c "import tiktoken; enc=tiktoken.encoding_for_model('gpt-4'); print(len(enc.encode(open('file.txt').read())))"
# Cheapest smart model for a one-off question
claude -p "your question here" --model claude-haiku-4-5
Ship something this week.
The saga is still being written. Go write a chapter.