A concrete, installable stack for 2026.
Rather than abstract advice, this chapter is a spec sheet. Install these, learn these, pay for these; you'll be in the top 10% of backend engineers on the AI axis within a quarter.
In plain English. A modern AI stack has six layers: a model, a place to call it from (an IDE or app), a way to give it your data (RAG/vectors), a way to give it tools (MCP/function calling), a way to measure quality (evals), and a way to see what it did (observability). You need all six.
flowchart TB
L1[1 . Surfaces
IDE, CLI, desktop, browser, app] --> L2
L2[2 . Orchestration
LangGraph, Spring AI, your code] --> L3
L3[3 . Models
Opus 4.7, Sonnet, Haiku, Gemini, GPT, local] --> L4
L4[4 . Tools and data
MCP servers, vector DBs, function calls] --> L5
L5[5 . Observability
Langfuse, Helicone, OpenTelemetry] --> L6
L6[6 . Evals and CI
promptfoo, pytest, custom scorers]
Where most teams fail: they buy layer 1 and 3, ignore 5 and 6, and then can't tell when their AI ships regressions.
flowchart TB
subgraph c[Clients]
C1[IDE: Cursor / VS Code + Copilot]
C2[Claude Code CLI]
C3[Claude Desktop / Cowork]
C4[Browser: Claude in Chrome]
end
subgraph f[Frontier APIs]
F1[Anthropic: Opus 4.7, Haiku 4.5]
F2[OpenAI: GPT-5, embeddings]
F3[Google: Gemini 3]
end
subgraph sh[Self-host]
S1[Ollama local]
S2[vLLM cluster]
end
subgraph d[Data layer]
D1[pgvector on Postgres]
D2[Qdrant / Pinecone]
D3[OpenSearch / Elastic]
end
subgraph fr[Frameworks]
FR1[Python: LangChain/LangGraph, LlamaIndex, instructor]
FR2[Java: Spring AI, LangChain4j]
FR3[JS/TS: Vercel AI SDK, LangChain.js]
end
subgraph o[Observability]
O1[Langfuse / LangSmith]
O2[OpenTelemetry + your APM]
end
subgraph e[Evals]
E1[promptfoo]
E2[pytest + LLM-as-judge]
end
c --> f
c --> sh
fr --> f
fr --> sh
fr --> d
f --> o
fr --> o
fr --> e
# macOS / Linux
# 1. CLI coding agent
brew install anthropic/tap/claude-code # or pipx install claude-code
# 2. Local model runtime
brew install ollama
ollama pull qwen2.5-coder:14b # coding
ollama pull llama3.3:70b-instruct-q4 # general (~40 GB)
# 3. API keys
cat >> ~/.zshrc <<'EOF'
export ANTHROPIC_API_KEY="..."
export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."
EOF
# 4. Python env
pipx install uv
uv tool install promptfoo
uv pip install --system anthropic openai google-genai \
langchain-core langgraph llama-index-core pydantic instructor
# 5. IDE
# Install Cursor or keep VS Code + GitHub Copilot
# Install the Continue.dev extension as a free alternative
Within an hour you can write code that calls three frontier APIs, one local model, and runs an eval suite.
A small rubric:
Never pick one model and marry it. Pricing and capability shift every 3–6 months. Build your code so swapping models is a config change.
# Shape you want — provider-neutral
client = llm("claude-opus-4-7") # or "gpt-5", "gemini-3-pro"
resp = client.chat(messages, tools=..., response_schema=...)
LangChain, LlamaIndex, instructor, Vercel AI SDK, and Spring AI all give you this shape.
Default:
Escalate only when you hit real limits:
Python:
Java:
JS/TS:
Pick one, install day one:
What to emit on every LLM call:
{
"trace_id": "...",
"model": "claude-opus-4-7",
"prompt_version": "pr_017",
"input_tokens": 1240,
"output_tokens": 340,
"latency_ms": 2200,
"cost_usd": 0.017,
"tools_called": ["search", "get_user"],
"user_id": "...",
"tenant_id": "...",
"outcome": "ok|error|timeout"
}
Minimum viable:
evals/ folder with 50–200 YAML test cases (promptfoo) or pytest cases.Graduate to:
litellm, Portkey, or home-grown) for:
flowchart LR
subgraph yapps[Your apps]
A1[App 1]
A2[App 2]
A3[App 3]
end
A1 --> GW[LLM Gateway
auth, quota, logs, routing]
A2 --> GW
A3 --> GW
GW --> P1[Anthropic]
GW --> P2[OpenAI]
GW --> P3[Google]
GW --> P4[Self-hosted]
| Need | Buy (first) | Build (later, if real) |
|---|---|---|
| Chat UI | Copy an open template | Custom UI |
| RAG | Bedrock KB / Vertex RAG | pgvector + your pipeline |
| Eval suite | promptfoo | Custom with LLM-as-judge |
| Agent framework | LangGraph / Temporal + LLMs | Custom if needs are truly novel |
| Observability | Langfuse / LangSmith | Only if you have scale + a team |
| Gateway | litellm / Portkey | Home-grown if you need custom policy |
| Fine-tuning infra | Axolotl / Vertex / Bedrock | A team owning GPUs |
Buy the standard; invest build-effort where you have unique requirements.
MODELS
Heavy : Claude Opus 4.7 (extended thinking)
Default : Claude Sonnet 4.5 or GPT-5 mid
Cheap : Claude Haiku 4.5 or Gemini Flash
Embeddings: voyage-3-large
Self-host : Qwen 2.5 Coder / LLaMA 4
DATA
pgvector on Postgres (default)
Redis for cache
OpenSearch for heavy keyword search
FRAMEWORKS
Python: LangGraph + instructor + LlamaIndex (RAG)
Java : Spring AI
JS/TS : Vercel AI SDK
OBSERVABILITY
Langfuse (or LangSmith)
OTel into your existing APM
Per-call fields: model, tokens, cost, latency, tool, tenant
EVALS
promptfoo + pytest
50-200 cases, LLM-as-judge for subjective
SECURITY
LLM gateway (litellm / Portkey)
Scrub PII, per-service quotas, secrets manager
AGENTS
Start with 1 agent + tools
LangGraph or Temporal for durable flows
Budgets: steps, tokens, dollars, wall clock
MCP
Write one server for your top-5 internal systems