Chapter 15 · The Backend Engineer's AI Toolkit

A concrete, installable stack for 2026.


Rather than abstract advice, this chapter is a spec sheet. Install these, learn these, pay for these; you'll be in the top 10% of backend engineers on the AI axis within a quarter.

In plain English. A modern AI stack has six layers: a model, a place to call it from (an IDE or app), a way to give it your data (RAG/vectors), a way to give it tools (MCP/function calling), a way to measure quality (evals), and a way to see what it did (observability). You need all six.

The toolkit, as a layered stack

flowchart TB
    L1[1 . Surfaces
IDE, CLI, desktop, browser, app] --> L2 L2[2 . Orchestration
LangGraph, Spring AI, your code] --> L3 L3[3 . Models
Opus 4.7, Sonnet, Haiku, Gemini, GPT, local] --> L4 L4[4 . Tools and data
MCP servers, vector DBs, function calls] --> L5 L5[5 . Observability
Langfuse, Helicone, OpenTelemetry] --> L6 L6[6 . Evals and CI
promptfoo, pytest, custom scorers]

Where most teams fail: they buy layer 1 and 3, ignore 5 and 6, and then can't tell when their AI ships regressions.

15.1 The big picture

flowchart TB
    subgraph c[Clients]
    C1[IDE: Cursor / VS Code + Copilot]
    C2[Claude Code CLI]
    C3[Claude Desktop / Cowork]
    C4[Browser: Claude in Chrome]
    end
    subgraph f[Frontier APIs]
    F1[Anthropic: Opus 4.7, Haiku 4.5]
    F2[OpenAI: GPT-5, embeddings]
    F3[Google: Gemini 3]
    end
    subgraph sh[Self-host]
    S1[Ollama local]
    S2[vLLM cluster]
    end
    subgraph d[Data layer]
    D1[pgvector on Postgres]
    D2[Qdrant / Pinecone]
    D3[OpenSearch / Elastic]
    end
    subgraph fr[Frameworks]
    FR1[Python: LangChain/LangGraph, LlamaIndex, instructor]
    FR2[Java: Spring AI, LangChain4j]
    FR3[JS/TS: Vercel AI SDK, LangChain.js]
    end
    subgraph o[Observability]
    O1[Langfuse / LangSmith]
    O2[OpenTelemetry + your APM]
    end
    subgraph e[Evals]
    E1[promptfoo]
    E2[pytest + LLM-as-judge]
    end
    c --> f
    c --> sh
    fr --> f
    fr --> sh
    fr --> d
    f --> o
    fr --> o
    fr --> e

15.2 Your local setup (one evening)

# macOS / Linux
# 1. CLI coding agent
brew install anthropic/tap/claude-code   # or pipx install claude-code

# 2. Local model runtime
brew install ollama
ollama pull qwen2.5-coder:14b            # coding
ollama pull llama3.3:70b-instruct-q4     # general (~40 GB)

# 3. API keys
cat >> ~/.zshrc <<'EOF'
export ANTHROPIC_API_KEY="..."
export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."
EOF

# 4. Python env
pipx install uv
uv tool install promptfoo
uv pip install --system anthropic openai google-genai \
  langchain-core langgraph llama-index-core pydantic instructor

# 5. IDE
#    Install Cursor or keep VS Code + GitHub Copilot
#    Install the Continue.dev extension as a free alternative

Within an hour you can write code that calls three frontier APIs, one local model, and runs an eval suite.

15.3 Model choices in 2026

A small rubric:

Never pick one model and marry it. Pricing and capability shift every 3–6 months. Build your code so swapping models is a config change.

# Shape you want — provider-neutral
client = llm("claude-opus-4-7")   # or "gpt-5", "gemini-3-pro"
resp = client.chat(messages, tools=..., response_schema=...)

LangChain, LlamaIndex, instructor, Vercel AI SDK, and Spring AI all give you this shape.

15.4 Data layer

Default:

Escalate only when you hit real limits:

15.5 Orchestration frameworks

Python:

Java:

JS/TS:

15.6 Observability — not optional

Pick one, install day one:

What to emit on every LLM call:

{
  "trace_id": "...",
  "model": "claude-opus-4-7",
  "prompt_version": "pr_017",
  "input_tokens": 1240,
  "output_tokens": 340,
  "latency_ms": 2200,
  "cost_usd": 0.017,
  "tools_called": ["search", "get_user"],
  "user_id": "...",
  "tenant_id": "...",
  "outcome": "ok|error|timeout"
}

15.7 Evals

Minimum viable:

Graduate to:

15.8 Security and secrets

flowchart LR
    subgraph yapps[Your apps]
    A1[App 1]
    A2[App 2]
    A3[App 3]
    end
    A1 --> GW[LLM Gateway
auth, quota, logs, routing] A2 --> GW A3 --> GW GW --> P1[Anthropic] GW --> P2[OpenAI] GW --> P3[Google] GW --> P4[Self-hosted]

15.9 Cloud-native defaults

On GCP

On AWS

15.10 The "build vs buy" chart for 2026

Need Buy (first) Build (later, if real)
Chat UI Copy an open template Custom UI
RAG Bedrock KB / Vertex RAG pgvector + your pipeline
Eval suite promptfoo Custom with LLM-as-judge
Agent framework LangGraph / Temporal + LLMs Custom if needs are truly novel
Observability Langfuse / LangSmith Only if you have scale + a team
Gateway litellm / Portkey Home-grown if you need custom policy
Fine-tuning infra Axolotl / Vertex / Bedrock A team owning GPUs

Buy the standard; invest build-effort where you have unique requirements.

15.11 The one-page toolkit summary

MODELS
  Heavy     : Claude Opus 4.7 (extended thinking)
  Default   : Claude Sonnet 4.5 or GPT-5 mid
  Cheap     : Claude Haiku 4.5 or Gemini Flash
  Embeddings: voyage-3-large
  Self-host : Qwen 2.5 Coder / LLaMA 4

DATA
  pgvector on Postgres (default)
  Redis for cache
  OpenSearch for heavy keyword search

FRAMEWORKS
  Python: LangGraph + instructor + LlamaIndex (RAG)
  Java  : Spring AI
  JS/TS : Vercel AI SDK

OBSERVABILITY
  Langfuse (or LangSmith)
  OTel into your existing APM
  Per-call fields: model, tokens, cost, latency, tool, tenant

EVALS
  promptfoo + pytest
  50-200 cases, LLM-as-judge for subjective

SECURITY
  LLM gateway (litellm / Portkey)
  Scrub PII, per-service quotas, secrets manager

AGENTS
  Start with 1 agent + tools
  LangGraph or Temporal for durable flows
  Budgets: steps, tokens, dollars, wall clock

MCP
  Write one server for your top-5 internal systems

Further reading & watching