Chapter 15 · The Backend Engineer's AI Toolkit

A concrete, installable stack for 2026.

Rather than abstract advice, this chapter is a spec sheet. Install these, learn these, pay for these; you'll be in the top 10% of backend engineers on the AI axis within a quarter.

In plain English. A modern AI stack has six layers: a model, a place to call it from (an IDE or app), a way to give it your data (RAG/vectors), a way to give it tools (MCP/function calling), a way to measure quality (evals), and a way to see what it did (observability). You need all six.

The toolkit, as a layered stack

flowchart TB
    L1[1 . Surfaces
IDE, CLI, desktop, browser, app] --> L2
    L2[2 . Orchestration
LangGraph, Spring AI, your code] --> L3
    L3[3 . Models
Opus 4.7, Sonnet, Haiku, Gemini, GPT, local] --> L4
    L4[4 . Tools and data
MCP servers, vector DBs, function calls] --> L5
    L5[5 . Observability
Langfuse, Helicone, OpenTelemetry] --> L6
    L6[6 . Evals and CI
promptfoo, pytest, custom scorers]

Where most teams fail: they buy layer 1 and 3, ignore 5 and 6, and then can't tell when their AI ships regressions.

15.1 The big picture

flowchart TB
    subgraph c[Clients]
    C1[IDE: Cursor / VS Code + Copilot]
    C2[Claude Code CLI]
    C3[Claude Desktop / Cowork]
    C4[Browser: Claude in Chrome]
    end
    subgraph f[Frontier APIs]
    F1[Anthropic: Opus 4.7, Haiku 4.5]
    F2[OpenAI: GPT-5, embeddings]
    F3[Google: Gemini 3]
    end
    subgraph sh[Self-host]
    S1[Ollama local]
    S2[vLLM cluster]
    end
    subgraph d[Data layer]
    D1[pgvector on Postgres]
    D2[Qdrant / Pinecone]
    D3[OpenSearch / Elastic]
    end
    subgraph fr[Frameworks]
    FR1[Python: LangChain/LangGraph, LlamaIndex, instructor]
    FR2[Java: Spring AI, LangChain4j]
    FR3[JS/TS: Vercel AI SDK, LangChain.js]
    end
    subgraph o[Observability]
    O1[Langfuse / LangSmith]
    O2[OpenTelemetry + your APM]
    end
    subgraph e[Evals]
    E1[promptfoo]
    E2[pytest + LLM-as-judge]
    end
    c --> f
    c --> sh
    fr --> f
    fr --> sh
    fr --> d
    f --> o
    fr --> o
    fr --> e

15.2 Your local setup (one evening)

# macOS / Linux
# 1. CLI coding agent
brew install anthropic/tap/claude-code   # or pipx install claude-code

# 2. Local model runtime
brew install ollama
ollama pull qwen2.5-coder:14b            # coding
ollama pull llama3.3:70b-instruct-q4     # general (~40 GB)

# 3. API keys
cat >> ~/.zshrc <<'EOF'
export ANTHROPIC_API_KEY="..."
export OPENAI_API_KEY="..."
export GOOGLE_API_KEY="..."
EOF

# 4. Python env
pipx install uv
uv tool install promptfoo
uv pip install --system anthropic openai google-genai \
  langchain-core langgraph llama-index-core pydantic instructor

# 5. IDE
#    Install Cursor or keep VS Code + GitHub Copilot
#    Install the Continue.dev extension as a free alternative

Within an hour you can write code that calls three frontier APIs, one local model, and runs an eval suite.

15.3 Model choices in 2026

A small rubric:

Heavy reasoning, long-horizon agents, complex refactors: Claude Opus 4.7 with extended thinking on.
Default workhorse API: Claude Sonnet 4.5 or GPT-5 mid-tier.
High-volume / cheap: Claude Haiku 4.5, GPT-4o-mini-class, Gemini 2 Flash.
Multimodal / voice / video: Gemini 3 or GPT-5 omni.
Self-hosted coding: Qwen 2.5 Coder 32B or DeepSeek Coder 33B on a single H100.
Self-hosted general: LLaMA 4 70B / Qwen 3 72B.
Embeddings: Voyage 3 (paid) or BGE / Nomic (open).

Never pick one model and marry it. Pricing and capability shift every 3–6 months. Build your code so swapping models is a config change.

# Shape you want — provider-neutral
client = llm("claude-opus-4-7")   # or "gpt-5", "gemini-3-pro"
resp = client.chat(messages, tools=..., response_schema=...)

LangChain, LlamaIndex, instructor, Vercel AI SDK, and Spring AI all give you this shape.

15.4 Data layer

Default:

Postgres + pgvector for vectors.
Postgres FTS or OpenSearch for keyword.
Redis for low-latency caches (embedding cache, LLM response cache).

Escalate only when you hit real limits:

Billions of vectors or sub-10ms ANN at p99 → dedicated vector DB (Qdrant, Pinecone, Vespa).
Massive scale + flexible schema → OpenSearch.

15.5 Orchestration frameworks

Python:

LangChain — the lingua franca. Large surface; pick what you need.
LangGraph — stateful, graph-based agent orchestration. Best for durable workflows.
LlamaIndex — RAG-first; great for document pipelines.
instructor — Pydantic-backed structured outputs. Tiny, beloved.
DSPy — declarative "compile your program" of LLM modules. Research-feeling but strong evals.
Haystack — older, well-supported RAG framework.

Java:

Spring AI — first-class for Spring-shop teams.
LangChain4j — idiomatic Java with broad provider support.

JS/TS:

Vercel AI SDK — great DX, streams natively.
LangChain.js — parity with Python.

15.6 Observability — not optional

Pick one, install day one:

Langfuse (open source) — traces, evals, prompt management. Self-host or SaaS.
LangSmith — tight LangChain integration.
Arize Phoenix (open source) — experimentation-friendly.
Braintrust — evals-forward.
Helicone — simple cost/latency logging.
OpenTelemetry with GenAI semantic conventions → your existing APM (Datadog, Honeycomb, New Relic).

What to emit on every LLM call:

{
  "trace_id": "...",
  "model": "claude-opus-4-7",
  "prompt_version": "pr_017",
  "input_tokens": 1240,
  "output_tokens": 340,
  "latency_ms": 2200,
  "cost_usd": 0.017,
  "tools_called": ["search", "get_user"],
  "user_id": "...",
  "tenant_id": "...",
  "outcome": "ok|error|timeout"
}

15.7 Evals

Minimum viable:

evals/ folder with 50–200 YAML test cases (promptfoo) or pytest cases.
CI job that runs evals on every PR touching prompts or model config.
A merge block if the score drops more than 2 points from main.

Graduate to:

LLM-as-judge with a written rubric.
Cost/latency regression tests.
Dashboard of eval scores over time.

15.8 Security and secrets

All LLM keys in a secrets manager (Secrets Manager, Vault, Doppler).
Per-service rate limits and spend caps.
Scrub PII before sending to an LLM when you don't need it.
An internal LLM gateway (e.g., litellm, Portkey, or home-grown) for:
Unified API shape across providers.
Centralized logging and budgets.
Retry, fallback, and routing policies.

flowchart LR
    subgraph yapps[Your apps]
    A1[App 1]
    A2[App 2]
    A3[App 3]
    end
    A1 --> GW[LLM Gateway
auth, quota, logs, routing]
    A2 --> GW
    A3 --> GW
    GW --> P1[Anthropic]
    GW --> P2[OpenAI]
    GW --> P3[Google]
    GW --> P4[Self-hosted]

15.9 Cloud-native defaults

On GCP

Vertex AI for Gemini + managed embeddings, Vertex Vector Search for ANN.
AlloyDB or Cloud SQL Postgres with pgvector.
Cloud Run for stateless inference shims.
Cloud Run Jobs or GKE for background agent workers.
Pub/Sub as the event backbone.
Cloud Logging + Cloud Monitoring for observability (plus Langfuse/LangSmith for LLM-specific).

On AWS

Bedrock for Claude, Llama, Titan, Cohere, Mistral behind one API.
Bedrock Knowledge Bases for managed RAG.
Bedrock Agents for orchestrated multi-step workflows.
Aurora PostgreSQL with pgvector, or OpenSearch for hybrid search.
Lambda / Fargate / ECS for agent workers.
Step Functions for durable orchestration.
Amazon Q Developer for AWS-native coding copilot.
CloudWatch + OTel with Langfuse layered on top.

15.10 The "build vs buy" chart for 2026

Need	Buy (first)	Build (later, if real)
Chat UI	Copy an open template	Custom UI
RAG	Bedrock KB / Vertex RAG	pgvector + your pipeline
Eval suite	promptfoo	Custom with LLM-as-judge
Agent framework	LangGraph / Temporal + LLMs	Custom if needs are truly novel
Observability	Langfuse / LangSmith	Only if you have scale + a team
Gateway	litellm / Portkey	Home-grown if you need custom policy
Fine-tuning infra	Axolotl / Vertex / Bedrock	A team owning GPUs

Buy the standard; invest build-effort where you have unique requirements.

15.11 The one-page toolkit summary

MODELS
  Heavy     : Claude Opus 4.7 (extended thinking)
  Default   : Claude Sonnet 4.5 or GPT-5 mid
  Cheap     : Claude Haiku 4.5 or Gemini Flash
  Embeddings: voyage-3-large
  Self-host : Qwen 2.5 Coder / LLaMA 4

DATA
  pgvector on Postgres (default)
  Redis for cache
  OpenSearch for heavy keyword search

FRAMEWORKS
  Python: LangGraph + instructor + LlamaIndex (RAG)
  Java  : Spring AI
  JS/TS : Vercel AI SDK

OBSERVABILITY
  Langfuse (or LangSmith)
  OTel into your existing APM
  Per-call fields: model, tokens, cost, latency, tool, tenant

EVALS
  promptfoo + pytest
  50-200 cases, LLM-as-judge for subjective

SECURITY
  LLM gateway (litellm / Portkey)
  Scrub PII, per-service quotas, secrets manager

AGENTS
  Start with 1 agent + tools
  LangGraph or Temporal for durable flows
  Budgets: steps, tokens, dollars, wall clock

MCP
  Write one server for your top-5 internal systems