Chapter 7 · RAG — Retrieval-Augmented Generation

The single most-shipped pattern in enterprise AI. Learn this thoroughly.

If a team ships one AI feature, it is almost always a RAG system. Customer-support copilot, "ask our docs," legal research, internal search, code-aware chat — these are all RAG.

In plain English. RAG is "open-book exam mode" for an LLM. Instead of trusting the model's memory, you hand it the relevant passages from your corpus at the moment it answers.

The RAG mental model

flowchart LR
    subgraph Library["Your knowledge
(books, docs, tickets)"]
    B1[Book 1]
    B2[Book 2]
    B3[Book N]
    end
    subgraph Librarian["Retriever"]
    R[Embedding
+ vector search
+ keyword
+ rerank]
    end
    Library --> R
    Q[User question] --> R
    R --> P[Top-k passages]
    P --> M[LLM]
    Q --> M
    M --> A[Answer
with citations]

Three roles: a library (your data), a librarian (retrieval), and a reader (the LLM). Most production failures live in the librarian.

The idea is simple: an LLM can't know your private data, can't know anything after its training cutoff, and hallucinates when it doesn't know. So don't rely on its memory. At query time, retrieve the relevant chunks and inject them into the prompt.

7.1 Why RAG beats "just put it in the prompt"

Even with 1M-token windows, RAG usually wins:

Cost. Stuffing 200k tokens per query costs real money; retrieving 2k doesn't.
Latency. Long contexts = more latency, even with prompt caching.
Recall in the middle. Models still underperform on content buried mid-context.
Freshness. Your vector DB updates in milliseconds; retraining doesn't.
Access control. RAG lets you filter by tenant, user, permission at retrieval time.
Citations. You can show the user which chunk you used.

7.2 The canonical pipeline

flowchart TB
    subgraph off[Offline indexing]
    A[Source docs
PDFs, wiki, tickets, code] --> B[Parsers
extract text + metadata]
    B --> C[Chunker
split into passages]
    C --> D[Embedder]
    D --> E[(Vector DB
pgvector / Pinecone / Qdrant)]
    B --> F[(Keyword index
BM25 / OpenSearch)]
    end
    subgraph onl[Online query]
    Q[User query] --> QE[Embed query]
    QE --> VS[Vector search]
    Q --> KS[Keyword search]
    E --> VS
    F --> KS
    VS --> RRF[Fusion / RRF]
    KS --> RRF
    RRF --> RR[Cross-encoder rerank]
    RR --> TK[Top-k passages]
    TK --> P[Prompt assembler]
    Q --> P
    P --> LLM
    LLM --> ANS[Answer + citations]
    end

Every arrow is a design decision worth sweating.

7.3 Document ingestion

This is the least glamorous and most decisive step. A RAG system is only as good as what it ingested.

Parse cleanly. PDFs are the enemy of clean text. Use pypdf, pdfplumber, Unstructured.io, or Claude / Gemini document understanding as a fallback — modern models are very good at reading weird PDFs, and it's often worth paying them once during ingestion.
Preserve metadata. Source URL, title, author, timestamp, section path, tenant ID, ACL list.
Handle tables and code specially. Don't chunk them mid-row; they lose meaning. Extract them as atomic units.
Clean before chunking. Strip navigation, headers, footers. Deduplicate boilerplate.

7.4 Chunking — get this right

Chunk size trades off recall (small chunks = more targeted) vs context (big chunks = more coherent).

Defaults that work:

Recursive character splitting at ~800 tokens with 100 token overlap. Splits on paragraphs first, then sentences, then words.
Semantic chunking — embed sentences, merge adjacent ones whose similarity is high. Better for heterogeneous docs; slower to index.
Hierarchical / parent–child — index small chunks, but retrieve their larger "parent" to give the LLM more context.
Code — split by function/class, not by token count.

flowchart LR
    subgraph Naive
    A1[Fixed 500 chars] --> A2[Breaks sentences]
    end
    subgraph Recursive
    B1[Split on paragraphs
then sentences
then words]
    end
    subgraph Semantic
    C1[Embed sentences]
    C1 --> C2[Merge where
sim > threshold]
    end
    subgraph Hierarchical
    D1[Small chunks for retrieval]
    D1 --> D2[Big parent chunks for LLM]
    end

7.5 Embeddings

Your choice of embedding model matters. In 2026:

Voyage 3 / 3-large — Anthropic's recommended embedder, strong across domains.
OpenAI text-embedding-3-large — excellent general-purpose.
BGE-large / Jina embeddings — strong open options.
Cohere Embed v4 — very strong multilingual.
Nomic Embed Text v2 — fully open, runs locally.
Vertex AI embeddings (Google) and Bedrock Titan Embeddings (AWS) for managed cloud-native paths.

Check your embedder on a domain evaluation. Generic leaderboards (MTEB) help but don't substitute for your own test set.

Cost perspective: embedding a million average documents costs roughly $10–$50 depending on the model. Re-embedding is cheap; changing embedding models later is the expensive part, because you must re-index.

7.6 Vector databases

You have three families:

Family	Examples	When it fits
Postgres extension	pgvector on RDS/AlloyDB/CloudSQL	Small-to-mid scale (millions of vectors), you already run Postgres. Joins + filters + transactions.
Dedicated cloud	Pinecone, Weaviate Cloud, Qdrant Cloud, Milvus	Ops-light, scales horizontally, hybrid search built-in.
Self-hosted	Qdrant, Weaviate, Milvus, LanceDB, Vespa	Huge scale or strict data residency.

Default for most teams: pgvector. You already have Postgres; vectors are "just another index." You get filters, joins, role-based access, and point-in-time recovery for free.

-- pgvector, real-world shape
CREATE TABLE docs (
  id          uuid PRIMARY KEY,
  tenant_id   uuid NOT NULL,
  title       text,
  body        text,
  embedding   vector(1024),
  metadata    jsonb,
  updated_at  timestamptz
);

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON docs (tenant_id);

-- Query: vector + tenant filter
SELECT id, title, body,
       1 - (embedding <=> $query_emb) AS score
FROM docs
WHERE tenant_id = $tenant
ORDER BY embedding <=> $query_emb
LIMIT 20;

7.7 Hybrid search

Pure vector search misses exact-match intent ("error code E_1007"). Pure keyword search misses paraphrase ("payment failed" vs "declined transaction"). Combining both — with Reciprocal Rank Fusion (RRF) or learned fusion — reliably adds 10–25% on retrieval quality.

flowchart LR
    Q[Query] --> VS[Vector search]
    Q --> KS[BM25 / keyword]
    VS --> V[Vector top-50]
    KS --> K[Keyword top-50]
    V --> RRF[RRF: score = sum of 1/(rank + k)]
    K --> RRF
    RRF --> T[Top-20]

Postgres has both: pgvector + tsvector full-text search. OpenSearch gives you both. Weaviate and Qdrant have hybrid modes natively.

7.8 Reranking

The retriever returns, say, 20 candidates. A cross-encoder reranker scores each (query, passage) pair directly, using a small model, and picks the best 3–5. This is often the single biggest quality lever after you've set up the basics.

Options:

Cohere Rerank 3 — hosted, multilingual, fast.
Voyage Rerank 2 — Anthropic-partner offering.
BGE-reranker-v2-m3 — strong open option.
Jina Reranker v2 — fast and free tier.

Add a reranker once your retrieval top-10 has the right answer but not always at rank 1. A $0.001 rerank often saves $0.01 of LLM confusion.

7.9 Contextual retrieval (Anthropic, 2024)

A striking result: prepend each chunk with a one- or two-sentence LLM-generated summary of the chunk's context within its parent document before embedding and indexing. Retrieval quality jumps ~35%.

[Chunk original text]
------
"In Q2 2024, Acme Corp reported revenue of $4.2B, up 12% YoY..."

[After contextual retrieval]
------
This chunk is from Acme Corp's 10-Q filing for Q2 2024, in the Revenue
section of the MD&A. In Q2 2024, Acme Corp reported revenue of $4.2B,
up 12% YoY...

The extra indexing cost is real (one LLM call per chunk) but prompt caching on the full document makes it tractable. For important corpora, do this.

7.10 The prompt

Once you have top-k passages, assemble the prompt. A clean pattern:

SYSTEM:
Answer the user's question strictly using the provided sources.
If the sources don't contain the answer, say "I don't know."
Cite sources by id, e.g. [src-3].

USER:
<sources>
<src id="src-1" url="...">...passage...</src>
<src id="src-2" url="...">...passage...</src>
</sources>

Question: {user question}

Enforce schema if you need structured output. Return the citations as metadata for the UI.

7.11 Evaluation

Two axes, measure both:

Retrieval quality — does the top-k contain the right chunks? Metrics: recall@k, MRR, NDCG.
End-to-end answer quality — does the LLM give the right answer? Metrics: exact match, LLM-as-judge, human labels.

Tools: Ragas, TruLens, Langfuse evals. Build your own "golden set" of 100 question/answer pairs from domain experts; this is the asset your AI team compounds on.

7.12 Beyond naive RAG

Patterns worth knowing:

Query rewriting / HyDE. Let the LLM expand or rephrase the query before retrieval. Use a hypothetical answer as the query vector. Helps short queries dramatically.
Multi-query RAG. Fan out the query into 3–5 variants and union the results.
Agentic RAG. The LLM decides whether, what, and how many times to retrieve. Slower, better on hard questions.
GraphRAG (Microsoft). Build a knowledge graph during ingestion; retrieve subgraphs. Powerful for "why" / "compare" questions.
Summary-augmented RAG. Store per-doc and per-section summaries; retrieve those first to route to chunks.

flowchart LR
    subgraph Naive RAG
    A1[Query] --> A2[Retrieve] --> A3[Answer]
    end
    subgraph Agentic RAG
    B1[Query] --> B2[LLM plan]
    B2 -->|retrieve?| B3[Retrieve]
    B3 --> B4[LLM reflect]
    B4 -->|need more| B2
    B4 -->|done| B5[Answer]
    end

7.13 Production checklist

Observability on retrieval latency, LLM latency, and per-query cost.
Tenant isolation enforced at query, never just in the prompt.
Freshness SLA: how quickly does a new doc become searchable? (Target minutes.)
Feedback loop: thumbs up/down, stored with query + retrieved IDs. Reuse as test cases.
Content safety: filter retrieved passages for secrets, PII, or policy-violating content.
Cache at two layers: LLM prompt cache + embedding cache for hot queries.