Chapter 7 · RAG — Retrieval-Augmented Generation

The single most-shipped pattern in enterprise AI. Learn this thoroughly.


If a team ships one AI feature, it is almost always a RAG system. Customer-support copilot, "ask our docs," legal research, internal search, code-aware chat — these are all RAG.

In plain English. RAG is "open-book exam mode" for an LLM. Instead of trusting the model's memory, you hand it the relevant passages from your corpus at the moment it answers.

The RAG mental model

flowchart LR
    subgraph Library["Your knowledge
(books, docs, tickets)"] B1[Book 1] B2[Book 2] B3[Book N] end subgraph Librarian["Retriever"] R[Embedding
+ vector search
+ keyword
+ rerank] end Library --> R Q[User question] --> R R --> P[Top-k passages] P --> M[LLM] Q --> M M --> A[Answer
with citations]

Three roles: a library (your data), a librarian (retrieval), and a reader (the LLM). Most production failures live in the librarian.

The idea is simple: an LLM can't know your private data, can't know anything after its training cutoff, and hallucinates when it doesn't know. So don't rely on its memory. At query time, retrieve the relevant chunks and inject them into the prompt.

7.1 Why RAG beats "just put it in the prompt"

Even with 1M-token windows, RAG usually wins:

7.2 The canonical pipeline

flowchart TB
    subgraph off[Offline indexing]
    A[Source docs
PDFs, wiki, tickets, code] --> B[Parsers
extract text + metadata] B --> C[Chunker
split into passages] C --> D[Embedder] D --> E[(Vector DB
pgvector / Pinecone / Qdrant)] B --> F[(Keyword index
BM25 / OpenSearch)] end subgraph onl[Online query] Q[User query] --> QE[Embed query] QE --> VS[Vector search] Q --> KS[Keyword search] E --> VS F --> KS VS --> RRF[Fusion / RRF] KS --> RRF RRF --> RR[Cross-encoder rerank] RR --> TK[Top-k passages] TK --> P[Prompt assembler] Q --> P P --> LLM LLM --> ANS[Answer + citations] end

Every arrow is a design decision worth sweating.

7.3 Document ingestion

This is the least glamorous and most decisive step. A RAG system is only as good as what it ingested.

7.4 Chunking — get this right

Chunk size trades off recall (small chunks = more targeted) vs context (big chunks = more coherent).

Defaults that work:

flowchart LR
    subgraph Naive
    A1[Fixed 500 chars] --> A2[Breaks sentences]
    end
    subgraph Recursive
    B1[Split on paragraphs
then sentences
then words] end subgraph Semantic C1[Embed sentences] C1 --> C2[Merge where
sim > threshold] end subgraph Hierarchical D1[Small chunks for retrieval] D1 --> D2[Big parent chunks for LLM] end

7.5 Embeddings

Your choice of embedding model matters. In 2026:

Check your embedder on a domain evaluation. Generic leaderboards (MTEB) help but don't substitute for your own test set.

Cost perspective: embedding a million average documents costs roughly $10–$50 depending on the model. Re-embedding is cheap; changing embedding models later is the expensive part, because you must re-index.

7.6 Vector databases

You have three families:

Family Examples When it fits
Postgres extension pgvector on RDS/AlloyDB/CloudSQL Small-to-mid scale (millions of vectors), you already run Postgres. Joins + filters + transactions.
Dedicated cloud Pinecone, Weaviate Cloud, Qdrant Cloud, Milvus Ops-light, scales horizontally, hybrid search built-in.
Self-hosted Qdrant, Weaviate, Milvus, LanceDB, Vespa Huge scale or strict data residency.

Default for most teams: pgvector. You already have Postgres; vectors are "just another index." You get filters, joins, role-based access, and point-in-time recovery for free.

-- pgvector, real-world shape
CREATE TABLE docs (
  id          uuid PRIMARY KEY,
  tenant_id   uuid NOT NULL,
  title       text,
  body        text,
  embedding   vector(1024),
  metadata    jsonb,
  updated_at  timestamptz
);

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON docs (tenant_id);

-- Query: vector + tenant filter
SELECT id, title, body,
       1 - (embedding <=> $query_emb) AS score
FROM docs
WHERE tenant_id = $tenant
ORDER BY embedding <=> $query_emb
LIMIT 20;

Pure vector search misses exact-match intent ("error code E_1007"). Pure keyword search misses paraphrase ("payment failed" vs "declined transaction"). Combining both — with Reciprocal Rank Fusion (RRF) or learned fusion — reliably adds 10–25% on retrieval quality.

flowchart LR
    Q[Query] --> VS[Vector search]
    Q --> KS[BM25 / keyword]
    VS --> V[Vector top-50]
    KS --> K[Keyword top-50]
    V --> RRF[RRF: score = sum of 1/(rank + k)]
    K --> RRF
    RRF --> T[Top-20]

Postgres has both: pgvector + tsvector full-text search. OpenSearch gives you both. Weaviate and Qdrant have hybrid modes natively.

7.8 Reranking

The retriever returns, say, 20 candidates. A cross-encoder reranker scores each (query, passage) pair directly, using a small model, and picks the best 3–5. This is often the single biggest quality lever after you've set up the basics.

Options:

Add a reranker once your retrieval top-10 has the right answer but not always at rank 1. A $0.001 rerank often saves $0.01 of LLM confusion.

7.9 Contextual retrieval (Anthropic, 2024)

A striking result: prepend each chunk with a one- or two-sentence LLM-generated summary of the chunk's context within its parent document before embedding and indexing. Retrieval quality jumps ~35%.

[Chunk original text]
------
"In Q2 2024, Acme Corp reported revenue of $4.2B, up 12% YoY..."

[After contextual retrieval]
------
This chunk is from Acme Corp's 10-Q filing for Q2 2024, in the Revenue
section of the MD&A. In Q2 2024, Acme Corp reported revenue of $4.2B,
up 12% YoY...

The extra indexing cost is real (one LLM call per chunk) but prompt caching on the full document makes it tractable. For important corpora, do this.

7.10 The prompt

Once you have top-k passages, assemble the prompt. A clean pattern:

SYSTEM:
Answer the user's question strictly using the provided sources.
If the sources don't contain the answer, say "I don't know."
Cite sources by id, e.g. [src-3].

USER:
<sources>
<src id="src-1" url="...">...passage...</src>
<src id="src-2" url="...">...passage...</src>
</sources>

Question: {user question}

Enforce schema if you need structured output. Return the citations as metadata for the UI.

7.11 Evaluation

Two axes, measure both:

  1. Retrieval quality — does the top-k contain the right chunks? Metrics: recall@k, MRR, NDCG.
  2. End-to-end answer quality — does the LLM give the right answer? Metrics: exact match, LLM-as-judge, human labels.

Tools: Ragas, TruLens, Langfuse evals. Build your own "golden set" of 100 question/answer pairs from domain experts; this is the asset your AI team compounds on.

7.12 Beyond naive RAG

Patterns worth knowing:

flowchart LR
    subgraph Naive RAG
    A1[Query] --> A2[Retrieve] --> A3[Answer]
    end
    subgraph Agentic RAG
    B1[Query] --> B2[LLM plan]
    B2 -->|retrieve?| B3[Retrieve]
    B3 --> B4[LLM reflect]
    B4 -->|need more| B2
    B4 -->|done| B5[Answer]
    end

7.13 Production checklist

Further reading & watching