The single most-shipped pattern in enterprise AI. Learn this thoroughly.
If a team ships one AI feature, it is almost always a RAG system. Customer-support copilot, "ask our docs," legal research, internal search, code-aware chat — these are all RAG.
In plain English. RAG is "open-book exam mode" for an LLM. Instead of trusting the model's memory, you hand it the relevant passages from your corpus at the moment it answers.
flowchart LR
subgraph Library["Your knowledge
(books, docs, tickets)"]
B1[Book 1]
B2[Book 2]
B3[Book N]
end
subgraph Librarian["Retriever"]
R[Embedding
+ vector search
+ keyword
+ rerank]
end
Library --> R
Q[User question] --> R
R --> P[Top-k passages]
P --> M[LLM]
Q --> M
M --> A[Answer
with citations]
Three roles: a library (your data), a librarian (retrieval), and a reader (the LLM). Most production failures live in the librarian.
The idea is simple: an LLM can't know your private data, can't know anything after its training cutoff, and hallucinates when it doesn't know. So don't rely on its memory. At query time, retrieve the relevant chunks and inject them into the prompt.
Even with 1M-token windows, RAG usually wins:
flowchart TB
subgraph off[Offline indexing]
A[Source docs
PDFs, wiki, tickets, code] --> B[Parsers
extract text + metadata]
B --> C[Chunker
split into passages]
C --> D[Embedder]
D --> E[(Vector DB
pgvector / Pinecone / Qdrant)]
B --> F[(Keyword index
BM25 / OpenSearch)]
end
subgraph onl[Online query]
Q[User query] --> QE[Embed query]
QE --> VS[Vector search]
Q --> KS[Keyword search]
E --> VS
F --> KS
VS --> RRF[Fusion / RRF]
KS --> RRF
RRF --> RR[Cross-encoder rerank]
RR --> TK[Top-k passages]
TK --> P[Prompt assembler]
Q --> P
P --> LLM
LLM --> ANS[Answer + citations]
end
Every arrow is a design decision worth sweating.
This is the least glamorous and most decisive step. A RAG system is only as good as what it ingested.
pypdf, pdfplumber, Unstructured.io, or Claude / Gemini document understanding as a fallback — modern models are very good at reading weird PDFs, and it's often worth paying them once during ingestion.Chunk size trades off recall (small chunks = more targeted) vs context (big chunks = more coherent).
Defaults that work:
flowchart LR
subgraph Naive
A1[Fixed 500 chars] --> A2[Breaks sentences]
end
subgraph Recursive
B1[Split on paragraphs
then sentences
then words]
end
subgraph Semantic
C1[Embed sentences]
C1 --> C2[Merge where
sim > threshold]
end
subgraph Hierarchical
D1[Small chunks for retrieval]
D1 --> D2[Big parent chunks for LLM]
end
Your choice of embedding model matters. In 2026:
Check your embedder on a domain evaluation. Generic leaderboards (MTEB) help but don't substitute for your own test set.
Cost perspective: embedding a million average documents costs roughly $10–$50 depending on the model. Re-embedding is cheap; changing embedding models later is the expensive part, because you must re-index.
You have three families:
| Family | Examples | When it fits |
|---|---|---|
| Postgres extension | pgvector on RDS/AlloyDB/CloudSQL | Small-to-mid scale (millions of vectors), you already run Postgres. Joins + filters + transactions. |
| Dedicated cloud | Pinecone, Weaviate Cloud, Qdrant Cloud, Milvus | Ops-light, scales horizontally, hybrid search built-in. |
| Self-hosted | Qdrant, Weaviate, Milvus, LanceDB, Vespa | Huge scale or strict data residency. |
Default for most teams: pgvector. You already have Postgres; vectors are "just another index." You get filters, joins, role-based access, and point-in-time recovery for free.
-- pgvector, real-world shape
CREATE TABLE docs (
id uuid PRIMARY KEY,
tenant_id uuid NOT NULL,
title text,
body text,
embedding vector(1024),
metadata jsonb,
updated_at timestamptz
);
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON docs (tenant_id);
-- Query: vector + tenant filter
SELECT id, title, body,
1 - (embedding <=> $query_emb) AS score
FROM docs
WHERE tenant_id = $tenant
ORDER BY embedding <=> $query_emb
LIMIT 20;
Pure vector search misses exact-match intent ("error code E_1007"). Pure keyword search misses paraphrase ("payment failed" vs "declined transaction"). Combining both — with Reciprocal Rank Fusion (RRF) or learned fusion — reliably adds 10–25% on retrieval quality.
flowchart LR
Q[Query] --> VS[Vector search]
Q --> KS[BM25 / keyword]
VS --> V[Vector top-50]
KS --> K[Keyword top-50]
V --> RRF[RRF: score = sum of 1/(rank + k)]
K --> RRF
RRF --> T[Top-20]
Postgres has both: pgvector + tsvector full-text search. OpenSearch gives you both. Weaviate and Qdrant have hybrid modes natively.
The retriever returns, say, 20 candidates. A cross-encoder reranker scores each (query, passage) pair directly, using a small model, and picks the best 3–5. This is often the single biggest quality lever after you've set up the basics.
Options:
Add a reranker once your retrieval top-10 has the right answer but not always at rank 1. A $0.001 rerank often saves $0.01 of LLM confusion.
A striking result: prepend each chunk with a one- or two-sentence LLM-generated summary of the chunk's context within its parent document before embedding and indexing. Retrieval quality jumps ~35%.
[Chunk original text]
------
"In Q2 2024, Acme Corp reported revenue of $4.2B, up 12% YoY..."
[After contextual retrieval]
------
This chunk is from Acme Corp's 10-Q filing for Q2 2024, in the Revenue
section of the MD&A. In Q2 2024, Acme Corp reported revenue of $4.2B,
up 12% YoY...
The extra indexing cost is real (one LLM call per chunk) but prompt caching on the full document makes it tractable. For important corpora, do this.
Once you have top-k passages, assemble the prompt. A clean pattern:
SYSTEM:
Answer the user's question strictly using the provided sources.
If the sources don't contain the answer, say "I don't know."
Cite sources by id, e.g. [src-3].
USER:
<sources>
<src id="src-1" url="...">...passage...</src>
<src id="src-2" url="...">...passage...</src>
</sources>
Question: {user question}
Enforce schema if you need structured output. Return the citations as metadata for the UI.
Two axes, measure both:
Tools: Ragas, TruLens, Langfuse evals. Build your own "golden set" of 100 question/answer pairs from domain experts; this is the asset your AI team compounds on.
Patterns worth knowing:
flowchart LR
subgraph Naive RAG
A1[Query] --> A2[Retrieve] --> A3[Answer]
end
subgraph Agentic RAG
B1[Query] --> B2[LLM plan]
B2 -->|retrieve?| B3[Retrieve]
B3 --> B4[LLM reflect]
B4 -->|need more| B2
B4 -->|done| B5[Answer]
end