A running list of terms you'll hit in this handbook, in papers, and in the wild. Definitions are short and practical, not exhaustive.
Agent. An LLM in a loop with tools and a goal. Executes multiple steps autonomously until a condition is met.
Alignment. The discipline of making AI systems pursue the goals their operators actually intend, not just the goals they were literally given.
ANN (Approximate Nearest Neighbor). Algorithm for fast vector similarity search (HNSW, IVF, ScaNN). The "search" in vector search.
Attention. The mechanism by which each token in a sequence weighs the relevance of every other token when forming its own new representation.
Autoregressive. A model that generates output one token at a time, each one conditioned on all prior tokens. GPT-style.
BERT. Early encoder-only Transformer (2018). Strong for classification and embeddings. Ancestor of most retrieval models.
BPE (Byte-Pair Encoding). Common subword tokenization scheme. Splits rare words into more frequent fragments.
Chain-of-thought (CoT). Prompting technique where the model writes intermediate reasoning before its final answer. Often improves accuracy on reasoning tasks.
Chinchilla. 2022 paper showing that most LLMs at the time were undertrained for their size. Influenced open-model training recipes.
Constitutional AI (CAI). Anthropic's alignment method: a written set of principles the model learns to critique and revise its own outputs against.
Context window. Maximum number of tokens (input + output) the model can process in a single call.
Contextual retrieval. Prepending each chunk with a short LLM-generated context summary before embedding, to improve retrieval quality.
Cross-encoder. A transformer that takes a (query, passage) pair and outputs a relevance score. Slower than embeddings but more accurate — used as a reranker.
Decoder-only. GPT-style architecture. One-directional attention, optimized for generation.
Distillation. Training a small model to mimic a larger one. Big cost reductions, often small quality drops.
DPO (Direct Preference Optimization). Preference-based fine-tuning method that avoids RL. Simpler than RLHF.
Embedding. A dense vector representing the meaning of a piece of text (or an image, etc.).
Encoder-only. BERT-style architecture. Bidirectional attention, used for understanding tasks rather than generation.
Evals. Automated tests for prompts and models. The backbone of serious AI engineering.
FlashAttention. Memory-efficient attention implementation. Standard in modern transformer inference.
Fine-tuning. Updating model weights on your data to change its behavior. LoRA/QLoRA are cheap variants.
Function calling. Protocol where the model emits structured JSON to invoke a named tool, rather than free-form text.
Gemini. Google DeepMind's frontier model family (2023+). Known for multimodality and long context.
GGUF. Efficient on-disk format for quantized LLMs used by llama.cpp.
GPT (Generative Pre-trained Transformer). OpenAI's decoder-only family. Ancestor of the modern chat era.
Gradient descent. The core optimization algorithm used to train neural networks.
Grounding. Supplying retrieved or structured facts to the model so its output is anchored to reality rather than invented.
Hallucination. Confidently generating false information. The defining failure mode of LLMs.
HNSW. Hierarchical Navigable Small World — a common ANN algorithm used by pgvector, Qdrant, and others.
Hybrid search. Combining vector similarity and keyword search (e.g., BM25) for better retrieval.
In-context learning. A model "learning" from examples provided in the prompt (few-shot) without any weight updates.
Instruction tuning (SFT). Supervised fine-tuning on (instruction, ideal response) pairs. Stage 2 in the RLHF pipeline.
Jailbreak. User's attempt to bypass the model's own policies.
Knowledge cutoff. The approximate date beyond which the model has no training data.
LangChain / LangGraph. Python (and JS) frameworks for building LLM apps and stateful agent graphs.
LlamaIndex. Python framework specialized for RAG pipelines.
LLM. Large Language Model. The general term for this handbook's subject.
LoRA. Low-Rank Adaptation. Cheap fine-tuning method that trains a small additive delta.
MCP. Model Context Protocol. An open standard for exposing tools, resources, and prompts to LLM clients.
Mixture of Experts (MoE). Architecture where only a subset of parameters is active per token, enabling very large total parameter counts with tractable inference cost.
Multi-agent system. Multiple LLM-powered agents coordinating on a task.
Multimodal. A model that accepts and/or produces more than one data modality (text, image, audio, video).
Next-token prediction. The objective most LLMs are trained on. Everything else (chat, code, reasoning) is emergent from it at scale.
Ollama. Popular CLI + local HTTP server for running open-weights LLMs on your laptop.
Parameter. A trainable weight in the neural network. Modern LLMs have tens of billions to trillions.
PEFT. Parameter-Efficient Fine-Tuning. Umbrella for LoRA, QLoRA, prefix tuning, adapters.
pgvector. PostgreSQL extension for storing and searching vector embeddings.
Prompt engineering. The discipline of writing clear, specific, testable instructions for a model.
Prompt injection. Attack where untrusted content included in the model's context hijacks its behavior.
Quantization. Reducing numeric precision of model weights (e.g., to 4 or 8 bits) to shrink memory and speed inference.
RAG. Retrieval-Augmented Generation. Retrieve relevant chunks, stuff them in the prompt, generate an answer.
ReAct. Prompting pattern alternating Reasoning and Acting. Ancestor of modern agent loops.
Reasoning model. A model trained to generate extensive hidden chain-of-thought before answering (e.g., o1, o3, Claude with extended thinking).
Reranker. Cross-encoder model scoring (query, passage) pairs directly, used after initial retrieval to improve top-k ordering.
RLHF. Reinforcement Learning from Human Feedback. The three-stage alignment pipeline behind ChatGPT and its peers.
Safetensors. A safe, portable format for storing neural network weights; replacement for Python pickle.
Scaling laws. Empirical finding that model loss decreases predictably with compute, data, and parameters.
Self-attention. Attention applied within a single sequence — how tokens attend to each other.
Seq2Seq. Encoder-decoder architecture common before Transformers.
SGLang. High-throughput LLM serving engine, peer of vLLM.
SOTA. State of the art.
Structured output. Constraining the model to emit output matching a schema (JSON, XML, regex-guided).
System prompt. Instructions sent to the model on each call, usually invisible to end users, that set role, rules, and output format.
Temperature. Sampling knob controlling output randomness. 0 = deterministic; higher = more creative.
Token. A subword unit the model actually processes. ~4 English characters each.
Tokenizer. The algorithm + vocabulary that converts text ↔ token IDs.
Tool use. See function calling.
Top-k / Top-p. Sampling parameters limiting the candidate tokens to the most likely ones.
Transformer. The neural-network architecture underlying all modern LLMs. 2017 paper, Attention Is All You Need.
vLLM. High-throughput LLM serving engine with paged attention and continuous batching.
Vector DB. Database with ANN indexes over embedding vectors.
Zero-shot / few-shot. Performing a task with no examples (zero-shot) or a few examples (few-shot) in the prompt.