Chapter 3 · Anatomy of an LLM

You don't need to train one. You do need the mental model.

An LLM is, at its core, a function.

f(text_in, knobs) -> text_out

That's it. It is stateless. It doesn't remember anything between calls. It doesn't know what time it is. It doesn't know who you are. It takes tokens in, predicts the next token, and repeats.

Everything you will later build — chatbots, RAG systems, agents, multi-agent platforms — is an engineering elaboration of that one stateless function. Grasp this chapter and every later chapter falls into place.

3.1 The pipeline, end to end

flowchart LR
    A[Raw text
hello world] --> B[Tokenizer
BPE / SentencePiece]
    B --> C[Token IDs
15496 995]
    C --> D[Embedding lookup
ID -> vector]
    D --> E[N Transformer blocks
self-attention + FFN]
    E --> F[Final layer norm]
    F --> G[LM head
project to vocab]
    G --> H[Logits over vocabulary]
    H --> I[Softmax + sampling]
    I --> J[Next token: !]
    J -.feed back.-> A

Every step is worth a section.

3.2 Tokens — the atoms of the system

An LLM doesn't see characters or words. It sees tokens — subword units produced by a tokenizer, usually byte-pair encoding (BPE) or SentencePiece.

"unbelievable" tokenizes into something like ["un", "believ", "able"]
"hello" is one token
" hello" (with four spaces) is two tokens: " " and "hello"
Emojis are often 3–5 tokens
Chinese, Japanese, Arabic characters are often 2–3 tokens each

Rule of thumb: 1 English token ≈ 4 characters ≈ 0.75 words. A 1,000-word email ≈ 1,300 tokens. A 300-page novel ≈ 100,000 tokens.

Why you care: - Prices are per token. Input and output are priced separately. - Context windows are measured in tokens. - Rate limits are in tokens per minute. - Pathological inputs (random unicode, weird whitespace) can balloon token counts and costs.

A handy habit: before shipping anything, tokenize your worst-case input and check the count. OpenAI's tiktoken and Anthropic's equivalent libraries make this a three-line check.

3.3 Embeddings — the geometry of meaning

Each token ID becomes a vector of, say, 4,096 real numbers. This is its embedding — a point in high-dimensional space.

Two properties make this useful:

Semantic similarity = geometric proximity. "king" and "queen" are closer to each other than to "bicycle."
Arithmetic has meaning. The classic: king - man + woman ≈ queen.

flowchart LR
    subgraph Embedding space
    A((king)) --- B((queen))
    A --- C((prince))
    B --- D((princess))
    E((car)) --- F((truck))
    E --- G((bicycle))
    end

Inside the model, embeddings evolve layer by layer — they start as crude "what word is this" and end as rich "what does this word mean in this context." Embeddings from the last-layer of an encoder model (or the second-to-last of a decoder model) are what we use for RAG (Chapter 7).

3.4 Attention — the engine

Self-attention is the part of a Transformer that mixes information between tokens. It's the heart of the architecture, and it's more intuitive than it looks.

For each token, the model computes three vectors:

Query (Q) — "what am I looking for?"
Key (K) — "what do I offer?"
Value (V) — "what do I carry?"

Then each token's output is a weighted sum of every other token's Value, where the weights are softmaxes of the dot products between this token's Query and every other token's Key.

flowchart TB
    T[Token: 'it'] --> Q[Query vector]
    subgraph Context
    C1[The cat] --> K1[Key]
    C1 --> V1[Value]
    C2[sat on the mat] --> K2[Key]
    C2 --> V2[Value]
    C3[because] --> K3[Key]
    C3 --> V3[Value]
    end
    Q --> D1[Q · K1 -> score]
    Q --> D2[Q · K2 -> score]
    Q --> D3[Q · K3 -> score]
    D1 --> S[Softmax
normalize to weights]
    D2 --> S
    D3 --> S
    S --> W[Weighted sum of V1, V2, V3]
    W --> O["New representation of 'it'"]

Multi-head attention runs many such operations in parallel — think of each head as a different "aspect" the token wants to look at (syntax, coreference, semantics, etc.).

Positional encodings (added to the embeddings) are what tell the model that "the cat sat" is different from "sat cat the" — otherwise attention is order-invariant.

You don't need to implement this to use LLMs. You do need to know that:

Attention is O(n²) in sequence length — which is why long contexts are expensive.
Modern tricks (FlashAttention, sliding-window, sparse attention, Mamba) reduce this cost.
Long-context quality varies — a 1M-token window with poor mid-context recall is worse than a 100k window with perfect recall.

3.5 Next-token prediction — the only thing it does

After the stack of Transformer blocks, the final output is a vector of logits — one number per vocabulary token. A softmax turns these into probabilities. A sampler picks one. That token is appended to the input. The whole thing runs again.

flowchart LR
    A[Context: 'The capital of France is'] --> M[Model]
    M --> L[Logits over vocab]
    L --> S[Softmax]
    S --> P["Probabilities:
Paris 0.87
London 0.02
..."]
    P --> X[Sample]
    X --> N[Paris]
    N -.append.-> A

This simple loop produces paragraphs, code, poetry, and plans. The "intelligence" is entirely emergent from the distribution the model has learned.

3.6 The sampling knobs

You control the sampling step, not the model.

Knob	What it does	Typical values
`temperature`	Scales logits before softmax. 0 = argmax (deterministic). Higher = more random.	0 for code/SQL, 0.3–0.7 for chat, 0.8+ for creative
`top_p`	Nucleus sampling — keep smallest set of tokens whose probabilities sum to ≥ p, sample from those.	0.9–1.0 typical
`top_k`	Keep the k highest-probability tokens, zero the rest.	40–100 if used
`max_tokens`	Hard cap on output length.	Always set. Budget is a feature.
`stop`	One or more strings that halt generation.	Useful for structured outputs (`"</json>"`, `"Thought:"`)
`presence_penalty` / `frequency_penalty`	Discourage repetition.	Usually 0; bump for long-form prose.
`seed`	Makes outputs mostly reproducible given the same input.	Set in tests, not in prod.

3.7 The context window

The context window is the maximum number of tokens (input + output) the model can handle in one call. It is the single most important capability axis in practice.

flowchart LR
    A[GPT-3
2k tokens
3 pages] --> B[GPT-4
8k-32k
~50 pages]
    B --> C[Claude 2
100k
a novel]
    C --> D[Claude 3.5
200k]
    D --> E[Gemini 1.5
1M-2M]
    E --> F[Claude 4.7
ultra-long + reliable]

Long context is useful but not magic:

Cost grows at least linearly (sometimes quadratically in attention).
Latency grows.
Recall degrades in the middle of very long contexts — the famous "lost in the middle" effect (Liu et al., 2023). Modern frontier models mostly fix this, but eval it yourself.

Rule: if a RAG system works, use it. Don't stuff megabytes into context just because you can.

3.8 What LLMs cannot do (natively)

This list is the other half of the handbook. Every one of these has a standard solution.

Limitation	Workaround
Hallucinate facts	RAG; tool use; citations; evals
Bad at math	Give it a calculator tool or a code sandbox
No memory across calls	You carry the history; use a memory system
Stale knowledge	Retrieval; fresh fine-tunes; live web search
Can't do actions	Function calling; MCP; agent loops
Inconsistent outputs	Structured outputs; schemas; retries
Opaque reasoning	Chain-of-thought prompting; reasoning models
Prompt injection	Defense-in-depth: classifiers, allowlists, scoped tools

3.9 A mental model you can carry

flowchart LR
    subgraph yapp[Your app]
    H[History]
    T[Tools]
    K[Knowledge base]
    end
    H --> P[Prompt assembler]
    T --> P
    K --> P
    P --> M[LLM: stateless function]
    M --> O[Output]
    O --> H
    O --> E[Effects in the world]

An LLM is a stateless function. Your application is the memory, the retrieval, the tools, the guardrails, the logging, the evals, and the humans in the loop. That is what you are building.

3.10 One picture: the sampling loop as a state machine

It helps some readers to see generation as a tiny state machine:

stateDiagram-v2
    [*] --> ReadPrompt
    ReadPrompt --> ForwardPass: tokenize, embed
    ForwardPass --> Sample: logits
    Sample --> Append: chosen token
    Append --> StopCheck
    StopCheck --> ForwardPass: not done
    StopCheck --> [*]: EOS, max_tokens, or stop sequence

Every token is a separate forward pass through billions of parameters. A 500-token answer is 500 trips through the whole network. This is why inference optimization — KV caches, speculative decoding, batching — matters so much.

3.11 Code: a minimal end-to-end call

Here is about as simple as a production-ready Python call gets. This is the shape of nearly every AI codebase.

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    temperature=0.2,
    system="You are a senior backend engineer. Be concise.",
    messages=[
        {"role": "user", "content": "Why would I pick pgvector over Pinecone?"},
    ],
)

print(resp.content[0].text)
print("tokens:", resp.usage.input_tokens, "->", resp.usage.output_tokens)

Four things to notice:

The system prompt is where you set role, rules, and output format.
max_tokens is always set. No "let it run."
Usage is tracked — you want cost observability from day one.
The client is stateless. You pass the entire history every call.

Spring AI or LangChain4j give you the Java equivalent with a nearly identical shape.