Chapter 3 · Anatomy of an LLM

You don't need to train one. You do need the mental model.


An LLM is, at its core, a function.

f(text_in, knobs) -> text_out

That's it. It is stateless. It doesn't remember anything between calls. It doesn't know what time it is. It doesn't know who you are. It takes tokens in, predicts the next token, and repeats.

Everything you will later build — chatbots, RAG systems, agents, multi-agent platforms — is an engineering elaboration of that one stateless function. Grasp this chapter and every later chapter falls into place.

3.1 The pipeline, end to end

flowchart LR
    A[Raw text
hello world] --> B[Tokenizer
BPE / SentencePiece] B --> C[Token IDs
15496 995] C --> D[Embedding lookup
ID -> vector] D --> E[N Transformer blocks
self-attention + FFN] E --> F[Final layer norm] F --> G[LM head
project to vocab] G --> H[Logits over vocabulary] H --> I[Softmax + sampling] I --> J[Next token: !] J -.feed back.-> A

Every step is worth a section.

3.2 Tokens — the atoms of the system

An LLM doesn't see characters or words. It sees tokens — subword units produced by a tokenizer, usually byte-pair encoding (BPE) or SentencePiece.

Rule of thumb: 1 English token ≈ 4 characters ≈ 0.75 words. A 1,000-word email ≈ 1,300 tokens. A 300-page novel ≈ 100,000 tokens.

Why you care: - Prices are per token. Input and output are priced separately. - Context windows are measured in tokens. - Rate limits are in tokens per minute. - Pathological inputs (random unicode, weird whitespace) can balloon token counts and costs.

A handy habit: before shipping anything, tokenize your worst-case input and check the count. OpenAI's tiktoken and Anthropic's equivalent libraries make this a three-line check.

3.3 Embeddings — the geometry of meaning

Each token ID becomes a vector of, say, 4,096 real numbers. This is its embedding — a point in high-dimensional space.

Two properties make this useful:

  1. Semantic similarity = geometric proximity. "king" and "queen" are closer to each other than to "bicycle."
  2. Arithmetic has meaning. The classic: king - man + woman ≈ queen.
flowchart LR
    subgraph Embedding space
    A((king)) --- B((queen))
    A --- C((prince))
    B --- D((princess))
    E((car)) --- F((truck))
    E --- G((bicycle))
    end

Inside the model, embeddings evolve layer by layer — they start as crude "what word is this" and end as rich "what does this word mean in this context." Embeddings from the last-layer of an encoder model (or the second-to-last of a decoder model) are what we use for RAG (Chapter 7).

3.4 Attention — the engine

Self-attention is the part of a Transformer that mixes information between tokens. It's the heart of the architecture, and it's more intuitive than it looks.

For each token, the model computes three vectors:

Then each token's output is a weighted sum of every other token's Value, where the weights are softmaxes of the dot products between this token's Query and every other token's Key.

flowchart TB
    T[Token: 'it'] --> Q[Query vector]
    subgraph Context
    C1[The cat] --> K1[Key]
    C1 --> V1[Value]
    C2[sat on the mat] --> K2[Key]
    C2 --> V2[Value]
    C3[because] --> K3[Key]
    C3 --> V3[Value]
    end
    Q --> D1[Q · K1 -> score]
    Q --> D2[Q · K2 -> score]
    Q --> D3[Q · K3 -> score]
    D1 --> S[Softmax
normalize to weights] D2 --> S D3 --> S S --> W[Weighted sum of V1, V2, V3] W --> O["New representation of 'it'"]

Multi-head attention runs many such operations in parallel — think of each head as a different "aspect" the token wants to look at (syntax, coreference, semantics, etc.).

Positional encodings (added to the embeddings) are what tell the model that "the cat sat" is different from "sat cat the" — otherwise attention is order-invariant.

You don't need to implement this to use LLMs. You do need to know that:

3.5 Next-token prediction — the only thing it does

After the stack of Transformer blocks, the final output is a vector of logits — one number per vocabulary token. A softmax turns these into probabilities. A sampler picks one. That token is appended to the input. The whole thing runs again.

flowchart LR
    A[Context: 'The capital of France is'] --> M[Model]
    M --> L[Logits over vocab]
    L --> S[Softmax]
    S --> P["Probabilities:
Paris 0.87
London 0.02
..."] P --> X[Sample] X --> N[Paris] N -.append.-> A

This simple loop produces paragraphs, code, poetry, and plans. The "intelligence" is entirely emergent from the distribution the model has learned.

3.6 The sampling knobs

You control the sampling step, not the model.

Knob What it does Typical values
temperature Scales logits before softmax. 0 = argmax (deterministic). Higher = more random. 0 for code/SQL, 0.3–0.7 for chat, 0.8+ for creative
top_p Nucleus sampling — keep smallest set of tokens whose probabilities sum to ≥ p, sample from those. 0.9–1.0 typical
top_k Keep the k highest-probability tokens, zero the rest. 40–100 if used
max_tokens Hard cap on output length. Always set. Budget is a feature.
stop One or more strings that halt generation. Useful for structured outputs ("</json>", "Thought:")
presence_penalty / frequency_penalty Discourage repetition. Usually 0; bump for long-form prose.
seed Makes outputs mostly reproducible given the same input. Set in tests, not in prod.

3.7 The context window

The context window is the maximum number of tokens (input + output) the model can handle in one call. It is the single most important capability axis in practice.

flowchart LR
    A[GPT-3
2k tokens
3 pages] --> B[GPT-4
8k-32k
~50 pages] B --> C[Claude 2
100k
a novel] C --> D[Claude 3.5
200k] D --> E[Gemini 1.5
1M-2M] E --> F[Claude 4.7
ultra-long + reliable]

Long context is useful but not magic:

Rule: if a RAG system works, use it. Don't stuff megabytes into context just because you can.

3.8 What LLMs cannot do (natively)

This list is the other half of the handbook. Every one of these has a standard solution.

Limitation Workaround
Hallucinate facts RAG; tool use; citations; evals
Bad at math Give it a calculator tool or a code sandbox
No memory across calls You carry the history; use a memory system
Stale knowledge Retrieval; fresh fine-tunes; live web search
Can't do actions Function calling; MCP; agent loops
Inconsistent outputs Structured outputs; schemas; retries
Opaque reasoning Chain-of-thought prompting; reasoning models
Prompt injection Defense-in-depth: classifiers, allowlists, scoped tools

3.9 A mental model you can carry

flowchart LR
    subgraph yapp[Your app]
    H[History]
    T[Tools]
    K[Knowledge base]
    end
    H --> P[Prompt assembler]
    T --> P
    K --> P
    P --> M[LLM: stateless function]
    M --> O[Output]
    O --> H
    O --> E[Effects in the world]

An LLM is a stateless function. Your application is the memory, the retrieval, the tools, the guardrails, the logging, the evals, and the humans in the loop. That is what you are building.

3.10 One picture: the sampling loop as a state machine

It helps some readers to see generation as a tiny state machine:

stateDiagram-v2
    [*] --> ReadPrompt
    ReadPrompt --> ForwardPass: tokenize, embed
    ForwardPass --> Sample: logits
    Sample --> Append: chosen token
    Append --> StopCheck
    StopCheck --> ForwardPass: not done
    StopCheck --> [*]: EOS, max_tokens, or stop sequence

Every token is a separate forward pass through billions of parameters. A 500-token answer is 500 trips through the whole network. This is why inference optimization — KV caches, speculative decoding, batching — matters so much.

3.11 Code: a minimal end-to-end call

Here is about as simple as a production-ready Python call gets. This is the shape of nearly every AI codebase.

import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

resp = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    temperature=0.2,
    system="You are a senior backend engineer. Be concise.",
    messages=[
        {"role": "user", "content": "Why would I pick pgvector over Pinecone?"},
    ],
)

print(resp.content[0].text)
print("tokens:", resp.usage.input_tokens, "->", resp.usage.output_tokens)

Four things to notice:

  1. The system prompt is where you set role, rules, and output format.
  2. max_tokens is always set. No "let it run."
  3. Usage is tracked — you want cost observability from day one.
  4. The client is stateless. You pass the entire history every call.

Spring AI or LangChain4j give you the Java equivalent with a nearly identical shape.

Further reading & watching