You don't need to train one. You do need the mental model.
An LLM is, at its core, a function.
f(text_in, knobs) -> text_out
That's it. It is stateless. It doesn't remember anything between calls. It doesn't know what time it is. It doesn't know who you are. It takes tokens in, predicts the next token, and repeats.
Everything you will later build — chatbots, RAG systems, agents, multi-agent platforms — is an engineering elaboration of that one stateless function. Grasp this chapter and every later chapter falls into place.
flowchart LR
A[Raw text
hello world] --> B[Tokenizer
BPE / SentencePiece]
B --> C[Token IDs
15496 995]
C --> D[Embedding lookup
ID -> vector]
D --> E[N Transformer blocks
self-attention + FFN]
E --> F[Final layer norm]
F --> G[LM head
project to vocab]
G --> H[Logits over vocabulary]
H --> I[Softmax + sampling]
I --> J[Next token: !]
J -.feed back.-> A
Every step is worth a section.
An LLM doesn't see characters or words. It sees tokens — subword units produced by a tokenizer, usually byte-pair encoding (BPE) or SentencePiece.
"unbelievable" tokenizes into something like ["un", "believ", "able"]"hello" is one token" hello" (with four spaces) is two tokens: " " and "hello"Rule of thumb: 1 English token ≈ 4 characters ≈ 0.75 words. A 1,000-word email ≈ 1,300 tokens. A 300-page novel ≈ 100,000 tokens.
Why you care: - Prices are per token. Input and output are priced separately. - Context windows are measured in tokens. - Rate limits are in tokens per minute. - Pathological inputs (random unicode, weird whitespace) can balloon token counts and costs.
A handy habit: before shipping anything, tokenize your worst-case input and check the count. OpenAI's tiktoken and Anthropic's equivalent libraries make this a three-line check.
Each token ID becomes a vector of, say, 4,096 real numbers. This is its embedding — a point in high-dimensional space.
Two properties make this useful:
king - man + woman ≈ queen.
flowchart LR
subgraph Embedding space
A((king)) --- B((queen))
A --- C((prince))
B --- D((princess))
E((car)) --- F((truck))
E --- G((bicycle))
end
Inside the model, embeddings evolve layer by layer — they start as crude "what word is this" and end as rich "what does this word mean in this context." Embeddings from the last-layer of an encoder model (or the second-to-last of a decoder model) are what we use for RAG (Chapter 7).
Self-attention is the part of a Transformer that mixes information between tokens. It's the heart of the architecture, and it's more intuitive than it looks.
For each token, the model computes three vectors:
Then each token's output is a weighted sum of every other token's Value, where the weights are softmaxes of the dot products between this token's Query and every other token's Key.
flowchart TB
T[Token: 'it'] --> Q[Query vector]
subgraph Context
C1[The cat] --> K1[Key]
C1 --> V1[Value]
C2[sat on the mat] --> K2[Key]
C2 --> V2[Value]
C3[because] --> K3[Key]
C3 --> V3[Value]
end
Q --> D1[Q · K1 -> score]
Q --> D2[Q · K2 -> score]
Q --> D3[Q · K3 -> score]
D1 --> S[Softmax
normalize to weights]
D2 --> S
D3 --> S
S --> W[Weighted sum of V1, V2, V3]
W --> O["New representation of 'it'"]
Multi-head attention runs many such operations in parallel — think of each head as a different "aspect" the token wants to look at (syntax, coreference, semantics, etc.).
Positional encodings (added to the embeddings) are what tell the model that "the cat sat" is different from "sat cat the" — otherwise attention is order-invariant.
You don't need to implement this to use LLMs. You do need to know that:
After the stack of Transformer blocks, the final output is a vector of logits — one number per vocabulary token. A softmax turns these into probabilities. A sampler picks one. That token is appended to the input. The whole thing runs again.
flowchart LR
A[Context: 'The capital of France is'] --> M[Model]
M --> L[Logits over vocab]
L --> S[Softmax]
S --> P["Probabilities:
Paris 0.87
London 0.02
..."]
P --> X[Sample]
X --> N[Paris]
N -.append.-> A
This simple loop produces paragraphs, code, poetry, and plans. The "intelligence" is entirely emergent from the distribution the model has learned.
You control the sampling step, not the model.
| Knob | What it does | Typical values |
|---|---|---|
temperature |
Scales logits before softmax. 0 = argmax (deterministic). Higher = more random. | 0 for code/SQL, 0.3–0.7 for chat, 0.8+ for creative |
top_p |
Nucleus sampling — keep smallest set of tokens whose probabilities sum to ≥ p, sample from those. | 0.9–1.0 typical |
top_k |
Keep the k highest-probability tokens, zero the rest. | 40–100 if used |
max_tokens |
Hard cap on output length. | Always set. Budget is a feature. |
stop |
One or more strings that halt generation. | Useful for structured outputs ("</json>", "Thought:") |
presence_penalty / frequency_penalty |
Discourage repetition. | Usually 0; bump for long-form prose. |
seed |
Makes outputs mostly reproducible given the same input. | Set in tests, not in prod. |
The context window is the maximum number of tokens (input + output) the model can handle in one call. It is the single most important capability axis in practice.
flowchart LR
A[GPT-3
2k tokens
3 pages] --> B[GPT-4
8k-32k
~50 pages]
B --> C[Claude 2
100k
a novel]
C --> D[Claude 3.5
200k]
D --> E[Gemini 1.5
1M-2M]
E --> F[Claude 4.7
ultra-long + reliable]
Long context is useful but not magic:
Rule: if a RAG system works, use it. Don't stuff megabytes into context just because you can.
This list is the other half of the handbook. Every one of these has a standard solution.
| Limitation | Workaround |
|---|---|
| Hallucinate facts | RAG; tool use; citations; evals |
| Bad at math | Give it a calculator tool or a code sandbox |
| No memory across calls | You carry the history; use a memory system |
| Stale knowledge | Retrieval; fresh fine-tunes; live web search |
| Can't do actions | Function calling; MCP; agent loops |
| Inconsistent outputs | Structured outputs; schemas; retries |
| Opaque reasoning | Chain-of-thought prompting; reasoning models |
| Prompt injection | Defense-in-depth: classifiers, allowlists, scoped tools |
flowchart LR
subgraph yapp[Your app]
H[History]
T[Tools]
K[Knowledge base]
end
H --> P[Prompt assembler]
T --> P
K --> P
P --> M[LLM: stateless function]
M --> O[Output]
O --> H
O --> E[Effects in the world]
An LLM is a stateless function. Your application is the memory, the retrieval, the tools, the guardrails, the logging, the evals, and the humans in the loop. That is what you are building.
It helps some readers to see generation as a tiny state machine:
stateDiagram-v2
[*] --> ReadPrompt
ReadPrompt --> ForwardPass: tokenize, embed
ForwardPass --> Sample: logits
Sample --> Append: chosen token
Append --> StopCheck
StopCheck --> ForwardPass: not done
StopCheck --> [*]: EOS, max_tokens, or stop sequence
Every token is a separate forward pass through billions of parameters. A 500-token answer is 500 trips through the whole network. This is why inference optimization — KV caches, speculative decoding, batching — matters so much.
Here is about as simple as a production-ready Python call gets. This is the shape of nearly every AI codebase.
import os
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
resp = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
temperature=0.2,
system="You are a senior backend engineer. Be concise.",
messages=[
{"role": "user", "content": "Why would I pick pgvector over Pinecone?"},
],
)
print(resp.content[0].text)
print("tokens:", resp.usage.input_tokens, "->", resp.usage.output_tokens)
Four things to notice:
system prompt is where you set role, rules, and output format.max_tokens is always set. No "let it run."Spring AI or LangChain4j give you the Java equivalent with a nearly identical shape.