Multimodality, reasoning models, and the five-lab skyline.
"Compute is the new oil. We are in the middle of a gold rush, and the picks and shovels are GPUs." — a Silicon Valley VC, circa 2023
The 108 days between ChatGPT's launch and GPT-4's release were the most intense research-to-product compression in tech history. Then it got faster.
quadrantChart
title Frontier labs: openness vs capability, April 2026
x-axis Closed weights --> Open weights
y-axis Niche / Small --> Flagship / Large
quadrant-1 Open frontier
quadrant-2 Closed frontier
quadrant-3 Closed niche
quadrant-4 Open niche
OpenAI GPT-5: [0.10, 0.95]
Anthropic Opus 4.7: [0.12, 0.93]
Google Gemini 3: [0.15, 0.90]
Meta LLaMA 4: [0.85, 0.80]
Mistral Large: [0.70, 0.65]
DeepSeek V3-R1: [0.88, 0.75]
Qwen 3: [0.80, 0.78]
xAI Grok 4: [0.35, 0.70]
OpenAI released GPT-4 with a 100-page technical report that notably declined to disclose the architecture, parameter count, or training data. What it did disclose was benchmarks:
The underlying dynamic had changed: GPT-4 wasn't an incremental improvement, it was good enough to automate real work. Legal associates, junior analysts, tutors, translators — every white-collar task suddenly had a GPT-4 question mark hanging over it.
flowchart LR
A[GPT-3.5
bottom 10% of bar] --> B[GPT-4
top 10%]
B --> C[GPT-4 Turbo
128k ctx, cheaper]
C --> D[GPT-4o
text+image+audio native]
D --> E[o1 / o3
reasoning models]
E --> F[GPT-5
2025 frontier]
For the next three years, five labs traded the crown roughly every few months:
| Lab | Flagships (2023–2026) | Superpower |
|---|---|---|
| OpenAI | GPT-4 → 4 Turbo → 4o → o1 → o3 → GPT-5 | Product velocity, multimodality, first reasoning models |
| Anthropic | Claude 2 → 3 → 3.5 → 4 → 4.5 → Opus 4.7 | Safety, long context, coding, agentic reliability |
| Google DeepMind | Bard → Gemini 1 → 1.5 → 2 → 3 | Ultra-long context (1M+), deep product integration |
| Meta | LLaMA → 2 → 3 → 4 | Open weights, community catalyst |
| xAI / Mistral / DeepSeek | Grok, Mistral Large, DeepSeek V3 / R1 | Efficient architectures, open reasoning |
timeline
title Frontier model releases
2023 Mar : GPT-4
2023 Jul : Claude 2 (100k ctx)
2023 Dec : Gemini 1.0 Ultra
2024 Feb : Gemini 1.5 (1M ctx)
2024 Mar : Claude 3 Opus
2024 May : GPT-4o (omni)
2024 Jun : Claude 3.5 Sonnet
2024 Sep : o1 (first reasoning model)
2024 Dec : o3 preview, Gemini 2
2025 Feb : Claude 3.7 (extended thinking)
2025 May : Claude 4 family
2025 Q3 : GPT-5
2025 Q4 : Claude 4.5, Opus 4.6, Gemini 3
2026 Q2 : Claude Opus 4.7
By late 2024, "the model" and "the input modality" decoupled. Any frontier model could:
flowchart TB
subgraph Inputs
I1[Text]
I2[Image]
I3[Audio]
I4[Video]
I5[PDF/docs]
end
subgraph Model["Frontier model (2026)"]
M[Multimodal Transformer]
end
subgraph Outputs
O1[Text]
O2[Structured data]
O3[Audio / speech]
O4[Image generation]
O5[Tool calls]
end
I1 --> M
I2 --> M
I3 --> M
I4 --> M
I5 --> M
M --> O1
M --> O2
M --> O3
M --> O4
M --> O5
For backend engineers, this had a concrete consequence: you stopped building separate OCR / vision / ASR pipelines. One API call could take a photo of a receipt and return structured JSON.
The next paradigm shift came in September 2024 when OpenAI released o1. Instead of answering immediately, o1 would generate thousands of hidden "thinking" tokens — chains of reasoning, self-critique, retries — before producing the final answer.
On math and competitive-programming benchmarks, o1 didn't just beat GPT-4. It beat it by margins that looked like a separate paradigm:
flowchart LR
A[Question] --> B[Hidden reasoning
thousands of tokens]
B --> C[Self-check]
C --> D[Retry / branch]
D --> B
C --> E[Final answer
concise]
By 2025, every major lab had a reasoning mode: o3, Claude with extended thinking, Gemini 2 Thinking, DeepSeek R1 (fully open). The trade-off: latency goes from seconds to minutes, cost goes up 5–20×. For hard problems, it's worth it; for routine chat, you'd skip it.
A less glamorous but equally important axis was context length.
flowchart LR
A[GPT-3
2k tokens] --> B[GPT-4
8k-32k]
B --> C[Claude 2
100k]
C --> D[GPT-4 Turbo
128k]
D --> E[Claude 3.5
200k]
E --> F[Gemini 1.5 Pro
1M-2M]
F --> G[Claude 4.x
ultra-long + reliable]
With 1M tokens, you can:
This partially threatens RAG (Chapter 7) — why retrieve if you can fit everything? — but cost, latency, and reliable retrieval inside long contexts mean RAG is still the right default for most real workloads in 2026.
Less sexy but more consequential: models got 100× cheaper between 2023 and 2026 at constant capability.
flowchart LR
A[GPT-4 Mar 2023
$30 / $60 per MT] --> B[GPT-4 Turbo Nov 2023
$10 / $30]
B --> C[GPT-4o May 2024
$5 / $15]
C --> D[GPT-4o mini
$0.15 / $0.60]
D --> E[Haiku / Flash
2026
pennies per million]
The macro story: inference cost dropped roughly 10× per year. Any workflow that was "too expensive" in 2023 is a trivial line item in 2026. Plan for this — your model choice today should be assumed to be a commodity in twelve months.
In plain English. Inference (the cost of using the model) is collapsing. Training cost is exploding. Net effect: the labs that win the training race get to play, but everyone else gets to use near-frontier models for almost nothing. Plan your products on the assumption that the smart bit is free in two years.
Benchmarks drove the race, and then Goodhart's Law took over: every benchmark got saturated or gamed. The most-watched leaderboards in 2026 are:
Trust benchmarks for direction, not truth. Write your own evals (Chapter 6) for anything you actually ship.
As a backend engineer, you do not have to track every release. You have to track three things:
The correct defaults for most production work in 2026 look like: