Chapter 9 · From Chatbots to Agents

A chatbot answers. An agent acts.


The leap from "a model that answers questions" to "a model that accomplishes tasks" is, arguably, the defining arc of 2023–2026. It is also where most production AI engineering effort now goes.

An agent is, in its simplest form, an LLM in a loop with tools and a goal. Nothing more.

In plain English. A chatbot is a pen pal. An agent is an intern with a credit card and a laptop.

The agent inside, as a state machine

stateDiagram-v2
    [*] --> Goal
    Goal --> Plan: read goal + memory
    Plan --> Act: choose tool
    Act --> Observe: execute
    Observe --> Reflect: read result
    Reflect --> Plan: not done
    Reflect --> Verify: looks done
    Verify --> Plan: failed checks
    Verify --> [*]: passed

Real systems add three things to that loop: memory (persistent state across turns), safety checks (don't email the customer, don't drop the table), and budgets (no more than N tool calls, no more than M dollars).

flowchart LR
    G[Goal] --> P[Plan / think]
    P --> A[Act: call a tool]
    A --> O[Observe result]
    O --> D{Done?}
    D -- no --> P
    D -- yes --> R[Return result]

That loop — Observe, Orient, Decide, Act — is ancient in AI (it's Boyd's OODA loop, it's classical control theory, it's reinforcement learning). The Transformer-era twist is that the "Decide" node can be an LLM reading natural language and writing natural language.

9.1 The paper that started it all: ReAct (2022)

Yao et al.'s ReAct paper (Oct 2022) showed that interleaving reasoning and action outperformed both pure reasoning (CoT) and pure action (tool use). The model alternates:

Thought: I need the current price of AAPL.
Action: get_quote("AAPL")
Observation: 198.42
Thought: The user asked in euros. I should convert.
Action: get_fx_rate("USD", "EUR")
Observation: 0.92
Thought: 198.42 * 0.92 = 182.55
Final Answer: AAPL is trading at roughly €182.55.

This four-line format is the ancestor of every agent framework that followed. Modern agents dress it up (JSON instead of text, parallel tool calls, typed schemas) but the core is unchanged.

9.2 The 2023 Cambrian explosion

By spring 2023, agent demos were everywhere:

They were chaotic because:

  1. Context windows were tiny — agents forgot the goal.
  2. Tool use was string-parsed — fragile.
  3. No evals — no way to iterate.
  4. Loops without budgets — agents burned dollars chasing their tails.

But they proved the shape.

9.3 What modern agents actually look like (2024–2026)

flowchart TB
    subgraph Runtime
    L[LLM: reasoning core]
    ST[Short-term state
conversation + scratch] MEM[Long-term memory
vector + summaries] end subgraph Tools T1[Retrieval] T2[Code execution] T3[HTTP / API] T4[DB query] T5[Filesystem] T6[Browser] T7[Domain tools] end G[Goal] --> L L <--> ST L <--> MEM L --> T1 L --> T2 L --> T3 L --> T4 L --> T5 L --> T6 L --> T7 T1 --> L T2 --> L T3 --> L T4 --> L T5 --> L T6 --> L T7 --> L L --> R[Result]

Common patterns:

9.4 Agents in the wild

Categories that matter in 2026:

flowchart LR
    subgraph Coding
    A1[Claude Code]
    A2[Cursor agent]
    A3[Devin]
    end
    subgraph Computer-use
    B1[Claude CU]
    B2[OpenAI Operator]
    B3[Project Mariner]
    end
    subgraph Research
    C1[Gemini DR]
    C2[ChatGPT DR]
    C3[Perplexity]
    end
    subgraph Ops
    D1[Resolve AI]
    D2[Runbooks]
    end
    subgraph Business
    E1[Sierra]
    E2[Decagon]
    E3[11x]
    end

9.5 The "just build an agent" temptation

Don't. The Anthropic guidance (Building Effective Agents, 2024) is the single best piece of advice here:

Use the simplest thing that works.

Often you don't need an agent — you need a well-structured workflow with an LLM call or two. Full agentic autonomy (the model decides the number of steps and the order) is the most expensive, least predictable, and hardest to debug pattern. Use it only when:

When a deterministic workflow suffices, write the workflow. When an LLM routes between workflows, call it a "router." Reserve "agent" for the cases where the model really does need to make decisions inside a loop.

flowchart TD
    A[Task] --> B{Bounded steps
enumerable?} B -- yes --> C[Workflow
with LLM calls] B -- no --> D{Safe to explore?} D -- yes --> E[Agent] D -- no --> F[Workflow with
human checkpoints]

9.6 What makes an agent reliable

The production art of agents, as learned the hard way:

  1. Clear, narrow tools. 5–15 well-described tools. Not 100.
  2. Structured errors. Tools return {ok: false, error: "...", suggestion: "..."}; the model self-corrects.
  3. Budgets. Max steps, max tokens, max wall clock, max dollars. All enforced outside the model.
  4. Observability. Every step, tool call, and token logged with a trace ID.
  5. Reflection. Periodic "what have I learned, what's next?" calls keep long runs coherent.
  6. Checkpoints. Long-running agents persist state (DB / filesystem) so you can resume after failure.
  7. Evals. A golden set of tasks with expected outcomes; run on every prompt or model change.
  8. Human gates. For any action that spends money, touches prod, or sends a customer message.

9.7 A minimal agent in Python

Here is the shape. In production you'd use LangGraph, OpenAI Agents SDK, or Claude Agent SDK — but seeing it bare is worth a page.

from anthropic import Anthropic

client = Anthropic()

TOOLS = [
    {
        "name": "search",
        "description": "Search the company wiki.",
        "input_schema": {
            "type": "object",
            "properties": {"q": {"type": "string"}},
            "required": ["q"],
        },
    },
    {
        "name": "finish",
        "description": "Return the final answer to the user.",
        "input_schema": {
            "type": "object",
            "properties": {"answer": {"type": "string"}},
            "required": ["answer"],
        },
    },
]

def run_tool(name, args):
    if name == "search":
        return wiki_search(args["q"])
    if name == "finish":
        return {"done": True, "answer": args["answer"]}
    return {"error": f"unknown tool {name}"}

def agent(goal, max_steps=10):
    messages = [{"role": "user", "content": goal}]
    for step in range(max_steps):
        resp = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=1024,
            tools=TOOLS,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})

        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return resp  # model answered without tool use

        results = []
        for tu in tool_uses:
            result = run_tool(tu.name, tu.input)
            if tu.name == "finish":
                return result["answer"]
            results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": str(result),
            })
        messages.append({"role": "user", "content": results})

    raise RuntimeError("agent exceeded step budget")

Everything else — multi-agent, MCP, long-running workflows — is a variation of this loop.

9.8 The honest state of agents in 2026

If you're building an agent, the question to keep asking is: "would a junior engineer with this toolkit and these instructions succeed?" If the answer is "probably not," your agent won't either.

Further reading & watching