Visual Technical Guide

How LLM Agents Work

A comprehensive visual breakdown of every operation inside an LLM-powered agent — from tokenization to tool execution to memory management.

Begin the Deep Dive

What is an LLM Agent?

An LLM agent is a system that uses a large language model as its core reasoning engine, augmented with the ability to perceive its environment, make decisions, execute actions via external tools, and learn from observations — all within an autonomous loop.

Unlike a simple chatbot that produces one response per input, an agent can take multiple steps to accomplish complex goals. It decides what to do, calls external APIs or tools, reads the results, adjusts its plan, and continues until the task is complete.

HIGH-LEVEL ARCHITECTURE
👤
User Input
Natural language task or query
🧠
LLM Core
Reasoning & decision engine
Action Selection
Choose tool & parameters
🔧
Tool Execution
APIs, code, search, etc.
👁
Observation
Parse & interpret results
Response
Final answer or next loop
Capability Plain LLM LLM Agent
Interaction Single turn: input → output Multi-step autonomous loop
World Access None — text in, text out Tools, APIs, file systems, databases
Memory Context window only Short-term + long-term retrieval stores
Planning Implicit in single generation Explicit multi-step reasoning & replanning
Error Recovery Cannot self-correct Observes failures, retries, adjusts approach

The Five Core Components

Every LLM agent, regardless of framework, is built from five fundamental components working in concert.

🧠

LLM Core (The Brain)

The foundation model (GPT-4, Claude, Llama, etc.) that generates text, reasons over context, and produces structured outputs. All agent intelligence flows through this component. The model processes a carefully constructed prompt and returns either a final answer or a structured action request.

🔧

Tools (The Hands)

External functions the agent can invoke: web search, code execution, database queries, API calls, file operations. Each tool has a name, description, and parameter schema that the LLM uses to decide when and how to call it.

📋

Planner (The Strategist)

The reasoning mechanism that decomposes complex tasks into ordered sub-steps. This might be explicit (a generated task list) or implicit (chain-of-thought reasoning within the LLM's generation). The planner also decides when to re-plan based on new observations.

💾

Memory (The Recall)

State management across agent steps. Short-term memory is the conversation/context window. Long-term memory uses vector databases or key-value stores for persistent knowledge retrieval. Working memory holds the current plan and intermediate results.

🔄

Orchestrator (The Loop)

The runtime that wires everything together. It manages the observe-think-act cycle, routes LLM outputs to tool execution, feeds results back, handles errors, enforces token budgets, and decides when to terminate. This is the "agentic scaffold."

The Core Execution Cycle

Every agent step follows the same fundamental cycle. The loop runs until the agent decides to stop (produces a final answer) or hits a configured maximum iteration limit.

👁
Perceive
Parse input
💭
Reason
Think step
📋
Plan
Pick action
Act
Execute tool
🔍
Observe
Read result
🔄
Update
Revise state
Agent Loop
1

Perceive — Parse the Input

The orchestrator assembles the full context: system prompt, tool definitions, conversation history, and the current user message (or previous tool result). This entire payload is tokenized and sent to the LLM. The perception step is where prompt engineering happens — the quality of what the agent "sees" determines everything downstream.

2

Reason — Think About What to Do

The LLM generates internal reasoning (often called "chain-of-thought" or a "scratchpad"). In ReAct-style agents, this is an explicit Thought: field. The model considers what it knows, what it needs, and what tools are available. This is pure text generation — the model is literally writing out its thinking process.

3

Plan — Select an Action

Based on its reasoning, the model outputs a structured action: a tool name and its parameters (typically JSON). If no tool is needed, it outputs a finish action with the final answer. The action format is defined in the system prompt and the model must follow the schema exactly for the orchestrator to parse it.

4

Act — Execute the Tool

The orchestrator parses the LLM's output, validates the action, and dispatches it to the appropriate tool. This is real code execution — an HTTP request, a database query, a file read, a calculator. The agent's "hands" touch the real world here. Errors are caught and formatted as results.

5

Observe — Read the Result

The tool's output is formatted and appended to the conversation history as an "observation." The agent now sees both its action and the result. This creates the feedback loop — the agent can evaluate whether its action succeeded, failed, or returned unexpected data.

6

Update — Revise State & Loop

The orchestrator updates working memory, checks termination conditions (max iterations, final answer detected, error budget exhausted), and either loops back to step 1 with the expanded context or returns the final answer to the user. The conversation history grows with each loop iteration.

How the Prompt is Built

Before every LLM call, the orchestrator assembles a structured prompt. The exact content of this prompt is what gives the agent its personality, capabilities, and constraints. Here's the anatomy of a typical agent prompt.

SYSTEM
Identity, personality, rules, and instructions for the agent's behavior and constraints.
TOOLS
JSON schema definitions of every available tool — name, description, parameter types.
FORMAT
Output format instructions: how to emit thoughts, actions, and final answers.
EXAMPLES
Few-shot examples demonstrating the Thought → Action → Observation pattern.
USER
"What was Apple's revenue last quarter?"
ASSISTANT
Thought: I need to search for Apple's latest earnings report...
TOOL RESULT
{"revenue": "$94.9B", "quarter": "Q1 2025", "source": "Apple IR"}
ASSISTANT
Thought: I have the data. I can now answer the user. Final Answer: ...
Key insight: The conversation history grows with every loop iteration. Each Thought, Action, Observation triple is appended. This is why agents consume tokens rapidly — after 5 tool calls, the context may contain thousands of tokens of accumulated reasoning and results.
System Prompt (Simplified)
// System prompt structure for a ReAct agent

You are a helpful research assistant with access to tools.

Available tools:
- web_search(query: string) → Search the web
- calculator(expression: string) → Evaluate math
- get_weather(city: string) → Current weather data

For each step, output your reasoning then an action:

Thought: [your reasoning about what to do next]
Action: {"tool": "tool_name", "params": {...}}

After receiving a tool result, reason about it and either
call another tool or provide your final answer:

Thought: [reasoning about the result]
Final Answer: [your response to the user]

Tool definitions are critical. The LLM reads the tool descriptions and parameter schemas to understand what each tool does and when to use it. Poorly written tool descriptions lead to misuse — the model might call the wrong tool or pass incorrect parameters. Each tool definition typically includes the tool name, a natural language description of its purpose, a JSON Schema for parameters (with types, required fields, and descriptions), and a description of what the return value looks like.

How Agents Think

The reasoning step is where intelligence happens. The LLM generates text that represents its internal deliberation. Different agent frameworks structure this differently.

🔗

Chain-of-Thought (CoT)

The model is prompted to "think step by step." It generates sequential reasoning before producing an answer. This improves accuracy on complex tasks by forcing intermediate computation steps. In agents, CoT happens within the "Thought" field at each loop iteration.

ReAct (Reason + Act)

The canonical agent pattern. Each step alternates between Reasoning (a thought explaining what to do and why) and Acting (a tool call). The observation from the action feeds into the next reasoning step. This interleaving grounds reasoning in real data.

🌳

Tree-of-Thought (ToT)

For especially hard problems, the agent explores multiple reasoning paths in parallel (branching), evaluates each path's promise, and prunes unpromising branches. This is more expensive (multiple LLM calls per step) but dramatically improves performance on tasks requiring search or backtracking.

🔄

Reflexion

After completing a task (or failing), the agent generates a self-critique: what went wrong, what could be improved, and what to try differently. This reflection is stored in long-term memory and retrieved on similar future tasks, creating a learning loop across episodes.

ReAct EXECUTION TRACE
Why reasoning matters: Without explicit reasoning steps, models tend to "jump to answers" — often incorrectly. The Thought field forces the model to articulate its plan before acting, dramatically reducing errors. Research shows that ReAct agents outperform pure CoT and pure action-only approaches on knowledge-intensive tasks.

Calling External Tools

Tools are what transform an LLM from a text generator into an agent that can affect the real world. Here's exactly how a tool call works, step by step.

A

LLM Generates Structured Output

The model outputs a JSON object (or specially formatted text) specifying which tool to call and what parameters to pass. Modern APIs (OpenAI function calling, Anthropic tool use) provide native structured output modes where the model's generation is constrained to valid JSON matching a tool's schema.

// LLM output (structured tool call)
{
  "tool": "web_search",
  "parameters": {
    "query": "Apple Q1 2025 revenue earnings report"
  }
}
B

Orchestrator Parses & Validates

The agent framework extracts the tool name and parameters from the LLM's output. It validates against the tool's schema: are required parameters present? Are types correct? Is this tool actually available? If validation fails, the error is formatted as an observation and sent back to the LLM so it can try again.

C

Tool Function Executes

The validated parameters are passed to the actual tool implementation — a Python function, an HTTP endpoint, a shell command. This is real-world execution: network requests fire, databases are queried, code runs. The tool returns a result (string, JSON, or error).

// Tool execution (server-side)
async function web_search(params) {
  const results = await searchAPI(params.query);
  return {
    results: results.map(r => ({
      title: r.title,
      snippet: r.snippet,
      url: r.url
    }))
  };
}
D

Result Formatted as Observation

The tool's return value is serialized to text and injected into the conversation as an Observation: or tool_result message. The LLM will see this on its next generation step. Results are often truncated or summarized to fit within token budgets.

E

LLM Reasons Over Result

The context now contains the original question, the tool call, and the observation. The LLM generates a new Thought — interpreting the result, deciding if it's sufficient, and either producing a final answer or selecting another tool call. The loop continues.

Common Tool Categories

🔍 Information Retrieval

Web search, document search (RAG), database queries, API data fetching. These tools extend the agent's knowledge beyond its training data and provide real-time information.

💻 Code Execution

Python/JS interpreters, shell commands, sandboxed compute environments. Allow the agent to run calculations, process data, generate files, and perform operations that require exact computation.

📝 Content Creation

File writers, image generators (DALL-E, Stable Diffusion), document formatters. These tools let agents produce artifacts beyond text — images, PDFs, spreadsheets, code files.

🌐 External Actions

Email sending, calendar management, Slack messaging, GitHub operations. These "actuator" tools let agents perform real-world actions on behalf of users, requiring careful permission models.

How Agents Remember

Memory is what allows agents to maintain coherence across steps and sessions. There are three distinct memory systems, each solving a different temporal problem.

Working Memory (Context Window)

The current conversation: system prompt + message history + tool results. This is the LLM's "RAM" — everything it can attend to right now. Limited by context window size (4K–200K tokens). As the agent loops, this grows and may require summarization or truncation to stay within limits.

📓

Short-Term Memory (Scratchpad)

Structured state maintained by the orchestrator outside the LLM: the current plan, completed sub-tasks, intermediate results, and error counts. This is typically a data structure in the agent framework that gets serialized into the context when needed. It allows the agent to track progress without consuming context for bookkeeping.

🗄

Long-Term Memory (Vector Store)

Persistent knowledge stored in external databases (Pinecone, Chroma, pgvector). Text is embedded into vectors and retrieved via semantic similarity search. This allows agents to recall information from past conversations, reference documents, or retrieve their own past reflections. Retrieval-Augmented Generation (RAG) is the core pattern here.

MEMORY RETRIEVAL FLOW (RAG)
💬
User Query
"What did we discuss about pricing?"
🔢
Embed
Query → vector [0.12, -0.8, ...]
🔎
Search
Cosine similarity over stored vectors
📄
Retrieve
Top-K relevant chunks returned
🧠
Augment
Chunks injected into LLM prompt
Context window management is one of the hardest engineering problems in agent systems. After several tool calls, the context can easily exceed limits. Strategies include: sliding window (drop oldest messages), summarization (LLM-generated summaries of past turns), selective retrieval (only inject relevant memories), and hierarchical memory (summarize at multiple levels of granularity).

Decomposing Complex Tasks

When an agent receives a complex request, it must break it down into manageable sub-tasks. Planning is the mechanism that makes agents capable of handling multi-step goals.

1

Task Decomposition

The agent receives a high-level goal like "Research competitor pricing, create a comparison spreadsheet, and draft an email summary for the team." The planner breaks this into discrete, ordered sub-tasks: (1) Search for competitor pricing data, (2) structure data into a table, (3) create the spreadsheet file, (4) draft the email. Each sub-task can be mapped to specific tool calls.

2

Dependency Analysis

The planner identifies which sub-tasks depend on others. The spreadsheet can't be created before data is collected. The email can't be drafted before the spreadsheet exists. Independent tasks (e.g., looking up two different competitors) can potentially execute in parallel.

3

Execution & Monitoring

The agent executes sub-tasks in order, checking off completed items and noting results. After each sub-task, it evaluates: did this succeed? Is the result sufficient? Do I need to adjust the plan? This is dynamic replanning — the plan evolves based on real observations.

4

Adaptive Replanning

If a sub-task fails (search returns no results, API errors, unexpected data format), the planner generates an alternative approach: try a different search query, use a different tool, break the sub-task further, or adjust the overall strategy. This resilience is what makes agents useful in unpredictable environments.

Generated Plan
// Agent-generated plan (emitted as structured output)
{
  "goal": "Compare competitor pricing and brief the team",
  "steps": [
    {"id": 1, "task": "Search competitor A pricing", "tool": "web_search", "status": "done"},
    {"id": 2, "task": "Search competitor B pricing", "tool": "web_search", "status": "done"},
    {"id": 3, "task": "Create comparison spreadsheet", "tool": "create_file", "status": "in_progress"},
    {"id": 4, "task": "Draft summary email", "tool": "compose_email", "status": "pending"}
  ]
}

Common Architectures

Over the past two years, several distinct agent patterns have emerged. Each makes different tradeoffs between simplicity, reliability, cost, and capability.

🔗

ReAct Agent

The foundational pattern. Alternates Thought → Action → Observation in a loop. Simple, reliable, and the most widely used. Each step is a single LLM call that produces reasoning and an action. Works well for tasks requiring 1–10 tool calls. Frameworks: LangChain, LlamaIndex.

📊

Plan-and-Execute

Separates planning from execution. A "planner" LLM generates a full task list upfront, then an "executor" LLM handles each step. The planner reviews and revises the plan between steps. Better for long-horizon tasks (10+ steps) because the plan provides structure. More expensive (2 LLM calls per step).

🔄

LLM Compiler

Generates a task DAG (directed acyclic graph) with dependencies, then executes independent tasks in parallel. Dramatically faster for tasks with parallelizable sub-tasks (e.g., searching multiple sources simultaneously). Complex orchestration logic but significant latency reduction.

💬

Conversational Agent

Optimized for dialogue. The agent maintains conversation state and can ask clarifying questions, request confirmation before destructive actions, and adapt its approach based on user feedback mid-task. Used in customer support, personal assistants, and interactive workflows.

🔀

Router Agent

A lightweight "dispatcher" that classifies the user's intent and routes to specialized sub-agents. Each sub-agent is optimized for its domain (code agent, research agent, writing agent). The router's job is just classification — it's fast and cheap. Good for systems with distinct capability domains.

🤔

Self-Reflective Agent

After producing a result, the agent evaluates its own output for quality, accuracy, and completeness. If it fails its own quality check, it revises and tries again. This "inner critic" pattern trades latency for reliability and is commonly used when output quality is critical.

Agents Working Together

For complex workflows, multiple specialized agents can collaborate. Each agent has a focused role, its own tools, and its own system prompt. A supervisor or messaging protocol coordinates their interactions.

MULTI-AGENT TOPOLOGY
Supervisor Agent Routes, delegates, aggregates Research Agent web_search, web_fetch Code Agent exec_code, create_file Writing Agent draft, edit, format Search APIs, Databases Sandboxed Runtime Templates, Style Guide Shared Message Bus / State Store

Communication patterns in multi-agent systems typically follow one of three models. In hierarchical systems, a supervisor delegates tasks and aggregates results — agents only talk to the supervisor. In peer-to-peer systems, agents communicate directly via a message bus, passing intermediate results without a central coordinator. In sequential pipeline systems, each agent's output becomes the next agent's input, like an assembly line.

Shared state is the coordination mechanism. A common pattern is a shared "blackboard" or state store where agents read and write intermediate results. The supervisor monitors this state to track progress and detect when the overall task is complete. Each agent reads the parts relevant to its sub-task and writes its results back.

When to use multi-agent? Multi-agent systems add complexity. Use them when: (1) the task genuinely requires different expertise or tool sets, (2) sub-tasks are parallelizable for latency gains, (3) you need separation of concerns for safety (e.g., a reviewer agent that checks a coder agent's output), or (4) the system is too complex for a single agent's context window.

Watch an Agent Trace

Select a scenario below to see a step-by-step execution trace of how an agent would handle the task. Each line shows exactly what happens at each stage of the agent loop.