A comprehensive visual breakdown of every operation inside an LLM-powered agent — from tokenization to tool execution to memory management.
Begin the Deep DiveAn LLM agent is a system that uses a large language model as its core reasoning engine, augmented with the ability to perceive its environment, make decisions, execute actions via external tools, and learn from observations — all within an autonomous loop.
Unlike a simple chatbot that produces one response per input, an agent can take multiple steps to accomplish complex goals. It decides what to do, calls external APIs or tools, reads the results, adjusts its plan, and continues until the task is complete.
| Capability | Plain LLM | LLM Agent |
|---|---|---|
| Interaction | Single turn: input → output | Multi-step autonomous loop |
| World Access | None — text in, text out | Tools, APIs, file systems, databases |
| Memory | Context window only | Short-term + long-term retrieval stores |
| Planning | Implicit in single generation | Explicit multi-step reasoning & replanning |
| Error Recovery | Cannot self-correct | Observes failures, retries, adjusts approach |
Every LLM agent, regardless of framework, is built from five fundamental components working in concert.
The foundation model (GPT-4, Claude, Llama, etc.) that generates text, reasons over context, and produces structured outputs. All agent intelligence flows through this component. The model processes a carefully constructed prompt and returns either a final answer or a structured action request.
External functions the agent can invoke: web search, code execution, database queries, API calls, file operations. Each tool has a name, description, and parameter schema that the LLM uses to decide when and how to call it.
The reasoning mechanism that decomposes complex tasks into ordered sub-steps. This might be explicit (a generated task list) or implicit (chain-of-thought reasoning within the LLM's generation). The planner also decides when to re-plan based on new observations.
State management across agent steps. Short-term memory is the conversation/context window. Long-term memory uses vector databases or key-value stores for persistent knowledge retrieval. Working memory holds the current plan and intermediate results.
The runtime that wires everything together. It manages the observe-think-act cycle, routes LLM outputs to tool execution, feeds results back, handles errors, enforces token budgets, and decides when to terminate. This is the "agentic scaffold."
Every agent step follows the same fundamental cycle. The loop runs until the agent decides to stop (produces a final answer) or hits a configured maximum iteration limit.
The orchestrator assembles the full context: system prompt, tool definitions, conversation history, and the current user message (or previous tool result). This entire payload is tokenized and sent to the LLM. The perception step is where prompt engineering happens — the quality of what the agent "sees" determines everything downstream.
The LLM generates internal reasoning (often called "chain-of-thought" or a "scratchpad"). In ReAct-style agents, this is an explicit Thought: field. The model considers what it knows, what it needs, and what tools are available. This is pure text generation — the model is literally writing out its thinking process.
Based on its reasoning, the model outputs a structured action: a tool name and its parameters (typically JSON). If no tool is needed, it outputs a finish action with the final answer. The action format is defined in the system prompt and the model must follow the schema exactly for the orchestrator to parse it.
The orchestrator parses the LLM's output, validates the action, and dispatches it to the appropriate tool. This is real code execution — an HTTP request, a database query, a file read, a calculator. The agent's "hands" touch the real world here. Errors are caught and formatted as results.
The tool's output is formatted and appended to the conversation history as an "observation." The agent now sees both its action and the result. This creates the feedback loop — the agent can evaluate whether its action succeeded, failed, or returned unexpected data.
The orchestrator updates working memory, checks termination conditions (max iterations, final answer detected, error budget exhausted), and either loops back to step 1 with the expanded context or returns the final answer to the user. The conversation history grows with each loop iteration.
Before every LLM call, the orchestrator assembles a structured prompt. The exact content of this prompt is what gives the agent its personality, capabilities, and constraints. Here's the anatomy of a typical agent prompt.
// System prompt structure for a ReAct agent You are a helpful research assistant with access to tools. Available tools: - web_search(query: string) → Search the web - calculator(expression: string) → Evaluate math - get_weather(city: string) → Current weather data For each step, output your reasoning then an action: Thought: [your reasoning about what to do next] Action: {"tool": "tool_name", "params": {...}} After receiving a tool result, reason about it and either call another tool or provide your final answer: Thought: [reasoning about the result] Final Answer: [your response to the user]
Tool definitions are critical. The LLM reads the tool descriptions and parameter schemas to understand what each tool does and when to use it. Poorly written tool descriptions lead to misuse — the model might call the wrong tool or pass incorrect parameters. Each tool definition typically includes the tool name, a natural language description of its purpose, a JSON Schema for parameters (with types, required fields, and descriptions), and a description of what the return value looks like.
The reasoning step is where intelligence happens. The LLM generates text that represents its internal deliberation. Different agent frameworks structure this differently.
The model is prompted to "think step by step." It generates sequential reasoning before producing an answer. This improves accuracy on complex tasks by forcing intermediate computation steps. In agents, CoT happens within the "Thought" field at each loop iteration.
The canonical agent pattern. Each step alternates between Reasoning (a thought explaining what to do and why) and Acting (a tool call). The observation from the action feeds into the next reasoning step. This interleaving grounds reasoning in real data.
For especially hard problems, the agent explores multiple reasoning paths in parallel (branching), evaluates each path's promise, and prunes unpromising branches. This is more expensive (multiple LLM calls per step) but dramatically improves performance on tasks requiring search or backtracking.
After completing a task (or failing), the agent generates a self-critique: what went wrong, what could be improved, and what to try differently. This reflection is stored in long-term memory and retrieved on similar future tasks, creating a learning loop across episodes.
Tools are what transform an LLM from a text generator into an agent that can affect the real world. Here's exactly how a tool call works, step by step.
The model outputs a JSON object (or specially formatted text) specifying which tool to call and what parameters to pass. Modern APIs (OpenAI function calling, Anthropic tool use) provide native structured output modes where the model's generation is constrained to valid JSON matching a tool's schema.
// LLM output (structured tool call) { "tool": "web_search", "parameters": { "query": "Apple Q1 2025 revenue earnings report" } }
The agent framework extracts the tool name and parameters from the LLM's output. It validates against the tool's schema: are required parameters present? Are types correct? Is this tool actually available? If validation fails, the error is formatted as an observation and sent back to the LLM so it can try again.
The validated parameters are passed to the actual tool implementation — a Python function, an HTTP endpoint, a shell command. This is real-world execution: network requests fire, databases are queried, code runs. The tool returns a result (string, JSON, or error).
// Tool execution (server-side) async function web_search(params) { const results = await searchAPI(params.query); return { results: results.map(r => ({ title: r.title, snippet: r.snippet, url: r.url })) }; }
The tool's return value is serialized to text and injected into the conversation as an Observation: or tool_result message. The LLM will see this on its next generation step. Results are often truncated or summarized to fit within token budgets.
The context now contains the original question, the tool call, and the observation. The LLM generates a new Thought — interpreting the result, deciding if it's sufficient, and either producing a final answer or selecting another tool call. The loop continues.
Web search, document search (RAG), database queries, API data fetching. These tools extend the agent's knowledge beyond its training data and provide real-time information.
Python/JS interpreters, shell commands, sandboxed compute environments. Allow the agent to run calculations, process data, generate files, and perform operations that require exact computation.
File writers, image generators (DALL-E, Stable Diffusion), document formatters. These tools let agents produce artifacts beyond text — images, PDFs, spreadsheets, code files.
Email sending, calendar management, Slack messaging, GitHub operations. These "actuator" tools let agents perform real-world actions on behalf of users, requiring careful permission models.
Memory is what allows agents to maintain coherence across steps and sessions. There are three distinct memory systems, each solving a different temporal problem.
The current conversation: system prompt + message history + tool results. This is the LLM's "RAM" — everything it can attend to right now. Limited by context window size (4K–200K tokens). As the agent loops, this grows and may require summarization or truncation to stay within limits.
Structured state maintained by the orchestrator outside the LLM: the current plan, completed sub-tasks, intermediate results, and error counts. This is typically a data structure in the agent framework that gets serialized into the context when needed. It allows the agent to track progress without consuming context for bookkeeping.
Persistent knowledge stored in external databases (Pinecone, Chroma, pgvector). Text is embedded into vectors and retrieved via semantic similarity search. This allows agents to recall information from past conversations, reference documents, or retrieve their own past reflections. Retrieval-Augmented Generation (RAG) is the core pattern here.
When an agent receives a complex request, it must break it down into manageable sub-tasks. Planning is the mechanism that makes agents capable of handling multi-step goals.
The agent receives a high-level goal like "Research competitor pricing, create a comparison spreadsheet, and draft an email summary for the team." The planner breaks this into discrete, ordered sub-tasks: (1) Search for competitor pricing data, (2) structure data into a table, (3) create the spreadsheet file, (4) draft the email. Each sub-task can be mapped to specific tool calls.
The planner identifies which sub-tasks depend on others. The spreadsheet can't be created before data is collected. The email can't be drafted before the spreadsheet exists. Independent tasks (e.g., looking up two different competitors) can potentially execute in parallel.
The agent executes sub-tasks in order, checking off completed items and noting results. After each sub-task, it evaluates: did this succeed? Is the result sufficient? Do I need to adjust the plan? This is dynamic replanning — the plan evolves based on real observations.
If a sub-task fails (search returns no results, API errors, unexpected data format), the planner generates an alternative approach: try a different search query, use a different tool, break the sub-task further, or adjust the overall strategy. This resilience is what makes agents useful in unpredictable environments.
// Agent-generated plan (emitted as structured output) { "goal": "Compare competitor pricing and brief the team", "steps": [ {"id": 1, "task": "Search competitor A pricing", "tool": "web_search", "status": "done"}, {"id": 2, "task": "Search competitor B pricing", "tool": "web_search", "status": "done"}, {"id": 3, "task": "Create comparison spreadsheet", "tool": "create_file", "status": "in_progress"}, {"id": 4, "task": "Draft summary email", "tool": "compose_email", "status": "pending"} ] }
Over the past two years, several distinct agent patterns have emerged. Each makes different tradeoffs between simplicity, reliability, cost, and capability.
The foundational pattern. Alternates Thought → Action → Observation in a loop. Simple, reliable, and the most widely used. Each step is a single LLM call that produces reasoning and an action. Works well for tasks requiring 1–10 tool calls. Frameworks: LangChain, LlamaIndex.
Separates planning from execution. A "planner" LLM generates a full task list upfront, then an "executor" LLM handles each step. The planner reviews and revises the plan between steps. Better for long-horizon tasks (10+ steps) because the plan provides structure. More expensive (2 LLM calls per step).
Generates a task DAG (directed acyclic graph) with dependencies, then executes independent tasks in parallel. Dramatically faster for tasks with parallelizable sub-tasks (e.g., searching multiple sources simultaneously). Complex orchestration logic but significant latency reduction.
Optimized for dialogue. The agent maintains conversation state and can ask clarifying questions, request confirmation before destructive actions, and adapt its approach based on user feedback mid-task. Used in customer support, personal assistants, and interactive workflows.
A lightweight "dispatcher" that classifies the user's intent and routes to specialized sub-agents. Each sub-agent is optimized for its domain (code agent, research agent, writing agent). The router's job is just classification — it's fast and cheap. Good for systems with distinct capability domains.
After producing a result, the agent evaluates its own output for quality, accuracy, and completeness. If it fails its own quality check, it revises and tries again. This "inner critic" pattern trades latency for reliability and is commonly used when output quality is critical.
For complex workflows, multiple specialized agents can collaborate. Each agent has a focused role, its own tools, and its own system prompt. A supervisor or messaging protocol coordinates their interactions.
Communication patterns in multi-agent systems typically follow one of three models. In hierarchical systems, a supervisor delegates tasks and aggregates results — agents only talk to the supervisor. In peer-to-peer systems, agents communicate directly via a message bus, passing intermediate results without a central coordinator. In sequential pipeline systems, each agent's output becomes the next agent's input, like an assembly line.
Shared state is the coordination mechanism. A common pattern is a shared "blackboard" or state store where agents read and write intermediate results. The supervisor monitors this state to track progress and detect when the overall task is complete. Each agent reads the parts relevant to its sub-task and writes its results back.
Select a scenario below to see a step-by-step execution trace of how an agent would handle the task. Each line shows exactly what happens at each stage of the agent loop.