Coding Agents Internals Series — Presentation 01

Signal in a Million-Line Codebase

ripgrep heuristics, tree-sitter AST parsing, ctags & language servers, code embeddings, dependency graphs, caching, and the concrete repo-navigation strategies used by Aider, Cursor, and Claude Code.

ripgrep tree-sitter ctags LSP Code Embeddings Call Graphs Aider repo-map Cursor Claude Code
Query ripgrep tree-sitter Symbol Index Embeddings Graph Context Window
00

Topics We’ll Cover

01

The Problem — Signal in a Million-Line Codebase

A coding agent receiving the instruction “fix the memory leak in the connection pool” must do something a human senior engineer does instinctively: locate the relevant 200–400 lines in a repository that might be 10–50 MB of source. Getting that wrong — either missing relevant context or flooding the LLM with irrelevant noise — determines whether the subsequent edit is correct.

The context-window budget

Even a 200 k-token model can hold roughly 1–1.5 MB of source at once. A 10 MLOC C++ codebase is 200–500 MB. Retrieval precision matters more than recall here: one wrong file that dominates context can confuse the model more than omitting a helper file.

Three retrieval regimes

  • Lexical — exact string match (ripgrep / grep)
  • Structural — AST / symbol queries (tree-sitter, ctags, LSP)
  • Semantic — dense vector nearest-neighbour (code embeddings)

Most agents layer all three; the trick is knowing which to invoke first.

The Goldilocks problem

Load too little and the model hallucinates — it invents a function signature it has never seen. Load too much and the model loses the thread of the actual task in a sea of tangentially related code. The optimal context for a single edit is typically 3–8 files, which means the retrieval step must achieve >80% precision even before the model reads a line.

Benchmarks that quantify the gap

BenchmarkTaskKey signal
SWE-bench (verified)Fix real GitHub issues in 300 Python reposOnly 26–50% of agent attempts succeed; most failures are context-retrieval, not model reasoning
RepoBenchNext-line completion given a cross-file dependencyRetrieval quality alone accounts for ~15% absolute accuracy delta across methods
RepoQALong-context function-level comprehensionModels degrade sharply when the relevant function is buried beyond 64 k tokens of context
02

ripgrep Heuristics That Beat Embeddings

Andrew Gallant’s ripgrep (first released 2016, current stable 14.x) processes 1–3 GB/s of source text on modern hardware thanks to SIMD-accelerated regex, mmap I/O, and aggressive gitignore-aware file filtering. For exact identifier lookup it out-performs semantic search by a wide margin on latency, cost, and precision.

ripgrep — key flags an agent runtime actually sets
# Find all callers of `connection_pool_acquire` (word-bounded, multiline context)
rg 'connection_pool_acquire' \
   --type rust \
   --word-regexp \
   --context 4 \
   --max-count 30 \
   --json            # machine-readable output consumed by the agent

# Locate the definition (function/method signature) only
rg '^(pub )?(async )?fn connection_pool_acquire' \
   --type rust \
   --line-number \
   --no-heading

Why lexical search wins on identifiers

Cold-cache latency

ripgrep on a 5 MLOC repo: ~0.3 s. Dense vector search after index warm-up: 10–80 ms. But building the vector index requires a multi-minute GPU or API pass upfront — ripgrep requires nothing.

Precision on identifier names

Identifiers like pg_conn_pool are essentially unique tokens. Embedding models sometimes cluster semantically similar but differently-named functions, returning false positives.

Multi-file diffusion

A call-graph query like “all files that import this module” is a pure grep pattern (^import.*mymodule). Semantic search has no concept of import topology.

Heuristic ordering used by Claude Code (observed behaviour)

1. Try ripgrep for the exact symbol name. If ≥1 hit, load those files. 2. Try ripgrep for adjacent identifiers extracted from the user query. 3. Fall back to semantic search only if lexical search returns zero results or the query is clearly conceptual (“anything related to authentication”).

Pathological cases for lexical search

03

AST Parsing — Tree-Sitter

tree-sitter (Max Brunsfeld, GitHub, 2018 — v0.22 as of 2024) is an incremental, error-tolerant Earley/GLR parser generator. It compiles a grammar to a C shared library and exposes a uniform C API (with bindings for Rust, Python, Node, Go, etc.). The key property for agents: it re-parses only the changed subtree after an edit, making it <1 ms for incremental updates on files up to ~100 k lines.

Python binding — extracting all function names and docstrings
from tree_sitter import Language, Parser
import tree_sitter_python as tspython

PY_LANG = Language(tspython.language())
parser  = Parser(PY_LANG)

def extract_functions(source: bytes) -> list[dict]:
    tree  = parser.parse(source)
    query = PY_LANG.query("""
        (function_definition
          name: (identifier) @fn.name
          body: (block . (expression_statement
                  (string) @fn.docstring)?))
    """)
    results = []
    for match in query.matches(tree.root_node):
        name_node = match[1].get('fn.name')
        doc_node  = match[1].get('fn.docstring')
        if name_node:
            results.append({
                'name': name_node.text.decode(),
                'row':  name_node.start_point[0],
                'doc':  doc_node.text.decode() if doc_node else None,
            })
    return results
Why tree-sitter over regex for structure

A regex for Python function definitions breaks on decorators, nested functions, type annotations with complex generics, and multiline signatures. tree-sitter’s concrete syntax tree is correct by construction for all of these. The error recovery property is equally important: even in syntactically broken files (mid-edit state), tree-sitter produces a partial tree with ERROR nodes rather than aborting — an agent can still extract the 95% of symbols that are intact.

Grammar coverage as of 2025

Tier 1
Python Rust TypeScript JavaScript Go C / C++ Java C#
Tier 2
Ruby Swift Kotlin Scala Elixir Haskell OCaml Zig
Config
JSON YAML TOML Dockerfile SQL Terraform HCL
04

Symbol Indexing — ctags & Language Servers

After lexical search and AST parsing, the next layer is a persistent symbol index that maps identifier names to file, line, and kind (function, class, variable, etc.) across the entire repository without re-parsing on every query.

Universal Ctags

Universal Ctags (fork of Exuberant Ctags, actively maintained since 2014, v6.1 in 2024) supports 150+ languages via regex and optionally tree-sitter parsers. Its output format is a plain-text tags file, binary-searchable with fzf or loaded directly into agent memory.

Generating a tags file and querying it from a Python agent
# Generate (takes ~8 s on a 1 MLOC Python/Rust mixed repo)
ctags \
  --recurse \
  --sort=yes \
  --fields=+n+K+Z \        # include line, kind, scope
  --extras=+q \            # add qualified names (module.ClassName.method)
  --output-format=json \
  --exclude='.git' --exclude='node_modules' \
  -f '.ctags_cache/tags.json' .

# Query: find definition of `ConnectionPool` (Python)
import json, bisect
tags = [json.loads(l) for l in open('.ctags_cache/tags.json')]
hits = [t for t in tags
        if t['name'] == 'ConnectionPool' and t['kind'] == 'class']

Language Server Protocol — go-to-definition at scale

LSP (Language Server Protocol, Microsoft, 2016 — spec 3.17 current) turns a language compiler into a query server. For agents that run long sessions on a single repository, spinning up rust-analyzer, pyright, or clangd unlocks type-resolved symbol queries unavailable from ctags.

LSP RequestAgent use-caseKey server
textDocument/definitionJump to the true definition even through type aliases, imports, macrosrust-analyzer, pyright, clangd
textDocument/referencesFind all call sites of a function before renaming or deleting itAny LSP server
textDocument/documentSymbolList all symbols in one file — faster than re-parsing with tree-sitterAny LSP server
workspace/symbolCross-file symbol search by name — replaces ctags for live sessionsrust-analyzer, gopls
textDocument/callHierarchyIncoming and outgoing call edges for a given functionclangd 12+, rust-analyzer
When NOT to use an LSP server

Starting rust-analyzer on a large workspace takes 30–120 s of initial indexing and >500 MB RAM. For one-shot CI tasks or ephemeral containers, ctags is faster to initialise and uses <10 MB. LSP shines for long-running interactive agent sessions (Cursor, Claude Code in a persistent session) where the startup cost is amortised.

05

Code Embeddings — When They Help, When They Don’t

Dense vector retrieval over code embeddings is the third layer. Unlike lexical or structural search, it can answer conceptual queries: “where is retry logic implemented?”, “show me all error-handling patterns”. The cost is a preprocessing step and a vector database.

Model landscape (2024–2025)

ModelDimMax tokensStrength
voyage-code-3 (Voyage AI, 2024)1 024 / 2 04816 000Best on CoIR benchmark; multilingual, long-context; recommended for production
text-embedding-3-large (OpenAI)3 0728 191Strong general + code; Matryoshka — can truncate to 256 d with minor loss
CodeBERT (Microsoft, 2020)768512Historically important, now outperformed; trained on NL+code bimodal pairs
UniXcoder (Microsoft, 2022)7681 024Encoder-decoder; strong on code summarisation and clone detection
cohere-embed-v3 (code mode)1 024512Input-type parameter distinguishes code vs. prose; useful for mixed corpora

CoIR benchmark results (December 2024)

CoIR Average NDCG@10 (higher = better) — selected models voyage-code-3 77.4 text-emb-3-large 67.6 cohere-embed-v3 61.9 CodeBERT 38.7 0 50 80
Where embeddings fail for agents

Exact-name lookup: asking “find the parse_jwt function” via embedding returns the 10 most semantically similar functions — many will not be parse_jwt. ripgrep finds it in 0.1 s with 100% precision. Config & schema files: YAML and JSON embed poorly; ripgrep patterns like ^ port: are far more reliable. Cold indexing cost: embedding a 1 MLOC repo at 0.5 M tokens per API call costs $0.50–$2.00 and 5–20 minutes. Budget carefully in CI pipelines.

06

Dependency & Call Graphs

When an agent needs to understand the blast radius of a change — which modules will break if I rename this function? which callers must be updated? — a call graph or dependency graph is the right data structure. Pure lexical and semantic search miss transitive effects.

Build-time dependency graphs

Python — importlib / AST import scan

Walk the AST for import and from X import Y nodes. Resolve relative imports against sys.path. Tools: modulegraph, pydeps (generates a DOT graph), pyproject-deps. Works without executing the code — important for agent safety.

Rust — cargo metadata

cargo metadata --format-version 1 | jq '.packages[].dependencies' gives the full crate dependency tree including dev and build dependencies, with resolved versions. The inter-crate call graph requires cargo-call-stack or MIR analysis.

Runtime call graphs

Python call-graph extraction with pycallgraph2 / py-spy
# Static analysis approach (no execution needed) using ast + networkx
import ast, networkx as nx

def build_call_graph(files: list[str]) -> nx.DiGraph:
    G = nx.DiGraph()
    for path in files:
        tree = ast.parse(open(path).read())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                caller = node.name
                G.add_node(caller, file=path, line=node.lineno)
                for child in ast.walk(node):
                    if isinstance(child, ast.Call):
                        if isinstance(child.func, ast.Name):
                            G.add_edge(caller, child.func.id)
    return G

# Agent use: find all transitive callers of `acquire_connection`
callers = list(nx.ancestors(G, 'acquire_connection'))  # reverse reachability
Practical limit: dynamic dispatch

Static call graphs are incomplete for languages with dynamic dispatch (Python duck-typing, virtual methods, callbacks, closures, eval). For Rust (monomorphised generics) and Go (explicit interfaces), static analysis gets you 90%+. For Python you typically need a type-annotated codebase plus pyright’s call-hierarchy analysis, or a runtime profiler (py-spy) to complete the graph. Most agent frameworks accept this limitation and use static graphs for “safe to rename?” checks, not correctness guarantees.

07

Caching Strategies for Large Repos

Retrieval speed on large repos is a function of what is pre-computed and how stale that cache is. There are four distinct caches in a mature agent system:

File-mtime index — which files changed since last run
Incremental tree-sitter parse cache (per-file SHA256 keyed)
ctags / symbol database (SQLite, re-indexed only for changed files)
Embedding vector store (Chroma / Qdrant / FAISS — updated per-chunk on change)
Cache layerInvalidation keyRegen cost (1 MLOC)Staleness risk
File-mtime scaninotify / FSEvents / polling<0.1 sNone — always accurate
tree-sitter ASTfile SHA-256~2 s (incremental)Low — per-file
ctags / LSP indexfile SHA-256 + language version8–30 s (delta)Low with inotify trigger
Embeddings (code chunks)chunk hash (sliding window content)5–20 min (full); seconds (delta)Medium — embedding model upgrades invalidate all

Cache storage patterns

SQLite for symbol caches

Universal Ctags --output-format=json piped into SQLite gives a queryable index with full-text search. A 1 MLOC repo produces ~50 MB. Incremental updates patch only the rows for modified files. WAL mode gives concurrent read access during re-index.

Chroma / Qdrant for embeddings

Both support upsert-by-id — hash the chunk content, use that as the vector ID, and upserts become no-ops for unchanged chunks. Qdrant’s HNSW index (M=16, ef=100) gives 1–5 ms query latency on 10 M vectors with 512–1 024 d embeddings.

The prompt-cache tier (Anthropic & OpenAI)

Both Anthropic and OpenAI now offer server-side prompt caching: if the first N tokens of a request match a prior request, the KV state is reused and the tokens are billed at 10% of normal cost. For agents that load a fixed repo-map or system prompt on every turn, this alone can reduce inference cost by 70–80%. The practical trick: always put stable content (repo map, file index) first in the context, variable content (current conversation) last.

08

Claude Code, Aider Repo Map, Cursor @-References

Three production agents have made different architectural choices about how to build their context window from a codebase. All three are open to inspection — via source code, published papers, or observed behaviour.

Aider Repo Map (Paul Gauthier)

Aider generates a compact repo map using Universal Ctags: a symbol inventory listing every file’s exports (functions, classes, constants) in ~5–15 lines per file. A PageRank-style weighting based on the cross-reference graph promotes files that are referenced by many others, pruning the map to fit within the token budget. The map is regenerated every turn if files have changed. On a 100 k-line Python repo, the map is typically 2–4 k tokens.

Aider repo map output (abridged example)

src/db/pool.py:
ConnectionPool.__init__(self, dsn, min_size, max_size)
ConnectionPool.acquire(self) -> Connection
ConnectionPool.release(self, conn)
ConnectionPool._evict_idle(self, max_idle_secs)

src/db/migrations.py:
run_migrations(engine, target_revision: str)
_check_alembic_version(engine) -> str

Cursor @-References

Cursor exposes explicit @-symbols to the user: @file, @folder, @codebase, @web, @docs, @git. Behind the scenes, @codebase triggers a hybrid retrieval pipeline: ripgrep for identifiers extracted from the query, then a code-embedding nearest-neighbour pass (Cursor runs its own embedding service), results re-ranked by a cross-encoder. The user sees the top N files; the agent sees their full text.

Claude Code heuristics

File-tree bootstrapping

On session start Claude Code reads CLAUDE.md (if present), then a directory listing to depth 2–3, plus any package.json / Cargo.toml / pyproject.toml it finds. This gives module structure without loading source.

Tool-driven retrieval

Rather than a pre-computed index, Claude Code issues explicit Read and Bash (ripgrep) tool calls — the retrieval is in-context reasoning. The model decides what to look up next based on what it has seen. This trades index-free simplicity for more tool-call round-trips.

Trade-off summary

Aider: pre-computed ctags map — deterministic, cheap, works offline, misses semantic queries. Cursor: hybrid index with user-facing @ references — most flexible, requires running services. Claude Code: zero pre-computation, in-context retrieval via tool calls — no setup, more tokens per turn, model must reason correctly about what to fetch.

09

What to Take Away

Where to next

Deck 02 covers what happens after the context window is loaded: how agents actually make changes — search-replace blocks, unified diffs, multi-file atomic edits, and failure recovery when the model produces a bad patch.