CodingAgents 01 — Repo Understanding

00

Topics We’ll Cover

The Problem — Signal in a Million-Line Codebase
ripgrep Heuristics That Beat Embeddings
AST Parsing — Tree-Sitter
Symbol Indexing — ctags & Language Servers
Code Embeddings — When They Help, When They Don’t
Dependency & Call Graphs
Caching Strategies for Large Repos
Claude Code Heuristics, Aider Repo Map, Cursor @-References
What to Take Away

01

The Problem — Signal in a Million-Line Codebase

A coding agent receiving the instruction “fix the memory leak in the connection pool” must do something a human senior engineer does instinctively: locate the relevant 200–400 lines in a repository that might be 10–50 MB of source. Getting that wrong — either missing relevant context or flooding the LLM with irrelevant noise — determines whether the subsequent edit is correct.

The context-window budget

Even a 200 k-token model can hold roughly 1–1.5 MB of source at once. A 10 MLOC C++ codebase is 200–500 MB. Retrieval precision matters more than recall here: one wrong file that dominates context can confuse the model more than omitting a helper file.

Three retrieval regimes

Lexical — exact string match (ripgrep / grep)
Structural — AST / symbol queries (tree-sitter, ctags, LSP)
Semantic — dense vector nearest-neighbour (code embeddings)

Most agents layer all three; the trick is knowing which to invoke first.

The Goldilocks problem

Load too little and the model hallucinates — it invents a function signature it has never seen. Load too much and the model loses the thread of the actual task in a sea of tangentially related code. The optimal context for a single edit is typically 3–8 files, which means the retrieval step must achieve >80% precision even before the model reads a line.

Benchmarks that quantify the gap

Benchmark	Task	Key signal
SWE-bench (verified)	Fix real GitHub issues in 300 Python repos	Only 26–50% of agent attempts succeed; most failures are context-retrieval, not model reasoning
RepoBench	Next-line completion given a cross-file dependency	Retrieval quality alone accounts for ~15% absolute accuracy delta across methods
RepoQA	Long-context function-level comprehension	Models degrade sharply when the relevant function is buried beyond 64 k tokens of context

02

ripgrep Heuristics That Beat Embeddings

Andrew Gallant’s ripgrep (first released 2016, current stable 14.x) processes 1–3 GB/s of source text on modern hardware thanks to SIMD-accelerated regex, mmap I/O, and aggressive gitignore-aware file filtering. For exact identifier lookup it out-performs semantic search by a wide margin on latency, cost, and precision.

ripgrep — key flags an agent runtime actually sets

# Find all callers of `connection_pool_acquire` (word-bounded, multiline context)
rg 'connection_pool_acquire' \
   --type rust \
   --word-regexp \
   --context 4 \
   --max-count 30 \
   --json            # machine-readable output consumed by the agent

# Locate the definition (function/method signature) only
rg '^(pub )?(async )?fn connection_pool_acquire' \
   --type rust \
   --line-number \
   --no-heading

Why lexical search wins on identifiers

Cold-cache latency

ripgrep on a 5 MLOC repo: ~0.3 s. Dense vector search after index warm-up: 10–80 ms. But building the vector index requires a multi-minute GPU or API pass upfront — ripgrep requires nothing.

Precision on identifier names

Identifiers like pg_conn_pool are essentially unique tokens. Embedding models sometimes cluster semantically similar but differently-named functions, returning false positives.

Multi-file diffusion

A call-graph query like “all files that import this module” is a pure grep pattern (^import.*mymodule). Semantic search has no concept of import topology.

Heuristic ordering used by Claude Code (observed behaviour)

1. Try ripgrep for the exact symbol name. If ≥1 hit, load those files. 2. Try ripgrep for adjacent identifiers extracted from the user query. 3. Fall back to semantic search only if lexical search returns zero results or the query is clearly conceptual (“anything related to authentication”).

Pathological cases for lexical search

Overloaded names — get() returns thousands of hits in any large codebase; ripgrep alone is useless without a structural filter.
Conceptual queries — “where is the retry logic?” has no single identifier to grep for.
Dynamic dispatch — Python duck-typed calls or virtual method tables mean the definition is never textually adjacent to the call site.

03

AST Parsing — Tree-Sitter

tree-sitter (Max Brunsfeld, GitHub, 2018 — v0.22 as of 2024) is an incremental, error-tolerant Earley/GLR parser generator. It compiles a grammar to a C shared library and exposes a uniform C API (with bindings for Rust, Python, Node, Go, etc.). The key property for agents: it re-parses only the changed subtree after an edit, making it <1 ms for incremental updates on files up to ~100 k lines.

Python binding — extracting all function names and docstrings

from tree_sitter import Language, Parser
import tree_sitter_python as tspython

PY_LANG = Language(tspython.language())
parser  = Parser(PY_LANG)

def extract_functions(source: bytes) -> list[dict]:
    tree  = parser.parse(source)
    query = PY_LANG.query("""
        (function_definition
          name: (identifier) @fn.name
          body: (block . (expression_statement
                  (string) @fn.docstring)?))
    """)
    results = []
    for match in query.matches(tree.root_node):
        name_node = match[1].get('fn.name')
        doc_node  = match[1].get('fn.docstring')
        if name_node:
            results.append({
                'name': name_node.text.decode(),
                'row':  name_node.start_point[0],
                'doc':  doc_node.text.decode() if doc_node else None,
            })
    return results

Why tree-sitter over regex for structure

A regex for Python function definitions breaks on decorators, nested functions, type annotations with complex generics, and multiline signatures. tree-sitter’s concrete syntax tree is correct by construction for all of these. The error recovery property is equally important: even in syntactically broken files (mid-edit state), tree-sitter produces a partial tree with ERROR nodes rather than aborting — an agent can still extract the 95% of symbols that are intact.

Grammar coverage as of 2025

Tier 1

Python Rust TypeScript JavaScript Go C / C++ Java C#

Tier 2

Ruby Swift Kotlin Scala Elixir Haskell OCaml Zig

Config

JSON YAML TOML Dockerfile SQL Terraform HCL

04

Symbol Indexing — ctags & Language Servers

After lexical search and AST parsing, the next layer is a persistent symbol index that maps identifier names to file, line, and kind (function, class, variable, etc.) across the entire repository without re-parsing on every query.

Universal Ctags

Universal Ctags (fork of Exuberant Ctags, actively maintained since 2014, v6.1 in 2024) supports 150+ languages via regex and optionally tree-sitter parsers. Its output format is a plain-text tags file, binary-searchable with fzf or loaded directly into agent memory.

Generating a tags file and querying it from a Python agent

# Generate (takes ~8 s on a 1 MLOC Python/Rust mixed repo)
ctags \
  --recurse \
  --sort=yes \
  --fields=+n+K+Z \        # include line, kind, scope
  --extras=+q \            # add qualified names (module.ClassName.method)
  --output-format=json \
  --exclude='.git' --exclude='node_modules' \
  -f '.ctags_cache/tags.json' .

# Query: find definition of `ConnectionPool` (Python)
import json, bisect
tags = [json.loads(l) for l in open('.ctags_cache/tags.json')]
hits = [t for t in tags
        if t['name'] == 'ConnectionPool' and t['kind'] == 'class']

Language Server Protocol — go-to-definition at scale

LSP (Language Server Protocol, Microsoft, 2016 — spec 3.17 current) turns a language compiler into a query server. For agents that run long sessions on a single repository, spinning up rust-analyzer, pyright, or clangd unlocks type-resolved symbol queries unavailable from ctags.

LSP Request	Agent use-case	Key server
`textDocument/definition`	Jump to the true definition even through type aliases, imports, macros	rust-analyzer, pyright, clangd
`textDocument/references`	Find all call sites of a function before renaming or deleting it	Any LSP server
`textDocument/documentSymbol`	List all symbols in one file — faster than re-parsing with tree-sitter	Any LSP server
`workspace/symbol`	Cross-file symbol search by name — replaces ctags for live sessions	rust-analyzer, gopls
`textDocument/callHierarchy`	Incoming and outgoing call edges for a given function	clangd 12+, rust-analyzer

When NOT to use an LSP server

Starting rust-analyzer on a large workspace takes 30–120 s of initial indexing and >500 MB RAM. For one-shot CI tasks or ephemeral containers, ctags is faster to initialise and uses <10 MB. LSP shines for long-running interactive agent sessions (Cursor, Claude Code in a persistent session) where the startup cost is amortised.

05

Code Embeddings — When They Help, When They Don’t

Dense vector retrieval over code embeddings is the third layer. Unlike lexical or structural search, it can answer conceptual queries: “where is retry logic implemented?”, “show me all error-handling patterns”. The cost is a preprocessing step and a vector database.

Model landscape (2024–2025)

Model	Dim	Max tokens	Strength
voyage-code-3 (Voyage AI, 2024)	1 024 / 2 048	16 000	Best on CoIR benchmark; multilingual, long-context; recommended for production
text-embedding-3-large (OpenAI)	3 072	8 191	Strong general + code; Matryoshka — can truncate to 256 d with minor loss
CodeBERT (Microsoft, 2020)	768	512	Historically important, now outperformed; trained on NL+code bimodal pairs
UniXcoder (Microsoft, 2022)	768	1 024	Encoder-decoder; strong on code summarisation and clone detection
cohere-embed-v3 (code mode)	1 024	512	Input-type parameter distinguishes code vs. prose; useful for mixed corpora

CoIR benchmark results (December 2024)

Where embeddings fail for agents

Exact-name lookup: asking “find the parse_jwt function” via embedding returns the 10 most semantically similar functions — many will not be parse_jwt. ripgrep finds it in 0.1 s with 100% precision. Config & schema files: YAML and JSON embed poorly; ripgrep patterns like ^ port: are far more reliable. Cold indexing cost: embedding a 1 MLOC repo at 0.5 M tokens per API call costs $0.50–$2.00 and 5–20 minutes. Budget carefully in CI pipelines.

06

Dependency & Call Graphs

When an agent needs to understand the blast radius of a change — which modules will break if I rename this function? which callers must be updated? — a call graph or dependency graph is the right data structure. Pure lexical and semantic search miss transitive effects.

Build-time dependency graphs

Python — importlib / AST import scan

Walk the AST for import and from X import Y nodes. Resolve relative imports against sys.path. Tools: modulegraph, pydeps (generates a DOT graph), pyproject-deps. Works without executing the code — important for agent safety.

Rust — `cargo metadata`

cargo metadata --format-version 1 | jq '.packages[].dependencies' gives the full crate dependency tree including dev and build dependencies, with resolved versions. The inter-crate call graph requires cargo-call-stack or MIR analysis.

Runtime call graphs

Python call-graph extraction with pycallgraph2 / py-spy

# Static analysis approach (no execution needed) using ast + networkx
import ast, networkx as nx

def build_call_graph(files: list[str]) -> nx.DiGraph:
    G = nx.DiGraph()
    for path in files:
        tree = ast.parse(open(path).read())
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                caller = node.name
                G.add_node(caller, file=path, line=node.lineno)
                for child in ast.walk(node):
                    if isinstance(child, ast.Call):
                        if isinstance(child.func, ast.Name):
                            G.add_edge(caller, child.func.id)
    return G

# Agent use: find all transitive callers of `acquire_connection`
callers = list(nx.ancestors(G, 'acquire_connection'))  # reverse reachability

Practical limit: dynamic dispatch

Static call graphs are incomplete for languages with dynamic dispatch (Python duck-typing, virtual methods, callbacks, closures, eval). For Rust (monomorphised generics) and Go (explicit interfaces), static analysis gets you 90%+. For Python you typically need a type-annotated codebase plus pyright’s call-hierarchy analysis, or a runtime profiler (py-spy) to complete the graph. Most agent frameworks accept this limitation and use static graphs for “safe to rename?” checks, not correctness guarantees.

07

Caching Strategies for Large Repos

Retrieval speed on large repos is a function of what is pre-computed and how stale that cache is. There are four distinct caches in a mature agent system:

File-mtime index — which files changed since last run

↓

Incremental tree-sitter parse cache (per-file SHA256 keyed)

↓

ctags / symbol database (SQLite, re-indexed only for changed files)

↓

Embedding vector store (Chroma / Qdrant / FAISS — updated per-chunk on change)

Cache layer	Invalidation key	Regen cost (1 MLOC)	Staleness risk
File-mtime scan	inotify / FSEvents / polling	<0.1 s	None — always accurate
tree-sitter AST	file SHA-256	~2 s (incremental)	Low — per-file
ctags / LSP index	file SHA-256 + language version	8–30 s (delta)	Low with inotify trigger
Embeddings (code chunks)	chunk hash (sliding window content)	5–20 min (full); seconds (delta)	Medium — embedding model upgrades invalidate all

Cache storage patterns

SQLite for symbol caches

Universal Ctags --output-format=json piped into SQLite gives a queryable index with full-text search. A 1 MLOC repo produces ~50 MB. Incremental updates patch only the rows for modified files. WAL mode gives concurrent read access during re-index.

Chroma / Qdrant for embeddings

Both support upsert-by-id — hash the chunk content, use that as the vector ID, and upserts become no-ops for unchanged chunks. Qdrant’s HNSW index (M=16, ef=100) gives 1–5 ms query latency on 10 M vectors with 512–1 024 d embeddings.

The prompt-cache tier (Anthropic & OpenAI)

Both Anthropic and OpenAI now offer server-side prompt caching: if the first N tokens of a request match a prior request, the KV state is reused and the tokens are billed at 10% of normal cost. For agents that load a fixed repo-map or system prompt on every turn, this alone can reduce inference cost by 70–80%. The practical trick: always put stable content (repo map, file index) first in the context, variable content (current conversation) last.

08

Claude Code, Aider Repo Map, Cursor @-References

Three production agents have made different architectural choices about how to build their context window from a codebase. All three are open to inspection — via source code, published papers, or observed behaviour.

Aider Repo Map (Paul Gauthier)

Aider generates a compact repo map using Universal Ctags: a symbol inventory listing every file’s exports (functions, classes, constants) in ~5–15 lines per file. A PageRank-style weighting based on the cross-reference graph promotes files that are referenced by many others, pruning the map to fit within the token budget. The map is regenerated every turn if files have changed. On a 100 k-line Python repo, the map is typically 2–4 k tokens.

Aider repo map output (abridged example)

⋮
src/db/pool.py:
ConnectionPool.__init__(self, dsn, min_size, max_size)
ConnectionPool.acquire(self) -> Connection
ConnectionPool.release(self, conn)
ConnectionPool._evict_idle(self, max_idle_secs)
⋮
src/db/migrations.py:
run_migrations(engine, target_revision: str)
_check_alembic_version(engine) -> str
⋮

Cursor @-References

Cursor exposes explicit @-symbols to the user: @file, @folder, @codebase, @web, @docs, @git. Behind the scenes, @codebase triggers a hybrid retrieval pipeline: ripgrep for identifiers extracted from the query, then a code-embedding nearest-neighbour pass (Cursor runs its own embedding service), results re-ranked by a cross-encoder. The user sees the top N files; the agent sees their full text.

Claude Code heuristics

File-tree bootstrapping

On session start Claude Code reads CLAUDE.md (if present), then a directory listing to depth 2–3, plus any package.json / Cargo.toml / pyproject.toml it finds. This gives module structure without loading source.

Tool-driven retrieval

Rather than a pre-computed index, Claude Code issues explicit Read and Bash (ripgrep) tool calls — the retrieval is in-context reasoning. The model decides what to look up next based on what it has seen. This trades index-free simplicity for more tool-call round-trips.

Trade-off summary

Aider: pre-computed ctags map — deterministic, cheap, works offline, misses semantic queries. Cursor: hybrid index with user-facing @ references — most flexible, requires running services. Claude Code: zero pre-computation, in-context retrieval via tool calls — no setup, more tokens per turn, model must reason correctly about what to fetch.

09

What to Take Away

Lexical search first. ripgrep on exact identifiers is faster, cheaper, and more precise than embeddings for the majority of agent queries. Build this path first.
Tree-sitter for structure. When you need to extract or navigate code structure — functions, classes, imports — use a real AST parser, not regex. tree-sitter handles incremental updates and error-tolerant parses.
ctags for offline symbol indices; LSP for live sessions. The cost curves cross at roughly “one repository open for >5 minutes”. Below that, ctags wins on startup time.
Embeddings for conceptual queries only. voyage-code-3 leads the CoIR benchmark but still loses to ripgrep on identifier lookup. Use embeddings as a fallback, not a first pass.
Call graphs quantify blast radius. Static AST-based graphs are fast and safe to build; they handle 90%+ of Rust/Go queries. Accept incompleteness for dynamic-dispatch languages.
Layer your caches. mtime scan → tree-sitter AST → ctags SQLite → vector store. Invalidate each layer independently. Use SHA-256 content hashing as the key, not timestamps alone.
Know the three agents’ approaches. Aider uses a static ctags repo-map with PageRank weighting; Cursor uses hybrid retrieval behind @-symbols; Claude Code does in-context tool-call retrieval. Each is a deliberate trade-off.

Where to next

Deck 02 covers what happens after the context window is loaded: how agents actually make changes — search-replace blocks, unified diffs, multi-file atomic edits, and failure recovery when the model produces a bad patch.