ripgrep heuristics, tree-sitter AST parsing, ctags & language servers, code embeddings, dependency graphs, caching, and the concrete repo-navigation strategies used by Aider, Cursor, and Claude Code.
A coding agent receiving the instruction “fix the memory leak in the connection pool” must do something a human senior engineer does instinctively: locate the relevant 200–400 lines in a repository that might be 10–50 MB of source. Getting that wrong — either missing relevant context or flooding the LLM with irrelevant noise — determines whether the subsequent edit is correct.
Even a 200 k-token model can hold roughly 1–1.5 MB of source at once. A 10 MLOC C++ codebase is 200–500 MB. Retrieval precision matters more than recall here: one wrong file that dominates context can confuse the model more than omitting a helper file.
Most agents layer all three; the trick is knowing which to invoke first.
Load too little and the model hallucinates — it invents a function signature it has never seen. Load too much and the model loses the thread of the actual task in a sea of tangentially related code. The optimal context for a single edit is typically 3–8 files, which means the retrieval step must achieve >80% precision even before the model reads a line.
| Benchmark | Task | Key signal |
|---|---|---|
| SWE-bench (verified) | Fix real GitHub issues in 300 Python repos | Only 26–50% of agent attempts succeed; most failures are context-retrieval, not model reasoning |
| RepoBench | Next-line completion given a cross-file dependency | Retrieval quality alone accounts for ~15% absolute accuracy delta across methods |
| RepoQA | Long-context function-level comprehension | Models degrade sharply when the relevant function is buried beyond 64 k tokens of context |
Andrew Gallant’s ripgrep (first released 2016, current stable 14.x) processes 1–3 GB/s of source text on modern hardware thanks to SIMD-accelerated regex, mmap I/O, and aggressive gitignore-aware file filtering. For exact identifier lookup it out-performs semantic search by a wide margin on latency, cost, and precision.
# Find all callers of `connection_pool_acquire` (word-bounded, multiline context)
rg 'connection_pool_acquire' \
--type rust \
--word-regexp \
--context 4 \
--max-count 30 \
--json # machine-readable output consumed by the agent
# Locate the definition (function/method signature) only
rg '^(pub )?(async )?fn connection_pool_acquire' \
--type rust \
--line-number \
--no-heading
ripgrep on a 5 MLOC repo: ~0.3 s. Dense vector search after index warm-up: 10–80 ms. But building the vector index requires a multi-minute GPU or API pass upfront — ripgrep requires nothing.
Identifiers like pg_conn_pool are essentially unique tokens. Embedding models sometimes cluster semantically similar but differently-named functions, returning false positives.
A call-graph query like “all files that import this module” is a pure grep pattern (^import.*mymodule). Semantic search has no concept of import topology.
1. Try ripgrep for the exact symbol name. If ≥1 hit, load those files. 2. Try ripgrep for adjacent identifiers extracted from the user query. 3. Fall back to semantic search only if lexical search returns zero results or the query is clearly conceptual (“anything related to authentication”).
get() returns thousands of hits in any large codebase; ripgrep alone is useless without a structural filter.tree-sitter (Max Brunsfeld, GitHub, 2018 — v0.22 as of 2024) is an incremental, error-tolerant Earley/GLR parser generator. It compiles a grammar to a C shared library and exposes a uniform C API (with bindings for Rust, Python, Node, Go, etc.). The key property for agents: it re-parses only the changed subtree after an edit, making it <1 ms for incremental updates on files up to ~100 k lines.
from tree_sitter import Language, Parser
import tree_sitter_python as tspython
PY_LANG = Language(tspython.language())
parser = Parser(PY_LANG)
def extract_functions(source: bytes) -> list[dict]:
tree = parser.parse(source)
query = PY_LANG.query("""
(function_definition
name: (identifier) @fn.name
body: (block . (expression_statement
(string) @fn.docstring)?))
""")
results = []
for match in query.matches(tree.root_node):
name_node = match[1].get('fn.name')
doc_node = match[1].get('fn.docstring')
if name_node:
results.append({
'name': name_node.text.decode(),
'row': name_node.start_point[0],
'doc': doc_node.text.decode() if doc_node else None,
})
return results
A regex for Python function definitions breaks on decorators, nested functions, type annotations with complex generics, and multiline signatures. tree-sitter’s concrete syntax tree is correct by construction for all of these. The error recovery property is equally important: even in syntactically broken files (mid-edit state), tree-sitter produces a partial tree with ERROR nodes rather than aborting — an agent can still extract the 95% of symbols that are intact.
After lexical search and AST parsing, the next layer is a persistent symbol index that maps identifier names to file, line, and kind (function, class, variable, etc.) across the entire repository without re-parsing on every query.
Universal Ctags (fork of Exuberant Ctags, actively maintained since 2014, v6.1 in 2024) supports 150+ languages via regex and optionally tree-sitter parsers. Its output format is a plain-text tags file, binary-searchable with fzf or loaded directly into agent memory.
# Generate (takes ~8 s on a 1 MLOC Python/Rust mixed repo)
ctags \
--recurse \
--sort=yes \
--fields=+n+K+Z \ # include line, kind, scope
--extras=+q \ # add qualified names (module.ClassName.method)
--output-format=json \
--exclude='.git' --exclude='node_modules' \
-f '.ctags_cache/tags.json' .
# Query: find definition of `ConnectionPool` (Python)
import json, bisect
tags = [json.loads(l) for l in open('.ctags_cache/tags.json')]
hits = [t for t in tags
if t['name'] == 'ConnectionPool' and t['kind'] == 'class']
LSP (Language Server Protocol, Microsoft, 2016 — spec 3.17 current) turns a language compiler into a query server. For agents that run long sessions on a single repository, spinning up rust-analyzer, pyright, or clangd unlocks type-resolved symbol queries unavailable from ctags.
| LSP Request | Agent use-case | Key server |
|---|---|---|
textDocument/definition | Jump to the true definition even through type aliases, imports, macros | rust-analyzer, pyright, clangd |
textDocument/references | Find all call sites of a function before renaming or deleting it | Any LSP server |
textDocument/documentSymbol | List all symbols in one file — faster than re-parsing with tree-sitter | Any LSP server |
workspace/symbol | Cross-file symbol search by name — replaces ctags for live sessions | rust-analyzer, gopls |
textDocument/callHierarchy | Incoming and outgoing call edges for a given function | clangd 12+, rust-analyzer |
Starting rust-analyzer on a large workspace takes 30–120 s of initial indexing and >500 MB RAM. For one-shot CI tasks or ephemeral containers, ctags is faster to initialise and uses <10 MB. LSP shines for long-running interactive agent sessions (Cursor, Claude Code in a persistent session) where the startup cost is amortised.
Dense vector retrieval over code embeddings is the third layer. Unlike lexical or structural search, it can answer conceptual queries: “where is retry logic implemented?”, “show me all error-handling patterns”. The cost is a preprocessing step and a vector database.
| Model | Dim | Max tokens | Strength |
|---|---|---|---|
| voyage-code-3 (Voyage AI, 2024) | 1 024 / 2 048 | 16 000 | Best on CoIR benchmark; multilingual, long-context; recommended for production |
| text-embedding-3-large (OpenAI) | 3 072 | 8 191 | Strong general + code; Matryoshka — can truncate to 256 d with minor loss |
| CodeBERT (Microsoft, 2020) | 768 | 512 | Historically important, now outperformed; trained on NL+code bimodal pairs |
| UniXcoder (Microsoft, 2022) | 768 | 1 024 | Encoder-decoder; strong on code summarisation and clone detection |
| cohere-embed-v3 (code mode) | 1 024 | 512 | Input-type parameter distinguishes code vs. prose; useful for mixed corpora |
Exact-name lookup: asking “find the parse_jwt function” via embedding returns the 10 most semantically similar functions — many will not be parse_jwt. ripgrep finds it in 0.1 s with 100% precision. Config & schema files: YAML and JSON embed poorly; ripgrep patterns like ^ port: are far more reliable. Cold indexing cost: embedding a 1 MLOC repo at 0.5 M tokens per API call costs $0.50–$2.00 and 5–20 minutes. Budget carefully in CI pipelines.
When an agent needs to understand the blast radius of a change — which modules will break if I rename this function? which callers must be updated? — a call graph or dependency graph is the right data structure. Pure lexical and semantic search miss transitive effects.
Walk the AST for import and from X import Y nodes. Resolve relative imports against sys.path. Tools: modulegraph, pydeps (generates a DOT graph), pyproject-deps. Works without executing the code — important for agent safety.
cargo metadatacargo metadata --format-version 1 | jq '.packages[].dependencies' gives the full crate dependency tree including dev and build dependencies, with resolved versions. The inter-crate call graph requires cargo-call-stack or MIR analysis.
# Static analysis approach (no execution needed) using ast + networkx
import ast, networkx as nx
def build_call_graph(files: list[str]) -> nx.DiGraph:
G = nx.DiGraph()
for path in files:
tree = ast.parse(open(path).read())
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
caller = node.name
G.add_node(caller, file=path, line=node.lineno)
for child in ast.walk(node):
if isinstance(child, ast.Call):
if isinstance(child.func, ast.Name):
G.add_edge(caller, child.func.id)
return G
# Agent use: find all transitive callers of `acquire_connection`
callers = list(nx.ancestors(G, 'acquire_connection')) # reverse reachability
Static call graphs are incomplete for languages with dynamic dispatch (Python duck-typing, virtual methods, callbacks, closures, eval). For Rust (monomorphised generics) and Go (explicit interfaces), static analysis gets you 90%+. For Python you typically need a type-annotated codebase plus pyright’s call-hierarchy analysis, or a runtime profiler (py-spy) to complete the graph. Most agent frameworks accept this limitation and use static graphs for “safe to rename?” checks, not correctness guarantees.
Retrieval speed on large repos is a function of what is pre-computed and how stale that cache is. There are four distinct caches in a mature agent system:
| Cache layer | Invalidation key | Regen cost (1 MLOC) | Staleness risk |
|---|---|---|---|
| File-mtime scan | inotify / FSEvents / polling | <0.1 s | None — always accurate |
| tree-sitter AST | file SHA-256 | ~2 s (incremental) | Low — per-file |
| ctags / LSP index | file SHA-256 + language version | 8–30 s (delta) | Low with inotify trigger |
| Embeddings (code chunks) | chunk hash (sliding window content) | 5–20 min (full); seconds (delta) | Medium — embedding model upgrades invalidate all |
Universal Ctags --output-format=json piped into SQLite gives a queryable index with full-text search. A 1 MLOC repo produces ~50 MB. Incremental updates patch only the rows for modified files. WAL mode gives concurrent read access during re-index.
Both support upsert-by-id — hash the chunk content, use that as the vector ID, and upserts become no-ops for unchanged chunks. Qdrant’s HNSW index (M=16, ef=100) gives 1–5 ms query latency on 10 M vectors with 512–1 024 d embeddings.
Both Anthropic and OpenAI now offer server-side prompt caching: if the first N tokens of a request match a prior request, the KV state is reused and the tokens are billed at 10% of normal cost. For agents that load a fixed repo-map or system prompt on every turn, this alone can reduce inference cost by 70–80%. The practical trick: always put stable content (repo map, file index) first in the context, variable content (current conversation) last.
Three production agents have made different architectural choices about how to build their context window from a codebase. All three are open to inspection — via source code, published papers, or observed behaviour.
Aider generates a compact repo map using Universal Ctags: a symbol inventory listing every file’s exports (functions, classes, constants) in ~5–15 lines per file. A PageRank-style weighting based on the cross-reference graph promotes files that are referenced by many others, pruning the map to fit within the token budget. The map is regenerated every turn if files have changed. On a 100 k-line Python repo, the map is typically 2–4 k tokens.
⋮
src/db/pool.py:
ConnectionPool.__init__(self, dsn, min_size, max_size)
ConnectionPool.acquire(self) -> Connection
ConnectionPool.release(self, conn)
ConnectionPool._evict_idle(self, max_idle_secs)
⋮
src/db/migrations.py:
run_migrations(engine, target_revision: str)
_check_alembic_version(engine) -> str
⋮
Cursor exposes explicit @-symbols to the user: @file, @folder, @codebase, @web, @docs, @git. Behind the scenes, @codebase triggers a hybrid retrieval pipeline: ripgrep for identifiers extracted from the query, then a code-embedding nearest-neighbour pass (Cursor runs its own embedding service), results re-ranked by a cross-encoder. The user sees the top N files; the agent sees their full text.
On session start Claude Code reads CLAUDE.md (if present), then a directory listing to depth 2–3, plus any package.json / Cargo.toml / pyproject.toml it finds. This gives module structure without loading source.
Rather than a pre-computed index, Claude Code issues explicit Read and Bash (ripgrep) tool calls — the retrieval is in-context reasoning. The model decides what to look up next based on what it has seen. This trades index-free simplicity for more tool-call round-trips.
Aider: pre-computed ctags map — deterministic, cheap, works offline, misses semantic queries. Cursor: hybrid index with user-facing @ references — most flexible, requires running services. Claude Code: zero pre-computation, in-context retrieval via tool calls — no setup, more tokens per turn, model must reason correctly about what to fetch.
Deck 02 covers what happens after the context window is loaded: how agents actually make changes — search-replace blocks, unified diffs, multi-file atomic edits, and failure recovery when the model produces a bad patch.