Coding Agents Internals Series — Presentation 04

Live Repo Context & the Prompt Cache

Companion deck for Components 1 + 2 of Sebastian Raschka’s essay. What goes into a workspace summary, why the prompt is split into a stable prefix and a changing tail, and how the Anthropic / OpenAI prompt cache turns thousands of repeated tokens into a near-zero recurring cost.

Workspace summary Prompt prefix Anthropic cache 5-min TTL CLAUDE.md AGENTS.md
Repo Workspace summary Stable prefix Cache breakpoint Session state User turn LLM
00

Topics We’ll Cover

01

The “Fix the Tests” Problem

Raschka opens Component 1 with a small but devastating example. A user types “fix the tests”. To a competent human colleague this is enough: they know the language, the test runner, the recent changes, the project conventions. To an LLM with nothing but the message, it is unsolvable.

What the LLM is missing

  • Which tests are failing — or how to find them.
  • The test command — pytest? cargo test? npm test? something custom?
  • The repo layout — where is the source vs the tests?
  • Project conventions — lint rules, type checking, commit style.
  • The current Git branch and what just changed.

What the agent needs

A workspace summary — what Raschka calls “stable facts”. Built once, at session start. Cheap to keep around. Carries most of what a colleague would know without being told.

The user message then becomes a delta on top of a known context, not a complete brief in itself.

A telling phrase from the article

Raschka writes that the agent “collects info (‘stable facts’ as a workspace summary) upfront before doing any work” precisely so that it does not start “from zero, without context, on every prompt”. Most of the engineering effort in Component 1 is choosing what counts as a stable fact and how to surface it cheaply.

02

Anatomy of a Workspace Summary

What does a real coding agent put in its workspace summary? Let’s reverse-engineer it from observable behaviour. Here is roughly what Claude Code reads on session start in a typical Python repo (~2 kLOC, mixed src/tests).

workspace_summary.md — observed Claude Code session-start payload
# Workspace summary — built once at session start

project:
  name: connection-pool
  root:  /home/user/code/connection-pool
  vcs:   git, branch=feat/idle-eviction, ahead 3 commits of main

tooling:
  language: Python 3.12
  build:    uv (pyproject.toml)
  test:     pytest -q
  lint:     ruff check, ruff format
  types:    pyright (strict)

layout:
  - src/                 # 14 Python files, 1 821 LOC
  - src/db/pool.py       # core class, 412 LOC
  - src/db/migrations.py
  - tests/               # 23 test files, 2 104 LOC
  - docs/architecture.md # 180 LOC, last modified 8 days ago

conventions:                 # pulled from CLAUDE.md / AGENTS.md / README
  - "All public methods must have type hints and a 1-line docstring."
  - "PRs run pytest + ruff + pyright; CI fails on any error."
  - "Avoid mocking the database; use a real Postgres in test fixtures."

recent_change:                # last commit summary
  sha: 3a91f0c
  title: "add idle-connection eviction; tests pending"

What goes in, what stays out

In — cheap, stable, useful

  • Project name, language, build/test commands.
  • Top-level directory structure to depth 2–3.
  • Identified key files (largest, most-imported, recently changed).
  • Conventions from CLAUDE.md / AGENTS.md / CONTRIBUTING.md.
  • Branch and most-recent commit, succinctly.

Out — expensive, unstable, or noisy

  • Full file contents (load on demand via tools).
  • Long Git history (the harness can run git log later).
  • Build artefacts, node_modules, generated code.
  • Secrets, env vars, anything from .env or password managers.
  • Output of any command that takes >1 s — defer to a tool call.
Where harnesses differ

Aider precomputes a repo map (ctags + PageRank) and inlines it. Cursor builds an embedding index server-side and exposes @codebase. Claude Code does almost no precomputation — it relies on CLAUDE.md + a few directory listings + tool-call retrieval. Codex emphasises an AGENTS.md at the repo root. All four are different bets about how much to do at session start vs on demand.

03

Stable Facts vs Changing Facts

Component 2 in Raschka’s essay zooms in on a critical detail: not all of the prompt changes between turns. Some of it is set once and reused for the entire session; some of it accumulates with every new tool call; some of it is regenerated every turn. Sorting these into the right order is the single biggest cost-and-quality lever in a coding harness.

System & instructions — stable for the entire deployment
Tool definitions — stable until tools change (rarely)
Workspace summary — stable for the session
Compact transcript — grows with each turn
Working memory — updates each turn
Latest tool result — only this turn
User message — only this turn

Cache lifetimes for each layer

LayerLifetimeTypical sizeCacheable?
System & instructionsDays–months (rarely changes)500–2 000 tokensYes — place first hot
Tool definitionsDays–weeks1 000–4 000 tokensYes — place after system hot
Workspace summaryPer-session (~hours)500–2 000 tokensYes — place at end of prefix hot
Compact transcriptPer-turn growth500–5 000 tokensPartial — up to last cache breakpoint warm
Working memoryPer-turn rewrite200–800 tokensNo — changes every turn cold
Latest tool resultThis turn onlyVariable, often clipped to ~1 500 tokensNo cold
User messageThis turn only20–500 tokensNo cold
The architectural rule

Sort by stability, descending. Stable layers go first; volatile layers go last. Place a cache breakpoint immediately after the last stable layer. Anything before the breakpoint is paid for once per session; anything after is paid for every turn. Get this ordering wrong and you pay full price on every turn for content the cache could have served.

04

Prompt-Shape Architecture

This is Raschka’s Figure 7 redrawn. A single turn’s prompt has three regions, separated by where the cache breakpoint sits.

Stable prefix — cached system · tool defs · workspace summary · conventions ~3 000–8 000 tokens · written once, read every turn at 10% cost cache key = exact prefix bytes, including order cache breakpoint ↑ Session state — partially cached compact transcript · working memory file grows or rewrites each turn · benefits from secondary breakpoint Turn-local — uncached latest tool result · user message · control directives paid at full price every turn — keep it small

How a real Anthropic API call expresses this

Python — using cache_control to mark the breakpoint
from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {"type": "text", "text": SYSTEM_INSTRUCTIONS},        # stable
        {"type": "text", "text": TOOL_DESCRIPTIONS},          # stable
        {"type": "text", "text": WORKSPACE_SUMMARY,           # stable for session
         "cache_control": {"type": "ephemeral"}},    # ← cache breakpoint
    ],
    messages=[
        *compact_transcript,                                 # warm
        {"role": "user",
         "content": working_memory_block + tool_result_block + user_msg},  # cold
    ],
    tools=TOOL_SCHEMAS,
)

# Subsequent turns hit the cache and pay 0.10x for the prefix tokens.
Why this is harness work, not user work

End users will never write cache_control directly. The harness arranges the prompt this way every turn so that the user just types “fix the tests” and the right architecture happens automatically. Raschka’s point in Component 2 is that this layout is not optional; it is what separates a usable harness from a wasteful one.

05

Interactive: Walk Through a Real Prompt

Click any layer below to see what it contains, why it sits where it does, and whether it benefits from the cache. The colours match the diagram on the previous slide: green = cache hot, orange = cache warm, red = cache cold.

SYSTEM · instructions, persona, refusals~ 1 200 tok
TOOLS · JSON schemas for read_file, write_file, bash, …~ 2 800 tok
WORKSPACE SUMMARY · layout, conventions, branch (CACHE BREAKPOINT after)~ 1 600 tok
COMPACT TRANSCRIPT · previous turns, summarised~ 2 100 tok
WORKING MEMORY · current task, important files, notes~ 600 tok
LATEST TOOL RESULT · output of last command~ 900 tok
USER MESSAGE · this turn’s instruction~ 80 tok
Click a layer above to see its role, lifetime, and cache status. Try them all — the difference between layers is the entire reason this architecture matters.
cache hot — paid once, reused for session cache warm — partially reusable cache cold — full price every turn

Token estimates are illustrative for a typical Claude Code session on a small Python repo. Real values depend on the harness, model, and project size.

06

Prompt-Cache Mechanics — What You Actually Pay

The economics behind Component 2 only make sense if you know how the cache actually bills. The Anthropic API uses ephemeral caching with a default 5-minute TTL (an extended 1-hour TTL is also available with a higher write multiplier). OpenAI’s mechanism is similar in spirit.

The price multipliers (Anthropic, default 5-min TTL)

OperationCost multiplierWhat it means
Cache write — first time you submit a prefix1.25× standard input priceYou pay a 25% premium to populate the cache.
Cache read — subsequent turns within TTL0.10× standard input priceCached prefix tokens cost 10% of normal — a 10× reduction.
Cache miss — prefix changed by even 1 byte1.00× standard input priceFull re-tokenisation; no benefit.
Cache refresh — new request resets the TTLFree if prefix matchesActive sessions stay hot; idle ones expire after 5 min.

Break-even analysis

The 25% write premium is paid up front; the 10× reduction kicks in on subsequent turns. The break-even is therefore very low — just over one cache hit. Concretely:

Cumulative cost of a 5 000-token stable prefix vs turn count 1 5 10 turns 10x 0 no cache with cache ~6× cheaper by turn 10
A useful rule of thumb

For any system prompt + workspace summary >1 000 tokens that is reused 2 or more times, prompt caching is essentially free money. The interesting design question is not whether to cache — it is where to put the breakpoint and how to keep the prefix bit-stable across turns.

07

Interactive: Cache-Hit Cost Calculator

Slide the inputs to see how cache geometry shapes a typical session. The calculator uses Anthropic’s default multipliers (1.25× write, 0.10× read) on a 5-minute TTL.

Without caching
$0.00
With caching
$0.00
Saved per session
$0.00
What the calculator tells you

For a Claude Code-style session (5 000-tok prefix, ~12 turns, ~1 200 tok tail at $15/M input) caching saves roughly 70–80% of the input bill. Long sessions amortise the 25% write premium completely; short sessions still come out ahead on the second turn. Sessions of length 1 are the only case where caching costs slightly more — and even those typically don’t happen in practice for coding agents.

08

Pitfalls — What Breaks the Cache

The cache is content-addressed: the cache key is the byte-for-byte prefix. Any of the following silently turn cache reads back into cache writes — an order of magnitude price increase, with no warning.

Timestamps in the system prompt

“Today is 2026-04-28T09:35:12Z” injected at the top of every turn destroys the cache. If you need a date, put it after the cache breakpoint, or round to the day.

Random JSON key ordering

If your tool schema is regenerated each turn from a Python dict (which had its key order shuffled by a refactor), the prefix changes. Sort keys deterministically or freeze schemas at startup.

User-message-dependent prefix

Tempting to do “if user mentions tests, add a tests-related instruction to the system prompt”. This branches the prefix per turn and defeats the cache. Push the conditional content into the cold tail.

Whitespace and Unicode drift

A trailing newline added by an editor, a tab vs spaces change, NFC/NFD normalisation differences — all change the byte stream. Diff your prefix between turns when caching mysteriously stops working.

5-minute TTL idle

If the user goes for coffee, the cache expires. The next turn pays the 1.25× write premium again. Long-form sessions benefit from the 1-hour TTL option even with its higher write multiplier.

Wrong breakpoint placement

Putting the breakpoint before the workspace summary means the summary is recomputed every turn. Putting it after session-state is impossible (session state changes). The right place is just after the last stable layer.

A diagnostic worth running

Anthropic’s API returns usage.cache_creation_input_tokens and usage.cache_read_input_tokens on every response. Log both per turn. If the read counter doesn’t climb during a session, something is invalidating your prefix — usually one of the bullets above.

09

Practical Patterns You Can Steal

Translating Components 1 + 2 from concept to code. These are patterns I’ve seen work in real harnesses; each maps to a specific design decision the article calls out.

Pattern 1 — the CLAUDE.md / AGENTS.md root file

A markdown file at the repo root that the harness inlines into the workspace summary. Costs nothing to maintain (it’s just a doc), survives across sessions, and lives in version control where it can be reviewed alongside code changes. Treat it as the harness’s contract with the project: conventions, commands, do-and-don’ts.

CLAUDE.md — minimal but high-signal example
# CLAUDE.md

## Repo conventions
- Test:  pytest -q --maxfail=1
- Lint:  ruff check . && ruff format --check .
- Types: pyright

## Style
- Type-hint every public function.
- Docstrings: 1 line for simple, full Google style for complex.
- No mocking the database. Use the postgres fixture in tests/conftest.py.

## Where to look first
- Connection logic:  src/db/pool.py
- Migrations:        src/db/migrations.py
- Architecture doc:  docs/architecture.md

## Don’t
- Don’t edit generated/ — regenerated by make gen.
- Don’t add new top-level packages without proposing in a comment first.

Pattern 2 — deterministic prefix builder

A pure function build_prefix(workspace) -> bytes that always produces the same bytes for the same workspace. Sort dicts, normalise newlines, strip trailing whitespace. Run it through a SHA-256 and log the digest each turn — if it changes, your cache will too.

Pattern 3 — secondary breakpoint mid-session

Anthropic supports up to four cache breakpoints. Put the second one immediately after the compact transcript: that way, even though the transcript grows, the part of it that hasn’t changed since last turn is also cached. This is how long sessions stay cheap.

Pattern 4 — round timestamps to coarse buckets

If your prompt genuinely needs a date (auditing, prompt-injection-aware grounding), round to the day. "today is 2026-04-28" survives caching for 24 hours; "now is 2026-04-28T09:35:12.741Z" survives for one millisecond.

10

Things to Try Yourself

Audit one of your prompts

Take a real prompt your code already builds. Print it. Highlight in three colours: hot (truly stable), warm (reusable until session changes), cold (changes every turn). Most prompts have at least one cold thing wedged into a hot region.

Write a CLAUDE.md for a small project

Pick a side project. Write the smallest CLAUDE.md you would actually rely on. Use Claude Code or Cursor against it for a week. Notice which lines you find yourself adding — those are the actual conventions of the project.

Plot your cache hit rate over time

Log cache_creation_input_tokens and cache_read_input_tokens per turn. Plot the ratio. A healthy session reaches >80% reads within 2–3 turns. If yours doesn’t, something is breaking the prefix.

Build the deterministic prefix builder

Write a pure function in 30 lines that takes a workspace dict and returns the prefix bytes plus a SHA-256. Make it part of your test suite: same input ↠ same digest. This catches subtle regressions instantly.

Where this leads

Once you have a stable prefix and a working cache, the next bottleneck moves to what the model is allowed to do in each turn. That is Component 3 — tool access — covered in the next deck.

Next deck → Tool Access & Use