Coding Agents Internals Series — Presentation 06

Minimising Context Bloat & Structured Session Memory

Companion deck for Components 4 + 5 of Sebastian Raschka’s essay. The harness levers that turn a runaway 200k-token transcript back into a focused, useful prompt — clipping, dedup, recency, summarisation — plus the storage-time discipline of working memory vs full transcript.

Clipping Dedup Recency weighting Compaction Working memory JSONL transcripts
Raw transcript Clip Dedupe Summarise Recency-weight Working memory Prompt
00

Topics We’ll Cover

01

Why Long Contexts Hurt

You might think that bigger context windows solved the bloat problem. They didn’t. They just moved it. Three independent forces still push back against feeding the model everything you have.

Cost grows linearly

At $15 per million input tokens, a 200k-token prompt costs $3 per turn. A 30-turn session at full context is $90. Caching helps the prefix, but tool outputs and growing transcripts are cache-cold. Every wasted token is real money.

Latency grows linearly

Time-to-first-token scales roughly with input length even with caching. A coding agent that takes 8 seconds per turn instead of 2 feels broken, regardless of correctness.

Accuracy degrades non-linearly

The “lost in the middle” effect: models reliably attend to the start and end of context, but content in the middle gets de-prioritised. A 150k-token prompt with the answer at position 75k may underperform a 20k-token prompt with the answer at position 18k.

Raschka’s lever

“A lot of apparent ‘model quality’ is really context quality.” When an agent stops working on a complex task — misremembers a decision, re-edits a file the user already corrected, asks a question it already had the answer to — the cause is rarely the model’s reasoning. It is what the harness chose to put in front of the model that turn.

Where the bloat comes from

02

Strategy 1 — Clipping Tool Outputs

The cheapest, most effective compaction technique. Cap every individual tool result at a fixed budget — usually 1 000–3 000 tokens. Anything longer is truncated with an explicit marker the model can ask to expand.

A clipping function with re-expansion support
def clip_output(text: str, max_tokens: int = 1500,
                head_frac: float = 0.6) -> tuple[str, ClipMeta]:
    # Token-aware clip; head + tail with elision marker.
    tokens = tokenize(text)
    if len(tokens) <= max_tokens:
        return text, ClipMeta(clipped=False)

    head_n = int(max_tokens * head_frac)
    tail_n = max_tokens - head_n - 12            # reserve space for the marker
    head   = detokenize(tokens[:head_n])
    tail   = detokenize(tokens[-tail_n:])

    elided = len(tokens) - head_n - tail_n
    marker = (f"\n\n[... {elided} tokens elided "
              f"— ask 'expand_clip(id={id(text)})' to see full output ...]\n\n")

    return head + marker + tail, ClipMeta(
        clipped=True, original_len=len(tokens), elided=elided, full_blob_id=id(text))

Where to clip and where not to

Always clip

  • File reads > 1 500 tokens (model can ask for specific line ranges).
  • Build / install / dependency-resolution output.
  • Test runner output with many failures — keep summary + first 3 failures.
  • Any web fetch result.

Never clip

  • Compiler errors that point to a single line. Model needs the whole error to fix it.
  • Diffs that the user is being asked to approve. Truncating the diff defeats the approval gate.
  • Search results with line-and-file context smaller than the budget.
An asymmetric clip pattern that works

For tool outputs, keep more head than tail — usually 60/40. The most-relevant context for the model is typically near the start (file imports, error type) with the actual error or value near the end. The 60/40 head/tail split outperforms a symmetric 50/50 in nearly every measurement we’ve seen.

03

Strategy 2 — Transcript Summarisation

Clipping handles individual outputs. Compaction handles the whole transcript: collapse old turns into a short summary, keeping recent ones verbatim. Raschka calls this “transcript reduction/summarisation” — an LLM-driven step that periodically rewrites the session history.

Turn 1: read pool.py
Turn 2: read tests/test_pool.py
Turn 3: ran pytest, 4 failures
Turn 4: read migrations.py
Turn 5: edited pool.py (added eviction guard)
↓  compaction trigger at turn 6
Summary: explored pool.py + tests + migrations. 4 tests failing.
Edited pool.py at line 88 to add eviction guard. Tests not yet rerun.
Turn 6: rerun pytest — verbatim
Turn 7: current turn — verbatim

When to trigger compaction

TriggerHow it worksTrade-off
Token threshold When transcript size exceeds, e.g., 30% of context window, compact the oldest 70% of turns. Predictable; can fire mid-task and lose nuance.
Turn-count threshold Every N turns (often 10), compact the older N−3 turns. Simple; ignores actual size; can over-compact tiny turns.
User-driven The user runs /compact (Claude Code) or similar before a long-running task. Predictable but requires user awareness.
Idle compaction Compact during idle waiting (user thinking) so the next turn starts clean. Hides cost; complex to schedule reliably.
A summarisation prompt that survives

Summarising a coding session well is its own prompt-engineering problem. The minimum that works: “Summarise the following turns in 200 tokens or fewer. Preserve: file paths touched, decisions made, blocking issues, the next planned action. Discard: tool-result text, internal model thinking, redundant reads.” Models that get this prompt produce summaries an order of magnitude more useful than “summarise the conversation”.

04

Strategy 3 — Recency Weighting

Raschka’s phrasing: “keep recent events richer because they are more likely to matter for the current step”. Concretely, this is a sliding policy that gets more aggressive about compression the further back a turn is.

Compression aggressiveness vs turn age summary only heavily compressed moderate clip light clip verbatim turns 1-5 turns 6-10 turns 11-15 turns 16-19 turn 20 (now) rich poor

The policy is a function of turn age, not of content importance — importance gets folded in via the working-memory layer (covered next slide). The age-only rule is what makes the policy cheap to implement and predictable for users.

A counter-intuitive consequence

Heavily compressing old turns means the model often forgets that it tried something three turns ago. That’s usually fine (it can re-discover) and sometimes bad (it loops). The fix is not to keep the old turn verbatim; it is to surface the relevant fact (“tried X, didn’t work because Y”) into the working memory, where it stays even after the turn that produced it has been compacted.

05

Strategy 4 — Deduplication

The fourth lever is the easiest to forget and one of the most effective. When a coding session reads src/db/pool.py on turns 3, 7, and 14, the prompt should not contain three copies of that file. Most agents do this badly out of the box.

What to dedupe

  • Repeated file reads — if turn N reads file F and turn N+5 reads F unchanged, the older read can be elided.
  • Repeated search hits — same query, same results, drop the older.
  • Repeated tool errors — the same “file not found” thrice tells the model nothing the first occurrence didn’t.

What NOT to dedupe

  • File reads of files that changed — the model needs to see the new state.
  • Test runs — even the same command may produce different results.
  • User messages — never silently drop user input.

Implementing dedup

Content-hash dedup over the transcript
def dedupe_transcript(turns: list[Turn]) -> list[Turn]:
    seen: dict[str, int] = {}
    out: list[Turn] = []
    for i, t in enumerate(turns):
        if t.kind == "tool_result":
            key = sha256((t.tool_name + t.tool_args).encode()).hexdigest()
            if key in seen:
                # mark the older one as deduped, keep only a pointer
                older = seen[key]
                out[older] = older.replaced_with_pointer(turn_index=i)
            seen[key] = i
        out.append(t)
    return out
A subtle case the article calls out

Dedup must happen before the cache breakpoint logic decides what to send. If you dedupe in the prompt builder but the underlying transcript on disk still has duplicates, the cache prefix will keep mismatching across turns. Treat dedup as a normalisation step that happens once per turn, with deterministic output.

06

Two Layers of Memory

Component 5 introduces a separation that often gets missed in casual descriptions of agents. There are two distinct memories in a healthy harness, with different jobs.

Full transcript — storage

// session-9f7a.jsonl
{"role":"user","content":"fix the failing tests"}
{"role":"assistant","tool_use":"read_file",
 "args":{"path":"src/db/pool.py"}}
{"role":"tool_result","content":"<1421 bytes>"}
{"role":"assistant","tool_use":"bash",
 "args":{"command":"pytest -q"}}
{"role":"tool_result","content":"<6800 bytes>"}
... (50 more turns)
JSONL on disk · complete fidelity · survives restart · basis for replay, audit, summarisation

Working memory — what stays in the prompt

{
  "task": "fix idle-eviction tests",
  "important_files": [
    "src/db/pool.py",
    "tests/test_pool.py"
  ],
  "decisions": [
    "use monotonic clock not wall clock",
    "evict only when pool is at capacity"
  ],
  "open_question": "should min_size be enforced?",
  "next": "rerun pytest after edit"
}
Small JSON · rewritten each turn · sits in the prompt · basis for cross-turn continuity

The article’s exact distinction

Compact transcript = prompt-time

Component 4. Reconstructs what the model sees this turn, by clipping, deduping, summarising, and recency-weighting the full transcript. Throwaway. Re-built every turn from the storage layer.

Working memory = storage-time

Component 5. A small structured file the harness writes deliberately as the session progresses. Survives compaction. Carries decisions and open threads forward across many turns or even across sessions.

Why two layers and not one?

The compaction system lossily compresses old turns to fit the prompt budget. If a critical decision lives only in turn 4 of a 30-turn session, by turn 25 it’s gone — even though it’s the most load-bearing piece of context. Working memory is the harness’s mechanism for committing facts to a layer that doesn’t lossy-compress. It’s persistence with intent.

07

What Actually Lives in Working Memory

Working memory should be small — 200–800 tokens is typical — and tightly structured. Bloated working memory is a contradiction in terms.

FieldHoldsUpdated when
taskOne-line statement of what the user is trying to accomplish.Set at session start; rewritten if the user pivots.
important_files3–7 paths the agent has identified as load-bearing for this task.Each time a new file is touched non-trivially.
decisionsShort, bullet-style record of choices made — algorithms, data structures, naming. The thing summarisation tends to drop.Each time the agent commits to an approach.
open_questionsThings the agent is unsure about and may need to ask.Each time uncertainty is raised — cleared when answered.
nextThe single next action. The agent’s plan from this turn.Every turn.
blockersTests still failing, missing dependencies, build errors.As they appear or are resolved.

How working memory gets updated

Two patterns coexist in the wild:

Tool-mediated update

The harness exposes a update_working_memory tool. The model emits a structured edit and the harness applies it. Pro: explicit, auditable, the model has to articulate its reasoning. Con: extra tool call per turn.

Harness-side derivation

After each turn, the harness runs a small summariser that updates working memory automatically based on what the turn touched. Pro: no extra tool call. Con: hidden machinery; harder to debug when working memory drifts.

Cross-session memory

Some harnesses (Claude Code with /memory commands, or its automatic memory feature) persist a subset of working memory across sessions. This is more controversial: cross-session memory can encode helpful user preferences (“always run pytest with -q”) or terrible stale beliefs (“the test framework is unittest” — six months later it’s pytest). Treat it as a useful tool with a non-trivial maintenance cost: review and prune.

08

Interactive: Context-Budget Visualizer

Slide the inputs to see how each compaction strategy reshapes a turn’s prompt. The bar shows how the context window is allocated; the warning lights up when the budget overflows the model’s window.

stable prefix compacted transcript working memory tool result user msg headroom
Total prompt: tok · headroom: tok
⚠ Prompt exceeds the model’s context window — the harness must compact harder or truncate the oldest turns.
What to play with

Try the same raw transcript size at compaction ratios 1.00 (no compaction), 0.30, and 0.10. Notice how aggressively a small ratio can recover headroom — and how a too-aggressive ratio leaves no room for the actual session content. The art of Component 4 is staying in the comfortable middle of this knob.

09

Pitfalls — Where Compaction Bites Back

Summary loses the diff

The compactor summarises “edited pool.py to add eviction guard” without keeping the actual code. Three turns later the model writes a different guard. Mitigate: store the most-recent edit’s diff in working memory, not just the description.

Dedup hides a real change

The agent reads config.yaml on turn 5 and again on turn 12, but between them the user edited it. If dedup elides the second read, the model uses stale config. Mitigate: hash by content, not by request.

Working memory drifts from reality

The model thinks the test framework is pytest because that was true 30 turns ago, but the user switched to unittest. Mitigate: refresh working memory from observed tool results, not just from prior model assertions.

Compaction during a multi-turn task

The user is mid-debug, the agent has explored four files and proposed a plan. Compaction fires, summary loses the plan’s structure, agent restarts the exploration. Mitigate: don’t compact while a planned action is in flight; gate on a quiet point.

Lost-in-the-middle of working memory

If working memory grows past ~1 500 tokens, the “lost in the middle” effect bites it too. Keep it small and structured; if you have more state than fits, that state belongs in named files in the workspace, not in working memory.

Summarising secrets into the prompt

A tool result accidentally contained an API key; the summariser dutifully includes it in the compact transcript. Now it’s in every subsequent prompt. Mitigate: redact at clipping time, never at summarisation time.

10

Things to Try Yourself

Plot context vs turns

Instrument an agent run: log the prompt size at every turn. Plot it. Healthy sessions stay roughly flat after turn 5; unhealthy ones grow linearly until they hit the window. The slope is the bug.

Diff your working memory across turns

Save working memory at every turn. Run diff between adjacent turns. Are most updates additive (good)? Or does the same field churn every turn (bad — suggests your update logic is unstable)?

Replay with different compaction ratios

Take a recorded session. Replay it with compaction ratios 1.00, 0.50, 0.20. Does the agent reach the same answer? Where does aggressive compaction break correctness? That tells you where your important state lives.

Build a clipping function with re-expansion

Implement clip_output + an expand_clip(id) tool. Watch how often the model uses it. Models that have a re-expansion mechanism stop being scared of the clip marker; they ask back when they need to.

Where this leads

Components 4 and 5 keep one agent’s context healthy. Component 6 — subagents — is the next lever: when the task is bigger than one context can hold, you spin up a bounded child to handle a piece of it. That’s the next deck.

Next deck → Subagents & Synthesis