Companion deck for Components 4 + 5 of Sebastian Raschka’s essay. The harness levers that turn a runaway 200k-token transcript back into a focused, useful prompt — clipping, dedup, recency, summarisation — plus the storage-time discipline of working memory vs full transcript.
You might think that bigger context windows solved the bloat problem. They didn’t. They just moved it. Three independent forces still push back against feeding the model everything you have.
At $15 per million input tokens, a 200k-token prompt costs $3 per turn. A 30-turn session at full context is $90. Caching helps the prefix, but tool outputs and growing transcripts are cache-cold. Every wasted token is real money.
Time-to-first-token scales roughly with input length even with caching. A coding agent that takes 8 seconds per turn instead of 2 feels broken, regardless of correctness.
The “lost in the middle” effect: models reliably attend to the start and end of context, but content in the middle gets de-prioritised. A 150k-token prompt with the answer at position 75k may underperform a 20k-token prompt with the answer at position 18k.
“A lot of apparent ‘model quality’ is really context quality.” When an agent stops working on a complex task — misremembers a decision, re-edits a file the user already corrected, asks a question it already had the answer to — the cause is rarely the model’s reasoning. It is what the harness chose to put in front of the model that turn.
npm install can produce 8 000 lines of warnings.The cheapest, most effective compaction technique. Cap every individual tool result at a fixed budget — usually 1 000–3 000 tokens. Anything longer is truncated with an explicit marker the model can ask to expand.
def clip_output(text: str, max_tokens: int = 1500,
head_frac: float = 0.6) -> tuple[str, ClipMeta]:
# Token-aware clip; head + tail with elision marker.
tokens = tokenize(text)
if len(tokens) <= max_tokens:
return text, ClipMeta(clipped=False)
head_n = int(max_tokens * head_frac)
tail_n = max_tokens - head_n - 12 # reserve space for the marker
head = detokenize(tokens[:head_n])
tail = detokenize(tokens[-tail_n:])
elided = len(tokens) - head_n - tail_n
marker = (f"\n\n[... {elided} tokens elided "
f"— ask 'expand_clip(id={id(text)})' to see full output ...]\n\n")
return head + marker + tail, ClipMeta(
clipped=True, original_len=len(tokens), elided=elided, full_blob_id=id(text))
For tool outputs, keep more head than tail — usually 60/40. The most-relevant context for the model is typically near the start (file imports, error type) with the actual error or value near the end. The 60/40 head/tail split outperforms a symmetric 50/50 in nearly every measurement we’ve seen.
Clipping handles individual outputs. Compaction handles the whole transcript: collapse old turns into a short summary, keeping recent ones verbatim. Raschka calls this “transcript reduction/summarisation” — an LLM-driven step that periodically rewrites the session history.
| Trigger | How it works | Trade-off |
|---|---|---|
| Token threshold | When transcript size exceeds, e.g., 30% of context window, compact the oldest 70% of turns. | Predictable; can fire mid-task and lose nuance. |
| Turn-count threshold | Every N turns (often 10), compact the older N−3 turns. | Simple; ignores actual size; can over-compact tiny turns. |
| User-driven | The user runs /compact (Claude Code) or similar before a long-running task. |
Predictable but requires user awareness. |
| Idle compaction | Compact during idle waiting (user thinking) so the next turn starts clean. | Hides cost; complex to schedule reliably. |
Summarising a coding session well is its own prompt-engineering problem. The minimum that works: “Summarise the following turns in 200 tokens or fewer. Preserve: file paths touched, decisions made, blocking issues, the next planned action. Discard: tool-result text, internal model thinking, redundant reads.” Models that get this prompt produce summaries an order of magnitude more useful than “summarise the conversation”.
Raschka’s phrasing: “keep recent events richer because they are more likely to matter for the current step”. Concretely, this is a sliding policy that gets more aggressive about compression the further back a turn is.
The policy is a function of turn age, not of content importance — importance gets folded in via the working-memory layer (covered next slide). The age-only rule is what makes the policy cheap to implement and predictable for users.
Heavily compressing old turns means the model often forgets that it tried something three turns ago. That’s usually fine (it can re-discover) and sometimes bad (it loops). The fix is not to keep the old turn verbatim; it is to surface the relevant fact (“tried X, didn’t work because Y”) into the working memory, where it stays even after the turn that produced it has been compacted.
The fourth lever is the easiest to forget and one of the most effective. When a coding session reads src/db/pool.py on turns 3, 7, and 14, the prompt should not contain three copies of that file. Most agents do this badly out of the box.
def dedupe_transcript(turns: list[Turn]) -> list[Turn]:
seen: dict[str, int] = {}
out: list[Turn] = []
for i, t in enumerate(turns):
if t.kind == "tool_result":
key = sha256((t.tool_name + t.tool_args).encode()).hexdigest()
if key in seen:
# mark the older one as deduped, keep only a pointer
older = seen[key]
out[older] = older.replaced_with_pointer(turn_index=i)
seen[key] = i
out.append(t)
return out
Dedup must happen before the cache breakpoint logic decides what to send. If you dedupe in the prompt builder but the underlying transcript on disk still has duplicates, the cache prefix will keep mismatching across turns. Treat dedup as a normalisation step that happens once per turn, with deterministic output.
Component 5 introduces a separation that often gets missed in casual descriptions of agents. There are two distinct memories in a healthy harness, with different jobs.
// session-9f7a.jsonl
{"role":"user","content":"fix the failing tests"}
{"role":"assistant","tool_use":"read_file",
"args":{"path":"src/db/pool.py"}}
{"role":"tool_result","content":"<1421 bytes>"}
{"role":"assistant","tool_use":"bash",
"args":{"command":"pytest -q"}}
{"role":"tool_result","content":"<6800 bytes>"}
... (50 more turns)
{
"task": "fix idle-eviction tests",
"important_files": [
"src/db/pool.py",
"tests/test_pool.py"
],
"decisions": [
"use monotonic clock not wall clock",
"evict only when pool is at capacity"
],
"open_question": "should min_size be enforced?",
"next": "rerun pytest after edit"
}
Component 4. Reconstructs what the model sees this turn, by clipping, deduping, summarising, and recency-weighting the full transcript. Throwaway. Re-built every turn from the storage layer.
Component 5. A small structured file the harness writes deliberately as the session progresses. Survives compaction. Carries decisions and open threads forward across many turns or even across sessions.
The compaction system lossily compresses old turns to fit the prompt budget. If a critical decision lives only in turn 4 of a 30-turn session, by turn 25 it’s gone — even though it’s the most load-bearing piece of context. Working memory is the harness’s mechanism for committing facts to a layer that doesn’t lossy-compress. It’s persistence with intent.
Working memory should be small — 200–800 tokens is typical — and tightly structured. Bloated working memory is a contradiction in terms.
| Field | Holds | Updated when |
|---|---|---|
task | One-line statement of what the user is trying to accomplish. | Set at session start; rewritten if the user pivots. |
important_files | 3–7 paths the agent has identified as load-bearing for this task. | Each time a new file is touched non-trivially. |
decisions | Short, bullet-style record of choices made — algorithms, data structures, naming. The thing summarisation tends to drop. | Each time the agent commits to an approach. |
open_questions | Things the agent is unsure about and may need to ask. | Each time uncertainty is raised — cleared when answered. |
next | The single next action. The agent’s plan from this turn. | Every turn. |
blockers | Tests still failing, missing dependencies, build errors. | As they appear or are resolved. |
Two patterns coexist in the wild:
The harness exposes a update_working_memory tool. The model emits a structured edit and the harness applies it. Pro: explicit, auditable, the model has to articulate its reasoning. Con: extra tool call per turn.
After each turn, the harness runs a small summariser that updates working memory automatically based on what the turn touched. Pro: no extra tool call. Con: hidden machinery; harder to debug when working memory drifts.
Some harnesses (Claude Code with /memory commands, or its automatic memory feature) persist a subset of working memory across sessions. This is more controversial: cross-session memory can encode helpful user preferences (“always run pytest with -q”) or terrible stale beliefs (“the test framework is unittest” — six months later it’s pytest). Treat it as a useful tool with a non-trivial maintenance cost: review and prune.
Slide the inputs to see how each compaction strategy reshapes a turn’s prompt. The bar shows how the context window is allocated; the warning lights up when the budget overflows the model’s window.
Try the same raw transcript size at compaction ratios 1.00 (no compaction), 0.30, and 0.10. Notice how aggressively a small ratio can recover headroom — and how a too-aggressive ratio leaves no room for the actual session content. The art of Component 4 is staying in the comfortable middle of this knob.
The compactor summarises “edited pool.py to add eviction guard” without keeping the actual code. Three turns later the model writes a different guard. Mitigate: store the most-recent edit’s diff in working memory, not just the description.
The agent reads config.yaml on turn 5 and again on turn 12, but between them the user edited it. If dedup elides the second read, the model uses stale config. Mitigate: hash by content, not by request.
The model thinks the test framework is pytest because that was true 30 turns ago, but the user switched to unittest. Mitigate: refresh working memory from observed tool results, not just from prior model assertions.
The user is mid-debug, the agent has explored four files and proposed a plan. Compaction fires, summary loses the plan’s structure, agent restarts the exploration. Mitigate: don’t compact while a planned action is in flight; gate on a quiet point.
If working memory grows past ~1 500 tokens, the “lost in the middle” effect bites it too. Keep it small and structured; if you have more state than fits, that state belongs in named files in the workspace, not in working memory.
A tool result accidentally contained an API key; the summariser dutifully includes it in the compact transcript. Now it’s in every subsequent prompt. Mitigate: redact at clipping time, never at summarisation time.
Instrument an agent run: log the prompt size at every turn. Plot it. Healthy sessions stay roughly flat after turn 5; unhealthy ones grow linearly until they hit the window. The slope is the bug.
Save working memory at every turn. Run diff between adjacent turns. Are most updates additive (good)? Or does the same field churn every turn (bad — suggests your update logic is unstable)?
Take a recorded session. Replay it with compaction ratios 1.00, 0.50, 0.20. Does the agent reach the same answer? Where does aggressive compaction break correctness? That tells you where your important state lives.
Implement clip_output + an expand_clip(id) tool. Watch how often the model uses it. Models that have a re-expansion mechanism stop being scared of the clip marker; they ask back when they need to.
Components 4 and 5 keep one agent’s context healthy. Component 6 — subagents — is the next lever: when the task is bigger than one context can hold, you spin up a bounded child to handle a piece of it. That’s the next deck.