Beyond Standard LLMs — Companion to Sebastian Raschka — Presentation 04

Small Recursive Transformers

7 million parameters that beat 70 billion at the task they care about. HRM and TRM solve ARC-AGI by iterating an answer rather than emitting it. The most counter-intuitive result in the article: attention may not be required.

Companion Sebastian Raschka HRM TRM ARC-AGI Iterative Refinement
Input grid Latent z₀ Refine answer yₐ Update latent zₐ Halt? Final y

Read the original article →

00

Topics We’ll Cover

Where this deck sits in the series

The first three decks were about scaling-with-twists. This one is the opposite bet: scaling down. A 7M-parameter recursive model beats a 70B LLM on a benchmark designed to measure reasoning. Whether that’s a one-off (puzzle-shaped tasks favour iterative solvers) or a general principle (small specialised reasoners as components in larger systems) is the open question.

01

A Reasonable Question — What If Depth Is Just Iteration?

A standard transformer is a fixed stack: 24, 80, 100 layers deep. Each forward pass runs through every layer once, and the answer is whatever falls out the bottom. If the task needs more reasoning than your stack can do in one pass, the standard answer has been: train a bigger stack. Or: let the model emit chain-of-thought tokens so it can re-enter itself textually.

Recursive transformers ask a different question. What if a small model ran itself many times in a learned loop? The depth-equivalent comes from iteration count, not from layer count. A 4-block model run 16 times has, in some sense, 64 layers of computation — but with 16× fewer parameters and a learned halting behaviour that decides when to stop.

Standard transformer

  • Fixed 24–100 layers, run once per forward pass.
  • Width and depth are training-time choices, frozen at deploy.
  • Computation per token is a constant, regardless of difficulty.
  • Reasoning depth is bounded by physical layer count.

Recursive transformer

  • Small core (e.g. 2–4 layers), run K times in a loop.
  • K is variable per input; the model learns when to halt.
  • Easy inputs use few iterations; hard ones use many.
  • Total compute is adaptive; parameter count is tiny.
A historical echo

This is conceptually the “adaptive computation time” idea from 2016 (Graves), and the “universal transformer” from 2018 — but with a 2025 training recipe and a domain (ARC-AGI) where the iterative-refinement bias matches the problem structure. The architectural ideas are old; the result is new.

02

HRM — Hierarchical Reasoning Model

HRM is the model that re-validated this approach in 2025. Two small transformer modules (each 4 blocks deep), arranged hierarchically: an outer module that updates a high-level latent state, an inner module that operates on the answer conditioned on that state. The outer module is the “planner”; the inner is the “solver”.

High-level module 4 blocks · updates latent z “planner” Low-level module 4 blocks · updates answer y “solver” latent z carries plan across iterations answer y grid being filled in recurse: y feeds back into next iteration

The headline result

HRM topped the ARC-AGI leaderboard at the time of release — the benchmark explicitly designed to test out-of-distribution reasoning that LLMs struggle with. Two small modules iterating against a benchmark designed to break large frozen models. The architectural choice was the lever, not parameter count.

A useful comparison

Frontier LLMs running on ARC-AGI typically score in the single digits to low tens. HRM, at orders-of-magnitude fewer parameters, scored in the range that started a small industry of follow-up work. The next slide is the most important entry in that follow-up work.

03

TRM — Tiny Recursive Model, 7M Parameters

TRM is the simplification that ate HRM. The hierarchical structure turned out not to be necessary: a single 2-layer transformer, run recursively with the same alternating-update pattern, beats HRM on ARC at a quarter of the parameter count. Seven million parameters — about a hundredth of a percent of a 70B LLM — and a $500 training run.

PropertyHRMTRM
Parameters~27M~7M (4× smaller)
ArchitectureTwo 4-block modules (hierarchical)Single 2-layer transformer
AttentionStandard self-attentionBidirectional self-attention — or no attention at all (see slide 05)
Update ruleAlternating high/low-level updatesAlternating: refine latent z, refine answer y
BackpropThrough final few iterationsThrough all recursive steps
HaltingHeuristicBinary cross-entropy on a halt head
Training cost(not published as headline)~$500 on 4×H100, ~2 days
ARC-AGI score(top at release)Beats HRM

The two design choices that mattered

  1. Backprop through all recursive steps. HRM only differentiated through a few of its iterations to keep memory tractable. TRM differentiates through every step, accepting the memory cost in exchange for cleaner gradients. The improvement is large enough to absorb the simplification of the architecture.
  2. Learned halting via BCE loss. A binary head decides per-input whether to run another iteration. Easy puzzles halt early (saving compute); hard ones get more iterations (spending compute where it matters). The model is doing test-time compute scaling adaptively.
Raschka’s framing

“TRM uses a single 2-layer transformer with bidirectional attention. Two alternating updates: (1) compute latent reasoning state, (2) update answer. Backpropagates through all recursive steps. Binary cross-entropy loss determines iteration halting.” That is the entire architecture. If a description that short produces the result on the next slide, something interesting is going on in the loop, not in the layers.

04

How Recursion Replaces Depth

The mechanism is conceptually simple. The model maintains two state vectors: an answer y (the grid being filled in) and a latent z (the reasoning state). On each iteration, both get updated, conditioned on each other and on the input.

Update step

  1. Update z using (input, current y, current z). The latent absorbs new evidence about the partial answer.
  2. Update y using (input, current y, new z). The answer is refined given the updated reasoning state.
  3. Run halt head on z. If above threshold, return y. Otherwise loop.

What this resembles

An iterative solver doing belief propagation: each step accumulates evidence, refines a candidate, and decides whether to do more work. The model is not just “deeper by repetition” — the recursive structure aligns the architecture with the iterative nature of the puzzles being solved.

Why grid puzzles are the right target

A general lesson

Recursion-as-depth is most powerful when the task itself has an iterative structure that classical algorithms would also exploit. Recursive transformers don’t replace next-token prediction for general text — they replace fixed-depth attention for tasks that look like constraint satisfaction. Match the architectural inductive bias to the problem.

05

The Attention-Not-Required Ablation

The most shocking single finding in the article: “Attention is not required... Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%)”. On the recursive transformer’s grid-puzzle benchmark, removing attention entirely — replacing it with a feedforward MLP — not only worked but worked better.

Why this is so striking

For seven years the load-bearing claim of the field has been “attention is the mechanism that makes transformers work”. On this task, on this architecture, in this scale regime, that claim is empirically false. Whatever is doing the work, it isn’t the attention mechanism per se.

What is plausibly doing the work

The recursion — iterating any nonlinear transformation on a structured state — appears to be the load-bearing primitive. Attention was a useful enough mixing operation that nobody noticed it could be replaced with an MLP at this scale. On a fixed grid where positional structure is given by the input itself, the all-pairs token mixing attention provides may be redundant.

How seriously to take this

Read this one carefully

This is the kind of result that gets repeated badly. Do not let yourself or anyone else summarise it as “attention isn’t actually needed”. The right summary is: at small scale, on tasks with iterative structure, the recursive loop carries the capability and the per-step mixing operation can be much simpler than full attention. That is a much more interesting and less reckless claim.

06

Interactive: Recursive Trace Viewer

A 4×4 puzzle. Three panels: the input (with given cells), the current best answer (refined per iteration), and the target. The model starts with a wrong answer; each iteration refines it. Watch the latent state (bars below) evolve and the answer converge. The halt head fires when the latent stabilises.

ready
Input
Current answer y
iter 0 · halt 0.00
Target
What to notice

The answer doesn’t fill in a fixed order — the model attacks the most-constrained cells first, then the rest fall out. The latent bars start jagged (high-uncertainty reasoning state) and smooth out as the answer converges. The halt probability climbs as the latent stabilises; once it crosses 0.9, the model returns. Easy puzzles halt in 4–6 iterations; hard ones take 12–16. Press New puzzle to see the cadence change.

07

Scope — Grid Puzzles Now, Components Later

The honest scope is narrow. HRM and TRM solve grid-shaped puzzles: Sudoku, mazes, ARC. The input and output are fixed-size grids. The technique does not yet apply to general text, to long-form generation, or to reasoning tasks with variable-length outputs.

What it’s good at today

Constraint-satisfaction puzzles with structured input/output. Tasks where iterative refinement matches the problem. Anywhere a classical solver (Z3, A*, belief propagation) would also be a reasonable choice.

What it’s not

A general-purpose LLM. A chat assistant. A code generator. A document summariser. Treating it as such will produce disappointment. The article is explicit about this.

What it might become

A tool for an agent. The same way an LLM uses a calculator API, a future agent could use a recursive-puzzle module to handle constraint problems — faster, cheaper, and more reliable than asking the main LLM.

Raschka’s speculative reading

The article’s closing speculation about recursive transformers is worth quoting: “tiny reasoning models as ‘tools’ for tool-calling LLMs, similar to how systems use Python/calculator APIs”. The interesting future is composition. A frontier model that knows when to spawn a tiny specialised reasoner is more capable, in aggregate, than either alone — and substantially cheaper to deploy.

A 2026 question to track

What other narrow domains can be packaged as a 7M-parameter recursive solver? Theorem checking? Game playing? Constraint solving for scheduling, layout, query optimisation? The TRM result strongly suggests the answer is “more than we’ve looked for”, but the sociological barriers (papers want frontier-tier results, not 7M-parameter components) are real.

08

Companion Decks — Where to Go Next

  1. 01
    Linear-Attention Hybrids →
    MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear. Gated DeltaNet, the 3:1 hybrid pattern. KV-cache calculator widget.
  2. 02
    Text Diffusion Models →
    LLaDA, Gemini Diffusion. Iterative denoising, parallel decoding, conditional-dependency limits. Diffusion-vs-autoregressive visualiser.
  3. 03
    Code World Models →
    CWM 32B. Mid-training that makes the model predict program state evolution. SWE-bench parity at 4× smaller. World-model rollout stepper.
  4. 05
    When to Reach for Non-Transformer →
    A practical decision-tree synthesising all four families. Walk your workload through the tree and see which architecture fits.
09

Things to Try Yourself

15 minutes — play with the trace viewer

Try multiple puzzles. Notice how the iteration count and halting threshold differ between easy and hard ones. The model running fewer iterations on easy puzzles is the “adaptive compute” behaviour visible without instrumentation.

1 hour — read the TRM repo

The code is <1000 lines. Find the recursive update rule, the halt head, and the BCE loss. The whole thing is shorter than most production LLM training scripts.

An afternoon — train a TRM-clone on Sudoku

Sudoku is a small enough domain to fit on a laptop. Implement the alternating-update rule, the latent + answer buffer, the halt head. Watch it converge. The training loop is unusual because gradients flow through the full unrolled recursion.

A weekend — replace attention with an MLP

Reproduce the article’s ablation. On your own toy version, sweep: attention vs MLP, iteration count, model width. Confirm that the recursion, not the mixing operation, is doing the work. The result is one of the most informative experiments you can run on a transformer.

Read the source

Start with the TRM paper as the cleanest exposition. Then HRM for context on what was being simplified. Then the original 2018 Universal Transformer paper to see the architectural lineage.

Watch for —

The first credible “recursive solver as a tool inside a chat agent” deployment. Whichever team ships that first — LLM hands off a constraint problem to a 7M solver, gets the answer back, continues — will have demonstrated the composition story. None has, as of early 2026.

A diagnostic for your team

Next time someone proposes “use the bigger LLM” for a task that looks like constraint solving, ask: could a 7M-parameter recursive model do this faster, cheaper and better? The answer is sometimes yes — and recognising it is the value the recursive-transformer line of work has added to the toolbox, regardless of whether the technique becomes general-purpose.