7 million parameters that beat 70 billion at the task they care about. HRM and TRM solve ARC-AGI by iterating an answer rather than emitting it. The most counter-intuitive result in the article: attention may not be required.
The first three decks were about scaling-with-twists. This one is the opposite bet: scaling down. A 7M-parameter recursive model beats a 70B LLM on a benchmark designed to measure reasoning. Whether that’s a one-off (puzzle-shaped tasks favour iterative solvers) or a general principle (small specialised reasoners as components in larger systems) is the open question.
A standard transformer is a fixed stack: 24, 80, 100 layers deep. Each forward pass runs through every layer once, and the answer is whatever falls out the bottom. If the task needs more reasoning than your stack can do in one pass, the standard answer has been: train a bigger stack. Or: let the model emit chain-of-thought tokens so it can re-enter itself textually.
Recursive transformers ask a different question. What if a small model ran itself many times in a learned loop? The depth-equivalent comes from iteration count, not from layer count. A 4-block model run 16 times has, in some sense, 64 layers of computation — but with 16× fewer parameters and a learned halting behaviour that decides when to stop.
This is conceptually the “adaptive computation time” idea from 2016 (Graves), and the “universal transformer” from 2018 — but with a 2025 training recipe and a domain (ARC-AGI) where the iterative-refinement bias matches the problem structure. The architectural ideas are old; the result is new.
HRM is the model that re-validated this approach in 2025. Two small transformer modules (each 4 blocks deep), arranged hierarchically: an outer module that updates a high-level latent state, an inner module that operates on the answer conditioned on that state. The outer module is the “planner”; the inner is the “solver”.
HRM topped the ARC-AGI leaderboard at the time of release — the benchmark explicitly designed to test out-of-distribution reasoning that LLMs struggle with. Two small modules iterating against a benchmark designed to break large frozen models. The architectural choice was the lever, not parameter count.
Frontier LLMs running on ARC-AGI typically score in the single digits to low tens. HRM, at orders-of-magnitude fewer parameters, scored in the range that started a small industry of follow-up work. The next slide is the most important entry in that follow-up work.
TRM is the simplification that ate HRM. The hierarchical structure turned out not to be necessary: a single 2-layer transformer, run recursively with the same alternating-update pattern, beats HRM on ARC at a quarter of the parameter count. Seven million parameters — about a hundredth of a percent of a 70B LLM — and a $500 training run.
| Property | HRM | TRM |
|---|---|---|
| Parameters | ~27M | ~7M (4× smaller) |
| Architecture | Two 4-block modules (hierarchical) | Single 2-layer transformer |
| Attention | Standard self-attention | Bidirectional self-attention — or no attention at all (see slide 05) |
| Update rule | Alternating high/low-level updates | Alternating: refine latent z, refine answer y |
| Backprop | Through final few iterations | Through all recursive steps |
| Halting | Heuristic | Binary cross-entropy on a halt head |
| Training cost | (not published as headline) | ~$500 on 4×H100, ~2 days |
| ARC-AGI score | (top at release) | Beats HRM |
“TRM uses a single 2-layer transformer with bidirectional attention. Two alternating updates: (1) compute latent reasoning state, (2) update answer. Backpropagates through all recursive steps. Binary cross-entropy loss determines iteration halting.” That is the entire architecture. If a description that short produces the result on the next slide, something interesting is going on in the loop, not in the layers.
The mechanism is conceptually simple. The model maintains two state vectors: an answer y (the grid being filled in) and a latent z (the reasoning state). On each iteration, both get updated, conditioned on each other and on the input.
An iterative solver doing belief propagation: each step accumulates evidence, refines a candidate, and decides whether to do more work. The model is not just “deeper by repetition” — the recursive structure aligns the architecture with the iterative nature of the puzzles being solved.
Recursion-as-depth is most powerful when the task itself has an iterative structure that classical algorithms would also exploit. Recursive transformers don’t replace next-token prediction for general text — they replace fixed-depth attention for tasks that look like constraint satisfaction. Match the architectural inductive bias to the problem.
The most shocking single finding in the article: “Attention is not required... Replacing self-attention with a pure MLP layer also improved accuracy (74.7% to 87.4%)”. On the recursive transformer’s grid-puzzle benchmark, removing attention entirely — replacing it with a feedforward MLP — not only worked but worked better.
For seven years the load-bearing claim of the field has been “attention is the mechanism that makes transformers work”. On this task, on this architecture, in this scale regime, that claim is empirically false. Whatever is doing the work, it isn’t the attention mechanism per se.
The recursion — iterating any nonlinear transformation on a structured state — appears to be the load-bearing primitive. Attention was a useful enough mixing operation that nobody noticed it could be replaced with an MLP at this scale. On a fixed grid where positional structure is given by the input itself, the all-pairs token mixing attention provides may be redundant.
This is the kind of result that gets repeated badly. Do not let yourself or anyone else summarise it as “attention isn’t actually needed”. The right summary is: at small scale, on tasks with iterative structure, the recursive loop carries the capability and the per-step mixing operation can be much simpler than full attention. That is a much more interesting and less reckless claim.
A 4×4 puzzle. Three panels: the input (with given cells), the current best answer (refined per iteration), and the target. The model starts with a wrong answer; each iteration refines it. Watch the latent state (bars below) evolve and the answer converge. The halt head fires when the latent stabilises.
The answer doesn’t fill in a fixed order — the model attacks the most-constrained cells first, then the rest fall out. The latent bars start jagged (high-uncertainty reasoning state) and smooth out as the answer converges. The halt probability climbs as the latent stabilises; once it crosses 0.9, the model returns. Easy puzzles halt in 4–6 iterations; hard ones take 12–16. Press New puzzle to see the cadence change.
The honest scope is narrow. HRM and TRM solve grid-shaped puzzles: Sudoku, mazes, ARC. The input and output are fixed-size grids. The technique does not yet apply to general text, to long-form generation, or to reasoning tasks with variable-length outputs.
Constraint-satisfaction puzzles with structured input/output. Tasks where iterative refinement matches the problem. Anywhere a classical solver (Z3, A*, belief propagation) would also be a reasonable choice.
A general-purpose LLM. A chat assistant. A code generator. A document summariser. Treating it as such will produce disappointment. The article is explicit about this.
A tool for an agent. The same way an LLM uses a calculator API, a future agent could use a recursive-puzzle module to handle constraint problems — faster, cheaper, and more reliable than asking the main LLM.
The article’s closing speculation about recursive transformers is worth quoting: “tiny reasoning models as ‘tools’ for tool-calling LLMs, similar to how systems use Python/calculator APIs”. The interesting future is composition. A frontier model that knows when to spawn a tiny specialised reasoner is more capable, in aggregate, than either alone — and substantially cheaper to deploy.
What other narrow domains can be packaged as a 7M-parameter recursive solver? Theorem checking? Game playing? Constraint solving for scheduling, layout, query optimisation? The TRM result strongly suggests the answer is “more than we’ve looked for”, but the sociological barriers (papers want frontier-tier results, not 7M-parameter components) are real.
Try multiple puzzles. Notice how the iteration count and halting threshold differ between easy and hard ones. The model running fewer iterations on easy puzzles is the “adaptive compute” behaviour visible without instrumentation.
The code is <1000 lines. Find the recursive update rule, the halt head, and the BCE loss. The whole thing is shorter than most production LLM training scripts.
Sudoku is a small enough domain to fit on a laptop. Implement the alternating-update rule, the latent + answer buffer, the halt head. Watch it converge. The training loop is unusual because gradients flow through the full unrolled recursion.
Reproduce the article’s ablation. On your own toy version, sweep: attention vs MLP, iteration count, model width. Confirm that the recursion, not the mixing operation, is doing the work. The result is one of the most informative experiments you can run on a transformer.
Start with the TRM paper as the cleanest exposition. Then HRM for context on what was being simplified. Then the original 2018 Universal Transformer paper to see the architectural lineage.
The first credible “recursive solver as a tool inside a chat agent” deployment. Whichever team ships that first — LLM hands off a constraint problem to a 7M solver, gets the answer back, continues — will have demonstrated the composition story. None has, as of early 2026.
Next time someone proposes “use the bigger LLM” for a task that looks like constraint solving, ask: could a 7M-parameter recursive model do this faster, cheaper and better? The answer is sometimes yes — and recognising it is the value the recursive-transformer line of work has added to the toolbox, regardless of whether the technique becomes general-purpose.