Beyond Standard LLMs — Companion to Sebastian Raschka — Presentation 01

Linear-Attention Hybrids

When attention costs more than the brain it serves. How MiniMax-M1, Qwen3-Next, DeepSeek V3.2 and Kimi Linear trade a sliver of accuracy for an order-of-magnitude reduction in long-context cost — and why MiniMax-M2 reverted.

Companion Sebastian Raschka Linear Attention DeltaNet KV Cache Long Context
Softmax Attention O(n²) Linear Attention O(n) Gated DeltaNet Hybrid Stack 3:1 Cheap KV Cache 262k Context

Read the original article →

00

Topics We’ll Cover

How this series maps onto Raschka’s article

Raschka’s essay surveys four post-transformer families and argues each one occupies a niche the standard decoder doesn’t fill. This deck takes the first family — linear-attention hybrids — and unpacks the engineering reality: what the math is, what the savings look like, and where the still-open accuracy questions sit. Decks 02–05 cover the other three families plus a decision-tree for when to reach for any of them.

01

The O(n²) Wall — Why Standard Attention Hurts

Scaled-dot-product attention is the one piece of the transformer that does not scale gracefully. Every token attends to every prior token, so both compute and memory grow quadratically with sequence length. At 4 k tokens nobody notices; at 200 k tokens it dominates the inference bill; at 1 M tokens it is the only thing that matters.

Compute — the obvious wall

The attention matrix has shape (n×n). Doubling the context quadruples the FLOPs in the attention layers. Flash-Attention reorganises the work to avoid materialising the full matrix, but it does not change the asymptotic O(n²) cost.

KV cache — the quieter wall

During autoregressive decoding the model caches Keys and Values for every prior token, in every layer, in every head. That cache grows linearly with n per token, multiplied by n_layers · n_heads · d_head. At long context the cache, not the weights, dominates GPU memory.

Raschka’s framing

From the article: “Traditional scaled-dot-product attention scales O(n²) with sequence length; linear variants aim for O(n) complexity.” Linear attention has been studied since 2020 (Linear Transformers, Performer, Linformer). What is new in 2025–2026 is that production-grade open-weights labs are now shipping it — not as research artefacts, but as flagship general-purpose models: MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear.

A useful intuition

Standard attention is a set comparison — every query token compares itself with every key. Linear attention is closer to a running summary — each token folds into a small fixed-size state, and queries read from that state. The set comparison is more expressive; the running summary is dramatically cheaper. The 2025 generation says: combine them.

02

From Softmax to Linear Attention — The Math Trick

The softmax attention computation looks innocent: O = softmax(QKT/√d) V. The problem is the softmax: it forces you to materialise the full n×n matrix because softmax couples every key to every query. Replace softmax with a kernel that factorises — φ(Q)φ(K)T — and you can re-associate the multiplication.

Softmax attention — O(n²) O = softmax(QKᵀ) V must materialise (n×n) matrix memory: O(n²) compute: O(n² d) expressive but expensive Linear attention — O(n d²) O = φ(Q) · (φ(K)ᵀ V) re-associate: KV first, then Q memory: O(d²) state, no n compute: O(n d²) cheaper, less expressive drop softmax

What changes in practice

The trade-off in one sentence

Standard attention scales like but learns whatever it needs. Linear attention scales like n but its fixed-size state is a memory bottleneck. The 2025 wave is essentially a search for the best way to make that fixed-size state useful.

03

Gated DeltaNet — Memory With Decay and Update

Plain linear attention has no mechanism to forget. Every key/value pair is added to the state and never removed; new information overwrites old only by accident. Gated DeltaNet, the dominant design in Qwen3-Next and Kimi Linear, fixes this with two learned gates and a delta-rule update.

hidden state Sₐ (d_head × d_head) the only memory carried forward α_t (decay) sigmoid scales Sₐ₋₁ delta rule vₜ − (Sₐ₋₁ kₜ) predicted-vs-actual β_t (update) sigmoid scales the delta kₜ kₜᵀ outer where to write

The update rule, in plain language

  1. Decay the old state. Multiply St−1 by the decay gate αt — per-token, learned from the input. High α means “keep remembering”; low α means “flush this forward”.
  2. Predict, then correct. The current key kt is run against the decayed state to produce a predicted value. Compare to the actual value vt. The delta is the surprise.
  3. Write the surprise. Scale the delta by the update gate βt and write it into the state via kt ktT — an outer product that addresses where, in the state matrix, this information should live.

The result: a fixed-size memory that updates differentially rather than just accumulating. New facts overwrite stale ones in the same address. This is the missing ingredient that makes pure linear attention almost competitive on real workloads — up from “research curiosity” to “production candidate”.

Connection to classic ideas

The delta rule is the same one used in Hebbian learning since the 1980s: change weights in the direction of the prediction error. Gated DeltaNet is essentially a transformer layer that does in-context Hebbian updates on a per-token basis — explicit recurrent state-space behaviour without giving up the parallelism of attention training.

04

The Hybrid Pattern — Why Pure Linear Doesn’t Ship

None of the four flagship 2025 models is purely linear. They interleave linear and standard layers in a fixed ratio, in the recognition that linear attention sacrifices exact-recall capabilities the standard mechanism handles natively. The interleaving ratio is the central design knob.

ModelTotal / active paramsLinear mechanismHybrid patternHeadline number
MiniMax-M1 456B / 46B (MoE) Lightning attention Mostly linear, periodic full-attention layers Frontier-tier reasoning at fraction of dense cost
Qwen3-Next 235B / ~22B (MoE) Gated DeltaNet + Gated Attention 3 linear : 1 attention (3:1 ratio) 262k native context vs 32k in Qwen3
DeepSeek V3.2 ~671B / ~37B (MoE) Subquadratic sparse attention Routes most queries to a small set of keys Long-context bills cut without architecture change
Kimi Linear 48B Gated DeltaNet + MLA 3 linear : 1 attention; MLA on the attention layers 75% KV-cache reduction, up to 6× decoding throughput

Why “3 linear : 1 attention” recurs

Three linear layers compress sequence information into a running state cheaply; the fourth-layer standard attention then has the option to do exact recall against the full token stream when it needs to. Three-quarters of the layers run cheaply, while the model retains the lookup-precision of softmax attention where it actually matters. Both Qwen3-Next and Kimi Linear converged on this ratio independently; the design is becoming a folk-standard.

What the linear layers contribute

Cheap context aggregation. Carrying the gist of a 200 k-token document forward without accumulating per-token KV cost. Most layers in a deep transformer are doing summarisation; linear is fine for them.

What the standard layers preserve

Exact-recall and copy operations — “quote line 47 verbatim”, “find the variable named user_id in the imports”. Exactly the operations linear attention degrades on. One in four layers is enough to keep these working.

05

KV Cache — Where the Money Actually Is

The headline savings from linear attention are not in compute — they are in KV-cache memory at inference time. Standard MHA caches grow linearly with sequence length, multiplied by depth and width. Linear-attention layers do not cache per-token at all: their state is a single fixed-size matrix.

Multi-Head Attention (per layer)

Cache size = batch × n_tokens × n_heads × d_head × 2 bytes (the ×2 covers K and V). Doubling the context doubles the cache. At 128 k tokens this typically dwarfs the model weights themselves on a single GPU.

DeltaNet (per layer)

State size = batch × n_heads × d_head × d_head bytes. Independent of sequence length. A 200 k-token context costs the same memory as a 200-token one. The crossover point is small — somewhere between 1 k and 4 k tokens.

What this changes about deployment

A useful framing

If you’ve internalised “long context is expensive” as a hard rule, half of that intuition is now wrong. The compute for the prefill pass is still expensive, but the memory for the decoding cache is suddenly an order of magnitude cheaper on a linear-heavy model. The two costs decouple in 2025 in a way they didn’t in 2023.

06

Interactive: KV-Cache Memory Calculator

Slide the inputs to see how the KV-cache footprint differs between a pure-MHA model and a 3:1 hybrid (Qwen3-Next / Kimi Linear style). Numbers are FP16 (2 bytes per scalar). The hybrid uses standard attention on 25% of layers and DeltaNet state on the rest.

Pure MHA cache
3:1 hybrid cache
Reduction
What the calculator tells you

At short context the two are similar: the fixed DeltaNet state is sized to be useful, not negligible. The interesting region is 32 k tokens and beyond, where the MHA cache scales linearly with context but the hybrid only scales on its 25% of standard layers. At 200 k tokens the reduction is roughly the headline 75% Kimi Linear advertised — and the absolute saving is several gigabytes per active session, which is what unlocks longer context on the same GPU.

07

When Linear Hurts — The MiniMax-M2 Reversal

The most informative single data point in the article is not a success story — it is a reversal. MiniMax-M1 shipped with linear attention (Lightning). MiniMax-M2, the successor, reverted to full attention, citing “poor accuracy in reasoning and multi-turn tasks” with the linear variants. This is one team being publicly honest about a trade-off the rest of the industry is still measuring.

Where linear seems to underperform

  • Multi-step reasoning — chains where the model has to revisit and exactly reuse intermediate facts.
  • Long multi-turn dialogue — recalling specific user statements from earlier turns.
  • Code editing with long-range references (“the function defined 800 lines ago”).
  • Structured output with strict positional constraints.

Where the wins are real

  • Single-pass long-context summarisation — gist over precision.
  • Streaming generation at long context — the throughput win dominates.
  • Cost-sensitive deployment at very long inputs — the 75% memory cut is real money.
  • Tasks where pre-RAG retrieval has done the precision work already.
Read this one carefully

The MiniMax-M2 reversal does not mean linear attention is wrong. It means at MiniMax’s 2025 capability target on their workload mix, the trade-off didn’t pencil out. Qwen3-Next, Kimi Linear and DeepSeek V3.2 made different bets and shipped. The reasonable inference is that the design space is genuinely contested — the answer depends on your workload, not on a universal verdict.

Open questions worth tracking

  1. Does the 3:1 ratio scale up? Bigger models may need a different ratio — the published ones are largely below 100B active params. Test-time evidence at 500B+ is thin.
  2. Do reasoning models need more standard attention? Long chains of thought do exact recall constantly. Linear-heavy stacks may interact badly with the o1/R1 paradigm.
  3. How sensitive is the gating? If α and β are mistuned at training time, the state under-decays and capacity collapses. Robust training recipes are still being established.
08

Companion Decks — Where to Go Next

This is the first of five companion decks to Raschka’s Beyond Standard LLMs. The next three take the remaining architecture families in turn; the fifth is a decision-tree for choosing among them in your own work.

  1. 02
    Text Diffusion Models →
    LLaDA, Gemini Diffusion. Iterative denoising instead of next-token prediction. The streaming and conditional-dependency caveats. Interactive: parallel-vs-autoregressive token visualiser.
  2. 03
    Code World Models →
    CWM 32B. Mid-training that makes the model predict program state evolution, not just code text. SWE-bench parity with 4× smaller parameter counts. Interactive: world-model rollout stepper.
  3. 04
    Small Recursive Transformers →
    HRM, TRM (7M params). Iterative self-loops for ARC-AGI. The surprising attention-not-required ablation. Interactive: recursive trace viewer.
  4. 05
    When to Reach for Non-Transformer →
    A practical decision-tree synthesising all four families. Interactive: walk your workload through the tree and see which architecture fits.
09

Things to Try Yourself

This material lands much harder if you do something with it. In rough order of effort:

20 minutes — play with the calculator

Plug in the dimensions of a model you actually deploy. At your context length, where does the crossover sit? Would a 3:1 hybrid pay off in your own serving stack at 50% utilisation?

1 hour — pull a Kimi Linear checkpoint

Kimi Linear weights are open. Run a 100 k-token summarisation and a 100 k-token exact-quote task on it and on a same-size MHA baseline. The gap between the two is the practical cost of the linear trade-off.

An afternoon — implement DeltaNet

The update rule is short: thirty lines of PyTorch. Build it as a single layer and verify that the state is bounded by d_head × d_head, not by n_tokens. Train on a toy copy task and watch it fail in characteristic ways.

A weekend — profile a real long-context inference

Run vLLM or sglang against a 100 k token prompt on Qwen3-Next vs Qwen3 and measure: peak GPU memory, time-to-first-token, decode tokens/s. Then read the gap against this deck’s claims.

Ongoing — watch the M3 release

MiniMax made a public reversal and is presumably preparing M3. Whether it goes back to linear, stays full-attention, or tries a different hybrid will be the most informative single data point of 2026 on this architecture family.

Read the source

Raschka’s article links every model release with primary sources. Start with flash-linear-attention for an open-source DeltaNet implementation; then read the Qwen3-Next and Kimi Linear technical reports back-to-back — the design choices are remarkably similar.

A diagnostic prompt for your team

Next time someone asks “should we cap context at 32 k to save cost?”, ask the harder question: is our model architecture even paying the n² tax for context beyond that? If you’re running on a linear-heavy hybrid the answer is approximately no, and the cap is leaving capability on the table.