When attention costs more than the brain it serves. How MiniMax-M1, Qwen3-Next, DeepSeek V3.2 and Kimi Linear trade a sliver of accuracy for an order-of-magnitude reduction in long-context cost — and why MiniMax-M2 reverted.
Raschka’s essay surveys four post-transformer families and argues each one occupies a niche the standard decoder doesn’t fill. This deck takes the first family — linear-attention hybrids — and unpacks the engineering reality: what the math is, what the savings look like, and where the still-open accuracy questions sit. Decks 02–05 cover the other three families plus a decision-tree for when to reach for any of them.
Scaled-dot-product attention is the one piece of the transformer that does not scale gracefully. Every token attends to every prior token, so both compute and memory grow quadratically with sequence length. At 4 k tokens nobody notices; at 200 k tokens it dominates the inference bill; at 1 M tokens it is the only thing that matters.
The attention matrix has shape (n×n). Doubling the context quadruples the FLOPs in the attention layers. Flash-Attention reorganises the work to avoid materialising the full matrix, but it does not change the asymptotic O(n²) cost.
During autoregressive decoding the model caches Keys and Values for every prior token, in every layer, in every head. That cache grows linearly with n per token, multiplied by n_layers · n_heads · d_head. At long context the cache, not the weights, dominates GPU memory.
From the article: “Traditional scaled-dot-product attention scales O(n²) with sequence length; linear variants aim for O(n) complexity.” Linear attention has been studied since 2020 (Linear Transformers, Performer, Linformer). What is new in 2025–2026 is that production-grade open-weights labs are now shipping it — not as research artefacts, but as flagship general-purpose models: MiniMax-M1, Qwen3-Next, DeepSeek V3.2, Kimi Linear.
Standard attention is a set comparison — every query token compares itself with every key. Linear attention is closer to a running summary — each token folds into a small fixed-size state, and queries read from that state. The set comparison is more expressive; the running summary is dramatically cheaper. The 2025 generation says: combine them.
The softmax attention computation looks innocent: O = softmax(QKT/√d) V. The problem is the softmax: it forces you to materialise the full n×n matrix because softmax couples every key to every query. Replace softmax with a kernel that factorises — φ(Q)φ(K)T — and you can re-associate the multiplication.
(d_head × d_head).n, instead of growing linearly with sequence length.Standard attention scales like n² but learns whatever it needs. Linear attention scales like n but its fixed-size state is a memory bottleneck. The 2025 wave is essentially a search for the best way to make that fixed-size state useful.
Plain linear attention has no mechanism to forget. Every key/value pair is added to the state and never removed; new information overwrites old only by accident. Gated DeltaNet, the dominant design in Qwen3-Next and Kimi Linear, fixes this with two learned gates and a delta-rule update.
St−1 by the decay gate αt — per-token, learned from the input. High α means “keep remembering”; low α means “flush this forward”.kt is run against the decayed state to produce a predicted value. Compare to the actual value vt. The delta is the surprise.βt and write it into the state via kt ktT — an outer product that addresses where, in the state matrix, this information should live.The result: a fixed-size memory that updates differentially rather than just accumulating. New facts overwrite stale ones in the same address. This is the missing ingredient that makes pure linear attention almost competitive on real workloads — up from “research curiosity” to “production candidate”.
The delta rule is the same one used in Hebbian learning since the 1980s: change weights in the direction of the prediction error. Gated DeltaNet is essentially a transformer layer that does in-context Hebbian updates on a per-token basis — explicit recurrent state-space behaviour without giving up the parallelism of attention training.
None of the four flagship 2025 models is purely linear. They interleave linear and standard layers in a fixed ratio, in the recognition that linear attention sacrifices exact-recall capabilities the standard mechanism handles natively. The interleaving ratio is the central design knob.
| Model | Total / active params | Linear mechanism | Hybrid pattern | Headline number |
|---|---|---|---|---|
| MiniMax-M1 | 456B / 46B (MoE) | Lightning attention | Mostly linear, periodic full-attention layers | Frontier-tier reasoning at fraction of dense cost |
| Qwen3-Next | 235B / ~22B (MoE) | Gated DeltaNet + Gated Attention | 3 linear : 1 attention (3:1 ratio) | 262k native context vs 32k in Qwen3 |
| DeepSeek V3.2 | ~671B / ~37B (MoE) | Subquadratic sparse attention | Routes most queries to a small set of keys | Long-context bills cut without architecture change |
| Kimi Linear | 48B | Gated DeltaNet + MLA | 3 linear : 1 attention; MLA on the attention layers | 75% KV-cache reduction, up to 6× decoding throughput |
Three linear layers compress sequence information into a running state cheaply; the fourth-layer standard attention then has the option to do exact recall against the full token stream when it needs to. Three-quarters of the layers run cheaply, while the model retains the lookup-precision of softmax attention where it actually matters. Both Qwen3-Next and Kimi Linear converged on this ratio independently; the design is becoming a folk-standard.
Cheap context aggregation. Carrying the gist of a 200 k-token document forward without accumulating per-token KV cost. Most layers in a deep transformer are doing summarisation; linear is fine for them.
Exact-recall and copy operations — “quote line 47 verbatim”, “find the variable named user_id in the imports”. Exactly the operations linear attention degrades on. One in four layers is enough to keep these working.
The headline savings from linear attention are not in compute — they are in KV-cache memory at inference time. Standard MHA caches grow linearly with sequence length, multiplied by depth and width. Linear-attention layers do not cache per-token at all: their state is a single fixed-size matrix.
Cache size = batch × n_tokens × n_heads × d_head × 2 bytes (the ×2 covers K and V). Doubling the context doubles the cache. At 128 k tokens this typically dwarfs the model weights themselves on a single GPU.
State size = batch × n_heads × d_head × d_head bytes. Independent of sequence length. A 200 k-token context costs the same memory as a 200-token one. The crossover point is small — somewhere between 1 k and 4 k tokens.
If you’ve internalised “long context is expensive” as a hard rule, half of that intuition is now wrong. The compute for the prefill pass is still expensive, but the memory for the decoding cache is suddenly an order of magnitude cheaper on a linear-heavy model. The two costs decouple in 2025 in a way they didn’t in 2023.
Slide the inputs to see how the KV-cache footprint differs between a pure-MHA model and a 3:1 hybrid (Qwen3-Next / Kimi Linear style). Numbers are FP16 (2 bytes per scalar). The hybrid uses standard attention on 25% of layers and DeltaNet state on the rest.
At short context the two are similar: the fixed DeltaNet state is sized to be useful, not negligible. The interesting region is 32 k tokens and beyond, where the MHA cache scales linearly with context but the hybrid only scales on its 25% of standard layers. At 200 k tokens the reduction is roughly the headline 75% Kimi Linear advertised — and the absolute saving is several gigabytes per active session, which is what unlocks longer context on the same GPU.
The most informative single data point in the article is not a success story — it is a reversal. MiniMax-M1 shipped with linear attention (Lightning). MiniMax-M2, the successor, reverted to full attention, citing “poor accuracy in reasoning and multi-turn tasks” with the linear variants. This is one team being publicly honest about a trade-off the rest of the industry is still measuring.
The MiniMax-M2 reversal does not mean linear attention is wrong. It means at MiniMax’s 2025 capability target on their workload mix, the trade-off didn’t pencil out. Qwen3-Next, Kimi Linear and DeepSeek V3.2 made different bets and shipped. The reasonable inference is that the design space is genuinely contested — the answer depends on your workload, not on a universal verdict.
α and β are mistuned at training time, the state under-decays and capacity collapses. Robust training recipes are still being established.This is the first of five companion decks to Raschka’s Beyond Standard LLMs. The next three take the remaining architecture families in turn; the fifth is a decision-tree for choosing among them in your own work.
This material lands much harder if you do something with it. In rough order of effort:
Plug in the dimensions of a model you actually deploy. At your context length, where does the crossover sit? Would a 3:1 hybrid pay off in your own serving stack at 50% utilisation?
Kimi Linear weights are open. Run a 100 k-token summarisation and a 100 k-token exact-quote task on it and on a same-size MHA baseline. The gap between the two is the practical cost of the linear trade-off.
The update rule is short: thirty lines of PyTorch. Build it as a single layer and verify that the state is bounded by d_head × d_head, not by n_tokens. Train on a toy copy task and watch it fail in characteristic ways.
Run vLLM or sglang against a 100 k token prompt on Qwen3-Next vs Qwen3 and measure: peak GPU memory, time-to-first-token, decode tokens/s. Then read the gap against this deck’s claims.
MiniMax made a public reversal and is presumably preparing M3. Whether it goes back to linear, stays full-attention, or tries a different hybrid will be the most informative single data point of 2026 on this architecture family.
Raschka’s article links every model release with primary sources. Start with flash-linear-attention for an open-source DeltaNet implementation; then read the Qwen3-Next and Kimi Linear technical reports back-to-back — the design choices are remarkably similar.
Next time someone asks “should we cap context at 32 k to save cost?”, ask the harder question: is our model architecture even paying the n² tax for context beyond that? If you’re running on a linear-heavy hybrid the answer is approximately no, and the cap is leaving capability on the table.