The forecast deck. Reading the technical and institutional patterns from decks 02–09 and projecting them forward. Will date faster than the rest of the series, and that is the point — the value is in the framework, not the predictions.
The forward view. Less narrative, more structural — reading the patterns from the previous nine decks and projecting forward where they continue, where they break, and where the genuine open questions are.
It is worth saying plainly: nobody who has tried to forecast LLM capability over the last five years has gotten it right. Not the lab CEOs, not the scaling-laws authors, not the journalists, not the academic critics. The 2020 GPT-3 paper was a surprise to most of the field; ChatGPT was a surprise to most of OpenAI; the o-series was a surprise to many people inside the lab that built it. The base rate for confident frontier-AI predictions surviving twelve months intact is unimpressive.
The point of this deck is therefore not to predict specific things. It is to lay out the axes along which the field could move and, where evidence supports it, to indicate which way the wind is blowing on each axis. A reader returning to this deck in two years should be able to score the framework, not the predictions.
Take any two-year-old confident forecast about AI capability. The accurate predictions tend to be ones that named a structural pattern (scaling continues, agents will arrive, frontier labs will consolidate). The inaccurate ones tend to be ones that named specific capability thresholds and timelines. This deck mostly does the first kind.
Five capability axes are visibly active in 2026. Each is in a different stage of maturity.
| Axis | State 2026 | What is plausibly next |
|---|---|---|
| Reasoning depth | o-series-style RL on chains of thought is mature; PhD-level on hard math/science benchmarks for top tier. | Reasoning extends to longer-horizon scientific problems. Math research, code-base scale engineering, drug-discovery pipelines. |
| Agentic action | Computer Use / Operator / Mariner work for ~tens of minutes of useful action; reliability still lossy. | Day-long agent tasks; multi-step economic transactions; agent-to-agent protocols. |
| Multimodality | Native image, audio, video in/out; Sora-class generation; real-time voice agents. | Embodied real-world action (robotics-foundation-model work). Better cross-modal reasoning. |
| Long context | 1–2 M tokens at frontier; 4 M+ at MiniMax. Quality degrades past ~200k. | Memory-style architectures (state-space, retrieval, learned memory). Quality at 10 M+ tokens. |
| Specialised expertise | Better-than-PhD on individual benchmarks; uneven by domain. | Reliable expert-level help across most knowledge domains. Failure modes localised and predictable. |
Each of these axes has gone through a 2–3 year period from "researchers prove the concept" to "shipped at frontier scale". Reasoning was 2022–2024. Agents were 2023–2025. Long context was 2023–2025. Robotics-foundation-models look like they are mid-cycle: 2024–2026. The next 18 months are the period in which the agentic-action axis goes from interesting to economically transformational.
For most of 2020–2024 the working assumption at the frontier was: more compute, more parameters, more data, lower loss, better capability. By late 2024 several senior researchers — including Ilya Sutskever in his post-OpenAI public statements and Dario Amodei in interviews — have publicly observed that pretraining-only scaling appears to be running into diminishing returns on raw next-token loss.
The honest read is: both sides are partly right. Pretraining-only scaling has clearly diminished in marginal value. That is settled. Total scaling — pretraining plus post-training plus inference-time compute plus multimodal plus tool use — is still climbing. The right framing for the next five years is probably "scaling on multiple new axes simultaneously", rather than "is the scaling-laws thesis correct or not".
The o-series, R1, Gemini Thinking, Claude extended-thinking and various other 2024–25 reasoning models have established that inference-time compute is a real second axis of capability. This changes several things:
The most likely next step on this axis is reasoning across very long horizons by chaining many smaller RL'd reasoning passes, with intermediate caching, search, and verification. The technique's home base will probably extend from math/code into more open-ended scientific research within two to three years. Whether it generalises to fully open-ended creative or strategic tasks is genuinely contested.
Through 2025 the agent-product story moved from "interesting demos" to "shipped frontier products with real revenue lines". Cursor (and Anysphere generally), Claude Code, Operator, Project Mariner, the Devin/Cognition line, the Replit agent, the agent SDKs at every frontier lab.
The most-shipped agent products are ones that work in environments with strong feedback signal: code (compilers, tests, linters), structured forms (web pages with predictable DOM), defined workflows (data ETL, scheduling). Open-ended agent tasks where no easy verifier exists remain hard. Most of the next two years of agent capability gains will come from labs figuring out how to engineer good feedback signals for harder domains, not from raw model capability.
Yann LeCun's JEPA programme, Demis Hassabis's emphasis on "world models" as the next frontier, the Pi/Physical Intelligence robotics-foundation-model work, Sergey Levine's Berkeley group, and a host of related research lines all share a thesis: language models are the prior; embodied/world-model agents are the next research target.
An LLM trained only on text knows about the world only secondhand. To act usefully in the real world (robotics, autonomous vehicles, scientific experimentation), models need to learn the dynamics of physical reality — not just descriptions of it. World-model approaches train on video and sensor streams, predict future states, and use that prediction as the substrate for planning and action. Whether they fully replace LLMs or extend them is contested.
Robotics foundation models will produce a step change in what humanoid and manipulator robots can do over the next 2–5 years; the question is whether that step change comes from "world models in LeCun's sense" or from "transformer-style policies trained on enough video and demonstration data". The technical bet behind world-model purism is that the latter approach hits a ceiling that the former does not. The empirical evidence is genuinely mixed. The most important new datasets and reward-shaping experiments are still ahead.
Through 2025 every western frontier lab and the leading Chinese labs have released models that are multimodal natively rather than as bolt-ons. Image, audio, video, and code all flow through the same architecture. The "single model that does everything" arc is largely complete on the input side.
Each new modality has gone through roughly the same arc: novelty → bolt-on adapter → dedicated subnetwork → native integration in pretraining. Image was 2021–2024. Audio was 2022–2024. Video is 2023–2026. Action (robotics) is 2024–2027 on present trajectory. The future is not more modalities; it is better-integrated existing ones, with action being the main new addition.
The open-vs-closed debate is one of the field's persistent arguments, covered repeatedly in the previous decks. In 2026 the equilibrium looks roughly like:
The probable equilibrium for 2027 is: two parallel frontiers, one closed and one open, with a 6–12 month gap that does not widen but does not close either. The most strategically active question is whether any of the regulatory regimes treats the open-weight frontier as itself a regulated artefact — the EU AI Act has provisions that could be read this way, and the political pressure depends on how the field's safety story unfolds.
Alignment research in 2026 is in a strange state. The capability-relevant techniques (RLHF, constitutional AI, RLAIF, deliberative alignment, debate, scalable oversight) all work to varying degrees on present models. The deeper question — do these techniques generalise to a model substantially smarter than the humans designing them? — has not been tested empirically because the model substantially smarter than the humans does not yet exist (or is not yet detected).
The regulatory side of LLM history in 2026 is structured by three jurisdictions. Each has settled into a distinct posture.
Passed 2024, fully in force 2026. Risk-tiered framework with the strictest rules on "high-risk" applications and general-purpose AI models above a compute threshold. Includes pre-deployment evaluation, transparency on training data, copyright provisions, and post-market monitoring. Most prescriptive of the three regimes; most stable.
No comprehensive federal AI law as of 2026. The 2023 Biden Executive Order set safety-evaluation requirements and was partially rolled back in 2025. The US AI Safety Institute (housed at NIST) does voluntary frontier-model testing. Sectoral rules (FDA, FTC, SEC) cover specific use cases. Most flexible regime; least stable.
UK AI Safety Institute (founded Nov 2023) does pre-deployment frontier-model evaluations under voluntary agreements with major labs. Aims to be the most technically credible evaluation body. No comprehensive AI law; instead empowers existing regulators (Ofcom, ICO, FCA, MHRA) to address AI in their sectors.
The 2023 Generative AI rules require pre-deployment safety review and content alignment with national-security guidelines. The 2024–2025 elaborations have added algorithmic-recommendation rules and deep-synthesis labelling. Domestic regime is restrictive on outputs and consumer-facing deployment, permissive on architecture and base models.
Each of the three western regimes has chosen a distinctive lane (EU prescriptive, US flexible-sectoral, UK technical-evaluation). All three have settled enough by 2026 that the regulatory uncertainty discount on AI investment is meaningfully smaller than it was in 2023. The next significant change will probably be precipitated by an incident — a high-profile autonomous-action failure, a frontier-capability surprise, a security breach — rather than by orderly process.
The AGI-timeline question is the deck's most-asked-about and least-confidently-answered. The honest version is that senior people in the field disagree wildly, often along lines that correlate with their incentives, and the disagreement persists because the relevant evidence does not exist.
The deepest reason: there is no agreed operational definition of AGI. "As capable as a median human at most economically valuable tasks", "can do all human cognitive labour", "matches or exceeds a top human expert in any domain", "recursively self-improves" — these are radically different bars and forecast accordingly. Anyone offering a single date for "AGI" without a specific definition is mostly signalling disposition, not predicting.
Capability gains in narrow domains (math, code, science research) are likely to continue at a pace that produces visible economic effects within 3–5 years. Whether those effects converge on something we should call "AGI" in the next 5–10 years depends on how the agentic-action and world-model axes mature, on whether frontier alignment work scales, and on regulatory and supply-chain factors that no one fully controls. The right posture for an engineer in this field is to plan as if the technology continues to capability-improve substantially, while being honest about what is and is not known.