LLM History 10 — Future Directions

00

What This Deck Covers

The forward view. Less narrative, more structural — reading the patterns from the previous nine decks and projecting forward where they continue, where they break, and where the genuine open questions are.

The Caveat — Forecasting in a Field This Fast
Where Capability is Heading
The Scaling Laws Debate — Continue or Plateau?
Test-Time Compute and Reasoning Models
Agents and Tool Use
World Models and Embodied AI
Multimodality and the Single-Model Future
Open Weights vs Closed — the Live Argument
The Alignment Question
The Regulatory Landscape
AGI Timelines — Honest Disagreement
Cheat Sheet

01

The Caveat — Forecasting in a Field This Fast

It is worth saying plainly: nobody who has tried to forecast LLM capability over the last five years has gotten it right. Not the lab CEOs, not the scaling-laws authors, not the journalists, not the academic critics. The 2020 GPT-3 paper was a surprise to most of the field; ChatGPT was a surprise to most of OpenAI; the o-series was a surprise to many people inside the lab that built it. The base rate for confident frontier-AI predictions surviving twelve months intact is unimpressive.

The point of this deck is therefore not to predict specific things. It is to lay out the axes along which the field could move and, where evidence supports it, to indicate which way the wind is blowing on each axis. A reader returning to this deck in two years should be able to score the framework, not the predictions.

A useful prior

Take any two-year-old confident forecast about AI capability. The accurate predictions tend to be ones that named a structural pattern (scaling continues, agents will arrive, frontier labs will consolidate). The inaccurate ones tend to be ones that named specific capability thresholds and timelines. This deck mostly does the first kind.

02

Where Capability is Heading

Five capability axes are visibly active in 2026. Each is in a different stage of maturity.

Axis	State 2026	What is plausibly next
Reasoning depth	o-series-style RL on chains of thought is mature; PhD-level on hard math/science benchmarks for top tier.	Reasoning extends to longer-horizon scientific problems. Math research, code-base scale engineering, drug-discovery pipelines.
Agentic action	Computer Use / Operator / Mariner work for ~tens of minutes of useful action; reliability still lossy.	Day-long agent tasks; multi-step economic transactions; agent-to-agent protocols.
Multimodality	Native image, audio, video in/out; Sora-class generation; real-time voice agents.	Embodied real-world action (robotics-foundation-model work). Better cross-modal reasoning.
Long context	1–2 M tokens at frontier; 4 M+ at MiniMax. Quality degrades past ~200k.	Memory-style architectures (state-space, retrieval, learned memory). Quality at 10 M+ tokens.
Specialised expertise	Better-than-PhD on individual benchmarks; uneven by domain.	Reliable expert-level help across most knowledge domains. Failure modes localised and predictable.

A pattern that the historical decks support

Each of these axes has gone through a 2–3 year period from "researchers prove the concept" to "shipped at frontier scale". Reasoning was 2022–2024. Agents were 2023–2025. Long context was 2023–2025. Robotics-foundation-models look like they are mid-cycle: 2024–2026. The next 18 months are the period in which the agentic-action axis goes from interesting to economically transformational.

03

The Scaling Laws Debate — Continue or Plateau?

For most of 2020–2024 the working assumption at the frontier was: more compute, more parameters, more data, lower loss, better capability. By late 2024 several senior researchers — including Ilya Sutskever in his post-OpenAI public statements and Dario Amodei in interviews — have publicly observed that pretraining-only scaling appears to be running into diminishing returns on raw next-token loss.

The "scaling continues" argument

Loss has continued to fall, even if the marginal gains per dollar have shrunk.
Capability metrics often jump non-linearly when scale crosses a threshold.
Test-time compute and post-training (RLHF, RLAIF, RL on CoT) provide new scaling axes.
Multimodal and synthetic-data extensions provide new data axes.

The "plateau" argument

Pretraining-loss reduction per training-compute dollar is sub-power-law in 2024 data points.
High-quality natural-text data is finite and approaching exhaustion.
The gains since GPT-4 to GPT-5 are large in capability but smaller per dollar than the 3 → 4 jump.
The next wave of capability looks like it comes from RL/agents/test-time compute, not pretraining.

Where this lands

The honest read is: both sides are partly right. Pretraining-only scaling has clearly diminished in marginal value. That is settled. Total scaling — pretraining plus post-training plus inference-time compute plus multimodal plus tool use — is still climbing. The right framing for the next five years is probably "scaling on multiple new axes simultaneously", rather than "is the scaling-laws thesis correct or not".

04

Test-Time Compute and Reasoning Models

The o-series, R1, Gemini Thinking, Claude extended-thinking and various other 2024–25 reasoning models have established that inference-time compute is a real second axis of capability. This changes several things:

What the technique gives you

State-of-the-art on math, science, coding, planning at single-step pricing 2–100x training cost — but per-query.
The ability to spend $0.01 on simple queries and $5 on hard ones with the same model.
RL on verifiable rewards (correctness on math/code) generalises in ways most people had not expected.
A research direction with most of its payoff still to come; smaller labs can compete here on technique rather than on scale.

What it pushes

Pricing models reorganise around "compute spent per query" not "tokens per request".
Specialised reasoning sub-models for specific verifiable domains (math, code, formal verification).
Long-horizon agents that can spend hours on a single task.
The performance of small open-weight models distilled from large reasoning models (the R1-distill family).

A specific forecast

The most likely next step on this axis is reasoning across very long horizons by chaining many smaller RL'd reasoning passes, with intermediate caching, search, and verification. The technique's home base will probably extend from math/code into more open-ended scientific research within two to three years. Whether it generalises to fully open-ended creative or strategic tasks is genuinely contested.

05

Agents and Tool Use

Through 2025 the agent-product story moved from "interesting demos" to "shipped frontier products with real revenue lines". Cursor (and Anysphere generally), Claude Code, Operator, Project Mariner, the Devin/Cognition line, the Replit agent, the agent SDKs at every frontier lab.

What works in 2026

Coding agents at the function-and-file level (Claude Code, Cursor, Devin).
Browser agents for tens of minutes of structured tasks.
Tool-augmented assistants in enterprise contexts (Salesforce, Microsoft, Atlassian).
Voice and Slack-bot agents in operations contexts.

What is still hard

Multi-hour autonomous tasks in unstructured environments.
Long-running open-ended research projects.
Fully autonomous agent-to-agent commerce.
Robust handling of unexpected failure modes mid-task.

What is plausibly next

Day-scale autonomous tasks with checkpointing and human review at strategic decisions.
MCP-style protocol consolidation (a few standards win).
Agent-marketplaces and agent-to-agent payments.
Operator licensing / agent-action insurance as commercial categories.

A useful pattern

The most-shipped agent products are ones that work in environments with strong feedback signal: code (compilers, tests, linters), structured forms (web pages with predictable DOM), defined workflows (data ETL, scheduling). Open-ended agent tasks where no easy verifier exists remain hard. Most of the next two years of agent capability gains will come from labs figuring out how to engineer good feedback signals for harder domains, not from raw model capability.

06

World Models and Embodied AI

Yann LeCun's JEPA programme, Demis Hassabis's emphasis on "world models" as the next frontier, the Pi/Physical Intelligence robotics-foundation-model work, Sergey Levine's Berkeley group, and a host of related research lines all share a thesis: language models are the prior; embodied/world-model agents are the next research target.

The thesis in one paragraph

An LLM trained only on text knows about the world only secondhand. To act usefully in the real world (robotics, autonomous vehicles, scientific experimentation), models need to learn the dynamics of physical reality — not just descriptions of it. World-model approaches train on video and sensor streams, predict future states, and use that prediction as the substrate for planning and action. Whether they fully replace LLMs or extend them is contested.

Where the work is happening

Meta's V-JEPA / I-JEPA line (LeCun's architecture).
Pi/Physical Intelligence (Berkeley/Google diaspora).
Google DeepMind robotics (RT-1/2, Gemini Robotics).
NVIDIA's GR00T humanoid programme.
Various Tesla / Waymo / Wayve autonomous-driving lines.
Sora and Veo as side-effects (video models that learn physics implicitly).

An honest forecast

Robotics foundation models will produce a step change in what humanoid and manipulator robots can do over the next 2–5 years; the question is whether that step change comes from "world models in LeCun's sense" or from "transformer-style policies trained on enough video and demonstration data". The technical bet behind world-model purism is that the latter approach hits a ceiling that the former does not. The empirical evidence is genuinely mixed. The most important new datasets and reward-shaping experiments are still ahead.

07

Multimodality and the Single-Model Future

Through 2025 every western frontier lab and the leading Chinese labs have released models that are multimodal natively rather than as bolt-ons. Image, audio, video, and code all flow through the same architecture. The "single model that does everything" arc is largely complete on the input side.

What is consolidated

Native vision input is table stakes (GPT-4o, Claude, Gemini, Llama 4, Qwen-VL).
Real-time voice in / voice out at near-human latency.
Long-context document and image understanding in a single inference.
Video understanding (frame-by-frame reasoning over hours of input).

What is still moving

Native generation across all modalities (Sora-style is current best, still imperfect).
Cross-modal reasoning fidelity ("the audio says X, the video says Y, what can you infer").
Multimodal agents (action over multimodal observation streams).
3D / spatial understanding (a much harder modality than 2D image).

A pattern

Each new modality has gone through roughly the same arc: novelty → bolt-on adapter → dedicated subnetwork → native integration in pretraining. Image was 2021–2024. Audio was 2022–2024. Video is 2023–2026. Action (robotics) is 2024–2027 on present trajectory. The future is not more modalities; it is better-integrated existing ones, with action being the main new addition.

08

Open Weights vs Closed — the Live Argument

The open-vs-closed debate is one of the field's persistent arguments, covered repeatedly in the previous decks. In 2026 the equilibrium looks roughly like:

The closed-weight frontier

OpenAI, Anthropic, Google DeepMind keep flagship weights closed.
API access plus cloud-marketplace distribution.
Strong commercial moats; clear safety-research story.
~6–12 month lead on raw frontier capability.

The open-weight frontier

Meta Llama, Mistral, DeepSeek, Qwen, plus an active long tail.
Hugging Face / Together / Ollama distribution.
Stronger researcher/community footprint; weaker enterprise sales motion.
~6–12 months behind the closed frontier on raw capability, often ahead on specific axes.

The questions for the next two years

Does the closed-weight frontier extend its lead, hold steady, or shrink as algorithmic efficiency techniques (MoE, distillation, RL on smaller models) propagate?
Does any frontier lab close its previously-open-weight strategy when the cost of the next model becomes large enough that internal economics push back? (Mistral has partially done this with Large; others may follow.)
Does any closed-weight lab open weights of older models systematically as a competitive lever?
How does the regulatory framework (EU AI Act, US export controls, possible UK rules) treat open-weight releases of frontier-capability models?

An honest forecast

The probable equilibrium for 2027 is: two parallel frontiers, one closed and one open, with a 6–12 month gap that does not widen but does not close either. The most strategically active question is whether any of the regulatory regimes treats the open-weight frontier as itself a regulated artefact — the EU AI Act has provisions that could be read this way, and the political pressure depends on how the field's safety story unfolds.

09

The Alignment Question

Alignment research in 2026 is in a strange state. The capability-relevant techniques (RLHF, constitutional AI, RLAIF, deliberative alignment, debate, scalable oversight) all work to varying degrees on present models. The deeper question — do these techniques generalise to a model substantially smarter than the humans designing them? — has not been tested empirically because the model substantially smarter than the humans does not yet exist (or is not yet detected).

What is going well

Mechanistic interpretability has moved from interesting toy results to features-on-frontier-models. The Anthropic 2024–25 work is genuinely promising.
Public RSPs / FSFs at top labs commit specific capability thresholds to specific safeguards.
Frontier red-teaming infrastructure exists (UK AISI, US AISI, METR, Apollo Research).
Deceptive-alignment research has moved from speculation to empirical — demonstrations on toy models exist; the question is whether they scale.

What is still alarming

None of the techniques have been validated on models substantially smarter than their designers, because such models do not yet exist (we think).
Alignment teams at the top labs have lost senior staff repeatedly (Leike, Christiano, Sutskever).
Commercial pressure compresses safety timelines.
The institutions designed to enforce safety commitments are still nascent and untested.

We have an unusual amount of empirical evidence that the techniques work on the systems we currently have. We have basically no empirical evidence about whether they will work on the systems we are about to build. — A common framing among senior alignment researchers in late 2025. Variants of this thought have been articulated by Bengio, Christiano, Olah and others.

10

The Regulatory Landscape

The regulatory side of LLM history in 2026 is structured by three jurisdictions. Each has settled into a distinct posture.

European Union — EU AI Act

Passed 2024, fully in force 2026. Risk-tiered framework with the strictest rules on "high-risk" applications and general-purpose AI models above a compute threshold. Includes pre-deployment evaluation, transparency on training data, copyright provisions, and post-market monitoring. Most prescriptive of the three regimes; most stable.

United States — sectoral plus executive

No comprehensive federal AI law as of 2026. The 2023 Biden Executive Order set safety-evaluation requirements and was partially rolled back in 2025. The US AI Safety Institute (housed at NIST) does voluntary frontier-model testing. Sectoral rules (FDA, FTC, SEC) cover specific use cases. Most flexible regime; least stable.

United Kingdom — AISI plus light-touch

UK AI Safety Institute (founded Nov 2023) does pre-deployment frontier-model evaluations under voluntary agreements with major labs. Aims to be the most technically credible evaluation body. No comprehensive AI law; instead empowers existing regulators (Ofcom, ICO, FCA, MHRA) to address AI in their sectors.

The China side

The 2023 Generative AI rules require pre-deployment safety review and content alignment with national-security guidelines. The 2024–2025 elaborations have added algorithmic-recommendation rules and deep-synthesis labelling. Domestic regime is restrictive on outputs and consumer-facing deployment, permissive on architecture and base models.

A pattern that probably continues

Each of the three western regimes has chosen a distinctive lane (EU prescriptive, US flexible-sectoral, UK technical-evaluation). All three have settled enough by 2026 that the regulatory uncertainty discount on AI investment is meaningfully smaller than it was in 2023. The next significant change will probably be precipitated by an incident — a high-profile autonomous-action failure, a frontier-capability surprise, a security breach — rather than by orderly process.

11

AGI Timelines — Honest Disagreement

The AGI-timeline question is the deck's most-asked-about and least-confidently-answered. The honest version is that senior people in the field disagree wildly, often along lines that correlate with their incentives, and the disagreement persists because the relevant evidence does not exist.

Short timelines (2027–2030)

Sam Altman has said publicly that "AGI" by his definitions is achievable within this window.
Dario Amodei's Machines of Loving Grace implies roughly this range.
Shane Legg has held this position consistently for two decades.
Many AI 2027-style scenario authors (Daniel Kokotajlo et al).

Longer timelines (2035–2050)

Yann LeCun argues current LLMs are not on the path to AGI; world-model architectures will take longer.
Many academic ML researchers and AI critics (Gary Marcus, Melanie Mitchell, others).
Most economists who have looked at the productivity-impact data carefully.
Demis Hassabis takes a middle position; serious about transformative impact in 5–10 years, sceptical of "AGI by 2027" specifics.

Why the disagreement persists

The deepest reason: there is no agreed operational definition of AGI. "As capable as a median human at most economically valuable tasks", "can do all human cognitive labour", "matches or exceeds a top human expert in any domain", "recursively self-improves" — these are radically different bars and forecast accordingly. Anyone offering a single date for "AGI" without a specific definition is mostly signalling disposition, not predicting.

What can be said with more confidence

Capability gains in narrow domains (math, code, science research) are likely to continue at a pace that produces visible economic effects within 3–5 years. Whether those effects converge on something we should call "AGI" in the next 5–10 years depends on how the agentic-action and world-model axes mature, on whether frontier alignment work scales, and on regulatory and supply-chain factors that no one fully controls. The right posture for an engineer in this field is to plan as if the technology continues to capability-improve substantially, while being honest about what is and is not known.

12

Cheat Sheet

The five axes to watch

Reasoning depth. RL on chains of thought; mature.
Agentic action. Mid-cycle; biggest near-term lever.
World models / embodiment. Early-cycle; high uncertainty.
Multimodality. Mostly consolidated; integration improvements.
Specialised expertise. Domain-by-domain; uneven.

The structural questions

Does pretraining-only scaling really plateau, or just shift axes?
Open weights vs closed: equilibrium or shift?
Does alignment scale to substantially-smarter models?
Which regulatory regime sets the global default?
How does the multipolar (US-China-Europe) frontier evolve?

Things to flag in re-reads

The "AGI by 2027" predictions — did they age well or badly?
The "open weights catches up by 2026" claim — how close is the gap actually?
The "test-time compute is the next axis" thesis — was it the dominant driver, or was something else?
Did robotics-foundation-models actually transform anything by 2028?

End of the series

This is the last deck. The history part of the LLM History series ends with the present.
The technical content of LLMs continues in the rest of the LLMs hub.
Two years from now, this deck should be re-read for what it got right about structure, not for what it predicted about specifics.