o1, o3, DeepSeek-R1 — scaling inference compute rather than parameters, GRPO reinforcement learning, long thinking traces, and where the eval lift actually lands.
From 2017 to 2023, the dominant scaling law was simple: bigger pre-training runs → smarter models. GPT-4 is roughly 10× more compute than GPT-3.5. The curve flattened — not because the law broke but because frontier training runs hit hundreds of millions of GPU-hours and data scarcity.
OpenAI's September 2024 release of o1-preview introduced a second axis: how much compute you spend at inference time. The model is trained to think before answering — emitting a long private chain-of-thought (“thinking trace”) before producing its visible reply. The trace is hidden from the user but consumes tokens.
Quality scales with training FLOPs. Larger model, more data, more tokens seen. Inference is cheap: 1–4 forward passes per user turn. RLHF aligns tone, not reasoning.
Quality scales with inference tokens. The same parameter count produces better answers by spending 100× more tokens on private reasoning. RL reward is correctness of the final answer, not human preference.
Correctness on maths and coding is verifiable. A reward signal can be computed without human raters — just check the answer or run the tests. That makes large-scale RL tractable. Models trained this way learn to self-correct mid-trace, explore dead ends, and backtrack, mimicking what a human expert does on paper.
Snell et al. (Google DeepMind, 2024) — “Scaling LLM Test-Time Compute Optimally” — showed that for a fixed total FLOP budget, reallocating compute from training to inference can match or beat a larger model on hard problems. The crossover depends on problem difficulty: easy problems don't benefit, hard ones benefit enormously.
Sequential refinement (used by o1-style models): the model elaborates its answer in a single stream, revising mid-trace. Compute scales linearly with trace length but benefits compound non-linearly.
Parallel sampling / best-of-N: sample K independent answers, pick the best via a verifier or majority vote. Embarrassingly parallel; benefits plateau around N≈64 for most tasks.
OpenAI has published very little about o1's internals. The September 2024 system card and the December 2024 o1 technical report are the primary sources. Everything else is inference from behaviour and leaks.
| Fact / datum | Source | Confidence |
|---|---|---|
| Trained with large-scale RL; reward is answer correctness | o1 system card, 2024-09 | confirmed |
| Thinking trace is hidden from users but counted in context | API docs — reasoning_tokens field | confirmed |
reasoning_effort param: low / medium / high | OpenAI API reference, 2025-01 | confirmed |
| o3 uses Monte-Carlo Tree Search over reasoning steps | Bloomberg / The Information, 2024-12 | unconfirmed |
| o1 is a fine-tuned GPT-4o base, not a new architecture | widespread inference; no official statement | unconfirmed |
| o3 ARC-AGI score: 87.5% (high-compute) vs 75.7% (low) | OpenAI ARC-AGI eval post, 2024-12-20 | confirmed |
| o3-mini matches o1 on AIME 2024 at ¼ the cost | OpenAI technical report, 2025-01 | confirmed |
The Chat Completions response for o-series models includes a usage.completion_tokens_details object with a reasoning_tokens integer. On a hard AIME problem this can be 8,000–30,000 tokens. The visible content reply might be 200 tokens. You pay for both at the same per-token rate.
import openai
client = openai.OpenAI()
resp = client.chat.completions.create(
model="o3-mini",
reasoning_effort="high", # low | medium | high
max_completion_tokens=16000,
messages=[{"role": "user", "content": "Solve AIME 2024 II Problem 15."}]
)
usage = resp.usage
print(usage.prompt_tokens,
usage.completion_tokens,
usage.completion_tokens_details.reasoning_tokens)
# e.g. 42 14830 14620 (most tokens were thinking)
DeepSeek-R1 (January 2025) is the first high-quality open-weight reasoning model. Its technical report is unusually detailed. The model family sits on top of DeepSeek-V3 (671B MoE, 37B active parameters per token), and the reasoning capability is entirely RL-induced — no supervised long-CoT imitation at the base stage.
Standard PPO requires a value (critic) network, doubling memory. GRPO (DeepSeek-Math, 2024) eliminates the critic by estimating the baseline from a group of K sampled responses to the same prompt:
For a group of K outputs o1…oK with rewards ri, the advantage is Ai = (ri − mean(r)) / std(r). The policy gradient is then clipped PPO-style. No critic; baseline is within-group statistics.
GRPO directly on DeepSeek-V3 base. No SFT cold-start. Emergent self-verification and aha-moment behaviours appear at ≈8k RL steps. Readable but sometimes switches languages mid-trace.
A small set (∼thousands) of human-curated long-CoT demonstrations bootstraps the policy before GRPO. Language stays consistent; readability improves. Published weights: 1.5B, 7B, 8B, 14B, 32B, 70B distillations from 671B.
Format reward: trace must contain <think>...</think> tags; answer must be boxed in \boxed{} for maths or wrapped in fenced code for coding. Pure rule-based. Correctness reward: exact match or unit-test pass. No neural judge for the primary RL phase.
The thinking trace is the model's scratchpad. It is generated autoregressively before the final answer, typically enclosed in <think>…</think> tags (R1 convention) or as hidden reasoning_tokens (o-series). The model learns to use this space for hypothesis generation, arithmetic, checking sub-results, and recovering from errors.
| Model | Trace visibility | Max thinking tokens | How budget is set |
|---|---|---|---|
| o1-mini | Hidden (reasoning_tokens) | ≈32k | reasoning_effort: low/med/high maps to ~1k/4k/32k |
| o3 (API) | Hidden | ≈100k | reasoning_effort or max_completion_tokens |
| DeepSeek-R1 | Visible in <think> block | 32k (context window) | max_tokens; model self-terminates </think> |
| Claude 3.7 Sonnet (extended thinking) | Visible in thinking content block | Up to 128k | thinking.budget_tokens (1,024–128,000) |
| Qwen QwQ-32B | Visible; <think> tags | 32k | max_tokens on inference |
Forcing a very short budget can hurt more than greedy decoding — the model learns to meta-plan its trace, so cutting it short mid-thought produces worse answers than no trace at all. The safe minimum for hard maths is ≈2,000 thinking tokens; for typical coding tasks ≈500.
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user",
"content": "Prove that sqrt(2) is irrational."}]
)
for block in resp.content:
if block.type == "thinking":
print("trace:", block.thinking[:200])
else:
print("answer:", block.text)
Reasoning tokens are priced identically to output tokens. A single hard AIME problem with o3 at high effort can consume 20,000–60,000 reasoning tokens. At $15/M output tokens (o3 pricing as of 2025-Q1) that is $0.30–$0.90 per problem. For a 30-question AIME set: $9–$27. Batch API discounts (50%) bring this to ~$5–$14.
| Model | Input $/M tok | Output $/M tok | Avg reasoning tok (hard maths) | Cost / hard problem |
|---|---|---|---|---|
| o3-mini (medium effort) | $1.10 | $4.40 | ≈4,000 | ≈$0.018 |
| o3-mini (high effort) | $1.10 | $4.40 | ≈16,000 | ≈$0.070 |
| o3 (high effort) | $10 | $40 | ≈30,000 | ≈$1.20 |
| DeepSeek-R1 (API) | $0.55 | $2.19 | ≈8,000 | ≈$0.018 |
| Claude 3.7 Sonnet (10k budget) | $3 | $15 | ≈8,000 | ≈$0.12 |
Reasoning tokens are generated sequentially. At 60 tok/s output throughput, 30,000 thinking tokens ≈ 8 minutes time-to-first-visible-token. This makes o3 at high effort unsuitable for interactive pipelines. o3-mini at medium effort ≈ 60–90 seconds. Streaming helps UX but doesn't reduce total latency.
The gains from reasoning models are highly domain-specific. AIME and competitive programming see dramatic lifts; conversational tasks see flat or negative results due to latency and over-thinking.
| Benchmark | GPT-4o (greedy) | o1 | o3-mini (high) | DeepSeek-R1 |
|---|---|---|---|---|
| AIME 2024 (pass@1) | 9.3% | 74.4% | 87.3% | 72.6% |
| MATH-500 | 76.6% | 94.8% | 97.2% | 97.3% |
| SWE-bench Verified | 38.8% | 48.9% | 49.3% | 49.2% |
| Codeforces Percentile | 11th | 89th | 93rd | 96th |
| GPQA Diamond | 53.6% | 77.3% | 79.7% | 71.5% |
| MT-Bench (chat quality) | 9.0 | 8.7 | 8.5 | 7.9 |
Reasoning models are RL-trained on verifiable rewards — which means no reward signal for tone, style, concision, or helpfulness in open-ended conversation. The model can over-plan, produce terse answers, or ignore formatting preferences. For creative writing or simple QA, a well-tuned chat model consistently wins.
When a reasoning model is the planner in a multi-step agent, the interaction pattern changes significantly. The model's long thinking trace can incorporate tool results, plan sub-tasks, and revise the plan — all within a single generation. This collapses several turns of a classic ReAct loop into one.
read_file, run_tests, edit_file, run_testsContext accumulation: reasoning tokens accumulate in the conversation window. A 128k-context model with 20k thinking tokens per turn exhausts context in ∼6 agent steps before tool results are counted. Summarisation or sliding-window strategies are essential.
Interleaved thinking + tools: Claude 3.7 Sonnet and R1 support thinking blocks between tool calls in the same turn. The model can think → call tool → think more → call tool → answer, all in one generation when the API permits it.
reasoning_effort: low or set a budget_tokens cap in production.Deck 02 examines the reward machinery in depth — PRMs vs ORMs, how MCTS-guided decoding works, and when verifier-based sampling beats self-consistency. Deck 03 addresses the deployment question: when do you reach for a reasoning model vs a frontier chat model plus a scaffold?