Reasoning Models Series — Presentation 01

The Test-Time Compute Paradigm Shift

o1, o3, DeepSeek-R1 — scaling inference compute rather than parameters, GRPO reinforcement learning, long thinking traces, and where the eval lift actually lands.

o1 / o3 DeepSeek-R1 GRPO Test-Time Compute Long CoT AIME SWE-bench
Prompt Think (hidden) GRPO reward Policy update Longer trace Answer
00

Topics We'll Cover

01

The o1/R1 Paradigm Shift

From 2017 to 2023, the dominant scaling law was simple: bigger pre-training runs → smarter models. GPT-4 is roughly 10× more compute than GPT-3.5. The curve flattened — not because the law broke but because frontier training runs hit hundreds of millions of GPU-hours and data scarcity.

OpenAI's September 2024 release of o1-preview introduced a second axis: how much compute you spend at inference time. The model is trained to think before answering — emitting a long private chain-of-thought (“thinking trace”) before producing its visible reply. The trace is hidden from the user but consumes tokens.

Pre-2024 paradigm

Quality scales with training FLOPs. Larger model, more data, more tokens seen. Inference is cheap: 1–4 forward passes per user turn. RLHF aligns tone, not reasoning.

Post-o1 paradigm

Quality scales with inference tokens. The same parameter count produces better answers by spending 100× more tokens on private reasoning. RL reward is correctness of the final answer, not human preference.

The key insight

Correctness on maths and coding is verifiable. A reward signal can be computed without human raters — just check the answer or run the tests. That makes large-scale RL tractable. Models trained this way learn to self-correct mid-trace, explore dead ends, and backtrack, mimicking what a human expert does on paper.

02

Scaling Test-Time Compute vs Train-Time

Snell et al. (Google DeepMind, 2024) — “Scaling LLM Test-Time Compute Optimally” — showed that for a fixed total FLOP budget, reallocating compute from training to inference can match or beat a larger model on hard problems. The crossover depends on problem difficulty: easy problems don't benefit, hard ones benefit enormously.

Accuracy vs tokens spent at inference (schematic, MATH-500 regime) 95% 85% 75% 65% 55% 256 tok 1 k tok 4 k tok 16 k tok Reasoning model (o1-style) Large base model (greedy) Small model + best-of-N
Two distinct mechanisms

Sequential refinement (used by o1-style models): the model elaborates its answer in a single stream, revising mid-trace. Compute scales linearly with trace length but benefits compound non-linearly.

Parallel sampling / best-of-N: sample K independent answers, pick the best via a verifier or majority vote. Embarrassingly parallel; benefits plateau around N≈64 for most tasks.

03

o1 / o3 — What's Public, What's Conjecture

OpenAI has published very little about o1's internals. The September 2024 system card and the December 2024 o1 technical report are the primary sources. Everything else is inference from behaviour and leaks.

Fact / datumSourceConfidence
Trained with large-scale RL; reward is answer correctnesso1 system card, 2024-09confirmed
Thinking trace is hidden from users but counted in contextAPI docs — reasoning_tokens fieldconfirmed
reasoning_effort param: low / medium / highOpenAI API reference, 2025-01confirmed
o3 uses Monte-Carlo Tree Search over reasoning stepsBloomberg / The Information, 2024-12unconfirmed
o1 is a fine-tuned GPT-4o base, not a new architecturewidespread inference; no official statementunconfirmed
o3 ARC-AGI score: 87.5% (high-compute) vs 75.7% (low)OpenAI ARC-AGI eval post, 2024-12-20confirmed
o3-mini matches o1 on AIME 2024 at ¼ the costOpenAI technical report, 2025-01confirmed
What the API exposes

The Chat Completions response for o-series models includes a usage.completion_tokens_details object with a reasoning_tokens integer. On a hard AIME problem this can be 8,000–30,000 tokens. The visible content reply might be 200 tokens. You pay for both at the same per-token rate.

OpenAI API — reasoning effort and token breakdown (Python SDK)
import openai
client = openai.OpenAI()
resp = client.chat.completions.create(
    model="o3-mini",
    reasoning_effort="high",          # low | medium | high
    max_completion_tokens=16000,
    messages=[{"role": "user", "content": "Solve AIME 2024 II Problem 15."}]
)
usage = resp.usage
print(usage.prompt_tokens,
      usage.completion_tokens,
      usage.completion_tokens_details.reasoning_tokens)
# e.g.  42   14830   14620  (most tokens were thinking)
04

DeepSeek-R1 — Open Architecture & GRPO Training

DeepSeek-R1 (January 2025) is the first high-quality open-weight reasoning model. Its technical report is unusually detailed. The model family sits on top of DeepSeek-V3 (671B MoE, 37B active parameters per token), and the reasoning capability is entirely RL-induced — no supervised long-CoT imitation at the base stage.

GRPO — Group Relative Policy Optimisation

Standard PPO requires a value (critic) network, doubling memory. GRPO (DeepSeek-Math, 2024) eliminates the critic by estimating the baseline from a group of K sampled responses to the same prompt:

GRPO reward formula

For a group of K outputs o1oK with rewards ri, the advantage is Ai = (ri − mean(r)) / std(r). The policy gradient is then clipped PPO-style. No critic; baseline is within-group statistics.

Prompt batch
Sample K rollouts
K≈8–16
Score each rollout
rule/format + correctness
Compute group advantage
PPO-clip update
ε=0.2, KL β

R1-Zero vs R1

R1-Zero (pure RL)

GRPO directly on DeepSeek-V3 base. No SFT cold-start. Emergent self-verification and aha-moment behaviours appear at ≈8k RL steps. Readable but sometimes switches languages mid-trace.

R1 (cold-start SFT + RL)

A small set (∼thousands) of human-curated long-CoT demonstrations bootstraps the policy before GRPO. Language stays consistent; readability improves. Published weights: 1.5B, 7B, 8B, 14B, 32B, 70B distillations from 671B.

Reward signals used in R1 training

Format reward: trace must contain <think>...</think> tags; answer must be boxed in \boxed{} for maths or wrapped in fenced code for coding. Pure rule-based.  Correctness reward: exact match or unit-test pass. No neural judge for the primary RL phase.

05

Long Thinking Traces & Max-Tokens Budget

The thinking trace is the model's scratchpad. It is generated autoregressively before the final answer, typically enclosed in <think></think> tags (R1 convention) or as hidden reasoning_tokens (o-series). The model learns to use this space for hypothesis generation, arithmetic, checking sub-results, and recovering from errors.

ModelTrace visibilityMax thinking tokensHow budget is set
o1-miniHidden (reasoning_tokens)≈32kreasoning_effort: low/med/high maps to ~1k/4k/32k
o3 (API)Hidden≈100kreasoning_effort or max_completion_tokens
DeepSeek-R1Visible in <think> block32k (context window)max_tokens; model self-terminates </think>
Claude 3.7 Sonnet (extended thinking)Visible in thinking content blockUp to 128kthinking.budget_tokens (1,024–128,000)
Qwen QwQ-32BVisible; <think> tags32kmax_tokens on inference
Budget vs quality trade-off

Forcing a very short budget can hurt more than greedy decoding — the model learns to meta-plan its trace, so cutting it short mid-thought produces worse answers than no trace at all. The safe minimum for hard maths is ≈2,000 thinking tokens; for typical coding tasks ≈500.

Anthropic Python SDK — extended thinking (Claude 3.7 Sonnet)
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user",
               "content": "Prove that sqrt(2) is irrational."}]
)
for block in resp.content:
    if block.type == "thinking":
        print("trace:", block.thinking[:200])
    else:
        print("answer:", block.text)
06

Cost & Latency Reality

Reasoning tokens are priced identically to output tokens. A single hard AIME problem with o3 at high effort can consume 20,000–60,000 reasoning tokens. At $15/M output tokens (o3 pricing as of 2025-Q1) that is $0.30–$0.90 per problem. For a 30-question AIME set: $9–$27. Batch API discounts (50%) bring this to ~$5–$14.

ModelInput $/M tokOutput $/M tokAvg reasoning tok (hard maths)Cost / hard problem
o3-mini (medium effort)$1.10$4.40≈4,000≈$0.018
o3-mini (high effort)$1.10$4.40≈16,000≈$0.070
o3 (high effort)$10$40≈30,000≈$1.20
DeepSeek-R1 (API)$0.55$2.19≈8,000≈$0.018
Claude 3.7 Sonnet (10k budget)$3$15≈8,000≈$0.12
Latency is the harder constraint

Reasoning tokens are generated sequentially. At 60 tok/s output throughput, 30,000 thinking tokens ≈ 8 minutes time-to-first-visible-token. This makes o3 at high effort unsuitable for interactive pipelines. o3-mini at medium effort ≈ 60–90 seconds. Streaming helps UX but doesn't reduce total latency.

User sends prompt
Reasoning phase
hidden; 10s–8min
Visible answer streams
seconds
User sees result
07

Eval Lift — Maths, Coding vs Chat Performance

The gains from reasoning models are highly domain-specific. AIME and competitive programming see dramatic lifts; conversational tasks see flat or negative results due to latency and over-thinking.

BenchmarkGPT-4o (greedy)o1o3-mini (high)DeepSeek-R1
AIME 2024 (pass@1)9.3%74.4%87.3%72.6%
MATH-50076.6%94.8%97.2%97.3%
SWE-bench Verified38.8%48.9%49.3%49.2%
Codeforces Percentile11th89th93rd96th
GPQA Diamond53.6%77.3%79.7%71.5%
MT-Bench (chat quality)9.08.78.57.9
Why chat quality can regress

Reasoning models are RL-trained on verifiable rewards — which means no reward signal for tone, style, concision, or helpfulness in open-ended conversation. The model can over-plan, produce terse answers, or ignore formatting preferences. For creative writing or simple QA, a well-tuned chat model consistently wins.

High-gain tasks

  • Competition maths (AIME, AMC, Olympiad)
  • Competitive programming (Codeforces Div1-D)
  • Formal verification & proof writing
  • Hard-constraint optimisation
  • Multi-step reasoning under uncertainty

Low-gain or negative tasks

  • Conversational chat, customer support
  • Simple code-autocomplete
  • Creative writing, poetry
  • Factual QA from knowledge (RAG tasks)
  • Latency-sensitive interactive loops
08

Reasoning Models in Agentic Loops

When a reasoning model is the planner in a multi-step agent, the interaction pattern changes significantly. The model's long thinking trace can incorporate tool results, plan sub-tasks, and revise the plan — all within a single generation. This collapses several turns of a classic ReAct loop into one.

User goal: “Debug and fix the failing CI job”
Reasoning model receives full context + tool schemas
Long thinking trace (hidden):
reads logs → hypothesises root cause → checks fix → revises → emits tool call sequence
Tool calls emitted: read_file, run_tests, edit_file, run_tests
Results injected back; model may generate another thinking trace
Final answer: diff + confirmation
Practical considerations in agentic use

Context accumulation: reasoning tokens accumulate in the conversation window. A 128k-context model with 20k thinking tokens per turn exhausts context in ∼6 agent steps before tool results are counted. Summarisation or sliding-window strategies are essential.

Interleaved thinking + tools: Claude 3.7 Sonnet and R1 support thinking blocks between tool calls in the same turn. The model can think → call tool → think more → call tool → answer, all in one generation when the API permits it.

09

What to Take Away

Where to next

Deck 02 examines the reward machinery in depth — PRMs vs ORMs, how MCTS-guided decoding works, and when verifier-based sampling beats self-consistency. Deck 03 addresses the deployment question: when do you reach for a reasoning model vs a frontier chat model plus a scaffold?