Reasoning 01 — Test-Time Compute & the o1/R1 Paradigm

00

Topics We'll Cover

The o1/R1 Paradigm Shift
Scaling Test-Time Compute vs Train-Time
o1 / o3 — What's Public, What's Conjecture
DeepSeek-R1 — Open Architecture & GRPO Training
Long Thinking Traces & Max-Tokens Budget
Cost & Latency Reality
Eval Lift — Maths, Coding vs Chat Performance
Reasoning Models in Agentic Loops
What to Take Away

01

The o1/R1 Paradigm Shift

From 2017 to 2023, the dominant scaling law was simple: bigger pre-training runs → smarter models. GPT-4 is roughly 10× more compute than GPT-3.5. The curve flattened — not because the law broke but because frontier training runs hit hundreds of millions of GPU-hours and data scarcity.

OpenAI's September 2024 release of o1-preview introduced a second axis: how much compute you spend at inference time. The model is trained to think before answering — emitting a long private chain-of-thought (“thinking trace”) before producing its visible reply. The trace is hidden from the user but consumes tokens.

Pre-2024 paradigm

Quality scales with training FLOPs. Larger model, more data, more tokens seen. Inference is cheap: 1–4 forward passes per user turn. RLHF aligns tone, not reasoning.

Post-o1 paradigm

Quality scales with inference tokens. The same parameter count produces better answers by spending 100× more tokens on private reasoning. RL reward is correctness of the final answer, not human preference.

The key insight

Correctness on maths and coding is verifiable. A reward signal can be computed without human raters — just check the answer or run the tests. That makes large-scale RL tractable. Models trained this way learn to self-correct mid-trace, explore dead ends, and backtrack, mimicking what a human expert does on paper.

02

Scaling Test-Time Compute vs Train-Time

Snell et al. (Google DeepMind, 2024) — “Scaling LLM Test-Time Compute Optimally” — showed that for a fixed total FLOP budget, reallocating compute from training to inference can match or beat a larger model on hard problems. The crossover depends on problem difficulty: easy problems don't benefit, hard ones benefit enormously.

Two distinct mechanisms

Sequential refinement (used by o1-style models): the model elaborates its answer in a single stream, revising mid-trace. Compute scales linearly with trace length but benefits compound non-linearly.

Parallel sampling / best-of-N: sample K independent answers, pick the best via a verifier or majority vote. Embarrassingly parallel; benefits plateau around N≈64 for most tasks.

03

o1 / o3 — What's Public, What's Conjecture

OpenAI has published very little about o1's internals. The September 2024 system card and the December 2024 o1 technical report are the primary sources. Everything else is inference from behaviour and leaks.

Fact / datum	Source	Confidence
Trained with large-scale RL; reward is answer correctness	o1 system card, 2024-09	confirmed
Thinking trace is hidden from users but counted in context	API docs — `reasoning_tokens` field	confirmed
`reasoning_effort` param: `low` / `medium` / `high`	OpenAI API reference, 2025-01	confirmed
o3 uses Monte-Carlo Tree Search over reasoning steps	Bloomberg / The Information, 2024-12	unconfirmed
o1 is a fine-tuned GPT-4o base, not a new architecture	widespread inference; no official statement	unconfirmed
o3 ARC-AGI score: 87.5% (high-compute) vs 75.7% (low)	OpenAI ARC-AGI eval post, 2024-12-20	confirmed
o3-mini matches o1 on AIME 2024 at ¼ the cost	OpenAI technical report, 2025-01	confirmed

What the API exposes

The Chat Completions response for o-series models includes a usage.completion_tokens_details object with a reasoning_tokens integer. On a hard AIME problem this can be 8,000–30,000 tokens. The visible content reply might be 200 tokens. You pay for both at the same per-token rate.

OpenAI API — reasoning effort and token breakdown (Python SDK)

import openai
client = openai.OpenAI()
resp = client.chat.completions.create(
    model="o3-mini",
    reasoning_effort="high",          # low | medium | high
    max_completion_tokens=16000,
    messages=[{"role": "user", "content": "Solve AIME 2024 II Problem 15."}]
)
usage = resp.usage
print(usage.prompt_tokens,
      usage.completion_tokens,
      usage.completion_tokens_details.reasoning_tokens)
# e.g.  42   14830   14620  (most tokens were thinking)

04

DeepSeek-R1 — Open Architecture & GRPO Training

DeepSeek-R1 (January 2025) is the first high-quality open-weight reasoning model. Its technical report is unusually detailed. The model family sits on top of DeepSeek-V3 (671B MoE, 37B active parameters per token), and the reasoning capability is entirely RL-induced — no supervised long-CoT imitation at the base stage.

GRPO — Group Relative Policy Optimisation

Standard PPO requires a value (critic) network, doubling memory. GRPO (DeepSeek-Math, 2024) eliminates the critic by estimating the baseline from a group of K sampled responses to the same prompt:

GRPO reward formula

For a group of K outputs o₁…o_K with rewards r_i, the advantage is A_i = (r_i − mean(r)) / std(r). The policy gradient is then clipped PPO-style. No critic; baseline is within-group statistics.

Prompt batch

→

Sample K rollouts
K≈8–16

→

Score each rollout
rule/format + correctness

→

Compute group advantage

→

PPO-clip update
ε=0.2, KL β

R1-Zero vs R1

R1-Zero (pure RL)

GRPO directly on DeepSeek-V3 base. No SFT cold-start. Emergent self-verification and aha-moment behaviours appear at ≈8k RL steps. Readable but sometimes switches languages mid-trace.

R1 (cold-start SFT + RL)

A small set (∼thousands) of human-curated long-CoT demonstrations bootstraps the policy before GRPO. Language stays consistent; readability improves. Published weights: 1.5B, 7B, 8B, 14B, 32B, 70B distillations from 671B.

Reward signals used in R1 training

Format reward: trace must contain <think>...</think> tags; answer must be boxed in \boxed{} for maths or wrapped in fenced code for coding. Pure rule-based. Correctness reward: exact match or unit-test pass. No neural judge for the primary RL phase.

05

Long Thinking Traces & Max-Tokens Budget

The thinking trace is the model's scratchpad. It is generated autoregressively before the final answer, typically enclosed in <think>…</think> tags (R1 convention) or as hidden reasoning_tokens (o-series). The model learns to use this space for hypothesis generation, arithmetic, checking sub-results, and recovering from errors.

Model	Trace visibility	Max thinking tokens	How budget is set
o1-mini	Hidden (`reasoning_tokens`)	≈32k	`reasoning_effort`: low/med/high maps to ~1k/4k/32k
o3 (API)	Hidden	≈100k	`reasoning_effort` or `max_completion_tokens`
DeepSeek-R1	Visible in `<think>` block	32k (context window)	`max_tokens`; model self-terminates `</think>`
Claude 3.7 Sonnet (extended thinking)	Visible in `thinking` content block	Up to 128k	`thinking.budget_tokens` (1,024–128,000)
Qwen QwQ-32B	Visible; `<think>` tags	32k	`max_tokens` on inference

Budget vs quality trade-off

Forcing a very short budget can hurt more than greedy decoding — the model learns to meta-plan its trace, so cutting it short mid-thought produces worse answers than no trace at all. The safe minimum for hard maths is ≈2,000 thinking tokens; for typical coding tasks ≈500.

Anthropic Python SDK — extended thinking (Claude 3.7 Sonnet)

import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 10000},
    messages=[{"role": "user",
               "content": "Prove that sqrt(2) is irrational."}]
)
for block in resp.content:
    if block.type == "thinking":
        print("trace:", block.thinking[:200])
    else:
        print("answer:", block.text)

06

Cost & Latency Reality

Reasoning tokens are priced identically to output tokens. A single hard AIME problem with o3 at high effort can consume 20,000–60,000 reasoning tokens. At $15/M output tokens (o3 pricing as of 2025-Q1) that is $0.30–$0.90 per problem. For a 30-question AIME set: $9–$27. Batch API discounts (50%) bring this to ~$5–$14.

Model	Input $/M tok	Output $/M tok	Avg reasoning tok (hard maths)	Cost / hard problem
o3-mini (medium effort)	$1.10	$4.40	≈4,000	≈$0.018
o3-mini (high effort)	$1.10	$4.40	≈16,000	≈$0.070
o3 (high effort)	$10	$40	≈30,000	≈$1.20
DeepSeek-R1 (API)	$0.55	$2.19	≈8,000	≈$0.018
Claude 3.7 Sonnet (10k budget)	$3	$15	≈8,000	≈$0.12

Latency is the harder constraint

Reasoning tokens are generated sequentially. At 60 tok/s output throughput, 30,000 thinking tokens ≈ 8 minutes time-to-first-visible-token. This makes o3 at high effort unsuitable for interactive pipelines. o3-mini at medium effort ≈ 60–90 seconds. Streaming helps UX but doesn't reduce total latency.

User sends prompt

→

Reasoning phase
hidden; 10s–8min

→

Visible answer streams
seconds

→

User sees result

07

Eval Lift — Maths, Coding vs Chat Performance

The gains from reasoning models are highly domain-specific. AIME and competitive programming see dramatic lifts; conversational tasks see flat or negative results due to latency and over-thinking.

Benchmark	GPT-4o (greedy)	o1	o3-mini (high)	DeepSeek-R1
AIME 2024 (pass@1)	9.3%	74.4%	87.3%	72.6%
MATH-500	76.6%	94.8%	97.2%	97.3%
SWE-bench Verified	38.8%	48.9%	49.3%	49.2%
Codeforces Percentile	11th	89th	93rd	96th
GPQA Diamond	53.6%	77.3%	79.7%	71.5%
MT-Bench (chat quality)	9.0	8.7	8.5	7.9

Why chat quality can regress

Reasoning models are RL-trained on verifiable rewards — which means no reward signal for tone, style, concision, or helpfulness in open-ended conversation. The model can over-plan, produce terse answers, or ignore formatting preferences. For creative writing or simple QA, a well-tuned chat model consistently wins.

High-gain tasks

Competition maths (AIME, AMC, Olympiad)
Competitive programming (Codeforces Div1-D)
Formal verification & proof writing
Hard-constraint optimisation
Multi-step reasoning under uncertainty

Low-gain or negative tasks

Conversational chat, customer support
Simple code-autocomplete
Creative writing, poetry
Factual QA from knowledge (RAG tasks)
Latency-sensitive interactive loops

08

Reasoning Models in Agentic Loops

When a reasoning model is the planner in a multi-step agent, the interaction pattern changes significantly. The model's long thinking trace can incorporate tool results, plan sub-tasks, and revise the plan — all within a single generation. This collapses several turns of a classic ReAct loop into one.

User goal: “Debug and fix the failing CI job”

↓

Reasoning model receives full context + tool schemas

↓

Long thinking trace (hidden):
reads logs → hypothesises root cause → checks fix → revises → emits tool call sequence

↓

Tool calls emitted: read_file, run_tests, edit_file, run_tests

↓

Results injected back; model may generate another thinking trace

↓

Final answer: diff + confirmation

Practical considerations in agentic use

Context accumulation: reasoning tokens accumulate in the conversation window. A 128k-context model with 20k thinking tokens per turn exhausts context in ∼6 agent steps before tool results are counted. Summarisation or sliding-window strategies are essential.

Interleaved thinking + tools: Claude 3.7 Sonnet and R1 support thinking blocks between tool calls in the same turn. The model can think → call tool → think more → call tool → answer, all in one generation when the API permits it.

09

What to Take Away

A new scaling axis. Test-time compute joins train-time compute as a primary quality lever. The two are complementary and have different cost profiles.
RL on verifiable rewards is the unlock. Correctness of maths answers and unit-test pass are automatic reward signals — no human raters needed at scale. GRPO makes this efficient by eliminating the critic network.
DeepSeek-R1 is the open benchmark. Trained on DeepSeek-V3 MoE (671B/37B active) with pure GRPO + lightweight cold-start SFT. Weights publicly available in 1.5B–70B distilled sizes.
Thinking tokens cost real money. Budget for 5–30k reasoning tokens per hard query; at o3 pricing that is $0.20–$1.20 per query. Use reasoning_effort: low or set a budget_tokens cap in production.
Latency is the real constraint in interactive settings. 20k tokens at 60 tok/s ≈ 5.5 minutes to first visible token. Batch offline workloads; don't force o3 into a chat UX.
Gains are domain-specific. AIME, Codeforces, GPQA: large lift. Chat, creative, simple QA: flat or negative. Pick the right tool.

Where to next

Deck 02 examines the reward machinery in depth — PRMs vs ORMs, how MCTS-guided decoding works, and when verifier-based sampling beats self-consistency. Deck 03 addresses the deployment question: when do you reach for a reasoning model vs a frontier chat model plus a scaffold?