DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Authors: DeepSeek-AI (research@deepseek.com) Published: January 2025 | arXiv:2501.12948 | Nature, vol. 645, pp. 633-638 (2025) Source: arxiv.org/abs/2501.12948

The Big Idea

Can an LLM learn to reason without ever being shown how humans reason? DeepSeek-R1 proves the answer is yes. By applying pure reinforcement learning to a base model with nothing but outcome-based reward signals, the model spontaneously develops sophisticated reasoning behaviors — self-reflection, verification, backtracking, and dynamic strategy adaptation — that were never explicitly taught. This is not fine-tuning on human chain-of-thought examples. The model discovers its own reasoning patterns from scratch.

Why This Paper Matters

This paper fundamentally reframes how we think about reasoning in LLMs. The prevailing approach (SFT on human-annotated reasoning traces) has two critical limitations: it caps the model’s reasoning at human-level patterns, and it requires expensive annotation. DeepSeek-R1 shows that RL with verifiable rewards can bypass both constraints. The model develops reasoning strategies that are often alien to human approaches but demonstrably effective — and it does so at a fraction of the cost of conventional methods.

The total training cost for DeepSeek-R1 was approximately $294K in GPU hours. For context, this produced a model that matches OpenAI o1-1217 on math and code benchmarks and reached #1 on ChatBot Arena’s style-controlled leaderboard alongside o1 and Gemini-Exp-1206.

Technical Architecture

Base Model

DeepSeek-R1 builds on DeepSeek-V3-Base, a 671B parameter Mixture-of-Experts (MoE) model with 37B active parameters per token. V3 features Multi-head Latent Attention (MLA), auxiliary-loss-free load balancing, and Multi-Token Prediction (MTP).

Two Models, One Story

The paper presents two models that tell a sequential story:

DeepSeek-R1-Zero — the pure RL experiment. No supervised fine-tuning whatsoever. The base model is trained using GRPO with only accuracy and format rewards. This model proves the concept: reasoning emerges from reward signals alone.

DeepSeek-R1 — the production model. A multi-stage pipeline that starts from R1-Zero’s insights but adds cold-start SFT data, rejection sampling, and a second RL stage to fix R1-Zero’s practical issues (language mixing, poor readability) while preserving its reasoning power.

GRPO: The RL Algorithm

Group Relative Policy Optimization (GRPO) is the key enabler. Unlike PPO, which requires training a separate value model (often as large as the policy model itself), GRPO estimates advantages using Monte Carlo rollouts within groups:

For each question q, sample G outputs from the current policy
Score each output with the reward function
Compute advantage as: A_i = (r_i - mean(r_1…r_G)) / std(r_1…r_G)
Update the policy using clipped surrogate objective with KL penalty

This eliminates the value model entirely, cutting memory and compute requirements roughly in half compared to PPO. GRPO also periodically updates the reference policy (every 400 steps), allowing the trained policy to diverge significantly — critical for long chain-of-thought training where the model’s behavior shifts dramatically from initialization.

Reward Design

R1-Zero (Pure RL)

Two reward types, deliberately kept simple:

Accuracy rewards: Binary (correct/incorrect). Math answers checked against ground truth. Code solutions executed against test cases. No partial credit.
Format rewards: Enforces <think>...</think> and <answer>...</answer> structure. Nothing more.

Crucially, they reject neural reward models for reasoning tasks. Their observation: neural reward models are susceptible to reward hacking during large-scale RL, and retraining them adds complexity. Rule-based verifiable rewards are harder to game.

R1 (Production)

Adds two more signals:

Model-based rewards: A helpful reward model (trained on 66K preference pairs) and a safety reward model (trained on 106K safe/unsafe annotations) for general non-reasoning tasks.
Language consistency reward: Ratio of target-language words to total words in the CoT. Fixes the language-mixing problem where R1-Zero would switch between English and Chinese mid-reasoning.

The “Aha Moment”

The paper’s most striking finding: during R1-Zero training, the model exhibits an “aha moment” around step 8,200. The model spontaneously begins using phrases like “Wait, wait. Wait. That’s an aha moment. I can flag here.” and starts re-evaluating its own reasoning mid-stream. This self-reflective behavior was never demonstrated in any training data — it emerged purely from the reward signal.

Quantitatively:

Reflective words (“wait”, “mistake”, “verify”, “wrong”, “check”) increase 5-7x over the course of training
The word “wait” specifically shows near-zero usage until step 4,000, occasional use between 4,000-7,000, then dramatic spikes after step 8,000
Response length grows steadily throughout training (from ~500 to ~15,000 tokens), indicating the model autonomously learns to “think longer” on harder problems

The Multi-Stage Pipeline (DeepSeek-R1)

R1-Zero proves the concept but has practical problems: poor readability, language mixing, and narrow focus on reasoning tasks. The production R1 model uses a four-stage pipeline:

Stage 1: Cold-Start SFT

Collect thousands of long CoT examples in a conversational, first-person style. Human annotators convert R1-Zero outputs into natural-sounding reasoning traces. These become few-shot examples for DeepSeek-V3 to generate more training data at scale.

Stage 2: First RL Stage

Train on reasoning prompts (math, code, STEM, logic) using rule-based rewards plus language consistency reward. 10,400 steps total. Key hyperparameters: learning rate 3e-6, KL coefficient 0.001, GRPO clip ratio 10, 16 samples per question, max length 32,768 tokens.

Stage 3: Rejection Sampling + SFT

Sample reasoning trajectories from the Stage 2 checkpoint. Filter for correct answers and readable format. Combine ~600K reasoning samples with ~200K non-reasoning samples (writing, factual QA, translation). Fine-tune for 2-3 epochs on this 800K-sample dataset.

Stage 4: Second RL Stage

Final alignment using mixed reward signals (rule-based for reasoning, model-based for general helpfulness/harmlessness). Only 1,700 steps at reduced temperature (0.7). Preference-based rewards limited to the final 400 steps to avoid reward hacking.

Benchmark Results

vs. Frontier Models (Table 8)

Benchmark	DeepSeek-R1	OpenAI o1-1217	Claude 3.5 Sonnet	GPT-4o
AIME 2024 (Pass@1)	79.8	79.2	16.0	9.3
MATH-500 (Pass@1)	97.3	96.4	78.3	74.6
Codeforces (Percentile)	96.3	96.6	20.3	23.6
Codeforces (Rating)	2029	2061	717	759
MMLU (EM)	90.8	91.8	88.3	87.2
LiveCodeBench (Pass@1)	65.9	63.4	38.9	32.9
SWE-Bench Verified	49.2	48.9	50.8	38.8
AlpacaEval 2.0	87.6	-	52.0	51.1
ArenaHard	92.3	-	85.2	80.4

R1 matches or exceeds o1-1217 on math and code. It surpasses human competitors on AIME (79.8% vs. 50% average) and outperforms 96.3% of human participants on Codeforces.

vs. Humans (Figure 10)

AIME 2024: R1 scores 79.8%, R1-Zero scores 77.9%, average human competitor scores ~50%
Codeforces: R1 reaches 96.3rd percentile
GPQA Diamond: Humans with Ph.D.-level qualifications and web access still beat R1 (81.2% vs. 75.8%)

Distillation: Reasoning for Smaller Models

One of R1’s most impactful contributions: the reasoning capability transfers to much smaller models through distillation. Using the 800K SFT dataset generated from R1’s reasoning trajectories:

Distilled Model	Base	AIME 2024	MATH-500	LiveCodeBench
R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	28.9	83.9	16.9
R1-Distill-Qwen-7B	Qwen2.5-Math-7B	55.5	92.8	37.6
R1-Distill-Qwen-14B	Qwen2.5-14B	69.7	93.9	53.1
R1-Distill-Qwen-32B	Qwen2.5-32B	72.6	94.3	57.2
R1-Distill-Llama-8B	Llama-3.1-8B	50.4	89.1	39.6
R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	70.0	94.5	57.5

The 14B distilled model outperforms QwQ-32B-Preview on most benchmarks. The 32B and 70B distilled models outperform OpenAI o1-mini. These are open-weight models under the MIT License.

Training Cost

Component	H800 GPU Hours	Cost ($2/GPU-hr)
R1-Zero	101K	$202K
SFT data creation	5K	$10K
R1 training	41K	$82K
Total	147K	$294K

This is remarkably cheap for a frontier reasoning model. For comparison, DeepSeek-V3’s pre-training alone cost $5.576M.

Limitations (Acknowledged)

Language mixing: Still occurs in non-English/Chinese languages. The V3 base model was primarily trained on English and Chinese.
Prompt sensitivity: Few-shot prompting degrades performance. Zero-shot with explicit format instructions works best.
Software engineering: Limited improvement over V3 on SWE benchmarks. RL hasn’t been extensively applied to engineering tasks yet.
Reward hacking: Observed when model-based rewards are used too long — reward scores increase while actual task performance degrades (Figure 6). Mitigated by limiting preference-reward training to 400 steps.
Overthinking: The model sometimes generates excessive reasoning for simple problems. Token efficiency remains suboptimal.

Key Takeaways

Reasoning can emerge from pure RL. You don’t need to show a model how to reason. Given the right reward structure (verifiable, rule-based) and sufficient compute, sophisticated reasoning behaviors — including self-reflection and error correction — emerge spontaneously.
GRPO > PPO for reasoning. By eliminating the value model and using group-relative advantages, GRPO is both cheaper and better suited to the long, variable-length outputs of reasoning tasks.
Verifiable rewards beat neural rewards for reasoning. Neural reward models introduce reward hacking risk. Binary correctness signals from compilers and answer checkers are robust and scalable.
SFT may constrain reasoning. Human-authored reasoning traces can limit the model’s exploration of novel strategies. Pure RL allows the model to discover reasoning patterns humans wouldn’t use.
Reasoning capability transfers via distillation. The 800K-sample dataset from R1 makes even 7B models strong reasoners, democratizing access to reasoning capabilities.
The “aha moment” is real and measurable. Emergent self-reflection during RL training represents a qualitative phase transition in model behavior, not a gradual improvement.

Summary prepared by Darrell Thomas, 2026