Training Compute-Optimal Large Language Models (Chinchilla)

Authors: Jordan Hoffmann*, Sebastian Borgeaud*, Arthur Mensch*, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre* Affiliation: DeepMind Published: March 2022 | arXiv:2203.15556 Source: arxiv.org/abs/2203.15556

The Big Idea

Everyone was building bigger models. DeepMind asked a simpler question: what if the models are the wrong size?

By training over 400 language models ranging from 70M to 16B parameters on 5B to 500B tokens, they discovered a straightforward rule: for every doubling of your compute budget, you should double both the model size and the training data equally. The entire industry had been ignoring half of this equation — building huge models but starving them of data.

To prove it, they trained Chinchilla: a 70B parameter model on 1.4 trillion tokens, using the exact same compute budget as Gopher (280B parameters, 300B tokens). Chinchilla is 4x smaller but trained on 4x more data. The result? Chinchilla beat Gopher on virtually every benchmark, including a 7.6% improvement on MMLU. It also beat GPT-3 (175B) and Megatron-Turing NLG (530B) — models 2.5x to 7.5x its size.

The punchline: the biggest models in AI were not too small. They were too big and undertrained.

Why This Paper Matters

The Mistake Everyone Was Making

Before Chinchilla, the AI industry was in an arms race to build the biggest model possible. The playbook came from Kaplan et al. (2020), OpenAI’s scaling laws paper, which concluded: when you get more compute, spend most of it on making the model bigger and only a little on more data. Specifically, they said a 10x increase in compute should go to a 5.5x bigger model but only 1.8x more data.

As a result, nearly every major model in 2020-2022 was trained on roughly 300 billion tokens regardless of size:

Model	Parameters	Training Tokens
LaMDA	137B	168B
GPT-3	175B	300B
Jurassic-1	178B	300B
Gopher	280B	300B
MT-NLG 530B	530B	270B

530 billion parameters trained on 270 billion tokens. That’s like building a Formula 1 engine and putting it in a car with a 2-gallon fuel tank.

What Chinchilla Proved

DeepMind’s finding was that Kaplan et al. had a methodological flaw: they used a fixed learning rate schedule and fixed number of training tokens, which biased their analysis toward larger models. When you correct for this — varying both model size and training duration properly — the optimal scaling exponents are almost exactly equal: a ≈ 0.50 for parameters, b ≈ 0.50 for data.

Equal scaling. That’s the whole result. And it meant every large model in existence was dramatically oversized for its data budget.

Why It Matters Beyond Benchmarks

The practical implications were enormous:

Inference cost drops immediately. A 70B model is 4x cheaper to run than a 280B model at every single forward pass. For anyone deploying LLMs in production, this isn’t a marginal improvement — it’s the difference between viable and not viable. Chinchilla serves the same quality at a quarter of the memory, a quarter of the GPU cost, and roughly 4x the throughput.
Fine-tuning becomes accessible. You can fine-tune a 70B model on hardware that can’t even load a 280B model. This democratized adaptation — suddenly research labs and startups could work with frontier-quality models without needing a supercomputer cluster.
The bottleneck shifted to data. This paper told the industry: stop obsessing over parameter counts and start obsessing over training data. High-quality, deduplicated, diverse datasets became the new competitive moat. The compute-optimal frontier for a 175B model requires 4.2 trillion tokens. For a 1T model, you need 21.2 trillion. This is far beyond what existed at the time, and it kicked off the data-centric AI movement.
It rewrote the scaling roadmap. Every lab recalculated. Llama (65B, 1.4T tokens), Llama 2 (70B, 2T tokens), and most subsequent open models followed Chinchilla-optimal or over-trained ratios. The “Chinchilla optimal” became standard vocabulary in ML engineering. The paper didn’t just improve one model — it changed how an entire industry allocates billions of dollars of compute.

Technical Method

The Core Question

Given a fixed FLOPs budget C, what is the optimal split between model parameters N and training tokens D to minimize loss?

The authors model this as: N_opt(C), D_opt(C) = argmin L(N, D) subject to FLOPs(N, D) = C.

They use three independent approaches to estimate the answer, and all three converge on the same conclusion.

Approach 1: Fix Model Sizes, Vary Training Duration

Train a family of models (70M to 10B parameters), each at 4 different training durations. For each FLOP count, identify which (model size, tokens) combination achieves the lowest loss. Fit power laws to the optimal frontier.

Result: a = 0.50, b = 0.50

Approach 2: IsoFLOP Profiles

For 9 fixed FLOP budgets ( $6 \times 10^{18}$ to $3 \times 10^{21}$ ), vary the model size (up to 16B) and measure final loss. Each FLOP budget shows a clear U-shaped curve with a single optimal model size. Fit power laws to these optima.

Result: a = 0.49, b = 0.51

Approach 3: Parametric Loss Function

Fit a closed-form parametric function to all observed losses:

L_hat(N, D) = E + A/N^alpha + B/D^beta

Where E is the entropy of natural text (the irreducible loss), A/N^alpha captures finite-model error, and B/D^beta captures finite-data error. Minimize this function analytically to derive the efficient frontier.

Result: a = 0.46, b = 0.54

Consensus

All three approaches agree: parameters and data should scale roughly equally. This directly contradicts Kaplan et al. (2020), who found a = 0.73 and b = 0.27 — heavily favoring bigger models over more data. The discrepancy arises because Kaplan et al. used a fixed cosine schedule length and did not properly account for the interaction between learning rate schedule and training duration.

The Chinchilla Model

To validate the theory, DeepMind trained Chinchilla with the same total FLOPs as Gopher but with a radically different allocation:

	Gopher	Chinchilla
Parameters	280B	70B
Training tokens	300B	1.4T
Layers	80	80
$d_{model}$	16,384	8,192
Heads	128	128
Optimizer	Adam	AdamW
Precision	bfloat16	bfloat16
Hardware	TPUv3/v4	TPUv3/v4

Same compute. 4x smaller model. 4.7x more data. Trained on MassiveText with a modified SentencePiece tokenizer.

Key Results

Language Modeling (The Pile)

Chinchilla outperforms Gopher on all 20 evaluation subsets of The Pile. Not some. All of them. Perplexity drops from 7.75 (Gopher) to 7.16 (Chinchilla) on Wikitext103.

MMLU (57-task knowledge benchmark)

Model	Accuracy
Random	25.0%
Average human rater	34.5%
GPT-3 (5-shot)	43.9%
Gopher (5-shot)	60.0%
Chinchilla (5-shot)	67.6%
Average human expert	89.8%

Chinchilla improves on Gopher by 7.6 percentage points — a massive jump that exceeded expert forecasts for June 2023 (63.4%) before June 2023 even arrived.

Reading Comprehension

Benchmark	Chinchilla	Gopher	GPT-3	MT-NLG 530B
LAMBADA (zero-shot)	77.4	74.5	76.2	76.6
RACE-m (few-shot)	86.8	75.1	58.1	-
RACE-h (few-shot)	82.3	71.6	46.8	47.9

Common Sense

Benchmark	Chinchilla	Gopher	GPT-3	MT-NLG 530B
HellaSWAG	80.8%	79.2%	78.9%	80.2%
PIQA	81.8%	81.8%	81.0%	82.0%
Winogrande	74.9%	70.1%	70.2%	73.0%
BoolQ	83.7%	79.3%	60.5%	78.2%

Chinchilla matches or outperforms Gopher and GPT-3 on all tasks, and beats the 530B MT-NLG on all but one.

BIG-bench (62 tasks)

Chinchilla outperforms Gopher on 58 of 62 tasks. Average accuracy jumps from 54.4% (Gopher) to 65.1% (Chinchilla) — a 10.7 percentage point improvement. Chinchilla underperforms on only 4 tasks: crash_blossom, dark_humor_detection, mathematical_induction, and logical_args.

Closed-Book Question Answering

Chinchilla achieves new closed-book SOTA on Natural Questions: 31.5% (5-shot), versus 24.5% for Gopher. On TriviaQA, Chinchilla reaches 73.2% versus 63.6% for Gopher.

The Compute-Optimal Scaling Table

This is the paper’s most referenced table — a lookup for how big your model should be given your compute budget:

Parameters	FLOPs	Tokens
400M	1.92e+19	8.0B
1B	1.21e+20	20.2B
10B	1.23e+22	205.1B
67B	5.76e+23	1.5T
175B	3.85e+24	3.7T
280B	9.90e+24	5.9T
520B	3.43e+25	11.0T
1T	1.27e+26	21.2T

The implication: a 175B model (GPT-3 scale) should have been trained on 3.7 trillion tokens to be compute-optimal, not 300 billion. It was trained on roughly 1/12th of the data it needed.

Limitations

Only two data points at scale. The core validation is Chinchilla vs. Gopher — a single pair. The 400+ smaller models establish the trend, but there are no additional large-scale confirmations at intermediate sizes.
Single-epoch training only. All training runs see each token at most once. Multi-epoch training (repeating data) may shift the optimal ratio, and this is not explored. Later work (e.g., Llama) showed that over-training smaller models on repeated data can be worthwhile for inference efficiency.
Power-law assumption. The analysis assumes the efficient frontier follows a power law. The authors observe some concavity at high compute budgets (the log curve bends), which could mean they’re still overestimating optimal model size at the largest scales.
Data quality not modeled. The scaling laws treat all tokens equally. In practice, a token from a well-written textbook is worth more than a token from a web crawl. The paper doesn’t account for data quality, filtering, or deduplication effects on the optimal ratio.
Architecture-specific. These results are for dense autoregressive transformers. MoE models, non-transformer architectures, and models with retrieval augmentation may have different optimal scaling ratios.
Inference cost not part of the optimization. The objective is purely minimizing training loss for fixed compute. In practice, you might deliberately over-train a smaller model (spending more FLOPs than “optimal”) because the inference savings over the model’s lifetime far exceed the extra training cost. This “inference-aware” scaling was not considered.

Key Takeaways

Scale data and model size equally. For every doubling of compute, double both parameters and training tokens. The exponents are approximately a = 0.50 for parameters and b = 0.50 for data. This directly overturned the Kaplan et al. (2020) scaling laws that favored larger models.
Most LLMs were (and many still are) undertrained. GPT-3, Gopher, Jurassic-1, and MT-NLG 530B were all trained on roughly 300B tokens regardless of size. Chinchilla showed they needed 4-40x more data to be compute-optimal.
Smaller, well-trained models beat bigger, undertrained ones. Chinchilla (70B) uniformly outperforms Gopher (280B), GPT-3 (175B), and MT-NLG (530B) — models up to 7.5x its size — while being 4x cheaper to serve.
The scaling laws corrected the field’s trajectory. Post-Chinchilla, virtually every major model adopted data-proportional scaling: Llama 65B on 1.4T tokens, Llama 2 70B on 2T tokens, and so on. “Chinchilla-optimal” became the baseline for compute allocation decisions across the industry.
Data is the new bottleneck. The compute-optimal frontier for large models requires trillions of high-quality tokens. This shifted focus from “how do we train bigger models?” to “where do we find more data?” — catalyzing investments in data curation, synthetic data generation, and multi-epoch training strategies.
Inference economics favor smaller models. A model that’s 4x smaller runs 4x faster at inference with 4x less memory. Training compute is a one-time cost; inference cost is paid on every query, forever. Chinchilla made the economic case for right-sizing models undeniable.

Summary prepared by Darrell Thomas, 2026