DeepSeek-V3 Technical Report

Authors: DeepSeek-AI (research@deepseek.com) Published: December 2024 (updated February 2025) | arXiv:2412.19437v2 Source: arxiv.org/abs/2412.19437

The Big Idea

DeepSeek-V3 is a 671B parameter Mixture-of-Experts language model — with only 37B parameters activated per token — that matches or beats GPT-4o and Claude-3.5-Sonnet across most benchmarks while costing only $5.576M to train. The entire training run completed in under two months on 2,048 NVIDIA H800 GPUs with zero irrecoverable loss spikes and zero rollbacks. It achieves this through a relentless co-design of architecture, training framework, and hardware utilization: FP8 mixed-precision training, a novel DualPipe algorithm for pipeline parallelism, Multi-head Latent Attention (MLA) for KV-cache compression, auxiliary-loss-free load balancing for its MoE routing, and a Multi-Token Prediction (MTP) training objective. The result is the strongest open-source model at the time of release, trained for a fraction of the cost of its competitors.

Why This Paper Matters

Necessity as the Mother of Optimization

There is a story inside this paper that goes beyond the architecture diagrams. DeepSeek is a Chinese AI lab, and due to U.S. export controls, they cannot purchase NVIDIA’s top-tier data center GPUs. No H100s. No A100s. They trained DeepSeek-V3 on H800s — a different SKU with significantly reduced interconnect bandwidth (the NVLink and InfiniBand capabilities are constrained compared to the H100).

Most Western labs with unlimited access to H100 clusters have little incentive to squeeze every last FLOP out of their hardware. When you can just requisition another thousand GPUs, brute force is the path of least resistance. DeepSeek didn’t have that option. They had to make what they had work, and the engineering they produced under that constraint is extraordinary.

The FP8 training framework they built is a case in point. FP8 mixed-precision training at the 671B scale had never been done before. The H800’s Tensor Cores have limited FP8 accumulation precision — only 14 bits, significantly less than full FP32. Rather than abandon FP8, DeepSeek engineered around the limitation: fine-grained tile-wise and block-wise quantization to handle outliers, promotion to CUDA Cores for high-precision accumulation every 128 elements, online quantization to avoid stale scaling factors. These are not trivial optimizations — they required deep understanding of the hardware at the register and warp-scheduler level.

The DualPipe scheduling algorithm tells the same story. Cross-node expert parallelism in MoE models creates a brutal communication bottleneck — roughly a 1:1 compute-to-communication ratio. DualPipe overlaps forward and backward computation-communication phases from both ends of the pipeline simultaneously, achieving near-zero all-to-all communication overhead. They wrote custom PTX instructions for their cross-node communication kernels. They partitioned GPU Streaming Multiprocessors between compute and communication, manually tuning the ratio. They even published concrete suggestions for what future GPU architectures should change to make this kind of optimization easier.

The lesson extends well beyond AI hardware. Optimization born from constraint is a different animal than optimization born from comfort. When you are forced to be efficient, you discover techniques that the unconstrained never bother to look for. DeepSeek’s $5.576M training cost vs. the estimated tens to hundreds of millions for comparable closed-source models isn’t just a cost savings — it’s a proof that the field has been leaving enormous efficiency gains on the table.

Technical Architecture

Model Structure

DeepSeek-V3 is a transformer with 61 layers, hidden dimension 7168, and two key architectural innovations carried forward from DeepSeek-V2:

Multi-head Latent Attention (MLA): Instead of caching full key-value pairs for every attention head (the standard KV-cache that dominates inference memory), MLA compresses keys and values into a low-rank latent vector. Only this compressed vector plus a small decoupled positional component (carrying RoPE) need to be cached during generation. This dramatically reduces KV-cache memory while maintaining performance comparable to standard Multi-Head Attention.

DeepSeekMoE with Auxiliary-Loss-Free Load Balancing: All FFN layers except the first three use MoE. Each MoE layer has 1 shared expert (always activated) and 256 routed experts, of which 8 are activated per token. The routing uses sigmoid gating with top-K selection and normalization.

The load balancing innovation is significant. Traditional MoE models add an auxiliary loss to the training objective that penalizes uneven expert utilization. The problem: this auxiliary loss fights with the primary training objective and degrades model quality. DeepSeek-V3 instead uses a bias term added to each expert’s affinity score during routing. After each training step, the bias is adjusted up for underloaded experts and down for overloaded ones (controlled by a hyperparameter gamma). The actual gating value used for computation still comes from the unbiased affinity score — the bias only influences routing decisions. This achieves balanced load without corrupting the training signal. Ablations confirm it outperforms auxiliary-loss-based approaches on nearly every benchmark.

Multi-Token Prediction (MTP): During training, an additional prediction head predicts the next+1 token at each position (D=1). This densifies the training signal and is shown to improve performance across benchmarks. During inference, the MTP module can be discarded (no inference cost) or repurposed for speculative decoding, achieving 1.8x TPS improvement with 85-90% acceptance rates.

Scale

671B total parameters, 37B activated per token
128K vocabulary (Byte-level BPE)
128K context length (extended from 4K via two-stage YaRN)
14.8T training tokens

Infrastructure and Training Optimizations

Compute Cluster

2,048 NVIDIA H800 GPUs across 256 nodes. 8 GPUs per node connected via NVLink/NVSwitch. Cross-node communication via InfiniBand.

DualPipe

The signature infrastructure contribution. In MoE training with expert parallelism, tokens must be dispatched to remote GPUs hosting the relevant experts and results combined back — all-to-all communication that creates roughly 1:1 compute-to-communication overhead.

DualPipe splits each forward and backward chunk into four components (attention, all-to-all dispatch, MLP, all-to-all combine) and interleaves them across a pair of forward and backward micro-batches fed from both ends of the pipeline. GPU SMs are manually partitioned between compute and communication tasks. The result: both all-to-all and pipeline-parallel communication are fully hidden behind computation.

Compared to 1F1B and ZB1P pipeline methods, DualPipe significantly reduces pipeline bubbles while only increasing activation memory by a factor of PP/(PP+1).

FP8 Mixed-Precision Training

First successful FP8 training at the 671B scale. Key techniques:

Fine-grained quantization: Activations quantized per 1x128 tile, weights per 128x128 block. Handles outliers far better than per-tensor quantization.
High-precision accumulation: Partial results promoted from Tensor Cores to CUDA Cores every 128 elements for FP32 accumulation, working around the H800’s 14-bit Tensor Core accumulation limitation.
E4M3 format everywhere: Unlike prior work using mixed E4M3/E5M2, DeepSeek uses E4M3 for all GEMM operations, enabled by fine-grained quantization.
Online quantization: Scaling factors computed from current values rather than delayed from prior iterations.
Selective precision: Embedding, output head, MoE gating, normalization, and attention remain in BF16/FP32 for stability.

Relative loss error compared to BF16 baseline stays consistently below 0.25%.

Memory Optimizations

RMSNorm and MLA up-projections recomputed during backprop instead of cached
EMA parameters stored in CPU memory
Embedding and output head co-located on same pipeline rank for physical parameter sharing with MTP module
BF16 optimizer states (first/second moments in AdamW) instead of FP32, with master weights kept in FP32

Key Results

Base Model (Table 3)

Benchmark	DeepSeek-V2 Base (MoE, 21B active, 236B total)	Qwen2.5 72B Base (Dense)	LLaMA-3.1 405B Base (Dense)	DeepSeek-V3 Base (MoE, 37B active, 671B total)
MMLU	78.4	85.0	84.4	87.1
MATH	43.4	54.4	49.0	61.6
HumanEval	43.3	53.0	54.9	65.2
MMLU-Pro	51.4	58.3	52.8	64.4

DeepSeek-V3-Base outperforms all open-source base models on most benchmarks, including LLaMA-3.1 405B (a dense model with 11x the activated parameters).

Chat Model (Table 6)

Benchmark	Qwen2.5 72B-Inst	LLaMA-3.1 405B-Inst	Claude-3.5-Sonnet-1022	GPT-4o-0513	DeepSeek-V3
MMLU	85.3	88.6	88.3	87.2	88.5
MATH-500	80.0	73.8	78.3	74.6	90.2
AIME 2024	23.3	23.3	16.0	9.3	39.2
Codeforces	24.8	25.3	20.3	23.6	51.6
SWE-bench Verified	23.8	24.5	50.8	38.8	42.0
LiveCodeBench	28.7	30.1	34.2	33.4	37.6
Arena-Hard	—	—	85.2	80.4	85.5
AlpacaEval 2.0	—	—	52.0	51.1	70.0

DeepSeek-V3 is the strongest open-source model across the board. It matches or exceeds GPT-4o and Claude-3.5-Sonnet on most benchmarks, and dominates in math (MATH-500: 90.2, AIME: 39.2) and algorithmic coding (Codeforces: 51.6 percentile, LiveCodeBench: 37.6). It trails Claude-3.5-Sonnet only on engineering-focused tasks like SWE-bench.

Training Cost (Table 1)

Phase	H800 GPU Hours	Cost ($2/GPU-hr)
Pre-training	2,664K	$5.328M
Context extension	119K	$0.238M
Post-training	5K	$0.01M
Total	2,788K	$5.576M

Pre-training throughput: 180K H800 GPU hours per trillion tokens (3.7 days on the 2,048-GPU cluster). The entire pre-training completed in under two months.

Limitations

Deployment footprint: The minimum deployment unit for the decoding stage is 40 nodes (320 GPUs). This is manageable for a cloud service but prohibitive for smaller teams.
Inference speed: While 2x faster than DeepSeek-V2 end-to-end, there is still room for improvement, expected to come with better hardware.
English factual knowledge: On SimpleQA, DeepSeek-V3 trails GPT-4o and Claude-3.5-Sonnet, likely because more training tokens were allocated to Chinese knowledge.
Engineering tasks: SWE-bench Verified and Aider performance lags behind Claude-3.5-Sonnet, indicating room for improvement on real-world software engineering.
Hardware constraints: Many of the training optimizations are workarounds for H800 limitations. The paper explicitly calls for future GPU architectures to natively support fine-grained quantization, fused FP8 cast with TMA access, and higher accumulation precision in Tensor Cores.

Key Takeaways

Constraint breeds innovation. Without access to top-tier H100 GPUs, DeepSeek engineered FP8 training, DualPipe, and a suite of memory optimizations that collectively made a $5.576M training run competitive with models that cost orders of magnitude more. The best optimization work in this field is coming from teams that can’t just throw more hardware at the problem.
MoE efficiency is real and proven at scale. DeepSeek-V3 activates only 37B of its 671B parameters per token. It outperforms LLaMA-3.1 405B (a dense model that activates all 405B parameters) on most benchmarks. The efficiency advantage is not theoretical — it’s measured and massive.
Auxiliary-loss-free load balancing is strictly better. The bias-adjustment approach to MoE routing achieves balanced expert load without degrading the training signal. Ablations show consistent improvements over auxiliary-loss-based methods. This should become the default for MoE training.
FP8 training works at extreme scale. DeepSeek-V3 is the first model to validate FP8 mixed-precision training at 671B parameters on 14.8T tokens with negligible quality loss (<0.25% relative error). The key is fine-grained quantization and high-precision accumulation — per-tensor quantization is not sufficient.
Multi-Token Prediction improves training for free at inference. MTP densifies the training signal, improving benchmark performance, and the additional prediction heads can be discarded during inference (zero cost) or used for speculative decoding (1.8x speedup).
Training stability is an engineering achievement. Zero loss spikes, zero rollbacks across the entire 14.8T token pre-training run. This is not luck — it’s the result of careful precision management, gradient clipping, and batch size scheduling.
Knowledge distillation from reasoning models works. DeepSeek-V3’s post-training pipeline distills reasoning capabilities from DeepSeek-R1, improving MATH-500 from 74.6 to 83.2 and LiveCodeBench from 31.1 to 37.4. This cross-pollination between reasoning-specialized and general-purpose models is a powerful paradigm.

Summary prepared by Darrell Thomas, 2026