Why GPUs? A Primer on the Hardware That Makes AI Possible

Author: Darrell Thomas Published: 2026 Type: Original primer — redshed.ai

The Big Idea

Every neural network, from a 7B model running on your desktop to a 671B model training on 2,048 GPUs in a Chinese data center, ultimately reduces to the same operation: multiply a lot of numbers together and add them up. This operation — matrix multiplication — is embarrassingly parallel. Every element of the output matrix can be computed independently. A CPU, designed to run a single complex thread of logic very fast, is architecturally the wrong tool for this job. A GPU, designed to run thousands of simple threads simultaneously, is architecturally the right one. That’s the entire argument. Everything else is detail.

Here’s the analogy I use: imagine you need to mow your lawn. A CPU is a person with the best riding mower money can buy — fast, precise, and sophisticated. It starts in one corner and mows row by row, line by line, beautifully straight. It’s very good at what it does. But it’s one mower.

A GPU takes a different approach. It requires some setup time — you have to draw a grid on your lawn and assign positions. But then you release a thousand rabbits, each one positioned on its own square of the grid, and they all chomp their square of grass at the same time. One coordinated bite and the entire lawn is done. The individual rabbit is far less sophisticated than the riding mower. It can’t navigate, it can’t handle obstacles, it doesn’t have a cup holder. But the lawn doesn’t need any of that. The lawn just needs a thousand patches of grass eaten simultaneously. And that is matrix multiplication.

The detail beyond this analogy matters, because understanding why GPUs work for AI is the key to understanding why models are built the way they are, why training costs what it costs, why inference has the performance characteristics it does, and why the engineering decisions in papers like DeepSeek-V3 make sense. If you don’t understand the hardware, the software is magic. If you understand the hardware, the software is inevitable.

Part 1: What Neural Networks Actually Compute

Strip away the abstractions and a neural network is a sequence of matrix multiplications interleaved with simple nonlinear functions.

A single layer of a transformer processes an input matrix X (shape: sequence_length x hidden_dim) through three steps:

Linear projections: Multiply X by weight matrices to produce Queries, Keys, and Values. Three large matrix multiplications.
Attention: Compute $QK^T$ (matrix multiply), apply softmax (elementwise), multiply by $V$ (matrix multiply).
Feed-forward network: Two more large matrix multiplications with a nonlinearity (SiLU, GELU) in between.

A 30-layer transformer repeats this pattern 30 times. A single forward pass through DeepSeek-V3 involves hundreds of large matrix multiplications. Training involves doing this forward pass, then computing gradients (more matrix multiplications), then updating weights (elementwise operations). Repeat billions of times across trillions of tokens.

The fundamental computational primitive is GEMM — General Matrix Multiply. Everything else (softmax, layer norm, activation functions) is elementwise or reduction operations that represent a small fraction of total compute. If you can do GEMM fast, you can do AI fast.

Part 2: Why CPUs Are the Wrong Tool

A modern CPU — say an Intel i9 or AMD Ryzen 9 — is a marvel of engineering. It has:

8-24 cores, each capable of independent complex computation
Deep out-of-order execution pipelines (20+ stages) that reorder instructions for maximum throughput
Branch prediction that speculates on which path a conditional will take
Large caches (L1/L2/L3, often 30-70MB total) designed to hide memory latency
High single-thread clock speeds (5+ GHz)

All of this engineering is designed around one assumption: the workload is fundamentally serial, with occasional parallelism. The CPU is optimized for a single thread of execution that branches, loops, and accesses memory in unpredictable patterns. The massive caches exist because the CPU can’t predict what data it will need next. The branch predictor exists because the CPU can’t predict which instruction it will execute next. The out-of-order engine exists because the CPU needs to find parallelism within a serial instruction stream.

Now consider multiplying two 4096x4096 matrices. This produces 16.7 million output elements, each requiring 4096 multiply-accumulate operations — about 137 billion FLOPs total. The computation is:

Perfectly predictable: You know every operation and every memory access before you start.
Massively parallel: Every output element can be computed independently.
Regular: The same operation (multiply-accumulate) repeated billions of times.

The CPU’s branch predictor is useless — there are no branches. The out-of-order engine is useless — the operations are already independent. The deep pipeline is useless — you don’t need speculation when you know exactly what comes next. Most of the transistors on the die are solving problems that don’t exist for this workload.

A high-end CPU can deliver roughly 1-2 TFLOPS (trillions of floating-point operations per second) on dense matrix multiplication. This is not because the CPU is slow — it’s because most of its transistors are doing something other than math.

Part 3: Why GPUs Are the Right Tool

A GPU takes the opposite design bet. An NVIDIA H100 has:

132 Streaming Multiprocessors (SMs), each containing multiple execution units
16,896 CUDA Cores for general floating-point math
528 Tensor Cores specialized for matrix multiply-accumulate
80GB of HBM3 memory with 3.35 TB/s bandwidth (vs. ~50-80 GB/s for CPU DDR5)
No branch prediction, no out-of-order execution, no speculation

The GPU design philosophy: dedicate almost every transistor to doing math. Instead of one complex core that can handle any workload, build thousands of simple cores that can only handle regular, parallel workloads — and handle them at staggering throughput.

The SIMT Execution Model

GPUs execute in SIMT — Single Instruction, Multiple Thread. Groups of 32 threads (called a warp) execute the same instruction at the same time on different data. This is not a limitation for matrix multiplication; it’s a perfect match. Every thread in the warp is doing the same thing: load two numbers, multiply them, accumulate the result. Just at different positions in the matrix.

An H100 can have thousands of warps in flight simultaneously. When one warp stalls waiting for memory, the SM instantly switches to another warp — no pipeline flush, no context switch overhead, just a register file pointer swap. This is how GPUs hide memory latency: not with caches (though they have some), but with massive parallelism. There are always more warps ready to execute.

Tensor Cores: Hardware Matrix Multiply

Starting with the Volta architecture (2017), NVIDIA added Tensor Cores — hardware units that perform small matrix multiplications (originally 4x4, now larger) in a single operation. Instead of breaking a matrix multiply into individual multiply-accumulate instructions, a Tensor Core consumes entire matrix tiles in one clock cycle.

An H100’s Tensor Cores deliver:

Precision	TFLOPS
FP64	67
FP32	134
TF32	989
FP16/BF16	1,979
FP8	3,958
INT8	3,958

Compare that 3,958 TFLOPS at FP8 to the CPU’s 1-2 TFLOPS. That’s roughly a 2,000x throughput advantage on the operation that dominates neural network workloads.

This is why DeepSeek-V3 went to extraordinary lengths to make FP8 training work. Every step down in precision roughly doubles throughput on Tensor Cores. The engineering challenge of fine-grained quantization, high-precision accumulation, and careful numerical stability management is worth it because the hardware reward is a 2x speedup on every GEMM in the network.

Memory Bandwidth: The Other Bottleneck

Raw compute throughput is only half the story. You can multiply numbers as fast as you want, but if you can’t feed the data to the multipliers fast enough, they sit idle.

This is measured by arithmetic intensity — the ratio of FLOPs performed to bytes transferred. If your operation has high arithmetic intensity (many FLOPs per byte loaded), you’re compute-bound and the TFLOPS number matters. If it has low arithmetic intensity (few FLOPs per byte loaded), you’re memory-bandwidth-bound and the GB/s number matters.

Large matrix multiplications are typically compute-bound — you load each element once and use it in many multiply-accumulate operations. But many other operations in a neural network are memory-bandwidth-bound:

Attention (small batch): The QK^T multiplication during autoregressive decoding has low arithmetic intensity because the matrices are thin.
Elementwise operations: Softmax, LayerNorm, activation functions load each element, do one or two operations, and write it back.
Embedding lookups: Pure memory access with zero compute.

This is why GPU memory bandwidth matters so much. The H100’s 3.35 TB/s of HBM3 bandwidth is ~50x a CPU’s DDR5 bandwidth. During inference, when you’re generating one token at a time with small batch sizes, memory bandwidth is often the dominant bottleneck. The model weights must be loaded from memory for every single token generated.

This has direct implications for model architecture. Multi-head Latent Attention (MLA), used in DeepSeek-V2 and V3, compresses the KV cache into a low-rank latent vector specifically to reduce the memory bandwidth consumed by attention during inference. The engineering isn’t arbitrary — it’s a direct response to the hardware constraint.

Part 4: The Memory Hierarchy

Understanding GPU performance requires understanding where data lives and how fast you can get it:

Memory Tier	Capacity	Bandwidth	Latency
Registers (per SM)	~256 KB	~TB/s	~1 cycle
Shared Memory / L1 (per SM)	228 KB	~10-20 TB/s	~30 cycles
L2 Cache	50 MB	~6 TB/s	~200 cycles
HBM (Global Memory)	80 GB	3.35 TB/s	~400 cycles
Host DRAM (via PCIe 5.0)	512+ GB	~64 GB/s	~10,000+ cycles
NVMe SSD	TB+	~7 GB/s	~100,000+ cycles

There is a 50x bandwidth cliff between HBM and host DRAM. This is why “just offload to system RAM” doesn’t work for most operations — a memory-bandwidth-bound operation that saturates HBM at 3.35 TB/s will crawl at PCIe’s 64 GB/s.

But this cliff is exactly what makes Engram’s architecture clever. Engram’s embedding lookups are latency-sensitive but bandwidth-light: you need a few specific embedding vectors (small data volume), and you need them before a specific layer executes (latency matters), but you don’t need to stream terabytes through the bus. By prefetching asynchronously while the GPU computes the preceding transformer block, you hide the latency entirely. The 100B-parameter offloading experiment in the Engram paper (2.8% throughput penalty) works because the data volume per lookup is tiny relative to PCIe bandwidth, and the prefetch window is sufficient to hide the latency.

For a home setup, this hierarchy is the entire game:

Your GPU’s VRAM (e.g., 24GB on a 4090): This is where hot computation happens. Model weights for active layers, KV cache, activations.
Your system RAM (e.g., 128GB DDR5): This is where you put parameters that are accessed predictably and infrequently — Engram embedding tables, offloaded MoE experts with good prefetching, quantized weight shards for layer-by-layer loading.
Your NVMe SSD (e.g., 2TB): The long tail of rarely-accessed data. Zipfian distributions make this viable for Engram’s embedding tables.

The engineering challenge is making sure data is in the right tier at the right time. Every paper in this series — DeepSeek-V3’s DualPipe overlapping communication with computation, Engram’s deterministic prefetching, MLA’s KV-cache compression — is fundamentally about navigating this memory hierarchy.

Part 5: Parallelism Strategies for Training

Training a model like DeepSeek-V3 (671B parameters, 14.8T tokens) on 2,048 GPUs requires distributing the work across all of them. There are four axes of parallelism, and understanding them requires understanding the hardware:

Data Parallelism (DP)

Each GPU holds a complete copy of the model. Different GPUs process different batches of data. Gradients are averaged across GPUs after each step. Scales easily but requires each GPU to hold the full model — impossible for a 671B-parameter model.

Tensor Parallelism (TP)

A single matrix multiplication is split across multiple GPUs. Each GPU holds a shard of the weight matrix and computes a partial result. Requires fast inter-GPU communication (NVLink at 900 GB/s) because partial results must be exchanged within a single layer’s computation. Low latency but limited scaling — typically 4-8 GPUs within a node.

Pipeline Parallelism (PP)

Different layers of the model run on different GPUs. GPU 1 computes layers 1-4, GPU 2 computes layers 5-8, and so on. Creates pipeline bubbles — GPU 2 is idle while GPU 1 computes. DeepSeek-V3’s DualPipe algorithm minimizes these bubbles by feeding micro-batches from both ends of the pipeline simultaneously and overlapping forward and backward computation.

Expert Parallelism (EP)

In MoE models, different experts live on different GPUs. When a token is routed to an expert on a remote GPU, it must be dispatched across the network (InfiniBand at 50 GB/s for the H800). DeepSeek-V3 uses 64-way EP across 8 nodes. The all-to-all communication pattern is the dominant bottleneck, which is why they invested so heavily in custom communication kernels and DualPipe’s computation-communication overlap.

DeepSeek-V3 uses all four simultaneously: 16-way PP, 64-way EP across 8 nodes, and ZeRO-1 DP. Understanding why requires understanding the bandwidth tiers: NVLink (intra-node, fast) handles TP, InfiniBand (inter-node, slower) handles EP and PP communication. The parallelism strategy is dictated by the hardware topology.

Part 6: A Brief History

The GPU’s journey from pixel shader to AI engine:

Year	Hardware	Milestone
2001	GeForce 3	First programmable shaders. Researchers start abusing pixel shaders for math.
2002	—	Mark Harris coins “GPGPU,” founds GPGPU.org
2003	—	Harris (UNC) and Buck (Stanford) complete PhD work on general-purpose GPU computing
2004	—	Ian Buck joins NVIDIA, begins developing CUDA based on his Brook research
2006	GeForce 8800 GTX	CUDA 1.0 released. First unified shader architecture (Tesla). GPUs become programmable in C.
2007	—	Kirk publishes “NVIDIA CUDA Software and GPU Parallel Computing Architecture”
2008	—	Nickolls et al. publish “Scalable Parallel Programming with CUDA”
2009	—	Raina, Madhavan, Ng demonstrate GPU-accelerated deep learning at Stanford
2012	GeForce GTX 580	AlexNet. Krizhevsky trains a CNN on two GTX 580s and destroys ImageNet. The deep learning revolution begins.
2014	—	cuDNN released. Optimized GPU primitives for convolution, pooling, normalization.
2016	Tesla P100	First GPU with HBM2 memory. NVLink interconnect. 21 TFLOPS FP16.
2017	Tesla V100	Tensor Cores introduced. Dedicated matrix multiply-accumulate hardware. 125 TFLOPS.
2020	A100	TF32 format, 3rd-gen Tensor Cores, structural sparsity support. 624 TFLOPS.
2022	H100	FP8 support, 4th-gen Tensor Cores, Transformer Engine. 3,958 TFLOPS FP8.
2024	B200 (Blackwell)	5th-gen Tensor Cores, native fine-grained quantization. Addresses limitations DeepSeek identified.

The progression tells a story: from general-purpose programmability (CUDA), to optimized libraries (cuDNN), to dedicated hardware (Tensor Cores), to precision specialization (FP8, microscaling formats). Each step is a response to how AI workloads actually use the hardware.

Ian Buck’s PhD thesis proved GPUs could do general-purpose compute. AlexNet proved they should. Tensor Cores proved NVIDIA was willing to burn transistors on the bet. FP8 on H100 proved the bet was right.

Part 7: What This Means for You

If you’re building a home AI rig, here’s what actually matters:

For inference (running models):

VRAM is the primary constraint. A model must fit in VRAM (or be cleverly offloaded) to run. A 70B model in FP16 needs ~140GB. In 4-bit quantization, it needs ~35GB. A 4090 has 24GB. A 3090 has 24GB. Two 3090s have 48GB. This is the hard math.
Memory bandwidth determines tokens/second. During autoregressive generation, you’re memory-bandwidth-bound. A 4090’s 1 TB/s will push tokens faster than a 3090’s 936 GB/s. This matters more than TFLOPS for single-user inference.
System RAM enables offloading if done right. Not all offloading is equal. Naive layer-by-layer offloading is killed by PCIe latency. Architectures like Engram that enable deterministic prefetching make system RAM genuinely useful.

For fine-tuning:

VRAM is even more critical. Training requires storing activations, gradients, and optimizer states alongside the model weights. LoRA and QLoRA exist specifically to reduce this memory footprint.
Tensor Core utilization matters. Use batch sizes and sequence lengths that are multiples of 8 (for FP16) or 16 (for FP8) to keep Tensor Cores fed. Odd dimensions waste hardware.
Multi-GPU scaling is real but not free. Two GPUs connected via PCIe (not NVLink) will not give you 2x performance for tensor-parallel training. The PCIe bottleneck limits scaling. Data parallelism scales better over PCIe but requires each GPU to hold the full model.

The reason we need GPUs is not complicated: neural networks are matrix multiplications, matrix multiplications are parallel, and GPUs are parallel processors. Everything else — Tensor Cores, memory bandwidth, NVLink, CUDA, cuDNN, the entire NVIDIA ecosystem — is optimization on top of that fundamental alignment between workload and hardware. And as the DeepSeek papers on this site demonstrate, that optimization is where the real engineering lives.