Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Authors: Albert Gu (Carnegie Mellon University) and Tri Dao (Princeton University) Published: December 2023 (updated May 2024) | arXiv:2312.00752v2 Source: arxiv.org/abs/2312.00752 Code: github.com/state-spaces/mamba

The Big Idea

Mamba asks a simple question: what if the model didn’t need to look back at everything?

Every modern AI language model — ChatGPT, Claude, Gemini, all of them — is built on the Transformer architecture we covered in “Attention Is All You Need.” The Transformer’s core trick is attention: when processing a word, it looks back at every other word in the conversation to decide what matters. This works incredibly well, but it has a brutal scaling problem. If you double the length of the text, the cost of that “look at everything” step quadruples. This is the fundamental reason context windows have hard limits, and why feeding an AI an entire book or codebase gets expensive fast.

Instead of re-examining the entire history at each step, Mamba reads through the text once, maintaining a set of running notes — a compressed summary of what it’s seen so far. When a new word arrives, it updates its notes and moves on. It never looks back. This makes the cost grow linearly with text length instead of quadratically. Twice as much text, twice the cost — not four times.

The catch is that the quality of the output depends entirely on how good those notes are. Previous models that worked this way (called state space models, or SSMs) kept notes using a fixed strategy — the same information got recorded no matter what the input was. They worked fine for audio and signal data, but they couldn’t handle language, because language requires judgment about what’s important. The word “the” doesn’t deserve the same treatment as the word “guilty.”

Mamba’s breakthrough is selectivity: the model decides what to write down and what to skip based on what it’s actually reading. Important tokens get absorbed into the running state. Filler gets ignored. This one change turns SSMs from a niche technique into a genuine Transformer competitor. Mamba-3B matches Transformers twice its size on language tasks, generates text at 5x the speed, and scales cleanly to million-token inputs.

Why This Paper Matters

The Home Lab Angle

If you run models on your own hardware, here’s why you should care about Mamba.

When a Transformer generates text, it has to remember every previous token in the conversation. It stores this memory in something called a KV cache, and that cache grows with every token of context. At 128K context on a 7B model, you might burn 8-16GB of GPU memory on the cache alone, on top of the model weights. That’s often the real bottleneck on consumer GPUs — not the model size, but the context length.

Mamba has no KV cache. Its state is a small, fixed-size tensor. Whether you’re processing a short prompt or an entire novel, memory usage during generation doesn’t change. This makes long-context inference on a single consumer GPU practical in a way that Transformers can’t easily match. (DeepSeek-V3’s MLA, which we covered earlier, tackles this same problem from the Transformer side by compressing the KV cache — a different solution to the same constraint.)

The Quadratic Wall

For seven years, nothing could touch the Transformer. Dozens of research groups tried to build something faster that didn’t sacrifice quality. All of them failed on language. Mamba is the first model to break through.

The problem it solves is concrete. When a Transformer processes text, its attention mechanism compares every word against every other word. For a 1,000-word input, that’s 1 million comparisons. For 10,000 words, it’s 100 million. For 128,000 words (a long conversation or document), it’s over 16 billion — and that’s per layer, per pass through the network. The cost grows with the square of the input length. This is why you hit context limits, why long conversations get slow, and why processing a full book is a serious engineering challenge even on expensive hardware.

Many researchers tried to build cheaper versions of attention that scale better. None of them matched the original Transformer’s quality on language. The shortcuts kept losing something essential.

Mamba doesn’t try to build a cheaper version of attention. It replaces it with an entirely different mechanism that is linear by design — double the input, double the cost, not quadruple. And it matches the Transformer’s quality.

Why Previous Alternatives Failed (and What Mamba Fixed)

Before Mamba, the most promising Transformer alternatives were called state space models (SSMs). Think of them like a reader with a notepad. As each word comes in, the reader updates their notes according to a fixed set of rules and moves on. The notes are a compressed summary of everything seen so far.

The problem was that the rules were always the same, no matter what the input was. Every word got processed through identical update logic. This is like taking notes with a rule that says “write down every third word” — it’s consistent and fast, but it has no judgment. It doesn’t know that “guilty” matters more than “the.”

This fixed-rule approach worked well for continuous signals like audio waveforms, where the patterns are relatively uniform. But language is full of structure that requires judgment: some words are critical, some are filler, and you can’t know which is which without looking at the content.

Mamba’s fix is to make the rules content-dependent. When each word arrives, the model decides:

How much to remember vs. forget — should I update my notes heavily, or keep them mostly as they are?
What to write down — how much of this input should enter my running summary?
What to output — what’s relevant from my notes for predicting what comes next?

This is what the paper calls selectivity. The model can now ignore filler words entirely (preserving its existing notes unchanged), focus on key terms (heavily updating its state), and even wipe its notes clean at the boundary between unrelated passages. It’s the difference between a note-taker who mindlessly transcribes everything and one who actively listens and writes down what matters.

The Hardware Problem (and Tri Dao’s Signature Move)

Making the rules content-dependent created an engineering problem. The reason prior SSMs were fast is that their fixed rules could be computed as a single batch operation (a convolution) that GPUs handle efficiently. Once the rules change at every step, you lose that shortcut. Naively, you’d be back to processing one word at a time — the exact bottleneck that killed RNNs in the first place.

This is where Tri Dao comes in. Dao is the same researcher behind FlashAttention, and he applies the same philosophy here: the speed of an algorithm depends less on how many math operations it performs and more on how it moves data through the GPU’s memory system.

Modern GPUs have two relevant tiers of memory: a large but slow main memory (HBM) and a small but fast scratchpad (SRAM). Most of the time spent “computing” is actually spent moving numbers between these two tiers. Mamba’s implementation loads everything it needs into the fast scratchpad, does all the work there, and writes only the final answer back to slow memory. The large intermediate calculations never touch the slow tier at all.

As a side note, here’s how: custom fused CUDA kernels. Normally, a deep learning framework like PyTorch runs each operation as a separate GPU kernel — each one reads from HBM, does a small computation, writes back to HBM, and the next kernel repeats the cycle. Dao bypasses this by writing a single CUDA kernel that loads the inputs into SRAM once, runs the entire selective scan there, and only writes the final result back to HBM. The math is identical either way, but the fused version avoids materializing large intermediate tensors in slow memory. The speedup comes entirely from smarter data movement, not fewer operations. If you want to understand GPU memory hierarchies and kernel programming at this level, Programming Massively Parallel Processors by Wen-mei Hwu and David Kirk is the definitive reference.

The result: Mamba runs 3x faster than previous SSMs despite being more complex, and generates text at 5x the speed of a Transformer of similar size. The speed advantage comes not just from the linear-time math, but from the absence of the KV cache. Mamba’s running state is a fixed size. It uses the same amount of memory whether it’s processing 1,000 tokens or 1,000,000.

Technical Architecture

State Space Models: The Foundation

To understand Mamba, you need to understand what a state space model is. If you studied control engineering or signal processing in college, you already know this — you just didn’t expect to see it powering a language model.

The state space representation comes straight from classical control theory. This was a book from my college days, and it remains one of the best books on the subject: Katsuhiko Ogata’s Modern Control Engineering (5th Edition, Pearson). In that world, state space models describe physical systems — a motor, a circuit, an aircraft’s pitch dynamics. You have a hidden state (the internal variables you can’t directly observe), inputs (forces, voltages, control signals), and outputs (what your sensors measure). Two equations govern everything:

State update: The hidden state at time t depends on the previous state and the current input.
Output: The observation at time t is some function of the hidden state.

In math, the continuous form is:

$h'(t) = \mathbf{A}\,h(t) + \mathbf{B}\,x(t) \quad \text{(state evolution)}$

$y(t) = \mathbf{C}\,h(t) \quad \text{(output)}$

Where $\mathbf{A}$ is the state transition matrix (how the state evolves on its own), $\mathbf{B}$ is the input matrix (how the input affects the state), and $\mathbf{C}$ is the output matrix (how the state maps to what you observe).

If you’ve seen these equations in Ogata, you’re looking at exactly the same math. The A, B, C matrices mean the same thing structurally. The discretization step — converting from continuous time to discrete time steps — is the same procedure you’d use to implement a digital controller. Albert Gu’s insight was that this framework, originally designed to model physical dynamics, could be repurposed to model sequences of tokens. Instead of tracking the state of an aircraft, you’re tracking the state of a conversation.

To use this in a discrete sequence model (where inputs arrive one at a time, like tokens), you discretize:

$h_t = \bar{\mathbf{A}}\,h_{t-1} + \bar{\mathbf{B}}\,x_t$

$y_t = \mathbf{C}\,h_t$

Where $\bar{\mathbf{A}}$ and $\bar{\mathbf{B}}$ are the discretized versions of $\mathbf{A}$ and $\mathbf{B}$ , controlled by a step size parameter $\Delta$ .

This is a recurrence — each step depends on the previous — but it has a critical computational dual: the entire operation can also be expressed as a global convolution when the parameters are constant across time. This duality is what made S4 fast: train using the convolution form (parallelizable on GPUs), infer using the recurrence form (constant memory per step).

Where Mamba diverges from anything in Ogata’s book is what comes next. In classical control theory, the A, B, C matrices are fixed properties of the system — they describe the physics, and the physics don’t change based on the input signal. A control theorist would call this linear time-invariant (LTI). Mamba breaks that assumption deliberately, making A, B, C functions of the input. In control terms, you’ve just made the system nonlinear and time-varying — which is exactly the point, and exactly what language requires.

Selectivity: The Key Innovation

Mamba breaks the time-invariance assumption. In Algorithm 2 of the paper, the parameters $\mathbf{B}$ , $\mathbf{C}$ , and $\Delta$ become functions of the input:

$\mathbf{B}_t = \text{Linear}_N(x_t)$ — input projection to state dimension $N$
$\mathbf{C}_t = \text{Linear}_N(x_t)$ — input projection to state dimension $N$
$\Delta_t = \text{softplus}(\text{Linear}_1(x_t) + \text{Parameter})$ — input-dependent step size

This means the model’s dynamics change at every time step based on what it’s looking at. The paper connects this directly to classical RNN gating mechanisms. When $N=1$ and $A=-1$ , the selective SSM recurrence reduces to:

$g_t = \sigma(\text{Linear}(x_t))$

$h_t = (1 - g_t) \cdot h_{t-1} + g_t \cdot x_t$

This is exactly the gating equation from an LSTM or GRU, the workhorses of pre-Transformer sequence modeling. Mamba generalizes this: instead of a scalar gate, it uses a structured state space with a higher-dimensional state ( $N=16$ by default), giving it far more capacity to selectively store and retrieve information.

Wait — How Is This Stable?

If you have any background in control theory or dynamical systems, the previous section should make you nervous. Making system matrices input-dependent is textbook nonlinear, time-varying behavior — exactly the kind of thing Ogata’s framework warns you about. In classical control, this is a recipe for instability. So how does Mamba get away with it?

The answer is stability by design, not stability by proof:

The $\mathbf{A}$ matrix is constrained. It’s initialized as a diagonal matrix with negative real values (the HiPPO initialization). Negative eigenvalues mean the state naturally decays toward zero — the system is inherently dissipative. Even when $\Delta_t$ modulates the effective discretized $\bar{\mathbf{A}}$ , the entries stay between 0 and 1 after discretization. The state can never grow unboundedly.
$\Delta_t$ passes through softplus. The softplus function ( $\log(1 + e^x)$ ) guarantees the step size is always positive. You can’t get a negative step size that would flip the dynamics. Small $\Delta$ means “ignore this input, keep the state.” Large $\Delta$ means “forget the state, absorb this input.” Both are stable operations.
The gating connection is the key insight. As the paper proves, the selective SSM reduces to LSTM-style gating when $N=1$ . LSTMs have been empirically stable for decades — the forget gate (which is exactly what $\Delta$ becomes) keeps values bounded between 0 and 1. Mamba is doing the same thing with a higher-dimensional state.
LayerNorm after every block. Even if individual activations drift, the normalization recenters everything. This is the same trick that makes deep Transformers trainable.

So the nonlinearity isn’t tamed by mathematical proof — it’s tamed by architectural constraints that limit the reachable dynamics to a safe subset of all possible nonlinear behaviors. A control theorist would call this “stability by design.” The system could be unstable in the general case, but the specific parameterization Mamba uses makes instability unreachable.

The mechanistic effects of selectivity are:

Variable spacing: The model can filter out noise tokens (filler words, padding) by setting their $\Delta$ close to zero, preserving the previous state unchanged.
Content-based filtering: Unlike LTI models, which aggregate all information regardless of relevance, selective SSMs can choose what enters the state based on content.
Boundary resetting: At sequence boundaries (e.g., when packing multiple documents), the model can reset its state by driving $\Delta$ to infinity, preventing information from leaking across unrelated sequences.

The Mamba Block

Prior SSM architectures (H3, Hyena) interleaved an SSM block with a standard MLP block, mimicking the Transformer’s alternation between attention and feed-forward layers. Mamba simplifies this by combining both into a single homogeneous block:

Input projection: The input is projected to an expanded dimension (expansion factor E=2) through two parallel linear projections, one for the main branch and one for the gate branch.
Short convolution: A 1D causal convolution (kernel size 4) over the main branch, providing local context before the SSM.
Selective SSM: The core selective state space layer processes the convolved representation.
Gated output: The SSM output is element-wise multiplied with a SiLU-activated version of the gate branch.
Output projection: A linear projection maps back to the model dimension.

The entire Mamba architecture is just this block repeated with residual connections and LayerNorm. No attention layers. No separate MLP blocks. Two stacks of the Mamba block are used to match the parameter count of a Transformer’s interleaved MHA + MLP blocks (roughly $12D^2$ parameters per pair in both cases).

The Hardware-Aware Scan

The efficient implementation is critical to Mamba’s practicality. The selective scan algorithm:

Loads SSM parameters ( $\mathbf{A}$ , $\mathbf{B}$ , $\mathbf{C}$ , $\Delta$ ) and input from HBM to SRAM
Performs discretization (computing $\bar{\mathbf{A}}$ , $\bar{\mathbf{B}}$ ) entirely in SRAM
Executes the recurrence using a work-efficient parallel scan algorithm in SRAM
Writes only the output $y$ back to HBM
During backpropagation, recomputes intermediate states from saved inputs rather than storing them

This avoids materializing the expanded state (of size B * L * D * N) in HBM. For context: with batch size 16, sequence length 8192, model dimension 2048, and state dimension 16, that’s 4GB of intermediate state per layer that never needs to touch slow memory.

Key Results

Language Modeling — Scaling Laws (Figure 4)

Mamba is trained on The Pile dataset following GPT-3 specifications, from 125M to 1.3B parameters, using the Chinchilla compute-optimal protocol:

Mamba is the first attention-free model to match the performance of Transformer++ (the strongest known Transformer recipe, based on PaLM/LLaMA best practices) across the full scaling curve. All prior subquadratic models (H3, Hyena, RWKV, RetNet) fall below both Transformer and Transformer++ baselines. Mamba matches or exceeds Transformer++ at every scale tested, and the gap widens at longer context lengths.

Zero-Shot Downstream Evaluations (Table 3)

At each model size, Mamba is compared against Pythia (same tokenizer, dataset, training length of 300B tokens):

Model	Size	Avg Accuracy (8 tasks)
Pythia	160M	40.6
Mamba	130M	40.1
Pythia	410M	48.2
Mamba	370M	50.0
Pythia	1B	53.5
Mamba	790M	57.1
Pythia	1.4B	55.2
Mamba	1.4B	59.7
Pythia	2.8B	57.2
Mamba	2.8B	63.3

At every scale, Mamba outperforms Pythia. Mamba-3B’s quality matches Transformers at roughly twice the parameter count — Mamba-1.4B already exceeds Pythia-2.8B on average accuracy (59.7 vs 57.2).

Inference Throughput (Figure 8)

On an A100 80GB GPU with prompt length 2048:

Mamba-1.4B achieves up to ~1,500 tokens/second at batch size 128
Transformer-1.3B achieves up to ~300 tokens/second at the same batch size

5x higher generation throughput, and the gap grows with batch size because Mamba has no KV cache competing for GPU memory. The Transformer runs out of memory at large batch sizes while Mamba keeps scaling.

The selective scan itself is up to 40x faster than a standard PyTorch scan implementation, and up to 3x faster than optimized convolution-based SSMs at sequence length 2K+.

Synthetic Tasks — Selective Copying and Induction Heads

These diagnostic tasks reveal the mechanism:

Selective Copying: Mamba with S6 (selective SSM) achieves 99.8% accuracy. All non-selective variants (S4, Hyena with S4 inner layer) fail, achieving 18-57% accuracy. Selection is the difference.
Induction Heads: Mamba solves the task perfectly and generalizes to sequences 4000x longer than training (trained on length 256, evaluated up to length 1M). No other method generalizes beyond 2x the training length.

DNA and Audio

Beyond language, Mamba demonstrates broad applicability:

DNA modeling: Mamba matches Transformer++ and HyenaDNA with 3-4x fewer parameters on genomic pretraining. On species classification of great apes (99% DNA similarity), Mamba is the only model whose accuracy improves with longer context, reaching ~75% accuracy at 1M sequence length.
Audio generation: A small 6.1M-parameter Mamba model outperforms SampleRNN (35M), WaveNet (4.2M), and SaShi-Mi (5.8M) on unconditional speech generation, reducing FID from 0.74 to 0.52. A larger 24.3M Mamba model achieves FID of 0.36, cutting the gap to real speech by more than half.

Ablations (Tables 6-10)

The ablation study confirms several architectural decisions:

Selectivity is the single biggest factor. Replacing non-selective S4 with selective S6 drops perplexity from 10.34 to 8.95 in the H3 architecture, and from 10.56 to 8.69 in Mamba.
$\Delta$ is the most important selective parameter. Making only $\Delta$ input-dependent (while keeping $\mathbf{B}$ and $\mathbf{C}$ fixed) gets most of the improvement, consistent with its connection to RNN gating.
Increasing state dimension N helps dramatically when selective. With constant B and C, increasing N from 1 to 16 barely changes perplexity (9.88 -> 9.81). With selective B and C, the same increase drops perplexity from 9.73 to 8.71. Selectivity unlocks the value of larger state.
The Mamba architecture performs equivalently to H3 while being simpler (one block type instead of two).

Limitations

Scale ceiling is untested. All experiments are at 1.3B parameters or below. The paper explicitly acknowledges that scaling to 7B+ may require “further engineering challenges and adjustments.” Subsequent work (Mamba-2, Jamba) has partially addressed this, but Mamba has never been scaled to the 70B+ regime where Transformers dominate.
No in-context learning analysis. The paper doesn’t evaluate few-shot prompting, instruction following, or other emergent abilities that appear at scale in Transformers. These capabilities depend heavily on the kind of content-based retrieval that attention excels at, and it’s unclear whether selectivity provides a sufficient substitute.
Finite state means lossy compression. The fixed-size state must compress the entire history into N * D numbers (where N=16 and D is the model dimension). For a 1.4B Mamba model with D=2048, that’s about 32K floating-point values summarizing potentially millions of tokens. Attention’s KV cache is expensive precisely because it avoids this compression — every past token’s representation is preserved exactly. For tasks requiring precise recall of specific details from deep in the context, this compression may be a fundamental limitation.
Hybrid architectures may be necessary. Subsequent work like Jamba (paper #11 on our list) suggests that the best practical architecture combines Mamba layers with some attention layers, getting linear-time scaling for most of the network while retaining attention’s precise retrieval where it matters most. This pattern — Mamba for bulk processing, attention for precision — is becoming the pragmatic consensus.
Training is not actually faster than Transformers. Mamba’s scan-based training, while much faster than naive recurrence, is not faster than highly optimized attention (FlashAttention-2). The speed advantage is primarily at inference time, where the constant-size state eliminates the KV cache bottleneck. During training, the parallelizable convolution form of attention is hard to beat.

Key Takeaways

Selectivity is the missing ingredient for SSMs. The gap between LTI state space models and Transformers on language wasn’t about architecture or scale — it was about the ability to decide what matters based on content. Making the SSM parameters input-dependent closes this gap entirely. The insight is almost embarrassingly simple in retrospect: a model that treats every input the same can’t compete with one that distinguishes signal from noise.
Linear-time sequence processing is now real, not theoretical. Prior “efficient” alternatives to attention either degraded quality (linear attention, Performer) or only worked on narrow domains (S4 on audio). Mamba is the first to match Transformer quality on language while maintaining true linear scaling. This is not an approximation of attention — it’s a different mechanism that happens to work as well.
The KV cache is the Transformer’s Achilles heel for inference, and Mamba eliminates it entirely. During generation, a Transformer must store and access the Key and Value representations for every token in the context. This cache grows linearly with sequence length and consumes GPU memory that could otherwise be used for larger batch sizes. Mamba’s fixed-size state means inference memory is constant regardless of context length — which is why it achieves 5x throughput at the same model size.
Hardware-aware algorithm design is not optional. The mathematically “obvious” implementation of selective SSMs (sequential recurrence) would be catastrophically slow. The actual speed comes from Tri Dao’s fused kernel that exploits the GPU memory hierarchy — keeping computation in SRAM, avoiding HBM materialization of intermediate states, and using parallel scan for the recurrence. This is the same lesson as FlashAttention: the algorithm and its hardware implementation are inseparable. A good idea with a bad implementation is a bad idea.
The Transformer’s dominance is architectural, not fundamental. For seven years after “Attention Is All You Need,” no alternative architecture could match the Transformer on language. Mamba breaks that streak. This doesn’t mean the Transformer is dead — it means the design space is larger than we thought. The future likely involves hybrid architectures (Jamba, on our reading list, is an early example) that combine the strengths of both: attention’s precise retrieval with Mamba’s efficient bulk processing.
Mamba connects back to RNNs in a deep way. The paper proves (Theorem 1) that selective SSMs generalize classical RNN gating. The LSTM’s forget gate, introduced in 1997, was an early form of selectivity. Mamba takes the same idea — let the model decide what to remember and what to forget — and implements it with modern hardware awareness and structured state spaces. The field didn’t abandon recurrence because the idea was wrong. It abandoned it because RNNs couldn’t be parallelized efficiently. Mamba shows that with the right implementation, recurrence can be both parallelizable (during training, via scan) and memory-efficient (during inference, via constant-size state).
This paper is a two-author tour de force. Albert Gu built the mathematical foundation of structured state space models (S4, HIPPO, H3). Tri Dao built the hardware-aware implementation framework (FlashAttention). Mamba is what happens when these two threads converge. The selection mechanism is Gu’s contribution; the efficient scan kernel is Dao’s. The result is greater than the sum of its parts — neither the math nor the implementation alone would have been sufficient to challenge the Transformer.

Summary prepared by Darrell Thomas, 2026