Skip to content
← Back
2026 Summary

Conditional Memory via Scalable Lookup (Engram)

Darrell Thomas

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models (Engram)

Authors: Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, et al. (Peking University & DeepSeek-AI) Published: January 12, 2026 | arXiv:2601.07372v1 Source: arxiv.org/abs/2601.07372 Code: github.com/deepseek-ai/Engram


The Big Idea

Transformers are forced to waste expensive computation simulating what is essentially a lookup operation. When your model encounters “Diana, Princess of Wales,” it burns through multiple layers of attention and feed-forward networks just to reconstruct a static fact that could have been retrieved from a table. Engram fixes this by adding a second axis of sparsity alongside Mixture-of-Experts: a massive N-gram embedding table that provides O(1)O(1) constant-time knowledge retrieval via deterministic hashing. The local context (the last 2-3 tokens) is hashed to an index, and a pre-trained embedding vector is retrieved and fused with the model’s hidden state. This offloads static pattern recognition from the neural network, freeing the transformer’s depth and attention capacity for what it’s actually good at: reasoning, long-range context, and compositional logic. Scaled to 27B parameters, Engram outperforms an iso-parameter, iso-FLOPs MoE baseline across the board — and the gains are largest not on knowledge retrieval (where you’d expect them) but on reasoning and code.


Why This Paper Matters

The Home Lab Angle

If you are running models on your own hardware — a workstation with a decent GPU and a healthy amount of system RAM — this paper should immediately grab your attention.

The fundamental problem with running large models locally is VRAM. A 70B dense model needs roughly 140GB in FP16, far beyond any consumer GPU. MoE models like DeepSeek-V3 help by only activating a fraction of parameters per token, but even the activated experts and the routing logic need to live on the GPU. The inactive experts still consume memory somewhere, and MoE’s dynamic routing means you can’t predict which experts you’ll need next — making offloading to system RAM painful because you don’t know what to prefetch.

Engram changes this equation entirely. Its memory lookups are deterministic. The moment you see the input tokens, you know exactly which embedding table entries you’ll need, because the lookup is a hash function over the local N-gram context. There is no learned router, no dynamic decision. This means:

  1. The entire embedding table can live in host DRAM (your system RAM), not on the GPU. The paper demonstrates offloading a 100B-parameter embedding table to CPU memory with only a 2.8% throughput penalty on an 8B backbone. That is 100 billion parameters of model capacity accessed through your PCIe bus with near-zero cost.

  2. Prefetching works perfectly. Because the lookup indices are known as soon as the input tokens arrive — before the transformer layers even begin computing — the system can asynchronously fetch the needed embeddings from host memory while the GPU is busy computing the preceding transformer block. The PCIe transfer is completely hidden behind useful computation.

  3. Zipf’s law is on your side. Natural language N-grams follow a Zipfian distribution: a small fraction of patterns (“the”, “of the”, “in the”) account for the vast majority of lookups. This means a multi-level cache hierarchy works beautifully. Keep the hot embeddings in GPU HBM, the warm ones in host DRAM, and the long tail on NVMe SSD. The paper explicitly describes this strategy and notes that it allows Engram to scale to massive memory capacities with minimal impact on effective latency.

For a home user with, say, a 24GB GPU and 128GB of system RAM, this architecture opens up a fundamentally different design space. You could run a model where 24GB of active computation lives on the GPU, but hundreds of gigabytes of knowledge capacity sits in your system RAM, getting prefetched on demand with almost no speed penalty. Your RAM becomes a first-class participant in the model, not just a staging area.

This is the same theme as DeepSeek-V3: the best engineering comes from working with constraints rather than ignoring them. If you can’t fit everything in VRAM, don’t just quantize until the model barely works. Instead, separate the parts of the model that need GPU compute (attention, MoE routing, nonlinear reasoning) from the parts that are fundamentally just storage (static knowledge patterns), and put each where it belongs.


Technical Architecture

How Engram Works

The Engram module operates in two phases at each position: retrieval and fusion.

Phase 1 — Retrieval via Hashed N-grams:

  1. Tokenizer Compression: Raw token IDs are mapped to canonical IDs via a surjective function that collapses surface variants (“Apple” and ” apple”) into the same identifier. This achieves a 23% reduction in effective vocabulary size, increasing semantic density.

  2. N-gram Formation: For each position t, extract suffix N-grams of orders 2 and 3 from the compressed token IDs. So for the sequence “Alexander the Great”, the 2-gram at “Great” is (the, Great) and the 3-gram is (Alexander, the, Great).

  3. Multi-Head Hashing: Each N-gram is hashed using K independent hash functions (multiplicative-XOR hashing) to index into K separate embedding tables. Multiple hash heads mitigate collisions — different heads will collide on different entries, and the downstream gating mechanism learns to suppress the noise.

  4. Concatenation: All retrieved embeddings (across N-gram orders and hash heads) are concatenated into a single memory vector e_t.

This entire retrieval process is O(1)O(1) per token. No attention, no matrix multiplication, no learned routing. Just hashing and table lookup.

Phase 2 — Context-Aware Fusion:

The retrieved embedding is static — it doesn’t know the broader context. The fusion step makes it context-aware:

  1. Gating: The current hidden state h_t (which has already passed through at least one attention layer and carries global context) serves as a Query. The retrieved memory e_t is projected into Key and Value vectors. A scalar gate αt\alpha_t is computed via scaled dot product between the normalized hidden state and the normalized Key. If the retrieved memory doesn’t match the current context (e.g., hash collision, or the N-gram is irrelevant in this context), the gate approaches zero.

  2. Convolution: A lightweight depthwise causal convolution (kernel size 4, dilation matching the max N-gram order) smooths and mixes the gated values across adjacent positions, expanding the receptive field.

  3. Residual Addition: The output is added to the hidden state via a residual connection, exactly like any other transformer sub-layer.

Where Engram Goes in the Network

This is a co-design problem between modeling quality and system latency:

  • Modeling preference: Inject early (Layer 2) to offload static pattern reconstruction before the backbone wastes depth on it.
  • System preference: Inject deeper so that the preceding transformer blocks provide a computation window to hide the PCIe memory transfer latency.
  • Solution: Split the Engram budget across two insertion points (Layers 2 and 15 in the 27B model). Early injection handles local pattern offloading; late injection benefits from richer contextual gating.

The Sparsity Allocation Law

The paper formalizes a key question: given a fixed parameter and compute budget, how should you split sparse capacity between MoE experts and Engram memory?

They define an allocation ratio rho where rho=1 means all spare capacity goes to MoE experts and rho=0 means all goes to Engram. Sweeping rho across two compute regimes reveals a U-shaped curve: pure MoE (rho=1) and pure Engram (rho=0) are both suboptimal. The sweet spot is around rho=75-80% — roughly 75-80% of sparse capacity in MoE experts, 20-25% in Engram memory.

This isn’t just an empirical finding. It reflects a structural truth about language: some processing requires dynamic computation (reasoning, composition, inference), and some is fundamentally pattern retrieval (named entities, formulaic phrases, local co-occurrence statistics). MoE handles the first. Engram handles the second. Allocating everything to one axis wastes capacity on work the other axis does better.


Key Results

Pre-training Benchmarks (Table 1)

All models trained on 262B tokens, matched on activated parameters (3.8B):

BenchmarkDense-4BMoE-27BEngram-27BEngram-40B
MMLU48.657.460.460.6
CMMLU47.957.961.963.4
BBH42.850.955.957.5
MATH15.228.330.730.6
HumanEval26.837.840.838.4
DROP41.655.759.060.7
ARC-Challenge59.370.173.876.4
TriviaQA33.048.850.751.8

Engram-27B outperforms MoE-27B on every single benchmark despite having the same total parameters, the same activated parameters, and the same training compute. The gains are not just on knowledge-retrieval tasks — the biggest improvements are in reasoning (BBH: +5.0) and code (HumanEval: +3.0).

Long-Context Performance (Table 2)

After context extension training to 32K tokens:

TaskMoE-27BEngram-27B
Multi-Query NIAH84.297.0
Variable Tracking77.087.2
Multi-Value NIAH92.795.5
Question Answering34.540.5

Engram dramatically improves long-context performance. By handling local pattern recognition through lookups, it frees attention heads to focus on genuinely long-range dependencies rather than wasting capacity on local co-occurrence patterns.

Offloading Throughput (Table 4)

100B-parameter Engram table offloaded entirely to host DRAM:

BackboneBaseline (tok/s)+ 100B Engram Offload (tok/s)Overhead
Dense-4B9,0318,8581.9%
Dense-8B6,3156,1402.8%

100 billion parameters of knowledge capacity, accessed from CPU memory over PCIe, with under 3% throughput loss. And this is the conservative baseline — the experiment forces all lookups through PCIe without any caching. A production implementation with Zipfian-aware caching would be even faster.

What Happens When You Remove Engram at Inference (Figure 6)

A revealing ablation: train with Engram, then disable it at inference. Factual knowledge collapses (TriviaQA retains only 29% of performance), but reading comprehension barely moves (C3 retains 93%). This confirms the architectural hypothesis: Engram acts as the model’s primary knowledge store, while the transformer backbone handles contextual reasoning. The two subsystems are cleanly separated.


Mechanistic Insight: Effective Depth

The paper’s most elegant finding is what Engram does to the network’s internal representations. Using LogitLens (projecting each layer’s hidden state to the output vocabulary), they show that Engram models reach “prediction-ready” representations much earlier in the network. Using CKA (Centered Kernel Alignment), they demonstrate that layer 5 of Engram-27B produces representations equivalent to approximately layer 12 of MoE-27B.

In plain terms: Engram effectively doubles the model’s useful depth by relieving early layers from the tedious work of reconstructing static knowledge. Those layers can instead begin higher-level feature composition immediately. The model doesn’t get more layers — it gets more value out of the layers it has.


Limitations

  1. Training complexity: Engram tables are sharded across GPUs during training using all-to-all communication, similar to expert parallelism in MoE. This adds infrastructure complexity.
  2. Hash collisions: Multi-head hashing mitigates but doesn’t eliminate collisions. The context-aware gating learns to suppress noise from collisions, but some information loss is inherent.
  3. N-gram order: The paper uses 2-grams and 3-grams. Allocating capacity to 4-grams was slightly suboptimal at the tested scale (it dilutes capacity from more frequent lower-order patterns), though higher-order patterns may become beneficial with larger memory budgets.
  4. Under-training at scale: Engram-40B doesn’t uniformly dominate Engram-27B, likely because 262B training tokens isn’t enough to fully utilize the expanded memory capacity. The training loss gap was still widening at the end of training.
  5. Inference-only validation of offloading: The 100B offloading experiment uses a simplified setup (nano-vLLM). Production deployment with full MoE + Engram + offloading hasn’t been demonstrated end-to-end.

Key Takeaways

  1. Not all parameters need to compute. Engram introduces a clean separation: parameters that store static knowledge (embedding tables) vs. parameters that perform dynamic computation (attention, MoE experts). Storage parameters don’t need FLOPs — they just need to be addressable. This is a fundamentally different scaling axis from making the network wider or deeper.

  2. Deterministic addressing enables offloading that dynamic routing cannot. Because Engram’s lookups are hash-based and depend only on input tokens (not on learned hidden states), the system knows what memory it needs before the transformer even starts computing. This makes prefetching from host RAM trivially effective. MoE experts, by contrast, require the router’s hidden-state-dependent decision, which means you can’t prefetch until the computation is already underway.

  3. Your system RAM can become a first-class model component. A 100B embedding table in host DRAM costs under 3% throughput. For a home user with a single consumer GPU and abundant system RAM, this changes the calculus of what models you can run. VRAM handles the hot compute path; RAM handles the knowledge store; NVMe handles the long tail. Each tier of your memory hierarchy contributes meaningfully.

  4. Offloading static knowledge improves reasoning, not just recall. The counterintuitive result: Engram’s biggest benchmark gains are on BBH (+5.0), ARC-Challenge (+3.7), HumanEval (+3.0), and MATH (+2.4) — tasks that require reasoning, not pattern matching. By relieving early layers from reconstructing facts, Engram gives the network more effective depth for compositional thinking.

  5. MoE and Engram are complementary, not competing. The U-shaped sparsity allocation curve proves that the optimal model uses both. Roughly 75-80% of sparse capacity in MoE experts for dynamic computation, 20-25% in Engram for static retrieval. Neither axis alone is sufficient.

  6. This is the natural evolution of the DeepSeek architecture. DeepSeek-V3 built the MoE foundation with MLA and auxiliary-loss-free load balancing. DeepSeek-R1 showed that reasoning emerges from RL on that foundation. Engram adds the missing piece: a memory primitive that handles what MoE was never designed for — cheap, massive-scale knowledge storage — freeing the compute pathway to focus entirely on reasoning. The trajectory is clear: separate what needs to think from what just needs to be remembered.


Where to Start

Where would I start with all this? Well first, just try to understand how the LLM works. Raschka’s Build a Large Language Model (From Scratch) scratches that itch. You really get a good look under the hood. You aren’t going to build a racecar — you are going to build a go-kart. But an engine is an engine. Once you understand how that engine works, your mind will naturally take you to places on how to improve it.


Summary prepared by Darrell Thomas, 2026