@karminski3: Magic! DeepSeekV4 context memory compressed to 1/10! Everyone knows DeepSeekV4 supports 1M context and is heavily optimized. To actually use 1M context, VRAM usage is only about 10GB (compared to DeepSeek-V3.2 which needs about…

X AI KOLs Following Papers

Summary

FlashMemory-DeepSeek-V4 proposes a novel inference paradigm called Lookahead Sparse Attention (LSA), which uses a neural memory indexer to actively predict future context needs, compressing physical KV cache usage to 13.5% of full context baseline while improving average accuracy by 0.6%. This method adopts a decoupled training strategy that allows independent training of the indexer without loading the base model, significantly reducing training cost.

Magic! DeepSeekV4 context memory compressed to 1/10! Everyone knows DeepSeekV4 supports 1M context and is heavily optimized. To actually use 1M context, VRAM usage is only about 10GB (compared to DeepSeek-V3.2 which needs about 84GB). Then I just saw the FlashMemory paper, which directly compresses VRAM usage to 1.3GB! And the output performance even improves instead of dropping! Dude, you can fool your buddies, but fooling yourself is pointless. Really? Performance improves after compression? I quickly looked into the paper details: Let's first review the traditional approach: Every time the model outputs a token, it has to re-read hundreds of thousands of previous tokens (this is global attention). FlashMemory's approach: Predict what will be needed in the future. It has a built-in Neural Memory Indexer (basically a small model) that proactively predicts which parts of the historical text will be needed for the upcoming generation. Then it pre-loads these parts. As long as the hit rate is extremely high, this improvement is absolutely effective. In other words, its assumption is that not everything in the KV cache is needed for generating every token; you just need to load on demand in advance. It's like when doing homework, you spread reference materials all over the table, then optimize by taking photos of the parts you need and just looking at the photos when needed. So it sounds simple, but the actual challenge is that training a dedicated indexer small model requires loading the DeepSeek-V4 model into VRAM to train together. That's quite computationally expensive. Then comes the second highlight of this paper: they introduced decoupled training. They treat the indexer as a standard "dual-encoder (similar to models used in search/recommendation)" and train it separately. During this process, there is no need to load the massive DeepSeek-V4 base model into VRAM at all. This drastically reduces training cost and is compatible with standard retrieval training frameworks. (Simply put, it's trained with a general method, using queries to predict which long sentences need to be retrieved. So it's actually a general-purpose model.) Sounds plausible, but that only reduces VRAM usage; how does performance improve? The answer is attention denoising. Because only the most relevant memory chunks for the current generation are loaded into VRAM, the model doesn't see irrelevant redundant information during computation. This naturally acts as a "denoising" effect, which is why VRAM usage decreases while model accuracy slightly increases. Official tests on long-text benchmarks (e.g., LongBench-v2) show an average accuracy improvement of 0.6%. (There are also details about how to evict data from VRAM and how to predict data for preloading. Those parts are also excellent and insightful. I recommend reading the original paper; I can't write it all due to space limitations.) Paper: http://arxiv.org/abs/2606.09079 Project: http://github.com/libertywing/FlashMemory-Deepseek-V4… #FlashMemory #DeepSeekV4 #FlashMemoryDeepseekV4
Original Article
View Cached Full Text

Cached at: 06/12/26, 07:01 PM

Magic! DeepSeekV4 Context Memory Compressed to 1/10!

Everyone knows that DeepSeekV4 supports 1M context and has been heavily optimized. If you actually use 1M context, the GPU memory usage is only about 10G (compared to DeepSeek-V3.2 which needs about 84G). Then I just saw the FlashMemory paper, which can directly compress memory usage to 1.3GB! And the output quality actually improves instead of decreasing!

Dude, you can fool your buddies, but fooling yourself is pointless. Really? Performance increases after compression? I quickly checked the paper details:

Let’s review the traditional approach: Every time the model generates a token, it has to re-read hundreds of thousands of previous tokens (that’s global attention).

FlashMemory’s approach: It predicts what will be needed in the future. It has a built-in Neural Memory Indexer (essentially a small model) that can actively predict which segments of the historical text will be needed for the upcoming generation. Then it pre-fetches these segments. As long as the hit rate is extremely high, this improvement is definitely effective. In other words, its assumption is that not everything in the KV cache is needed for generating every token; you only need to load on demand in advance.

It’s like when doing homework, you spread reference materials all over the table, and then optimize by taking photos of the parts you need and just looking at the photos when needed.

That sounds simple, but the real difficulty is that training a dedicated indexer small model requires loading the DeepSeek-V4 model into GPU memory and training them together. It is quite computationally expensive.

Then comes the second highlight of this paper: they introduced decoupled training. They treat this indexer as a standard dual-encoder (similar to models used for search/retrieval) and train it independently. In this process, there is no need to load the massive DeepSeek-V4 base model into GPU memory. This dramatically reduces training costs and is compatible with standard retrieval training frameworks. (Simply put, it’s trained using a general method, predicting which long sentences to retrieve based on the query. So it’s actually a general model.)

Sounds plausible, but that only reduces memory usage; why does performance improve? The answer is attention denoising. Because only the memory chunks most relevant to the current generation are fetched into GPU memory, the model doesn’t see irrelevant redundant information during computation. This naturally acts as a ‘denoising’ effect, which is why the accuracy slightly improves while memory usage decreases. Official tests on long-text benchmarks (such as LongBench-v2, etc.) show an average accuracy improvement of 0.6%.

(Actually, there are also details about how to evict data from GPU memory and how to predict data for preloading. These parts are also great and insightful. I suggest reading the original paper, but I can’t write more due to space constraints.)

Paper: http://arxiv.org/abs/2606.09079 Project: http://github.com/libertywing/FlashMemory-Deepseek-V4…

#FlashMemory #DeepSeekV4 #FlashMemoryDeepseekV4


FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Source: https://arxiv.org/html/2606.09079 1]Independent Researchers 2]Tencent 3]The Hong Kong University of Science and Technology (Guangzhou) 4]Tsinghua University\contribution[*]Equal contribution\contribution[†]Project Lead

Qifan Zhang Jiachen Yu Tian Liang Dongyang Ma Xiang Hu Zibo Lin Chunyang Li Zhichao Wang Miao Peng Nuo Chen Jia Li Yujiu Yang Haitao Mi Dong Yu\[[[[[email protected] (https://arxiv.org/html/2606.09079v2/mailto:[email protected])

Abstract

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory.

We demonstrate that this “less is more” paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone’s core reasoning capacities.

Project Status: Due to organizational realignments, the Project Lead has parted ways with Tencent, and this project has been suspended. This technical report documents our preliminary breakthroughs and verified checkpoints.We firmly believe in the potential of the FlashMemory paradigm for infinite long-context intelligence. If you or your organization are interested in supporting or collaborating on the next phase (e.g., compute sponsorship, scaling tests, or research integration), please contact the Project Lead at [email protected] (https://arxiv.org/html/2606.09079v2/mailto:[email protected]).

Refer to captionFigure 1:Performance and hardware efficiency of FlashMemory-DeepSeek-V4.On LongBench-v2 and RULER, FM-DS-V4 consistently matches or exceedsDS-V4-Flash, while reducing KV cache overhead to merely 13.5% on average. KV cache memory footprints are measured via sglang deployment logs on an 8×\timesH20 GPU server.## 1Introduction

The extension of Large Language Models (LLMs) toward ultra-long context windows is fundamentally bottlenecked by memory capacity. While modern sparse attention mechanisms successfully reduce the computational FLOPs per decoding step to a near-constant level, the GPU memory footprint of the Key-Value (KV) cache still scales linearly with the sequence length. Recent foundation models like DeepSeek-V4 111https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro and Qwen3.5 222https://huggingface.co/Qwen/Qwen3.5-397B-A17B attempt to slow down this memory explosion by incorporating heavily compressed attention (HCA) or linear attention layers deepseekv4,qwen35blog. However, to preserve fine-grained factual recall, these models must still retain a significant portion of low-compression or full-attention layers deepseekv4. Consequently, they only mitigate the rate of memory growth rather than eliminating the linear scaling bottleneck itself.

This work stems from a simple yet striking observation of resource waste during inference: conventional LLMs fully load and carry the entire KV cache in GPU memory even when the active decoding step is completely independent of the historical context. Our empirical analysis of real-world inference logs reveals that over 90% of user requests with contexts longer than 64K tokens can be accurately resolved using only the last 8K tokens.This indicates that an overwhelming majority of GPU memory is squandered on inactive context that contributes nothing to the current token prediction. Conversely, simply discarding history via standard sliding-window attention fails entirely on the remaining tasks that genuinely require global context synthesis. This hard contradiction—supporting deep global reasoning without paying the full GPU memory tax for local generation steps—is the root cause behind the prohibitive cost of long-context serving.

To resolve this dilemma, we present Lookahead Sparse Attention (LSA). Following the structural compression spirit of DeepSeek-V4 deepseekv4, our architecture retains all highly condensed HCA chunks (128:1 compression ratio) to maintain global context awareness. However, we fundamentally upgrade the conventional Compressed Sparse Attention (CSA) layers into our predictive LSA paradigm. LSA empowers the model to not recall that much fine-grained context; instead, driven by a highly efficient Neural Memory Indexer, the system triggers periodically at a fixed decoding interval of τ\tau steps (e.g., τ=64\tau=64) to evaluate current hidden states and proactively fetch only the critical CSA chunks into the GPU memory. Crucially, we formulate the indexer as a standalone dual-encoder architecture. This decoupled design allows us to train the indexer independently on pre-computed hidden states and labels, completely bypassing the prohibitive memory and computational overhead of full-model fine-tuning or joint distillation.

Experimental results across three distinct long-context benchmarks confirm the robustness and striking efficiency of LSA. In scenarios requiring long-term memory and deep understanding, LSA acts as an effective attention denoiser. Specifically, averaged across LongBench-v2, LongMemEval, and RULER, LSA reduces GPU memory consumption to merely 13.5% of the baseline (an 86.5% reduction) while outperforming the standard Deepseek-V4-Flash by +0.6% absolute accuracy. At 500K context lengths, the memory reduction reaches up to 90%.

In summary, our core contributions are threefold:

  • •Lookahead Sparse Attention (LSA) Paradigm:We propose LSA, a novel inference paradigm that eliminates the hard contradiction between long-context modeling capabilities and hardware efficiency by proactively predicting and fetching query-critical KV chunks on demand.
  • •Backbone-Free Decoupled Training:We introduce an ultra-lightweight training strategy that physically isolates the indexer from the host LLM. Formulated as a standalone dual-encoder trained on pre-computed representations, the indexer can be optimized independently in just a single H20 GPU hour without ever loading the massive backbone model.
  • •Breakthrough in Efficiency:Extensive evaluations show that LSA reduces GPU memory to merely 13.5% of the baseline (up to 90% reduction at 500K) while maintaining comparable accuracy to the full-attention baseline.

Refer to captionFigure 2:Architectural overview of LSA vs. CSA. The black lines denote the standard, step-by-step CSA pipelines. The red lines highlight our proposed LSA mechanism, which decouples the GPU memory footprint by leveraging a Memory Indexer to fetch historical KV chunks dynamically every τ\tau steps.

2Methodology

In this section, we present the technical details of Lookahead Sparse Attention (LSA), including its architectural formulation, data curation pipeline, optimization strategy, and optimal configuration. Specifically, Section2.1 (https://arxiv.org/html/2606.09079#S2.SS1) introduces how we architect LSA on top of the DeepSeek-V4 framework to achieve predictive context selection. Section2.2 (https://arxiv.org/html/2606.09079#S2.SS2) introduces our lookahead data formats and the automated gathering pipeline. Section2.3 (https://arxiv.org/html/2606.09079#S2.SS3) details our decoupled training strategy that physically isolates indexer optimization from the massive LLM backbone. Finally, Section2.4 (https://arxiv.org/html/2606.09079#S2.SS4) presents our systematic exploration of the optimal layer configuration and training recipe for the production model.

2.1Memory Indexer for Lookahead Selection

The core design principle of LSA is to minimize modifications to the DeepSeek-V4 architecture, thereby maximizing the preservation of its established capabilities. Therefore, our Memory Indexer mirrors the exact architecture of the native Lightning Indexer used in DeepSeek-V4, reusing the compressed indexer keys KICompK^{\text{IComp}} as the dense representation of historical context. The definitive departure is that we introduce a Sigmoid function as the final activation layer to scale the indexer scores into the (0,1)(0,1) range, and we replace the rigid Top-kk selector with a threshold-based mechanism to recall a dynamic number of historical entries.

During the autoregressive decoding stage, the Memory Indexer triggers periodically at a fixed decoding step interval τ\tau (e.g., τ=64\tau=64) to perform lookahead block prediction. As illustrated in Figure2 (https://arxiv.org/html/2606.09079#S1.F2), at decoding step tt (where t(modτ)=0t\pmod{\tau}=0), given the current input hidden state of the query token ht∈Rd\mathbf{h}{t}\in\mathbb{R}^{d}, we map it into low-rank indexer queries across n h l n{h}^{l} indexer heads:

c t Q \displaystyle\mathbf{c}{t}^{Q} = h t ⋅ W D Q , \displaystyle=\mathbf{h}{t}\cdot W^{DQ}, (1) [ q t , 1 l ; q t , 2 l ; … ; q t , n h l l ] = q t l \displaystyle[\mathbf{q}{t,1}^{l};\mathbf{q}{t,2}^{l};\dots;\mathbf{q}{t,n{h}^{l}}^{l}]=\mathbf{q}{t}^{l} = c t Q ⋅ W I U Q , \displaystyle=\mathbf{c}{t}^{Q}\cdot W^{IUQ}, (2) where W D Q ∈ R d × d c W^{DQ}\in\mathbb{R}^{d\times d_{c}} and W I U Q ∈ R d c × c l n h l W^{IUQ}\in\mathbb{R}^{d_{c}\times c^{l}n_{h}^{l}} represent the down-projection and up-projection matrices for the lookahead query representation, respectively. Concurrently, we dynamically project h t \mathbf{h}{t} to compute the routing head weights w t l \mathbf{w}{t}^{l}:

[ w t , 1 l ; w t , 2 l ; … ; w t , n h l l ] = w t l = h t ⋅ W w , [\mathbf{w}{t,1}^{l};\mathbf{w}{t,2}^{l};\dots;\mathbf{w}{t,n{h}^{l}}^{l}]=\mathbf{w}{t}^{l}=\mathbf{h}{t}\cdot W^{w}, (3) where W w ∈ R d × n h l W^{w}\in\mathbb{R}^{d\times n_{h}^{l}} is a learnable matrix, and w t , h l \mathbf{w}_{t,h}^{l} dynamically scales the importance of the hh-th indexer head.

To determine which historical compressed KV entries are strictly critical for the upcoming window [ t , t + τ − 1 ] [t,t+\tau-1], the lookahead index score I t , s I_{t,s} between the query token tt and a preceding compressed entry ss (s<⌊ t m ⌋ s<\lfloor\frac{t}{m}\rfloor) is formulated as a head-fused gated matching score with a Sigmoid activation:

I t , s = σ ( ∑ h = 1 n h l w t , h l ⋅ ReLU ( q t , h l ⋅ ( K s IComp ) T ) ) , I_{t,s}=\sigma\left(\sum_{h=1}^{n_{h}^{l}}\mathbf{w}{t,h}^{l}\cdot\text{ReLU}\left(\mathbf{q}{t,h}^{l}\cdot\left(K_{s}^{\text{IComp}}\right)^{T}\right)\right), (4) where σ ( ⋅ ) \sigma(\cdot) denotes the standard Sigmoid function.

This Sigmoid activation stands as the only architectural departure from the native Lightning Indexer. While the original one applies a ReLU boundary for raw attention scoring, LSA introduces Sigmoid normalization to align the Memory Indexer’s scalar outputs explicitly with discrete binary targets y ∈ { 0 , 1 } y\in{0,1}. For a query token tt, rather than a rigid Top-kk selection strategy, we fetch all preceding compressed KV entries whose lookahead scores meet or exceed a specific classification threshold (i.e., I t , s ≥ 0.5 I_{t,s}\geq 0.5) from the CPU Cold Pool into the GPU memory for subsequent core attention:

C t MemComp = { C s Comp | I t , s ≥ 0.5 } , C_{t}^{\text{MemComp}}=\left{C_{s}^{\text{Comp}};\middle|;I_{t,s}\geq 0.5\right}, (5) where C Comp C^{\text{Comp}} denotes the pre-computed compressed KV entries. Once the query-critical context subset C t MemComp C_{t}^{\text{MemComp}} is successfully resident in the GPU memory, the native Lightning Indexer calculates the token-level matching scores within this restricted C t MemComp C_{t}^{\text{MemComp}} boundary instead of scanning the full context. It applies the native ReLU-based Multi-Query Attention scoring over the fetched subset to select the final fine-grained Top-kk core compressed entries:

C i CoreComp = { C s Comp ∈ C t MemComp | Score native ( i , s ) ∈ Top-k } . C_{i}^{\text{CoreComp}}=\left{C_{s}^{\text{Comp}}\in C_{t}^{\text{MemComp}};\middle|;\text{Score}{\text{native}}(i,s)\in\text{Top-}k\right}. (6) The selected C i CoreComp C{i}^{\text{CoreComp}} entries are then concatenated with the non-offloadable sliding window KV cache to participate in the final core attention computation. This tiered selection mechanism guarantees that the underlying FlashInfer or FlashAttention kernels operate exclusively on a highly condensed, hardware-resident active sequence footprint.

2.2Lookahead Dataset Construction

The cornerstone of optimizing our Memory Indexer is pinning down exactly which historical compressed KV entries a decoding token needs to look ahead to. A naive approach would define the positive label set for token tt as the simple union of all Top-kk entries recalled by the native Lightning Indexer across the future window [ t , t + τ − 1 ] [t,t+\tau-1]. However, empirical analysis reveals a massive inflation problem with this strategy, resulting in nearly 10,000 positive samples per token window before filtering (reduced to approximately 100–1,000 after our pipeline). The root cause is that a rigid Top-kk selector forces the model to recall a fixed number of preceding entries regardless of their actual relevance, causing low-probability noise entries from different attention layers to heavily pollute the ground-truth dataset.

To eliminate this noise, we propose an golden label filtering pipeline that uses a Cross-Layer Majority Voting mechanism to identify the true “golden entries.” The data generation pass runs completely offline on the frozen

Similar Articles

FlashMemory DeepSeek-V4 Retriever (GitHub Repo)

TLDR AI

Introduces FlashMemory DeepSeek-V4 Retriever, a lightweight model that sparsifies DeepSeek-V4's CSA KV-cache by predicting which chunks will be attended to next, keeping only ~10-15% on-device while matching full-attention performance.