@karminski3: 魔法! DeepSeekV4 上下文内存压缩到1/10! 大家都知道 DeepSeekV4 是支持1M上下文的, 而且经过了极度优化, 如果要真的用到1M上下文, 显存占用只需要10G左右, (对比之下 DeepSeek-V3.2 大概需…
摘要
FlashMemory-DeepSeek-V4提出了一种名为Lookahead Sparse Attention(LSA)的新型推理范式,通过神经内存索引器主动预测未来上下文需求,将物理KV缓存占用压缩至全上下文基线的13.5%,同时平均精度提升0.6%。该方法采用解耦训练策略,无需加载基座模型即可独立训练索引器,显著降低训练成本。
查看缓存全文
缓存时间: 2026/06/12 19:01
魔法! DeepSeekV4 上下文内存压缩到1/10!
大家都知道 DeepSeekV4 是支持1M上下文的, 而且经过了极度优化, 如果要真的用到1M上下文, 显存占用只需要10G左右, (对比之下 DeepSeek-V3.2 大概需要84G显存). 然后我刚看到了FlashMemory这个论文, 直接能把显存占用压到 1.3GB! 甚至输出效果不降反升!
哥们你骗兄弟可以, 骗自己就没意思了, 真的吗? 压缩后反而性能上升? 我赶紧看了论文细节:
咱们先复习一下传统做法: 模型每吐出一个字,都要把之前的几十万字重新看一遍(这就是全局注意力).
FlashMemory 的做法是: 预测未来需要什么, 它内置了一个神经内存索引器(Neural Memory Indexer, 其实就是个小模型了),能够主动预判接下来生成内容时需要用到历史文本里的哪些片段. 然后预先准备好这些片段, 接下来只要做到命中率超高, 那么这个提升就绝对有效. 即它的假设是, KVCache里面的东西并不是生成每个字的时候全都需要的, 只需要按需提前加载即可.
很像做作业的时候, 把参考资料摊满桌子, 然后优化了一下就是把参考资料需要用到的部分直接拍照, 用的时候看照片就行了.
那么听上去很简单, 但实际的难点在于, 训练一个专用的索引器小模型, 需要把 DeepSeek-V4模型加载到显存里一起炼. 相当耗费算力.
于是这篇论文第二个亮点来了, 它搞了个解耦训练. 他们把这个索引器当成一个标准的“双编码器(Dual-encoder,类似做搜索推荐的模型)“来单独训练. 在这个过程中,根本不需要把庞大的 DeepSeek-V4 基座模型加载到显存中. 这让训练成本断崖式下降,且兼容标准的检索(Retrieval)训练框架. (简单来讲就是它是通用方法训练的, 通过query预测需要检索哪些长句子. 所以其实是个通用模型)
听上去靠谱, 那也只是显存占用少了, 怎么就性能还提高了呢? 答案是注意力降噪. 因为每次只提取和当前生成最相关的记忆块(Chunks)放入显存,模型在运算时就看不见那些无关的冗余信息了.天然地起到了一种“去噪“作用,这也是为什么显存占用少了,模型准确率反而略微提升的原因.官方测试在长文本评测集(如 LongBench-v2 等)上的准确率平均最终提升了 0.6%.
(其实还有数据如何逐出显存和如何预测数据实现预加载, 这部分也很棒, 很有启发性. 建议看原论文, 篇幅原因写不下了)
论文地址: http://arxiv.org/abs/2606.09079 项目地址: http://github.com/libertywing/FlashMemory-Deepseek-V4…
#FlashMemory #DeepSeekV4 #FlashMemoryDeepseekV4
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
Source: https://arxiv.org/html/2606.09079 1]Independent Researchers 2]Tencent 3]The Hong Kong University of Science and Technology (Guangzhou) 4]Tsinghua University\contribution[*]Equal contribution\contribution[†]Project Lead
Qifan ZhangJiachen YuTian LiangDongyang MaXiang HuZibo LinChunyang LiZhichao WangMiao PengNuo ChenJia LiYujiu YangHaitao MiDong Yu[[[[[email protected]
Abstract
Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we proposeLookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via abackbone-free decoupled trainingstrategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory.
We demonstrate that this “less is more” paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER),FM-DS-V4compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone’s core reasoning capacities.
Project Status: Due to organizational realignments, the Project Lead has parted ways with Tencent, and this project has been suspended. This technical report documents our preliminary breakthroughs and verified checkpoints.We firmly believe in the potential of theFlashMemoryparadigm for infinite long-context intelligence. If you or your organization are interested in supporting or collaborating on the next phase (e.g., compute sponsorship, scaling tests, or research integration), please contact the Project Lead at[email protected].
Figure 1:Performance and hardware efficiency of FlashMemory-DeepSeek-V4.On LongBench-v2 and RULER, FM-DS-V4 consistently matches or exceedsDS-V4-Flash, while reducing KV cache overhead to merely 13.5% on average. KV cache memory footprints are measured via sglang deployment logs on an 8×\timesH20 GPU server.## 1Introduction
The extension of Large Language Models (LLMs) toward ultra-long context windows is fundamentally bottlenecked by memory capacity. While modern sparse attention mechanisms successfully reduce the computational FLOPs per decoding step to a near-constant level, the GPU memory footprint of the Key-Value (KV) cache still scales linearly with the sequence length. Recent foundation models like DeepSeek-V4111https://huggingface.co/deepseek-ai/DeepSeek-V4-Proand Qwen3.5222https://huggingface.co/Qwen/Qwen3.5-397B-A17Battempt to slow down this memory explosion by incorporating heavily compressed attention (HCA) or linear attention layersdeepseekv4,qwen35blog. However, to preserve fine-grained factual recall, these models must still retain a significant portion of low-compression or full-attention layersdeepseekv4. Consequently, they only mitigate the rate of memory growth rather than eliminating the linear scaling bottleneck itself.
This work stems from a simple yet striking observation of resource waste during inference: conventional LLMs fully load and carry the entire KV cache in GPU memory even when the active decoding step is completely independent of the historical context. Our empirical analysis of real-world inference logs reveals thatover 90% of user requests with contexts longer than 64K tokens can be accurately resolved using only the last 8K tokens.This indicates that an overwhelming majority of GPU memory is squandered on inactive context that contributes nothing to the current token prediction. Conversely, simply discarding history via standard sliding-window attention fails entirely on the remaining tasks that genuinely require global context synthesis. This hard contradiction—supporting deep global reasoning without paying the full GPU memory tax for local generation steps—is the root cause behind the prohibitive cost of long-context serving.
To resolve this dilemma, we presentLookahead Sparse Attention (LSA). Following the structural compression spirit of DeepSeek-V4deepseekv4, our architecture retains all highly condensed HCA chunks (128:1 compression ratio) to maintain global context awareness. However, we fundamentally upgrade the conventional Compressed Sparse Attention (CSA) layers into our predictive LSA paradigm. LSA empowers the model tonot recall that muchfine-grained context; instead, driven by a highly efficientNeural Memory Indexer, the system triggers periodically at a fixed decoding interval ofτ\tausteps (e.g.,τ=64\tau=64) to evaluate current hidden states and proactively fetch only the critical CSA chunks into the GPU memory. Crucially, we formulate the indexer as a standalone dual-encoder architecture. This decoupled design allows us to train the indexer independently on pre-computed hidden states and labels, completely bypassing the prohibitive memory and computational overhead of full-model fine-tuning or joint distillation.
Experimental results across three distinct long-context benchmarks confirm the robustness and striking efficiency of LSA. In scenarios requiring long-term memory and deep understanding, LSA acts as an effective attention denoiser. Specifically, averaged across LongBench-v2, LongMemEval, and RULER, LSA reduces GPU memory consumption to merely 13.5% of the baseline (an 86.5% reduction) while outperforming the standard Deepseek-V4-Flash by +0.6% absolute accuracy. At 500K context lengths, the memory reduction reaches up to 90%.
In summary, our core contributions are threefold:
- •Lookahead Sparse Attention (LSA) Paradigm:We propose LSA, a novel inference paradigm that eliminates the hard contradiction between long-context modeling capabilities and hardware efficiency by proactively predicting and fetching query-critical KV chunks on demand.
- •Backbone-Free Decoupled Training:We introduce an ultra-lightweight training strategy that physically isolates the indexer from the host LLM. Formulated as a standalone dual-encoder trained on pre-computed representations, the indexer can be optimized independently in justa single H20 GPU hourwithout ever loading the massive backbone model.
- •Breakthrough in Efficiency:Extensive evaluations show that LSA reduces GPU memory to merely 13.5% of the baseline (up to 90% reduction at 500K) while maintaining comparable accuracy to the full-attention baseline.
Figure 2:Architectural overview of LSA vs. CSA. The black lines denote the standard, step-by-step CSA pipelines. The red lines highlight our proposed LSA mechanism, which decouples the GPU memory footprint by leveraging a Memory Indexer to fetch historical KV chunks dynamically everyτ\tausteps.
2Methodology
In this section, we present the technical details of Lookahead Sparse Attention (LSA), including its architectural formulation, data curation pipeline, optimization strategy, and optimal configuration. Specifically, Section2.1introduces how we architect LSA on top of the DeepSeek-V4 framework to achieve predictive context selection. Section2.2introduces our lookahead data formats and the automated gathering pipeline. Section2.3details our decoupled training strategy that physically isolates indexer optimization from the massive LLM backbone. Finally, Section2.4presents our systematic exploration of the optimal layer configuration and training recipe for the production model.
2.1Memory Indexer for Lookahead Selection
The core design principle of LSA is to minimize modifications to the DeepSeek-V4 architecture, thereby maximizing the preservation of its established capabilities. Therefore, our Memory Indexer mirrors the exact architecture of the native Lightning Indexer used in DeepSeek-V4, reusing the compressed indexer keysKICompK^{\text{IComp}}as the dense representation of historical context. The definitive departure is that we introduce a Sigmoid function as the final activation layer to scale the indexer scores into the(0,1)(0,1)range, and we replace the rigid Top-kkselector with a threshold-based mechanism to recall a dynamic number of historical entries.
During the autoregressive decoding stage, the Memory Indexer triggers periodically at a fixed decoding step intervalτ\tau(e.g.,τ=64\tau=64) to perform lookahead block prediction. As illustrated in Figure2, at decoding steptt(wheret(modτ)=0t\pmod{\tau}=0), given the current input hidden state of the query token𝐡t∈ℝd\mathbf{h}_{t}\in\mathbb{R}^{d}, we map it into low-rank indexer queries acrossnhln_{h}^{l}indexer heads:
𝐜tQ\displaystyle\mathbf{c}_{t}^{Q}=𝐡t⋅WDQ,\displaystyle=\mathbf{h}_{t}\cdot W^{DQ},(1)[𝐪t,1l;𝐪t,2l;…;𝐪t,nhll]=𝐪tl\displaystyle[\mathbf{q}_{t,1}^{l};\mathbf{q}_{t,2}^{l};\dots;\mathbf{q}_{t,n_{h}^{l}}^{l}]=\mathbf{q}_{t}^{l}=𝐜tQ⋅WIUQ,\displaystyle=\mathbf{c}_{t}^{Q}\cdot W^{IUQ},(2)whereWDQ∈ℝd×dcW^{DQ}\in\mathbb{R}^{d\times d_{c}}andWIUQ∈ℝdc×clnhlW^{IUQ}\in\mathbb{R}^{d_{c}\times c^{l}n_{h}^{l}}represent the down-projection and up-projection matrices for the lookahead query representation, respectively. Concurrently, we dynamically project𝐡t\mathbf{h}_{t}to compute the routing head weights𝐰tl\mathbf{w}_{t}^{l}:
[𝐰t,1l;𝐰t,2l;…;𝐰t,nhll]=𝐰tl=𝐡t⋅Ww,[\mathbf{w}_{t,1}^{l};\mathbf{w}_{t,2}^{l};\dots;\mathbf{w}_{t,n_{h}^{l}}^{l}]=\mathbf{w}_{t}^{l}=\mathbf{h}_{t}\cdot W^{w},(3)whereWw∈ℝd×nhlW^{w}\in\mathbb{R}^{d\times n_{h}^{l}}is a learnable matrix, and𝐰t,hl\mathbf{w}_{t,h}^{l}dynamically scales the importance of thehh-th indexer head.
To determine which historical compressed KV entries are strictly critical for the upcoming window[t,t+τ−1][t,t+\tau-1], the lookahead index scoreIt,sI_{t,s}between the query tokenttand a preceding compressed entryss(s<⌊tm⌋s<\lfloor\frac{t}{m}\rfloor) is formulated as a head-fused gated matching score with a Sigmoid activation:
It,s=σ(∑h=1nhl𝐰t,hl⋅ReLU(𝐪t,hl⋅(KsIComp)T)),I_{t,s}=\sigma\left(\sum_{h=1}^{n_{h}^{l}}\mathbf{w}_{t,h}^{l}\cdot\text{ReLU}\left(\mathbf{q}_{t,h}^{l}\cdot\left(K_{s}^{\text{IComp}}\right)^{T}\right)\right),(4)whereσ(⋅)\sigma(\cdot)denotes the standard Sigmoid function.
This Sigmoid activation stands as the only architectural departure from the native Lightning Indexer. While the original one applies a ReLU boundary for raw attention scoring, LSA introduces Sigmoid normalization to align the Memory Indexer’s scalar outputs explicitly with discrete binary targetsy∈{0,1}y\in\{0,1\}. For a query tokentt, rather than a rigid Top-kkselection strategy, we fetch all preceding compressed KV entries whose lookahead scores meet or exceed a specific classification threshold (i.e.,It,s≥0.5I_{t,s}\geq 0.5) from the CPU Cold Pool into the GPU memory for subsequent core attention:
CtMemComp={CsComp|It,s≥0.5},C_{t}^{\text{MemComp}}=\left\{C_{s}^{\text{Comp}}\;\middle|\;I_{t,s}\geq 0.5\right\},(5)whereCCompC^{\text{Comp}}denotes the pre-computed compressed KV entries. Once the query-critical context subsetCtMemCompC_{t}^{\text{MemComp}}is successfully resident in the GPU memory, the native Lightning Indexer calculates the token-level matching scores within this restrictedCtMemCompC_{t}^{\text{MemComp}}boundary instead of scanning the full context. It applies the native ReLU-based Multi-Query Attention scoring over the fetched subset to select the final fine-grained Top-kkcore compressed entries:
CiCoreComp={CsComp∈CtMemComp|Scorenative(i,s)∈Top-k}.C_{i}^{\text{CoreComp}}=\left\{C_{s}^{\text{Comp}}\in C_{t}^{\text{MemComp}}\;\middle|\;\text{Score}_{\text{native}}(i,s)\in\text{Top-}k\right\}.(6)The selectedCiCoreCompC_{i}^{\text{CoreComp}}entries are then concatenated with the non-offloadable sliding window KV cache to participate in the final core attention computation. This tiered selection mechanism guarantees that the underlying FlashInfer or FlashAttention kernels operate exclusively on a highly condensed, hardware-resident active sequence footprint.
2.2Lookahead Dataset Construction
The cornerstone of optimizing our Memory Indexer is pinning down exactly which historical compressed KV entries a decoding token needs to look ahead to. A naive approach would define the positive label set for tokenttas the simple union of all Top-kkentries recalled by the native Lightning Indexer across the future window[t,t+τ−1][t,t+\tau-1]. However, empirical analysis reveals a massive inflation problem with this strategy, resulting in nearly 10,000 positive samples per token window before filtering (reduced to approximately 100–1,000 after our pipeline). The root cause is that a rigid Top-kkselector forces the model to recall a fixed number of preceding entries regardless of their actual relevance, causing low-probability noise entries from different attention layers to heavily pollute the ground-truth dataset.
To eliminate this noise, we propose an golden label filtering pipeline that uses aCross-Layer Majority Votingmechanism to identify the true “golden entries.” The data generation pass runs completely offline on the frozen DeepSeek-V4-Flash backbone model. For each decoding tokeni∈[t,t+τ−1]i\in[t,t+\tau-1]and across allLLCSA layers (whereL=21L=21for DeepSeek-V4-Flashdeepseekv4), we extract the raw indexer logit scoresSi,l,sS_{i,l,s}for every preceding compressed entryss. We then filter these scores through a three-step denoising pipeline:
- •Step 1: Softmax Normalization.We convert the raw logit scores into a valid probability distribution via a Softmax operation over all historical entries: Pi,l,s=exp(Si,l,s)∑jexp(Si,l,j).P_{i,l,s}=\frac{\exp(S_{i,l,s})}{\sum_{j}\exp(S_{i,l,j})}.(7)
- •Step 2: Top-ppThresholding.Instead of using a fixed Top-kkcount, we dynamically retain only the high-confidence entries using a nucleus thresholdpp(we empirically setp=0.6p=0.6). An entryssis marked as selected by layerllif it falls within the minimum set of entries that cumulatively account for the top60%60\%of the probability mass: ℳi,l={s|∑j∈Sorted(Pi,l,:)Pi,l,j≤p}.\mathcal{M}_{i,l}=\left\{s\;\middle|\;\sum_{j\in\text{Sorted}(P_{i,l,:})}P_{i,l,j}\leq p\right\}.(8)
- •Step 3: Cross-Layer Majority Voting.We aggregate the selection hits across allLLlayers. The voting scoreVi,sV_{i,s}for entryssat token stepiiis calculated by counting how many layers independently voted for it: Vi,s=∑l=1L𝕀(s∈ℳi,l),V_{i,s}=\sum_{l=1}^{L}\mathbb{I}(s\in\mathcal{M}_{i,l}),(9)where𝕀(⋅)\mathbb{I}(\cdot)is the indicator function. An entry is officially recognized as a core active entry𝒜igolden\mathcal{A}_{i}^{\text{golden}}if and only if it secures consensus backing from at leastθ\thetalayers (we setθ=3\theta=3): 𝒜igolden={s|Vi,s≥3}.\mathcal{A}_{i}^{\text{golden}}=\left\{s\;\middle|\;V_{i,s}\geq 3\right\}.(10)
Finally, for each lookahead evaluation window triggered at decoding steptt, the positive ground-truth label set𝒴t+\mathcal{Y}_{t}^{+}is established by taking the union of these denoised golden entries across the entire future temporal window ofτ\tausteps:
𝒴t+=⋃i=tt+τ−1𝒜igolden.\mathcal{Y}_{t}^{+}=\bigcup_{i=t}^{t+\tau-1}\mathcal{A}_{i}^{\text{golden}}.(11) By shifting from an arbitrary Top-kklookup to a consensus-driven density estimation, our pipeline isolates the true contextual backbone of the long sequence, discarding irrelevant background noise. In total, our training set comprises approximately 10,000 long documents with context lengths ranging from 16K to 512K tokens.
2.3Optimization and Decoupled Training
Although our Memory Indexer shares a structural setup similar to the native Lightning Indexer, their underlying optimization paradigms are fundamentally different. Unlike the native Lightning Indexer which relies on heavy end-to-end self-distillation, we treat the Memory Indexer as a standard retrieval model and optimize it viametric learning. The primary training objective is to perform distance-based contrastive optimization: maximizing the lookahead matching scores for query-critical historical entries while minimizing the scores for negative samples.
A key system insight of LSA is that the compressed indexer keysKsICompK^{\text{IComp}}_{s}of historical entries are entirely pre-computed and strictlyfrozenduring the training stage. Consequently, the optimization process simplifies into training only the query encoder of a standard dual-encoder retrieval architecture. Specifically, we only need to optimize the low-rank projection matrices (WDQ,WIUQ,WwW^{DQ},W^{IUQ},W^{w}) to map the current input hidden state𝐡t\mathbf{h}_{t}to align with the fixed historical targets.
To achieve this objective, we minimize a standard element-wise Binary Cross-Entropy (BCE) loss function over the predicted lookahead scores. For a single sample with predicted probabilityppand labely∈{0,1}y\in\{0,1\}, the per-sample BCE is defined as:
ℓBCE(p,y)=−(ylog(p)+(1−y)log(1−p)),\ell_{\text{BCE}}(p,y)=-\bigl(y\log(p)+(1-y)\log(1-p)\bigr),(12)whereyt,s=1y_{t,s}=1ifs∈𝒴t+s\in\mathcal{Y}_{t}^{+}, andyt,s=0y_{t,s}=0otherwise. The overall batch objective is then the average over all samples in the batch𝒮\mathcal{S}.
Because the historical representationsKsICompK^{\text{IComp}}_{s}, target labels𝒴t+\mathcal{Y}_{t}^{+}, and layer-specific query hidden states𝐡t\mathbf{h}_{t}are all pre-extracted and stored offline, the training pipeline achieves complete physical isolation from the host LLM. The thousand-billion-parameter backbone model is never loaded into GPU memory during the entire optimization loop. Since the trainable projection layers represent less than 0.1% of the full model’s parameter scale, the computational workload is remarkably small. As a result, the entire Memory Indexer converges elegantly withina single H20 GPU hour.
This decoupled design significantly accelerates our research cycle. Leveraging a single cluster of 8×\timesNVIDIA H20 GPUs, we seamlessly executed approximately 500 distinct training runs within a single week to systematically map out the optimal architecture and training strategies, a feat that would be computationally prohibitive under traditional joint end-to-end distillation.
2.4Architectural Optimal Configuration
A fundamental premise of designing LSA is that not every transformer layer is suited for contextual lookahead prediction. Our early-stage exploration revealed that deploying memory indexers on the initial shallow layers of the LLM yields exceptionally poor lookahead performance, as these early representations predominantly capture low-level token statistics rather than long-range semantic dependencies. Therefore, an efficient system routing paradigm must selectively place indexers only on layers that possess mature global context awareness.
However, scaling the number of joint training layers introduces a strict trade-off between performance and serving efficiency. While a single-layer retriever lacks the representative capacity to handle diverse long-context workloads, aggressively scaling to an 8-layer joint configuration (spanning layers 6 to 20) introduces severe hardware-side efficiency degradation. As verified in our full-system benchmarks, an 8-layer ensemble triggers an excessively loose context recall mask, fetching up to 30%–49% of historical compressed KV entries into the GPU memory, which defeats our primary goal of minimizing the memory tax.
Through extensive Pareto-frontier optimization, we established that placing independent Memory Indexers on exactly three strategic intermediate layers—layers 10, 12, and 20—delivers the ultimate sweet spot. During inference, our runtime system aggregates the scoring predictions from these three layers using aunion operations strategy (OR-mode routing). Specifically, a preceding compressed KV entry is actively fetched into the GPU memory ifat least oneof the three layer indexers predicts its classification scoreIt,s≥0.5I_{t,s}\geq 0.5:
CtMemComp=⋃l∈{10,12,20}{CsComp|It,s(l)≥0.5}.C_{t}^{\text{MemComp}}=\bigcup_{l\in\{10,12,20\}}\left\{C_{s}^{\text{Comp}}\;\middle|\;I_{t,s}^{(l)}\geq 0.5\right\}.(13)This 3-layer consensus framework provides an exceptionally robust fallback protection boundary.
Our final production model instantiation is built upon this optimal 3-layer geometry and optimized via a carefully curated combination of effective training strategies:
- •Random Initialization:Rather than loading alignment-biased weights from a host checkpoint, we initialize the indexer’s projection matrices randomly, forcing the dual-encoder to learn unified representations from scratch.
- •Query Low-Rank Conditioning:We leverage the native low-rank query projection geometry of the DeepSeek-V4 architecture. In DeepSeek’s MLA/MQA design, the query vector is projected through an internal low-rank bottleneck (officially designatedq_lora_rankin the DeepSeek-V3 codebase, where the default is 1536). In our implementation, we set this internal projection dimension tor=2048r=2048for the R-series configuration.This is not PEFT-style LoRA fine-tuning(which typically uses ranks of 8–64 to learn small perturbations on frozen weights); rather, it is a fixed architectural dimension of the model’s attention backbone that determines the representational capacity of the query encoder. Increasing this rank directly expands the spatial projection capacity of the lookahead indexer without introducing any adapter overhead.
- •Focal Loss Denoising:To prevent easy negative samples from dominating the gradients, we replace standard BCE with a sample-weighted Focal Loss. Letpt,s∈[0,1]p_{t,s}\in[0,1]denote the Sigmoid-activated indexer score andyt,s∈{0,1}y_{t,s}\in\{0,1\}the binary label. We first compute the predicted confidence on the correct class: pt,s(correct)=pt,s⋅yt,s+(1−pt,s)⋅(1−yt,s).p_{t,s}^{\text{(correct)}}=p_{t,s}\cdot y_{t,s}+(1-p_{t,s})\cdot(1-y_{t,s}).(14)The per-sample Focal Loss is then defined as: ℒFL=1|𝒮|∑s∈𝒮wt,s(1−pt,s(correct))γℓBCE(It,s,yt,s),\mathcal{L}_{\text{FL}}=\frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}}w_{t,s}\,\bigl(1-p_{t,s}^{\text{(correct)}}\bigr)^{\gamma}\,\ell_{\text{BCE}}(I_{t,s},y_{t,s}),(15)whereℒBCE\mathcal{L}_{\text{BCE}}is the standard binary cross-entropy,γ=2\gamma=2is the focusing parameter that down-weights well-classified samples, andwt,sw_{t,s}is a per-sample weight. Notably,we do not use a separate class-balancing coefficientα\alpha; instead, class imbalance is handled jointly by (i) a negative sampling ratio of 3:1 (three negatives per positive) and (ii) the per-sample weightwt,sw_{t,s}computed by the--weighted-lossscheduler. This design forces the optimizer to concentrate on hard boundary tokens while keeping the hyperparameter surface minimal.
Conversely, multiple popular retrieval and contrastive learning tricks proved to be redundant or even detrimental during our 500-run sweep, and were systematically excluded from our final pipeline:
- •Pairwise-to-Pointwise Chaining:Transitioning optimization from a pairwise ranking stage (BPR/Margin Loss) to a pointwise calibration stage yielded no statistical recall gains over a pure pointwise training loop.
- •Strong Negative Mining:Utilizing LLM-annotated semantic chunks as a hard negative pool introduced severe secondary label noise into the contrastive format; random negative sampling within the non-voted historical repository proved significantly more robust.
- •Weighted Loss Functions:Scaling the loss according to native layer matching counts increased raw precision slightly but degraded the absolute recall bound by discarding boundary context, shifting the model away from its safety-net objective.
Note on Hyperparameter Selection.
Due to the unexpected suspension of this project, we were unable to conduct systematic ablation studies on several key hyperparameters. Specifically, the decoding intervalτ=64\tau=64and the classification threshold of0.50.5were selected based on initial exploratory runs but remain untested across alternative values. The 3-layer configuration (layers 10, 12, 20) was determined through the 500-run Pareto sweep described in Section2.4; however, a more fine-grained layer-wise ablation would be desirable for future work.
3Experiments
3.1Experimental Setup
To ensure a rigorous and controlled evaluation of the FlashMemory paradigm, we benchmark our model against three structural variants. Crucially, to maintain architectural consistency,all evaluated configurations universally retain the full Heavily Compressed Attention (HCA) layers (at a 128:1 compression ratio), alongside the exact CSA chunks corresponding to both the last 8K tokens of the original prompt and all actively decoded tokens within the local window.The precise treatment of the remaining historical long-context CSA chunks differentiates the methods as follows:
- •DS-V4-Flash: The standard, unaltered DeepSeek-V4-Flash model.
- •FM-DS-V4(Ours): TheDS-V4-Flashbackbone augmented with the Memory Indexer. The lookahead selection mechanism triggers periodically everyτ=64\tau=64decoding steps, dynamically evaluating and fetching query-critical historical CSA chunks from the CPU cold pool into the active GPU HBM.
- •Recency Only: A sliding-window fallback control. While it shares the same base HCA layers and the local 8K/decoded CSA window to match the static local memory allocation budget, it completely discards all prior long-context historical CSA chunks and executes zero predictive lookahead retrieval.
- •Random 10%: A naive sparse routing control. On top of the foundational HCA layers and the local 8K/decoded CSA window, it randomly selects and retains exactly 10% of the global historical context CSA chunks in the active KV cache, providing a non-predictive stochastic baseline.
3.2Primary Results: Breaking the Capacity Wall
Table1highlights the performance and hardware footprint scaling across three major long-context benchmarks: LongBench-v2bai2025longbenchv2deeperunderstanding, LongMemEvalwu2025longmemevalbenchmarkingchatassistants, and RULERhsieh2024rulerwhatsrealcontext.
Table 1:System performance and physical KV cache footprints (GPU memory overhead in gigabytes [GB] in parentheses) across primary long-context benchmarks.DS-V4-Flashoperates at 100% full KV cache allocation without chunk pruning.Benchmark / DatasetDS-V4-FlashFM-DS-V4Recency OnlyRandom 10%LongBench-v2-S(46K)68.9 (0.17 GB)70.2(0.04 GB)50.0 (0.03 GB)53.3 (0.04 GB)LongBench-v2-M(179K)67.6 (0.65 GB)68.9(0.08 GB)54.4 (0.03 GB)48.9 (0.09 GB)LongBench-v2-L(493K)68.1 (1.80 GB)70.0(0.18 GB)54.3 (0.04 GB)46.9 (0.22 GB)LongMemEval-S(125K)80.6 (0.46 GB)82.0(0.06 GB)19.2 (0.04 GB)20.1 (0.07 GB)LongMemEval-M(500K)39.3 (1.82 GB)40.2(0.17 GB)23.1 (0.04 GB)25.7 (0.22 GB)RULER(64K)94.7 (0.23 GB)95.0(0.04 GB)36.6 (0.03 GB)52.8 (0.05 GB)RULER(128K)94.3(0.47 GB)93.2 (0.06 GB)21.6 (0.03 GB)32.3 (0.08 GB)RULER(256K)90.5(0.94 GB)88.2 (0.09 GB)20.6 (0.04 GB)41.2 (0.12 GB)RULER(512K)88.3 (1.87 GB)89.6(0.18 GB)18.8 (0.04 GB)27.2 (0.22 GB)Avg.76.9 (0.93 GB)77.5(0.10 GB)33.3 (0.04 GB)38.7 (0.12 GB)The empirical findings deliver a striking victory for the FlashMemory paradigm. Averaged across all tasks,FM-DS-V4consumes merely 13.5% of the baseline GPU memory footprint—representing an average 86.5% reduction in KV cache storage—while actually improving overall performance to 77.5% (+0.6% absolute margin over DS-V4-Flash).When the average context length reaches 500K, this reduction ratio further climbs to an astonishing 90%.
This counter-intuitive “less is more” phenomenon is especially pronounced in the ultra-longLongBench-v2-L (493K)setting, where our model beatsDS-V4-Flashby+1.9%while running on a threadbare10.0%memory budget. This forcefully proves our core hypothesis: LSA acts as an expertattention denoiser, filtering out thousands of irrelevant historical chunks that would otherwise clutter the attention dot-products and cause factual hallucinations. Under the same memory restrictions, native heuristic controls (Recency OnlyandRandom 10%) completely collapse, failing to synthesize global context and confirming that our indexer has mastered complex predictive temporal routing.
One might naturally question whyRecency OnlyandRandom 10%can still maintain a reasonable performance baseline on specific datasets like LongBench-v2. It is critical to reiterate that in DeepSeek-V4’s hybrid design, the sparse CSA mechanism operates in parallel with the full Heavily Compressed Attention (HCA) layers (at a 128:1 compression ratio). For evaluation scenarios that primarily necessitate global semantic themes or coarse-grained synthesis rather than lossy, hyper-granular token retrieval, utilizing the global compressed HCA foundations alongside the local 8K cache proves sufficient to navigate basic context structures.
3.3Limitations and Diagnostics
While FlashMemory achieves unprecedented efficiency gains on three standard long-context benchmarks, our stress-testing exposes critical boundaries of the current paradigm. Due to recent organizational realignments, active development has been suspended. We present these diagnostic findings and concrete failure cases to provide transparent insights for the open-source community.
3.3.1Context-Independent Overhead
We originally hypothesized that for context-independent queries where historical long context is entirely irrelevant, the pointwise Sigmoid gating would naturally collapse to near-zero retrievals, yielding a strictO(1)O(1)constant KV cache footprint. To test this adversarial boundary, we augmented LongMemEval-S and LongMemEval-M by explicitly appending queries that are strictly context-free or tightly bounded to the local 8K window only.
Table 2:System evaluation under adversarial context-independent tasks (No-Context).Context Independent DatasetsDS-V4-FlashFM-DS-V4(Ours)LongMemEval-S(No-Context)96.7(0.46 GB)95.0 (0.06 GB)LongMemEval-M(No-Context)91.2 (1.82 GB)92.5(0.16 GB)As shown in Table2, while the downstream accuracy gracefully matches the foundation baseline, the modelfails to preserve a constant memory overhead. Moving from the 125K context to the 500K context, the lookahead memory allocation ratio does scale down to 8.4%, yet the physical absolute chunk retention volume inflates by approximately2.5×\times. This indicates that the point-wise Sigmoid gater still leaks a marginal background probability across massive sequence lengths, accumulating false-positive retrievals when facing massive distraction pools.
3.3.2Dense Global Memory Breakdown (The MRCR Failure Case)
Our model experiences a severe breakdown on the Multi-Range Context Retrieval (MRCR)vodrahalli2024michelangelolongcontextevaluationsbenchmark, where accuracy plummets from the baseline’s 76.0% down to adismal 48.0%. To isolate the root cause of this severe performance regression, we conducted a rigorous oracle simulation: we pre-computed the global golden attention weights ofDS-V4-Flashacross the full decoding path for each sample, sorted the historical blocks based on cumulative attention density, and selectively loaded only the Top 50%, 25%, or 10% highest-weighted chunks into core MQA layers.
Our diagnostic oracle sweeps revealed a fundamental property difference between benchmarks: for LongBench-v2, LongMemEval, and RULER, retaining a mere 10% or 25% of golden CSA chunks alongside global HCA layers completely secures 100% baseline accuracy. However, MRCR exhibits an aggressiveglobal dense memory dependency—even when providing the indexer with 50% of the absolutetrue golden chunks, the accuracy still drops by about 2% compared to full-context cache execution.
These two empirical findings firmly isolate the architectural limitations of our current Memory Indexer. Ideally, we envisioned an ideal indexer capable of executing deterministic, context-adaptive retrieval: achieving near-zero recall on context-independent tasks to maintain a constant memory floor, while delivering near-perfect recall on memory-dense tasks to secure maximum contextual awareness.
Unfortunately, by relying on a highly compressed, standalone Dual-Encoder framework, the model fundamentally lacks the capacity to balance such extreme operational boundaries of precision and recall. Consequently, the following three critical factors bound its performance:
- 1.Frozen Key Representation: Due to computational budget constraints, we never adjusted or optimized the native DeepSeek-V4 Compressed indexer keys (KICompK^{\text{IComp}}), fine-tuning only the query projection encoder.
- 2.Shallow Cross-Interaction: Operating purely via a 64-step coarse dot-product similarity, the indexer lacks the multi-turn interaction capacity. Incorporating aLate-Interaction architecture(e.g., ColBERT-style token-level cross-matching) is essential to untangle complex dense retrieval patterns.
- 3.Decoupled Training Isolation: The lack of end-to-end joint optimization with the main backbone restricts the indexer to static pseudo-labels, ignoring live autoregressive shift dynamics.
Addressing these items remains our formal future roadmap.
3.3.3The Length Generalization Ceiling
Our initial design intent assumed that because our lookahead indexer operates via point-wise chunk matching, we could train the Dual-Encoder on relatively short context windows (e.g., 128K) and seamlessly scale zero-shot inference to 1M+ context fields, as candidate pool expansion theoretically shouldn’t distort point-wise scoring.
Our empirical evaluations completely dismantled this assumption. The indexer safely generalizesup to exactly2×2\timesits training context length. Attempting to execute inference beyond this hard boundary causes accuracy to collapse precipitously, with lookahead block selection degenerating into near-random sampling. We attribute this performance bottleneck to the effects from the out-of-distribution positional embeddings, which constitutes the primary architectural divergence between self-attention mechanisms and generic text retrieval systems. Consequently, our final released memory indexer was explicitly trained on context lengths up to 512K. Although empirical validation at greater scales remains untested, we hypothesize that its retrieval discriminability would decay irreversibly when deployed on sequences exceeding 1M tokens.
4Conclusion
In this report, we have presented FlashMemory-DeepSeek-V4, an LLM augmented with Lookahead Sparse Attention (LSA). By introducing a Neural Memory Indexer into the DeepSeek-V4-Flash architecture, we enable the model to proactively predict and fetch only the query-critical KV chunks into GPU memory. Compared to DeepSeek-V4-Flash, our model achieves comparable or even superior performance across the majority of benchmarks, while consuming merely approximately 13.5% of the GPU memory.
We emphasize that the architecture, training pipeline, and hyperparameters of FlashMemory-DeepSeek-V4 are severely constrained by computational resources and the unexpected suspension of the project. The indexer was trained with frozen key representations, shallow dot-product interaction, and no end-to-end joint optimization with the backbone—design choices dictated by resource availability rather than optimality. Nevertheless, the results achieved under these constraints make us highly confident in the vast potential for improvement that remains: FlashMemory-DeepSeek-V4, in its current form, is merely the first glimpse of what LSA can achieve for ultra-long-context intelligence.
References
相似文章
FlashMemory-DeepSeek-V4:通过前瞻稀疏注意力实现闪电索引超长上下文
提出在DeepSeek-V4上结合神经记忆索引器的前瞻稀疏注意力,将GPU内存使用降至全上下文基线的约13.5%,同时保持或略微提升准确率。
FlashMemory DeepSeek-V4 检索器(GitHub仓库)
介绍了FlashMemory DeepSeek-V4检索器,这是一个轻量级模型,通过预测接下来将关注哪些块来稀疏化DeepSeek-V4的CSA KV缓存,仅保留约10-15%在设备上,同时匹配全注意力性能。
@Michaelzsguo: KV缓存是模型在生成期间的工作记忆。随着上下文窗口变长,模型必须保留更多…
DeepSeek的KV缓存压缩创新,包括MLA和CSA/HCA,将KV缓存大小减少了93%,实现了高效的长上下文推理和基于SSD的缓存,正如antirez的ds4.c项目所展示的那样。
@Michaelzsguo: 发现了一个对你的本地LLM推理优化很有用的工具:https://kvcache.ai/tools/kv-cache-ca…
一条推文分享了来自KVCache.ai的KV缓存大小计算器,这是一个用于估算本地LLM推理中KV缓存内存使用量的工具,并强调DeepSeek V4 Pro的100万token仅使用5GB内存。
DeepSeek-V4:百万Token上下文,真正可供智能体使用
DeepSeek发布V4,这是一款MoE模型,拥有100万Token上下文窗口,通过混合注意力机制和降低KV缓存需求,针对智能体任务进行了优化。