WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering
Summary
WaveFilter proposes a training-free, wavelet-guided KV cache filtering framework for diffusion large language models that enhances long-context capability by precisely identifying key tokens and constructing sparse caches, improving performance on complex long-context tasks.
View Cached Full Text
Cached at: 06/02/26, 03:38 PM
# WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering
Source: [https://arxiv.org/html/2606.00724](https://arxiv.org/html/2606.00724)
Jinnan Yang1,4,Yan Wang2,Zhen Bi3,Kehao Wu1, Xiaojie Li1,Jungang Lou3,Zechao Li1†,Jing Liu4† 1Nanjing University of Science and Technology, 2Alibaba Group,3Huzhou Normal University, 4Institute of Automation, Chinese Academy of Sciences †Corresponding authors
###### Abstract
Diffusion Large Language Models \(DLMs\) have demonstrated significant advantages across various tasks\. However, constrained by their multi\-step iterative inference mechanism, their computational overhead and inference latency in long\-context tasks have become core bottlenecks restricting their large\-scale deployment\. When processing long sequences, existing Key\-Value \(KV\) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra\-long contexts\. Inspired by the human reading process, we proposeWaveFilter, a universal and training\-free caching framework\. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation\. Experimental results demonstrate that WaveFilter, as a plug\-and\-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long\-context tasks\.
WaveFilter: Enhancing the Long\-Context Capability of Diffusion LLMs via Wavelet\-Guided KV Cache Filtering
Jinnan Yang1,4, Yan Wang2, Zhen Bi3, Kehao Wu1,Xiaojie Li1,Jungang Lou3,Zechao Li1†,Jing Liu4†1Nanjing University of Science and Technology,2Alibaba Group,3Huzhou Normal University,4Institute of Automation, Chinese Academy of Sciences†Corresponding authors
## 1Introduction
Owing to their non\-autoregressive nature and bidirectional contextual modeling capability, diffusion large language models \(DLMs\)Nieet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib1)\)have demonstrated unique advantages in tasks such as text\-to\-image generation, dialogue systems, and code generationSahooet al\.\([2024b](https://arxiv.org/html/2606.00724#bib.bib2)\); Guptaet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib3)\); Gonget al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib4)\)\. However, constrained by their multi\-step iterative inference mechanism, DLMs incur substantially higher computational complexity and inference latency than autoregressive modelsLiet al\.\([2022](https://arxiv.org/html/2606.00724#bib.bib5),[2025](https://arxiv.org/html/2606.00724#bib.bib6)\)\. This heavy computational burden has become a core bottleneck limiting their large\-scale deployment\. To alleviate the overhead caused by repeated computation, researchers have introduced Key\-Value \(KV\) caching mechanisms into DLMsMaet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib7)\)\. By caching the Key and Value vectors from previous steps, this method enables subsequent generation to directly reuse prior computation results\. This effectively avoids redundant computation over already generated context, thereby reducing inference latency and improving generation efficiency\.
Figure 1:Performance comparison at different context lengths on theniah\_single\_1subset ofRuler\. \(a\) illustrates theaccuracy \(%\)of theLLaDA\-8b\-Instructand its variants with various KV Cache methods; \(b\) displays the correspondingthroughput \(Tokens/sec\)\.However, directly extending KV caching mechanisms to DLMs for handling complex long\-context tasks still faces significant challenges\. On the one hand, as shown in Figure[1](https://arxiv.org/html/2606.00724#S1.F1)a, the performance of LLaDA\-8B\-Instruct sharply declines as input length increases, indicating that the model struggles to maintain generation robustness in long\-context tasksLiuet al\.\([2026](https://arxiv.org/html/2606.00724#bib.bib8)\)\. On the other hand, existing KV caching mechanisms have been insufficiently studied in the context of DLMs applied to complex long\-context tasks\. As illustrated in Figure[1](https://arxiv.org/html/2606.00724#S1.F1)a and Figure[1](https://arxiv.org/html/2606.00724#S1.F1)b, although Fast\-dLLMWuet al\.\([2025b](https://arxiv.org/html/2606.00724#bib.bib18)\)and Elastic\-CacheNguyen\-Triet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib19)\)can provide certain acceleration benefits for short\-text tasks, their throughput rapidly deteriorates as context length grows, often accompanied by further reductions in accuracy\.The core challenge lies in the extreme difficulty of precisely identifying and filtering tokens that make critical contributions to the denoising process within ultra\-long context sequences\.Therefore, developing a universal, plug\-and\-play enhancement framework to empower existing caching methods, such as Fast\-dLLM and Elastic\-Cache, to extend efficiently and robustly to long\-context tasks remains a imperative issue\.
To accurately screen for crucial tokens in long\-context tasks, we draw inspiration from the human cognitive habit of "skimming before scanning\." Humans typically skim the entire text first to rapidly construct a macro\-level contextual semantic structure, and subsequently perform localized scanning targeted at specific questions to locate and extract key information, ultimately achieving efficient and precise question answering\. Inspired by this cognitive process, we proposeWaveFilter, a universal and training\-free framework\. The core of this framework lies in the introduction of theDiscrete Wavelet Transform \(DWT\) for KV cache compression\. By reducing cache length and filtering out high\-frequency noise while fully preserving time\-domain information, WaveFilter successfully facilitates the rapid construction of macro contextual semantic structures\. Following this, a multi\-scale recursive filtering mechanism is employed to simulate localized scanning, precisely pinning down the tokens most relevant to the question to achieve highly accurate question answering\.
Specifically, at the initial time step, the DWT is first utilized to extract the semantic features of the cache, and the attention mechanism is employed to identify the initial critical tokens targeted by the query vector\. Based on this, multi\-scale recursive filtering is performed on the initial important tokens to determine the final critical tokens\. These final tokens are directly utilized to dynamically construct a sparse KV Cache, which subsequently participates in attention computation with the current query vector to precisely forge the final context representation\. In summary, the primary contributions of this paper are as follows:
- •This paper proposes WaveFilter, a universal and training\-free KV caching framework for long\-context tasks\. By mimicking human cognitive reading habits, this framework seamlessly empowers existing KV Cache methods, effectively resolving their performance degradation in long\-context tasks\.
- •WaveFilter innovatively introduces the wavelet transform to achieve multi\-scale token filtering: by compressing the cache drastically while preserving critical information, it precisely identifies relevant regions at a negligible cost, thereby effectively resolving the challenge of identifying pivotal tokens within massive caches\.
- •Experimental results demonstrate that WaveFilter, as a universal plug\-and\-play framework, can be seamlessly integrated into various existing KV caching strategies: while maintaining competitive generation speed, it boosts model performance in complex long\-context tasks, consistently outperforming standalone KV caching methods\.
## 2Preliminary
### 2\.1Key\-Value Cache in Masked Diffusion Models
Masked Diffusion Models \(MDMs\) replace traditional continuous noise addition with random masking and iterative blank\-filling, enabling the parallel generation of discrete dataSohl\-Dicksteinet al\.\([2015](https://arxiv.org/html/2606.00724#bib.bib9)\); Austinet al\.\([2021](https://arxiv.org/html/2606.00724#bib.bib10)\); Campbellet al\.\([2022](https://arxiv.org/html/2606.00724#bib.bib11)\); Sahooet al\.\([2024a](https://arxiv.org/html/2606.00724#bib.bib38)\)\. To optimize generation efficiency during the reverse process, the KV Cache mechanism is integrated into its Transformer backbone\. As illustrated in Figure[2](https://arxiv.org/html/2606.00724#S2.F2)a, at the initial timesteptt\(wheret=1t=1\), the model performs full computation over all positionsI=\{1,2,…,N\}I=\\\{1,2,\\dots,N\\\}\. At thell\-th layer, the current hidden stateh1,lh^\{1,l\}is projected into query vectorsQ1,l\[I\]Q^\{1,l\}\[I\], key vectorsK1,l\[I\]K^\{1,l\}\[I\], and value vectorsV1,l\[I\]V^\{1,l\}\[I\]via learnable projection matricesWQ1,lW\_\{Q\}^\{1,l\},WK1,lW\_\{K\}^\{1,l\}, andWV1,lW\_\{V\}^\{1,l\}\. The attention output and the corresponding initialization of the KV Cache for this layer are formulated as:
A1,l\[I\]=Softmax\(Q1,l\[I\]\(K1,l\[I\]\)Td\)V1,l\[I\]\.A^\{1,l\}\[I\]=\\text\{Softmax\}\\left\(\\frac\{Q^\{1,l\}\[I\]\(K^\{1,l\}\[I\]\)^\{T\}\}\{\\sqrt\{d\}\}\\right\)V^\{1,l\}\[I\]\.\(1\)
Subsequently, the KV pairs computed at the initial step are saved to the cache, with the initialization formally defined as:
\{K~1,l\[I\]=K1,l\[I\]V~1,l\[I\]=V1,l\[I\]\.\\left\\\{\\begin\{aligned\} \\widetilde\{K\}^\{1,l\}\[I\]&=K^\{1,l\}\[I\]\\\\ \\widetilde\{V\}^\{1,l\}\[I\]&=V^\{1,l\}\[I\]\\end\{aligned\}\\right\.\.\(2\)
In subsequent timestepst\>1t\>1, the model performs inference only for the set of generation positionsI~\\widetilde\{I\}\. By reusing the cached keysK~\\widetilde\{K\}and valuesV~\\widetilde\{V\}stored from previous timesteps, the attention computation is simplified to:
At,l\[I~\]=Softmax\(Qt,l\[I~\]\(K~t−1,l\[I\]\)Td\)V~t−1,l\[I\]\.A^\{t,l\}\[\\widetilde\{I\}\]=\\text\{Softmax\}\\left\(\\frac\{Q^\{t,l\}\[\\widetilde\{I\}\]\(\\widetilde\{K\}^\{t\-1,l\}\[I\]\)^\{T\}\}\{\\sqrt\{d\}\}\\right\)\\widetilde\{V\}^\{t\-1,l\}\[I\]\.\(3\)
Subsequently, the cache is dynamically updated using the KV pairs computed at the current step:
\{K~t,l\[I~\]=Kt,l\[I~\]V~t,l\[I~\]=Vt,l\[I~\]\)\.\\left\\\{\\begin\{aligned\} \\widetilde\{K\}^\{t,l\}\[\\widetilde\{I\}\]=K^\{t,l\}\[\\widetilde\{I\}\]\\\\ \\widetilde\{V\}^\{t,l\}\[\\widetilde\{I\}\]=V^\{t,l\}\[\\widetilde\{I\}\]\)\\end\{aligned\}\\right\.\.\(4\)
Inference based on the KV cache significantly reduces computational complexity and inference latency during the reverse process\. Through this mechanism, the model substantially enhances the efficiency of discrete sequence generation while maintaining the global context modeling capabilities of the Transformer\.
### 2\.2Discrete Wavelet Transform
The Discrete Wavelet Transform \(DWT\) is a time\-frequency analysis method used for signal decompositionYaoet al\.\([2022](https://arxiv.org/html/2606.00724#bib.bib13)\); Kirulutaet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib14)\)\. DWT decomposes a signalx\[n\]x\[n\]into low\-frequency approximation components and high\-frequency detail components through a pair of complementary filter banks\. A single\-level decomposition process can be formulated as:
\{A1\[n\]=∑kx\[k\]⋅g\[2n−k\]D1\[n\]=∑kx\[k\]⋅h\[2n−k\]\.\\left\\\{\\begin\{aligned\} A\_\{1\}\[n\]=\\sum\_\{k\}x\[k\]\\cdot g\[2n\-k\]\\\\ D\_\{1\}\[n\]=\\sum\_\{k\}x\[k\]\\cdot h\[2n\-k\]\\end\{aligned\}\\right\.\.\(5\)
whereg\[n\]g\[n\]andh\[n\]h\[n\]denote the low\-pass and high\-pass filter coefficients, respectively, and the subscript2n2nrepresents the downsampling operation\. The core advantage of DWT lies in its recursive nature: following the first level of decomposition, the approximation coefficientsA1A\_\{1\}can serve as the input for the subsequent level of the filter bank\. This iterative process constructs a multi\-level pyramid structure\. AfterLLlevels of decomposition, the original signal is represented by the set of components\{AL,DL,DL−1,…,D1\}\\\{A\_\{L\},D\_\{L\},D\_\{L\-1\},\\dots,D\_\{1\}\\\}\. In the context of long\-sequence modeling,ALA\_\{L\}captures the global semantic information of the signal, while the various detail componentsDjD\_\{j\}preserve local fluctuations across different resolutions\.
Figure 2:Schematic pipeline of WaveFilter\. Consists of four parts: \(a\)Decoding steps & Motivation: It illustrates the decoding mechanism of discrete diffusion models and introduces a coarse\-to\-fine retrieval strategy to address the challenge of key token extraction in long\-context tasks\. \(b\)Discrete Wavelet Transform: Decomposition is performed on the cache keys via DWT to extract low\-frequency components\. \(c\)Importance Assessment & Selection: Computes the correlation between the query vector and the low\-frequency components of the cached keys, and utilizes Top\-K selection to identify key candidate regions\. \(d\)Multi\-scale Recursive Filtering: Recursively refines candidate regions across different scales, ultimately selecting the most informative sparse KV matrix\.
## 3Methodology
This section provides a detailed exposition of the core mechanisms of WaveFilter framework\. To address the challenge of identifying and filtering critical tokens in long\-context tasks, this paper proposes a KV Cache framework based on the DWT\. As shown in Figure[2](https://arxiv.org/html/2606.00724#S2.F2), the overall algorithmic pipeline consists of two distinct stages \(refer to Appendix[A](https://arxiv.org/html/2606.00724#A1)for the detailed algorithm\)\. The first stage,coarse\-grained global perception \(Section[3\.1](https://arxiv.org/html/2606.00724#S3.SS1)\)employs the DWT to construct low\-frequency components, enabling the rapid localization of semantic regions that contain critical information within a compressed space\. The second stage,fine\-grained local localization \(Section[3\.2](https://arxiv.org/html/2606.00724#S3.SS2)\), which achieves precise extraction of important tokens through multi\-scale recursive filtering, based on which a sparse KV Cache is constructed\. This coarse\-to\-fine strategy not only effectively maintains speed competitiveness but also significantly boosts the overall performance of the model\.
### 3\.1Coarse\-grained Global Perception: Constructing Global Semantic Outlines
For long\-sequence processing, the standard attention mechanism faces severe computational challenges when extracting key tokens highly relevant to the query from extensive contexts, primarily due to itsO\(N2\)O\(N^\{2\}\)time complexity\. To alleviate this bottleneck, this section proposes a novel method for constructing a "Global Semantic Map" based on the DWT, aiming to rapidly and accurately localize potential regions harboring critical tokens by leveraging semantic distributions\.
LetQt,l\[I~\]∈RI~×dQ^\{t,l\}\[\\widetilde\{I\}\]\\in\{R\}^\{\\widetilde\{I\}\\times d\}denote the query vector at layerlland time stept\(t\>1\)t\\;\(t\>1\)\. The cached key vectors comprise prompt positionsIpI\_\{p\}and generation positionsIgI\_\{g\}\. Specifically,K~t−1,l\[Ip\]∈RIp×d\{\\widetilde\{K\}\}^\{t\-1,l\}\[I\_\{p\}\]\\in\{R\}^\{I\_\{p\}\\times d\}andK~t−1,l\[Ig\]∈RIg×d\{\\widetilde\{K\}\}^\{t\-1,l\}\[I\_\{g\}\]\\in\{R\}^\{I\_\{g\}\\times d\}represent the cached key vectors at layerlland time stept−1t\-1for the prompt and generation positions, respectively\. To capture sequential features across different receptive fields, as illustrated in Figure[2](https://arxiv.org/html/2606.00724#S2.F2)b, we first extract the low\-frequency components ofK~t−1,l\[Ip\]∈RIp×d\{\\widetilde\{K\}\}^\{t\-1,l\}\[I\_\{p\}\]\\in\{R\}^\{I\_\{p\}\\times d\}via the DWT with a wavelet basisψ\\psi:
K~lowt−1,l\(B\)\[Ip\]=DWT\(K~t−1,l\[Ip\],ψ\),\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]=DWT\(\\widetilde\{K\}^\{t\-1,l\}\[I\_\{p\}\],\\psi\),\(6\)
whereK~lowt−1,l\(B\)\[Ip\]∈RIp2B×d\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]\\in\{R\}^\{\\frac\{I\_\{p\}\}\{2^\{B\}\}\\times d\}denotes the low\-frequency component at scaleBB\. By filtering out high\-frequency noise, these components effectively capture the semantic outlines of the prompt long sequences at a coarse\-grained level\. As shown in Figure[2](https://arxiv.org/html/2606.00724#S2.F2)c, since the sequence dimension is significantly compressed at scaleBB, we can compute the perception weight matrix ofQt,l\[I~\]Q^\{t,l\}\[\\widetilde\{I\}\]relative toK~lowt−1,l\(B\)\[Ip\]\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]via the attention mechanism at a negligible computational cost:
At,l\(B\)=Softmax\(Qt,l\[I~\]\(K~lowt−1,l\(B\)\[Ip\]\)Td\),A^\{t,l\(B\)\}=Softmax\\left\(\\frac\{Q^\{t,l\}\[\\widetilde\{I\}\]\(\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]\)^\{T\}\}\{\\sqrt\{d\}\}\\right\),\(7\)
whereAt,l\(B\)∈RI~×Ip2BA^\{t,l\(B\)\}\\in\{R\}^\{\\widetilde\{I\}\\times\\frac\{I\_\{p\}\}\{2^\{B\}\}\}characterizes the correlations between tokens within the compressed space\. To identify the candidate regions inK~lowt−1,l\(B\)\[Ip\]\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]with the highest semantic relevance toQt,l\[I~\]Q^\{t,l\}\[\\widetilde\{I\}\], we perform a column\-wise summation ofAt,l\(B\)A^\{t,l\(B\)\}to obtain the importance evaluation vector:
Wt,l\(B\)=∑i=1Ip2BAit,l\(B\),W^\{t,l\(B\)\}=\\sum\_\{i=1\}^\{\\frac\{I\_\{p\}\}\{2^\{B\}\}\}A\_\{i\}^\{t,l\(B\)\},\(8\)
Wt,l\(B\)W^\{t,l\(B\)\}precisely characterizes the contribution of different tokens inK~lowt−1,l\(B\)\[Ip\]\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]to the current query from a macroscopic perspective\. Finally, based on the importance scores, we select a proportionmBm\_\{B\}of the most significant regions to determine the candidate region setJBJ\_\{B\}:
JB=Top−Kj∈\{1,…,Ip2B\}\(Wjt,l\(B\),mB\)\.J\_\{B\}=\\underset\{j\\in\\\{1,\\dots,\\frac\{I\_\{p\}\}\{2^\{B\}\}\\\}\}\{Top\-K\}\(W\_\{j\}^\{t,l\(B\)\},m\_\{B\}\)\.\(9\)
Through the aforementioned process, we successfully implement a coarse\-grained semantic macro\-screening stage for long\-context sequences\. By leveraging the compression characteristics of the wavelet transform to filter out redundant information without pursuing localized granular precision, this stage rapidly outlines the critical semantic intervals ofQt,l\[I~\]Q^\{t,l\}\[\\widetilde\{I\}\]overK~lowt−1,l\(B\)\[Ip\]\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\]\. This drastically shrinks the subsequent search space, thereby guiding the fine\-grained screening in Section[3\.2](https://arxiv.org/html/2606.00724#S3.SS2)with an exceptionally low computational overhead\.
### 3\.2Fine\-grained Local Localization: Multi\-scale Recursive Filtering
Although the initial candidate setJBJ\_\{B\}effectively narrows the search space, these regions inevitably contain tokens irrelevant to the current query\. Relying solely on these coarse regions for KV Cache updates and reuse decisions would compromise generation accuracy\. To address this, we construct a multi\-scale recursive filtering method, as illustrated in Figure[2](https://arxiv.org/html/2606.00724#S2.F2)d\. First, the frequency\-domain regionsJBJ\_\{B\}at scaleBBare mapped back to the index space of the prompt sequence:
μB−1=\{2B⋅j\+k∣j∈JB,k∈\{0,…,2B−1\}\},\\mu^\{B\-1\}=\\\{2^\{B\}\\cdot j\+k\\mid j\\in J\_\{B\},k\\in\\\{0,\\dots,2^\{B\}\-1\\\}\\\},\(10\)
WhereμB−1\\mu^\{B\-1\}denotes the set of indices covered by the candidate regions in the prompt sequence at scaleBB\. Subsequently, the corresponding subsequence tokens are extracted fromK~t−1,l\[Ip\]\{\\widetilde\{K\}\}^\{t\-1,l\}\[I\_\{p\}\]via a tensor indexing operation to construct a sparse key matrix, which is then subjected to a wavelet transform to obtain the higher\-resolution low\-frequency approximation component \(at scaleB−1B\-1\):
K~lowt−1,l\(B−1\)\[μB−1\]=DWT\(K~t−1,l\[μB−1\],ψ\)\.\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\-1\)\}\[\\mu^\{B\-1\}\]=DWT\(\\widetilde\{K\}^\{t\-1,l\}\[\\mu^\{B\-1\}\],\\psi\)\.\(11\)
Compared toK~lowt−1,l\(B\)\[Ip\]\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[I\_\{p\}\], the componentK~lowt−1,l\(B−1\)\[μB−1\]\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\-1\)\}\[\\mu^\{B\-1\}\]offers higher resolution and finer local representations, enabling more granular refinement of the candidate region localization\. By repeating the procedure described in Figure[2](https://arxiv.org/html/2606.00724#S2.F2)c, we derive a more precise candidate setJB−1J\_\{B\-1\}\.
By iteratively executing the aforementioned evaluation and selection process, important regions are recursively refined layer by layer from scaleBBto scale 1\. This hierarchical filtering mechanism enables cross\-scale alignment from macroscopic semantic perception to precise token localization, thereby obtaining the most informative token indexμ0\\mu^\{0\}fromK~t−1,l\[Ip\]\{\\widetilde\{K\}\}^\{t\-1,l\}\[I\_\{p\}\]\. Finally, the corresponding sparse key matrixK~t−1,l\[μ0\]\\widetilde\{K\}^\{t\-1,l\}\[\\mu^\{0\}\]and value matrixV~t−1,l\[μ0\]\\widetilde\{V\}^\{t\-1,l\}\[\\mu^\{0\}\]are extracted via a tensor index selection operation\. They are then concatenated withK~t−1,l\[Ig\]\\widetilde\{K\}^\{t\-1,l\}\[I\_\{g\}\]andV~t−1,l\[Ig\]\\widetilde\{V\}^\{t\-1,l\}\[I\_\{g\}\]of the current time step along the sequence dimension, respectively, to construct the final sparse key matrixK~sparset−1,l\{\\widetilde\{K\}\}\_\{sparse\}^\{t\-1,l\}and value matrixV~sparset−1,l\{\\widetilde\{V\}\}\_\{sparse\}^\{t\-1,l\}\. These matrices are then utilized alongside the query vector of the current step to compute the final contextual representation\.
## 4Experiments
### 4\.1Experiments Setup
##### Implementation Details\.
All experiments are conducted on asingle NVIDIA A800 80GB GPU\. We evaluate WaveFilter usingLLaDA\-8b\-InstructNieet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib1)\)andDream\-v0\-Base\-7BYeet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib15)\)across a diverse set of benchmarks, includingLongbenchBaiet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib16)\)andRulerHsiehet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib17)\)\. Detailed hyperparameter configurations are provided in Appendix[B](https://arxiv.org/html/2606.00724#A2)\. To ensure a fair comparison, we re\-run LLaDA\-8b\-InstructNieet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib1)\), Dream\-v0\-Base\-7BYeet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib15)\), Fast\-dLLMWuet al\.\([2025b](https://arxiv.org/html/2606.00724#bib.bib18)\), and Elastic\-CacheNguyen\-Triet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib19)\)under identical hardware and software environments\.Evaluation Framework and Metrics\. We utilize thelm\-eval\-harnessGaoet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib20)\)framework for our evaluation\. Following the protocol established by Fast\-dLLM and Elastic\-Cache,throughputis measured as the average tokens per second \(Tokens/sec\) calculated until the generation process terminates\.
##### Confidence\-Aware Decoding\.
We implement the confidence\-aware decoding strategy from FastdLLMWuet al\.\([2025b](https://arxiv.org/html/2606.00724#bib.bib18)\)\. Unlike the fixed\-step unmasking mechanism in baseline Diffusion LLMs, this strategy performs dynamic filtering by introducing a thresholdϵ\\epsilon: tokens are selected only when their confidence scores exceed this limit\. This mechanism allows the model to adaptively adjust its decoding scope per iteration based on prediction quality, thereby significantly enhancing inference efficiency\. Consequently, our experiments focus on evaluating the additional acceleration gains specifically provided by KV caching under this unified decoding framework\.
### 4\.2Performance and Efficiency Evaluation
To evaluate the effectiveness and efficiency of the proposed WaveFilter framework, we conduct benchmark testing on the LongBench and Ruler datasets\. We integrate WaveFilter into several competitive baselines and perform a detailed experimental comparison\. Table[1](https://arxiv.org/html/2606.00724#S4.T1)and Table[2](https://arxiv.org/html/2606.00724#S4.T2)present the accuracy, throughput, and relative speedup ratios of each baseline\. Additionally, Appendix[C](https://arxiv.org/html/2606.00724#A3)details the total generated tokens and the total runtime for each baseline across all datasets\.
Table 1:Comprehensive benchmark results ofLLaDA\-8B\-InstructandDream\-v0\-Base\-7BonLongbench\. Each cell displays the accuracy \(top row\), the decoding throughput in Tokens/sec, and the speedup ratio relative to the LLaDA and Dream baselines \(bottom row,blue: throughput/orange: speedup\)\. The symbol "\-" denotesout of memoryerrors\.Single\-Doc QAMulti\-Doc QAFew\-shot LearningSyntheticCodeQspMulFHQA2WQAMSQTRECTrQAPsgCPsgRLCCLLaDA\-8B\-Instruct24\.642\.56\(1\.00×1\.00\\times\)24\.522\.03\(1\.00×1\.00\\times\)3\.361\.18\(1\.00×1\.00\\times\)3\.952\.00\(1\.00×1\.00\\times\)1\.080\.94\(1\.00×1\.00\\times\)33\.170\.23\(1\.00×1\.00\\times\)67\.330\.16\(1\.00×1\.00\\times\)1\.101\.84\(1\.00×1\.00\\times\)15\.892\.47\(1\.00×1\.00\\times\)61\.666\.75\(1\.00×1\.00\\times\)\+ Fast\-dLLM22\.0315\.01\(5\.86×5\.86\\times\)24\.2511\.59\(5\.71×5\.71\\times\)2\.648\.35\(7\.08×7\.08\\times\)3\.6912\.87\(6\.44×6\.44\\times\)0\.927\.33\(7\.80×7\.80\\times\)33\.030\.89\(3\.87×3\.87\\times\)67\.131\.31\(8\.19×8\.19\\times\)1\.038\.30\(4\.51×4\.51\\times\)20\.049\.27\(3\.75×3\.75\\times\)60\.0527\.27\(4\.04×4\.04\\times\)\+ Fast\-dLLM & WaveFilter28\.736\.80\(2\.66×2\.66\\times\)26\.377\.42\(3\.66×3\.66\\times\)2\.655\.71\(4\.84×4\.84\\times\)4\.958\.66\(4\.33×4\.33\\times\)1\.025\.37\(5\.71×5\.71\\times\)31\.100\.52\(2\.26×2\.26\\times\)64\.990\.86\(5\.38×5\.38\\times\)1\.905\.27\(2\.86×2\.86\\times\)22\.326\.13\(2\.48×2\.48\\times\)60\.0819\.44\(2\.88×2\.88\\times\)\+ Elastic\-Cache\-\-21\.3410\.42\(5\.13×5\.13\\times\)\-\-\-\-\-\-32\.780\.58\(2\.52×2\.52\\times\)\-\-\-\-21\.565\.32\(2\.15×2\.15\\times\)\-\-\+ Elastic\-Cache & WaveFilter\-\-24\.3810\.88\(5\.36×5\.36\\times\)\-\-\-\-\-\-32\.860\.73\(3\.17×3\.17\\times\)\-\-\-\-21\.945\.11\(2\.07×2\.07\\times\)\-\-Dream\-v0\-Base\-7B24\.683\.47\(1\.00×1\.00\\times\)37\.432\.21\(1\.00×1\.00\\times\)12\.411\.10\(1\.00×1\.00\\times\)5\.662\.29\(1\.00×1\.00\\times\)5\.460\.86\(1\.00×1\.00\\times\)70\.003\.23\(1\.00×1\.00\\times\)87\.901\.37\(1\.00×1\.00\\times\)0\.802\.20\(1\.00×1\.00\\times\)14\.432\.87\(1\.00×1\.00\\times\)19\.309\.41\(1\.00×1\.00\\times\)\+ Fast\-dLLM24\.5828\.05\(8\.08×8\.08\\times\)36\.6021\.40\(9\.68×9\.68\\times\)11\.8512\.83\(11\.66×11\.66\\times\)4\.6821\.97\(9\.59×9\.59\\times\)3\.7610\.69\(12\.43×12\.43\\times\)70\.0025\.16\(12\.43×12\.43\\times\)89\.2913\.71\(10\.01×10\.01\\times\)1\.0514\.08\(6\.40×6\.40\\times\)16\.6916\.71\(5\.82×5\.82\\times\)15\.0145\.17\(4\.80×4\.80\\times\)\+ Fast\-dLLM & WaveFilter27\.9516\.66\(4\.80×4\.80\\times\)37\.9311\.38\(5\.15×5\.15\\times\)18\.137\.55\(6\.86×6\.86\\times\)5\.8313\.34\(5\.83×5\.83\\times\)10\.296\.06\(7\.05×7\.05\\times\)69\.5012\.78\(3\.96×3\.96\\times\)89\.167\.42\(5\.42×5\.42\\times\)1\.328\.32\(3\.78×3\.78\\times\)20\.0610\.08\(3\.51×3\.51\\times\)13\.8128\.67\(3\.05×3\.05\\times\)\+ Elastic\-Cache\-\-32\.8531\.34\(14\.18×14\.18\\times\)\-\-4\.4613\.96\(6\.10×6\.10\\times\)\-\-35\.5410\.53\(6\.10×6\.10\\times\)\-\-\-\-7\.1413\.73\(4\.78×4\.78\\times\)\-\-\+ Elastic\-Cache & WaveFilter\-\-39\.6437\.77\(17\.09×17\.09\\times\)\-\-5\.8912\.31\(5\.38×5\.38\\times\)\-\-33\.4911\.02\(3\.41×3\.41\\times\)\-\-\-\-16\.7712\.98\(4\.52×4\.52\\times\)\-\-
##### LongBench Results\.
Table[1](https://arxiv.org/html/2606.00724#S4.T1)summarizes the experimental results across various long\-context tasks, including single\-document QA, multi\-document QA, few\-shot learning, synthetic tasks, and code generation\. The results indicate that introducing the WaveFilter framework as a plug\-in component into existing cache management methods significantly boosts model accuracy on long\-context tasks at the cost of a marginal efficiency drop\. For instance, in single\-document QA, multi\-document QA, synthetic tasks, and code generation,Fast\-dLLM & WaveFilterconsistently improves the accuracy of LLaDA\-8B\-Instruct compared to the vanilla Fast\-dLLM\. Meanwhile,Elastic\-Cache & WaveFilternot only enhances the accuracy of LLaDA\-8B\-Instruct over Elastic\-Cache but also improves throughput and effectively reduces the total execution time on the MulF dataset\. Furthermore, when extended to the Dream\-v0\-Base\-7B, WaveFilter similarly yields significant improvements in experimental accuracy\.
##### Ruler Results\.
To further evaluate WaveFilter framework under varying context lengths, we report the performance on the Ruler benchmark at 4K and 8K context windows in Table[2](https://arxiv.org/html/2606.00724#S4.T2)\. The empirical findings reveal that as the context length increases, standard cache management methods like Fast\-dLLM and Elastic\-Cache suffer from severe performance degradation compared to the dense LLaDA\-8B\-Instruct model\. However, integrating the WaveFilter framework successfully mitigates this accuracy drop for both Fast\-dLLM and Elastic\-Cache\. For example, on the vt dataset, the dense LLaDA\-8B\-Instruct model achieves an accuracy of 40\.6%\. When utilizing Fast\-dLLM and Elastic\-Cache, the accuracy plunges to 19\.08% and 25\.72%, respectively\. In contrast, the incorporation of the WaveFilter framework recovers the accuracy to 21\.56% and 26\.04%, thereby effectively alleviating the performance deterioration\.
In summary, the consistent results across both LongBench and Ruler benchmarks demonstrate the exceptional effectiveness and generalization capability of the WaveFilter framework\. As a plug\-and\-play, universal component, WaveFilter not only significantly alleviates the accuracy attenuation inherent in KV Cache methods during long\-context tasks but also maintains or even enhances throughput on specific benchmarks\. These findings validate the rationale of preserving critical contextual information through multi\-scale filtering mechanisms\.
Table 2:Comprehensive benchmark results ofLLaDA\-8B\-InstructonRuler\. Each cell displays the accuracy \(top row\), the decoding throughput in Tokens/sec, and the speedup ratio relative to the LLaDA baselines \(bottom row,blue: throughput/orange: speedup\)\.s1s2m1m2mvmqvtcweqa1qa2Context LengthLLaDA\-8B\-Instruct100\.003\.80\(1\.00×1\.00\\times\)100\.008\.12\(1\.00×1\.00\\times\)100\.007\.33\(1\.00×1\.00\\times\)92\.404\.99\(1\.00×1\.00\\times\)100\.004\.61\(1\.00×1\.00\\times\)99\.806\.35\(1\.00×1\.00\\times\)96\.926\.43\(1\.00×1\.00\\times\)30\.566\.35\(1\.00×1\.00\\times\)79\.184\.25\(1\.00×1\.00\\times\)78\.805\.25\(1\.00×1\.00\\times\)4K\+ Fast\-dLLM99\.6022\.46\(5\.91×5\.91\\times\)100\.0021\.10\(2\.60×2\.60\\times\)100\.0021\.40\(2\.92×2\.92\\times\)88\.0022\.95\(4\.60×4\.60\\times\)97\.5529\.60\(6\.42×6\.42\\times\)99\.5531\.43\(4\.95×4\.95\\times\)93\.0828\.43\(4\.42×4\.42\\times\)1\.5830\.42\(4\.79×4\.79\\times\)77\.3212\.08\(2\.84×2\.84\\times\)77\.4017\.39\(3\.31×3\.31\\times\)4K\+ Fast\-dLLM & WaveFilter99\.805\.95\(1\.57×1\.57\\times\)100\.006\.15\(0\.76×0\.76\\times\)100\.005\.81\(0\.79×0\.79\\times\)85\.8013\.91\(2\.79×2\.79\\times\)98\.2018\.62\(4\.04×4\.04\\times\)98\.6017\.65\(2\.78×2\.78\\times\)94\.0021\.48\(3\.34×3\.34\\times\)1\.6416\.16\(2\.54×2\.54\\times\)77\.356\.58\(1\.55×1\.55\\times\)76\.8011\.92\(2\.27×2\.27\\times\)4K\+ Elastic\-Cache99\.2065\.28\(17\.18×17\.18\\times\)100\.0021\.43\(2\.64×2\.64\\times\)100\.0029\.74\(4\.06×4\.06\\times\)90\.2017\.39\(3\.48×3\.48\\times\)96\.8518\.75\(4\.07×4\.07\\times\)99\.7524\.72\(3\.89×3\.89\\times\)96\.4822\.21\(3\.45×3\.45\\times\)16\.509\.03\(1\.42×1\.42\\times\)80\.3792\.51\(21\.77×21\.77\\times\)77\.2045\.77\(8\.72×8\.72\\times\)4K\+ Elastic\-Cache & WaveFilter99\.6054\.41\(14\.32×14\.32\\times\)100\.0017\.86\(2\.20×2\.20\\times\)100\.0028\.50\(3\.89×3\.89\\times\)87\.4015\.11\(3\.03×3\.03\\times\)97\.1517\.01\(3\.69×3\.69\\times\)99\.7820\.49\(3\.23×3\.23\\times\)96\.7819\.05\(2\.96×2\.96\\times\)16\.957\.44\(1\.17×1\.17\\times\)80\.7588\.14\(4\.88×4\.88\\times\)77\.8043\.22\(8\.23×8\.23\\times\)4KLLaDA\-8B\-Instruct57\.004\.08\(1\.00×1\.00\\times\)76\.401\.73\(1\.00×1\.00\\times\)63\.602\.08\(1\.00×1\.00\\times\)43\.001\.91\(1\.00×1\.00\\times\)55\.753\.08\(1\.00×1\.00\\times\)54\.452\.71\(1\.00×1\.00\\times\)40\.603\.14\(1\.00×1\.00\\times\)26\.842\.81\(1\.00×1\.00\\times\)49\.632\.65\(1\.00×1\.00\\times\)66\.802\.39\(1\.00×1\.00\\times\)8K\+ Fast\-dLLM45\.4017\.06\(4\.18×4\.18\\times\)55\.2011\.44\(6\.61×6\.61\\times\)61\.4012\.90\(6\.20×6\.20\\times\)40\.6011\.53\(6\.04×6\.04\\times\)43\.8019\.86\(6\.45×6\.45\\times\)52\.1017\.92\(6\.61×6\.61\\times\)19\.0817\.58\(5\.60×5\.60\\times\)4\.5413\.76\(4\.90×4\.90\\times\)47\.8715\.50\(5\.85×5\.85\\times\)63\.0013\.70\(5\.73×5\.73\\times\)8K\+ Fast\-dLLM & WaveFilter47\.803\.58\(0\.88×0\.88\\times\)55\.807\.68\(4\.44×4\.44\\times\)62\.058\.65\(4\.16×4\.16\\times\)38\.67\.05\(3\.69×3\.69\\times\)43\.9013\.16\(4\.27×4\.27\\times\)53\.1012\.16\(4\.49×4\.49\\times\)21\.5613\.11\(4\.18×4\.18\\times\)4\.909\.70\(3\.45×3\.45\\times\)48\.9810\.70\(4\.04×4\.04\\times\)63\.509\.21\(3\.85×3\.85\\times\)8K\+ Elastic\-Cache44\.8012\.39\(3\.04×3\.04\\times\)61\.007\.01\(4\.05×4\.05\\times\)62\.807\.91\(3\.80×3\.80\\times\)42\.007\.55\(3\.95×3\.95\\times\)50\.6010\.11\(3\.28×3\.28\\times\)56\.0010\.77\(3\.97×3\.97\\times\)25\.7211\.57\(3\.68×3\.68\\times\)7\.483\.54\(1\.26×1\.26\\times\)47\.8710\.36\(3\.91×3\.91\\times\)63\.4010\.29\(4\.31×4\.31\\times\)8K\+ Elastic\-Cache & WaveFilter45\.209\.26\(2\.27×2\.27\\times\)61\.207\.43\(4\.29×4\.29\\times\)63\.208\.94\(4\.30×4\.30\\times\)41\.205\.62\(2\.79×2\.79\\times\)51\.1210\.10\(3\.28×3\.28\\times\)56\.1510\.08\(3\.72×3\.72\\times\)26\.048\.99\(2\.86×2\.86\\times\)9\.302\.38\(0\.85×0\.85\\times\)48\.859\.72\(3\.67×3\.67\\times\)63\.858\.40\(3\.51×3\.51\\times\)8K
### 4\.3Ablations and Analysis
We conduct comprehensive ablation studies to evaluate the impact of two key hyperparameters: \(1\)the recurrence scaleBB, \(2\)the candidate region proportionmBm\_\{B\}\. This section aims to investigate how these hyperparameters influence both experimental accuracy and computational efficiency, thereby validating the optimality and rationale of our default configurations\.
#### 4\.3\.1Ablation and Analysis on Scale
Figure[3](https://arxiv.org/html/2606.00724#S4.F3)illustrates the performance sensitivity of the WaveFilter framework with respect toBB\. The experimental results demonstrate that the choice ofBBexerts a significant impact on both accuracy and computational efficiency\.
##### Impact on Accuracy\.
In terms of accuracy, a largerBBdoes not monotonically yield better performance\. Experiments indicate that the model typically achieves peak accuracy at intermediate scales \(e\.g\.,B=2B=2orB=3B=3\)\. This suggests that an overly shallow scale \(B=1B=1\) fails to effectively separate noise from key semantic features, whereas an excessively deep scale \(B=4B=4\) may lead to the loss of critical semantic information\.
##### Impact on Throughput and Total Runtime\.
AsBBgradually increases from 1 to 4, the throughput exhibits a clear downward trend, accompanied by an increase in total runtime\. This is because higher values ofBBentail more layers of recurrent filtering\. Although a deeper hierarchy enables finer\-grained feature alignment, each extra recurrence layer introduces extra computational overhead—specifically from wavelet transforms, perceptual weight matrices, and candidate region selection—which inevitably amplifies inference latency\.
Based on the above analysis, we can conclude that the multi\-scale recursive filtering described in Section[3\.2](https://arxiv.org/html/2606.00724#S3.SS2)is essential\. Furthermore, taking into account the trade\-off between accuracy and computational efficiency, settingB=2B=2as the default configuration proves to be both optimal choice and highly reasonable in practice\.
#### 4\.3\.2Ablation and Analysis on Proportions
Figure[3](https://arxiv.org/html/2606.00724#S4.F3)displays the performance sensitivity of WaveFilter framework relative tomBm\_\{B\}\. The parametermBm\_\{B\}directly controls the retention proportion of tokens that proceed to subsequent screening after undergoing the wavelet transform\. The empirical results reveal that the choice ofmBm\_\{B\}significantly affects both accuracy and computational efficiency\.
##### Impact on Accuracy\.
AsmBm\_\{B\}varies within a reasonable range of\[0\.3, 0\.6\], the model suffers no performance degradation; instead, it exhibits a notable boost in accuracy\. This clearly demonstrates that the WaveFilter framework can accurately identifies and preserves core semantic tokens that dictate generation quality, while efficiently discarding a vast amount of low\-contribution, redundant tokens\. Notably, moderate sparsification not only preserves foundational semantic integrity by keeping backbone tokens, but more importantly, it effectively mitigates noise interference within the attention mechanism by filtering out irrelevant tokens\. This mechanism allows the model to focus more intensely on core contextual representations, thereby consistently enhancing experimental accuracy\. This crucial finding fully substantiates the necessity and validity of executing deep sparse filtering on the KV Cache to balance accuracy and computational efficiency\.
Figure 3:Ablation study and performance analysis ofWaveFiltercombined with different caching methods onLLaDA\-8B\-Instruct\.\(a\)\-\(c\)evaluateFast\-dLLM & WaveFilteron theLongbench\-Qspdataset, while\(d\)\-\(f\)evaluateElastic\-Cache & WaveFilteron theLongbench\-MulFdataset\. The heatmaps illustrate the joint effects ofScale \(y\-axis\)andProportion \(x\-axis\)across three key metrics:\(a, d\) Accuracy \(%\),\(b, e\) Throughput \(Tokens/sec\), and\(c, f\) Total Runtime \(min\)\. Darker colors indicate higher values for each respective metric\.
##### Impact on Throughput and Total Runtime\.
The impact ofmBm\_\{B\}on computational efficiency exhibits a distinct non\-linear characteristic\. WhenmBm\_\{B\}is set to an overly small value, aggressive pruning expels an excessive number of tokens from the KV Cache\. While this trims redundant information, it also discards part of the crucial semantics, which increases the model’s inference time steps; consequently, generation throughput drops, and the total runtime increases\. Conversely, whenmBm\_\{B\}is too large, an influx of redundant tokens repopulates the KV Cache, and the heightened computational load similarly depresses throughput and prolongs runtime\. The model maximizes generation throughput and substantially minimizes total runtime only when a reasonable selection proportion strikes the optimal balance between information integrity and sparsity—drastically reducing attention overhead without sacrificing core semantics\.
Based on the above analysis, the selection proportion is paramount to the WaveFilter framework\. Balancing both accuracy and computational efficiency, selecting an appropriate range ofmBm\_\{B\}tailored to specific datasets represents the most effective and rational approach\.
## 5Related Work
### 5\.1Diffusion Language Models
Diffusion models initially achieved breakthroughs in continuous domain tasks, such as image and audio generationSong and Ermon \([2019](https://arxiv.org/html/2606.00724#bib.bib21)\); Hoet al\.\([2020](https://arxiv.org/html/2606.00724#bib.bib12)\); Dhariwal and Nichol \([2021](https://arxiv.org/html/2606.00724#bib.bib22)\); Guptaet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib3)\)\. Recently, to accommodate the discrete nature of text, researchers have introduced modeling approaches based on Markov, multinomial, and continuous\-time frameworks, successfully extending this mechanism to the NLP domainLiet al\.\([2022](https://arxiv.org/html/2606.00724#bib.bib5)\); Gonget al\.\([2023](https://arxiv.org/html/2606.00724#bib.bib23)\); Louet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib24)\)\. Currently, the generation quality of masked diffusion models approaches that of autoregressive modelsNieet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib1)\), demonstrating competitive potential against models like LLaMAGrattafioriet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib26)\)and QwenYanget al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib27)\)\. This not only provides an alternative to the autoregressive paradigm but has also expanded its impact into domains such as multimodality and code generation\.
### 5\.2LLM Acceleration Techniques
Although LLM inference efficiency is constrained by quadratic computational overhead and memory bottlenecks in long sequences, it can be significantly accelerated via KV Cache by caching historical attention statesZhanget al\.\([2023](https://arxiv.org/html/2606.00724#bib.bib28)\); Liet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib30)\); Xiaoet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib31)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib32)\); Xiaoet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib29)\)\. In contrast, the caching mechanism in diffusion models is more complex; the presence of multiple denoising timesteps and distinct feature variations across steps severely diminishes caching effectiveness\. Despite caching methods such as DeepCacheMaet al\.\([2024](https://arxiv.org/html/2606.00724#bib.bib33)\), dLLM\-CacheLiuet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib34)\), dKV\-CacheMaet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib7)\), Fast\-dLLMWuet al\.\([2025b](https://arxiv.org/html/2606.00724#bib.bib18)\), Fast\-dLLM v2Wuet al\.\([2025a](https://arxiv.org/html/2606.00724#bib.bib35)\), Sparse\-dLLMSonget al\.\([2026](https://arxiv.org/html/2606.00724#bib.bib36)\), d2\{\}^\{\\mbox\{2\}\}CacheJianget al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib37)\), and Elastic\-CacheNguyen\-Triet al\.\([2025](https://arxiv.org/html/2606.00724#bib.bib19)\), they have yet to fully address the efficiency bottlenecks in long\-context tasks\.
## 6Conclusion
To address the performance bottlenecks of existing diffusion model caching methods in long\-context tasks, this paper proposes WaveFilter, a universal and training\-free framework\. This framework innovatively introduces the wavelet transform into the KV Cache, precisely locating highly correlated tokens within a reduced\-dimensional space through multi\-scale recursive filtering\. Experimental results demonstrate that the WaveFilter framework effectively enhances the performance of existing caching methods in long\-context tasks\. Future work will further explore KV Cache optimization for DLMs in more complex tasks\.
## Limitations
While our proposedWaveFilter frameworkdemonstrates significant efficacy, we acknowledge several limitations that warrant further investigation in future work\.
First, relying solely on throughput to evaluate processing speed may lead to incomplete conclusions\. As demonstrated by the empirical analyses in Tables[1](https://arxiv.org/html/2606.00724#S4.T1)and[2](https://arxiv.org/html/2606.00724#S4.T2), as well as Appendices[C](https://arxiv.org/html/2606.00724#A3)and[D](https://arxiv.org/html/2606.00724#A4), although the integration of WaveFilter results in a notable drop in throughput for thes1configuration withinRuler, it substantially shortens the total end\-to\-end execution time and delivers superior generation quality compared to the baseline\. This phenomenon indicates that the conventional throughput metric \(i\.e\., tokens per second\) fails to fully capture the practical efficiency gains during inference\. Therefore, a multi\-dimensional evaluation that combines total runtime and output quality is essential for a fairer assessment of speed\.
Second, despite the sequence compression, the perceptual weight computation still introduces non\-negligible overhead\. In Equation[7](https://arxiv.org/html/2606.00724#S3.E7), while the wavelet transform effectively shortens the overall sequence length, computing the perceptual weight matrix via the attention mechanism inevitably increases computational complexity\. This additional operational cost is the primary factor driving the decreased throughput and prolonged running times observed on certain specific datasets\.
Lastly, a performance degradation occurs in text summarization tasks\. As shown in thefew\-shot learningtasks in Table[1](https://arxiv.org/html/2606.00724#S4.T1), the WaveFilter framework sparsifies the tokens corresponding to the prompt positions within the KV cache\. Since complex summarization tasks heavily depend on both the global context and the specific instructions embedded within the prompt, this process of structural sparsification often inadvertently discards critical contextual prompts, thereby severely compromising the model’s final generation performance on such highly specific tasks\.
## References
- Structured denoising diffusion models in discrete state\-spaces\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,M\. Ranzato, A\. Beygelzimer, Y\. N\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),pp\. 17981–17993\.External Links:[Link](https://proceedings.neurips.cc/paper/2021/hash/958c530554f78bcd8e97125b70e6973d-Abstract.html)Cited by:[§2\.1](https://arxiv.org/html/2606.00724#S2.SS1.p1.11)\.
- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: A bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 3119–3137\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.172),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.172)Cited by:[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Campbell, J\. Benton, V\. D\. Bortoli, T\. Rainforth, G\. Deligiannidis, and A\. Doucet \(2022\)A continuous time framework for discrete denoising models\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/b5b528767aa35f5b1a60fe0aaeca0563-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.00724#S2.SS1.p1.11)\.
- P\. Dhariwal and A\. Q\. Nichol \(2021\)Diffusion models beat gans on image synthesis\.InAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6\-14, 2021, virtual,M\. Ranzato, A\. Beygelzimer, Y\. N\. Dauphin, P\. Liang, and J\. W\. Vaughan \(Eds\.\),pp\. 8780–8794\.External Links:[Link](https://proceedings.neurips.cc/paper/2021/hash/49ad23d1ec9fa4bd8d77d02681df5cfa-Abstract.html)Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)A framework for few\-shot language model evaluation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Gong, M\. Li, J\. Feng, Z\. Wu, and L\. Kong \(2023\)DiffuSeq: sequence to sequence text generation with diffusion models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=jQj-%5C_rLVXsj)Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- S\. Gong, R\. Zhang, H\. Zheng, J\. Gu, N\. Jaitly, L\. Kong, and Y\. Zhang \(2025\)DiffuCoder: understanding and improving masked diffusion models for code generation\.CoRRabs/2506\.20639\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.20639),[Document](https://dx.doi.org/10.48550/ARXIV.2506.20639),2506\.20639Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- A\. Gupta, L\. Yu, K\. Sohn, X\. Gu, M\. Hahn, F\. Li, I\. Essa, L\. Jiang, and J\. Lezama \(2024\)Photorealistic video generation with diffusion models\.InComputer Vision \- ECCV 2024 \- 18th European Conference, Milan, Italy, September 29\-October 4, 2024, Proceedings, Part LXXIX,A\. Leonardis, E\. Ricci, S\. Roth, O\. Russakovsky, T\. Sattler, and G\. Varol \(Eds\.\),Lecture Notes in Computer Science,pp\. 393–411\.External Links:[Link](https://doi.org/10.1007/978-3-031-72986-7%5C_23),[Document](https://dx.doi.org/10.1007/978-3-031-72986-7%5F23)Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6\-12, 2020, virtual,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\. Balcan, and H\. Lin \(Eds\.\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html)Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.CoRRabs/2404\.06654\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.06654),[Document](https://dx.doi.org/10.48550/ARXIV.2404.06654),2404\.06654Cited by:[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Jiang, Y\. Cai, X\. Luo, J\. Fu, J\. Wang, C\. Liu, and X\. Yang \(2025\)D2\{\}^\{\\mbox\{2\}\}cache: accelerating diffusion\-based llms via dual adaptive caching\.CoRRabs/2509\.23094\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.23094),[Document](https://dx.doi.org/10.48550/ARXIV.2509.23094),2509\.23094Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- A\. Kiruluta, P\. Burity, and S\. Williams \(2025\)Learnable multi\-scale wavelet transformer: A novel alternative to self\-attention\.CoRRabs/2504\.08801\.External Links:[Link](https://doi.org/10.48550/arXiv.2504.08801),[Document](https://dx.doi.org/10.48550/ARXIV.2504.08801),2504\.08801Cited by:[§2\.2](https://arxiv.org/html/2606.00724#S2.SS2.p1.1)\.
- T\. Li, M\. Chen, B\. Guo, and Z\. Shen \(2025\)A survey on diffusion language models\.CoRRabs/2508\.10875\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.10875),[Document](https://dx.doi.org/10.48550/ARXIV.2508.10875),2508\.10875Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1)\.
- X\. L\. Li, J\. Thickstun, I\. Gulrajani, P\. Liang, and T\. B\. Hashimoto \(2022\)Diffusion\-lm improves controllable text generation\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2022/hash/1be5bc25d50895ee656b8c2d9eb89d6a-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1),[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024\)SnapKV: LLM knows what you are looking for before generation\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/28ab418242603e0f7323e54185d19bde-Abstract-Conference.html)Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- X\. Liu, Y\. Song, Z\. Liu, Z\. Huang, Q\. Guo, Z\. He, and X\. Qiu \(2026\)LongLLaDA: unlocking long context capabilities in diffusion llms\.InFortieth AAAI Conference on Artificial Intelligence, Thirty\-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2026, Singapore, January 20\-27, 2026,S\. Koenig, C\. Jenkins, and M\. E\. Taylor \(Eds\.\),pp\. 32186–32194\.External Links:[Link](https://doi.org/10.1609/aaai.v40i38.40491),[Document](https://dx.doi.org/10.1609/AAAI.V40I38.40491)Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p2.1)\.
- Z\. Liu, Y\. Yang, Y\. Zhang, J\. Chen, C\. Zou, Q\. Wei, S\. Wang, and L\. Zhang \(2025\)DLLM\-cache: accelerating diffusion large language models with adaptive caching\.CoRRabs/2506\.06295\.External Links:[Link](https://doi.org/10.48550/arXiv.2506.06295),[Document](https://dx.doi.org/10.48550/ARXIV.2506.06295),2506\.06295Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- A\. Lou, C\. Meng, and S\. Ermon \(2024\)Discrete diffusion modeling by estimating the ratios of the data distribution\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 32819–32848\.External Links:[Link](https://proceedings.mlr.press/v235/lou24a.html)Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- X\. Ma, G\. Fang, and X\. Wang \(2024\)DeepCache: accelerating diffusion models for free\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16\-22, 2024,pp\. 15762–15772\.External Links:[Link](https://doi.org/10.1109/CVPR52733.2024.01492),[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01492)Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- X\. Ma, R\. Yu, G\. Fang, and X\. Wang \(2025\)DKV\-cache: the cache for diffusion language models\.CoRRabs/2505\.15781\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.15781),[Document](https://dx.doi.org/10.48550/ARXIV.2505.15781),2505\.15781Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1),[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- Q\. Nguyen\-Tri, M\. Ranjan, and Z\. Shen \(2025\)Attention is all you need for KV cache in diffusion llms\.CoRRabs/2510\.14973\.External Links:[Link](https://doi.org/10.48550/arXiv.2510.14973),[Document](https://dx.doi.org/10.48550/ARXIV.2510.14973),2510\.14973Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.CoRRabs/2502\.09992\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.09992),[Document](https://dx.doi.org/10.48550/ARXIV.2502.09992),2502\.09992Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024a\)Simple and effective masked diffusion language models\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/eb0b13cc515724ab8015bc978fdde0ad-Abstract-Conference.html)Cited by:[§2\.1](https://arxiv.org/html/2606.00724#S2.SS1.p1.11)\.
- S\. S\. Sahoo, A\. Gokaslan, C\. D\. Sa, and V\. Kuleshov \(2024b\)Diffusion models with learned adaptive noise\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/bee43378b65ec195a67f24709469dcaf-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p1.1)\.
- J\. Sohl\-Dickstein, E\. A\. Weiss, N\. Maheswaranathan, and S\. Ganguli \(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InProceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6\-11 July 2015,F\. R\. Bach and D\. M\. Blei \(Eds\.\),JMLR Workshop and Conference Proceedings,pp\. 2256–2265\.External Links:[Link](http://proceedings.mlr.press/v37/sohl-dickstein15.html)Cited by:[§2\.1](https://arxiv.org/html/2606.00724#S2.SS1.p1.11)\.
- Y\. Song and S\. Ermon \(2019\)Generative modeling by estimating gradients of the data distribution\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 11895–11907\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/3001ef257407d5a371a96dcd947c7d93-Abstract.html)Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- Y\. Song, X\. Liu, R\. Li, Z\. Liu, Z\. Huang, Q\. Guo, Z\. He, and X\. Qiu \(2026\)Sparse\-dllm: accelerating diffusion llms with dynamic cache eviction\.InFortieth AAAI Conference on Artificial Intelligence, Thirty\-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2026, Singapore, January 20\-27, 2026,S\. Koenig, C\. Jenkins, and M\. E\. Taylor \(Eds\.\),pp\. 33038–33046\.External Links:[Link](https://doi.org/10.1609/aaai.v40i39.40586),[Document](https://dx.doi.org/10.1609/AAAI.V40I39.40586)Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- C\. Wu, H\. Zhang, S\. Xue, S\. Diao, Y\. Fu, Z\. Liu, P\. O\. Molchanov, P\. Luo, S\. Han, and E\. Xie \(2025a\)Fast\-dllm v2: efficient block\-diffusion LLM\.CoRRabs/2509\.26328\.External Links:[Link](https://doi.org/10.48550/arXiv.2509.26328),[Document](https://dx.doi.org/10.48550/ARXIV.2509.26328),2509\.26328Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025b\)Fast\-dllm: training\-free acceleration of diffusion LLM by enabling KV cache and parallel decoding\.CoRRabs/2505\.22618\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.22618),[Document](https://dx.doi.org/10.48550/ARXIV.2505.22618),2505\.22618Cited by:[§1](https://arxiv.org/html/2606.00724#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- G\. Xiao, J\. Tang, J\. Zuo, J\. Guo, S\. Yang, H\. Tang, Y\. Fu, and S\. Han \(2025\)DuoAttention: efficient long\-context LLM inference with retrieval and streaming heads\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=cFu7ze7xUm)Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=NG7sS51zVF)Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.CoRRabs/2412\.15115\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.15115),[Document](https://dx.doi.org/10.48550/ARXIV.2412.15115),2412\.15115Cited by:[§5\.1](https://arxiv.org/html/2606.00724#S5.SS1.p1.1)\.
- T\. Yao, Y\. Pan, Y\. Li, C\. Ngo, and T\. Mei \(2022\)Wave\-vit: unifying wavelet and transformers for visual representation learning\.InComputer Vision \- ECCV 2022 \- 17th European Conference, Tel Aviv, Israel, October 23\-27, 2022, Proceedings, Part XXV,S\. Avidan, G\. J\. Brostow, M\. Cissé, G\. M\. Farinella, and T\. Hassner \(Eds\.\),Lecture Notes in Computer Science,pp\. 328–345\.External Links:[Link](https://doi.org/10.1007/978-3-031-19806-9%5C_19),[Document](https://dx.doi.org/10.1007/978-3-031-19806-9%5F19)Cited by:[§2\.2](https://arxiv.org/html/2606.00724#S2.SS2.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.CoRRabs/2508\.15487\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.15487),[Document](https://dx.doi.org/10.48550/ARXIV.2508.15487),2508\.15487Cited by:[§4\.1](https://arxiv.org/html/2606.00724#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Zhang, Y\. Liu, H\. Yuan, Z\. Qin, Y\. Yuan, Q\. Gu, and A\. C\. Yao \(2025\)Tensor product attention is all you need\.CoRRabs/2501\.06425\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.06425),[Document](https://dx.doi.org/10.48550/ARXIV.2501.06425),2501\.06425Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. W\. Barrett, Z\. Wang, and B\. Chen \(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/6ceefa7b15572587b78ecfcebb2827f8-Abstract-Conference.html)Cited by:[§5\.2](https://arxiv.org/html/2606.00724#S5.SS2.p1.1)\.
## Appendix AAlgorithm Procedure of WaveFilter
To ensure the rigor and completeness of the proposed framework, the formal execution logic of the WaveFilter algorithm is detailed in Algorithm[1](https://arxiv.org/html/2606.00724#alg1)\. Following the theoretical foundations formulated in Section[3](https://arxiv.org/html/2606.00724#S3), the execution procedure can be partitioned into three primary phases:initialization, multi\-resolution recursive filtering, and cache consolidation\.First, the algorithm initializes the complete sequencex0x^\{0\}by combining the prompt sequence with positional placeholders for generation, and preserves the initial Key\-Value \(KV\) cache across all Transformer layers \(Lines 1–4\)\. Second, it recursively processes the historical KV states via the Discrete Wavelet Transform \(DWT\), performing attention alignment and index screening across multiple resolutions to effectively extract the most informative sparse KV pairs \(Lines 5–16\)\. Finally, the isolated sparse cache is utilized to compute the final context representation for diffusion decoding \(Lines 17–19\)\.
Algorithm 1The WaveFilter Algorithm1:Input:Prompt
xpromptx\_\{\\text\{prompt\}\}, Generation Length
NN, Scale
BB, Wavelet
ψ\\psi, Selection Proportions
\{mb\}b=1B\\\{m\_\{b\}\\\}\_\{b=1\}^\{B\}, Threshold
ϵ\\epsilon\.
2:Initialize:
x0←\{xprompt;\[MASK\],…,\[MASK\]\}x^\{0\}\\leftarrow\\\{x\_\{\\text\{prompt\}\};\\text\{\[MASK\]\},\\dots,\\text\{\[MASK\]\}\\\};
p←length\(xprompt\)p\\leftarrow\\text\{length\}\(x\_\{\\text\{prompt\}\}\)
3:
t←1t\\leftarrow 1;
I←\{1,…,p\+N\}I\\leftarrow\\\{1,\\dots,p\+N\\\};
I~←\{p\+1,…,p\+N\}\\widetilde\{I\}\\leftarrow\\\{p\+1,\\dots,p\+N\\\};
K~1,l\[I\]←K1,l\[I\]\\widetilde\{K\}^\{1,l\}\[I\]\\leftarrow K^\{1,l\}\[I\];
V~1,l\[I\]←V1,l\[I\]\\widetilde\{V\}^\{1,l\}\[I\]\\leftarrow V^\{1,l\}\[I\]
4:
t←t\+1t\\leftarrow t\+1
5:while
I~t≠∅\\widetilde\{I\}^\{t\}\\neq\\emptysetdo
6:
Ht,1\[I~t\]←Embedding\(xt\[I~t\]\)H^\{t,1\}\[\\widetilde\{I\}^\{t\}\]\\leftarrow\\text\{Embedding\}\(x^\{t\}\[\\widetilde\{I\}^\{t\}\]\)
7:for
l=1,…,Ll=1,\\dots,Ldo
8:
Qt,l\[I~t\],Kt,l\[I~t\],Vt,l\[I~t\]←FFN\(Ht,l\[I~t\]\)Q^\{t,l\}\[\\widetilde\{I\}^\{t\}\],K^\{t,l\}\[\\widetilde\{I\}^\{t\}\],V^\{t,l\}\[\\widetilde\{I\}^\{t\}\]\\leftarrow\\text\{FFN\}\(H^\{t,l\}\[\\widetilde\{I\}^\{t\}\]\)
9:
μB←Ip\\mu^\{B\}\\leftarrow I\_\{p\}
10:while
B≥1B\\geq 1do
11:
K~lowt−1,l\(B\)\[μB\]←DWT\(K~t−1,l\[μB\],ψ\)\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[\\mu^\{B\}\]\\leftarrow DWT\(\\widetilde\{K\}^\{t\-1,l\}\[\\mu^\{B\}\],\\psi\)
12:
At,l\(B\)←Softmax\(\(Qt,l\[I~t\]⋅\(K~lowt−1,l\(B\)\[μB\]\)T\)/d\)A^\{t,l\(B\)\}\\leftarrow Softmax\\left\(\(Q^\{t,l\}\[\\widetilde\{I\}^\{t\}\]\\cdot\(\\widetilde\{K\}\_\{low\}^\{t\-1,l\(B\)\}\[\\mu^\{B\}\]\)^\{T\}\)/\\sqrt\{d\}\\right\)
13:
Wt,l\(B\)←∑i=1Ip/2BAit,l\(B\)W^\{t,l\(B\)\}\\leftarrow\\sum\_\{i=1\}^\{I\_\{p\}/2^\{B\}\}A\_\{i\}^\{t,l\(B\)\}
14:
JB←Top−Kj∈\{1,…,Ip/2B\}\(Wjt,l\(B\),mB\)J\_\{B\}\\leftarrow\\mathop\{Top\-K\}\\limits\_\{j\\in\\\{1,\\dots,I\_\{p\}/2^\{B\}\\\}\}\(W\_\{j\}^\{t,l\(B\)\},m\_\{B\}\)
15:
μB−1←\{2B⋅j\+k∣j∈JB,k∈\{0,…,2B−1\}\}\\mu^\{B\-1\}\\leftarrow\\\{2^\{B\}\\cdot j\+k\\mid j\\in J\_\{B\},k\\in\\\{0,\\dots,2^\{B\}\-1\\\}\\\};
μB←μB−1\\mu^\{B\}\\leftarrow\\mu^\{B\-1\};
B←B−1B\\leftarrow B\-1
16:endwhile
17:
μ0←\\mu^\{0\}\\leftarrowThe most informative token indices from
K~t−1,l\[Ip\]\\widetilde\{K\}^\{t\-1,l\}\[I\_\{p\}\]
18:
K~sparset−1,l←\[K~t−1,l\[μ0\],K~t−1,l\[Ig\]\]\{\\widetilde\{K\}\}\_\{sparse\}^\{t\-1,l\}\\leftarrow\[\{\\widetilde\{K\}\}^\{t\-1,l\}\[\\mu^\{0\}\],\{\\widetilde\{K\}\}^\{t\-1,l\}\[I\_\{g\}\]\];
K~sparset−1,l←\[V~t−1,l\[μ0\],V~t−1,l\[Ig\]\]\{\\widetilde\{K\}\}\_\{sparse\}^\{t\-1,l\}\\leftarrow\[\{\\widetilde\{V\}\}^\{t\-1,l\}\[\\mu^\{0\}\],\{\\widetilde\{V\}\}^\{t\-1,l\}\[I\_\{g\}\]\]
19:Compute the final context representation using
Qt,l\[I~t\]Q^\{t,l\}\[\\widetilde\{I\}^\{t\}\],
K~sparset−1,l\{\\widetilde\{K\}\}\_\{sparse\}^\{t\-1,l\}, and
V~sparset−1,l\{\\widetilde\{V\}\}\_\{sparse\}^\{t\-1,l\}
20:endfor
21:Retain high\-confidence tokens via threshold
ϵ\\epsilon, mask the rest;
t←t\+1t\\leftarrow t\+1
22:endwhile
23:return
xt−1x^\{t\-1\}
## Appendix BDetailed Experiment Setup
### B\.1Implementation Details
We conduct all experiments on asingle NVIDIA A800 80GB GPUto ensure a consistent hardware environment\. Specifically, we implement our proposedWaveFilterframework on Fast\-dLLM and Elastic\-Cache, and evaluate its performance across two DLMs:LLaDA\-8b\-InstructandDream\-v0\-Base\-7B\. Our evaluation spans two long\-context benchmarks:LongbenchandRuler\. Within Longbench, we evaluate the following tasks:single\-document QA, represented by Qasper\(Qsp\)and MultifieldQA\_en\(MulF\);multi\-document QA, including HotpotQA\(HQA\), 2WikiMultihopQA\(2WQA\), and Musique\(MSQ\); thefew\-shot learningtaskTRECandTriviaQA \(TrQA\);synthetic tasksPassageCount\(PsgC\)and PassageRetrieval\_en\(PsgR\); and thecode completiontaskLCC\. For the Ruler benchmark, we further evaluate 10 core subtasks across four dimensions: \(1\) foundational and multi\-target retrieval capabilities, covering single\- to triple\-needle retrieval \(niah\_single\_1 \(s1\), niah\_single\_2 \(s2\)\), multi\-value retrieval \(niah\_multivalue \(mv\)\), and multi\-query retrieval \(niah\_multiquery \(mq\)\); \(2\) multi\-step logical reasoning and state\-tracking abilities, evaluated through multi\-key retrieval \(niah\_multikey\_1 \(m1\), niah\_multikey\_2 \(m2\)\) and variable tracking \(variable\_tracking \(vt\)\); \(3\) global information aggregation, measured via common word extraction \(cwe\); and \(4\) deep comprehension and question\-answering performance on long texts, validated through single\-hop QA \(qa\_squad \(qa1\)\) and multi\-hop QA \(qa\_hotpot \(qa2\)\)\. To establish a rigorous and fair comparison, we re\-evaluate all baseline methods, including theconfidence\-based decodingdiffusion modelsLLaDA\-8B\-InstructandDream\-v0\-Base\-7B, as well as the caching methodsFast\-dLLMandElastic\-Cache\. This procedure eliminates confounding variables arising from hardware or software discrepancies, ensuring that all observed performance variations are solely attributable to the methods themselves\.
### B\.2Evaluation Framework and Metrics
To ensure the standardization and reproducibility of our experiments, we utilize thelm\-eval\-harness frameworkto conduct all task\-specific evaluations\. We measure inference speed by throughput in tokens per second \(Tokens/sec\), which is calculated as the average number of tokens generated by the model over the entire sequence until an end\-of\-sequence \(EOS\) token is produced\. Furthermore, our calculation methodology strictly aligns with those ofFast\-dLLMandElastic\-Cacheto ensure a rigorous and fair comparison of inference speed benchmarks across different methods\.
### B\.3Hyperparameter Settings
Table[3](https://arxiv.org/html/2606.00724#A2.T3)presents the hyperparameters used forWaveFilter\. Specifically, for both theLongBenchandRulerbenchmarks, we uniformly set the maximum generation length to256, the wavelet decomposition scaleBBto2, and select theHaarwavelet as the base function\. The thresholdϵ\\epsilonfor confidence\-based decoding is set to0\.9\. Crucially, the selection proportion of salient regions is dynamically adjusted within the range of\[0\.3, 1\]across different datasets\. Such highly consistent parameter settings across different models and benchmarks fully demonstrate the robustness of WaveFilter to hyperparameter variations\.
To ensure a fair comparison, we standardized the configuration of experimental parameters\. For the confidence\-based decoding diffusion models,LLaDA\-8b\-InstructandDream\-v0\-Base\-7B, the threshold is uniformly set to0\.9\. Meanwhile, when using Fast\-dLLM, Elastic\-Cache, and applying the WaveFilter framework, all hyperparameters strictly follow thedefault settingsspecified in their original papers\. This approach aims to eliminate biases introduced by hyperparameter tuning, thereby objectively evaluating the native performance of each baseline method\.
Table 3:The hyper\-parameters of WaveFilter under various benchmarks\.BenchmarkModelGeneration lengthScaleWaveletProportionThreshold𝒎𝟐\\boldsymbol\{m\_\{2\}\}𝒎𝟏\\boldsymbol\{m\_\{1\}\}LongbenchLLaDA\-8B\-Instruct2562Haar\[0\.3, 1\]\[0\.3, 1\]0\.9Dream\-v0\-Base\-7B2562Haar\[0\.3, 1\]\[0\.3, 1\]0\.9RulerLLaDA\-8B\-Instruct2562Haar\[0\.3, 1\]\[0\.3, 1\]0\.9
## Appendix CToken Statistics and Total Runtime
To provide a transparent and rigorous assessment of the operational efficiency of the WaveFilter framework, this section details the comprehensive empirical statistics across various evaluation benchmarks\. Table[4](https://arxiv.org/html/2606.00724#A3.T4)and Table[5](https://arxiv.org/html/2606.00724#A3.T5)thoroughly document the cumulative volume of generated tokens alongside the total execution time for LLaDA\-8B\-Instruct and Dream\-v0\-Base\-7B across all subsets of the LongBench and Ruler benchmarks\. Through an in\-depth analysis of these experimental data, we not only gain profound insights into the efficiency of our framework but also reveal the inherent limitations of the throughput metric when evaluating long\-text generation speeds, thereby elucidating the necessity of incorporating total runtime\.
Table 4:Comprehensive benchmark results ofLLaDA\-8B\-InstructandDream\-v0\-Base\-7BonLongbench\. Each cell displays thetotal generated tokens\(top row\), and thetotal runtimein minutes \(bottom row,green: runtime\)\. The symbol "\-" denotesout of memoryerrors\.Single\-Doc QAMulti\-Doc QAFew\-shot LearningSyntheticCodeQspMulFHQA2WQAMSQTRECTrQAPsgCPsgRLCCLLaDA\-8B\-Instruct30104\(195\.97\)29114\(238\.75\)49658\(701\.43\)46269\(385\.63\)51157\(907\.60\)3733\(271\.50\)5590\(591\.47\)50924\(461\.50\)51119\(345\.55\)125990\(310\.86\)\+ Fast\-dLLM31211\(34\.65\)29448\(42\.35\)49867\(99\.52\)46481\(60\.17\)51176\(116\.38\)2459\(46\.20\)7157\(90\.77\)50072\(100\.52\)51096\(91\.83\)124969\(69\.02\)\+ Fast\-dLLM & WaveFilter14806\(36\.28\)27226\(61\.23\)49133\(143\.47\)42826\(82\.38\)51171\(159\.23\)1577\(67\.07\)8873\(133\.33\)46346\(146\.45\)51047\(138\.78\)113990\(97\.72\)\+ Elastic\-Cache\-\-37878\(60\.60\)\-\-\-\-\-\-4001\(60\.60\)\-\-\-\-51087\(159\.93\)\-\-\+ Elastic\-Cache & WaveFilter\-\-37491\(57\.45\)\-\-\-\-\-\-4719\(107\.07\)\-\-\-\-51097\(166\.47\)\-\-Dream\-v0\-Base\-7B50985\(244\.77\)38212\(288\.47\)51125\(773\.72\)51115\(371\.90\)51152\(986\.82\)51200\(264\.58\)51200\(624\.38\)51198\(387\.15\)51200\(297\.30\)127986\(226\.57\)\+ Fast\-dLLM51003\(30\.30\)38163\(29\.72\)51106\(66\.36\)51115\(38\.78\)51123\(79\.70\)51200\(33\.92\)51199\(62\.23\)51192\(60\.58\)51196\(51\.07\)127973\(47\.22\)\+ Fast\-dLLM & WaveFilter50962\(50\.98\)38181\(55\.90\)51064\(112\.65\)51129\(63\.87\)51050\(140\.33\)51200\(66\.78\)51200\(115\.03\)51140\(102\.42\)51176\(84\.58\)127960\(74\.38\)\+ Elastic\-Cache\-\-38282\(20\.37\)\-\-51162\(61\.07\)\-\-51200\(81\.02\)\-\-\-\-51199\(62\.13\)\-\-\+ Elastic\-Cache & WaveFilter\-\-38271\(16\.88\)\-\-51108\(69\.17\)\-\-51120\(77\.43\)\-\-\-\-51195\(65\.75\)\-\-
Table 5:Comprehensive benchmark results ofLLaDA\-8B\-InstructonRuler\. Each cell displays thetotal generated tokens\(top row\), and thetotal runtimein minutes \(bottom row,green: runtime\)\.s1s2m1m2mvmqvtcweqa1qa2Context LengthLLaDA\-8B\-Instruct15049\(65\.97\)61358\(125\.97\)43153\(98\.18\)116706\(390\.07\)127503\(460\.52\)127486\(334\.87\)118877\(308\.05\)127538\(334\.60\)16429\(64\.45\)36839\(116\.90\)4K\+ Fast\-dLLM53474\(39\.68\)97647\(75\.75\)81597\(63\.55\)127256\(92\.40\)127501\(71\.78\)127486\(67\.60\)125999\(73\.88\)127598\(69\.92\)23191\(31\.98\)49151\(47\.12\)4K\+ Fast\-dLLM & WaveFilter11846\(33\.20\)12360\(33\.48\)11283\(32\.38\)127015\(152\.13\)90481\(80\.97\)124450\(117\.52\)107579\(83\.48\)114729\(118\.32\)11470\(29\.07\)43783\(61\.20\)4K\+ Elastic\-Cache123382\(31\.50\)126300\(97\.77\)125387\(70\.27\)127493\(122\.20\)127493\(113\.32\)127471\(85\.95\)127142\(95\.42\)127614\(235\.60\)123317\(22\.22\)124669\(45\.40\)4K\+ Elastic\-Cache & WaveFilter123135\(37\.72\)126118\(117\.70\)125751\(73\.55\)127481\(140\.58\)127497\(124\.92\)127463\(103\.67\)127097\(111\.20\)127580\(285\.80\)123255\(23\.30\)124647\(48\.07\)4KLLaDA\-8B\-Instruct90628\(370\.22\)127543\(1227\.47\)127553\(1021\.02\)127500\(1111\.00\)127560\(690\.32\)127600\(784\.80\)126536\(671\.53\)127897\(759\.45\)127499\(802\.55\)127048\(886\.93\)8K\+ Fast\-dLLM110022\(107\.43\)127524\(185\.78\)127531\(164\.82\)127495\(184\.23\)127488\(106\.98\)127610\(118\.70\)126509\(119\.92\)127996\(155\.02\)127500\(137\.07\)127242\(154\.78\)8K\+ Fast\-dLLM & WaveFilter22661\(105\.58\)125781\(273\.08\)127260\(245\.25\)121832\(288\.02\)127681\(161\.70\)127807\(175\.17\)126695\(161\.07\)127002\(218\.22\)127546\(198\.66\)120473\(218\.01\)8K\+ Elastic\-Cache127493\(171\.53\)127556\(303\.01\)127586\(268\.70\)127402\(281\.40\)127469\(210\.05\)127697\(197\.65\)127413\(183\.58\)127910\(602\.37\)127491\(205\.03\)127499\(206\.60\)8K\+ Elastic\-Cache & WaveFilter127339\(229\.25\)127570\(286\.17\)127618\(237\.87\)127453\(378\.28\)127544\(210\.50\)127623\(210\.95\)127347\(236\.15\)127931\(897\.65\)127492\(218\.72\)127497\(252\.97\)8K
## Appendix DGeneration Examples
Appendix[D\.1](https://arxiv.org/html/2606.00724#A4.SS1)and Appendix[D\.2](https://arxiv.org/html/2606.00724#A4.SS2)present the output snippets of Fast\-dLLM and Fast\-dLLM integrated with the WaveFilter framework, respectively, on the 4K context length s1 subset of the Ruler benchmark\. A comparative analysis reveals that the generation from Fast\-dLLM contains a significant amount of redundant information irrelevant to the prompt, whereas the introduction of WaveFilter enables the model to accurately generate highly relevant answers\. It is worth noting that in Table[1](https://arxiv.org/html/2606.00724#S4.T1)and Table[2](https://arxiv.org/html/2606.00724#S4.T2), although Fast\-dLLM achieves higher throughput, it contains invalid generations\. This phenomenon demonstrates that in long\-context tasks, relying solely on throughput as a metric fails to objectively reflect the actual operational efficiency and generation quality of the model\. This clear and illustrative comparative example compellingly demonstrates that for complex long\-context tasks, throughput should be integrated with total execution time and final generation quality to establish a comprehensive, multi\-dimensional evaluation framework\.
### D\.1Case Study: Fast\-dLLM
Examples of Fast\-dLLM"Context":"A special magic number is hidden within the following text\. Make sure to memorize it\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\.…\[Long context text omitted\]… One of the special magic numbers for capable\-percent is: 7374122\. …\[Long context text omitted\]… The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\.""Question":"What is the special magic number for wandering\-age mentioned in the provided text? The special magic number for wandering\-age mentioned in the provided text is""Answer":"The special magic number for capable\-percent mentioned in the provided text is 7374122\.The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\. "
### D\.2Case Study: Fast\-dLLM & WaveFilter
Examples of Fast\-dLLM & WaveFilter"Context":"A special magic number is hidden within the following text\. Make sure to memorize it\. The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\.…\[Long context text omitted\]… One of the special magic numbers for capable\-percent is: 7374122\. …\[Long context text omitted\]… The grass is green\. The sky is blue\. The sun is yellow\. Here we go\. There and back again\.""Question":"What is the special magic number for wandering\-age mentioned in the provided text? The special magic number for wandering\-age mentioned in the provided text is""Answer":"The special magic number for capable\-percent mentioned in the provided text is 7374122\."
## Appendix EThe Use of Large Language Models
During the preparation of this paper, the use of Large Language Models was strictly limited to language polishing and text translation\. All core scientific contributions—including research conceptualization, experimental design and execution, data analysis, and conclusion derivation—were independently completed by the authors, who assume full responsibility for this work\. Following a rigorous verification, the final text has been ensured to be accurate, factually correct, and in full compliance with academic integrity standards\.Similar Articles
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
Enabling KV Caching of Shared Prefix for Diffusion Language Models
This paper proposes BiCache, a novel KV caching technique for shared prefixes in diffusion language models, which avoids accuracy collapse by dynamically reusing cached keys and values in shallow layers and achieves 36.3%–98.3% throughput improvement.
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
This paper introduces LKV, a method for end-to-end learning of head-wise budgets and token selection to optimize KV cache eviction in large language models, achieving state-of-the-art performance with high compression rates.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.