Enabling KV Caching of Shared Prefix for Diffusion Language Models
Summary
This paper proposes BiCache, a novel KV caching technique for shared prefixes in diffusion language models, which avoids accuracy collapse by dynamically reusing cached keys and values in shallow layers and achieves 36.3%–98.3% throughput improvement.
View Cached Full Text
Cached at: 06/09/26, 08:46 AM
# Enabling KV Caching of Shared Prefix for Diffusion Language Models
Source: [https://arxiv.org/html/2606.07571](https://arxiv.org/html/2606.07571)
Younghun Go Jaehoon Han11footnotemark:1Changyong Shin Chuk Yoo Gyeongsik Yang22footnotemark:2 Korea University, South Korea
###### Abstract
Key\-value \(KV\) caching for shared prefixes is essential for high\-throughput large language model \(LLM\) serving, but it faces critical challenges in emerging diffusion language models \(DLMs\)\. In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs\. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs\. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero\.
To unlock high\-throughput DLM serving, we propose bidirectional prefix caching,BiCache, the first KV caching technique for shared prefixes in DLMs\.BiCacheis designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request\. Thus,BiCachedynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation\. Evaluations demonstrate thatBiCachesignificantly improves serving throughput by 36\.3%–98\.3% compared to existing techniques without accuracy collapse \(only 0–1\.8% difference\)\.
Enabling KV Caching of Shared Prefix for Diffusion Language Models
Younghun Go††thanks:Equal contribution\.Jaehoon Han11footnotemark:1Changyong Shin Chuk Yoo††thanks:Corresponding authors\.Gyeongsik Yang22footnotemark:2Korea University, South Korea
## 1Introduction
Key\-value \(KV\) caching is a fundamental optimization for efficient large language model \(LLM\) serving\. In attention layers, token hidden states are projected into key and value tensors, and storing these tensors allows serving systems to reuse them in later generation steps instead of recomputing them, thereby improving serving latency and throughput\. KV caching is particularly important for shared prefixes, where the same token sequence \(e\.g\., system prompts\) appears at the beginning of multiple requests and its KVs can be cached and reused across those requests\. In real\-world workloads, shared prefixes account for 85–97% of prompt tokensSrivatsaet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib37)\)\.
Most existing LLMs are based on autoregressive models \(ARMs\) that generate tokens sequentiallyOpenAI \([2023](https://arxiv.org/html/2606.07571#bib.bib41)\)\. In ARMs, the KVs of previously generated tokens remain unchanged once computed\. As a result, modern serving systems use “shared prefix caching,” which stores the KVs of a shared prefix once and reuses them across requestsKwonet al\.\([2023](https://arxiv.org/html/2606.07571#bib.bib38)\), making shared prefix caching a key optimization for high\-throughput serving\.
Recently, diffusion language models \(DLMs\) have emerged as a promising alternative to ARMsNieet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib39)\)\. ARMs generate tokens sequentially, making latency grow with output lengthXiaoet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib40)\)\. Moreover, their causal attention can lead to the reversal curse, where a model may correctly answer a relation in left\-to\-right but fail to infer its inverseBerglundet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib27)\); Luet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib26)\)\. In contrast, DLMs generate tokens at arbitrary positions through iterative denoising steps with bidirectional attention, allowing all positions to interact at each step and alleviating the strict left\-to\-right generation bottleneck of ARMs\.
However, the bidirectional generation process of DLMs makes shared prefix caching much harder\. Under bidirectional attention, updating one token changes the KVs of all other tokens, including those in the shared prefix\. Therefore, shared prefix KVs are not invariant across steps or across requests, which is a key difference from ARMs\. Directly reusing the shared prefix KVs as in ARMs can corrupt the attention context and severely collapse model accuracy\. Our motivating experiments show that naively applying shared prefix caching designed for ARMs to DLMs drives accuracy collapse to near zero on diverse benchmarks \(§[3](https://arxiv.org/html/2606.07571#S3)\)\.
Recent studiesWuet al\.\([2025b](https://arxiv.org/html/2606.07571#bib.bib36),[a](https://arxiv.org/html/2606.07571#bib.bib33)\); Liuet al\.\([2025d](https://arxiv.org/html/2606.07571#bib.bib35)\)have introduced KV caching techniques for DLMs; however, they focus on reusing KVs within a single request and do not address shared prefix caching across requests\. Because shared prefixes dominate practical workloads and are central to serving throughput, enabling shared prefix caching for DLMs remains a major open challenge\.
To address this challenge, we propose bidirectional prefix caching,BiCache, the first shared prefix caching technique for DLMs to our knowledge\. The key idea ofBiCacheis to selectively reuse KVs according to layer depth and the shared prefix ratio of a request\. Through an in\-depth analysis of how shared prefix KVs change across requests, layers, and denoising steps, we make three important observations \(§[4](https://arxiv.org/html/2606.07571#S4)\)\. First, shared prefix KVs remain highly similar across requests in shallow layers, making shared prefix caching possible in those layers\. Second, the depth of shallow layers that can safely reuse cached KVs is strongly correlated with the shared prefix ratio, i\.e\., the fraction of shared prefix tokens in each request\. Third, although the remaining deep layers cannot directly reuse shared prefix KVs across requests, they can still reduce redundant computation within a request by reusing KVs with periodic refresh \(§[5](https://arxiv.org/html/2606.07571#S5)\)\.
Based on the observations,BiCacheintroduces two practical mechanisms: shared prefix profiling and layer\-partitioned caching\. Shared prefix profiling determines the layer depth up to which shared prefix KVs can be safely reused without harming accuracy\. During serving, layer\-partitioned caching directly reuses cached KVs of the shared prefix in shallow layers and applies periodic refresh in deep layers to remove redundant KV computation\.
The major contributions of this study are:
- ∙\\bulletCharacterize how shared prefix KVs change in DLMs across requests, layers, and steps\.
- ∙\\bulletPresentBiCache, to our knowledge the first shared prefix caching technique for DLMs\.
- ∙\\bulletDemonstrate significant throughput improvements by 36\.3%–98\.3% without accuracy collapse \(only 0–1\.8% differences\)\.
## 2Background
### 2\.1DLM
LLaDANieet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib39)\)is a transformer\-based DLM and a foundation for many later DLMs\. Fig\.[1](https://arxiv.org/html/2606.07571#S2.F1)shows its generation process\. It has three key hyperparameters \(0\): the number of transformer layersLL, the output lengthgg, and the number of denoising stepsss\. Throughout the paper, we use the term “step” to refer to a denoising step\. Oversssteps, the model generatesggtokens, predicting either⌊g/s⌋\\lfloor g/s\\rflooror⌊g/s⌋\+1\\lfloor g/s\\rfloor\{\+\}1tokens per step \(e\.g\.,g=8g\{=\}8,s=3s\{=\}3gives\{3,3,2\}\\\{3,3,2\\\}tokens per step\)\.
When a new request arrives, LLaDA forms an input string by appendinggg\[MASK\]tokens to the prompt tokens \(1\)\. At each step, the input string is processed by a forward pass through theLLtransformer layers \(2\)\. In each step, DLM applies bidirectional attention and denoises the designated number of\[MASK\]tokens so that the predicted words \(tokens\) are filled in\. In Fig\.[1](https://arxiv.org/html/2606.07571#S2.F1), step 1 denoises three tokens \(as explained above\), so the first step denoises “explains”, “prefix”, and “caching” \(3\)\. The input string with three denoised tokens is passed to the next step, where another denoising happens using theLLtransformer layers\. The denoised tokens from the previous step remain unchanged, and the model denoises the designated number of mask tokens, three in the example, producing “clearly”, “why”, and “impossible\.” This iterative process continues \(4\) forsssteps until all\[MASK\]tokens are denoised\.
Figure 1:Workflow of token generation in DLMs\.
### 2\.2Key\-Value and Bidirectional Attention
Here, we detail the DLM layer\. We explain how hidden states are projected into KVs and how bidirectional attention aggregates them to update token representations\. Finally, we explain how this bidirectional nature causes KVs to change across steps\.
KV calculation\.At transformer layerℓ∈\{1,…,L\}\\ell\\in\\\{1,\{\\ldots\},L\\\}and stept∈\{1,…,s\}t\\in\\\{1,\{\\ldots\},s\\\}, the hidden state of a token is denoted by𝐡ℓ,t∈ℝd\\mathbf\{h\}\_\{\\ell,t\}\\in\\mathbb\{R\}^\{d\}, whereddis the hidden dimension \(the feature size of each token representation\)\. The layer computes the key and value vectors using its projection matricesWℓK∈ℝdk×dW^\{K\}\_\{\\ell\}\\in\\mathbb\{R\}^\{d\_\{k\}\\times d\}andWℓV∈ℝdv×dW^\{V\}\_\{\\ell\}\\in\\mathbb\{R\}^\{d\_\{v\}\\times d\}\. Here,dkd\_\{k\}anddvd\_\{v\}denote the dimensions of the key and value vectors, respectively; the key vector \(𝐤ℓ,t∈ℝdk\\mathbf\{k\}\_\{\\ell,t\}\\in\\mathbb\{R\}^\{d\_\{k\}\}\) is used to compute attention weights, and the value vector \(𝐯ℓ,t∈ℝdv\\mathbf\{v\}\_\{\\ell,t\}\\in\\mathbb\{R\}^\{d\_\{v\}\}\) is aggregated according to those weights\. The computation is
𝐤ℓ,t=WℓK𝐡ℓ,t,𝐯ℓ,t=WℓV𝐡ℓ,t\.\\mathbf\{k\}\_\{\\ell,t\}=W^\{K\}\_\{\\ell\}\\mathbf\{h\}\_\{\\ell,t\},\\quad\\mathbf\{v\}\_\{\\ell,t\}=W^\{V\}\_\{\\ell\}\\mathbf\{h\}\_\{\\ell,t\}\.\(1\)For conciseness, throughout the paper, we denote the key–value pair by𝐊𝐕ℓ,t≜\(𝐤ℓ,t,𝐯ℓ,t\)\\mathbf\{KV\}\_\{\\ell,t\}\\triangleq\(\\mathbf\{k\}\_\{\\ell,t\},\\mathbf\{v\}\_\{\\ell,t\}\)\.
Bidirectional attention\.Using the keys and values, layerℓ\\ellcomputes an attention output𝐨ℓ,t\\mathbf\{o\}\_\{\\ell,t\}to gather context from the input string\. Specifically, it is calculated using a query vector𝐪ℓ,t\\mathbf\{q\}\_\{\\ell,t\}\(also projected from𝐡ℓ,t\\mathbf\{h\}\_\{\\ell,t\}usingWℓQW^\{Q\}\_\{\\ell\}\) that is compared against the keys of all tokens in the input string, wherenndenotes the number of tokens in the input string:
𝐨ℓ,t=∑j=1nsoftmaxj\(𝐪ℓ,t⋅𝐤ℓ,t\(j\)dk\)𝐯ℓ,t\(j\)\.\\mathbf\{o\}\_\{\\ell,t\}=\\sum\_\{j=1\}^\{n\}\\mathrm\{softmax\}\_\{j\}\\\!\\left\(\\frac\{\\mathbf\{q\}\_\{\\ell,t\}\\cdot\\mathbf\{k\}^\{\(j\)\}\_\{\\ell,t\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)\\mathbf\{v\}^\{\(j\)\}\_\{\\ell,t\}\.\(2\)
Here,𝐨ℓ,t\\mathbf\{o\}\_\{\\ell,t\}is a weighted sum of value vectors𝐯ℓ,t\(j\)\\mathbf\{v\}^\{\(j\)\}\_\{\\ell,t\}, representing aggregated global context\. The weights are given by the softmax\-normalized scaled dot products, measuring the relevance of thejj\-th token to the current token at steptt\. To produce each token’s𝐨ℓ,t\\mathbf\{o\}\_\{\\ell,t\}, the query attends to allnntokens\. This output𝐨ℓ,t\\mathbf\{o\}\_\{\\ell,t\}updates the next layer’s hidden state𝐡ℓ\+1,t\\mathbf\{h\}\_\{\{\\ell\{\+\}1\},t\}\. AfterLLlayers, the final hidden state is transformed into vocabulary logits to predict the most likely word \(e\.g\., “explains” in Fig\.[1](https://arxiv.org/html/2606.07571#S2.F1)\)\.
Importantly, with bidirectional attention, denoising a token during a step changes𝐨ℓ,t\\mathbf\{o\}\_\{\\ell,t\}for other tokens\. This propagates to their hidden states and modifies their key and value vectors via Eq\. \([1](https://arxiv.org/html/2606.07571#S2.E1)\)\. For example, in Fig\.[1](https://arxiv.org/html/2606.07571#S2.F1), even after the token “explains” has been denoised, its KVs continue to be updated in later steps as the surrounding tokens change\.
### 2\.3Shared Prefix and Its Caching
Shared prefixes are common in practice\. For example, many production LLM serving systems prepend a fixed system prompt to every user request to provide formatting guidelines, safety and ethical constraints, and role instructionsLiet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib18)\)\. See Appendix[A](https://arxiv.org/html/2606.07571#A1)for examples of real\-world shared prefixes\. Also, in multi\-turn conversations, shared prefixes also naturally arise\. When a user submits a new request within the same session, the input string to the model typically includes previous user requests and responses as a shared prefix so that the model can maintain conversational contextYiet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib19)\)\. As a result, a substantial portion of the prompt tokens is repeated across requests, accounting for up to 97% of the total prompt token length in practical deploymentsSrivatsaet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib37)\)\.
Shared prefix caching avoids redundant KV computation by 1\) computing the KVs of a shared prefix once and 2\) reusing them for later requests whose inputs contain the same prefix\. In ARMsKwonet al\.\([2023](https://arxiv.org/html/2606.07571#bib.bib38)\), once the KVs of the shared prefix are cached, they can be reused directly for other requests at every layer and across all steps\. This is possible because, under causal attention in ARMs, the KV values of the prefix remain unchanged even when additional suffix tokens are appended\. Prior work reports that shared prefix caching in ARMs can provide∼\\sim10×\\timeslatency improvement, making it an essential optimization for practical LLM serving systemsGimet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib20)\)\.
## 3Motivation: New Shared Prefix Caching Technique is Required for DLMs
In this section, we apply existing shared prefix caching designed for ARMs to DLMs and show that it causes high accuracy drop, which motivates the need for a new technique\.
### 3\.1Setup
All experiments are conducted on an NVIDIA B200 GPU with 180 GB memory\. We evaluate LLaDANieet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib39)\)that has 32LLlayers\. We setg=256g\{=\}256, ands=128s\{=\}128; thus, two tokens are unmasked at each step\. We use a batch size of 1 to fit the model in GPU memory, which is consistent with the configuration used in other studiesQianet al\.\([2026](https://arxiv.org/html/2606.07571#bib.bib6)\); Liuet al\.\([2025b](https://arxiv.org/html/2606.07571#bib.bib5)\)\.
LLaDA is evaluated on widely used benchmarks: 1\) ARC\-Challenge\-Chat \(science multiple\-choice QA\), 2\) GPQA \(graduate\-level question answering\), and 3\) MATH\-500 \(competition\-style mathematics\)\. Similar to existing LLM serving systems and studiesZhuet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib17)\); Cobbeet al\.\([2021](https://arxiv.org/html/2606.07571#bib.bib16)\), we prepend two system prompts to each request\. First, we attach the real\-world system prompt of xAI’s Grok\-4 modelxAI Organization \([2025](https://arxiv.org/html/2606.07571#bib.bib25)\)\(details in Appendix[A](https://arxiv.org/html/2606.07571#A1)\), which provides role and safety instructions\. Second, we prepend another system prompt that enforces a specific output format \(i\.e\., attaching\#\#before the final answer\) required for strict answer matching in the benchmarks \(described below\)\. Across benchmarks, the shared prefix accounts for 85% of the request on average \(details in Appendix §[D\.5](https://arxiv.org/html/2606.07571#A4.SS5)\), consistent with real\-world workloadsSrivatsaet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib37)\)\.
We evaluate two metrics: 1\) throughput and 2\) accuracy\. Throughput is measured as the number of generated tokens per second \(tokens/s\), averaged over all requests in each benchmark\. For accuracy, following prior workLian \([2025](https://arxiv.org/html/2606.07571#bib.bib24)\); Marimuthu and Krishnamurthy \([2025](https://arxiv.org/html/2606.07571#bib.bib23)\), we apply strict matching between the model output and the ground\-truth answer\. From the LLaDA output, the final answer following the\#\#symbol is considered correct only if it exactly matches the ground\-truth answer in each benchmark\. Accuracy is reported as the percentage of correct outputs over all requests\.
We compare the following baselines:
- ∙\\bulletNo caching: LLaDA inference without any shared prefix caching\.
- ∙\\bulletvLLMKwonet al\.\([2023](https://arxiv.org/html/2606.07571#bib.bib38)\): LLaDA with vLLM, a representative shared prefix caching technique for ARMs\. For requests sharing the same prefix, vLLM computes the KVs from the shared prefix once, and then reuses them across requests for all layers and steps\.
\(a\)Throughput\.
\(b\)Accuracy\.
Figure 2:Motivating experiment\.
### 3\.2Motivating Experiment Results
Fig\.[2\(a\)](https://arxiv.org/html/2606.07571#S3.F2.sf1)shows the throughput across benchmarks\. Compared to no caching, vLLM consistently achieves higher throughput on all benchmarks\. Specifically, vLLM improves throughput by∼\\sim42\.7%,∼\\sim38\.6%, and∼\\sim48\.8% on ARC\-Challenge\-Chat, GPQA, and MATH\-500\.
However, existing techniques exhibit severe accuracy collapse\. Fig\.[2\(b\)](https://arxiv.org/html/2606.07571#S3.F2.sf2)shows the accuracy on the same benchmarks\. In contrast to the throughput gains in Fig\.[2\(a\)](https://arxiv.org/html/2606.07571#S3.F2.sf1), vLLM causes accuracy collapse across all benchmarks\. Specifically, while no caching achieves 86\.1%, 27\.9%, and 18\.2% accuracy on the three benchmarks, vLLM shows near\-zero accuracy \(0\.1%, 0\.7%, and 0%\), indicating that it fails to generate correct answers\. The results demonstrate that, although the shared prefix caching technique has potential to increase throughput, the technique designed for ARMs leads to significant accuracy collapse because the KV values of shared prefix in DLMs continuously change\.

Figure 3:Heatmap of𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}for all layers and steps\.\(a\)Per task type\.
\(b\)Per shared prefix ratio\.
Figure 5: Shallow layer depth determination\.
Figure 5:Deep layers𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}under periodic refresh\.
## 4Key Observations
To design shared prefix caching for DLMs, we present three key observations \(O1–O3\) on KV value changes in the shared prefix\.
### 4\.1KV similarity is Layer\-dependent but Stable across steps
In ARMs, shared prefix caching is feasible because the KVs of the shared prefix remain identical across requests, allowing the KVs computed once to be reused at all layers and steps\. In DLMs, however, bidirectional attention changes KV values of shared prefix during denoising, so it is unclear whether shared prefix KV reuse across requests is still possible\. To answer this, we examine i\) whether the shared prefix KVs from one request remain similar in other requests, and ii\) how this similarity varies across layers and steps\. We follow the same setup as in §[3\.1](https://arxiv.org/html/2606.07571#S3.SS1), and use WildChat\-4\.8MZhaoet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib42)\), a large\-scale conversational dataset with diverse prompt lengths and tasks\. Because the dataset lacks ground\-truth answers, we do not use it for evaluation as in Fig\.[2](https://arxiv.org/html/2606.07571#S3.F2)\. Instead, we use it to analyze the KV computation behavior of DLM\.
We measure the cosine similarity between 1\) the shared prefix KVs over all shared prefix token positions computed from the shared prefix alone, which are cached for reuse across requests, and 2\) the corresponding shared prefix KVs at the same token positions obtained when each request is processed without caching during denoising, across layers and steps\. For layerℓ∈\{1,2,…,L\}\\ell\\in\\\{1,2,\{\\ldots\},L\\\}and stept∈\{1,2,…,s\}t\\in\\\{1,2,\{\\ldots\},s\\\}, we denote them by𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}and𝐊𝐕ℓ,tref\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}, respectively\.111The cached KVs do not carry a step indexttas they are computed once at the first step, as in vLLM, and then reused across all denoising steps\. Each KV representation is flattened into a single vector before computing cosine similarity\.We define
𝐬𝐢𝐦ℓ,t≜cos\(𝐊𝐕ℓ$,𝐊𝐕ℓ,tref\)\.\\mathbf\{sim\}\_\{\\ell,t\}\\triangleq\\cos\\\!\\Big\(\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\},\\,\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}\\Big\)\.\(3\)The value of𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}ranges from−1\-1to11, where values closer to11indicate higher similarity\.
Fig\.[3](https://arxiv.org/html/2606.07571#S3.F3)shows a heatmap of𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}over layers and steps, averaged across requests\. Although bidirectional attention makes the similarity lower than 1, the results reveal two patterns that make shared prefix caching feasible in DLMs\. First,𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}is much higher in shallow layers than in deep layers: shallow layers,ℓ∈\{1,…,16\}\\ell\\in\\\{1,\{\\ldots\},16\\\}, show an average similarity of 0\.98, whereas deep layers,ℓ∈\{17,…,32\}\\ell\\in\\\{17,\{\\ldots\},32\\\}, show 0\.86\. This indicates that shared prefix KV reuse across requests is feasible mainly in shallow layers\.
Second, within each layer,𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}remains nearly constant across steps\. The standard deviation across steps is only 0\.007 on average and at most 0\.016, indicating little change in similarity over steps\.
O1:In shallow layers, shared prefix KVs are highly similar and remain stable across steps\. Therefore, shared prefix KVs in shallow layers can be cached and reused across requests\.
### 4\.2Shallow layer depth is correlated with shared prefix ratio
O1 shows that KV reuse across requests is feasible in shallow layers\. We next investigate how many layers can be treated as shallow for a given input\. To quantify this depth, we definebbas the largest layer depth such that𝐬𝐢𝐦ℓ,t≥τ\\mathbf\{sim\}\_\{\\ell,t\}\\geq\\tauholds for all layersℓ∈\{1,…,b\}\\ell\\in\\\{1,\{\\ldots\},b\\\}and all stepst∈\{1,…,s\}t\\in\\\{1,\{\\ldots\},s\\\}, whereb≤Lb\\leq L\. In other words, layers\{1,…,b\}\\\{1,\{\\ldots\},b\\\}are the shallow layers whose shared prefix KVs can be reused without accuracy collapse\.222The first layer’s KVs are projected from the hidden states before bidirectional attention \(§[2\.2](https://arxiv.org/html/2606.07571#S2.SS2)\); thus the shared prefix KVs at the first layer are identical to those computed during denoising, so always their similarity is 1 and henceb≥1b\\geq 1\.Here,τ\\tauis the similarity threshold above which𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}can be reused without accuracy loss\. From our analysis, we setτ=0\.97\\tau\{=\}0\.97through empirical analysis \(to be explained in §[6\.4](https://arxiv.org/html/2606.07571#S6.SS4)\)\.
For each DLM, two input characteristics can affectbb: 1\) task type, such as math, coding, or general QA, and 2\) shared prefix ratiorr, i\.e\., the fraction of shared prefix tokens in the input string\. We examine how these two characteristics affectbb\.
First, we analyze howbbvaries across task types\. We measurebbas Eq\. \([3](https://arxiv.org/html/2606.07571#S4.E3)\) on ARC\-Challenge\-Chat, MATH\-500, and GSM8K whenrris 0\.9\. Differentrrvalues also show similar results\. Fig\.[5\(a\)](https://arxiv.org/html/2606.07571#S3.F5.sf1)shows thatbbremains consistent across three benchmarks, with a standard deviation of only 0\.43\. This indicates that task type is not the main factor determining the shallow layer depth\.
Second, we analyze howbbchanges withrr\. Fig\.[5\(b\)](https://arxiv.org/html/2606.07571#S3.F5.sf2)shows thatbbincreases asrrincreases\. The reason is that, under bidirectional attention, the KV of each shared prefix token is affected by the entire input string, including both shared prefix tokens and non\-shared tokens\. Whenrris small, non\-shared tokens occupy a larger fraction of the input, so their effect on the shared prefix KVs is larger, and the similarity drops belowτ\\tauat earlier layers\. In contrast, whenrris large, the input is dominated by shared prefix tokens, so the influence of non\-shared tokens decreases\. As a result, the shared prefix KVs remain similar across requests for more layers, which increasesbb\. We theoretically prove this relationship in Appendix §[B](https://arxiv.org/html/2606.07571#A2)\.333bbalso remains stable and consistent under small changes in system prompt tokens \(e\.g\., instruction wording or metadata fields\); see Appendix §[D\.1](https://arxiv.org/html/2606.07571#A4.SS1)\.
O2:The shallow layer depthbbis correlated with the shared prefix ratiorr\.
Figure 6:BiCacheoverview\.
### 4\.3Deep layers can be cached*within*a request
O1 and O2 show that reuse of shared prefix KVs across requests is feasible for shallow layersℓ∈\{1,…,b\}\\ell\\in\\\{1,\{\\ldots\},b\\\}\. We now examine whether the remaining deep layersℓ∈\{b\+1,…,L\}\\ell\\in\\\{b\{\+\}1,\{\\ldots\},L\\\}can still benefit from KV caching\. To do so, we use the same similarity metric𝐬𝐢𝐦ℓ,t≜cos\(𝐊𝐕ℓ$,𝐊𝐕ℓ,tref\)\\mathbf\{sim\}\_\{\\ell,t\}\\triangleq\\cos\\\!\\big\(\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\},\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}\\big\), but now analyze it for deep layers within a request\.
Fig\.[5](https://arxiv.org/html/2606.07571#S3.F5)compares three settings over the deep layers, while reusing cached KVs for the shallow layers as in O1\. First, when𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}is directly reused from another request, as in Fig\.[3](https://arxiv.org/html/2606.07571#S3.F3),𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}remains low, with an average of 0\.86 \(black line\)\. Second, when𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}is computed once from the current request at the first step of each layer and then reused for later steps,𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}gradually decreases asttincreases, reaching 0\.93 on average \(blue line\), because the cached𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}becomes stale\.
Third, based on this observation, we test a periodic\-refresh design in which𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}is recomputed for each layer of the current request every 8 steps and reused only until the next refresh\. This design keeps𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}high at 0\.96 \(red dotted line\), close to the threshold of 0\.97 for shallow layers, showing that deep layers can still support effective KV reuse within a request when stale KVs are periodically refreshed\.
O3:For deep layers, shared prefix KVs can be reusedwithina request for multiple steps when they are periodically refreshed\.
## 5Design
Fig\.[6](https://arxiv.org/html/2606.07571#S4.F6)presents the workflow ofBiCache, which consists of two mechanisms: 1\) shared prefix profiling and 2\) layer\-partitioned caching\. Given a DLM to be served,BiCachefirst performs offline shared prefix profiling to determine the shallow layer depthbbfor different shared prefix ratiosrr\. Based on O1, it measures𝐬𝐢𝐦ℓ,t\\mathbf\{sim\}\_\{\\ell,t\}at the first step for each request and layer using predefined profiling datasets \(1\)\. Based on O2, it then builds a lookup table𝒯\\mathcal\{T\}that maps eachrrto the corresponding shallow layer depthbb\(2\)\. This profiling is performed offline before serving user requests\.
When a user request arrives online \(3\),BiCacheperforms layer\-partitioned caching\. It first computesrrfor the request and queries𝒯\\mathcal\{T\}to obtainbb\(4\)\.BiCachethen retrieves the cached shared prefix KVs\. The cached KVs are stored in KV storeHH\(5\) that maps each shared prefix stringppto its cached KVs\. For a new request,BiCacheextracts the shared prefixppand looks up the corresponding entry inHH\.
On a cache hit,BiCacheloads the cached KVs and reuses them for shallow layers11tobb\(6\)\. On a miss, it creates a new entry inHH\(7\) by computing the KVs ofppon the DLM for all layers\.
For each layerℓ\\ell,BiCachethen determines whetherℓ≤b\\ell\\leq b\(shallow layers\) orℓ\>b\\ell\>b\(deep layers\)\. Ifℓ≤b\\ell\\leq b,BiCachedirectly reuses the cached shared prefix KVs \(8\)\. Otherwise, following O3, it computes the shared prefix KVs within the current request and reuses them across steps with periodic refresh everyΔ\\Deltasteps \(9\)\. We explain each mechanism in detail below\.
### 5\.1Shared Prefix Profiling
We explain 1\) profiling dataset, 2\) profiler, and 3\) lookup table𝒯\\mathcal\{T\}\.
Profiling dataset\.We prepare a profiling dataset to learn the relationship betweenbbandrr\. We use WildChat\-4\.8MZhaoet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib42)\)as a source of diverse user prompts and construct requests with differentrrby combining each prompt with the shared prefix \(system promptZhuet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib17)\)\)\. For each request,rris calculated as the ratio of shared prefix tokens to all tokens in the input string\. From the requests, we uniformly sampleMMrequests for eachr∈\{0\.01,…,1\.00\}r\\in\\\{0\.01,\{\\ldots\},1\.00\\\}, resulting in100×M100\\times Mrequests\. We setM=500M=500empirically \(§[6\.4](https://arxiv.org/html/2606.07571#S6.SS4)\)\.
Profiler\.Using the profiling dataset,BiCacheprofiles each request as follows\. We follow the same method to measure KV similarity used in Fig\.[3](https://arxiv.org/html/2606.07571#S3.F3)\. For each request with ratiorr, we prepare two inputs: 1\) the string with shared prefix only and 2\) the full input of the shared prefix and remaining prompt tokens\. Running both inputs at the first denoising step for every layer yields two KV values for the shared prefix:𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}\(cached KV\) and𝐊𝐕ℓ,1ref\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,1\}\(KV computed per layer without caching\)\. We then calculate𝐬𝐢𝐦ℓ,1\\mathbf\{sim\}\_\{\\ell,1\}using Eq\. \([3](https://arxiv.org/html/2606.07571#S4.E3)\)\.
Lookup table𝒯\\mathcal\{T\}\.From the profiled𝐬𝐢𝐦ℓ,1\\mathbf\{sim\}\_\{\\ell,1\}values,BiCachedefinesbbas the largest layer index such that all layers from11tobbsatisfy𝐬𝐢𝐦ℓ,1≥τ\\mathbf\{sim\}\_\{\\ell,1\}\\geq\\tau, whereτ=0\.97\\tau=0\.97\(determined empirically in §[6\.4](https://arxiv.org/html/2606.07571#S6.SS4)\)\.
To determinebbfor eachrr, we use the profiled results onMMrequests for everyr∈\{0\.01,…,1\.00\}r\\in\\\{0\.01,\{\\ldots\},1\.00\\\}\. Each requestiiproduces a set of𝐬𝐢𝐦ℓ,1\\mathbf\{sim\}\_\{\\ell,1\}values across layers, from whichBiCachedeterminesbib\_\{i\}as the largest layer index satisfying the thresholdτ\\tau\.BiCachethen computes the average of\{bi\}i=1M\\\{b\_\{i\}\\\}\_\{i=1\}^\{M\}and uses it as representativebbfor ratiorr\. Repeating this procedure for allrrconstructs𝒯\\mathcal\{T\}that is used in layer\-partitioned caching \(§[5\.2](https://arxiv.org/html/2606.07571#S5.SS2)\)\.
### 5\.2Layer\-partitioned Caching
When a new user request arrives,BiCachefirst computesrrand retrievesbbfrom𝒯\\mathcal\{T\}\. It then reuses the cached KVs in shallow layers as follows\.
Shared prefix KVs in shallow layers\.We first identify the cached shared prefix KVs\. Specifically, we extract the shared prefixppfrom the user request and apply a hash functionh\(⋅\)h\(\\cdot\)to obtain an identifierk=h\(p\)k=h\(p\)\. We then look upkkin a hash\-indexed KV tableHH, which maps shared prefix identifiers to cached KV values\.
On a cache hit \(k∈Hk\\in H\),BiCacheretrieves the cached KVs\{𝐊𝐕ℓ$\}ℓ=1L\\\{\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}\\\}\_\{\\ell=1\}^\{L\}fromH\[k\]H\[k\]\. On a miss \(k∉Hk\\notin H\),BiCachecomputes the KVs forppat the first step for all layers and stores them inH\[k\]H\[k\]\.
Let𝐊𝐕ℓ,tuse\\mathbf\{KV\}^\{\\mathrm\{use\}\}\_\{\\ell,t\}denote the KV values used for the current request at layerℓ\\elland steptt\. Then, for shallow layersℓ≤b\\ell\\leq b, the KVs are determined as
𝐊𝐕ℓ,tuse=𝐊𝐕ℓ$,∀ℓ≤b,t∈\{1,…,s\}\.\\mathbf\{KV\}^\{\\mathrm\{use\}\}\_\{\\ell,t\}=\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\},\\;\\forall\\,\\ell\\leq b,\\;t\\in\\\{1,\{\\ldots\},s\\\}\.\(4\)That is, the cached KVs are reused in shallow layers, eliminating redundant KV computation\.
Shared prefix KVs in deep layers\.For deep layersℓ\>b\\ell\>b,BiCacheavoids KV reuse across requests but enables reuse within a request through periodic refresh\. Specifically, it recomputes the KVs everyΔ\\Deltasteps, whereΔ\\Deltadenotes the refresh interval\. We define this as follows:
𝐊𝐕ℓ,tuse=\{𝐊𝐕ℓ,tref,ℓ\>b,\(t−1\)modΔ=0,𝐊𝐕ℓ,t−1use,ℓ\>b,otherwise\.\\mathbf\{KV\}^\{\\mathrm\{use\}\}\_\{\\ell,t\}=\\smash\[b\]\{\\begin\{cases\}\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\},\\;\\ell\>b,\\ \(t\{\-\}1\)\\bmod\\Delta\{=\}0,\\\\ \\mathbf\{KV\}^\{\\mathrm\{use\}\}\_\{\\ell,t\-1\},\\;\\ell\>b,\\ \\text\{otherwise\.\}\\end\{cases\}\}\(5\)
For each steptt,BiCachechecks whetherttcorresponds to a refresh step occurring everyΔ\\Deltasteps \(i\.e\., when\(t−1\)modΔ=0\(t\-1\)\\bmod\\Delta=0\)\. If so,BiCacherecomputes the KVs of shared prefix from the current request at the step, denoted as𝐊𝐕ℓ,tref\\mathbf\{KV\}^\{ref\}\_\{\\ell,t\}\. Otherwise, it reuses the KVs from the previous step𝐊𝐕ℓ,t−1use\\mathbf\{KV\}^\{\\mathrm\{use\}\}\_\{\\ell,t\-1\}\. We chooseΔ\\Deltato balance accuracy drop and throughput improvement, and setΔ=16\\Delta=16based on §[6\.4](https://arxiv.org/html/2606.07571#S6.SS4)\.
MethodARC\-CGPQAMATHGSM8KThr\.Acc\.Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.§[6\.2](https://arxiv.org/html/2606.07571#S6.SS2)No caching79\.486\.172\.027\.972\.918\.250\.056\.3vLLM113\.3 \(\+42\.7%\)0\.0 \(\-86\.1%\)99\.7 \(\+38\.5%\)0\.7 \(\-27\.2%\)108\.4 \(\+48\.5%\)0\.0 \(\-18\.2%\)94\.2 \(\+88\.4%\)30\.9 \(\-25\.4%\)BiCache111\.7\(\+40\.7%\)85\.3\(\-0\.8%\)98\.1\(\+36\.3%\)27\.5\(\-0\.4%\)107\.0\(\+46\.8%\)18\.8\(\+0\.6%\)91\.4\(\+82\.8%\)55\.5\(\-0\.8%\)§[6\.3](https://arxiv.org/html/2606.07571#S6.SS3)Fast\-dLLM85\.184\.382\.127\.076\.516\.451\.763\.1BiCache\[\-2pt\]\+Fast\-dLLM129\.0\(\+51\.6%\)84\.3\(\+0\.0%\)124\.6\(\+51\.8%\)27\.2\(\+0\.2%\)123\.3\(\+61\.2%\)14\.6\(\-1\.8%\)102\.5\(\+98\.3%\)63\.9\(\+0\.8%\)Table 1:Throughput \(tokens/s\) and accuracy \(%\) across benchmarks\. The upper part \(§[6\.2](https://arxiv.org/html/2606.07571#S6.SS2)\) reports main improvements; the lower part \(§[6\.3](https://arxiv.org/html/2606.07571#S6.SS3)\) shows seamless integration results\. Percentages ofBiCacheare relative to no caching \(upper part\) and Fast\-dLLM \(lower part\)\. Bold:BiCacheresults\.
## 6Experiments
### 6\.1Setup
We evaluateBiCacheon LLaDA using the same setup as §[3\.1](https://arxiv.org/html/2606.07571#S3.SS1)\. As DLMs are at an early stage, LLaDA is currently the only model that supports stable, reproducible evaluation; accordingly, previous studies also report results solely on LLaDAGwaket al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib45)\); Lianget al\.\([2026](https://arxiv.org/html/2606.07571#bib.bib46)\)\.BiCacheis implemented in∼\\sim3K lines of code \(details in Appendix §[C](https://arxiv.org/html/2606.07571#A3)\); additional results beyond what we report here are in Appendix[D](https://arxiv.org/html/2606.07571#A4)due to page limit\. We useM=500M\{=\}500,τ=0\.97\\tau\{=\}0\.97, andΔ=16\\Delta\{=\}16forBiCache\. We use four benchmarks: ARC\-Challenge\-Chat, GPQA, MATH\-500, and GSM8K\. GSM8K is a benchmark of grade\-school math problems; the others are explained in §[3\.1](https://arxiv.org/html/2606.07571#S3.SS1)\. Together, they cover a representative range of tasks used in prior DLM optimization studiesWuet al\.\([2025b](https://arxiv.org/html/2606.07571#bib.bib36),[a](https://arxiv.org/html/2606.07571#bib.bib33)\)\.
Figure 7:MMchanges on L1 distance\.
Figure 8:τ\\tauchanges on throughput and accuracy\.
Figure 9:Δ\\Deltachanges on throughput and accuracy\.
Evaluation scope\.We structure our evaluation to evaluate three key aspects: 1\) main improvements \(§[6\.2](https://arxiv.org/html/2606.07571#S6.SS2)\), 2\) seamless integration with orthogonal techniques \(§[6\.3](https://arxiv.org/html/2606.07571#S6.SS3)\), and 3\) parameter analysis \(§[6\.4](https://arxiv.org/html/2606.07571#S6.SS4)\)\. Note that, as research on DLMs is still in an early stage, most papers evaluate their techniques only on LLaDA without comparisons to other baselines or orthogonal techniquesGwaket al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib45)\); Lianget al\.\([2026](https://arxiv.org/html/2606.07571#bib.bib46)\)\. To provide a broader evaluation, we construct our evaluation as follows\.
First, we evaluate the main improvements in throughput and accuracy by comparingBiCachewith no caching and vLLM \(explained in §[3\.1](https://arxiv.org/html/2606.07571#S3.SS1)\)\.
Second, we evaluateBiCache’s seamless compatibility with an orthogonal acceleration method, Fast\-dLLMWuet al\.\([2025b](https://arxiv.org/html/2606.07571#bib.bib36),[a](https://arxiv.org/html/2606.07571#bib.bib33)\)\. Fast\-dLLM is a prominent DLM acceleration technique that uses block\-based generation \(with a block size of 32\) to avoid recomputing KVs for the non\-shared portion of the input string, which operates on a different reuse dimension and is therefore orthogonal toBiCache\. We compare 1\) Fast\-dLLM and 2\) Fast\-dLLM combined withBiCache\.
Third, we analyze the choice of parameters inBiCache\(§[6\.4](https://arxiv.org/html/2606.07571#S6.SS4)\)\. Specifically, we report: 1\) lookup table updates for differentMMvalues to identify the minimumMMneeded for an accurate lookup table; 2\) throughput and accuracy changes with the similarity thresholdτ\\tauused to determinebb\(§[4\.2](https://arxiv.org/html/2606.07571#S4.SS2)\) on ARC\-Challenge\-Chat; and 3\) throughput and accuracy changes with the refresh intervalΔ\\Delta\(§[5\.2](https://arxiv.org/html/2606.07571#S5.SS2)\) for deep layers on the same benchmark\.
### 6\.2Main Improvements
Throughput\.The upper part \(denoted §[6\.2](https://arxiv.org/html/2606.07571#S6.SS2)\) of Table[1](https://arxiv.org/html/2606.07571#S5.T1)shows the main improvements ofBiCacheacross benchmarks\. Compared to no caching,BiCacheimproves throughput by 36\.3% \(GPQA\)–82\.8% \(GSM8K\)\. Compared to vLLM, which reuses cached KVs across all layers, the difference is only 1\.3% \(MATH\)–3\.0% \(GSM8K\)\. Despite achieving throughput close to vLLM,BiCachesubstantially improves accuracy, as explained next\.
Accuracy\.Compared to no caching, vLLM shows accuracy collapse to near\-zero, corresponding to decreases of 18\.2% \(MATH\) to 86\.1% \(ARC\-C\) across benchmarks\. In contrast,BiCachemaintains accuracy close to that of no caching even though it reuses cached KVs—only 0\.4% \(GPQA\)–0\.8% \(GSM8K\) across benchmarks, which corresponds to 31\.8×\\times–107\.6×\\timessmaller accuracy drop than vLLM\. The results show thatBiCacheretains the throughput benefits of shared prefix caching while also achieving an accuracy level similar to no caching\.
### 6\.3Seamless Integration
Throughput\.The lower part \(denoted §[6\.3](https://arxiv.org/html/2606.07571#S6.SS3)\) of Table[1](https://arxiv.org/html/2606.07571#S5.T1)shows the performance ofBiCachecombined with Fast\-dLLM\.BiCacheshows the highest throughput on all benchmarks: 51\.6% \(ARC\-C\)–98\.3% \(GSM8K\) better than Fast\-dLLM alone\.
Accuracy\.Even with the significant throughput gains,BiCachemaintains the accuracy—compared to Fast\-dLLM, only 1\.8% lower on MATH, and 0\.2% and 0\.8% higher on GPQA and GSM8K each\. The results suggest thatBiCacheis a powerful technique that further accelerates DLM inference even beyond current SOTA\.
### 6\.4Parameter Analysis
Number of profiling requestsM\\boldsymbol\{M\}\.Fig\.[9](https://arxiv.org/html/2606.07571#S6.F9)shows the L1 distance betweenbbvalues in lookup tables obtained with consecutiveMMvalues\. A large L1 distance means that thebbvalues still change substantially asMMincreases, indicating that such anMMis insufficient for stable estimation ofbb\. AsMMincreases, the distance decreases untilM=500M\{=\}500, from 0\.2 to 0\.04\. BetweenM=500M\{=\}500andM=600M\{=\}600, the distance differs by only 0\.01\. Thus, we chooseM=500M\{=\}500, wherebbis sufficiently stabilized\.
Thresholdτ\\boldsymbol\{\\tau\}\.Fig\.[9](https://arxiv.org/html/2606.07571#S6.F9)shows throughput \(left y\-axis, white bars\) and accuracy \(right y\-axis, hatched bars\) asτ\\tauvaries\. Throughput remains nearly constant across allτ\\tauvalues—only 0\.4% difference between 111\.9 tokens/s \(τ=0\.91\\tau\{=\}0\.91\) and 111\.5 tokens/s \(τ=0\.99\\tau\{=\}0\.99\)\. In terms of accuracy, it increases from 64% \(τ=0\.91\\tau\{=\}0\.91\) to 85% \(τ=0\.97\\tau\{=\}0\.97\), and remains the same up toτ=0\.99\\tau\{=\}0\.99\. Since throughput is stable and accuracy peaks atτ=0\.97\\tau\{=\}0\.97, we adoptτ=0\.97\\tau\{=\}0\.97inBiCache\.
Refresh interval𝚫\\boldsymbol\{\\Delta\}\.Fig\.[9](https://arxiv.org/html/2606.07571#S6.F9)shows throughput \(x\-axis\) and accuracy \(y\-axis\) asΔ\\Deltavaries \(each dot labeled with itsΔ\\Deltavalue\)\. Both throughput and accuracy remain largely stable across all values ofΔ\\Delta\. In particular, increasingΔ\\Deltafrom 8 to 64 changes throughput by only 2\.5% and accuracy by only 1\.2%\. Thus,BiCacheis robust to the choice ofΔ\\Delta, and we useΔ=16\\Delta\{=\}16\. Overall, the results indicate thatBiCacheis not particularly sensitive to the choice ofMM,τ\\tau, andΔ\\Deltaaround the selected values\.
## 7Related Work
Recent studies have proposed KV caching techniques within a request\. For example, Fast\-dLLMWuet al\.\([2025b](https://arxiv.org/html/2606.07571#bib.bib36)\)and Fast\-dLLM v2Wuet al\.\([2025a](https://arxiv.org/html/2606.07571#bib.bib33)\)divide the steps into blocks, cache the KVs of all tokens within each block, and reuse them across all layers and steps within the block\. dLLM\-CacheLiuet al\.\([2025d](https://arxiv.org/html/2606.07571#bib.bib35)\), D2CacheJianget al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib29)\), and FlashDLMHuet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib32)\)identify a subset of tokens that show high KV similarity across steps and reuse cached KVs only for tokens that remain stable\. Other tokens that significantly change are updated at every step\. All the techniques focus on KV caching within a request and do not address shared prefix caching across requests\. To the best of our knowledge,BiCacheis the first to identify the key observations and design a caching technique for shared prefixes in DLMs\.
## 8Conclusion
This paper presents bidirectional prefix caching,BiCache, for enabling shared prefix KV caching in diffusion language models\. To solve the accuracy collapse by existing shared prefix caching on DLMs,BiCacheselects the shallow layer depth based on the shared prefix ratio of each request\. In our evaluated setting,BiCacheimproves throughput by 36\.3%–98\.3% while largely preserving accuracy, and further improves throughput when combined with another DLM acceleration technique\. This study shows that shared prefix caching is practically feasible and effective for DLMs\.
## 9Limitations
BiCachehas several limitations that point to future work\.
Prefix matching granularity\.BiCachecurrently relies on exact matching of the shared prefix\. This design is practical because shared prefixes typically come from fixed system prompts, where exact matches are commonLiet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib18)\); Srivatsaet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib37)\)\. Although requests that differ by even a single token do not share cached KVs, such cases are relatively uncommon in practice; real\-world shared prefixes often remain largely static, comprising up to 97% of prompt tokens without variationSrivatsaet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib37)\)\. Exact matching also avoids costly token\-level cache retrieval and enables efficient𝒪\(1\)\\mathcal\{O\}\(1\)lookup via hashing\. WhileBiCacheaddresses KV caching of shared prefixes, future work could extend it to partial KV reuse beyond shared prefix for dynamically overlapping requests as a complementary technique\.
Profiling\-based depth determination\.BiCachedetermines the shallow layer depthbbthrough offline profiling\. Because bidirectional attention changes KVs across the entire context, deriving exact layer\-wise KV variation analytically at runtime is difficultLiuet al\.\([2025a](https://arxiv.org/html/2606.07571#bib.bib3)\)\. As in many ARM and DLM optimizationsLiuet al\.\([2025c](https://arxiv.org/html/2606.07571#bib.bib2)\); Kimet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib1)\), we therefore use profiling\. In our evaluation, the profiled depth remains robust across tasks \(§[4\.2](https://arxiv.org/html/2606.07571#S4.SS2)\) and system prompts \(Appendix[D\.1](https://arxiv.org/html/2606.07571#A4.SS1)\), and profiling is required only once per model, which is a reasonable practical overhead\. Developing runtime\-adaptive heuristics or a theoretical inference model forbbis an interesting direction for future work\.
Evaluation within the emerging DLM ecosystem\.Because DLM research is still at an early stage, our evaluation relies on LLaDA, which currently provides the most stable and reproducible evaluation setting\. This choice is consistent with recent DLM studiesGwaket al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib45)\); Lianget al\.\([2026](https://arxiv.org/html/2606.07571#bib.bib46)\)\. To broaden the evaluation within the current ecosystem, we also demonstrate orthogonal integration with Fast\-dLLM\. We choose Fast\-dLLM because it is the most mature open\-source accelerator currently available for integration and is used as a baseline in recent state\-of\-the\-art DLM studiesWuet al\.\([2025a](https://arxiv.org/html/2606.07571#bib.bib33)\); Wanget al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib30)\)\. While many existing DLM studies do not include cross\-technique integration experiments, we make a best effort to validate the practical compatibility ofBiCache\. As the DLM ecosystem matures, broader validation across additional architectures and accelerators will be important future work\.
## References
- L\. Berglund, M\. Tong, M\. Kaufmann, M\. Balesni, A\. Stickland, T\. Korbak, and O\. Evans \(2024\)The reversal curse: llms trained on “a is b”fail to learn “b is a”\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 18623–18642\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/5178b2f2d7c44aa390c0777dc77b3f0c-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p3.1)\.
- A\. Bietti, V\. Cabannes, D\. Bouchacourt, H\. Jegou, and L\. Bottou \(2023\)Birth of a transformer: a memory viewpoint\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=3X2EbBLNsk)Cited by:[§B\.2](https://arxiv.org/html/2606.07571#A2.SS2.p1.1)\.
- V\. Castin, P\. Ablin, and G\. Peyré \(2024\)How smooth is attention?\.InProceedings of the 41st International Conference on Machine Learning,ICML’24\.Cited by:[§B\.2](https://arxiv.org/html/2606.07571#A2.SS2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p2.1)\.
- Y\. Collet \(2024\)XxHash: extremely fast hash algorithm\.Note:[https://github\.com/Cyan4973/xxHash](https://github.com/Cyan4973/xxHash)Accessed: 2026\-03\-16Cited by:[Appendix C](https://arxiv.org/html/2606.07571#A3.p1.3)\.
- I\. Gim, G\. Chen, S\. Lee, N\. Sarda, A\. Khandelwal, and L\. Zhong \(2024\)Prompt cache: modular attention reuse for low\-latency inference\.External Links:2311\.04934,[Link](https://arxiv.org/abs/2311.04934)Cited by:[§2\.3](https://arxiv.org/html/2606.07571#S2.SS3.p2.2)\.
- D\. Gwak, M\. Jung, J\. Park, M\. Park, C\. Park, J\. Hyung, and J\. Choo \(2025\)Reward\-weighted sampling: enhancing non\-autoregressive characteristics in masked diffusion LLMs\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 34574–34594\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1754/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1754),ISBN 979\-8\-89176\-332\-6Cited by:[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p1.4),[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p2.1),[§9](https://arxiv.org/html/2606.07571#S9.p4.1)\.
- Y\. Hsieh, M\. Cheng, D\. Juan, W\. Wei, W\. Hsu, and C\. Hsieh \(2019\)On the robustness of self\-attentive models\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 1520–1529\.External Links:[Link](https://aclanthology.org/P19-1147/),[Document](https://dx.doi.org/10.18653/v1/P19-1147)Cited by:[§B\.2](https://arxiv.org/html/2606.07571#A2.SS2.p1.1)\.
- Z\. Hu, J\. Meng, Y\. Akhauri, M\. S\. Abdelfattah, J\. Seo, Z\. Zhang, and U\. Gupta \(2025\)FlashDLM: accelerating diffusion language model inference via efficient kv caching and guided diffusion\.External Links:2505\.21467,[Link](https://arxiv.org/abs/2505.21467)Cited by:[§7](https://arxiv.org/html/2606.07571#S7.p1.1)\.
- Y\. Jiang, Y\. Cai, X\. Luo, J\. Fu, J\. Wang, C\. Liu, and X\. Yang \(2025\)D2cache: accelerating diffusion\-based llms via dual adaptive caching\.External Links:2509\.23094,[Link](https://arxiv.org/abs/2509.23094)Cited by:[§7](https://arxiv.org/html/2606.07571#S7.p1.1)\.
- M\. Kim, C\. Hooper, A\. Tomar, C\. Xu, M\. Farajtabar, M\. W\. Mahoney, K\. Keutzer, and A\. Gholami \(2025\)Beyond next\-token prediction: a performance characterization of diffusion versus autoregressive language models\.External Links:2510\.04146,[Link](https://arxiv.org/abs/2510.04146)Cited by:[§9](https://arxiv.org/html/2606.07571#S9.p3.2)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,SOSP ’23,New York, NY, USA,pp\. 611–626\.External Links:ISBN 9798400702297,[Link](https://doi.org/10.1145/3600006.3613165),[Document](https://dx.doi.org/10.1145/3600006.3613165)Cited by:[Appendix C](https://arxiv.org/html/2606.07571#A3.p2.1),[§1](https://arxiv.org/html/2606.07571#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.07571#S2.SS3.p2.2),[2nd item](https://arxiv.org/html/2606.07571#S3.I1.i2.p1.1.1)\.
- K\. Li, T\. Liu, N\. Bashkansky, D\. Bau, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2024\)Measuring and controlling instruction \(in\)stability in language model dialogs\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=60a1SAtH4e)Cited by:[§2\.3](https://arxiv.org/html/2606.07571#S2.SS3.p1.1),[§9](https://arxiv.org/html/2606.07571#S9.p2.1)\.
- Y\. Lian \(2025\)Reinforcement learning is all you need\.External Links:2503\.09512,[Link](https://arxiv.org/abs/2503.09512)Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p3.1)\.
- Y\. Liang, Z\. Wang, H\. Chen, X\. Sun, J\. Wu, X\. Yu, J\. Liu, E\. Barsoum, Z\. Liu, and N\. K\. Jha \(2026\)CD4LM: consistency distillation and adaptive decoding for diffusion language models\.External Links:2601\.02236,[Link](https://arxiv.org/abs/2601.02236)Cited by:[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p1.4),[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p2.1),[§9](https://arxiv.org/html/2606.07571#S9.p4.1)\.
- A\. Liu, M\. He, S\. Zeng, S\. Zhang, L\. Zhang, C\. Wu, W\. Jia, Y\. Liu, X\. Zhou, and J\. Zhou \(2025a\)WeDLM: reconciling diffusion language models with standard causal attention for fast inference\.External Links:2512\.22737,[Link](https://arxiv.org/abs/2512.22737)Cited by:[§9](https://arxiv.org/html/2606.07571#S9.p3.2)\.
- J\. Liu, X\. Dong, Z\. Ye, R\. Mehta, Y\. Fu, V\. Singh, J\. Kautz, C\. Zhang, and P\. Molchanov \(2025b\)TiDAR: think in diffusion, talk in autoregression\.External Links:2511\.08923,[Link](https://arxiv.org/abs/2511.08923)Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p1.3)\.
- X\. Liu, J\. Park, L\. Hu, W\. Kwon, Z\. Li, C\. Zhang, K\. Du, X\. Mo, K\. You, A\. Cheung, Z\. Deng, I\. Stoica, and H\. Zhang \(2025c\)TurboSpec: closed\-loop speculation control system for optimizing llm serving goodput\.External Links:2406\.14066,[Link](https://arxiv.org/abs/2406.14066)Cited by:[§9](https://arxiv.org/html/2606.07571#S9.p3.2)\.
- Z\. Liu, Y\. Yang, Y\. Zhang, J\. Chen, C\. Zou, Q\. Wei, S\. Wang, and L\. Zhang \(2025d\)DLLM\-cache: accelerating diffusion large language models with adaptive caching\.External Links:2506\.06295,[Link](https://arxiv.org/abs/2506.06295)Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p5.1),[§7](https://arxiv.org/html/2606.07571#S7.p1.1)\.
- Z\. Lu, L\. Jin, P\. Li, Y\. Tian, L\. Zhang, S\. Wang, G\. Xu, C\. Tian, and X\. Cai \(2024\)Rethinking the reversal curse of LLMs: a prescription from human knowledge reversal\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 7518–7530\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.428/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.428)Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p3.1)\.
- S\. Marimuthu and P\. Krishnamurthy \(2025\)LTRC\-IIITH at PerAnsSumm 2025: SpanSense \- perspective\-specific span identification and summarization\.InProceedings of the Second Workshop on Patient\-Oriented Language Processing \(CL4Health\),S\. Ananiadou, D\. Demner\-Fushman, D\. Gupta, and P\. Thompson \(Eds\.\),Albuquerque, New Mexico,pp\. 409–414\.External Links:[Link](https://aclanthology.org/2025.cl4health-1.37/),[Document](https://dx.doi.org/10.18653/v1/2025.cl4health-1.37),ISBN 979\-8\-89176\-238\-1Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p3.1)\.
- Y\. Meng, J\. Huang, G\. Wang, C\. Zhang, H\. Zhuang, L\. Kaplan, and J\. Han \(2019\)Spherical text embedding\.InProceedings of the 33rd International Conference on Neural Information Processing Systems,Cited by:[§B\.2](https://arxiv.org/html/2606.07571#A2.SS2.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.External Links:2502\.09992,[Link](https://arxiv.org/abs/2502.09992)Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.07571#S2.SS1.p1.11),[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p1.3)\.
- OpenAI \(2023\)ChatGPT: A large language model by openai\.Note:[https://openai\.com/chatgpt](https://openai.com/chatgpt)Accessed: 2026\-02\-03Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p2.1)\.
- Y\. Qian, J\. Su, L\. Hu, P\. Zhang, Z\. Deng, P\. Zhao, and H\. Zhang \(2026\)D3LLM: ultra\-fast diffusion llm using pseudo\-trajectory distillation\.External Links:2601\.07568,[Link](https://arxiv.org/abs/2601.07568)Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p1.3)\.
- V\. Srivatsa, Z\. He, R\. Abhyankar, D\. Li, and Y\. Zhang \(2025\)Preble: efficient distributed prompt scheduling for LLM serving\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=meKEKDhdnx)Cited by:[§D\.1](https://arxiv.org/html/2606.07571#A4.SS1.p1.2),[§1](https://arxiv.org/html/2606.07571#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.07571#S2.SS3.p1.1),[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p2.1),[§9](https://arxiv.org/html/2606.07571#S9.p2.1)\.
- X\. Wang, C\. Xu, Y\. Jin, J\. Jin, H\. Zhang, and Z\. Deng \(2025\)Diffusion llms can do faster\-than\-ar inference via discrete diffusion forcing\.External Links:2508\.09192,[Link](https://arxiv.org/abs/2508.09192)Cited by:[§9](https://arxiv.org/html/2606.07571#S9.p4.1)\.
- C\. Wu, H\. Zhang, S\. Xue, S\. Diao, Y\. Fu, Z\. Liu, P\. Molchanov, P\. Luo, S\. Han, and E\. Xie \(2025a\)Fast\-dllm v2: efficient block\-diffusion llm\.External Links:2509\.26328,[Link](https://arxiv.org/abs/2509.26328)Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p5.1),[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p1.4),[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p4.1),[§7](https://arxiv.org/html/2606.07571#S7.p1.1),[§9](https://arxiv.org/html/2606.07571#S9.p4.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025b\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.External Links:2505\.22618,[Link](https://arxiv.org/abs/2505.22618)Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p5.1),[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p1.4),[§6\.1](https://arxiv.org/html/2606.07571#S6.SS1.p4.1),[§7](https://arxiv.org/html/2606.07571#S7.p1.1)\.
- xAI Organization \(2025\)Xai\-org/grok\-prompts: prompts for the grok chat assistant and the @grok bot on x\.Note:[https://github\.com/xai\-org/grok\-prompts](https://github.com/xai-org/grok-prompts)Accessed: 2026\-02\-12External Links:[Link](https://github.com/xai-org/grok-prompts)Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p2.1)\.
- Z\. Xiao, F\. Huang, J\. Tu, J\. Wei, W\. Ma, Y\. Zhou, J\. Wu, B\. Yu, Z\. Liu, and J\. Lin \(2025\)LongWeave: a long\-form generation benchmark bridging real\-world relevance and verifiability\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10386–10417\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.549/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.549),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2606.07571#S1.p3.1)\.
- Z\. Yi, J\. Ouyang, Z\. Xu, Y\. Liu, T\. Liao, H\. Luo, and Y\. Shen \(2025\)A survey on recent advances in llm\-based multi\-turn dialogue systems\.ACM Comput\. Surv\.58\(6\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3771090),[Document](https://dx.doi.org/10.1145/3771090)Cited by:[§2\.3](https://arxiv.org/html/2606.07571#S2.SS3.p1.1)\.
- W\. Zhao, X\. Ren, J\. Hessel, C\. Cardie, Y\. Choi, and Y\. Deng \(2024\)WildChat: 1m chatGPT interaction logs in the wild\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bl8u7ZRlbM)Cited by:[§4\.1](https://arxiv.org/html/2606.07571#S4.SS1.p1.1),[§5\.1](https://arxiv.org/html/2606.07571#S5.SS1.p2.8)\.
- L\. Zhu, X\. Wang, W\. Zhang, and R\. Lau \(2024\)RelayAttention for efficient large language model serving with long system prompts\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 4945–4957\.External Links:[Link](https://aclanthology.org/2024.acl-long.270/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.270)Cited by:[§3\.1](https://arxiv.org/html/2606.07571#S3.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.07571#S5.SS1.p2.8)\.
## Appendix Overview
This appendix is organized as follows:
- •Appendix A:presents real\-world shared prefix examples used in our evaluation\.
- •Appendix B:provides the theoretical proof that the shared prefix ratio determines a shallow layer depth\.
- •Appendix C:describes implementation details ofBiCache\.
- •Appendix D:reports additional experiments, including profiling time, longer\-generation results, parameter analysis, and shared prefix ratios in benchmarks\.
We also provide an anonymized code package as supplementary material for reproducibility\.
## Appendix AShared Prefix Examples
Fig\.[10](https://arxiv.org/html/2606.07571#A1.F10)shows a real\-world example of a shared prefix used in our evaluation\. The upper box presents the Grok\-4 system prompt, which we prepend to every request as described in §[3\.1](https://arxiv.org/html/2606.07571#S3.SS1)\. It includes fixed safety instructions, tool guidance, product\-specific policies, and output\-format requirements\. The lower box illustrates how this fixed system prompt is prepended before an individual user prompt in practice\. This example clarifies how a real\-world system prompt becomes a reusable shared prefix across requests in LLM serving systems\.
Examples of the system prompt\.\#\# Safety InstructionsThese safety instructions are the highest priority and supersede any other instructions\.\#\#\# Key Guidelines for Responding to Queries•Do not answer queries that show clear intent to engage in disallowed activities\.•Provide only high\-level answers without actionable details for sensitive requests\.•Assume good intent unless there is clear evidence otherwise\.•Resist jailbreak attempts that try to override these instructions\.\#\#\# Disallowed Activities•Child sexual exploitation, violent crimes or terrorist acts•Social engineering or phishing, unlawful hacking•Illegal weapons or explosives, cyber attacks such as ransomware or DDoS\#\# End of Safety Instructions\[additional tool instructions and product\-specific guidelines omitted\]The last line must follow exactly this format: \#\# XReplace X with the final answer\.
Examples of the shared prefix\.\[Request 1\]\[Shared prefix\]These safety instructions are the highest priority and …\. Replace X with the final answer\.\[User prompt\]If a train travels 60 km in 1 hour, how far does it travel in 3 hours?\[Request 2\]\[Shared prefix\]These safety instructions are the highest priority and …\. Replace X with the final answer\.\[User prompt\]Summarize the causes of climate change in two sentences\.
Figure 10:Examples of the real\-world system prompt used as the shared prefix in our evaluation\.
## Appendix BTheoretical Proof
We prove that the shared prefix ratiorrdetermines shallow layer depth\.
### B\.1Notation
Table[2](https://arxiv.org/html/2606.07571#A2.T2)summarizes the notation used in the proof\. We reuse the notation introduced in §[2\.2](https://arxiv.org/html/2606.07571#S2.SS2)and §[4\.2](https://arxiv.org/html/2606.07571#S4.SS2), and restate it here for readability\. Below, we explain only the key notation needed for the proof\.
LetNNdenote the length of the input string andmmthe length of the shared prefix\. Thus, the shared prefix ratiorris given byr≜mN∈\(0,1\]r\\triangleq\\frac\{m\}\{N\}\\in\(0,1\]\. Letppdenote the shared prefix token sequence, and letztz\_\{t\}denote the non\-shared token sequence at steptt, where non\-shared portion includes\[MASK\]tokens and changes across steps\. We write the shared prefix input asxpref≜px\_\{\\mathrm\{pref\}\}\\triangleq pand the entire input string at stepttasxfull,t≜\(p,zt\)x\_\{\\mathrm\{full\},t\}\\triangleq\(p,z\_\{t\}\)\.
We define the vectorized shared prefix KVs at layerℓ\\ellunder the shared prefix input asuℓ$≜vec\(𝐊𝐕ℓ$\)u\_\{\\ell\}^\{\\mathdollar\}\\triangleq\\mathrm\{vec\}\\\!\\big\(\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}\\big\), and the corresponding vectorized shared prefix KVs at layerℓ\\elland stepttunder the full input asuℓ,tref≜vec\(𝐊𝐕ℓ,tref\)u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\\triangleq\\mathrm\{vec\}\\\!\\big\(\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}\\big\)\.
As in §[4\.2](https://arxiv.org/html/2606.07571#S4.SS2), we measure the similarity between the cached KVs of shared prefix and the reference KVs of shared prefix\. Here, we write this similarity explicitly using cosine similarity, defined asCosSim\(a,b\)≜a⊤b‖a‖2‖b‖2\\mathrm\{CosSim\}\(a,b\)\\triangleq\\frac\{a^\{\\top\}b\}\{\\\|a\\\|\_\{2\}\\\|b\\\|\_\{2\}\}\. Using this notation, the shared prefix KV similarity at layerℓ\\elland stepttis written assimℓ,t=CosSim\(uℓ$,uℓ,tref\)\\mathrm\{sim\}\_\{\\ell,t\}=\\mathrm\{CosSim\}\\\!\\big\(u\_\{\\ell\}^\{\\mathdollar\},u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\\big\)\. This measures how similar the KVs of the shared prefix are between the shared prefix input and the full input at layerℓ\\elland steptt\. As in §[4\.2](https://arxiv.org/html/2606.07571#S4.SS2), given the similarity thresholdτ\\tau, we defineb\(r\)≜max\{b∈\{0,1,…,L\}:mintsimℓ,t≥τ,∀ℓ≤b\}b\(r\)\\triangleq\\max\\Big\\\{b\\in\\\{0,1,\\dots,L\\\}:\\min\_\{t\}\\mathrm\{sim\}\_\{\\ell,t\}\\geq\\tau,\\ \\forall\\ell\\leq b\\Big\\\}\. This is the same definition as in §[4\.2](https://arxiv.org/html/2606.07571#S4.SS2), with the only difference that we writebbasb\(r\)b\(r\)to make its dependence on the shared prefix ratiorrexplicit\. Therefore,b\(r\)b\(r\)is the deepest shallow\-layer depth such that the shared prefix KVs satisfy the similarity threshold across all steps\.
For a token positionii, letwij\(ℓ,h\)\(x\)w\_\{ij\}^\{\(\\ell,h\)\}\(x\)denote the attention weight from tokeniito tokenjjat layerℓ\\elland headhhunder inputxx, and letVj\(ℓ,h\)\(x\)V\_\{j\}^\{\(\\ell,h\)\}\(x\)denote the corresponding value vector of tokenjj\. We define the attention output of tokeniiasAttni\(ℓ,h\)\(x\)≜∑jwij\(ℓ,h\)\(x\)Vj\(ℓ,h\)\(x\)\\mathrm\{Attn\}^\{\(\\ell,h\)\}\_\{i\}\(x\)\\triangleq\\sum\_\{j\}w\_\{ij\}^\{\(\\ell,h\)\}\(x\)V\_\{j\}^\{\(\\ell,h\)\}\(x\)\.
NotationDescriptionNNLength of the full input string\.mmLength of the shared prefix\.r=mN∈\(0,1\]r=\\frac\{m\}\{N\}\\in\(0,1\]Shared prefix ratio\.ppShared prefix token sequence\.ztz\_\{t\}Non\-shared token sequence at denoising steptt, including\[MASK\]tokens\.xprefx\_\{\\mathrm\{pref\}\}Shared prefix input, defined asxpref≜px\_\{\\mathrm\{pref\}\}\\triangleq p\.xfull,tx\_\{\\mathrm\{full\},t\}Full input at steptt, defined asxfull,t≜\(p,zt\)x\_\{\\mathrm\{full\},t\}\\triangleq\(p,z\_\{t\}\)\.𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}Shared prefix KVs at layerℓ\\ellunder the shared\-prefix\-only input\.𝐊𝐕ℓ,tref\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}Reference shared prefix KVs at layerℓ\\elland stepttunder the full input\.uℓ$u\_\{\\ell\}^\{\\mathdollar\}Vectorized form of𝐊𝐕ℓ$\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}, i\.e\.,uℓ$≜vec\(𝐊𝐕ℓ$\)u\_\{\\ell\}^\{\\mathdollar\}\\triangleq\\mathrm\{vec\}\(\\mathbf\{KV\}^\{\\mathdollar\}\_\{\\ell\}\)\.uℓ,trefu\_\{\\ell,t\}^\{\\mathrm\{ref\}\}Vectorized form of𝐊𝐕ℓ,tref\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}, i\.e\.,uℓ,tref≜vec\(𝐊𝐕ℓ,tref\)u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\\triangleq\\mathrm\{vec\}\(\\mathbf\{KV\}^\{\\mathrm\{ref\}\}\_\{\\ell,t\}\)\.CosSim\(a,b\)\\mathrm\{CosSim\}\(a,b\)Cosine similarity between vectorsaaandbb\.simℓ,t\\mathrm\{sim\}\_\{\\ell,t\}Shared prefix KV similarity at layerℓ\\elland steptt, defined asCosSim\(uℓ$,uℓ,tref\)\\mathrm\{CosSim\}\(u\_\{\\ell\}^\{\\mathdollar\},u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\)\.τ\\tauSimilarity threshold for safe KV reuse\.b\(r\)b\(r\)Deepest shallow layer depth such thatmintsimℓ,t≥τ\\min\_\{t\}\\mathrm\{sim\}\_\{\\ell,t\}\\geq\\taufor all layers up to that depth\.wij\(ℓ,h\)\(x\)w\_\{ij\}^\{\(\\ell,h\)\}\(x\)Attention weight from tokeniito tokenjjat layerℓ\\ell, headhh, under inputxx\.Vj\(ℓ,h\)\(x\)V\_\{j\}^\{\(\\ell,h\)\}\(x\)Value vector of tokenjjat layerℓ\\ell, headhh, under inputxx\.Attni\(ℓ,h\)\(x\)\\mathrm\{Attn\}\_\{i\}^\{\(\\ell,h\)\}\(x\)Attention output of tokeniiat layerℓ\\ell, headhh, under inputxx\.ρi,t\(ℓ,h\)\\rho\_\{i,t\}^\{\(\\ell,h\)\}Attention leakage from shared prefix tokeniito non\-shared tokens at layerℓ\\ell, headhh, and steptt\.Attn~i\(ℓ,h\)\(xfull,t\)\\widetilde\{\\mathrm\{Attn\}\}\_\{i\}^\{\(\\ell,h\)\}\(x\_\{\\mathrm\{full\},t\}\)Prefix\-renormalized attention output obtained by removing attention mass assigned to non\-shared tokens and renormalizing over shared prefix tokens only\.ρ¯ℓ,t\(r\)\\bar\{\\rho\}\_\{\\ell,t\}\(r\)Average attention leakage at layerℓ\\elland stepttas a function of the shared prefix ratiorr\.gℓ,t\(1−r\)g\_\{\\ell,t\}\(1\-r\)Upper\-bounding function for attention leakage as a function of the non\-shared ratio1−r1\-r\.BVB\_\{V\}Uniform upper bound on theℓ2\\ell\_\{2\}norm of value vectors\.MℓM\_\{\\ell\}Positive lower bound on theℓ2\\ell\_\{2\}norms ofuℓ$u\_\{\\ell\}^\{\\mathdollar\}anduℓ,trefu\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\.Λℓ,t\\Lambda\_\{\\ell,t\}Constant relating attention\-output perturbation to shared prefix KV perturbation\.Cℓ,tC\_\{\\ell,t\}Constant defined asCℓ,t≜2BVΛℓ,tC\_\{\\ell,t\}\\triangleq 2B\_\{V\}\\Lambda\_\{\\ell,t\}\.cℓ,tc\_\{\\ell,t\}Constant defined ascℓ,t≜Cℓ,t22Mℓ2c\_\{\\ell,t\}\\triangleq\\frac\{C\_\{\\ell,t\}^\{2\}\}\{2M\_\{\\ell\}^\{2\}\}\.LBℓ,t\(r\)\\mathrm\{LB\}\_\{\\ell,t\}\(r\)Step\-wise lower bound on similarity, defined asLBℓ,t\(r\)≜1−cℓ,tgℓ,t\(1−r\)2\\mathrm\{LB\}\_\{\\ell,t\}\(r\)\\triangleq 1\-c\_\{\\ell,t\}g\_\{\\ell,t\}\(1\-r\)^\{2\}\.LB¯ℓ\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)Worst\-step lower bound, defined asLB¯ℓ\(r\)≜mintLBℓ,t\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\\triangleq\\min\_\{t\}\\mathrm\{LB\}\_\{\\ell,t\}\(r\)\.bsafe\(r\)b\_\{\\mathrm\{safe\}\}\(r\)Guaranteed shallow layer depth induced by the lower boundLB¯ℓ\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\.Table 2:Notation used in the theoretical proof\.
### B\.2Assumptions
We use the following standard assumptionsHsiehet al\.\([2019](https://arxiv.org/html/2606.07571#bib.bib10)\); Biettiet al\.\([2023](https://arxiv.org/html/2606.07571#bib.bib9)\); Menget al\.\([2019](https://arxiv.org/html/2606.07571#bib.bib8)\); Castinet al\.\([2024](https://arxiv.org/html/2606.07571#bib.bib7)\), widely adopted in theoretical analyses of transformer models:
- •\(A1\) Bounded attention value vectors\.There exists a constantBV\>0B\_\{V\}\>0such that‖Vj\(ℓ,h\)\(x\)‖2≤BV\\\|V^\{\(\\ell,h\)\}\_\{j\}\(x\)\\\|\_\{2\}\\leq B\_\{V\}for all layers, heads, token positions, and inputs of interest, whereVj\(ℓ,h\)\(x\)V^\{\(\\ell,h\)\}\_\{j\}\(x\)denotes the value vector in attention produced for token positionjjat layerℓ\\elland attention headhhunder inputxx, andBVB\_\{V\}is a uniform upper bound on itsℓ2\\ell\_\{2\}norm\. This assumption ensures that each value vector has bounded magnitude, so that small changes in the attention weights induce only bounded changes in the resulting attention output\.
- •\(A2\) Attention leakage controlled by the non\-shared ratio\.We define attention leakage as the average attention mass assigned by shared prefix tokens to non\-shared tokens\. For each layerℓ\\elland steptt, we assume there exists an upper\-bounding functiongℓ,t:\[0,1\]→\[0,1\]g\_\{\\ell,t\}:\[0,1\]\\to\[0,1\]such thatρ¯ℓ,t\(r\)≤gℓ,t\(1−r\)\\bar\{\\rho\}\_\{\\ell,t\}\(r\)\\leq g\_\{\\ell,t\}\(1\-r\), whereρ¯ℓ,t\(r\)\\bar\{\\rho\}\_\{\\ell,t\}\(r\)denotes this leakage and1−r1\-ris the non\-shared ratio\. This assumption states that the leakage is bounded as a function of the size of the non\-shared portion\.
- •\(A3\) Non\-degenerate shared prefix KV norm\.For each layerℓ\\ell, there exists a constantMℓ\>0M\_\{\\ell\}\>0such that‖uℓ$‖2≥Mℓ\\\|u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}\\geq M\_\{\\ell\}and‖uℓ,tref‖2≥Mℓ\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\\\|\_\{2\}\\geq M\_\{\\ell\}for all stepstt, whereuℓ$u\_\{\\ell\}^\{\\mathdollar\}anduℓ,trefu\_\{\\ell,t\}^\{\\mathrm\{ref\}\}denote the vectorized shared prefix KVs under the shared prefix input and the full input, respectively, andMℓM\_\{\\ell\}is a positive lower bound on theirℓ2\\ell\_\{2\}norms\. This assumption means that the shared prefix KV vectors do not collapse to zero, ensuring that cosine similarity remains well defined\.
- •\(A4\) Shared prefix KV stability under attention perturbation\.For each layerℓ\\elland steptt, there exists a constantΛℓ,t\>0\\Lambda\_\{\\ell,t\}\>0such that the perturbation in the vectorized shared prefix KVs is bounded byΛℓ,t\\Lambda\_\{\\ell,t\}times the average perturbation in the attention outputs of shared prefix tokens\. Here,uℓ$u\_\{\\ell\}^\{\\mathdollar\}anduℓ,trefu\_\{\\ell,t\}^\{\\mathrm\{ref\}\}denote the vectorized shared prefix KVs under the shared prefix input and the full input, respectively, andΛℓ,t\\Lambda\_\{\\ell,t\}is a positive constant\. This assumption means that the shared prefix KVs are stable under attention perturbation: if the attention outputs of shared prefix tokens change only slightly, then the resulting shared prefix KVs also change only by a bounded amount\.
### B\.3Auxiliary Lemmas
We derive several intermediate lemmas needed for the proof\.
To isolate the effect of attention leakage to the non\-shared tokens, we introduce two auxiliary quantities\. For a shared prefix tokeni≤mi\\leq m, we define the attention leakage asρi,t\(ℓ,h\)≜∑j\>mwij\(ℓ,h\)\(xfull,t\)\\rho\_\{i,t\}^\{\(\\ell,h\)\}\\triangleq\\sum\_\{j\>m\}w\_\{ij\}^\{\(\\ell,h\)\}\(x\_\{\\mathrm\{full\},t\}\), which is the attention mass assigned from tokeniito the non\-shared tokens at layerℓ\\ell, headhh, and steptt\. We then define the prefix\-renormalized attention output asAttn~i\(ℓ,h\)\(xfull,t\)≜∑j≤mwij\(ℓ,h\)\(xfull,t\)1−ρi,t\(ℓ,h\)Vj\(ℓ,h\)\(xfull,t\)\\widetilde\{\\mathrm\{Attn\}\}^\{\(\\ell,h\)\}\_\{i\}\(x\_\{\\mathrm\{full\},t\}\)\\triangleq\\sum\_\{j\\leq m\}\\frac\{w\_\{ij\}^\{\(\\ell,h\)\}\(x\_\{\\mathrm\{full\},t\}\)\}\{1\-\\rho\_\{i,t\}^\{\(\\ell,h\)\}\}V\_\{j\}^\{\(\\ell,h\)\}\(x\_\{\\mathrm\{full\},t\}\)\. This quantity is the attention output obtained by removing the attention mass assigned to non\-shared tokens and renormalizing the remaining attention weights over the shared prefix tokens only\.
###### Lemma 1\(Leakage induces bounded attention deviation\)\.
For any layerℓ\\ell, headhh, steptt, and shared prefix tokeni≤mi\\leq m,‖Attni\(ℓ,h\)\(xfull,t\)−Attn~i\(ℓ,h\)\(xfull,t\)‖2≤2BVρi,t\(ℓ,h\)\\big\\\|\\mathrm\{Attn\}^\{\(\\ell,h\)\}\_\{i\}\(x\_\{\\mathrm\{full\},t\}\)\-\\widetilde\{\\mathrm\{Attn\}\}^\{\(\\ell,h\)\}\_\{i\}\(x\_\{\\mathrm\{full\},t\}\)\\big\\\|\_\{2\}\\leq 2B\_\{V\}\\,\\rho\_\{i,t\}^\{\(\\ell,h\)\}\.
###### Proof\.
Letwij≜wij\(ℓ,h\)\(xfull,t\)w\_\{ij\}\\triangleq w\_\{ij\}^\{\(\\ell,h\)\}\(x\_\{\\mathrm\{full\},t\}\),Vj≜Vj\(ℓ,h\)\(xfull,t\)V\_\{j\}\\triangleq V\_\{j\}^\{\(\\ell,h\)\}\(x\_\{\\mathrm\{full\},t\}\), andρi≜ρi,t\(ℓ,h\)\\rho\_\{i\}\\triangleq\\rho\_\{i,t\}^\{\(\\ell,h\)\}\. Then∑j≤mwij=1−ρi\\sum\_\{j\\leq m\}w\_\{ij\}=1\-\\rho\_\{i\}\. By the definitions ofAttni\(ℓ,h\)\\mathrm\{Attn\}^\{\(\\ell,h\)\}\_\{i\}andAttn~i\(ℓ,h\)\\widetilde\{\\mathrm\{Attn\}\}^\{\(\\ell,h\)\}\_\{i\}, we haveAttni−Attn~i=∑j≤m\(wij−wij1−ρi\)Vj\+∑j\>mwijVj\\mathrm\{Attn\}\_\{i\}\-\\widetilde\{\\mathrm\{Attn\}\}\_\{i\}=\\sum\_\{j\\leq m\}\\left\(w\_\{ij\}\-\\frac\{w\_\{ij\}\}\{1\-\\rho\_\{i\}\}\\right\)V\_\{j\}\+\\sum\_\{j\>m\}w\_\{ij\}V\_\{j\}\. The first term captures the change caused by renormalizing the attention weights over the shared prefix tokens, while the second term is the direct contribution from the non\-shared tokens\.
Since‖Vj‖2≤BV\\\|V\_\{j\}\\\|\_\{2\}\\leq B\_\{V\}, we have‖Attni−Attn~i‖2≤BV∑j≤m\|wij−wij1−ρi\|\+BV∑j\>mwij\\\|\\mathrm\{Attn\}\_\{i\}\-\\widetilde\{\\mathrm\{Attn\}\}\_\{i\}\\\|\_\{2\}\\leq B\_\{V\}\\sum\_\{j\\leq m\}\\left\|w\_\{ij\}\-\\frac\{w\_\{ij\}\}\{1\-\\rho\_\{i\}\}\\right\|\+B\_\{V\}\\sum\_\{j\>m\}w\_\{ij\}\. Forj≤mj\\leq m,\|wij−wij1−ρi\|=wijρi1−ρi\\left\|w\_\{ij\}\-\\frac\{w\_\{ij\}\}\{1\-\\rho\_\{i\}\}\\right\|=w\_\{ij\}\\frac\{\\rho\_\{i\}\}\{1\-\\rho\_\{i\}\}, and thus∑j≤m\|wij−wij1−ρi\|=ρi1−ρi∑j≤mwij=ρi\\sum\_\{j\\leq m\}\\left\|w\_\{ij\}\-\\frac\{w\_\{ij\}\}\{1\-\\rho\_\{i\}\}\\right\|=\\frac\{\\rho\_\{i\}\}\{1\-\\rho\_\{i\}\}\\sum\_\{j\\leq m\}w\_\{ij\}=\\rho\_\{i\}\. Also, by definition,∑j\>mwij=ρi\\sum\_\{j\>m\}w\_\{ij\}=\\rho\_\{i\}\. Substituting these equalities gives‖Attni−Attn~i‖2≤2BVρi\\\|\\mathrm\{Attn\}\_\{i\}\-\\widetilde\{\\mathrm\{Attn\}\}\_\{i\}\\\|\_\{2\}\\leq 2B\_\{V\}\\rho\_\{i\}, which proves the lemma\.
Lemma[1](https://arxiv.org/html/2606.07571#Thmlemma1)shows that the deviation between the full attention output and the prefix\-renormalized attention output is proportional to the leakage mass\. Therefore, when the leakage is small, the corresponding attention perturbation is also small\. ∎
###### Lemma 2\(Shared prefix KV deviation is controlled by the shared prefix ratio\)\.
For any layerℓ\\elland steptt,‖uℓ,tref−uℓ$‖2≤Cℓ,tgℓ,t\(1−r\)\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\-u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}\\leq C\_\{\\ell,t\}\\,g\_\{\\ell,t\}\(1\-r\), whereCℓ,t≜2BVΛℓ,tC\_\{\\ell,t\}\\triangleq 2B\_\{V\}\\Lambda\_\{\\ell,t\}\.
###### Proof\.
By \(A4\),‖uℓ,tref−uℓ$‖2≤Λℓ,t⋅𝔼i≤m,h\[‖Attni\(ℓ,h\)\(xfull,t\)−Attn~i\(ℓ,h\)\(xfull,t\)‖2\]\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\-u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}\\leq\\Lambda\_\{\\ell,t\}\\cdot\\mathbb\{E\}\_\{i\\leq m,\\,h\}\\big\[\\\|\\mathrm\{Attn\}^\{\(\\ell,h\)\}\_\{i\}\(x\_\{\\mathrm\{full\},t\}\)\-\\widetilde\{\\mathrm\{Attn\}\}^\{\(\\ell,h\)\}\_\{i\}\(x\_\{\\mathrm\{full\},t\}\)\\\|\_\{2\}\\big\]\. Applying Lemma[1](https://arxiv.org/html/2606.07571#Thmlemma1)gives‖uℓ,tref−uℓ$‖2≤2BVΛℓ,t⋅ρ¯ℓ,t\(r\)\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\-u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}\\leq 2B\_\{V\}\\Lambda\_\{\\ell,t\}\\cdot\\bar\{\\rho\}\_\{\\ell,t\}\(r\)\. Then, by \(A2\),‖uℓ,tref−uℓ$‖2≤2BVΛℓ,tgℓ,t\(1−r\)=Cℓ,tgℓ,t\(1−r\)\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\-u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}\\leq 2B\_\{V\}\\Lambda\_\{\\ell,t\}\\,g\_\{\\ell,t\}\(1\-r\)=C\_\{\\ell,t\}g\_\{\\ell,t\}\(1\-r\)\. This proves the lemma\. ∎
Lemma[2](https://arxiv.org/html/2606.07571#Thmlemma2)is the key bridge in the proof\. It shows that the difference between the cached shared prefix KVs and the reference shared prefix KVs at stepttis bounded by a function of1−r1\-r\. Equivalently, oncerris fixed, the magnitude of the KV perturbation is also bounded\.
###### Lemma 3\(Cosine similarity lower bound via distance\)\.
For any nonzero vectorsaaandbb,1−CosSim\(a,b\)≤‖a−b‖222‖a‖2‖b‖21\-\\mathrm\{CosSim\}\(a,b\)\\leq\\frac\{\\\|a\-b\\\|\_\{2\}^\{2\}\}\{2\\\|a\\\|\_\{2\}\\\|b\\\|\_\{2\}\}\.
###### Proof\.
Using the identity‖a−b‖22=‖a‖22\+‖b‖22−2a⊤b\\\|a\-b\\\|\_\{2\}^\{2\}=\\\|a\\\|\_\{2\}^\{2\}\+\\\|b\\\|\_\{2\}^\{2\}\-2a^\{\\top\}b, we obtain1−CosSim\(a,b\)=‖a−b‖22−\(‖a‖2−‖b‖2\)22‖a‖2‖b‖2≤‖a−b‖222‖a‖2‖b‖21\-\\mathrm\{CosSim\}\(a,b\)=\\frac\{\\\|a\-b\\\|\_\{2\}^\{2\}\-\(\\\|a\\\|\_\{2\}\-\\\|b\\\|\_\{2\}\)^\{2\}\}\{2\\\|a\\\|\_\{2\}\\\|b\\\|\_\{2\}\}\\leq\\frac\{\\\|a\-b\\\|\_\{2\}^\{2\}\}\{2\\\|a\\\|\_\{2\}\\\|b\\\|\_\{2\}\}\. ∎
Lemma[3](https://arxiv.org/html/2606.07571#Thmlemma3)allows us to convert a bound on the KV difference into a lower bound on the similarity, which is needed becauseb\(r\)b\(r\)is defined through a similarity threshold\.
### B\.4Theorem
###### Theorem 1\(The shared prefix ratio determines a guaranteed depth\)\.
For each layerℓ\\elland steptt, definecℓ,t≜Cℓ,t22Mℓ2c\_\{\\ell,t\}\\triangleq\\frac\{C\_\{\\ell,t\}^\{2\}\}\{2M\_\{\\ell\}^\{2\}\}andLBℓ,t\(r\)≜1−cℓ,tgℓ,t\(1−r\)2\\mathrm\{LB\}\_\{\\ell,t\}\(r\)\\triangleq 1\-c\_\{\\ell,t\}g\_\{\\ell,t\}\(1\-r\)^\{2\}\. Also define the worst\-step lower boundLB¯ℓ\(r\)≜mintLBℓ,t\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\\triangleq\\min\_\{t\}\\mathrm\{LB\}\_\{\\ell,t\}\(r\)\. Then the following hold:
1. 1\.For every layerℓ\\elland steptt,simℓ,t≥LBℓ,t\(r\)\\mathrm\{sim\}\_\{\\ell,t\}\\geq\\mathrm\{LB\}\_\{\\ell,t\}\(r\)\. Consequently,mintsimℓ,t≥LB¯ℓ\(r\)\\min\_\{t\}\\mathrm\{sim\}\_\{\\ell,t\}\\geq\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\.
2. 2\.Definebsafe\(r\)≜max\{b∈\{0,1,…,L\}:LB¯ℓ\(r\)≥τ,∀ℓ≤b\}b\_\{\\mathrm\{safe\}\}\(r\)\\triangleq\\max\\Big\\\{b\\in\\\{0,1,\\dots,L\\\}:\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\\geq\\tau,\\ \\forall\\ell\\leq b\\Big\\\}\. Thenbsafe\(r\)b\_\{\\mathrm\{safe\}\}\(r\)is determined by the shared prefix ratiorrthrough the lower boundLB¯ℓ\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\.
3. 3\.The empirical shallow layer depth satisfiesb\(r\)≥bsafe\(r\)b\(r\)\\geq b\_\{\\mathrm\{safe\}\}\(r\)\.
###### Proof\.
We prove the theorem in three steps\.
#### Step 1: Lower\-bounding the step\-wise similarity\.
By Lemma[2](https://arxiv.org/html/2606.07571#Thmlemma2),‖uℓ,tref−uℓ$‖2≤Cℓ,tgℓ,t\(1−r\)\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\-u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}\\leq C\_\{\\ell,t\}g\_\{\\ell,t\}\(1\-r\)\. We then convert this distance bound into a similarity bound\. Applying Lemma[3](https://arxiv.org/html/2606.07571#Thmlemma3)witha=uℓ$a=u\_\{\\ell\}^\{\\mathdollar\}andb=uℓ,trefb=u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}, and using \(A3\), gives1−simℓ,t=1−CosSim\(uℓ$,uℓ,tref\)≤‖uℓ,tref−uℓ$‖222Mℓ2≤Cℓ,t22Mℓ2gℓ,t\(1−r\)2=cℓ,tgℓ,t\(1−r\)21\-\\mathrm\{sim\}\_\{\\ell,t\}=1\-\\mathrm\{CosSim\}\\\!\\big\(u\_\{\\ell\}^\{\\mathdollar\},u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\\big\)\\leq\\frac\{\\\|u\_\{\\ell,t\}^\{\\mathrm\{ref\}\}\-u\_\{\\ell\}^\{\\mathdollar\}\\\|\_\{2\}^\{2\}\}\{2M\_\{\\ell\}^\{2\}\}\\leq\\frac\{C\_\{\\ell,t\}^\{2\}\}\{2M\_\{\\ell\}^\{2\}\}\\,g\_\{\\ell,t\}\(1\-r\)^\{2\}=c\_\{\\ell,t\}g\_\{\\ell,t\}\(1\-r\)^\{2\}\. Rearranging yieldssimℓ,t≥1−cℓ,tgℓ,t\(1−r\)2=LBℓ,t\(r\)\\mathrm\{sim\}\_\{\\ell,t\}\\geq 1\-c\_\{\\ell,t\}g\_\{\\ell,t\}\(1\-r\)^\{2\}=\\mathrm\{LB\}\_\{\\ell,t\}\(r\), which proves the first claim\.
#### Step 2: Lower\-bounding the worst\-step similarity\.
Since the shallow layer depth is defined using the minimum similarity across steps, we take the minimum of the step\-wise lower bound over all steps:mintsimℓ,t≥mintLBℓ,t\(r\)=LB¯ℓ\(r\)\\min\_\{t\}\\mathrm\{sim\}\_\{\\ell,t\}\\geq\\min\_\{t\}\\mathrm\{LB\}\_\{\\ell,t\}\(r\)=\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\. Thus,LB¯ℓ\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)serves as a conservative lower bound on the similarity of layerℓ\\ellacross all steps\.
#### Step 3: Determining a guaranteed depth fromrr\.
By definition,bsafe\(r\)b\_\{\\mathrm\{safe\}\}\(r\)is the deepest consecutive shallow layer depth such that every layer up to that depth satisfies the conservative lower boundLB¯ℓ\(r\)≥τ\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\\geq\\tau\. SinceLB¯ℓ\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)is a function ofrr, the resulting depthbsafe\(r\)b\_\{\\mathrm\{safe\}\}\(r\)is also determined byrr\.
Finally, for every layerℓ≤bsafe\(r\)\\ell\\leq b\_\{\\mathrm\{safe\}\}\(r\), we haveLB¯ℓ\(r\)≥τ\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\)\\geq\\tau\. Combining this with the result from Step 2 givesmintsimℓ,t≥τ\\min\_\{t\}\\mathrm\{sim\}\_\{\\ell,t\}\\geq\\taufor allℓ≤bsafe\(r\)\\ell\\leq b\_\{\\mathrm\{safe\}\}\(r\)\. Therefore, by the definition ofb\(r\)b\(r\), the empirical shallow layer depth satisfiesb\(r\)≥bsafe\(r\)b\(r\)\\geq b\_\{\\mathrm\{safe\}\}\(r\)\. This shows that the shared prefix ratiorrdetermines a guaranteed shallow layer depth through the lower boundLB¯ℓ\(r\)\\underline\{\\mathrm\{LB\}\}\_\{\\ell\}\(r\), which completes the proof\. ∎
## Appendix CImplementation Details
We implementBiCacheon top of PyTorch\. On LLaDA, a representative open\-sourced DLM, we implementBiCachewith∼\\sim3K lines of code\. To implement the KV storeHHin layer\-partitioned caching \(§[5\.2](https://arxiv.org/html/2606.07571#S5.SS2)\), we map each shared prefix to a deterministic 64\-bit hash of its tokenized prefix sequence and use the resulting identifier as the shared prefix cache key\. In our implementation, we use xxHashCollet \([2024](https://arxiv.org/html/2606.07571#bib.bib4)\), a fast non\-cryptographic hash function, to map each shared\-prefix token ID sequence to a deterministic 64\-bit identifier, which is then used as the cache key for lookup\. At serving time,BiCacheextracts the shared prefix, computes its hash, and retrieves the corresponding cached KVs on a hit; on a miss, it computes the shared prefix KVs once and inserts them intoHH\.
Since GPU memory is finite, storing a large number of shared prefix KVs can cause out\-of\-memory issues\. To manage the KV cache efficiently, we follow an eviction technique as in prior worksKwonet al\.\([2023](https://arxiv.org/html/2606.07571#bib.bib38)\)\. Specifically, we used a first\-in, first\-out policy, a standard approach that evicts the oldest cached KVs first\.
MethodARC\-CGPQAMATHGSM8KThr\.Acc\.Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.No caching65\.786\.260\.527\.960\.614\.643\.541\.2vLLM90\.61 \(\+37\.9%\)0\.0 \(\-86\.2%\)80\.4 \(\+32\.9%\)0\.0 \(\-27\.9%\)87\.2 \(\+43\.9%\)0\.0 \(\-14\.6%\)75\.5 \(\+73\.6%\)10\.5 \(\-30\.7%\)BiCache89\.8\(\+36\.7%\)85\.1\(\-1\.1%\)79\.9\(\+32\.1%\)30\.1\(\+2\.2%\)86\.5\(\+42\.8%\)17\.6\(\+3\.0%\)74\.3\(\+70\.8%\)35\.0\(\-6\.2%\)Fast\-dLLM79\.884\.477\.127\.272\.119\.049\.273\.0BiCache\+Fast\-dLLM123\.0\(\+54\.1%\)84\.7\(\+0\.3%\)116\.5\(\+51\.1%\)27\.0\(\-0\.2%\)117\.3\(\+62\.7%\)15\.0\(\-4\.0%\)99\.0\(\+101\.2%\)71\.0\(\-2\.0%\)Table 3:Additional experiments with longer generation length\. We evaluate throughput \(tokens/s\) and accuracy \(%\) across the same benchmarks as in §[6](https://arxiv.org/html/2606.07571#S6)\. The upper parts reports main improvements; the lower parts shows seamless integration results\. Percentages ofBiCacheare relative to no caching \(upper part\) and Fast\-dLLM \(lower part\)\. Bold:BiCacheresults\.Method0\.910\.930\.950\.970\.99Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.BiCache92\.439\.192\.148\.492\.045\.991\.455\.591\.253\.8BiCache\+Fast\-dLLM103\.246\.3103\.257\.5104\.355\.2102\.563\.9103\.265\.0\(a\)Impact of thresholdτ\\tau\.
Method8163264Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.Thr\.Acc\.BiCache89\.155\.591\.455\.592\.754\.993\.453\.5BiCache\+Fast\-dLLM99\.664\.8102\.563\.9104\.161\.9104\.761\.9\(b\)Impact of refresh intervalΔ\\Delta\.
Table 4:Additional experiments evaluating the impact of the threshold \(τ\\tau\) and refresh interval \(Δ\\Delta\) on throughput \(tokens/s\) and accuracy \(%\) on GSM8K\. Bold text denotes the best configuration\.BenchmarkShared prefix ratioARC\-C88\.2%GPQA75\.9%MATH90\.1%GSM8K94\.5%Table 5:Average shared prefix ratio in each benchmark\.
## Appendix DAdditional Analysis and Experiments
Here, we present additional experimental results beyond those in §[6](https://arxiv.org/html/2606.07571#S6)\. Specifically, we report the profiling overhead for different values ofMM, evaluateBiCacheunder a longer generation\-length setting, extend the parameter analysis to GSM8K, and summarize the shared prefix ratios of the benchmarks used in our evaluation\.
Figure 11:Standard deviation under system prompt changes\.
Figure 12:Profiling time changes forMM\.
### D\.1Analysis ofbbfor System Prompt Changes
In §[4\.2](https://arxiv.org/html/2606.07571#S4.SS2), we show that the shared prefix ratiorris strongly related to the shallow layer depthbbacross different tasks\. One may wonder whether this relationship still holds when the system prompt itself changes\. In practice, system prompts are typically identical across requests\. However, in some cases, a small subset of tokens may vary \(e\.g\., timestamps, metadata fields, or minor changes in instruction wording\)\. In such cases, only a few tokens within the shared prefix differ while most of the prefix remains identical across requestsSrivatsaet al\.\([2025](https://arxiv.org/html/2606.07571#bib.bib37)\)\.
We investigate this effect as follows\. For eachr∈\{0\.50,0\.51,…,1\.00\}r\\in\\\{0\.50,0\.51,\{\\ldots\},1\.00\\\}, we generate 500 samples while keepingrrfixed\. Using the system prompt in Figure[10](https://arxiv.org/html/2606.07571#A1.F10)as a source template, we instruct an LLM to generate diverse variants by varying role descriptions, safety instructions, tone specifications, and output\-format constraints while avoiding near\-duplicate rewrites\. We further instruct that at least half of the tokens in each generated prompt differ from those in the source template\. We then measure the standard deviation ofsimℓ,t\\mathrm\{sim\}\_\{\\ell,t\}across differentrrvalues\. A lower standard deviation indicates thatsimℓ,t\\mathrm\{sim\}\_\{\\ell,t\}remains stable even when the system prompt changes, which suggests thatbbis not sensitive to such changes\.
Fig\.[12](https://arxiv.org/html/2606.07571#A4.F12)shows the average standard deviation ofsimℓ,t\\mathrm\{sim\}\_\{\\ell,t\}across ratios, with error bars\. The deviation remains low \(0\.014 on average\), indicating that system prompt changes have little effect onbb\. Combined with the observation in §[4\.2](https://arxiv.org/html/2606.07571#S4.SS2), this supports our design choice of profilingbbonly once offline for each model\.
### D\.2Profiling Time
In §[6\.4](https://arxiv.org/html/2606.07571#S6.SS4), we evaluate which value ofMMis appropriate\. Here, we present the profiling overhead for different values ofMM\. Fig\.[12](https://arxiv.org/html/2606.07571#A4.F12)shows the time required for profiling as the number of profiling requestsMMincreases\. For each value ofMM,BiCacheprofilesMMrequests for each shared prefix ratior∈\{0\.01,…,1\.00\}r\\in\\\{0\.01,\{\\ldots\},1\.00\\\}\. Thus, for example,M=600M\{=\}600means profiling a total of600×100600\\times 100requests across all shared prefix ratios\.
As shown in the figure, the profiling time linearly increases asMMincreases\. In particular, whenM=500M\{=\}500that we use, profiling takes 87\.5 minutes\. Although this cost is nontrivial, profiling is a one\-time offline procedure performed at model deployment rather than during online serving, and it does not need to be repeated when the task changes \(§[4\.2](https://arxiv.org/html/2606.07571#S4.SS2)\) or when the shared prefix changes, such as through a system prompt update \(§[D\.1](https://arxiv.org/html/2606.07571#A4.SS1)\)\. Therefore, it does not affect the runtime latency of user requests\. Moreover, the profiling cost can be further reduced by profiling only a subset of representativerrvalues or by using a more efficient search algorithm instead of exhaustively profiling all ratios\.
### D\.3Experiments on Longergg
Unlike the experiments in Table[1](https://arxiv.org/html/2606.07571#S5.T1), which are conducted with the generation length setting ofg=256g\{=\}256, here we examine whether the main and orthogonal results continue to hold under a longer generation setting ofg=512g\{=\}512\.
Table[3](https://arxiv.org/html/2606.07571#A3.T3)reports the results under the same experiment setup as in §[6](https://arxiv.org/html/2606.07571#S6), except that the generation length and number of steps are set tog=512g\{=\}512ands=256s\{=\}256, respectively\.
Main improvements\.Table[3](https://arxiv.org/html/2606.07571#A3.T3)shows the results under longer generation lengths\. Compared to no caching, both vLLM andBiCacheimprove throughput across all benchmarks\. Specifically,BiCacheachieves 32\.1% \(GPQA\)–70\.8% \(GSM8K\) higher throughput than no caching while maintaining similar accuracy\. Unlike vLLM, which severely collapse accuracy,BiCacheclosely matches the accuracy of no caching, with differences within 1\.1% \(ARC\-C\)–6\.2% \(GSM8K\) across benchmarks\. These results show thatBiCachemaintains the throughput benefits of shared prefix caching even under longer generation lengths while avoiding severe accuracy collapse\.
Seamless integration\.Table[3](https://arxiv.org/html/2606.07571#A3.T3)also shows the results when integratingBiCachewith Fast\-dLLM\. Compared to Fast\-dLLM,BiCacheconsistently improves throughput across all benchmarks\. Specifically, compared to Fast\-dLLM,BiCacheimproves throughput by 51\.1% \(GPQA\)–101\.2% \(GSM8K\)\. At the same time,BiCachemaintains comparable accuracy to Fast\-dLLM, with differences within 4\.0% across benchmarks\. These results indicate thatBiCacheremains effective when combined with a DLM accelerator even under longer generation lengths\.
### D\.4Parameter Analysis on GSM8K
In §[6\.4](https://arxiv.org/html/2606.07571#S6.SS4), we conduct a parameter analysis on ARC\-Challenge\-Chat\. Here, we extend the parameter analysis to another benchmark\. Table[4](https://arxiv.org/html/2606.07571#A3.T4)reports GSM8K results for different thresholdτ\\tauand refresh intervalΔ\\Deltasettings\. All other experimental settings follow §[6](https://arxiv.org/html/2606.07571#S6)\.
Threshold\.Table[4](https://arxiv.org/html/2606.07571#A3.T4)varies the thresholdτ\\tauwhile fixingΔ=16\\Delta=16\. Increasingτ\\tauimproves accuracy while throughput remains largely stable\. Specifically,τ=0\.97\\tau=0\.97achieves high accuracy for bothBiCacheandBiCache\+ Fast\-dLLM while maintaining comparable throughput\.
Refresh Interval\.Table[4](https://arxiv.org/html/2606.07571#A3.T4)varies the refresh intervalΔ\\Deltawhile fixingτ=0\.97\\tau=0\.97\. DecreasingΔ\\Deltaimproves accuracy but reduces throughput\. Specifically,Δ=16\\Delta=16achieves high accuracy for bothBiCacheandBiCache\+ Fast\-dLLM while maintaining comparable throughput\.
### D\.5Shared Prefix Ratio in Benchmarks
Here, we report the shared prefix ratios of the benchmarks used in §[3\.2](https://arxiv.org/html/2606.07571#S3.SS2)and §[6](https://arxiv.org/html/2606.07571#S6)\. Table[5](https://arxiv.org/html/2606.07571#A3.T5)shows the average shared prefix ratio for each benchmark\. The ratio varies across benchmarks because the benchmark\-specific user prompt lengths differ, while we use the same fixed system prompt\. GSM8K has the highest shared prefix ratio at 94\.5%, while GPQA has the lowest at 75\.9%\. Overall, these values are broadly comparable to the real\-world range of shared prefix ratios discussed in §[3\.1](https://arxiv.org/html/2606.07571#S3.SS1), although some benchmarks fall slightly below that range\. In light of the results in §[3\.2](https://arxiv.org/html/2606.07571#S3.SS2), these results suggest that a higher shared prefix ratio tends to yield larger throughput improvements\.
## Appendix EAI Assistants in Research or Writing
We use AI assistants, i\.e\., Gemini and ChatGPT, for proofreading support, including grammar, typo, and wording checks throughout the manuscript\. The assistants are used solely to identify errors in our original writing and are not used to develop the core research ideas, methods, experiments, results, or conclusions; all such content is created and verified by the authors\.Similar Articles
@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
This paper introduces ReST-KV, a novel method for robust KV cache eviction in large language models that uses layer-wise output reconstruction and spatial-temporal smoothing to improve efficiency. The method significantly reduces decoding latency and outperforms state-of-the-art baselines on long-context benchmarks like LongBench and RULER.
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering
WaveFilter proposes a training-free, wavelet-guided KV cache filtering framework for diffusion large language models that enhances long-context capability by precisely identifying key tokens and constructing sparse caches, improving performance on complex long-context tasks.
@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…
DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.