Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching
Summary
This paper proposes Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV cache transmission with compact semantic codes, achieving up to 2.65x TTFT speedup while keeping generation quality within 5% F1 of the oracle.
View Cached Full Text
Cached at: 06/09/26, 08:52 AM
# Efficient State Transfer via Reuse and Selective Patching
Source: [https://arxiv.org/html/2606.07684](https://arxiv.org/html/2606.07684)
## Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching
Zhiqing Tang✉\{\}^\{\\textrm\{\{\\char 0\\relax\}\}\}Hanshuai CuiZhi YaoWeijia Jia✉\{\}^\{\\textrm\{\{\\char 0\\relax\}\}\}
###### Abstract
Disaggregated serving alleviates memory bottlenecks in Large Language Model \(LLM\) inference but creates a severe communication bottleneck: transmitting high\-dimensional Key\-Value \(KV\) caches often dominates time\-to\-first\-token \(TTFT\)\. Moreover, reusing caches across heterogeneous models \(e\.g\., base and fine\-tuned variants\) causes semantic misalignment that accumulates over layers, degrading generation quality\. We propose Semantic Cache Distillation \(SCD\), a loss\-constrained framework that replaces raw KV transmission with compact semantic codes\. SCD addresses these challenges via two mechanisms: \(1\)Reuse, which reconstructs most layers from low\-rank subspaces to minimize transfer cost, and \(2\)Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation\. Empirically, SCD delivers up to 2\.65×\\timesTTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality–latency Pareto frontier in bandwidth\-constrained regimes, while keeping generation quality within 5% F1 of the oracle\.
Large Language Models, Disaggregated Inference, KV Cache Compression, Semantic Cache Distillation, Efficient Serving, Distributed Systems
## 1Introduction
LLM serving is increasingly constrained by memory bandwidth and communication overheads rather than compute intensity\. In decoder\-only Transformers, autoregressive decoding reuses per\-layer KV caches to avoid recomputing attention; however, these caches grow linearly with context length and depth\. Modern systems therefore disaggregate inference into a compute\-bound prefill stage and a memory\-bound decode stage, placing them on separate devices\(Qin et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib20); Zhong et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib30); Patel et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib19)\)\. While advanced schedulers\(Agrawal et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib1)\)and memory managers\(Kwon et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib12)\)address local efficiency, disaggregation introduces a critical network bottleneck: data transfer between the prefill producer and the decode consumer can dominate TTFT, especially over slow interconnects\.
Prior work primarily focuses on compressing the KV cache footprint\. Techniques include quantization to compress states into low precision\(Liu et al\.,[2024c](https://arxiv.org/html/2606.07684#bib.bib18); Hooper et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib10)\), sparsification to evict less salient tokens\(Zhang et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib28); Li et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib14); Tang et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib22); Cai et al\.,[2024b](https://arxiv.org/html/2606.07684#bib.bib4)\), and low\-rank methods that exploit redundancy in Transformer representations\. Systems like CacheGen\(Liu et al\.,[2024b](https://arxiv.org/html/2606.07684#bib.bib17)\)further optimize the streaming of these compressed states\. However, these existing approaches typically assume the producer and consumer share the same latent space \(i\.e\., identical weights\)\. They rely on static compression policies that ignore the representation discrepancies inherent in cross\-model adaptation\.
SCD targets the narrower but practical case of producer\-consumer pairs that share the same Transformer architecture while differing in weights\. It is not intended to replace same\-model KV reuse, where raw or quantized caches can already be reused directly\. Instead, SCD addresses shared\-architecture, weight\-mismatched serving, such as a generic base model serving as the producer for fine\-tuned specialists\(Sheng et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib21); Chen et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib6)\)or a draft/verifier pair\(Cai et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib3); Li et al\.,[2024b](https://arxiv.org/html/2606.07684#bib.bib15)\)\.
A representative deployment is*shared\-prefill, specialized\-decode*serving\. For example, an online service may run a shared producer on a long common prefix, such as user history, documents, instructions, or a task context, and then route the resulting state to task\-specialized consumers for tutoring, code assistance, safety filtering, or domain\-specific response generation\. The producer and consumers are intentionally not identical: specialization is the purpose of the back\-end models\. Once their weights differ, however, the producer’s raw KV states are no longer natively compatible with the consumer\. This setting presents two coupled challenges: \(1\)Transmission Overhead:Raw KV transfer is bandwidth\-bound in disaggregated serving, so any practical system must shrink the per\-request payload while preserving the consumer’s distribution\. \(2\)Semantic Drift:Weight discrepancies cause representation mismatches that compound through layers; direct reuse degrades quality, while full recomputation negates the latency benefits of disaggregation\. Existing approaches\(Liu et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib16)\)attempt to mitigate this by selectively recomputing layers or sharing only specific pairs\. However, these methods enforce a rigid trade\-off: prioritizing reuse risks quality, while prioritizing recomputation sacrifices latency\. A practical solution must simultaneously minimize transmission costs and correct drift under a fixed online execution path\.
This deployment pattern is attractive precisely because the expensive prefix computation can be amortized across a family of specialized consumers\. Synchronizing all endpoints to identical weights would remove the specialization that motivates the system, while colocating every specialized model with the producer would duplicate memory and undermine disaggregation\. The relevant question is therefore not whether same\-model KV reuse can work, but how to transfer state when architectural compatibility exists while the consumer’s weight space is different\.
Figure 1:Overview of SCD\. \(a\) Challenges in heterogeneous disaggregated serving: Transmitting raw KV caches creates a communication bottleneck, and directly reusing caches from a base model \(Producer\) to a fine\-tuned model \(Consumer\) causes semantic drift that degrades quality\. \(b\) SCD Framework: We replace raw KV transmission with compact semantic codes\. The Consumer reconstructs states usingReuse\(low\-rank projection\) for efficiency and appliesPatchat sparse transition layers to rectify semantic misalignment, achieving low latency and high generation quality\.We proposeSemantic Cache Distillation \(SCD\), a bandwidth\-efficient state\-transfer framework that replaces raw KV transmission with compact semantic codes for shared\-architecture, weight\-mismatched producer\-consumer pairs\. SCD distills high\-dimensional states at the producer and reconstructs consumer\-aligned states via lightweight, layer\-aware translators\. SCD integrates two complementary mechanisms: \(1\)Reuse:Leveraging the high cross\-model compatibility of most layers, SCD reconstructs states via fast low\-rank projections to minimize bandwidth usage\. \(2\)Patch:For the few critical layers that induce significant drift, SCD predicts normalized pre\-attention inputs to truncate error propagation without full recomputation\. In summary, we make the following contributions:
- •SCD Framework\.We propose an end\-to\-end design that integratesReusefor bandwidth efficiency andPatchfor semantic alignment, enabling low\-latency transfer between differing models\.
- •Heterogeneous State Transfer Formulation\.We formalize the problem of cacheReuseunder representation mismatch, identifying why raw transfer and static compression fail in disaggregated settings\.
- •Selective Correction Mechanism\.We introducePatch, a lightweight module applied at sparse transition layers that truncates error propagation with minimal computational overhead\.
- •Empirical Effectiveness\.We demonstrate that SCD achieves superior latency\-quality trade\-offs in bandwidth\-limited environments, delivering up to 2\.65×\\timesTTFT speedup over the oracle consumer prefill while maintaining generation quality and outperforming quantization and DroidSpeak\-style recomputation baselines on the Pareto frontier\.
##### Conflict of Interest Disclosure\.
The authors declare no financial conflicts of interest related to this work\.
## 2Related Work
Our work intersects with disaggregated LLM serving, efficient KV cache management, and cross\-model alignment\.
##### Disaggregated Serving and the Transfer Bottleneck\.
Modern serving systems separate*prefill*and*decode*phases across distinct workers to maximize utilization\(Zhong et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib30); Qin et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib20); Patel et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib19)\)\. While this architecture improves throughput, it places KV\-cache transfer on the critical path of time\-to\-first\-token \(TTFT\)\. Advanced schedulers like Sarathi\-Serve\(Agrawal et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib1)\)and FastServe\(Wu et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib25)\)use chunk\-level scheduling to mitigate stalls, while engines like vLLM\(Kwon et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib12)\)and SGLang\(Zheng et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib29)\)optimize memory fragmentation\. Recent transport\-layer works, such as FlowKV\(Li et al\.,[2025](https://arxiv.org/html/2606.07684#bib.bib13)\), optimize low\-latency KV\-cache transfer and load\-aware scheduling\. However, these system\-level optimizations largely treat transferred states as*system\-level payloads*rather than learning cross\-model semantic translators\. Our work complements these efforts by optimizing the*payload*: we target the producer\-to\-consumer handoff, particularly when raw transfer between heterogeneous models becomes bandwidth\-prohibitive\.
##### Intra\-Model KV Compression\.
Standard techniques to reduce cache footprint include quantization methods like KIVI\(Liu et al\.,[2024c](https://arxiv.org/html/2606.07684#bib.bib18)\)and KVQuant\(Hooper et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib10)\)\. Beyond quantization, pruning strategies such as H2O\(Zhang et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib28)\), SnapKV\(Li et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib14)\), Quest\(Tang et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib22)\), and PyramidKV\(Cai et al\.,[2024b](https://arxiv.org/html/2606.07684#bib.bib4)\)evict less salient tokens, while StreamingLLM\(Xiao et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib26)\)uses attention sinks for infinite\-length inference\. Systems like CacheGen\(Liu et al\.,[2024b](https://arxiv.org/html/2606.07684#bib.bib17)\)further optimize the streaming transmission of compressed states\. Crucially, these approaches are inherently*intra\-model*: they assume the producer and consumer share identical weights and feature spaces\. Applying them directly to heterogeneous pairs fails to bridge the semantic gap caused by weight discrepancies\. In contrast, SCD is designed for the*inter\-model*setting, mapping producer states into the consumer’s native space via learnable codes\.
##### Cross\-Model Cache Reuse\.
Serving heterogeneous models with a shared backbone is a growing challenge\. Systems like S\-LoRA\(Sheng et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib21)\)and Punica\(Chen et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib6)\)enable scalable serving of LoRA adapters, while Prompt Cache\(Gim et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib8)\)explores attention reuse across requests\. For reuse between base and fine\-tuned models,Liu et al\. \([2024a](https://arxiv.org/html/2606.07684#bib.bib16)\)show that naive sharing degrades generation quality\. Their system, DroidSpeak, mitigates this by selectively recomputing critical layers while reusing raw caches for the rest\. Unlike DroidSpeak, which makes a binary decision \(reuse raw vs\. recompute\), SCD advances from*selection*to*transformation*\. By employing loss\-constrained reconstruction \(Reuse\) and targeted correction \(Patch\), SCD achieves higher compression than raw reuse and lower latency than recomputation, smoothing the trade\-off frontier\.
##### Orthogonal Acceleration Approaches\.
SCD focuses on state transfer and is orthogonal to decoding\-stage optimizations\. Techniques like CoDec\(Wang et al\.,[2025b](https://arxiv.org/html/2606.07684#bib.bib24)\)accelerate prefix\-shared decoding kernels, while speculative decoding frameworks like Medusa\(Cai et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib3)\)and EAGLE\(Li et al\.,[2024b](https://arxiv.org/html/2606.07684#bib.bib15)\)generate multiple tokens per step using draft models\. SCD is compatible with these methods: once the cache is transferred and reconstructed, the consumer can employ speculative sampling or optimized kernels for generation\.
##### Distinction from Semantic Communication\.
Finally, we distinguish SCD from Semantic Communication or Cache\-to\-Cache \(C2C\) frameworks\(Fu et al\.,[2026](https://arxiv.org/html/2606.07684#bib.bib7)\)\. C2C approaches typically use KV caches to fuse information from multiple agents to enhance collaborative generation\. In such settings, the receiver integrates external signals into its own context\. Conversely, our goal is acceleration: we enable the consumer to*skip*its prefill phase entirely\. Adopting C2C\-style fusion would require the consumer to first compute a local cache, negating the latency benefits of disaggregation\. Thus, SCD serves as a specialized transfer protocol for latency\-critical serving rather than a general collaboration mechanism\.
## 3Problem Formulation
##### Disaggregated Serving and the Bandwidth Bottleneck\.
Modern LLM systems disaggregate*prefill*and*decode*phases across devices to maximize utilization\(Zhong et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib30); Qin et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib20)\)\. We consider a setup with a*producer*modelℳA\\mathcal\{M\}\_\{A\}\(Device A\) and a*consumer*modelℳB\\mathcal\{M\}\_\{B\}\(Device B\), both processing a prefixx1:Tx\_\{1:T\}\. To generate tokens, the consumerℳB\\mathcal\{M\}\_\{B\}requires layer\-wise KV caches𝒞B=\{\(KBℓ,VBℓ\)\}ℓ=1L\\mathcal\{C\}\_\{B\}=\\\{\(K\_\{B\}^\{\\ell\},V\_\{B\}^\{\\ell\}\)\\\}\_\{\\ell=1\}^\{L\}\. Device B faces a dilemma: it must either*recompute*these states locally \(compute\-bound\) or*fetch*them from Device A \(bandwidth\-bound\)\. Since the cache size‖𝒞B‖\\\|\\mathcal\{C\}\_\{B\}\\\|scales linearly with context lengthTTand depthLL, raw transmission often dominates end\-to\-end latency under limited interconnect bandwidth\.
##### Heterogeneity and Semantic Drift\.
Reuse becomes challenging when producer and consumer models are heterogeneous \(e\.g\., base vs\. fine\-tuned, or draft vs\. verifier\)\. Because their weights differ, their internal representations lie in distinct feature spaces\. Directly substituting the producer’s cache𝒞A\\mathcal\{C\}\_\{A\}for𝒞B\\mathcal\{C\}\_\{B\}causes*semantic drift*—a representation mismatch that compounds over layers\. Existing methods typically mitigate this by partially recomputing specific layers, which forces a rigid trade\-off between transfer cost and computation overhead\(Liu et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib16)\)\.
##### Loss\-Constrained State Transfer\.
We formulate cross\-model cache reuse as a rate\-distortion problem\. Our goal is to transfer compact*semantic codes*fromℳA\\mathcal\{M\}\_\{A\}to reconstruct an*effective*cache forℳB\\mathcal\{M\}\_\{B\}\. We define a producer\-side encoderϕ\\phiand a consumer\-side decoderψ\\psioperating on source states𝐒A\\mathbf\{S\}\_\{A\}\(e\.g\., hidden states\):
𝐙\\displaystyle\\mathbf\{Z\}=ϕ\(𝐒A\),𝐙∈ℝT×r,\\displaystyle=\\phi\(\\mathbf\{S\}\_\{A\}\),\\quad\\mathbf\{Z\}\\in\\mathbb\{R\}^\{T\\times r\},\(1\)𝐒^B\\displaystyle\\hat\{\\mathbf\{S\}\}\_\{B\}=ψ\(𝐙\),\\displaystyle=\\psi\(\\mathbf\{Z\}\),\(2\)wherer≪dr\\ll dis the rank of the code, and𝐒^B\\hat\{\\mathbf\{S\}\}\_\{B\}is used to compute the approximated cache𝒞^B\\hat\{\\mathcal\{C\}\}\_\{B\}\.
We minimize the total transfer latency𝒯transfer\\mathcal\{T\}\_\{\\mathrm\{transfer\}\}, which sums transmission, reconstruction, and residual computation time:
minϕ,ψ𝒯transfer=\|𝐙\|BW\+𝒯recon\(ψ,𝐙\)\+𝒯fill,\\displaystyle\\min\_\{\\phi,\\psi\}\\quad\\mathcal\{T\}\_\{\\mathrm\{transfer\}\}~=~\\frac\{\|\\mathbf\{Z\}\|\}\{\\mathrm\{BW\}\}~\+~\\mathcal\{T\}\_\{\\mathrm\{recon\}\}\(\\psi,\\mathbf\{Z\}\)~\+~\\mathcal\{T\}\_\{\\mathrm\{fill\}\},\(3\)whereBW\\mathrm\{BW\}is the channel bandwidth,\|𝐙\|\|\\mathbf\{Z\}\|is the code size, and𝒯fill\\mathcal\{T\}\_\{\\mathrm\{fill\}\}accounts for any layers explicitly recomputed rather than reconstructed \(in SCD,𝒯fill=0\\mathcal\{T\}\_\{\\mathrm\{fill\}\}=0since Reuse and Patch jointly cover all layers\)\.
Crucially, we constrain the consumer’s generation quality to remain within anϵ\\epsilon\-margin of the oracle:
𝔼x1:T∼𝒟\[DKL\(PℳB\(⋅∣𝒞B\)∥PℳB\(⋅∣𝒞^B\)\)\]≤ϵ,\\displaystyle\\mathbb\{E\}\_\{x\_\{1:T\}\\sim\\mathcal\{D\}\}\\Big\[D\_\{\\mathrm\{KL\}\}\\Big\(P\_\{\\mathcal\{M\}\_\{B\}\}\(\\cdot\\mid\\mathcal\{C\}\_\{B\}\)~\\Big\\\|~P\_\{\\mathcal\{M\}\_\{B\}\}\(\\cdot\\mid\\hat\{\\mathcal\{C\}\}\_\{B\}\)\\Big\)\\Big\]\\leq\\epsilon,\(4\)wherePℳBP\_\{\\mathcal\{M\}\_\{B\}\}is the next\-token distribution\. SCD learns layer\-wise projections\(ϕ,ψ\)\(\\phi,\\psi\)to minimize Eq\. \([3](https://arxiv.org/html/2606.07684#S3.E3)\) while satisfying Eq\. \([4](https://arxiv.org/html/2606.07684#S3.E4)\)\.
## 4Method
We presentSCD, a framework for split inference between a producerℳA\\mathcal\{M\}\_\{A\}and a consumerℳB\\mathcal\{M\}\_\{B\}that share a Transformer architecture but differ in weights\. Rather than transmitting raw high\-dimensional states, the producer sends compact*semantic codes*, from which the consumer reconstructs*native\-space*states\. SCD integrates two complementary mechanisms: \(i\)Reuse—fast low\-rank reconstruction of KV caches for the majority of layers; and \(ii\)Patch—targeted semantic correction at sparse transition layers to truncate error propagation\.
### 4\.1Notation and Preliminaries
We index Transformer layers byℓ∈\{1,…,L\}\\ell\\in\\\{1,\\dots,L\\\}\. LetTTdenote the prefix length,dhd\_\{h\}the per\-head dimension, anddmodeld\_\{\\text\{model\}\}the model hidden size\. We represent matrices in bold uppercase \(e\.g\.,𝐊,𝐕\\mathbf\{K\},\\mathbf\{V\}\) and vectors/scalars in lowercase\. The consumer’s cache for layerℓ\\ellconsists of multi\-head tensors\(𝐊Bℓ,𝐕Bℓ\)\(\\mathbf\{K\}\_\{B\}^\{\\ell\},\\mathbf\{V\}\_\{B\}^\{\\ell\}\)\. When applying encoders/decoders, we flatten thebatch×heads×tokens\\text\{batch\}\\times\\text\{heads\}\\times\\text\{tokens\}dimensions into a sample axisNN\. Letℒpatch⊆\{1,…,L\}\\mathcal\{L\}\_\{\\mathrm\{patch\}\}\\subseteq\\\{1,\\dots,L\\\}be the set of layers using Patch, andℒreuse=\{1,…,L\}∖ℒpatch\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}=\\\{1,\\dots,L\\\}\\setminus\\mathcal\{L\}\_\{\\mathrm\{patch\}\}be the remaining layers that use Reuse\. The patch budget isk=\|ℒpatch\|k=\|\\mathcal\{L\}\_\{\\mathrm\{patch\}\}\|; unless otherwise specified, we usek=6k\{=\}6\. We selectℒpatch\\mathcal\{L\}\_\{\\mathrm\{patch\}\}once during offline calibration and keep it fixed for all online requests\. We employ branch\-specific ranksrK,rV,rH≪dh,dmodelr\_\{K\},r\_\{V\},r\_\{H\}\\ll d\_\{h\},d\_\{\\text\{model\}\}\. Keys utilize Rotary Positional Embeddings \(RoPE\); we denotedeRoPE\(⋅\)\\operatorname\{deRoPE\}\(\\cdot\)andRoPE\(⋅\)\\operatorname\{RoPE\}\(\\cdot\)as the inverse and forward operations\. Crucially, we compress Key, Value, and Hidden states independently to preserve channel\-wise semantics\. Offline calibration uses a small prefix set𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\(typically 100–500 representative prefixes\)\. For each prefixx1:T∈𝒟calx\_\{1:T\}\\in\\mathcal\{D\}\_\{\\mathrm\{cal\}\}, we run full prefills on bothℳA\\mathcal\{M\}\_\{A\}andℳB\\mathcal\{M\}\_\{B\}under identical tokenization and attention masks, then record paired KV tensors and normalized pre\-attention hidden states for every layer\. These paired traces are used to fit Reuse translators via linear low\-rank regression and Patch aligners via supervised regression before deployment\. At deployment time, the learned artifacts are loaded as a static routing table keyed by layer index, and the serving path performs no per\-request layer search\.
Figure 2:SCD Overview\.*Offline:*We collect paired traces on identical prefixes to learn per\-layer low\-rank translators for KV pairs \(Reuse\) and aligners for transition semantic states \(Patch\)\.*Online:*The producer executes prefill, encodes, and transmits low\-dimensional codes\(𝐙Kℓ,𝐙Vℓ\)\(\\mathbf\{Z\}\_\{K\}^\{\\ell\},\\mathbf\{Z\}\_\{V\}^\{\\ell\}\)forℓ∈ℒreuse\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}and𝐙Hℓ\\mathbf\{Z\}\_\{H\}^\{\\ell\}forℓ∈ℒpatch\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}\. The consumer decodes these into its native\-space states, bypassing local prefill\.
### 4\.2Overview: Semantic Code Transport
As illustrated in Fig\.[2](https://arxiv.org/html/2606.07684#S4.F2), SCD replaces raw state transmission with semantic code transport\. During producer prefill, we transmit codes\{𝐙Kℓ,𝐙Vℓ\}ℓ∈ℒreuse\\\{\\mathbf\{Z\}\_\{K\}^\{\\ell\},\\mathbf\{Z\}\_\{V\}^\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}\}forReuselayers and\{𝐙Hℓ\}ℓ∈ℒpatch\\\{\\mathbf\{Z\}\_\{H\}^\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}\}forPatchlayers\. The consumer decodes𝐙K/𝐙V\\mathbf\{Z\}\_\{K\}/\\mathbf\{Z\}\_\{V\}intoℳB\\mathcal\{M\}\_\{B\}\-native KV caches and maps𝐙H\\mathbf\{Z\}\_\{H\}intoℳB\\mathcal\{M\}\_\{B\}’s pre\-attention normalized space to regenerate consistent KV pairs\. Concretely,ϕ\\phiandψ\\psiin Eqs\. \(1\)–\(2\) are instantiated as the layer\-wise linear maps𝐖enc,𝐖dec\\mathbf\{W\}\_\{\\mathrm\{enc\}\},\\mathbf\{W\}\_\{\\mathrm\{dec\}\}\(Reuse\) plus alignersgθℓg\_\{\\theta\}^\{\\ell\}\(Patch\)\.
### 4\.3Reuse: Cross\-Model Low\-Rank KV Reconstruction
Reuse leverages the empirical low\-rank structure of KV tensors and the high subspace overlap between corresponding layers ofℳA\\mathcal\{M\}\_\{A\}andℳB\\mathcal\{M\}\_\{B\}\. We learn a shared latent space and model\-specific decoders per layer\.
#### 4\.3\.1Offline: Joint Subspace Learning
##### Paired Collection and deRoPE\.
We collect per\-layer tensors from paired full prefills ofℳA\\mathcal\{M\}\_\{A\}andℳB\\mathcal\{M\}\_\{B\}on the same calibration prefixes𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\. The calibration stage is offline and produces frozen artifacts for a fixed model pair\. For keys, we remove RoPE to operate in the content space:
𝐊~Aℓ=deRoPE\(𝐊Aℓ\),𝐊~Bℓ=deRoPE\(𝐊Bℓ\)\.\\displaystyle\\tilde\{\\mathbf\{K\}\}\_\{A\}^\{\\ell\}=\\operatorname\{deRoPE\}\(\\mathbf\{K\}\_\{A\}^\{\\ell\}\),\\qquad\\tilde\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\}=\\operatorname\{deRoPE\}\(\\mathbf\{K\}\_\{B\}^\{\\ell\}\)\.\(5\)Values𝐕\\mathbf\{V\}are processed directly\.
##### Joint Low\-Rank Factorization\.
Flattening samples into matrices𝐀Kℓ,𝐁Kℓ∈ℝN×dh\\mathbf\{A\}\_\{K\}^\{\\ell\},\\mathbf\{B\}\_\{K\}^\{\\ell\}\\in\\mathbb\{R\}^\{N\\times d\_\{h\}\}, we construct the joint matrix𝐗Kℓ=\[𝐀Kℓ𝐁Kℓ\]∈ℝN×2dh\\mathbf\{X\}\_\{K\}^\{\\ell\}=\[\\mathbf\{A\}\_\{K\}^\{\\ell\}\\;\\mathbf\{B\}\_\{K\}^\{\\ell\}\]\\in\\mathbb\{R\}^\{N\\times 2d\_\{h\}\}\. A rank\-rKr\_\{K\}approximation yields shared latents𝐙Kℓ∈ℝN×rK\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\in\\mathbb\{R\}^\{N\\times r\_\{K\}\}and decoders𝐖dec,Aℓ,𝐖dec,Bℓ∈ℝrK×dh\\mathbf\{W\}\_\{\\mathrm\{dec\},A\}^\{\\ell\},\\mathbf\{W\}\_\{\\mathrm\{dec\},B\}^\{\\ell\}\\in\\mathbb\{R\}^\{r\_\{K\}\\times d\_\{h\}\}such that:
𝐀Kℓ≈𝐙Kℓ𝐖dec,Aℓ,𝐁Kℓ≈𝐙Kℓ𝐖dec,Bℓ\.\\displaystyle\\mathbf\{A\}\_\{K\}^\{\\ell\}\\approx\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{dec\},A\}^\{\\ell\},\\qquad\\mathbf\{B\}\_\{K\}^\{\\ell\}\\approx\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{dec\},B\}^\{\\ell\}\.\(6\)The same procedure applies independently to values to obtain𝐙Vℓ\\mathbf\{Z\}\_\{V\}^\{\\ell\}\.
##### Producer Encoders\.
To map new producer observations to the shared latent space, we learn a projection𝐖encℓ∈ℝdh×rK\\mathbf\{W\}\_\{\\mathrm\{enc\}\}^\{\\ell\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times r\_\{K\}\}via ridge regression:
𝐖encℓ=argmin𝐖‖𝐀Kℓ𝐖−𝐙Kℓ‖F2\+λ‖𝐖‖F2\.\\displaystyle\\mathbf\{W\}\_\{\\mathrm\{enc\}\}^\{\\ell\}=\\mathop\{\\arg\\min\}\_\{\\mathbf\{W\}\}\\;\\left\\\|\\mathbf\{A\}\_\{K\}^\{\\ell\}\\mathbf\{W\}\-\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\right\\\|\_\{F\}^\{2\}\+\\lambda\\\|\\mathbf\{W\}\\\|\_\{F\}^\{2\}\.\(7\)
#### 4\.3\.2Online: Encode and Decode
For eachℓ∈ℒreuse\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}, the producer computes:
𝐙Kℓ=𝐊~Aℓ𝐖enc,Kℓ,𝐙Vℓ=𝐕Aℓ𝐖enc,Vℓ\.\\displaystyle\\mathbf\{Z\}\_\{K\}^\{\\ell\}=\\tilde\{\\mathbf\{K\}\}\_\{A\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{enc\},K\}^\{\\ell\},\\qquad\\mathbf\{Z\}\_\{V\}^\{\\ell\}=\\mathbf\{V\}\_\{A\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{enc\},V\}^\{\\ell\}\.\(8\)The consumer reconstructs native\-space caches via:
𝐊~Bℓ\\displaystyle\\tilde\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\}=𝐙Kℓ𝐖dec,Bℓ,\\displaystyle=\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{dec\},B\}^\{\\ell\},𝐕Bℓ\\displaystyle\\mathbf\{V\}\_\{B\}^\{\\ell\}=𝐙Vℓ𝐖dec,Bℓ,\\displaystyle=\\mathbf\{Z\}\_\{V\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{dec\},B\}^\{\\ell\},\(9\)followed by re\-applying RoPE:𝐊Bℓ=RoPE\(𝐊~Bℓ\)\\mathbf\{K\}\_\{B\}^\{\\ell\}=\\operatorname\{RoPE\}\(\\tilde\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\}\)\.
### 4\.4Patch: Semantic Correction at Transition Layers
Low\-rank reconstruction is insufficient for critical layers where feature mismatches amplify\. Patch intercepts the consumer’s*transition semantic state*to enforce alignment\.
#### 4\.4\.1Patch Signal: Pre\-Attention Normalized Input
Forℓ∈ℒpatch\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}, we define the transition state as the post\-norm input to the attention block:
𝐇in,Aℓ≜𝐱norm,Aℓ,𝐇in,Bℓ≜𝐱norm,Bℓ,\\displaystyle\\mathbf\{H\}\_\{\\mathrm\{in\},A\}^\{\\ell\}\\triangleq\\mathbf\{x\}^\{\\ell\}\_\{\\mathrm\{norm\},A\},\\qquad\\mathbf\{H\}\_\{\\mathrm\{in\},B\}^\{\\ell\}\\triangleq\\mathbf\{x\}^\{\\ell\}\_\{\\mathrm\{norm\},B\},\(10\)where𝐇∈ℝT×dmodel\\mathbf\{H\}\\in\\mathbb\{R\}^\{T\\times d\_\{\\text\{model\}\}\}\. The producer compresses𝐇in,Aℓ\\mathbf\{H\}\_\{\\mathrm\{in\},A\}^\{\\ell\}into a code𝐙Hℓ=𝐇in,Aℓ𝐖enc,Hℓ\\mathbf\{Z\}\_\{H\}^\{\\ell\}=\\mathbf\{H\}\_\{\\mathrm\{in\},A\}^\{\\ell\}\\,\\mathbf\{W\}\_\{\\mathrm\{enc\},H\}^\{\\ell\}, where𝐖enc,Hℓ∈ℝdmodel×rH\\mathbf\{W\}\_\{\\mathrm\{enc\},H\}^\{\\ell\}\\in\\mathbb\{R\}^\{d\_\{\\text\{model\}\}\\times r\_\{H\}\}is learned similarly to Eq\. \([7](https://arxiv.org/html/2606.07684#S4.E7)\)\.
#### 4\.4\.2Consumer Aligner and Regeneration
The consumer applies a non\-linear alignergθℓg\_\{\\theta\}^\{\\ell\}\(an MLP\) to map the code to its native space:
𝐇^in,Bℓ=gθℓ\(𝐙Hℓ\),ℓ∈ℒpatch\.\\displaystyle\\hat\{\\mathbf\{H\}\}\_\{\\mathrm\{in\},B\}^\{\\ell\}=g\_\{\\theta\}^\{\\ell\}\(\\mathbf\{Z\}\_\{H\}^\{\\ell\}\),\\qquad\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}\.\(11\)KV pairs are then regenerated using the consumer’s native weights:
𝐊^Bℓ\\displaystyle\\hat\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\}=RoPE\(𝐇^in,Bℓ𝐖k,Bℓ\),\\displaystyle=\\operatorname\{RoPE\}\\Bigl\(\\hat\{\\mathbf\{H\}\}\_\{\\mathrm\{in\},B\}^\{\\ell\}\\mathbf\{W\}\_\{k,B\}^\{\\ell\}\\Bigr\),\(12a\)𝐕^Bℓ\\displaystyle\\hat\{\\mathbf\{V\}\}\_\{B\}^\{\\ell\}=𝐇^in,Bℓ𝐖v,Bℓ\.\\displaystyle=\\hat\{\\mathbf\{H\}\}\_\{\\mathrm\{in\},B\}^\{\\ell\}\\mathbf\{W\}\_\{v,B\}^\{\\ell\}\.\(12b\)We traingθℓg\_\{\\theta\}^\{\\ell\}to minimize the L2 reconstruction error‖gθℓ\(𝐙Hℓ\)−𝐇in,Bℓ‖22\\\|g\_\{\\theta\}^\{\\ell\}\(\\mathbf\{Z\}\_\{H\}^\{\\ell\}\)\-\\mathbf\{H\}\_\{\\mathrm\{in\},B\}^\{\\ell\}\\\|\_\{2\}^\{2\}\.
### 4\.5Online Procedure
Algorithm[1](https://arxiv.org/html/2606.07684#alg1)details the inference pipeline\. The transmission complexity isO\(Tr\)O\(Tr\), significantly lower than the raw transfer costO\(Td\)O\(Td\)sincer≪dr\\ll d\.ℒpatch\\mathcal\{L\}\_\{\\mathrm\{patch\}\}andℒreuse\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}are fixed inputs from offline calibration; branch selection reduces to constant\-time membership checks over layer indices, so latency variation comes from prefix length and model execution, not from an online policy search\.
Algorithm 1Split Inference with Reuse and PatchInput:Prefix
x1:Tx\_\{1:T\}, Producer
ℳA\\mathcal\{M\}\_\{A\}, Consumer
ℳB\\mathcal\{M\}\_\{B\}, fixed sets
ℒpatch,ℒreuse\\mathcal\{L\}\_\{\\mathrm\{patch\}\},\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}
Output:Consumer cache
𝒞B\\mathcal\{C\}\_\{B\}ready for decoding
Producer \(Device A\):
Compute full prefill on
ℳA\\mathcal\{M\}\_\{A\}
for
ℓ∈ℒreuse\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}do
𝐙Kℓ←deRoPE\(𝐊Aℓ\)𝐖enc,Kℓ\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\leftarrow\\operatorname\{deRoPE\}\(\\mathbf\{K\}\_\{A\}^\{\\ell\}\)\\mathbf\{W\}\_\{\\mathrm\{enc\},K\}^\{\\ell\}
𝐙Vℓ←𝐕Aℓ𝐖enc,Vℓ\\mathbf\{Z\}\_\{V\}^\{\\ell\}\\leftarrow\\mathbf\{V\}\_\{A\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{enc\},V\}^\{\\ell\}
endfor
for
ℓ∈ℒpatch\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}do
𝐙Hℓ←𝐱norm,Aℓ𝐖enc,Hℓ\\mathbf\{Z\}\_\{H\}^\{\\ell\}\\leftarrow\\mathbf\{x\}^\{\\ell\}\_\{\\mathrm\{norm\},A\}\\mathbf\{W\}\_\{\\mathrm\{enc\},H\}^\{\\ell\}
endfor
Transmit
\{𝐙Kℓ,𝐙Vℓ\}reuse∪\{𝐙Hℓ\}patch\\\{\\mathbf\{Z\}\_\{K\}^\{\\ell\},\\mathbf\{Z\}\_\{V\}^\{\\ell\}\\\}\_\{\\text\{reuse\}\}\\cup\\\{\\mathbf\{Z\}\_\{H\}^\{\\ell\}\\\}\_\{\\text\{patch\}\}to Device B
Consumer \(Device B\):
for
ℓ∈ℒreuse\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}do
𝐊~Bℓ←𝐙Kℓ𝐖dec,Bℓ\\tilde\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\}\\leftarrow\\mathbf\{Z\}\_\{K\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{dec\},B\}^\{\\ell\}
𝐊Bℓ←RoPE\(𝐊~Bℓ\)\\mathbf\{K\}\_\{B\}^\{\\ell\}\\leftarrow\\operatorname\{RoPE\}\(\\tilde\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\}\)
𝐕Bℓ←𝐙Vℓ𝐖dec,Bℓ\\mathbf\{V\}\_\{B\}^\{\\ell\}\\leftarrow\\mathbf\{Z\}\_\{V\}^\{\\ell\}\\mathbf\{W\}\_\{\\mathrm\{dec\},B\}^\{\\ell\}
endfor
for
ℓ∈ℒpatch\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}do
𝐇^in,Bℓ←gθℓ\(𝐙Hℓ\)\\hat\{\\mathbf\{H\}\}\_\{\\mathrm\{in\},B\}^\{\\ell\}\\leftarrow g\_\{\\theta\}^\{\\ell\}\(\\mathbf\{Z\}\_\{H\}^\{\\ell\}\)
Regenerate
\(𝐊Bℓ,𝐕Bℓ\)\(\\mathbf\{K\}\_\{B\}^\{\\ell\},\\mathbf\{V\}\_\{B\}^\{\\ell\}\)via Eq\. \([12](https://arxiv.org/html/2606.07684#S4.E12)\)
endfor
Initiate decoding with assembled cache
𝒞B\\mathcal\{C\}\_\{B\}
## 5Theoretical Analysis
We analyze whyReusealone leads to multiplicative error amplification across layers and howPatchacts as a spectral truncation operator\. Our analysis connects layer\-wise reconstruction errors to the divergence in the consumer’s output distribution\. Proofs are detailed in Appendix[B](https://arxiv.org/html/2606.07684#A2)\.
##### Setup\.
Consider a decoding stepttgiven prefixx1:Tx\_\{1:T\}\. Let𝐡Bℓ∈ℝd\\mathbf\{h\}\_\{B\}^\{\\ell\}\\in\\mathbb\{R\}^\{d\}denote the oracle hidden state onℳB\\mathcal\{M\}\_\{B\}at layerℓ\\ellfor the final prefix position \(the state from which the first decode step reads\), and𝐡^Bℓ\\hat\{\\mathbf\{h\}\}\_\{B\}^\{\\ell\}denote the SCD approximation of the same state under the reconstructed cache\. Let𝐨B,𝐨^B∈ℝV\\mathbf\{o\}\_\{B\},\\hat\{\\mathbf\{o\}\}\_\{B\}\\in\\mathbb\{R\}^\{V\}be the corresponding output logits\. We quantify thelayer\-local injection errorsas:
εreuseℓ\\displaystyle\\varepsilon\_\{\\mathrm\{reuse\}\}^\{\\ell\}≜∥\(𝐊~Bℓ,𝐕Bℓ\)−\(𝐊~B,reuseℓ,𝐕B,reuseℓ\)∥,\\displaystyle\\triangleq\\lVert\(\\tilde\{\\mathbf\{K\}\}\_\{B\}^\{\\ell\},\\mathbf\{V\}\_\{B\}^\{\\ell\}\)\-\(\\tilde\{\\mathbf\{K\}\}\_\{B,\\mathrm\{reuse\}\}^\{\\ell\},\\mathbf\{V\}\_\{B,\\mathrm\{reuse\}\}^\{\\ell\}\)\\rVert,\(13\)εpatchℓ\\displaystyle\\varepsilon\_\{\\mathrm\{patch\}\}^\{\\ell\}≜∥𝐡Bℓ−𝐡^Bℓ∥,ℓ∈ℒpatch,\\displaystyle\\triangleq\\lVert\\mathbf\{h\}\_\{B\}^\{\\ell\}\-\\hat\{\\mathbf\{h\}\}\_\{B\}^\{\\ell\}\\rVert,\\quad\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\},\(14\)where∥⋅∥\\lVert\\cdot\\rVertdenotes the spectral norm\.
##### Assumption 1: Layer\-wise Stability\.
We assume the Transformer layer mapℱℓ\\mathcal\{F\}\_\{\\ell\}is Lipschitz continuous with respect to both input states and cached KV pairs\. For any reuse layerℓ∈ℒreuse\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}, there exist stability constantsαℓ,βℓ≥0\\alpha\_\{\\ell\},\\beta\_\{\\ell\}\\geq 0such that:
∥𝐡^Bℓ\+1−𝐡Bℓ\+1∥≤αℓ∥𝐡^Bℓ−𝐡Bℓ∥\+βℓεreuseℓ\.\\lVert\\hat\{\\mathbf\{h\}\}\_\{B\}^\{\\ell\+1\}\-\\mathbf\{h\}\_\{B\}^\{\\ell\+1\}\\rVert\\leq\\alpha\_\{\\ell\}\\lVert\\hat\{\\mathbf\{h\}\}\_\{B\}^\{\\ell\}\-\\mathbf\{h\}\_\{B\}^\{\\ell\}\\rVert\+\\beta\_\{\\ell\}\\,\\varepsilon\_\{\\mathrm\{reuse\}\}^\{\\ell\}\.\(15\)For patch layersℓ∈ℒpatch\\ell\\in\\mathcal\{L\}\_\{\\mathrm\{patch\}\}, the aligner explicitly resets the state, bounding the error solely by the patch reconstruction quality:
∥𝐡^Bℓ−𝐡Bℓ∥≤εpatchℓ\.\\lVert\\hat\{\\mathbf\{h\}\}\_\{B\}^\{\\ell\}\-\\mathbf\{h\}\_\{B\}^\{\\ell\}\\rVert\\leq\\varepsilon\_\{\\mathrm\{patch\}\}^\{\\ell\}\.\(16\)
##### Error Propagation and Truncation\.
Unrolling Eq\. \([15](https://arxiv.org/html/2606.07684#S5.E15)\) reveals that errors accumulate multiplicatively\. However, Eq\. \([16](https://arxiv.org/html/2606.07684#S5.E16)\) introduces a reset mechanism: a Patch layer effectively truncates the history, preventing errors from earlier layers from propagating further\.
###### Lemma 5\.1\(Segmented Error Bound\)\.
Let patch indices bes1<s2<⋯<sms\_\{1\}<s\_\{2\}<\\dots<s\_\{m\}\. Define segments usings0≜0s\_\{0\}\\triangleq 0andsm\+1≜Ls\_\{m\+1\}\\triangleq L\. For any target layerℓ\\elllocated in segment\(sj,sj\+1\]\(s\_\{j\},s\_\{j\+1\}\], the accumulated representation error is bounded by:
∥𝐡^Bℓ−𝐡Bℓ∥≤\\displaystyle\\lVert\\hat\{\\mathbf\{h\}\}\_\{B\}^\{\\ell\}\-\\mathbf\{h\}\_\{B\}^\{\\ell\}\\rVert\\;\\leq\(∏i=sjℓ−1αi\)εpatchsj\\displaystyle\\left\(\\prod\_\{i=s\_\{j\}\}^\{\\ell\-1\}\\alpha\_\{i\}\\right\)\\varepsilon\_\{\\mathrm\{patch\}\}^\{s\_\{j\}\}\(17\)\+∑k=sjℓ−1\(∏m=k\+1ℓ−1αm\)βkεreusek,\\displaystyle\+\\;\\sum\_\{k=s\_\{j\}\}^\{\\ell\-1\}\\left\(\\prod\_\{m=k\+1\}^\{\\ell\-1\}\\alpha\_\{m\}\\right\)\\beta\_\{k\}\\,\\varepsilon\_\{\\mathrm\{reuse\}\}^\{k\},whereεpatchs0≜0\\varepsilon\_\{\\mathrm\{patch\}\}^\{s\_\{0\}\}\\triangleq 0\.
###### Proof Sketch\.
The proof follows by induction\. The base case atℓ=sj\\ell=s\_\{j\}is satisfied by Eq\. \([16](https://arxiv.org/html/2606.07684#S5.E16)\)\. For subsequent layers, we apply the recurrence in Eq\. \([15](https://arxiv.org/html/2606.07684#S5.E15)\) starting from the reset pointsjs\_\{j\}\. ∎
##### Theorem 1 \(NLL Gap\)\.
Assuming the output head isLoutL\_\{\\mathrm\{out\}\}\-Lipschitz and the log\-softmax function is locallyCsmC\_\{\\mathrm\{sm\}\}\-Lipschitz, the divergence in NLL for the target tokenyty\_\{t\}satisfies:
\|logp\(yt\)−logp^\(yt\)\|≤CsmLout∥𝐡^BL−𝐡BL∥,\\left\|\\log p\(y\_\{t\}\)\-\\log\\hat\{p\}\(y\_\{t\}\)\\right\|\\;\\leq\\;C\_\{\\mathrm\{sm\}\}L\_\{\\mathrm\{out\}\}\\lVert\\hat\{\\mathbf\{h\}\}\_\{B\}^\{L\}\-\\mathbf\{h\}\_\{B\}^\{L\}\\rVert,\(18\)where the RHS is bounded by Lemma[5\.1](https://arxiv.org/html/2606.07684#S5.Thmtheorem1)\.
##### Theoretical Implications\.
This formalizes ourReuse\-heavy, Patch\-sparsestrategy\. Without Patch, the error term is dominated by∏i=1Lαi\\prod\_\{i=1\}^\{L\}\\alpha\_\{i\}, which compounds multiplicatively with depthLL; in practiceαi≳1\\alpha\_\{i\}\\gtrsim 1due to residual connections, so even small per\-layer amplification accumulates into a large drift over deep stacks\. By inserting a Patch atsjs\_\{j\}, we replace the accumulated error term with a small local termεpatchsj\\varepsilon\_\{\\mathrm\{patch\}\}^\{s\_\{j\}\}, effectively truncating the drift\.
## 6Experiments
We evaluate SCD in a split inference setting where a producer runs prefill and a consumer starts decoding without a full consumer\-side prefill\. This setting is motivated by disaggregated serving, where KV transfer can dominate latency and requires high bandwidth at scale\(Zhong et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib30)\)\. We focus on the practical scenario of a base model and its fine\-tuned variant sharing the same architecture but different weights, for which cross\-model cache reuse is known to be non\-trivial\(Liu et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib16)\)\.
### 6\.1Experimental Setup
##### Tasks and Datasets\.
We evaluate generation quality and distributional fidelity across diverse benchmarks: \(i\)Question Answering \(QA\):We report F1 scores on CMRC2018 \(Chinese extractive QA\) and HotpotQA \(English multi\-hop reasoning\), covering both span\-extraction and complex reasoning tasks; \(ii\)Language Modeling:We measure Perplexity \(PPL\) on WikiText\-2 to quantify general representational consistency; and \(iii\)Distributional Fidelity:We compute token\-level KL divergence and Total Variation \(TV\) distance between the Oracle consumer distribution and the approximated distribution\.
##### Split\-Inference Protocol\.
Given a prefixx1:Tx\_\{1:T\}, the producer runs a normal prefill and transmits either raw KV \(baselines\) or semantic codes \(SCD\)\. The consumer reconstructs caches \(and optionally applies Patch\) before initiating greedy decoding\.
##### Calibration Protocol\.
SCD uses an offline calibration set𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}only to learn auxiliary translators for a fixed model pair\. For the canonical MistralLite→\\toMistral\-7B run, we use 200 CMRC2018 prefixes \(mean prefix length 609\.0 tokens; 121,805 total tokens\)\. For each prefix, we run full prefills on both producer and consumer with identical tokenization and attention masks, then save paired per\-layer KV tensors and normalized pre\-attention hidden states\. TheReusemodule is fitted by low\-rank subspace identification followed by ridge\-regression encoders/decoders; thePatchmodule trains sparse MLP aligners on the selected transition layers\. The learned artifacts are frozen and reused across online requests\.
##### Bandwidth Measurement\.
We measure TTFT and end\-to\-end latency under controlled effective bandwidthBnetB\_\{\\mathrm\{net\}\}\(sweeping from 100 Gbps to 1 Tbps\)\. We report the total transmitted payload size for each method\.
##### Method Compared\.
We compare the following methods:
- •Oracle: Full prefill onℳB\\mathcal\{M\}\_\{B\}\(Upper bound on quality; Lower bound on latency\)\.
- •Raw KV: Transmits full BF16 KV caches from producer to consumer\.
- •Quantized KV: Transmits4\-bit quantizedKV caches \(group\-wise INT4\) and dequantizes on the consumer\.
- •DroidSpeak\-style Selective Recompute\(Liu et al\.,[2024a](https://arxiv.org/html/2606.07684#bib.bib16)\): Reuses raw KV for most layers but fully recomputes a selected subset of critical layers onℳB\\mathcal\{M\}\_\{B\}, following the selection policy proposed in DroidSpeak\.
- •SCD \(Reuse\-only\): Applies low\-rank reuse to all layers \(i\.e\.,ℒpatch=∅\\mathcal\{L\}\_\{\\mathrm\{patch\}\}=\\emptyset\); no recomputation is performed\.
- •SCD: The proposed full method, applying Patch at selected transition layers and Reuse elsewhere\.
##### Implementation Details\.
We use the same tokenization and exact attention masking on both sides\. Keys are deRoPE’d before encoding and reRoPE’d after decoding\. We keepK/V/HK/V/Hseparate with independent ranks as described in Section[4](https://arxiv.org/html/2606.07684#S4)and Appendix[F](https://arxiv.org/html/2606.07684#A6)\. Offline calibration cost and deployment footprint are reported in Section[6\.6](https://arxiv.org/html/2606.07684#S6.SS6)\.
Table 1:Main Results across Model Scales\.Mistral \(7B\) and Qwen \(32B\) pairs on the QA benchmarks of Section[6\.1](https://arxiv.org/html/2606.07684#S6.SS1)\.SCDconsistently recovers Oracle\-level quality across scales while achieving∼\\sim2\.0×\\timesspeedup over full prefill\. Oracle TTFT differs across tables due to per\-dataset prefix length distributions; underlying model pairs are unchanged\.
### 6\.2Main Results
#### 6\.2\.1Bandwidth–Latency Trade\-off
Figure 3:Bandwidth–Latency Trade\-off\.Time\-to\-first\-token \(TTFT\) versus effective bandwidth \(log\-scale\)\. WhileQuantized KV \(4\-bit\)is the fastest due to minimal payload, it suffers from catastrophic quality collapse \(see Table[2](https://arxiv.org/html/2606.07684#S6.T2)\)\.SCDachieves a favorable sweet spot: it is significantly faster than Raw KV and DroidSpeak\-style Selective Recompute while maintaining Oracle\-level quality\.##### Bandwidth–Latency Analysis\.
Figure[3](https://arxiv.org/html/2606.07684#S6.F3)sweeps the link bandwidth from 100 Gbps to 1 Tbps\. As bandwidth increases, TTFT decreases and asymptotically approaches the compute/synchronization floor\.Quantized KV \(4\-bit\)achieves the lowest TTFT, consistent with its minimal payload size\. However, as shown in Table[2](https://arxiv.org/html/2606.07684#S6.T2), this comes at the cost of unusable generation quality\.SCD \(Reuse\+Patch\)improves significantly overDroidSpeak\-style Selective Recompute, with its overhead dominated by local reconstruction \(GEMMs\) rather than data transfer\. This confirms SCD’s efficiency in bandwidth\-constrained regimes typical of disaggregated serving\.
#### 6\.2\.2Quality and Fidelity
Table 2:Main Results: Quality vs\. Efficiency\.Evaluated on MistralLite→\\toMistral\-7B atBnet=200B\_\{\\mathrm\{net\}\}\{=\}200Gbps\.SCDmatches Oracle quality \(F1 0\.78 vs 0\.81\) while achieving2\.65×\\timesspeedup\. Note that4\-bit Quantizationfails to bridge the cross\-model semantic gap \(F1 0\.13\)\.Table[1](https://arxiv.org/html/2606.07684#S6.T1)and Table[2](https://arxiv.org/html/2606.07684#S6.T2)highlight the necessity of semantic\-aware transfer\.Quantized KV \(4\-bit\)collapses quality \(F1 0\.1289\), proving that standard compression cannot handle the weight mismatch between heterogeneous models\.Reuse\-only\(all\-layer reuse\) improves F1 to 0\.6421 but still lags behind Oracle\. By addingPatch, SCD recovers nearly full Oracle performance \(F1 0\.7850\) with only a modest latency increase \(\+7\.0 ms\), effectively bridging the semantic gap\. Crucially, Table[1](https://arxiv.org/html/2606.07684#S6.T1)demonstrates that this advantage holds for the largerQwen\-32Bpair, validating the scalability of our approach\. Additional adapted compression baselines are reported in Appendix[E](https://arxiv.org/html/2606.07684#A5)\.
### 6\.3Ablations
#### 6\.3\.1Rank Sensitivity and Pareto Frontier
Figure 4:Rank Sensitivity \(Pareto Frontier\)\.Quality \(F1\) versus transferred bytes across\(rK,rV\)\(r\_\{K\},r\_\{V\}\)configurations\. Hollow markers denote non\-dominated operating points\. ReducingrKr\_\{K\}yields substantial compression with minimal quality loss, highlighting the asymmetric information density in K vs\. V\.Figure[4](https://arxiv.org/html/2606.07684#S6.F4)plots the quality–cost trade\-off\. The Pareto frontier demonstrates that SCD offers tunable compression\. We observe thatrKr\_\{K\}can be reduced more aggressively thanrVr\_\{V\}without hurting quality, suggesting that Value states carry more fine\-grained semantic information required for generation\.
#### 6\.3\.2Patch Necessity and Budget
Table 3:Effect of Patch Budget \(kk\) on WikiText\-2\.Increasing the number of patched layers \(kk\) monotonically improves perplexity \(PPL\) at the cost of modest TTFT overhead\. The defaultk=6k\{=\}6used in the main experiments is chosen from the F1\-based efficiency curve in Figure[5](https://arxiv.org/html/2606.07684#S6.F5)rather than this PPL profile, which is whyk=6k\{=\}6is not listed explicitly in this table\.Table[3](https://arxiv.org/html/2606.07684#S6.T3)confirms thatPatchis not cosmetic\. Starting fromReuse\-only\(PPL 6\.788\), adding just 5 patched layers reduces PPL to 5\.017, close to the Oracle baseline\. This supports our hypothesis that error propagation is concentrated in a few critical layers; correcting these via Patch stabilizes the entire generation process\.
Figure 5:Patch Budget Selection\.Cumulative efficiencyΔF\(𝒮k\)/\(c⋅k\)\\Delta F\(\\mathcal\{S\}\_\{k\}\)/\(c\\cdot k\)as a function of the patch budgetkk, whereΔF\\Delta Fis the F1 gain over the all\-Reuse baseline andccis the per\-layer patch cost \(Appendix[A](https://arxiv.org/html/2606.07684#A1)\)\. We selectk=6k\{=\}6as the default budget at the argmax of this curve; smallerkkleaves quality on the table, while largerkkadds cost faster than it adds quality\.
### 6\.4Analysis: Error Propagation
Figure 6:Layer\-wise Error Propagation\.Relativeℓ2\\ell\_\{2\}error of the pre\-attention normalized inputxnorm,Bℓx^\{\\ell\}\_\{\\mathrm\{norm\},B\}\.Reuse\-onlyexhibits compounding error accumulation\.SCD\(Patch\) effectively truncates this error at transition layers \(marked by dashed lines\), preventing downstream drift\.Figure[6](https://arxiv.org/html/2606.07684#S6.F6)visualizes the truncation effect predicted by our theory\. WithReuse\-only, feature mismatch compounds with depth\. In contrast,SCDresets the error at each patched layer, keeping the trajectory close to the Oracle manifold\.
### 6\.5End\-to\-End Latency Breakdown
Table 4:End\-to\-end TTFT Breakdown \(WikiText\-2\)\.SCDachieves a2\.19×\\timesspeedup over Oracle \(271\.3 ms\) with minimal reconstruction overhead\. Column abbreviations:Prod\.\(Producer Prefill\),Enc\.\(Encoding\),Trans\.\(Network Transfer\),Reb\.\(Rebuild/Decompression\),Rec\.\(Recomputation/Patching\),Dec1\(First Token Decode\)\.Table[4](https://arxiv.org/html/2606.07684#S6.T4)dissects the latency\.Reuse\-onlyeliminates the recomputation overhead \(0\.0 ms\), achieving the fastest total time among reuse\-based methods \(112\.9 ms\), though at lower quality\.SCDintroduces a modest 11\.7 ms overhead in the “Recompute” phase, which corresponds to the execution of the Patch Aligner networks and the regeneration of native KV pairs at transition layers\. This small investment yields the quality jump observed in Table[2](https://arxiv.org/html/2606.07684#S6.T2)\.
### 6\.6Offline Calibration and Deployment Cost
SCD shifts cross\-model adaptation into a bounded offline phase\. For each producer–consumer pair, we collect paired traces, fit the Reuse translators, profile layer sensitivity, and train Patch aligners for the selected transition layers\. For the canonical MistralLite→\\toMistral\-7B run, the single\-GPU pipeline takes 9\.74 hours total, dominated by Patch aligner training; Reuse fitting is lightweight and neither stage retrains either backbone \(per\-stage breakdown in Appendix[F](https://arxiv.org/html/2606.07684#A6), Table[6](https://arxiv.org/html/2606.07684#A6.T6)\)\. The learned artifacts are reusable across requests for the same model pair: 8\.4 MB for Reuse, 1\.58 GB for Patch, totaling 397\.7M auxiliary parameters \(5\.49% of the 7\.24B consumer\)\. Thus calibration is a one\-time per\-pair expense, and serving loads these frozen artifacts and follows the offline\-selected route\.
## 7Conclusion
We proposed Semantic Cache Distillation \(SCD\), a state transfer framework for disaggregated LLM serving that replaces opaque KV transfer with compact semantic codes\. By combiningReuse\(low\-rank reconstruction\) withPatch\(sparse semantic correction\), SCD bridges weight\-mismatched feature spaces, truncates error propagation, and recovers Oracle\-level quality with substantially lower communication overhead\.
##### Limitations and Future Work\.
SCD targets shared\-architecture, weight\-mismatched model pairs; arbitrary cross\-architecture transfer, extreme long\-context regimes beyond 32K tokens, and large distribution shifts remain open\. Each pair also requires one\-time calibration; future work will address these boundaries with bandwidth\-adaptive ranks and broader transfer\.
## Impact Statement
This work focuses on the efficiency of LLM inference, particularly in disaggregated serving scenarios\. By mitigating bandwidth overheads for state transfer between heterogeneous models, our approach reduces the energy consumption and operational costs associated with large\-scale AI services\. This contributes to the sustainability and accessibility of AI infrastructure\. We do not foresee specific negative ethical or societal consequences intrinsic to this transport protocol beyond the general risks associated with LLM deployment\.
## Acknowledgements
This work was supported in part by National Natural Science Foundation of China \(NSFC\) under Grant 62272050 and Grant 62302048; in part by the Guangdong Key Lab of AI and Multi\-modal Data Processing, Beijing Normal\-Hong Kong Baptist University \(BNBU\), Zhuhai under 2023\-2024 Grants sponsored by Guangdong Provincial Department of Education; in part by Institute of Artificial Intelligence and Future Networks \(BNU\-Zhuhai\) and Engineering Center of AI and Future Education, Guangdong Provincial Department of Science and Technology, China; Zhuhai Science\-Tech Innovation Bureau under Grant No\. 2320004002772, and in part by the Interdisciplinary Intelligence SuperComputer Center of Beijing Normal University \(Zhuhai\)\.
## References
- Agrawal et al\. \(2024\)Agrawal, A\., Kedia, N\., Panwar, A\., Mohan, J\., Kwatra, N\., Gulavani, B\., Tumanov, A\., and Ramjee, R\.Taming throughput\-latency tradeoff in LLM inference with Sarathi\-Serve\.In*18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\)*, pp\. 117–134, 2024\.
- Ashkboos et al\. \(2024\)Ashkboos, S\., Croci, M\. L\., Nascimento, M\. G\. d\., Hoefler, T\., and Hensman, J\.SliceGPT: Compress large language models by deleting rows and columns\.In*International Conference on Learning Representations*, 2024\.
- Cai et al\. \(2024a\)Cai, T\., Li, Y\., Geng, Z\., Peng, H\., Lee, J\. D\., Chen, D\., and Dao, T\.Medusa: Simple LLM inference acceleration framework with multiple decoding heads\.*arXiv preprint arXiv:2401\.10774*, 2024a\.
- Cai et al\. \(2024b\)Cai, Z\., Zhang, Y\., Gao, B\., Liu, Y\., Li, Y\., Liu, T\., Lu, K\., Xiong, W\., Dong, Y\., Hu, J\., et al\.PyramidKV: Dynamic KV cache compression based on pyramidal information funneling\.*arXiv preprint arXiv:2406\.02069*, 2024b\.
- Castin et al\. \(2023\)Castin, V\., Ablin, P\., and Peyré, G\.How Smooth Is Attention?*arXiv preprint arXiv:2312\.14820*, 2023\.
- Chen et al\. \(2024\)Chen, L\., Ye, Z\., Wu, Y\., Zhuo, D\., Ceze, L\., and Krishnamurthy, A\.Punica: Multi\-tenant LoRA serving\.*Proceedings of Machine Learning and Systems*, 6:1–13, 2024\.
- Fu et al\. \(2026\)Fu, T\., Min, Z\., Zhang, H\., Yan, J\., Dai, G\., Ouyang, W\., and Wang, Y\.Cache\-to\-Cache: Direct semantic communication between large language models\.In*International Conference on Learning Representations*, 2026\.
- Gim et al\. \(2024\)Gim, I\., Chen, G\., Lee, S\.\-s\., Sarda, N\., Khandelwal, A\., and Zhong, L\.Prompt Cache: Modular attention reuse for low\-latency inference\.*Proceedings of Machine Learning and Systems*, 6:325–338, 2024\.
- Gu et al\. \(2025\)Gu, Y\., Zhou, W\., Iacovides, G\., and Mandic, D\.TensorLLM: Tensorising multi\-head attention for enhanced reasoning and compression in LLMs\.In*2025 International Joint Conference on Neural Networks*, pp\. 1–8, 2025\.
- Hooper et al\. \(2024\)Hooper, C\., Kim, S\., Mohammadzadeh, H\., Mahoney, M\. W\., Shao, Y\. S\., Keutzer, K\., and Gholami, A\.KVQuant: Towards 10 million context length LLM inference with KV cache quantization\.*Advances in Neural Information Processing Systems*, 37:1270–1303, 2024\.
- Kim et al\. \(2021\)Kim, H\., Papamakarios, G\., and Mnih, A\.The lipschitz constant of self\-attention\.In*International Conference on Machine Learning*, pp\. 5562–5571\. PMLR, 2021\.
- Kwon et al\. \(2023\)Kwon, W\., Li, Z\., Zhuang, S\., Sheng, Y\., Zheng, L\., Yu, C\. H\., Gonzalez, J\. E\., Zhang, H\., and Stoica, I\.Efficient memory management for large language model serving with PagedAttention\.In*Proceedings of the 29th Symposium on Operating Systems Principles*, 2023\.
- Li et al\. \(2025\)Li, W\., Jiang, G\., Ding, X\., Tao, Z\., Hao, C\., Xu, C\., Zhang, Y\., and Wang, H\.FlowKV: A disaggregated inference framework with low\-latency KV cache transfer and load\-aware scheduling\.*arXiv preprint arXiv:2504\.03775*, 2025\.
- Li et al\. \(2024a\)Li, Y\., Huang, Y\., Yang, B\., Venkitesh, B\., Locatelli, A\., Ye, H\., Cai, T\., Lewis, P\., and Chen, D\.SnapKV: LLM knows what you are looking for before generation\.*Advances in Neural Information Processing Systems*, 37:22947–22970, 2024a\.
- Li et al\. \(2024b\)Li, Y\., Wei, F\., Zhang, C\., and Zhang, H\.EAGLE: Speculative sampling requires rethinking feature uncertainty\.In*International Conference on Machine Learning*, 2024b\.
- Liu et al\. \(2024a\)Liu, Y\., Huang, Y\., Yao, J\., Feng, S\., Gu, Z\., Du, K\., Li, H\., Cheng, Y\., Jiang, J\., Lu, S\., et al\.DroidSpeak: KV cache sharing for cross\-LLM communication and multi\-LLM serving\.*arXiv preprint arXiv:2411\.02820*, 2024a\.
- Liu et al\. \(2024b\)Liu, Y\., Li, H\., Cheng, Y\., Ray, S\., Huang, Y\., Zhang, Q\., Du, K\., Yao, J\., Lu, S\., Ananthanarayanan, G\., et al\.CacheGen: KV cache compression and streaming for fast large language model serving\.In*Proceedings of the ACM SIGCOMM 2024 Conference*, pp\. 38–56, 2024b\.
- Liu et al\. \(2024c\)Liu, Z\., Yuan, J\., Jin, H\., Zhong, S\., Xu, Z\., Braverman, V\., Chen, B\., and Hu, X\.KIVI: A tuning\-free asymmetric 2bit quantization for KV cache\.In*International Conference on Machine Learning*, 2024c\.
- Patel et al\. \(2024\)Patel, P\., Choukse, E\., Zhang, C\., Shah, A\., Goiri, Í\., Maleki, S\., and Bianchini, R\.Splitwise: Efficient generative LLM inference using phase splitting\.In*2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture \(ISCA\)*, pp\. 118–132\. IEEE, 2024\.
- Qin et al\. \(2024\)Qin, R\., Li, Z\., He, W\., Zhang, M\., Wu, Y\., Zheng, W\., and Xu, X\.Mooncake: A KVCache\-centric disaggregated architecture for LLM serving\.*arXiv preprint arXiv:2407\.00079*, 2024\.
- Sheng et al\. \(2024\)Sheng, Y\., Cao, S\., Li, D\., Hooper, C\., Lee, N\., Yang, S\., Chou, C\., Zhu, B\., Zheng, L\., Keutzer, K\., et al\.S\-LoRA: Scalable serving of thousands of LoRA adapters\.*Proceedings of Machine Learning and Systems*, 6:296–311, 2024\.
- Tang et al\. \(2024\)Tang, J\., Zhao, Y\., Zhu, K\., Xiao, G\., Kasikci, B\., and Han, S\.Quest: Query\-aware sparsity for efficient long\-context LLM inference\.In*International Conference on Machine Learning*, 2024\.
- Wang et al\. \(2025a\)Wang, X\., Zheng, Y\., Wan, Z\., and Zhang, M\.SVD\-LLM: Truncation\-aware singular value decomposition for large language model compression\.In*International Conference on Learning Representations*, 2025a\.
- Wang et al\. \(2025b\)Wang, Z\., Ning, R\., Fang, C\., Zhang, Z\., Lin, X\., Ma, S\., Zhou, M\., Li, X\., Wang, Z\., Huan, C\., Gu, R\., Yang, K\., Chen, G\., Zhong, S\., and Tian, C\.CoDec: Prefix\-shared decoding kernel for LLMs\.*arXiv preprint arXiv:2505\.17694*, 2025b\.
- Wu et al\. \(2023\)Wu, B\., Zhong, Y\., Zhang, Z\., Liu, S\., Liu, F\., Sun, Y\., Huang, G\., Liu, X\., and Jin, X\.Fast distributed inference serving for large language models\.*arXiv preprint arXiv:2305\.05920*, 2023\.
- Xiao et al\. \(2023\)Xiao, G\., Tian, Y\., Chen, B\., Han, S\., and Lewis, M\.Efficient streaming language models with attention sinks\.*arXiv preprint arXiv:2309\.17453*, 2023\.
- Yudin et al\. \(2025\)Yudin, N\., Gaponov, A\., Kudriashov, S\., and Rakhuba, M\.Pay attention to attention distribution: A new local lipschitz bound for transformers\.*arXiv preprint arXiv:2507\.07814*, 2025\.
- Zhang et al\. \(2023\)Zhang, Z\., Sheng, Y\., Zhou, T\., Chen, T\., Zheng, L\., Cai, R\., Song, Z\., Tian, Y\., Ré, C\., Barrett, C\., et al\.H2O: Heavy\-hitter oracle for efficient generative inference of large language models\.*Advances in Neural Information Processing Systems*, 36:34661–34710, 2023\.
- Zheng et al\. \(2024\)Zheng, L\., Yin, L\., Xie, Z\., Sun, C\. L\., Huang, J\., Yu, C\. H\., Cao, S\., Kozyrakis, C\., Stoica, I\., Gonzalez, J\. E\., et al\.SGLang: Efficient execution of structured language model programs\.*Advances in neural information processing systems*, 37:62557–62583, 2024\.
- Zhong et al\. \(2024\)Zhong, Y\., Liu, S\., Chen, J\., Hu, J\., Zhu, Y\., Liu, X\., Jin, X\., and Zhang, H\.DistServe: Disaggregating prefill and decoding for goodput\-optimized large language model serving\.In*18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\)*, pp\. 193–210, 2024\.
## Appendix ALayer Policy: Selecting Reuse vs\. Patch
SCD appliesReuseto the majority of layers to minimize transfer overhead, while enablingPatchon a sparse subset to recover generation quality\. Letℒ=\{0,…,L−1\}\\mathcal\{L\}=\\\{0,\\dots,L\\\!\-\\\!1\\\}be the set of Transformer layers\. Let𝒮⊆ℒ\\mathcal\{S\}\\subseteq\\mathcal\{L\}denote the set of layers where the consumer appliesPatch\(during profiling, this corresponds to restoring oracle states\), with the remaining layersℒ∖𝒮\\mathcal\{L\}\\setminus\\mathcal\{S\}usingReuse\. LetF\(𝒮\)F\(\\mathcal\{S\}\)be a task metric \(e\.g\., F1 score or negative log\-likelihood\) measured under this configuration\. The marginal gain over the all\-Reusebaseline is:
ΔF\(𝒮\)≜F\(𝒮\)−F\(∅\)\.\\Delta F\(\\mathcal\{S\}\)\\triangleq F\(\\mathcal\{S\}\)\-F\(\\emptyset\)\.\(19\)Assuming a constant per\-layer costcc\(e\.g\., latency or parameters\) for patching, we select𝒮\\mathcal\{S\}to maximize efficiency:
Eff\(𝒮\)≜ΔF\(𝒮\)c⋅\|𝒮\|\.\\mathrm\{Eff\}\(\\mathcal\{S\}\)\\triangleq\\frac\{\\Delta F\(\\mathcal\{S\}\)\}\{c\\cdot\|\\mathcal\{S\}\|\}\.\(20\)
##### Step 1: Restore\-one Sensitivity \(Candidate Discovery\)\.
Starting from the all\-Reusebaseline \(𝒮=∅\\mathcal\{S\}=\\emptyset\), we measure the individual marginal benefit of restoring each layerℓ\\ell:
ΔFℓ≜F\(\{ℓ\}\)−F\(∅\),ℓ∈ℒ\.\\Delta F\_\{\\ell\}\\triangleq F\(\\\{\\ell\\\}\)\-F\(\\emptyset\),\\qquad\\ell\\in\\mathcal\{L\}\.\(21\)We construct a candidate pool𝒞\\mathcal\{C\}by selecting the top\-MMlayers ranked byΔFℓ\\Delta F\_\{\\ell\}\.
##### Step 2: Interaction\-aware Greedy Selection\.
Single\-layer sensitivity ignores inter\-layer interactions\. To address this, we construct𝒮\\mathcal\{S\}via forward greedy selection on the candidate pool𝒞\\mathcal\{C\}:
ℓt=argmaxℓ∈𝒞∖𝒮t−1\(F\(𝒮t−1∪\{ℓ\}\)−F\(𝒮t−1\)\),𝒮t=𝒮t−1∪\{ℓt\}\.\\ell\_\{t\}=\\mathop\{\\arg\\max\}\_\{\\ell\\in\\mathcal\{C\}\\setminus\\mathcal\{S\}\_\{t\-1\}\}\\Big\(F\(\\mathcal\{S\}\_\{t\-1\}\\cup\\\{\\ell\\\}\)\-F\(\\mathcal\{S\}\_\{t\-1\}\)\\Big\),\\quad\\mathcal\{S\}\_\{t\}=\\mathcal\{S\}\_\{t\-1\}\\cup\\\{\\ell\_\{t\}\\\}\.\(22\)Unlike approaches that restrict recomputation to contiguous segments, SCD allows𝒮t\\mathcal\{S\}\_\{t\}to be sparse\. This flexible topology enablesPatchto act as a semantic bridge at critical transitions without requiring dense support\.
##### Step 3: Budgeted Cost\-aware Selection\.
Instead of fixing the budget a priori, we determine the optimal sizek⋆∈\[kmin,kmax\]k^\{\\star\}\\in\[k\_\{\\min\},k\_\{\\max\}\]that maximizes the efficiency metric:
k⋆=argmaxk∈\[kmin,kmax\]ΔF\(𝒮k\)c⋅k\.k^\{\\star\}=\\mathop\{\\arg\\max\}\_\{k\\in\[k\_\{\\min\},k\_\{\\max\}\]\}\\;\\frac\{\\Delta F\(\\mathcal\{S\}\_\{k\}\)\}\{c\\cdot k\}\.\(23\)The set𝒮k⋆\\mathcal\{S\}\_\{k^\{\\star\}\}defines thePatchpolicy; all other layers employReuse\.
Figure 7:Visualization of the Layer Selection Policy\.\(a\) Candidate Discovery:The restore\-one sensitivity profile reveals that only a few critical layers \(blue\) yield significant quality gains\.\(b\) Interaction Effects:Greedy selection shows diminishing returns, indicating that restoring dense contiguous segments incurs high redundancy\.\(c\) Budget Optimization:The efficiency curveΔF\(𝒮k\)/k\\Delta F\(\\mathcal\{S\}\_\{k\}\)/kpeaks at a sparse budgetk⋆k^\{\\star\}\(green marker\), representing the optimal trade\-off between generation quality and computational cost\.
##### Analysis of Layer Selection Dynamics\.
We validate our policy by analyzing the sensitivity and interactions ofPatchlayers, as shown in Figure[7](https://arxiv.org/html/2606.07684#A1.F7)\. \(a\)Sparsity:The sensitivity profile shows a heavy\-tailed distribution\. A small subset of critical layers drives most of the generation quality, while the majority have negligible impact\. \(b\)Redundancy:Greedy selection shows diminishing returns\. The cumulative gain plateaus as the budget increases, indicating significant redundancy among lower\-ranked layers and suggesting that dense restoration is wasteful\. \(c\)Efficiency:Maximizing the gain\-per\-cost ratio identifies an optimal operating point at a smallkk\(green marker\)\. This confirms our design choice to applyPatchas a sparse, targeted intervention rather than a contiguous block\.
## Appendix BProofs for Section[5](https://arxiv.org/html/2606.07684#S5)
This appendix provides proofs for Lemma[5\.1](https://arxiv.org/html/2606.07684#S5.Thmtheorem1)and Eq\. \([18](https://arxiv.org/html/2606.07684#S5.E18)\), using the notation from the main text\.
### B\.1Useful Inequalities for Log\-Softmax
Letlse\(𝐨\)≜log∑iexp\(oi\)\\operatorname\{lse\}\(\\mathbf\{o\}\)\\triangleq\\log\\sum\_\{i\}\\exp\(o\_\{i\}\)denote the LogSumExp function\. The log\-softmax vector islogsoftmax\(𝐨\)y=oy−lse\(𝐨\)\\log\\operatorname\{softmax\}\(\\mathbf\{o\}\)\_\{y\}=o\_\{y\}\-\\operatorname\{lse\}\(\\mathbf\{o\}\)\.
###### Lemma B\.1\(Stability of Log\-Softmax\)\.
For any logits𝐨,𝐨^∈ℝV\\mathbf\{o\},\\hat\{\\mathbf\{o\}\}\\in\\mathbb\{R\}^\{V\}and token indexyy,
\|−logsoftmax\(𝐨^\)y\+logsoftmax\(𝐨\)y\|≤2∥𝐨^−𝐨∥∞\.\\big\|\-\\log\\operatorname\{softmax\}\(\\hat\{\\mathbf\{o\}\}\)\_\{y\}\+\\log\\operatorname\{softmax\}\(\\mathbf\{o\}\)\_\{y\}\\big\|\\;\\leq\\;2\\,\\\|\\hat\{\\mathbf\{o\}\}\-\\mathbf\{o\}\\\|\_\{\\infty\}\.\(24\)
###### Proof\.
Decompose the difference as:
Δ=\(oy−o^y\)\+\(lse\(𝐨^\)−lse\(𝐨\)\)\.\\Delta=\(o\_\{y\}\-\\hat\{o\}\_\{y\}\)\+\(\\operatorname\{lse\}\(\\hat\{\\mathbf\{o\}\}\)\-\\operatorname\{lse\}\(\\mathbf\{o\}\)\)\.The first term is bounded by‖𝐨−𝐨^‖∞\\\|\\mathbf\{o\}\-\\hat\{\\mathbf\{o\}\}\\\|\_\{\\infty\}\. For the second term, observe thatlse\(⋅\)\\operatorname\{lse\}\(\\cdot\)is 1\-Lipschitz with respect to the∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}norm, since its gradient issoftmax\(⋅\)\\operatorname\{softmax\}\(\\cdot\), which has anℓ1\\ell\_\{1\}norm of exactly 1\. Thus,\|lse\(𝐨^\)−lse\(𝐨\)\|≤‖𝐨^−𝐨‖∞\.\|\\operatorname\{lse\}\(\\hat\{\\mathbf\{o\}\}\)\-\\operatorname\{lse\}\(\\mathbf\{o\}\)\|\\leq\\\|\\hat\{\\mathbf\{o\}\}\-\\mathbf\{o\}\\\|\_\{\\infty\}\.Summing the bounds yields Eq\. \([24](https://arxiv.org/html/2606.07684#A2.E24)\)\. ∎
### B\.2Proof of Lemma[5\.1](https://arxiv.org/html/2606.07684#S5.Thmtheorem1)
###### Proof\.
Let the sorted patch layers beℒpatch=\{s1<⋯<sm\}\\mathcal\{L\}\_\{\\mathrm\{patch\}\}=\\\{s\_\{1\}<\\dots<s\_\{m\}\\\}, with sentinelss0=0s\_\{0\}=0andsm\+1=Ls\_\{m\+1\}=L\. Fix a target layerℓ∈\(sj,sj\+1\]\\ell\\in\(s\_\{j\},s\_\{j\+1\}\]\.
##### Initialization \(Segment Start\)\.
Ifj=0j=0, we have𝐡^s0=𝐡s0\\hat\{\\mathbf\{h\}\}^\{s\_\{0\}\}=\\mathbf\{h\}^\{s\_\{0\}\}, so‖𝐡^s0−𝐡s0‖=0\\\|\\hat\{\\mathbf\{h\}\}^\{s\_\{0\}\}\-\\mathbf\{h\}^\{s\_\{0\}\}\\\|=0\. Ifj≥1j\\geq 1, by the Patch injection bound \(Eq\. \([16](https://arxiv.org/html/2606.07684#S5.E16)\)\), the error is reset:
‖𝐡^sj−𝐡sj‖≤εpatchsj\.\\\|\\hat\{\\mathbf\{h\}\}^\{s\_\{j\}\}\-\\mathbf\{h\}^\{s\_\{j\}\}\\\|\\leq\\varepsilon\_\{\\mathrm\{patch\}\}^\{s\_\{j\}\}\.
##### Unrolling \(Within Segment\)\.
For any layeru∈\{sj,…,ℓ−1\}u\\in\\\{s\_\{j\},\\dots,\\ell\-1\\\}utilizing Reuse, we apply the stability condition \(Eq\. \([15](https://arxiv.org/html/2606.07684#S5.E15)\)\):
‖𝐡^u\+1−𝐡u\+1‖≤αu‖𝐡^u−𝐡u‖\+βuεreuseu\.\\\|\\hat\{\\mathbf\{h\}\}^\{u\+1\}\-\\mathbf\{h\}^\{u\+1\}\\\|\\leq\\alpha\_\{u\}\\\|\\hat\{\\mathbf\{h\}\}^\{u\}\-\\mathbf\{h\}^\{u\}\\\|\+\\beta\_\{u\}\\,\\varepsilon\_\{\\mathrm\{reuse\}\}^\{u\}\.Unrolling this recursion fromu=sju=s\_\{j\}tou=ℓ−1u=\\ell\-1yields:
‖𝐡^ℓ−𝐡ℓ‖≤\(∏i=sjℓ−1αi\)‖𝐡^sj−𝐡sj‖\+∑k=sjℓ−1\(∏m=k\+1ℓ−1αm\)βkεreusek\.\\\|\\hat\{\\mathbf\{h\}\}^\{\\ell\}\-\\mathbf\{h\}^\{\\ell\}\\\|\\leq\\left\(\\prod\_\{i=s\_\{j\}\}^\{\\ell\-1\}\\alpha\_\{i\}\\right\)\\\|\\hat\{\\mathbf\{h\}\}^\{s\_\{j\}\}\-\\mathbf\{h\}^\{s\_\{j\}\}\\\|\+\\sum\_\{k=s\_\{j\}\}^\{\\ell\-1\}\\left\(\\prod\_\{m=k\+1\}^\{\\ell\-1\}\\alpha\_\{m\}\\right\)\\beta\_\{k\}\\,\\varepsilon\_\{\\mathrm\{reuse\}\}^\{k\}\.Substituting the initialization boundεpatchsj\\varepsilon\_\{\\mathrm\{patch\}\}^\{s\_\{j\}\}completes the proof\. ∎
### B\.3Proof of Eq\. \([18](https://arxiv.org/html/2606.07684#S5.E18)\) \(NLL Gap\)
###### Proof\.
Assume the output head \(LayerNorm \+ Linear\) isLoutL\_\{\\mathrm\{out\}\}\-Lipschitz under the infinity norm:
‖𝐨^−𝐨‖∞≤Lout‖𝐡^L−𝐡L‖\.\\\|\\hat\{\\mathbf\{o\}\}\-\\mathbf\{o\}\\\|\_\{\\infty\}\\leq L\_\{\\mathrm\{out\}\}\\\|\\hat\{\\mathbf\{h\}\}^\{L\}\-\\mathbf\{h\}^\{L\}\\\|\.By Lemma[B\.1](https://arxiv.org/html/2606.07684#A2.Thmtheorem1), with𝐨^=𝐨^B\\hat\{\\mathbf\{o\}\}=\\hat\{\\mathbf\{o\}\}\_\{B\}and𝐨=𝐨B\\mathbf\{o\}=\\mathbf\{o\}\_\{B\}:
\|−logp^\(yt\)\+logp\(yt\)\|≤2‖𝐨^B−𝐨B‖∞≤2Lout‖𝐡^L−𝐡L‖\.\|\-\\log\\hat\{p\}\(y\_\{t\}\)\+\\log p\(y\_\{t\}\)\|\\leq 2\\\|\\hat\{\\mathbf\{o\}\}\_\{B\}\-\\mathbf\{o\}\_\{B\}\\\|\_\{\\infty\}\\leq 2L\_\{\\mathrm\{out\}\}\\\|\\hat\{\\mathbf\{h\}\}^\{L\}\-\\mathbf\{h\}^\{L\}\\\|\.This corresponds to Eq\. \([18](https://arxiv.org/html/2606.07684#S5.E18)\) \(settingCsm=2C\_\{\\mathrm\{sm\}\}=2\)\. Substituting the bound for‖𝐡^L−𝐡L‖\\\|\\hat\{\\mathbf\{h\}\}^\{L\}\-\\mathbf\{h\}^\{L\}\\\|from Lemma[5\.1](https://arxiv.org/html/2606.07684#S5.Thmtheorem1)yields the final result\. ∎
## Appendix CEstimating Stability Constantsαℓ\\alpha\_\{\\ell\}andβℓ\\beta\_\{\\ell\}
We provide a concrete instantiation of the stability coefficientsαℓ\\alpha\_\{\\ell\}\(state sensitivity\) andβℓ\\beta\_\{\\ell\}\(cache sensitivity\)\. While standard attention is not globally Lipschitz, we utilize*local*Lipschitz constants estimated on a bounded domain defined by the calibration set, consistent with prior literature\(Castin et al\.,[2023](https://arxiv.org/html/2606.07684#bib.bib5); Yudin et al\.,[2025](https://arxiv.org/html/2606.07684#bib.bib27)\)\.
### C\.1Layer Map and Perturbation
Consider a pre\-LN decoder layer \(omitting the layer indexℓ\\ellfor brevity\)\. The forward pass is defined as:
𝐮\\displaystyle\\mathbf\{u\}=LN\(𝐡\),\\displaystyle=\\operatorname\{LN\}\(\\mathbf\{h\}\),\(25\)𝐐\\displaystyle\\mathbf\{Q\}=𝐮𝐖Q,𝐊,𝐕∈ℝT×dh,\\displaystyle=\\mathbf\{u\}\\mathbf\{W\}\_\{Q\},\\quad\\mathbf\{K\},\\mathbf\{V\}\\in\\mathbb\{R\}^\{T\\times d\_\{h\}\},\(26\)𝐀\\displaystyle\\mathbf\{A\}=softmax\(𝐐𝐊⊤dh\+𝐌\),\\displaystyle=\\operatorname\{softmax\}\\left\(\\frac\{\\mathbf\{Q\}\\mathbf\{K\}^\{\\top\}\}\{\\sqrt\{d\_\{h\}\}\}\+\\mathbf\{M\}\\right\),\(27\)Attn\(𝐡;𝐊,𝐕\)\\displaystyle\\operatorname\{Attn\}\(\\mathbf\{h\};\\mathbf\{K\},\\mathbf\{V\}\)=\(𝐀𝐕\)𝐖O\.\\displaystyle=\(\\mathbf\{A\}\\mathbf\{V\}\)\\mathbf\{W\}\_\{O\}\.\(28\)Our goal is to bound the output perturbation‖δ𝐡ℓ\+1‖\\\|\\delta\\mathbf\{h\}^\{\\ell\+1\}\\\|in terms of the input error‖δ𝐡ℓ‖\\\|\\delta\\mathbf\{h\}^\{\\ell\}\\\|and cache errors\(‖δ𝐊‖,‖δ𝐕‖\)\(\\\|\\delta\\mathbf\{K\}\\\|,\\\|\\delta\\mathbf\{V\}\\\|\)\.
### C\.2Component Bounds
##### Softmax Jacobian\.
For the row\-wise softmax function, the spectral norm of the Jacobian is bounded by1/21/2\. Locally, we have:
‖δ𝐀‖2≤12‖δ𝐒‖2,\\\|\\delta\\mathbf\{A\}\\\|\_\{2\}\\leq\\frac\{1\}\{2\}\\\|\\delta\\mathbf\{S\}\\\|\_\{2\},\(29\)where𝐒\\mathbf\{S\}denotes the pre\-softmax scores\.
##### LayerNorm\.
We denote the local Lipschitz constant of LayerNorm asLLNℓL\_\{\\mathrm\{LN\}\}^\{\\ell\}\. We estimate this empirically based on the minimum per\-token variance observed in the calibration traces\.
### C\.3Bounding Attention and MLP
By expanding the differential of the attention mechanism and applying triangle inequalities \(following the derivation inKim et al\. \([2021](https://arxiv.org/html/2606.07684#bib.bib11)\)\), we derive the aggregate stability coefficients:
αℓ\\displaystyle\\alpha\_\{\\ell\}:=1\+Lattn,hℓ\+Lmlpℓ,\\displaystyle:=1\+L\_\{\\mathrm\{attn\},h\}^\{\\ell\}\+L\_\{\\mathrm\{mlp\}\}^\{\\ell\},\(30\)βℓ\\displaystyle\\beta\_\{\\ell\}:=Lattn,Kℓ\+Lattn,Vℓ\.\\displaystyle:=L\_\{\\mathrm\{attn\},K\}^\{\\ell\}\+L\_\{\\mathrm\{attn\},V\}^\{\\ell\}\.\(31\)The termαℓ\\alpha\_\{\\ell\}includes11due to the residual connection\. The cache sensitivity terms are defined as:
- •Lattn,KℓL\_\{\\mathrm\{attn\},K\}^\{\\ell\}: Captures sensitivity to Key perturbations, mediated by the query magnitude‖𝐐‖2\\\|\\mathbf\{Q\}\\\|\_\{2\}\.
- •Lattn,VℓL\_\{\\mathrm\{attn\},V\}^\{\\ell\}: Captures sensitivity to Value perturbations, mediated by the attention probability norm‖𝐀‖2\\\|\\mathbf\{A\}\\\|\_\{2\}\.
### C\.4Practical Estimation
We estimate these constants using statistics collected from the calibration set:
- •Weights:We compute spectral norms‖𝐖‖2\\\|\\mathbf\{W\}\\\|\_\{2\}via power iteration\.
- •Activations:We use the empirical maxima of‖𝐊‖2,‖𝐕‖2,‖𝐐‖2\\\|\\mathbf\{K\}\\\|\_\{2\},\\\|\\mathbf\{V\}\\\|\_\{2\},\\\|\\mathbf\{Q\}\\\|\_\{2\}over the traces to bound the local domain\.
- •Softmax:We use the theoretical boundLsm=1/2L\_\{\\mathrm\{sm\}\}=1/2\.
This procedure yields tight layer\-wise coefficients\(αℓ,βℓ\)\(\\alpha\_\{\\ell\},\\beta\_\{\\ell\}\)that accurately predict the error amplification observed in our experiments\.
## Appendix DAdditional Visualizations
##### Visualization of Representation Alignment\.
Figure[8](https://arxiv.org/html/2606.07684#A4.F8)visualizes KV representations using t\-SNE to assess alignment quality\. We compare source \(producer, blue\), target \(consumer, gray\), and translated \(orange\) states\. The raw source distribution is clearly separated from the target, indicating a significant feature gap\. In contrast, the translated states align closely with the target clusters\. This demonstrates that our method effectively maps producer states into the consumer’s semantic space\.
Figure 8:t\-SNE visualization of cross\-model KV alignment\.Blue, gray, and orange points represent source, target, and translated KV states, respectively\. The translated states move closer to the target distribution, illustrating the reduction of the representational gap\.
## Appendix ENeighboring Compression Baselines
We additionally adapt representative neighboring compression methods, including SVD\-LLM\(Wang et al\.,[2025a](https://arxiv.org/html/2606.07684#bib.bib23)\), SliceGPT\(Ashkboos et al\.,[2024](https://arxiv.org/html/2606.07684#bib.bib2)\), and TensorLLM\(Gu et al\.,[2025](https://arxiv.org/html/2606.07684#bib.bib9)\), to the heterogeneous state\-transfer setting\. These methods primarily target model or tensor compression rather than producer\-to\-consumer KV transfer under weight mismatch, so we report them in the appendix instead of the main comparison\. Table[5](https://arxiv.org/html/2606.07684#A5.T5)shows that generic compression alone does not resolve cross\-model semantic misalignment\.
Table 5:Adapted Neighboring Compression Baselines\.Quality retention and TTFT speedup are measured under the same heterogeneous transfer setting\. Values above1\.0×1\.0\\timesindicate speedup over the oracle consumer prefill; values below1\.0×1\.0\\timesindicate slowdown\.The results support the distinction between compression and semantic transfer\. SVD\-LLM\-style and SliceGPT\-style adaptations retain part of the consumer behavior but do not provide a latency benefit in this split\-inference setting\. TensorLLM\-style adaptation is more aggressive but collapses quality under weight mismatch\. SCD instead learns producer\-to\-consumer translators and applies Patch at sparse transition layers, preserving most oracle quality while improving TTFT\.
## Appendix FExtended Implementation Details
We divide our implementation into two phases: anOffline Calibration Phase, where we learn layer\-wise parameters to bridge model discrepancies; and anOnline Execution Phase, where the target model \(ℳB\\mathcal\{M\}\_\{B\}\) uses these parameters to populate its KV cache, regenerating native KV pairs at selected patch layers\. A core principle is to follow thenative KV\-cache semanticsof the inference framework \(e\.g\., Hugging Face Transformers\), ensuring that reconstructed states are treated as standard cached history during generation\.
### F\.1Offline Calibration and Parameter Training
##### Reuse: Layer\-wise Stack\-SVD with Ridge Regression\.
We implementReuseas a stack of independent linear translators\. For each layerℓ\\ell, we execute full prefill passes on both the source \(ℳA\\mathcal\{M\}\_\{A\}\) and target \(ℳB\\mathcal\{M\}\_\{B\}\) models over a calibration dataset\. We collect the internal representations at layerℓ\\ellthat serve as inputs to the KV cache\. The optimization proceeds in two steps:
1. 1\.Subspace Identification via Truncated SVD:We first apply Truncated Singular Value Decomposition \(SVD\) to identify a rank\-rrlatent subspace\. By the Eckart–Young–Mirsky theorem, this yields the optimal low\-rank approximation of the feature map in the Frobenius norm, minimizing information loss\.
2. 2\.Mapping Fit via Ridge Regression:We fit the linear encoder/decoder mappings using Ridge regression \(ℓ2\\ell\_\{2\}regularization\)\. We choose Ridge over unregularized least squares to improve numerical stability and prevent noise amplification from small singular values\.
##### Patch: Sparse Breakpoint Alignment\.
We train thePatchmechanism only on a sparse subset of transition layersℓ∈ℒpatch\\ell\\in\\mathcal\{L\}\_\{\\text\{patch\}\}\. For each layer, we collect the ”pre\-attention breakpoint state” \(the state immediately preceding the self\-attention projection\)\. We train an alignergℓg^\{\\ell\}to map the low\-dimensional code fromℳA\\mathcal\{M\}\_\{A\}to an estimate ofℳB\\mathcal\{M\}\_\{B\}’s state\. The training objective is a standard regression loss \(e\.g\., MSE\)\. To improve consistency, we can optionally include auxiliary terms matching intermediate representations, similar to feature matching in Knowledge Distillation\. The patch setℒpatch\\mathcal\{L\}\_\{\\text\{patch\}\}is selected offline under the budgetkkand is loaded as a fixed route at serving time\.
The final offline artifacts are:
- •Reuse Artifacts:Linear encoder/decoder parameters for all layers\.
- •Patch Artifacts:Breakpoint alignersgℓg^\{\\ell\}for the subsetℒpatch\\mathcal\{L\}\_\{\\text\{patch\}\}\.
These are serialized with minimal metadata for loading at runtime\.
Table 6:Offline Calibration Cost\.One\-time cost for MistralLite→\\toMistral\-7B using 200 CMRC2018 prefixes on a single GPU\.
### F\.2Online Execution and Hybrid Cache Construction
##### Online Reuse: Transmission and Reconstruction\.
During inference,ℳA\\mathcal\{M\}\_\{A\}performs a standard prefill and transmits low\-dimensional codes for each reused layerℓ\\ell\.ℳB\\mathcal\{M\}\_\{B\}uses the corresponding decoder to reconstruct the tensor in its own feature space and writes it directly into its cache container \(e\.g\.,past\_key\_values\)\. Once the cache is populated,ℳB\\mathcal\{M\}\_\{B\}skips its local prefill and enters the standard decoding loop\. The framework handles cache indices \(e\.g\.,cache\_position\) natively, treating the reconstructed cache as valid local history\.
##### Online Patch: Alignment and Recomputation\.
When thePatchmechanism is applied at a selected transition layerℓ\\ell, the execution flow is:
1. 1\.State Alignment:ℳB\\mathcal\{M\}\_\{B\}uses the alignergℓg^\{\\ell\}to map the incoming code to an estimated normalized pre\-attention state in the consumer space\.
2. 2\.Native KV Regeneration:ℳB\\mathcal\{M\}\_\{B\}multiplies this aligned state by its ownWk,BℓW\_\{k,B\}^\{\\ell\}andWv,BℓW\_\{v,B\}^\{\\ell\}, and applies RoPE to the key, producing consumer\-native KV pairs for layerℓ\\ell\.
3. 3\.Cache Assembly:The final cache is a hybrid: layers inℒreuse\\mathcal\{L\}\_\{\\mathrm\{reuse\}\}use reconstructed KV pairs, and layers inℒpatch\\mathcal\{L\}\_\{\\mathrm\{patch\}\}use patch\-regenerated native KV pairs\.Similar Articles
Enabling KV Caching of Shared Prefix for Diffusion Language Models
This paper proposes BiCache, a novel KV caching technique for shared prefixes in diffusion language models, which avoids accuracy collapse by dynamically reusing cached keys and values in shallow layers and achieves 36.3%–98.3% throughput improvement.
Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches
A new semantic-adaptive eviction policy for LLM prefix caches that learns token reuse patterns across different token types, achieving 1.4x-2.7x TTFT improvement over existing policies.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
This paper introduces ReST-KV, a novel method for robust KV cache eviction in large language models that uses layer-wise output reconstruction and spatial-temporal smoothing to improve efficiency. The method significantly reduces decoding latency and outperforms state-of-the-art baselines on long-context benchmarks like LongBench and RULER.
High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction
Proposes an SRC pipeline that uses entropy-based selection and low-rank reconstruction to summarize KV cache instead of pruning tokens, reducing VRAM for million-token LLM contexts while avoiding catastrophic attention errors.
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.