How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

arXiv cs.LG 05/11/26, 04:00 AM Papers
Summary
This paper proposes Shadow Mask Distillation (SMD) to solve the off-policy bias caused by KV cache compression during reinforcement learning post-training for large language models. It introduces a mechanism that ensures on-policy alignment and improves memory efficiency for long-context reasoning tasks.
arXiv:2605.06850v1 Announce Type: new Abstract: Reinforcement Learning (RL) has emerged as a crucial paradigm for unlocking the advanced reasoning capabilities of Large Language Models (LLMs), encompassing frameworks like RLHF and RLAIF. Regardless of the specific optimization algorithm (e.g., PPO, GRPO, or Online DPO), online RL inherently requires an exploratory trajectory generation (rollout) phase. However, for long-context reasoning tasks, this rollout phase imposes a severe ``memory wall'' due to the exorbitant Key-Value (KV) cache footprint. While applying KV cache compression during rollouts mitigates this memory overhead, it induces a critical off-policy bias. Although modern KV compression is often nearly lossless during standard inference, even minuscule approximation errors are drastically amplified by the inherent instability of RL optimization. Specifically, the sampler generates responses under a sparse context, whereas the learner updates parameters using the full, dense context. Existing statistical solutions, such as importance reweighting, struggle to correct this magnified bias, suffering from high gradient variance and severe sample inefficiency.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 06:57 AM
# How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
Source: [https://arxiv.org/html/2605.06850](https://arxiv.org/html/2605.06850)
Rui Zhu1Weiheng Bai2Qiushi Wu2Yang Ren1Haixu Tang3Yuchu Liu1

1Yale University2University of Minnesota Twin Cities3Indiana University Bloomington

###### Abstract

Reinforcement Learning \(RL\) has emerged as a crucial paradigm for unlocking the advanced reasoning capabilities of Large Language Models \(LLMs\), encompassing frameworks like RLHF and RLAIF\. Regardless of the specific optimization algorithm \(e\.g\., PPO, GRPO, or Online DPO\), online RL inherently requires an exploratory trajectory generation \(rollout\) phase\. However, for long\-context reasoning tasks, this rollout phase imposes a severe “memory wall” due to the exorbitant Key\-Value \(KV\) cache footprint\. While applying KV cache compression during rollouts mitigates this memory overhead, it induces a critical off\-policy bias\. Although modern KV compression is often nearly lossless during standard inference, even minuscule approximation errors are drastically amplified by the inherent instability of RL optimization\. Specifically, the sampler generates responses under a sparse context, whereas the learner updates parameters using the full, dense context\. Existing statistical solutions, such as importance reweighting, struggle to correct this magnified bias, suffering from high gradient variance and severe sample inefficiency\.

In this paper, we proposeShadow Mask Distillation \(SMD\), an elegant architectural framework that eradicates this structural mismatch\. Instead of post\-hoc statistical patching, SMD injects a “Shadow Mask”—recorded during the sparse rollout—directly into the learner’s attention layers, mathematically guaranteeing perfect on\-policy alignment\. Furthermore, we introduce a dual\-track KL distillation mechanism to transfer global contextual knowledge from the dense policy to the masked policy\. Extensive experiments on a 4B model validate SMD’s profound efficacy\. Under 50% KV cache compression, SMD achievesnear\-lossless compression, remaining highly competitive with uncompressed baselines \(e\.g\., 73\.6% vs\. 74\.5% on GSM8K\)\. Moreover, it prevents the severe long\-context degradation inherent in SOTA rejection\-sampling baselines, and its mask simulation entirely eliminates native VRAM spikes, setting a robust, memory\-efficient standard for long\-context RL\.

## 1Introduction

Large Language Models \(LLMs\)\(Brownet al\.,[2020](https://arxiv.org/html/2605.06850#bib.bib33); Touvronet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib10); Baiet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib12); Xionget al\.,[2025](https://arxiv.org/html/2605.06850#bib.bib72)\)have achieved unprecedented success, largely driven by alignment techniques such as Reinforcement Learning from Human Feedback \(RLHF\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.06850#bib.bib1); Baiet al\.,[2022](https://arxiv.org/html/2605.06850#bib.bib39); Liet al\.,[2025](https://arxiv.org/html/2605.06850#bib.bib69)\)\. Recent advancements, particularly Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib2)\)and Online Direct Preference Optimization \(Online DPO\)\(Rafailovet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib20)\), have further streamlined this process\. However, as LLMs are increasingly deployed for long\-context applications\(Ding and others,[2024](https://arxiv.org/html/2605.06850#bib.bib60)\), the rollout phase—where the model autoregressively generates multiple trajectories—hits a formidable “memory wall\.” The massive memory footprint required to store the Key\-Value \(KV\) cache\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.06850#bib.bib22); Kwonet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib30); Shazeer,[2019](https://arxiv.org/html/2605.06850#bib.bib56)\)for long sequences heavily bounds the batch size and training throughput, rendering long\-context RLHF computationally prohibitive for standard hardware\.

A natural countermeasure is to apply KV cache compression algorithms, such as SnapKV\(Liet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib3)\)or H2O\(Zhanget al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib5)\), to sparsify the context during the rollout phase\. While this alleviates the memory bottleneck, it inadvertently introduces a structuraldichotomyinto the RL pipeline\. During generation, the actor behaves as a “myopic” policy constrained by a sparse context \(πsparse\\pi\_\{\\text\{sparse\}\}\)\. Yet, during the optimization phase, the learner evaluates these trajectories using the uncompressed, dense context \(πdense\\pi\_\{\\text\{dense\}\}\)\. This asymmetry severely violates the core assumption of policy gradient methods\(Sutton and Barto,[2018](https://arxiv.org/html/2605.06850#bib.bib7); Wanget al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib73)\), creating a massive off\-policy bias that misguides the gradient updates and frequently culminates in irreversible policy collapse\(Chenet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib71)\)\.

Current literature attempts to reconcile this divergence through statistical interventions\. Recent state\-of\-the\-art methods rely on Sparsity\-Aware Rejection Sampling and Importance\-based Reweighting\(Luoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib8)\)\. Unfortunately, these post\-hoc statistical patches treat the symptom rather than the disease\. Rejection sampling discards expensive rollout trajectories, resulting in abysmal sample inefficiency; meanwhile, importance reweighting introduces extreme gradient variance, severely destabilizing the training dynamics\.

To break this impasse, we proposeShadow Mask Distillation \(SMD\), an architectural rather than statistical solution\. We observe that physical on\-policy alignment can be restored if the learner undergoes the exact same informational bottleneck as the generator\. SMD achieves this by recording a binary “Shadow Mask” during the sparse rollout and physically injecting it into the learner’s causal attention matrix\. This time\-freezing mechanism guarantees strictπsparse≡πshadow\\pi\_\{\\text\{sparse\}\}\\equiv\\pi\_\{\\text\{shadow\}\}alignment, completely neutralizing off\-policy variance\. To prevent the model from overfitting to the truncated context, we simultaneously execute a dense forward pass, applying a Kullback\-Leibler \(KL\)\(Kullback and Leibler,[1951](https://arxiv.org/html/2605.06850#bib.bib38)\)divergence penalty to distill global contextual reasoning into the masked policy\. Furthermore, we reveal a critical engineering insight: native tensor\-slicing for KV eviction in high\-level frameworks induces a “Not\-In\-Place Allocation Spike\.” SMD bypasses this entirely, offering a framework\-agnostic solution that avoids catastrophic out\-of\-memory \(OOM\) spikes without requiring low\-level C\+\+/CUDA modifications\(Zhenget al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib29); Kwonet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib30)\)\.

Our primary contributions are summarized as follows:

- •We identify the structural off\-policy mismatch in memory\-efficient RLHF and proposeShadow Mask Distillation, an elegant dual\-track framework that guarantees perfect gradient alignment with zero data waste\.
- •We empirically demonstrate that SMD completely eliminates the massive instantaneous memory fragmentation spikes inherent in physical KV eviction during the memory\-constrained learner optimization phase, providing a perfectly robust execution environment for long\-context generation\.
- •We reveal the implicit regularization effect of attention\-sparsified rollouts\. On the Reddit TL;DR benchmark, SMD outperforms the dense baseline in ROUGE\-L \(by \+0\.6% relative\) and significantly eclipses statistical reweighting baselines in both convergence speed and training stability\.

## 2Related Work

#### Memory\-Efficient RLHF and GRPO\.

Aligning LLMs via RL, such as Proximal Policy Optimization \(PPO\)\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06850#bib.bib4)\), typically requires maintaining multiple model copies \(Actor, Critic, Reference, Reward\), creating an immense memory overhead\. GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib2)\)alleviates this by eliminating the Critic model and utilizing group\-relative advantages\. Despite this, the autoregressive rollout phase remains heavily bottlenecked by KV cache allocation\(Kwonet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib30)\), especially as sequence lengths scale into the tens of thousands of tokens\. Our work directly targets this rollout memory wall, proposing an orthogonal optimization that integrates seamlessly with modern RLHF pipelines\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.06850#bib.bib1); Stiennonet al\.,[2020](https://arxiv.org/html/2605.06850#bib.bib17)\)\.

#### KV Cache Compression in LLMs\.

To address the linear memory scaling of Transformers, various KV cache compression strategies have been proposed\. Eviction\-based methods, such as H2O\(Zhanget al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib5)\)and SnapKV\(Liet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib3)\), selectively retain “heavy hitter” tokens based on accumulated attention scores\. Tuning\-free quantization frameworks like KIVI\(Liuet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib24)\)have further pushed compression limits\. Alternative approaches like StreamingLLM\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib6)\)leverage the attention sink phenomenon or streaming mechanisms to maintain stable generation in infinite\-length settings, while recent works like FOCUS\(Zhuet al\.,[2025](https://arxiv.org/html/2605.06850#bib.bib75)\)extend near\-lossless compression capabilities to specialized domains such as ultra\-long DNA sequences\. While these methods are highly effective during standard inference, natively integrating them into the distributed RL training loop \(e\.g\., Megatron\-LM\(Shoeybiet al\.,[2019](https://arxiv.org/html/2605.06850#bib.bib32)\)or Ray\(Moritzet al\.,[2018](https://arxiv.org/html/2605.06850#bib.bib31); Fanet al\.,[2025](https://arxiv.org/html/2605.06850#bib.bib74)\)\) triggers a profound off\-policy bias\(Zhuet al\.,[2023](https://arxiv.org/html/2605.06850#bib.bib70)\)\. Our Shadow Mask framework allows any arbitrary eviction\-based compression algorithm \(e\.g\., SnapKV or Random retention\) to be safely integrated into the RLHF loop without corrupting the policy gradients\.

#### The Failure of Naive KV Compression in RL\.

Correcting off\-policy bias is a classical problem in RL\. Standard approaches rely on Importance Sampling \(IS\)\(Sutton and Barto,[2018](https://arxiv.org/html/2605.06850#bib.bib7)\)to reweight the gradients, which is often stabilized via clipping mechanisms, as famously implemented in PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.06850#bib.bib4)\)\. However, while applying KV compression directly during standard inference works seamlessly, natively plugging it into an RL training loop to assist rollout generation fails catastrophically\. The reason stems from the severe structural misalignment between the actor’s sparse generation and the learner’s dense evaluation\. To illustrate this, when a standard PPO/GRPO pipeline is naively augmented with 50% SnapKV compression during rollouts, the training dynamics exhibit severe reward collapse, and the model’s accuracy on the GSM8K benchmark plummets from 74\.5% to an unusable 64\.3% \(as detailed later in Section[4](https://arxiv.org/html/2605.06850#S4)\)\. This unequivocally demonstrates that naive KV compression is fundamentally incompatible with online RL without rigorous off\-policy correction\.

#### Pioneering Efforts and Their Limitations\.

To date, Sparse\-RL\(Luoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib8)\)stands as the sole pioneering effort in this nascent domain, successfully identifying this critical bottleneck and proposing the first viable framework for long\-context memory\-efficient RL\. By ingeniously combining Sparsity\-Aware Rejection Sampling with importance reweighting, Sparse\-RL effectively mitigates the divergence between dense and sparse contexts\. However, its statistical nature inherently bounds its efficacy\. In the context of LLMs, where the action space \(vocabulary\) and sequence lengths are massive, the importance ratio grows exponentially\(Luoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib8)\), leading to catastrophic gradient variance\. Consequently, Sparse\-RL is forced to aggressively discard a significant portion of expensive rollout trajectories \(often\>20%\>20\\%\) to maintain stability\.

Motivated by this massive data waste and persistent gradient variance, we argue that the off\-policy dilemma should not be patched statistically, but rather resolved architecturally\. This profound insight directly motivates our proposed Shadow Mask Distillation methodology, which we detail in the following section\.

## 3Methodology

In this section, we introduceShadow Mask Distillation, a novel architectural framework designed to eliminate the off\-policy bias induced by KV cache compression in RLHF, without relying on high\-variance statistical patches\. We first build an intuition for the asymmetric context dilemma, followed by the formalization of our dual\-track mechanism\. Figure[1](https://arxiv.org/html/2605.06850#S3.F1)provides an overall intuition of our proposed methodology\.

![Refer to caption](https://arxiv.org/html/2605.06850v1/img/framework.png)Figure 1:The overall architecture of Shadow Mask Distillation \(SMD\)\. In Phase 1 \(Rollout\), the KV eviction algorithm dynamically drops tokens to save memory, recording the retention indices into a binary Shadow Mask \(MM\)\. In Phase 2 \(Learner\), SMD conducts a dual\-track forward pass: theAlignmenttrack applies the Shadow Mask to perfectly reconstruct the sparse generation environment for strict on\-policy GRPO parameter updates, while theDistillationtrack leverages the full dense context to implicitly regularize the myopic sparse policy via KL divergence\.### 3\.1Preliminaries and Intuition: The Asymmetric Context Dilemma

In standard RLHF algorithms like GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib2)\), the training pipeline consists of two decoupled phases:rollout\(trajectory generation\) andlearner\(parameter update\)\. For a given promptxx, the model generates a set of outputs\{y1,y2,…,yK\}\\\{y\_\{1\},y\_\{2\},\\dots,y\_\{K\}\\\}by sampling from the current policyπθ\\pi\_\{\\theta\}, and updates parameters using the estimated advantagesA^i\\hat\{A\}\_\{i\}\.

To overcome the “memory wall” in long\-context tasks, memory\-efficient systems apply KV cache compression \(e\.g\., SnapKV\) during the rollout phase\. This creates a constrained policy, denoted asπsparse\\pi\_\{\\text\{sparse\}\}\. However, during the learner phase, standard frameworks recompute the log\-probabilities using the full, uncompressed context, effectively evaluating the trajectories under an omniscient policy,πdense\\pi\_\{\\text\{dense\}\}\.

Intuition:This discrepancy resembles an unfair review process where a player navigates a maze blindfolded \(sparse rollout\), but the coach evaluates their moves using a full overhead map \(dense learner\)\. The coach penalizes the player for failing to utilize information they never observed\. While recent works attempt to patch this via statistical importance reweighting, such methods suffer from extreme gradient variance and sample inefficiency\. Instead, our method completely eradicates this bias physically: we simply place the exact same blindfold on the coach during the evaluation\.

### 3\.2Track 1: Architectural On\-Policy Alignment via Shadow Masking

To natively align the learner with the rollout generator, we introduce theShadow Mask\. During the rollout generation phase, as the KV eviction algorithm dynamically drops tokens, we record the exact retention state\. While conceptually modeled as a 1D sequenceM∈\{0,1\}LM\\in\\\{0,1\\\}^\{L\}for simplicity, the mask practically implements dynamic, per\-head masking to perfectly reconstruct the specific KV eviction algorithm \(e\.g\., SnapKV’s head\-specific retention\)\. Crucially, this mask is only applied to generation queries; the initial prompt prefill phase computes full, uncompressed attention to mirror the standard autoregressive generation process\.

In the learner phase, instead of feeding the full context, we inject this recorded maskMMdirectly into the causal attention matrix of the model\. The modified attention scores are computed as:

Attention\(Q,K,V,M\)=softmax\(QKTd\+ℳcausal\+ℳshadow\)V\\text\{Attention\}\(Q,K,V,M\)=\\text\{softmax\}\\left\(\\frac\{QK^\{T\}\}\{\\sqrt\{d\}\}\+\\mathcal\{M\}\_\{\\text\{causal\}\}\+\\mathcal\{M\}\_\{\\text\{shadow\}\}\\right\)V
where the additive penaltyℳshadow\\mathcal\{M\}\_\{\\text\{shadow\}\}sets the pre\-softmax logits to−∞\-\\inftyfor any key token whereMj=0M\_\{j\}=0, resulting in exact post\-softmax attention weights of0\. By doing so, the forward pass of the learner is physically restricted to the exact information manifold experienced during generation\. The resulting policyπθ\(yi\|x,M\)\\pi\_\{\\theta\}\(y\_\{i\}\|x,M\)perfectly reconstructsπsparse\(yi\|x\)\\pi\_\{\\text\{sparse\}\}\(y\_\{i\}\|x\), achieving strict on\-policy alignment mathematically:

πsparse\(yi\|x\)πθ\(yi\|x,M\)≡1\.0\\frac\{\\pi\_\{\\text\{sparse\}\}\(y\_\{i\}\|x\)\}\{\\pi\_\{\\theta\}\(y\_\{i\}\|x,M\)\}\\equiv 1\.0
This ensures the GRPO surrogate loss is computed with zero off\-policy bias and eliminates the need for unstable importance ratios\.

#### Theoretical Guarantee \(Informal\)\.

To rigorously understand the impact of this architectural alignment, we provide a formal mathematical proof in Appendix[A](https://arxiv.org/html/2605.06850#A1)\. Informally, our theory states:

###### Theorem \(Informal\)\(Variance Eradication\)\.

While statistical patches like importance reweighting intrinsically suffer from gradient variance that scales exponentially with the compressed context lengthLL\(i\.e\.,VarIR∝eL\\text\{Var\}\_\{\\text\{IR\}\}\\propto e^\{L\}\), Shadow Mask Distillation natively neutralizes this off\-policy discrepancy\. By enforcing absolute architectural parity between the rollout generation and the learner’s evaluation, SMD theoretically achievesstrictly zero additional off\-policy variance, regardless ofLL\.

In plain terms, this theorem proves that traditional statistical methods are theoretically doomed to fail on long documents\. Because the dense and sparse policies evaluate information differently, the probability of them completely agreeing across thousands of generated tokens becomes vanishingly small, leading to wild, unusable gradients\. By physically forcing the dense learner to wear the exact same “blindfold” \(the shadow mask\) as the sparse generator, SMD completely bypasses this statistical nightmare, ensuring rock\-solid training stability regardless of how long the context gets\.

### 3\.3Track 2: Dense\-to\-Sparse Knowledge Distillation

While the Shadow Mask perfectly aligns the gradients, it fundamentally trains a “myopic” model restricted by sparse context\. To inject global contextual reasoning back into the compressed policy, we introduce a secondary dense track\.

Sequentially with the masked forward pass, we perform a secondary forward pass using a full dense causal attention mechanism, yielding the logits of the dense policyπθ\(yi\|x\)\\pi\_\{\\theta\}\(y\_\{i\}\|x\)\. We treat this dense policy as an implicit teacher and apply a Kullback\-Leibler \(KL\) divergence penalty to distill its global knowledge into the masked policy\. This acts as a powerful implicit regularizer, teaching the sparse model to approximate the omniscient decision boundary even when its physical memory is truncated\.

Intuition on Stop\-Gradient and Shared Weights:Note that the dense teacher is detached via a stop\-gradient operator\. One might question why the dense model improves during final downstream inference if it receives no direct RL gradients\. The fundamental reason is that the sparse and dense pathways share the exact same underlying physical weightsθ\\theta\. By forcing the network to predict the complete dense information manifold using only a severely truncated context, we impose a strict information bottleneck \(akin to sequence\-level dropout\)\. This forces the shared parametersθ\\thetato learn a highly robust and noise\-resilient representation\. This enhancement in foundational feature extraction naturally generalizes and translates into superior performance during standard dense inference\.

### 3\.4Overall Training Objective

Combining the strictly aligned GRPO objective from Track 1 and the dense distillation regularizer from Track 2, our final Shadow Mask Distillation objective is formulated as:

maxθ⁡ℒSMD\(θ\)=𝔼x∼𝒟,yi∼πsparse\[ℒGRPO\(πθ\(yi\|x,M\)\)⏟Strict On\-Policy Alignment−λ𝒟KL\(sg\[πθ\(yi\|x\)\]∥πθ\(yi\|x,M\)\)⏟Dense Knowledge Distillation Penalty\]\\max\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{SMD\}\}\(\\theta\)=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\},y\_\{i\}\\sim\\pi\_\{\\text\{sparse\}\}\}\\left\[\\underbrace\{\\mathcal\{L\}\_\{\\text\{GRPO\}\}\\big\(\\pi\_\{\\theta\}\(y\_\{i\}\|x,M\)\\big\)\}\_\{\\text\{Strict On\-Policy Alignment\}\}\-\\underbrace\{\\lambda\\mathcal\{D\}\_\{\\text\{KL\}\}\\Big\(\\text\{sg\}\\big\[\\pi\_\{\\theta\}\(y\_\{i\}\|x\)\\big\]\\parallel\\pi\_\{\\theta\}\(y\_\{i\}\|x,M\)\\Big\)\}\_\{\\text\{Dense Knowledge Distillation Penalty\}\}\\right\]
wheresg\[⋅\]\\text\{sg\}\[\\cdot\]denotes the stop\-gradient operation applied to the dense teacher,λ\\lambdais the distillation coefficient, and𝒟KL\\mathcal\{D\}\_\{\\text\{KL\}\}is practically implemented as the expected sum of token\-wise KL divergences over the vocabulary distribution at each generation step\. Note thatℒGRPO\\mathcal\{L\}\_\{\\text\{GRPO\}\}utilizes the masked context for its reference policy as well to ensure consistent gradient alignment\. Our method achieves flawless gradient alignment while maintaining a unified network architecture, maximizing both computational efficiency and training stability\. For a detailed step\-by\-step algorithmic procedure of the entire training pipeline, please refer to Algorithm[1](https://arxiv.org/html/2605.06850#alg1)in Appendix[C](https://arxiv.org/html/2605.06850#A3)\.

## 4Experiments

To comprehensively evaluate the effectiveness of Shadow Mask Distillation \(SMD\), we design a series of experiments targeting multiple capabilities: short summarization, long\-document summarization, mathematical reasoning, and multi\-hop question answering\. Our evaluation aims to answer the following research questions:

- •RQ1:How does SMD compare against state\-of\-the\-art RL baselines under identical memory constraints?
- •RQ2:Does SMD successfully preserve downstream task generalization while eliminating PyTorch’s native VRAM spikes?
- •RQ3:How do critical hyperparameters \(compression ratio, distillation coefficient\) and KV selection strategies affect the final performance?

Unless otherwise specified, all core experiments utilize theQwen3\-4B\-Instruct\-2507model as the primary baseline, leveraging its robust foundational capabilities to demonstrate SMD’s zero\-cost compression and regularization benefits\. A comprehensive evaluation across diverse model families and scales \(including Llama\-3\.2\-1B and Qwen2\.5 1\.5B–7B\) is provided in Appendix[B](https://arxiv.org/html/2605.06850#A2)\.

### 4\.1Main Results and SOTA Comparison \(RQ1\)

#### Statistical Patches vs\. Architectural Cures\.

Current SOTA attempts to reconcile sparse rollouts with dense learners using post\-hoc statistical interventions\. SMD abandons statistical patching for a native architectural alignment\. Table[1](https://arxiv.org/html/2605.06850#S4.T1)highlights these fundamental differences\.

Table 1:Fundamental comparison between statistical interventions and SMD’s architectural alignment\.ImportanceReweightingRejection Sampling\(Sparse\-RL\)Shadow Masking\(SMD\)MechanismStatistical scalar\(Weighting\)Statistical filter\(Discarding\)Physical matrixinjectionGradientVarianceExponential explosion\(Ω\(\(1\+σ02\)L\)\\Omega\(\(1\+\\sigma\_\{0\}^\{2\}\)^\{L\}\)\)HighMathematically ZeroData WasteNone\>20%\>20\\%discardedZeroAlignmentApproximatePartialStrict \(πsparse≡πshadow\\pi\_\{\\text\{sparse\}\}\\equiv\\pi\_\{\\text\{shadow\}\}\)
#### Dataset Statistics\.

To ensure a robust evaluation across diverse reasoning modalities and context lengths, we carefully curated our datasets\. The detailed statistics and characteristics of these datasets are summarized in Table[2](https://arxiv.org/html/2605.06850#S4.T2)\. All experiments were executed on a uniform hardware configuration of 8×\\timesNVIDIA H100 GPUs to ensure strict reproducibility and sufficient computational capacity\.

Table 2:Overview of the four core evaluation datasets\. “Avg\. Length” indicates the approximate token count of the prompt context\.DatasetTask TypeAvg\. LengthTrain SizeTL;DRShort Summarization∼\\sim5001\.5KGSM8KMathematical Reasoning∼\\sim8007\.5KHotpotQAMulti\-hop Question Answering∼\\sim2\.5K2\.0KGovReportLong\-document Summarization∼\\sim8\.0K1\.0KWe compare SMD with the standard Dense Baseline \(100% KV Cache\) and Sparse\-RL\(Luoet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib8)\), the current leading approach that physically evicts 50% of the KV cache during generation, combined with rejection sampling \(discarding the 20% most deviated rollout trajectories\) and importance weighting \(ρ\\rhoclipped to\[0\.8,1\.2\]\[0\.8,1\.2\]\)\. For SMD, we uniformly apply a 50% SnapKV compression ratio \(i\.e\., retaining only 50% of the KV tokens per sequence\) and a distillation coefficient ofλ=0\.1\\lambda=0\.1\. We train the models for 500 GRPO steps \(withK=4K=4generated responses per prompt and a learning rate of1×10−71\\times 10^\{\-7\}\) across the four core datasets: TL;DR, GovReport, GSM8K, and HotpotQA\. Additionally, we rigorously evaluate the models’ zero\-shot generalization capabilities on the advanced MATH500 and AIME24 benchmarks \(the latter evaluated using Pass@8 with temperature0\.70\.7\), with full results detailed in Appendix[B](https://arxiv.org/html/2605.06850#A2)\.

![Refer to caption](https://arxiv.org/html/2605.06850v1/img/exp_02.png)Figure 2:Rollout reward trajectories across four core datasets\. SMD SnapKV \(green\) exhibits significantly lower variance and faster convergence compared to Sparse\-RL \(orange\), successfully matching or exceeding the Dense Baseline \(blue\) despite operating with only 50% of the KV cache\.Table 3:Comprehensive evaluation on four core datasets\. For evaluation metrics, we report ROUGE\-L for TL;DR and GovReport, Accuracy for GSM8K, and F1 score for HotpotQA\. All metrics are presented in standard percentages\(Stiennonet al\.,[2020](https://arxiv.org/html/2605.06850#bib.bib17); Huanget al\.,[2021](https://arxiv.org/html/2605.06850#bib.bib18); Cobbeet al\.,[2021](https://arxiv.org/html/2605.06850#bib.bib15); Yanget al\.,[2018](https://arxiv.org/html/2605.06850#bib.bib19)\)\.MethodTL;DRGovReportGSM8KHotpotQADense Baseline \(100% KV\)33\.242\.574\.552\.4Sparse\-RL \(50% KV drop\)31\.539\.871\.248\.6SMD SnapKV \(50% KV drop\)33\.442\.473\.652\.6

As shown in Figure[2](https://arxiv.org/html/2605.06850#S4.F2)and Table[3](https://arxiv.org/html/2605.06850#S4.T3), SMD achievesnear\-lossless compression\. Under an aggressive 50% KV cache reduction, SMD not only outperforms Sparse\-RL but remains highly competitive with the uncompressed Dense Baseline\. Remarkably, on HotpotQA and TL;DR, SMD even slightly surpasses the Dense Baseline, indicating that appropriate KV compression serves as an effective information bottleneck, regularizing the policy against overfitting to short\-sighted RL rewards\.

Furthermore, SMD overcomes the long\-context bottleneck inherent in Sparse\-RL\. Tasks like GovReport and HotpotQA heavily rely on long\-distance contextual dependencies\. The passive rejection sampling in Sparse\-RL severely penalizes long\-sequence generation, causing significant performance drops \(e\.g\., GovReport falls from 42\.5 to 39\.8\)\. In contrast, the dense distillation track within SMD successfully preserves critical global evidence, fully restoring the model’s long\-context comprehension despite the sparse generation\. For a more comprehensive evaluation across diverse model scales \(from 1B to 7B\) and additional rigorous reasoning benchmarks \(e\.g\., MATH500, AIME24\), please refer to Appendix[B](https://arxiv.org/html/2605.06850#A2)\.

### 4\.2Downstream Generalization and System Efficiency \(RQ2\)

A holistic evaluation of KV cache compression requires examining both model capabilities and hardware execution efficiency\. To this end, we assess the models’ intrinsic reasoning on the GSM8K test set\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.06850#bib.bib15)\)\(using an 8\-shot prompt\(Brownet al\.,[2020](https://arxiv.org/html/2605.06850#bib.bib33)\)\) while simultaneously micro\-benchmarking the instantaneous memory footprint during the learner’s optimization phase\.

Table 4:Dual\-perspective analysis of Downstream Generalization \(GSM8K Accuracy\) and System Efficiency \(VRAM Spikes\)\. SMD not only improves downstream reasoning over standard RL but also strictly prevents the memory fragmentation spikes inherent in physical KV eviction\.MethodGSM8KAccuracyPeak VRAMRatioAbsoluteVRAM SpikeSFT Baseline72\.0%N/AN/ADense RL Baseline74\.5%100%BaselineSparse\-RL \(50% KV\)71\.2%150%\+1\.24 GB\(Spike\)SMD SnapKV \(50% KV\)73\.6%100%\+0\.00 GB\(No Spike\)

The results in Table[4](https://arxiv.org/html/2605.06850#S4.T4)reveal a twofold advantage for SMD\. First, regarding downstream performance, SMD achieves 73\.6% accuracy, outperforming the SFT \(72\.0%\) and Sparse\-RL \(71\.2%\) baselines, while remaining highly competitive with the Dense RL \(74\.5%\) upper bound\. In contrast, a naive physical eviction of 50% leads to a complete accuracy collapse \(64\.3%\)\. This empirically validates that the information bottleneck created by SMD acts as an effective regularizer against overfitting without losing critical task knowledge\.

Second, regarding system efficiency, a critical yet often overlooked issue in implementing physical KV cache eviction \(i\.e\., extracting and reallocating the subset of retained token tensors\) is the instantaneous memory spike during non\-in\-place memory operations\. Because PyTorch allocates the new tensor before freeing the old one, retaining 50% of the tokens invariably results in a momentary peak of 150% of the original cache size \(a \+1\.24 GB spike\)\. While this spike occurs during the rollout phase, its impact is entirely context\-dependent\. During rollout, the absence of massive backpropagation activations means the base VRAM footprint is exceptionally low; thus, a momentary spike here is completely harmless and will not trigger an Out\-Of\-Memory \(OOM\) error\. However, during the learner’s optimization phase—where stored activations, gradients, and optimizer states push VRAM to its absolute limits—any physical slicing spike becomes catastrophic\. Sparse\-RL typically suffers from this because its standard implementations rely on physical slicing during the gradient optimization phase to evaluate the sparse trajectories\. SMD gracefully circumvents this typical implementation artifact: we strictly use mask simulation instead of physical slicing during the learner’s optimization phase, ensuring absolute memory safety at the critical high\-water mark with absolutely zero VRAM spikes\.

### 4\.3Ablation Studies \(RQ3\)

#### Impact of KV Compression Ratio\.

We investigate the optimal KV compression ratio on the TL;DR dataset\(Stiennonet al\.,[2020](https://arxiv.org/html/2605.06850#bib.bib17)\), holdingλ=0\.1\\lambda=0\.1and using SnapKV\(Liet al\.,[2024](https://arxiv.org/html/2605.06850#bib.bib3)\)\. As depicted in Figure[4](https://arxiv.org/html/2605.06850#S4.F4), the relationship between the compression ratio and ROUGE\-L exhibits a distinct inverted\-U curve, demonstrating that a moderate information bottleneck acts as an effective regularizer against overfitting\. A 20% compression rate \(i\.e\., dropping 20% of the KV tokens\) emerges as thesweet spot\(33\.6\), providing optimal regularization\. As we further push the compression to the target 50% ratio, SMD gracefully degrades but still maintains a robust score \(33\.4\) that safely outperforms the uncompressed Dense baseline \(33\.2\)\. Even at an extreme 70% compression ratio, SMD maintains a highly usable score of 31\.0\. Furthermore, as the compression budget shrinks, Sparse\-RL experiences a catastrophic accuracy collapse due to the exponentially exploding variance in its importance reweighting estimator\. In stark contrast, SMD demonstrates profound robustness, maintaining stable performance even under severe constraints\. This directly validates our theoretical claim that SMD’s architectural alignment completely neutralizes the off\-policy variance that plagues statistical baselines\.

![Refer to caption](https://arxiv.org/html/2605.06850v1/img/exp_05.png)Figure 3:Impact of KV Compression Ratio on TL;DR ROUGE\-L\. SMD maintains strong performance even under extreme compression, while Sparse\-RL suffers from catastrophic variance explosion\.
![Refer to caption](https://arxiv.org/html/2605.06850v1/img/exp_06.png)Figure 4:Impact of the Distillation Coefficient \(λ\\lambda\)\. A subtleλ=0\.1\\lambda=0\.1pull yields optimal convergence\.

#### Impact of the Distillation Coefficient \(λ\\lambda\)\.

Figure[4](https://arxiv.org/html/2605.06850#S4.F4)illustrates the effect of varying the distillation weightλ\\lambdain our dual\-track loss\. Relying solely on the shadow mask \(λ=0\.0\\lambda=0\.0\) provides strong regularization, achieving a respectable score of 32\.5\. However, introducing a subtle distillation guidance \(λ=0\.1\\lambda=0\.1\) rapidly aligns the policy, reaching the peak performance of 33\.4\. Conversely, excessive distillation \(e\.g\.,λ=0\.8\\lambda=0\.8\) allows the KL divergence target to overpower the RL reward signal, leading to severe performance degradation \(21\.0\)\.

#### KV Selection Strategy\.

Finally, we examine the importance of the token selection mechanism by comparing SnapKV \(attention\-guided\), Recent \(keeping only the latest tokens\), and Random selection under a 50% compression budget\.

Table 5:Comparison of different KV selection strategies on TL;DR \(all using 50% compression with SMD\)\.StrategyROUGE\-LCharacteristicsRandom30\.8High variance, destroys local semantic integrityRecent31\.2Loses critical early prompt facts \(e\.g\., title\)SnapKV33\.4Builds high\-density semantic bottleneckTable[5](https://arxiv.org/html/2605.06850#S4.T5)reveals that SnapKV dominates with a score of 33\.4 by successfully preserving the most task\-critical context features\. The Recent strategy suffers from “primacy bias” amnesia, forgetting essential instructions located at the beginning of the prompt\. Intriguingly, SMD’s extreme robustness is proven by the Random strategy: despite randomly destroying semantic structures, the off\-policy correction ensures the model safely converges to 30\.8 instead of completely collapsing \(as observed in naive setups without SMD\)\.

#### Empirical Transferability of Ablated Hyperparameters\.

Due to the significant computational cost of searching hyperparameters across extreme sequence lengths, our ablation studies were primarily conducted on the TL;DR dataset to enable rapid iteration\. Crucially, the optimal hyperparameters discovered here \(e\.g\., 50% compression ratio,λ=0\.1\\lambda=0\.1\) were directly transferred in a zero\-shot manner to long\-context benchmarks like GovReport \(8K tokens\) and HotpotQA \(2\.5K tokens\)\. The fact that SMD still achieved state\-of\-the\-art results on these tasks \(as shown in Table[3](https://arxiv.org/html/2605.06850#S4.T3)\) empirically confirms the robust transferability of our settings across diverse context scales\.

## 5Limitations

While Shadow Mask Distillation \(SMD\) provides an elegant architectural solution to the off\-policy dilemma in memory\-efficient RLHF, it is not without limitations\. First, SMD’s reliance on a secondary dense forward pass during the distillation track inevitably introduces some computational overhead in the learner\. However, in typical RLHF training, the autoregressive rollout generation dominates\>90%\>90\\%of the wall\-clock time\. Therefore, while SMD nominally doubles the FLOPs during the learner phase, the actual increase in end\-to\-end total training time is strictly marginal \(<10%<10\\%\)\. Crucially, because the dense and masked forward passes are executed sequentially, this dual\-track computation introduces minimal peak memory overhead during the learner phase \(primarily restricted to storing the dense logits for the KL divergence penalty\)\. Furthermore, the minor time overhead is heavily offset by the ability to utilize much larger batch sizes and eliminate data waste from rollout rejection\. Future work could address this by exploring more advanced, inherently semantic\-aware compression algorithms that go beyond simple attention\-score heuristics\. Additionally, because SMD natively relies on a binary mask to simulate discrete token eviction, it is fundamentally incompatible with continuous KV cache compression techniques such as quantization\.

## 6Conclusion

In this work, we identified a critical structural bottleneck in modern long\-context alignment pipelines: the severe off\-policy bias and gradient variance induced by naively applying KV cache compression to the rollout generation\. To fundamentally resolve this, we proposedShadow Mask Distillation \(SMD\)\. Rather than relying on unstable statistical patching like importance reweighting or rejection sampling, SMD completely eradicates gradient variance architecturally by physically aligning the dense learner’s attention matrix with the sparse generator’s exact information manifold\. Coupled with a dense\-to\-sparse KL distillation track, SMD provides a powerful regularizing effect without losing global context\.

## References

- J\. Bai, S\. Bai, Y\. Chu,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.06850#S4.SS2.p1.1)\.
- X\. Chen, S\. Tang, R\. Zhu, S\. Yan, L\. Jin, Z\. Wang, L\. Su, H\. Tang, and X\. Wang \(2024\)The janus interface: how fine\-tuning in large language models amplifies the privacy risks\.InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security \(CCS\),Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p2.2)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.2](https://arxiv.org/html/2605.06850#S4.SS2.p1.1),[Table 3](https://arxiv.org/html/2605.06850#S4.T3)\.
- Y\. Dinget al\.\(2024\)LongRoPE: extending LLM context window beyond 2 million tokens\.arXiv preprint arXiv:2402\.13753\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- Y\. Fan, R\. Zhu, Z\. Wang, C\. Wang, H\. Tang, Y\. Dong, H\. Cho, and L\. Ohno\-Machado \(2025\)ByzSFL: achieving byzantine\-robust secure federated learning with zero\-knowledge proofs\.arXiv preprint arXiv:2501\.06953\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Huang, S\. Cao, N\. N\. Parulian, H\. Ji, and L\. Wang \(2021\)Efficient attentions for long document summarization\.arXiv preprint arXiv:2104\.02112\.Cited by:[Table 3](https://arxiv.org/html/2605.06850#S4.T3)\.
- S\. Kullback and R\. A\. Leibler \(1951\)On information and sufficiency\.The annals of mathematical statistics22\(1\),pp\. 79–86\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p4.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and J\. Jin \(2023\)Efficient memory management for large language model serving with PagedAttention\.InSymposium on Operating Systems Principles,Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1),[§1](https://arxiv.org/html/2605.06850#S1.p4.1),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Li, K\. Song, R\. Zhu, P\. Chen, and H\. Tang \(2025\)Adversarial attack\-defense co\-evolution for LLM safety alignment via tree\-group dual\-aware search and optimization\.arXiv preprint arXiv:2511\.19218\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- Y\. Li, Y\. Dong, C\. Gu,et al\.\(2024\)SnapKV: LLM knows what you are looking for before generation\.arXiv preprint arXiv:2404\.14469\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p2.2),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.06850#S4.SS3.SSS0.Px1.p1.1)\.
- Z\. Liu, J\. Yuan, H\. Jin,et al\.\(2023\)KIVI: a tuning\-free asymmetric 2bit quantization for KV cache\.arXiv preprint arXiv:2402\.02750\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Luo, X\. Zhang, Y\. Hu, B\. Zhang, K\. Wang, J\. Su, M\. Sun, L\. Liang, and J\. Zhang \(2024\)Sparse\-RL: breaking the memory wall in LLM reinforcement learning via stable sparse rollouts\.arXiv preprint arXiv:2401\.10079\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p3.1),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px4.p1.1),[§4\.1](https://arxiv.org/html/2605.06850#S4.SS1.SSS0.Px2.p2.6)\.
- P\. Moritz, R\. Nishihara, S\. Wang, A\. Tumanov, R\. Liaw, E\. Liang, M\. Elibol, Z\. Yang, W\. Paul, M\. I\. Jordan,et al\.\(2018\)Ray: a distributed framework for emerging\{\\\{ai\}\\\}applications\.In13th USENIX Symposium on Operating Systems Design and Implementation,pp\. 561–577\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in Neural Information Processing Systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2024\)Direct preference optimization: your language model is secretly a reward model\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Feng, M\. Fang,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.06850#S3.SS1.p1.4)\.
- N\. Shazeer \(2019\)Fast transformer decoding: one write\-head is all you need\.arXiv preprint arXiv:1911\.02150\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- M\. Shoeybi, M\. Patwary, R\. Puri, P\. LeGresley, J\. Casper, and B\. Catanzaro \(2019\)Megatron\-lm: training multi\-billion parameter language models using model parallelism\.arXiv preprint arXiv:1909\.08053\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Stiennon, L\. Ouyang, J\. Wu, D\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. F\. Christiano \(2020\)Learning to summarize from human feedback\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 3008–3021\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.06850#S4.SS3.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2605.06850#S4.T3)\.
- R\. S\. Sutton and A\. G\. Barto \(2018\)Reinforcement learning: an introduction\.MIT press\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p2.2),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- Z\. Wang, R\. Zhu, D\. Zhou, Z\. Zhang, J\. Mitchell, H\. Tang, and X\. Wang \(2024\)DPAdapter: improving differentially private deep learning through noise tolerance pre\-training\.In33rd USENIX Security Symposium \(USENIX Security 24\),Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p2.2)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Song,et al\.\(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Xiong, Z\. Wang, R\. Zhu, T\. Ho, P\. Chen, J\. Xiong, H\. Tang, and L\. Ohno\-Machado \(2025\)Hey, that’s my data\! label\-only dataset inference in large language models\.arXiv preprint arXiv:2506\.06057\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InConference on Empirical Methods in Natural Language Processing,Cited by:[Table 3](https://arxiv.org/html/2605.06850#S4.T3)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou,et al\.\(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p2.2),[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, J\. Huang,et al\.\(2023\)SGLang: efficient execution of structured language model programs\.arXiv preprint arXiv:2312\.07104\.Cited by:[§1](https://arxiv.org/html/2605.06850#S1.p4.1)\.
- R\. Zhu, D\. Tang, S\. Tang, X\. Wang, and H\. Tang \(2023\)Selective amnesia: on efficient, high\-fidelity and blind suppression of backdoor effects in trojaned machine learning models\.In2023 IEEE Symposium on Security and Privacy \(SP\),Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Zhu, X\. Zhou, H\. Tang, S\. W\. Scherer, and L\. Ohno\-Machado \(2025\)Near\-lossless model compression enables longer context inference in DNA large language models\.arXiv preprint arXiv:2511\.14694\.Cited by:[§2](https://arxiv.org/html/2605.06850#S2.SS0.SSS0.Px2.p1.1)\.

Appendix

## Appendix ATheoretical Analysis of Gradient Variance

In this section, we theoretically formalize why the statistical Importance Reweighting \(IR\) method—commonly utilized to correct off\-policy bias in KV\-compressed RLHF \(e\.g\., Sparse\-RL\)—is fundamentally unstable for long\-context tasks, and how our Shadow Mask Distillation \(SMD\) method eradicates this instability\.

Letτ=\(s1,a1,s2,a2,…,sL,aL\)\\tau=\(s\_\{1\},a\_\{1\},s\_\{2\},a\_\{2\},\\dots,s\_\{L\},a\_\{L\}\)be a trajectory of lengthLLgenerated by the compressed rollout policyπsparse\\pi\_\{\\text\{sparse\}\}\. To update the dense learner policyπdense\\pi\_\{\\text\{dense\}\}parameterized byθ\\theta, the IR gradient estimator is defined as:

g^IR=\(∏t=1Lπdense\(at\|st\)πsparse\(at\|st\)\)∇θlog⁡πdense\(τ\)Aπdense\(τ\)\\hat\{g\}\_\{\\text\{IR\}\}=\\left\(\\prod\_\{t=1\}^\{L\}\\frac\{\\pi\_\{\\text\{dense\}\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\pi\_\{\\text\{sparse\}\}\(a\_\{t\}\|s\_\{t\}\)\}\\right\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\text\{dense\}\}\(\\tau\)A^\{\\pi\_\{\\text\{dense\}\}\}\(\\tau\)wherewt=πdense\(at\|st\)πsparse\(at\|st\)w\_\{t\}=\\frac\{\\pi\_\{\\text\{dense\}\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\pi\_\{\\text\{sparse\}\}\(a\_\{t\}\|s\_\{t\}\)\}is the token\-level importance weight\. To rigorously analyze the variance, we establish two mild standard assumptions common in RL literature\.

###### Assumption 1\(Strictly Positive Gradient\-Advantage Magnitude\)\.

The squared magnitude of the gradient\-advantage product is almost surely lower\-bounded by a strictly positive constant for any valid trajectory:infτ‖∇θlog⁡πdense\(τ\)Aπdense\(τ\)‖2≥Cmin\>0\\inf\_\{\\tau\}\\\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\text\{dense\}\}\(\\tau\)A^\{\\pi\_\{\\text\{dense\}\}\}\(\\tau\)\\\|^\{2\}\\geq C\_\{\\min\}\>0\.

###### Assumption 2\(Non\-zero Information Gap\)\.

Due to the physical KV cache compression \(e\.g\., dropping50%50\\%of tokens\), the rollout policy diverges from the dense policy\. We assume the centered sequencewt−1w\_\{t\}\-1forms a martingale difference sequence with respect to the generation historyhth\_\{t\}\(meaning𝔼\[wt\|ht\]=1\\mathbb\{E\}\[w\_\{t\}\|h\_\{t\}\]=1\), and possesses a strictly positive lower\-bounded conditional variance:Varat∼πsparse\[wt\|ht\]=σ2≥σ02\>0\\text\{Var\}\_\{a\_\{t\}\\sim\\pi\_\{\\text\{sparse\}\}\}\[w\_\{t\}\|h\_\{t\}\]=\\sigma^\{2\}\\geq\\sigma\_\{0\}^\{2\}\>0\.

Based on these assumptions, we introduce the following theorem to quantify the gradient variance\.

###### Theorem 1\(Variance Explosion vs\. Variance Eradication\)\.

Under Assumptions[1](https://arxiv.org/html/2605.06850#Thmassumption1)and[2](https://arxiv.org/html/2605.06850#Thmassumption2), the variance of the off\-policy Importance Reweighting estimatorg^IR\\hat\{g\}\_\{\\text\{IR\}\}grows exponentially with the context lengthLL:

Var\(g^IR\)=Ω\(\(1\+σ02\)L\)\\text\{Var\}\(\\hat\{g\}\_\{\\text\{IR\}\}\)=\\Omega\\Big\(\(1\+\\sigma\_\{0\}^\{2\}\)^\{L\}\\Big\)In contrast, the policy gradient estimator under the Shadow Mask Distillation \(SMD\) framework, denoted asg^SMD\\hat\{g\}\_\{\\text\{SMD\}\}, achieves strictly zero off\-policy variance, ensuring that its variance remains entirely independent of the context length scaling caused by KV compression\.

###### Proof\.

First, we analyze the variance of the cumulative importance weightρ1:L=∏t=1Lwt\\rho\_\{1:L\}=\\prod\_\{t=1\}^\{L\}w\_\{t\}\. Due to the martingale property ofwtw\_\{t\}with𝔼\[wt\|ht\]=1\\mathbb\{E\}\[w\_\{t\}\|h\_\{t\}\]=1andVar\[wt\|ht\]≥σ02\\text\{Var\}\[w\_\{t\}\|h\_\{t\}\]\\geq\\sigma\_\{0\}^\{2\}, the variance of the product satisfies:

Var\(ρ1:L\)=𝔼\[∏t=1Lwt2\]−\(𝔼\[∏t=1Lwt\]\)2≥\(1\+σ02\)L−1\\text\{Var\}\(\\rho\_\{1:L\}\)=\\mathbb\{E\}\\left\[\\prod\_\{t=1\}^\{L\}w\_\{t\}^\{2\}\\right\]\-\\left\(\\mathbb\{E\}\\left\[\\prod\_\{t=1\}^\{L\}w\_\{t\}\\right\]\\right\)^\{2\}\\geq\(1\+\\sigma\_\{0\}^\{2\}\)^\{L\}\-1Since Assumption[1](https://arxiv.org/html/2605.06850#Thmassumption1)establishes a strict pointwise lower bound on the gradient\-advantage product, we can directly lower\-bound the second moment\. The total variance of the IR gradient estimator is thus bounded below:Var\(g^IR\)≥Cmin\(\(1\+σ02\)L−1\)\\text\{Var\}\(\\hat\{g\}\_\{\\text\{IR\}\}\)\\geq C\_\{\\min\}\(\(1\+\\sigma\_\{0\}^\{2\}\)^\{L\}\-1\)\. Therefore, we haveVar\(g^IR\)=Ω\(\(1\+σ02\)L\)\\text\{Var\}\(\\hat\{g\}\_\{\\text\{IR\}\}\)=\\Omega\(\(1\+\\sigma\_\{0\}^\{2\}\)^\{L\}\), which grows exponentially as the generation lengthLLincreases\.

For our Shadow Mask method, the learner is structurally restricted via the injected maskMM\. By definition of the masked causal attention mechanism, the shadow policy evaluates the trajectory exactly as the sparse policy did:

∀t,πshadow\(at\|st,M\)≡πsparse\(at\|st\)\\forall t,\\quad\\pi\_\{\\text\{shadow\}\}\(a\_\{t\}\|s\_\{t\},M\)\\equiv\\pi\_\{\\text\{sparse\}\}\(a\_\{t\}\|s\_\{t\}\)Consequently, the token\-level importance weight for our method is deterministicallywtSMD=πshadow\(at\|st,M\)πsparse\(at\|st\)=1w\_\{t\}^\{\\text\{SMD\}\}=\\frac\{\\pi\_\{\\text\{shadow\}\}\(a\_\{t\}\|s\_\{t\},M\)\}\{\\pi\_\{\\text\{sparse\}\}\(a\_\{t\}\|s\_\{t\}\)\}=1\. The cumulative importance ratio collapses to exactly1\.01\.0\. The gradient estimator becomes:

g^SMD=1⋅∇θlog⁡πshadow\(τ\)Aπshadow\(τ\)\\hat\{g\}\_\{\\text\{SMD\}\}=1\\cdot\\nabla\_\{\\theta\}\\log\\pi\_\{\\text\{shadow\}\}\(\\tau\)A^\{\\pi\_\{\\text\{shadow\}\}\}\(\\tau\)Thus, the off\-policy induced variance is exactly zero \(Varoff\-policy\(g^SMD\)=0\\text\{Var\}\_\{\\text\{off\-policy\}\}\(\\hat\{g\}\_\{\\text\{SMD\}\}\)=0\)\. The total variance reduces to the standard on\-policy variance, which does not suffer from the exponential explosion associated withLL\. This completes the proof\. ∎

#### Implications\.

Theorem[1](https://arxiv.org/html/2605.06850#Thmtheorem1)mathematically elucidates why statistical patching methods like Sparse\-RL face catastrophic instability in long\-context tasks\. AsLLreaches thousands of tokens, the exponential term\(1\+σ2\)L\(1\+\\sigma^\{2\}\)^\{L\}drives the gradient variance to infinity, forcing these methods to rely on aggressive clipping and rejection sampling \(discarding\>20%\>20\\%of data\)\. Our Shadow Mask Distillation neutralizes this exponential term natively, maintaining optimal sample efficiency and rendering it theoretically superior for long\-context memory\-efficient RLHF\.

## Appendix BComprehensive Results Across Model Scales and Benchmarks

To further validate the superiority and broad applicability of Shadow Mask Distillation \(SMD\), we present a comprehensive evaluation in Table[6](https://arxiv.org/html/2605.06850#A2.T6)\. We expand our analysis across diverse model scales \(from 1B to 7B\) and include additional complex mathematical reasoning benchmarks \(MATH500 and AIME24\) beyond our core dataset suite\.

For existing model scales \(Llama\-3\.2\-1B, Qwen2\.5\-1\.5B, Qwen2\.5\-3B, Qwen2\.5\-7B\), we benchmark the standard baselines and Sparse\-RL under a standard KV retention ratio, extending their evaluation to encompass our broader task suite\. More importantly, we introduce ourQwen3\-4B\-Instructevaluation block, where SMD is comprehensively compared against all strong baselines\. As demonstrated, SMD not only strictly prevents accuracy collapse but consistently achieves state\-of\-the\-art performance across nearly all settings, frequently matching or surpassing even the uncompressed Dense baseline\.

Table 6:Comprehensive Results across 6 diverse benchmarks and multiple model scales\. The “Toks\. saving” indicates the empirical reduction in KV cache storage during the rollout phase \(deviations from uniform compression target arise from varying prompt\-to\-generation length ratios across datasets\)\. The top two performances within each model block are highlighted in bold and underlined\.ModelRolloutMethodGSM8KMATH500AIME24TL;DRGovReportHotpotQAToks\.savingQwen3\-4B\-InstructBase–72\.040\.22\.530\.138\.045\.2\-GRPODense74\.555\.15\.833\.242\.552\.4\-GRPOw/ SnapKV64\.345\.03\.128\.535\.241\.550\.0%↪\\hookrightarrow\+ Sparse\-RL71\.252\.84\.631\.539\.848\.6↪\\hookrightarrow\+ SMD \(Ours\)73\.654\.25\.233\.442\.452\.6Llama\-3\.2\-1B\-InstructBase–36\.222\.81\.218\.524\.230\.1\-GRPODense51\.233\.62\.923\.128\.535\.4\-GRPOw/ SnapKV45\.426\.51\.519\.424\.129\.845\.0%↪\\hookrightarrow\+ Sparse\-RL48\.631\.42\.221\.626\.533\.2↪\\hookrightarrow\+ SMD \(Ours\)50\.432\.82\.623\.428\.836\.1Qwen2\.5\-1\.5BBase–43\.521\.00\.322\.428\.133\.5\-GRPODense75\.159\.14\.026\.834\.241\.0\-GRPOw/ SnapKV66\.337\.63\.124\.130\.036\.543\.3%↪\\hookrightarrow\+ Sparse\-RL73\.757\.63\.427\.033\.540\.2↪\\hookrightarrow\+ SMD \(Ours\)74\.258\.23\.627\.434\.841\.6Qwen2\.5\-3BBase–76\.055\.84\.128\.536\.445\.1\-GRPODense85\.065\.86\.531\.540\.250\.6\-GRPOw/ SnapKV79\.054\.25\.227\.234\.846\.542\.0%↪\\hookrightarrow\+ Sparse\-RL83\.464\.05\.330\.839\.550\.0↪\\hookrightarrow\+ SMD \(Ours\)84\.164\.65\.832\.241\.051\.5Qwen2\.5\-7BBase–81\.657\.47\.332\.141\.551\.2\-GRPODense92\.574\.815\.534\.845\.255\.8\-GRPOw/ SnapKV73\.454\.62\.629\.538\.648\.039\.4%↪\\hookrightarrow\+ Sparse\-RL90\.171\.410\.234\.044\.154\.5↪\\hookrightarrow\+ SMD \(Ours\)91\.273\.113\.535\.546\.056\.2

## Appendix CPseudo Code for Shadow Mask Distillation

Algorithm[1](https://arxiv.org/html/2605.06850#alg1)outlines the complete training procedure for Shadow Mask Distillation \(SMD\)\. During the rollout phase, we physically execute KV cache compression and record the eviction indices into the Shadow Mask\. During the RL training phase, this mask is injected into the attention computation to reconstruct the sparse behavior for the policy gradient update, completely eradicating off\-policy bias\. Simultaneously, a dense forward pass is maintained to compute the KL distillation penalty, preserving long\-context reasoning capabilities\.

Algorithm 1Shadow Mask Distillation \(SMD\) for PPO/GRPO0:Pre\-trained LLM policy

πθ\\pi\_\{\\theta\}, reference policy

πref\\pi\_\{\\text\{ref\}\}, prompt dataset

𝒟\\mathcal\{D\}, learning rate

α\\alpha, distillation coefficient

λ\\lambda, clipping parameter

ϵ\\epsilon, KL penalty coefficient

β\\beta, compression algorithm \(e\.g\., SnapKV\)\.

1:whiletrainingdo

2:Sample a batch of prompts

x∼𝒟x\\sim\\mathcal\{D\}
3:// Phase 1: Sparse Rollout and Mask Recording

4:foreach prompt

xxdo

5:for

k=1k=1to

KKdo

6:Generate response

yk∼πθ\(⋅\|x\)y\_\{k\}\\sim\\pi\_\{\\theta\}\(\\cdot\|x\)using KV cache compression

7:Record evicted KV token indices to construct Shadow Mask

MkM\_\{k\}
8:Compute environment reward

r\(x,yk\)r\(x,y\_\{k\}\)
9:endfor

10:Compute group\-relative advantages

A^k\\hat\{A\}\_\{k\}by normalizing the rewards

r\(x,yk\)r\(x,y\_\{k\}\)
11:endfor

12:// Phase 2: Dual\-Track Optimization

13:foreach optimization epochdo

14:// Track A: Shadow Policy Gradient \(Bias\-free RL\)

15:Compute action probabilities

πθshadow\(yk\|x,Mk\)\\pi\_\{\\theta\}^\{\\text\{shadow\}\}\(y\_\{k\}\|x,M\_\{k\}\)and

πrefshadow\(yk\|x,Mk\)\\pi\_\{\\text\{ref\}\}^\{\\text\{shadow\}\}\(y\_\{k\}\|x,M\_\{k\}\)
16:Calculate ratio

pk=πθshadow\(yk\|x,Mk\)πθoldshadow\(yk\|x,Mk\)p\_\{k\}=\\frac\{\\pi\_\{\\theta\}^\{\\text\{shadow\}\}\(y\_\{k\}\|x,M\_\{k\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}^\{\\text\{shadow\}\}\(y\_\{k\}\|x,M\_\{k\}\)\}
17:Calculate surrogate loss

LSurr=min⁡\(pkA^k,clip\(pk,1−ϵ,1\+ϵ\)A^k\)L\_\{\\text\{Surr\}\}=\\min\(p\_\{k\}\\hat\{A\}\_\{k\},\\text\{clip\}\(p\_\{k\},1\-\\epsilon,1\+\\epsilon\)\\hat\{A\}\_\{k\}\)
18:Calculate policy gradient loss

LPG\(θ\)=−𝔼\[LSurr−βDKL\(πθshadow∥πrefshadow\)\]L\_\{\\text\{PG\}\}\(\\theta\)=\-\\mathbb\{E\}\\left\[L\_\{\\text\{Surr\}\}\-\\beta D\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}^\{\\text\{shadow\}\}\\parallel\\pi\_\{\\text\{ref\}\}^\{\\text\{shadow\}\}\)\\right\]
19:// Track B: Dense Distillation \(Regularization\)

20:Compute dense action probabilities

πθdense\(yk\|x\)\\pi\_\{\\theta\}^\{\\text\{dense\}\}\(y\_\{k\}\|x\)without masking

21:Calculate distillation penalty

LDistill\(θ\)=𝔼\[DKL\(sg\[πθdense\(yk\|x\)\]∥πθshadow\(yk\|x,Mk\)\)\]L\_\{\\text\{Distill\}\}\(\\theta\)=\\mathbb\{E\}\\left\[D\_\{\\text\{KL\}\}\\left\(\\text\{sg\}\\big\[\\pi\_\{\\theta\}^\{\\text\{dense\}\}\(y\_\{k\}\|x\)\\big\]\\parallel\\pi\_\{\\theta\}^\{\\text\{shadow\}\}\(y\_\{k\}\|x,M\_\{k\}\)\\right\)\\right\]
22:// Total Update

23:

Ltotal=LPG\+λLDistillL\_\{\\text\{total\}\}=L\_\{\\text\{PG\}\}\+\\lambda L\_\{\\text\{Distill\}\}
24:

θ←θ−α∇θLtotal\\theta\\leftarrow\\theta\-\\alpha\\nabla\_\{\\theta\}L\_\{\\text\{total\}\}
25:endfor

26:endwhile

## Appendix DBroader Societal Impacts

Our work focuses on dramatically reducing the computational barrier for Reinforcement Learning from Human Feedback \(RLHF\) and related alignment techniques\. By making long\-context RL alignment viable on standard hardware, Shadow Mask Distillation \(SMD\) democratizes access to advanced LLM safety and behavioral tuning methodologies\. On the positive side, this enables smaller research labs and independent researchers to align their models safely, fostering a more transparent and diverse AI research ecosystem\. It also reduces the carbon footprint associated with training massive models by eliminating redundant computations and memory spikes\. On the negative side, lower barriers to entry for model alignment mean that malicious actors could more easily tune models to bypass existing safety guardrails or generate harmful content at a large scale\. While our methodology itself is a neutral optimization technique, we emphasize the ongoing necessity for the community to develop robust, model\-agnostic safety evaluation frameworks that ensure aligned behaviors remain beneficial to society\.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

Similar Articles

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Submit Feedback

Similar Articles

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models