Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion
Summary
This paper identifies a bias in attention weights caused by quantizing keys in KV-cache compression for chunk-wise autoregressive video diffusion, and proposes a per-attention-score correction that removes the bias with negligible overhead, recovering near-BF16 video quality at INT2 quantization.
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion
Source: [https://arxiv.org/html/2605.26266](https://arxiv.org/html/2605.26266)
Tuna Tuncer1,2Felix Becker2,†\\daggerThomas Pfeil2,†\\dagger 1Technical University of Munich 2Tensordyne tuna\.tuncer@tum\.defelix\.becker@tensordyne\.aithomas\.pfeil@tensordyne\.ai
###### Abstract
Chunk\-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer\. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality\. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the*Jensen bias*\. This effect causes quantized keys to steal attention mass from the unquantized current chunk\. We derive a per\-attention\-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm\. Using a second\-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache\. Evaluated on MAGI\-1, SkyReels\-V2, and HY\-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near\-BF16 video quality, and can outperform INT4 quantization while using 50% less memory\.
## 1Introduction
Video diffusion models have made remarkable progress in generating short, high\-fidelity clips\(Yanget al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib33); Konget al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib32); Team Wanet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib34)\)\. Recent work on video generation models has introduced chunk\-wise autoregressive video diffusion, where each chunk of frames is denoised independently and attends to previously generated chunks\(Chenet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib35); Yinet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib36); Sand\.aiet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib12); Chenet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib43); Sunet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib13)\)\. To avoid recomputing the key and value representations of past chunks at every denoising step, autoregressive models store them in a KV cache and reuse them across subsequent chunks\. In this setting, the KV cache acts as the model’s temporal memory: it determines how much previously generated visual context remains available when simulating the next chunk of a video or world trajectory\.
Figure 1:Qualitative comparison on MAGI\-1 for two representative prompts\. Columns show successive frames from the same generated video\. From top to bottom: BF16 baseline; asymmetric INT2 \(QuaRot\+RTN\) KV\-cache quantization of both keys and values; same quantized setting with our correction\. INT2 quantization quickly destroys subject and scene structure, whereas our correction substantially recovers the BF16\-like visual quality and temporal consistency\.Figure 2:Attention weights for MAGI\-1 for the prompt “a person” under INT2 KV\-cache quantization\. The visualization is taken from a representative layer, time step, and attention head\. Panel\(b\)shows that relative to the BF16 baseline in\(a\), quantization increases attention weights in the cached block of tokens and decreases them in the current chunk\. This effect is quantified by the*attention masses*P𝒮P\_\{\\mathcal\{S\}\}andPℛP\_\{\\mathcal\{R\}\}of the cached token blocks and current chunks\.\(c\)shows that our correction largely restores the original attention weights\.Figure 3:Illustration of the*Jensen bias*and its correction on a single attention score\.Left:Quantization noiseδ∼Uniform\[−Δ/2,Δ/2\]\\delta\\sim\\mathrm\{Uniform\}\[\-\\Delta/2,\\Delta/2\]with zero mean produces a noisy scores^=s\+δ\\hat\{s\}=s\+\\deltacentered atss\.Center:After exponentiation the distribution becomes right\-skewed: its mean𝔼\[es^\]\\mathbb\{E\}\[e^\{\\hat\{s\}\}\]strictly exceedsese^\{s\}by the so\-called*Jensen bias*\.Right:Subtracting a correctionbbshifts the mean𝔼\[es^−b\]\\mathbb\{E\}\[e^\{\\hat\{s\}\-b\}\]closer toese^\{s\}, largely removing the systematic Jensen bias\.To further reduce the attention cost, MAGI\-1\(Sand\.aiet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib12)\)attends to a sliding window of the lastncached chunks, yielding linear instead of quadratic scaling in video length\. This design introduces a fundamental memory–context trade\-off: increasing the window size improves temporal consistency by providing more past context, but also increases the size of the KV cache proportionally\. Due to memory capacity, memory bandwidth, and latency constraints in practical systems, the window size must be limited, restricting the temporal information available to the model and degrading long\-range consistency\(Xiet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib17); Samuelet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib16)\)\.
KV\-cache quantization directly targets the underlying memory bottleneck by compressing the cached keys and values to lower bitwidths, thereby relaxing this trade\-off: the same memory budget can support a larger context window, or a fixed window can be stored more efficiently\. Prior work on KV\-cache quantization for LLM inference\(Liuet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib1); Hooperet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib2); Ashkbooset al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib4)\)has established effective techniques down to 2\-bit precision\. For autoregressive video models, we find that INT4 KV\-cache quantization preserves reasonable quality, whereas reducing to INT2 leads to severely distorted frames \([Fig\.˜1](https://arxiv.org/html/2605.26266#S1.F1),[Fig\.˜A2](https://arxiv.org/html/2605.26266#A9.F2), and[Fig\.˜A3](https://arxiv.org/html/2605.26266#A10.F3)\)\.
We identify a shift of*attention mass*toward cached tokens under aggressive quantization as an important source of this degradation \(see example in[Fig\.˜2](https://arxiv.org/html/2605.26266#S1.F2)and definition in[Section˜4\.1](https://arxiv.org/html/2605.26266#S4.SS1)\)\. This shift is consistent across layers, heads, denoising steps, and prompts, and correlates with poor video quality \([Fig\.˜1](https://arxiv.org/html/2605.26266#S1.F1)\)\. Integer quantization introduces approximately zero\-mean noise into the cached keys, leaving pre\-softmax attention scores unbiased in expectation\. However, the exponential in softmax breaks this symmetry: due to its convexity, positive deviations are amplified more than equally large negative deviations are suppressed\. As a result, a symmetric score\-level noise distribution becomes right\-skewed after exponentiation, with its mean systematically exceeding the exponential of the original unquantized score \([Fig\.˜3](https://arxiv.org/html/2605.26266#S1.F3)\)\. We refer to this systematic, convexity\-induced inflation as the*Jensen bias*, as it is an instance of the Jensen gap studied in probability theory\(Gaoet al\.,[2020](https://arxiv.org/html/2605.26266#bib.bib42)\)\. In chunk\-wise autoregressive video diffusion, this bias inflates the cached\-token contribution to the softmax partition sum at the expense of the current chunk\.
Our correction directly targets the Jensen bias\. Because the bias is systematic, it can be estimated from quantities available at inference time and subtracted from the cached\-key attention scores before the softmax\. This restores the balance between cached and current tokens without retraining or modifying the quantized KV cache values \([Fig\.˜2](https://arxiv.org/html/2605.26266#S1.F2)\)\.
Our contributions are as follows:
- •We identify the Jensen bias, a systematic inflation induced by KV\-cache quantization, in which zero\-mean cached\-key score perturbations inflate the expected cached\-token softmax contribution and shift attention mass away from the unquantized current chunk\.
- •We derive a theoretically grounded per\-attention\-score correction and show that a simple second\-order Taylor approximation yields an effective, practical formula with negligible overhead\.
- •We demonstrate consistent benchmark improvements across multiple models and quantization schemes, validating the proposed correction from attention\-level diagnostics through to end\-to\-end video quality\.
## 2Related Work
#### KV\-cache quantization for LLMs\.
The KV cache is a well\-known memory bottleneck in long\-context LLM inferenceKwon and others \([2023](https://arxiv.org/html/2605.26266#bib.bib5)\), and a growing body of work addresses it through quantization: KIVI\(Liuet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib1)\)provides an early systematic study of KV cache element distributions, observing that keys exhibit channel\-wise outliers while values do not, and exploits this asymmetry to achieve tuning\-free 2\-bit KV quantization\. KVQuant\(Hooperet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib2)\)combines per\-channel key quantization with non\-uniform datatypes calibrated to the empirical KV distribution and explicit isolation of outlier entries, pushing KV caches below 4 bits with minimal perplexity loss\. QuaRot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib4)\)applies Hadamard rotations to spread channel\-wise outliers before quantization, enabling outlier\-free 4\-bit inference\. TurboQuant\(Zandiehet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib40)\)similarly leverages random rotations, framing KV\-cache compression as an online vector quantization problem and applying scalar quantization in the rotated space to achieve near\-optimal distortion at low bitwidth\. AsymKV\(Taoet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib6)\)observes that model loss is more sensitive to key quantization than value quantization and proposes layer\-wise asymmetric bit allocation, supporting our focus on key cache quantization\. Our work is orthogonal to the approaches above in that we do not improve the quantization scheme itself, but instead analytically correct the systematic bias in the attention weights introduced by any such scheme\.
#### Attention sensitivity and correction\.
Several works have studied how quantization and other perturbations affect the attention mechanism\.Pandeyet al\.\([2023](https://arxiv.org/html/2605.26266#bib.bib7)\)show that quantizing the softmax computation introduces a large bias in the softmax output, degrading accuracy in generative models, and propose an offline correction that can be folded into the quantization parameters\. Our work targets a different source of bias, focusing on KV\-cache quantization rather than softmax quantization\. KVLinC\(Saxena and Roy,[2025](https://arxiv.org/html/2605.26266#bib.bib8)\)is conceptually closest to our approach: it introduces trainable linear correction adapters to compensate errors from quantized keys\. In contrast, our correction is training\-free and analytically derived\. SageAttention\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib9)\)smooths queries by subtracting channel means and adds a correction term to the scores\. However, this targets quantization\-friendliness of theQK⊤QK^\{\\top\}product rather than the systematic bias from exponentiation\.Yaoet al\.\([2024](https://arxiv.org/html/2605.26266#bib.bib10)\)propose time step\-aware corrections for quantized diffusion models, demonstrating that structure\-aware corrections can substantially reduce quantization degradation, a principle our per\-attention\-score correction shares\.
#### Autoregressive video diffusion and efficient caching\.
Chunk\-wise autoregressive video diffusion models generate videos by denoising successive chunks that attend to previously generated chunks through a KV cache\(Chenet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib35); Yinet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib36); Sand\.aiet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib12); Chenet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib43); Sunet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib13)\)\. Because the cache grows with each new chunk, a growing body of work aims to reduce its cost through cache compression and eviction\(Maet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib14); Chenet al\.,[2026a](https://arxiv.org/html/2605.26266#bib.bib15); Samuelet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib16)\), sparse attention\(Lvet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib38)\), or direct quantization of the cached states\(Xiet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib17)\)\. Among these, QuantVideoGen\(Xiet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib17)\)is most directly related to our approach: it applies training\-free KV\-cache quantization using semantic\-aware smoothing and progressive residual quantization to reduce the quantization error itself\. Our approach is complementary: rather than reducing the quantization error, we analytically correct the bias it introduces in softmax attention\. We validate this complementarity empirically in[Table˜1](https://arxiv.org/html/2605.26266#S4.T1), where composing the two methods on MAGI\-1 yields the best overall results\.
## 3Preliminaries
#### Integer quantization\.
Integer quantization maps a floating\-point value to a discrete grid defined by a*scale*Δ\\Delta, also known as the step size between adjacent grid levels, and a*zero\-point*zz\. Given aBB\-bit quantization target, each elementxxis mapped to
xq=clamp\(⌊x/Δ⌉\+z,0,2B−1\),x\_\{q\}=\\mathrm\{clamp\}\\\>\\\!\\bigl\(\\lfloor x/\\Delta\\rceil\+z,\\;0,\\;2^\{B\}\{\-\}1\\bigr\),\(1\)where⌊⋅⌉\\lfloor\\cdot\\rceildenotes rounding to nearest \(RTN\), and is reconstructed asx^=\(xq−z\)⋅Δ\\hat\{x\}=\(x\_\{q\}\-z\)\\cdot\\Delta\. The round\-tripx↦xq↦x^x\\mapsto x\_\{q\}\\mapsto\\hat\{x\}introduces an additive errorϵ=x^−x\\epsilon=\\hat\{x\}\-xthat is bounded by\|ϵ\|≤Δ/2\|\\epsilon\|\\leq\\Delta/2\. In practice, bothΔ\\Deltaandzzare chosen to cover the full\[min,max\]\[\\min,\\max\]range of the value being quantized\.
#### Quantization granularity\.
The scale and zero\-point can be shared at different granularities\. In per\-tensor quantization, one\(Δ,z\)\(\\Delta,z\)pair is shared across an entire tensor\. In per\-token quantization, each token has its own\(Δi,zi\)\(\\Delta\_\{i\},z\_\{i\}\)\. Group\-wise per\-token quantization further divides each token’sddchannels into groups of sizegg, with an independent\(Δi,j,zi,j\)\(\\Delta\_\{i,j\},z\_\{i,j\}\)per groupjj\. The smaller the group of values sharing\(Δ,z\)\(\\Delta,z\), the smaller the quantization error, but the larger the overall memory footprint\.
#### Hadamard rotation\.
Key vectors in transformer models often exhibit channel\-wise outliers, i\.e\. a few channels have much larger magnitudes than the rest\(Dettmerset al\.,[2022](https://arxiv.org/html/2605.26266#bib.bib18); Ashkbooset al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib4)\)\. These outliers inflate the quantization step sizeΔ\\Delta, degrading precision for all other channels\. QuaRot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib4)\)spreads the outlier energy across all channels by applying a randomized Hadamard rotationH∈ℝd×dH\\in\\mathbb\{R\}^\{d\\times d\}\(withH⊤H=IH^\{\\top\}H=I\) to both keys and queries\. The resulting distribution is more uniform, allowing for lower quantization errors\. BecauseHHis orthogonal, the attention scores are preserved:\(Hq\)⊤\(Hk\)=q⊤k\(Hq\)^\{\\top\}\(Hk\)=q^\{\\top\}k\. For all ablation studies, we use such a Hadamard rotation before quantization, since this results in overall best quantized video quality\.
#### Token Structure and Attention Decomposition\.
In autoregressive video diffusion, each chunk of video frames is encoded into a latent representation and patchified into a grid of spatio\-temporal tokens before entering the transformer\. Depending on the model and resolution, this results in several thousand tokens per chunk\. At each denoising step, every query in the current chunk attends to two groups of keys: \(i\) the keys of the current chunk, which are computed in full precision at every step, and \(ii\) the keys of previously generated chunks, which were written to a KV cache once each chunk finished denoising and are reused without recomputation\. The attention score matrix therefore decomposes into two blocks: a*current*block of tokens \(current\-chunk queries×\\timescurrent\-chunk keys\) and a*cached*block of tokens \(current\-chunk queries×\\timescached keys\)\.
We now turn to the effect of quantization on this attention mechanism and derive a correction that compensates for the resulting bias in the softmax computation\.
## 4Method
We analyze the effect of KV\-cache quantization on softmax attention and show that it introduces a systematic bias that inflates the contribution of cached keys\. Based on this analysis, we derive a correction term that removes this bias in expectation, and present a practical approximation suitable for efficient implementation\.
### 4\.1Quantization Bias in Softmax Attention
Consider a single attention head with dimensiondd\. For a query vectorq∈ℝdq\\in\\mathbb\{R\}^\{d\}and key vectorski∈ℝdk\_\{i\}\\in\\mathbb\{R\}^\{d\}, whereiiis the token index, the attention score and attention weight for tokeniiare
si=q⊤kid,pi=esi∑j=1Nesj\.s\_\{i\}=\\frac\{q^\{\\top\}k\_\{i\}\}\{\\sqrt\{d\}\},\\qquad p\_\{i\}=\\frac\{e^\{s\_\{i\}\}\}\{\\sum\_\{j=1\}^\{N\}e^\{s\_\{j\}\}\}\.\(2\)
Recall from[Section˜3](https://arxiv.org/html/2605.26266#S3)that in autoregressive video generation, tokens from previously generated chunks are quantized and stored in the KV cache, while tokens of the current chunk have not yet been quantized\. Let𝒮\\mathcal\{S\}denote the set of quantized*cached*key indices andℛ\\mathcal\{R\}the set of unquantized*current\-chunk*key indices, so that\{1,…,N\}=𝒮∪ℛ\\\{1,\\dots,N\\\}=\\mathcal\{S\}\\cup\\mathcal\{R\}\. We define the partition sums
Z𝒮=∑i∈𝒮esi,Zℛ=∑i∈ℛesi,Z=Z𝒮\+Zℛ\.Z\_\{\\mathcal\{S\}\}=\\sum\_\{i\\in\\mathcal\{S\}\}e^\{s\_\{i\}\},\\qquad Z\_\{\\mathcal\{R\}\}=\\sum\_\{i\\in\\mathcal\{R\}\}e^\{s\_\{i\}\},\\qquad Z=Z\_\{\\mathcal\{S\}\}\+Z\_\{\\mathcal\{R\}\}\.\(3\)We also define the total attention mass on the cached block,
P𝒮=∑i∈𝒮pi=Z𝒮Z𝒮\+Zℛ,P\_\{\\mathcal\{S\}\}=\\sum\_\{i\\in\\mathcal\{S\}\}p\_\{i\}=\\frac\{Z\_\{\\mathcal\{S\}\}\}\{Z\_\{\\mathcal\{S\}\}\+Z\_\{\\mathcal\{R\}\}\},\(4\)which measures how much attention mass is assigned to cached keys, and is what we ultimately care about when reasoning about attention stealing\. For a representative example of attention stealing, compare left to middle panel in[Fig\.˜2](https://arxiv.org/html/2605.26266#S1.F2)\.
#### Quantization noise model\.
LetΔi,c\\Delta\_\{i,c\}denote the quantization step size for channelccof cached tokenii\. The quantize–dequantize round\-trip yieldsk^i=ki\+ϵi\\hat\{k\}\_\{i\}=k\_\{i\}\+\\epsilon\_\{i\}fori∈𝒮i\\in\\mathcal\{S\}\. For the per\-element error of integer quantizationϵi∈ℝd\\epsilon\_\{i\}\\in\\mathbb\{R\}^\{d\}, we assume that the components are independent across channelsc∈\{1,…,d\}c\\in\\\{1,\\dots,d\\\}and uniformly distributed\(Widrowet al\.,[1996](https://arxiv.org/html/2605.26266#bib.bib39)\):
ϵi,c∼𝒰\(−Δi,c2,\+Δi,c2\)\.\\epsilon\_\{i,c\}\\sim\\mathcal\{U\}\\\!\\left\(\-\\frac\{\\Delta\_\{i,c\}\}\{2\},\\;\+\\frac\{\\Delta\_\{i,c\}\}\{2\}\\right\)\.\(5\)
Note that this noise model depends only on the round\-to\-nearest quantization operation itself, not on any preprocessing applied to the keys before quantization \(such as Hadamard rotations in QuaRot; see[Appendix˜F](https://arxiv.org/html/2605.26266#A6)\)\.
The quantized attention score is then
s^i=q⊤k^id=si\+δi,δi=q⊤ϵid,\\hat\{s\}\_\{i\}=\\frac\{q^\{\\top\}\\hat\{k\}\_\{i\}\}\{\\sqrt\{d\}\}=s\_\{i\}\+\\delta\_\{i\},\\qquad\\delta\_\{i\}=\\frac\{q^\{\\top\}\\epsilon\_\{i\}\}\{\\sqrt\{d\}\},\(6\)whereδi\\delta\_\{i\}is the attention\-score noise for keyii\. Under the uniform noise model,δi\\delta\_\{i\}has zero mean and, by channel independence, its variance is
σi2=Var\(δi\)=112d∑c=1dqc2Δi,c2\.\\sigma\_\{i\}^\{2\}=\\operatorname\{Var\}\(\\delta\_\{i\}\)=\\frac\{1\}\{12\\,d\}\\sum\_\{c=1\}^\{d\}q\_\{c\}^\{2\}\\,\\Delta\_\{i,c\}^\{2\}\.\(7\)For unquantized keysi∈ℛi\\in\\mathcal\{R\}, we haves^i=si\\hat\{s\}\_\{i\}=s\_\{i\}\.
#### Jensen bias and attention stealing\.
Consider the quantized cached partition sumZ^𝒮=∑i∈𝒮esi\+δi\\hat\{Z\}\_\{\\mathcal\{S\}\}=\\sum\_\{i\\in\\mathcal\{S\}\}e^\{s\_\{i\}\+\\delta\_\{i\}\}\. By linearity of expectation:
𝔼\[Z^𝒮\]=∑i∈𝒮esi⋅𝔼\[eδi\]\.\\mathbb\{E\}\\bigl\[\\hat\{Z\}\_\{\\mathcal\{S\}\}\\bigr\]=\\sum\_\{i\\in\\mathcal\{S\}\}e^\{s\_\{i\}\}\\cdot\\mathbb\{E\}\\bigl\[e^\{\\delta\_\{i\}\}\\bigr\]\.\(8\)For each term, Jensen’s inequality applied to the convex functionexp\(⋅\)\\exp\(\\cdot\)gives𝔼\[eδi\]≥e𝔼\[δi\]=1\\mathbb\{E\}\[e^\{\\delta\_\{i\}\}\]\\geq e^\{\\mathbb\{E\}\[\\delta\_\{i\}\]\}=1, so that𝔼\[Z^𝒮\]≥Z𝒮\\mathbb\{E\}\[\\hat\{Z\}\_\{\\mathcal\{S\}\}\]\\geq Z\_\{\\mathcal\{S\}\}\. We call this systematic inflation ofZ^𝒮\\hat\{Z\}\_\{\\mathcal\{S\}\}caused byδi\\delta\_\{i\}the*Jensen bias*\. See[Fig\.˜3](https://arxiv.org/html/2605.26266#S1.F3)for an illustration of this bias and its correction on a single attention score value\.
SinceZℛZ\_\{\\mathcal\{R\}\}is unaffected by key quantization, inflation ofZ^𝒮\\hat\{Z\}\_\{\\mathcal\{S\}\}can shift attention mass toward cached keys\. We quantify this*attention stealing*as
ΔP𝒮=P^𝒮−P𝒮,P^𝒮=Z^𝒮Z^𝒮\+Zℛ\.\\Delta P\_\{\\mathcal\{S\}\}=\\hat\{P\}\_\{\\mathcal\{S\}\}\-P\_\{\\mathcal\{S\}\},\\qquad\\hat\{P\}\_\{\\mathcal\{S\}\}=\\frac\{\\hat\{Z\}\_\{\\mathcal\{S\}\}\}\{\\hat\{Z\}\_\{\\mathcal\{S\}\}\+Z\_\{\\mathcal\{R\}\}\}\.\(9\)Positive values indicate excess attention on the cached block, as observed in[Section˜5\.3](https://arxiv.org/html/2605.26266#S5.SS3)\.
### 4\.2Correction of the Jensen Bias
We derive a per\-attention\-score correctionbib\_\{i\}that counteracts the Jensen bias, applied only to cached scores \(i∈𝒮i\\in\\mathcal\{S\}\) and leaving current\-chunk scoressis\_\{i\}\(i∈ℛi\\in\\mathcal\{R\}\) unchanged\. As shown in[Section˜4\.1](https://arxiv.org/html/2605.26266#S4.SS1.SSS0.Px1), each cached token’s contribution to the partition sum is individually biased upward:𝔼\[esi\+δi\]=esi𝔼\[eδi\]≥esi\\mathbb\{E\}\[e^\{s\_\{i\}\+\\delta\_\{i\}\}\]=e^\{s\_\{i\}\}\\,\\mathbb\{E\}\[e^\{\\delta\_\{i\}\}\]\\geq e^\{s\_\{i\}\}\. We correct each token individually by requiring its expected contribution to match the unquantized value:
esi−bi⋅𝔼\[eδi\]=\!esi⟹bi=log𝔼\[eδi\]\.e^\{s\_\{i\}\-b\_\{i\}\}\\cdot\\mathbb\{E\}\\bigl\[e^\{\\delta\_\{i\}\}\\bigr\]\\overset\{\!\}\{=\}e^\{s\_\{i\}\}\\quad\\Longrightarrow\\quad\\boxed\{\\;b\_\{i\}=\\log\\mathbb\{E\}\\bigl\[e^\{\\delta\_\{i\}\}\\bigr\]\.\\;\}\(10\)Since every term is individually unbiased, the corrected cached partition sum is unbiased by linearity of expectation:
𝔼\[Z~𝒮\]=∑i∈𝒮esi−bi⋅𝔼\[eδi\]=∑i∈𝒮esi=Z𝒮\.\\mathbb\{E\}\\bigl\[\\tilde\{Z\}\_\{\\mathcal\{S\}\}\\bigr\]=\\sum\_\{i\\in\\mathcal\{S\}\}e^\{s\_\{i\}\-b\_\{i\}\}\\cdot\\mathbb\{E\}\\bigl\[e^\{\\delta\_\{i\}\}\\bigr\]=\\sum\_\{i\\in\\mathcal\{S\}\}e^\{s\_\{i\}\}=Z\_\{\\mathcal\{S\}\}\.\(11\)At inference time, we apply this correction by subtractingbib\_\{i\}from each cached attention scoresis\_\{i\}prior to the softmax, leaving scores from the current \(unquantized\) keys unchanged\. Note thatbi≥0b\_\{i\}\\geq 0always \(since𝔼\[eδi\]≥1\\mathbb\{E\}\[e^\{\\delta\_\{i\}\}\]\\geq 1by Jensen’s inequality\)\. Furthermore,bib\_\{i\}increases with the score\-space noise, i\.e\. withΔi,c\\Delta\_\{i,c\}\.
Since the noise componentsϵi,c\\epsilon\_\{i,c\}are independent across channels the expectation𝔼\[eδi\]\\mathbb\{E\}\[e^\{\\delta\_\{i\}\}\]factorizes across dimensions, leading to the exact correction term \(for the full derivation, see[Appendix˜A](https://arxiv.org/html/2605.26266#A1)\):
bi=∑c=1dlog\(sinh\(qcΔi,c2d\)qcΔi,c2d\)\\boxed\{\\rule\{0\.0pt\}\{14\.63881pt\}b\_\{i\}=\\sum\_\{c=1\}^\{d\}\\log\\\!\\left\(\\frac\{\\sinh\\\!\\left\(\\dfrac\{q\_\{c\}\\,\\Delta\_\{i,c\}\}\{2\\sqrt\{d\}\}\\right\)\}\{\\dfrac\{q\_\{c\}\\,\\Delta\_\{i,c\}\}\{2\\sqrt\{d\}\}\}\\right\)\\rule\[\-5\.16663pt\]\{0\.0pt\}\{0\.0pt\}\\;\}\(12\)Settingαc=qcΔi,c/\(2d\)\\alpha\_\{c\}=q\_\{c\}\\Delta\_\{i,c\}/\(2\\sqrt\{d\}\)and using the second\-order Taylor expansionlog\(sinh\(αc\)/αc\)≈αc2/6\\log\(\\sinh\(\\alpha\_\{c\}\)/\\alpha\_\{c\}\)\\approx\\alpha\_\{c\}^\{2\}/6for small\|αc\|\|\\alpha\_\{c\}\|, this simplifies to:
bi≈124d∑c=1dqc2Δi,c2\\boxed\{\\rule\{0\.0pt\}\{14\.63881pt\}b\_\{i\}\\approx\\frac\{1\}\{24\\,d\}\\sum\_\{c=1\}^\{d\}q\_\{c\}^\{2\}\\,\\Delta\_\{i,c\}^\{2\}\\rule\[\-5\.16663pt\]\{0\.0pt\}\{0\.0pt\}\\;\}\(13\)The Taylor approximation is simple, interpretable, and numerically stable\. It shows that the bias scales with both the squared query magnitude and the squared quantization step size\. We use this approximation in all experiments\. For a representative example of this proposed correction, compare middle to right panel in[Fig\.˜2](https://arxiv.org/html/2605.26266#S1.F2)\.
#### Connection to the noise variance\.
Comparing[Eq\.˜13](https://arxiv.org/html/2605.26266#S4.E13)with[Eq\.˜7](https://arxiv.org/html/2605.26266#S4.E7), the Taylor correction is exactly half the score\-space noise variance:bi≈σi2/2b\_\{i\}\\approx\\sigma\_\{i\}^\{2\}/2\. This follows from the cumulant generating function \(CGF\)\. For any random variableXXwith cumulantsκ1,κ2,κ3,…\\kappa\_\{1\},\\kappa\_\{2\},\\kappa\_\{3\},\\ldots, the CGF satisfies
log𝔼\[eX\]=κ1\+κ22\+κ36\+⋯\.\\log\\mathbb\{E\}\[e^\{X\}\]=\\kappa\_\{1\}\+\\frac\{\\kappa\_\{2\}\}\{2\}\+\\frac\{\\kappa\_\{3\}\}\{6\}\+\\cdots\\,\.\(14\)For zero\-mean noise \(κ1=0\\kappa\_\{1\}=0\), the leading term isκ2/2=σ2/2\\kappa\_\{2\}/2=\\sigma^\{2\}/2, which depends only on the variance and not on the specific noise distribution\. The exact closed\-form correction in[Eq\.˜12](https://arxiv.org/html/2605.26266#S4.E12)relies on the uniform noise model of integer quantization, but the second\-order Taylor approximation requires only the score\-space noise varianceσi2\\sigma\_\{i\}^\{2\}\. This means that extending the correction to other quantization formats reduces to estimatingσi2\\sigma\_\{i\}^\{2\}under the appropriate error model: for floating\-point formats such as FP, MXFP, and NVFP, whose rounding error is proportional to the magnitude of the quantized value but can be described by approximate additive noise models\(Widrowet al\.,[1996](https://arxiv.org/html/2605.26266#bib.bib39)\), one substitutes the corresponding score\-space variance intobi≈σi2/2b\_\{i\}\\approx\\sigma\_\{i\}^\{2\}/2\.
#### Specialization to grouped per\-token quantization\.
In our experimental setting, each token’sddchannels are divided intoG=d/gG=d/ggroups of sizegg, and all channels within groupjjshare the same step sizeΔi,j\\Delta\_\{i,j\}\. Grouping channels with shared step sizes, and writing‖qj‖2=∑c∈groupjqc2\\\|q\_\{j\}\\\|^\{2\}=\\sum\_\{c\\in\\text\{group \}j\}q\_\{c\}^\{2\}for the per\-group squared query norm, we obtain
bi≈124d∑j=1GΔi,j2‖qj‖2\.b\_\{i\}\\approx\\frac\{1\}\{24d\}\\sum\_\{j=1\}^\{G\}\\Delta\_\{i,j\}^\{2\}\\,\\\|q\_\{j\}\\\|^\{2\}\.\(15\)The same correction extends to QuaRot by replacingqqwith the rotated queryHqHq, so that‖qj‖2\\\|q\_\{j\}\\\|^\{2\}becomes‖\(Hq\)j‖2\\\|\(Hq\)\_\{j\}\\\|^\{2\}in[Eq\.˜15](https://arxiv.org/html/2605.26266#S4.E15)\(for full derivation, see[Appendix˜F](https://arxiv.org/html/2605.26266#A6)\)\.
### 4\.3Effective bitwidth and computational complexity
For group\-wise quantization with group sizegg, the effective bitwidth isBeff=B\+24gB\_\{\\mathrm\{eff\}\}=B\+\\frac\{24\}\{g\}, accounting for per\-group scale stored in FP8 and zero\-point stored in BF16 metadata\. Our correction introduces no additional storage, as it depends only on existing quantization parameters\.
The Taylor correction adds anO\(QK⋅d/g\)O\(QK\\cdot d/g\)term to attention computation, compared to the standardO\(QK⋅d\)O\(QK\\cdot d\)cost ofQK⊤QK^\{\\top\}\. Thus, the additional work is smaller by a factor ofggand is negligible in practice\. In our FlexAttention\-based implementation\(Donget al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib29)\)on MAGI\-1 with QuaRot\+RTN and group sizeg=32g\{=\}32, the correction adds approximately5%5\\%end\-to\-end latency overhead relative to the quantized baseline\. For more details about these storage and computation costs, see[Appendix˜B](https://arxiv.org/html/2605.26266#A2)\.
Table 1:Effect of the proposed correction for MAGI\-1, SkyReels\-V2, and HY\-WorldPlay\. The correction consistently improves fidelity \(PSNR, SSIM, LPIPS\) and perceptual quality \(VBench\), recovering much of the degradation introduced by quantization\. RTN and QuaRot\+RTN rows use an effective bitwidth of2\.752\.75at INT2; QVG rows on MAGI\-1 use the default QVG configuration, which yields an effective bitwidth of approximately2\.522\.52\. Standard errors for all metrics are reported in[Tables˜2](https://arxiv.org/html/2605.26266#A7.T2)and[5](https://arxiv.org/html/2605.26266#A8.T5)\.
## 5Experiments
We evaluate the effectiveness of our proposed correction by measuring its impact both on attention behavior and on end\-to\-end video quality across multiple metrics and models\.
### 5\.1Experimental Setup
#### Models\.
We evaluate our method on three autoregressive video diffusion models: MAGI\-1\(Sand\.aiet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib12)\)\(4\.5B\), SkyReels\-V2\(Chenet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib43)\)\(1\.3B\), and HY\-WorldPlay\(Sunet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib13)\)\(8B\)\. All use chunk\-wise generation with KV caching over previously generated chunks\. MAGI\-1 uses 16 denoising steps with a sliding window annealed from 5 to 2 chunks, SkyReels\-V2 uses 50 steps with a 5\-chunk window, and HY\-WorldPlay uses 4 steps\. Unless otherwise noted, all other generation hyperparameters remain at default values\.
#### Quantization configuration\.
We adopt group\-wise per\-token asymmetric INT2 quantization of key and value states as the default KV\-cache compression setting throughout the paper\. Unless otherwise noted, we use group sizeg=32g=32, FP8 E4M3 scales, BF16 zero\-points\. We evaluate two quantization schemes: QuaRot\+RTN\(Ashkbooset al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib4)\)and plain RTN without rotation\. Additionally, on MAGI\-1 we evaluate QuantVideoGen \(QVG\)\(Xiet al\.,[2026](https://arxiv.org/html/2605.26266#bib.bib17)\)using its default configuration \(S=1S\{=\}1,B=64B\{=\}64,K=256K\{=\}256\) to demonstrate that our correction composes with upstream video\-aware cache compression\. We apply the Taylor\-approximated bias correction from[Section˜4\.2](https://arxiv.org/html/2605.26266#S4.SS2)to all quantization schemes\. Unquantized BF16 results serve as the reference outputs for fidelity metrics\.
#### Metrics\.
We report fidelity metrics \(PSNR, SSIM\(Wanget al\.,[2004](https://arxiv.org/html/2605.26266#bib.bib3)\), and LPIPS\(Zhanget al\.,[2018](https://arxiv.org/html/2605.26266#bib.bib19)\)\) to measure the similarity between quantized and BF16 outputs on identical inputs\. We further evaluate generated videos using the VBench evaluation framework\(Huanget al\.,[2023](https://arxiv.org/html/2605.26266#bib.bib20)\)in the VBench\-Long setting from VBench\+\+\(Huanget al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib21)\), which adapts the benchmark to long\-form videos\.[Table˜1](https://arxiv.org/html/2605.26266#S4.T1)reports the aggregate VBench score; per\-dimension results and Quality/Semantic sub\-scores are provided in[Appendix˜H](https://arxiv.org/html/2605.26266#A8)\.
#### Evaluation data\.
For MAGI\-1 and SkyReels\-V2, we evaluate on the first 30% of prompts from each VBench\-Long dimension, generating 10\-second videos \(240 frames\) and 7\-second videos \(177 frames\), respectively\. We do not evaluate on the full prompt set, as this is computationally prohibitive across all models and quantization configurations\. For HY\-WorldPlay, we generate 10\-second videos \(253 frames\) from the 10 image–prompt pairs released in the official repository\(Sunet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib13)\)\. We do not report VBench scores for this model, as its required inputs \(image, text prompt, and per\-frame keyboard actions\) are not provided by any VBench suite\.
### 5\.2Main Results
KV\-cache quantization substantially degrades video quality \([Table˜1](https://arxiv.org/html/2605.26266#S4.T1),[Fig\.˜1](https://arxiv.org/html/2605.26266#S1.F1),[Figs\.˜A2](https://arxiv.org/html/2605.26266#A9.F2)and[A3](https://arxiv.org/html/2605.26266#A10.F3)\)\. Our correction improves fidelity metrics \(PSNR, SSIM, LPIPS\) and VBench scores across all three models and both quantization schemes \([Table˜1](https://arxiv.org/html/2605.26266#S4.T1)\)\. Notably, the correction improves every reported metric in every evaluated configuration, without any model\-specific tuning\. On MAGI\-1, composing our correction with QVG achieves the best results across all metrics, confirming that the two methods are complementary: QVG reduces the quantization error while our correction removes the residual Jensen bias\.
On MAGI\-1 and SkyReels\-V2, our correction closes the quality gap between INT2 KV\-cache quantization and the BF16 baseline \([Table˜1](https://arxiv.org/html/2605.26266#S4.T1)\)\. The MAGI\-1 per\-dimension breakdown in[Appendix˜H](https://arxiv.org/html/2605.26266#A8)shows that these gains are broad\-based across the VBench dimensions\. On HY\-WorldPlay, where VBench is not applicable, the correction consistently improves fidelity metrics \([Table˜1](https://arxiv.org/html/2605.26266#S4.T1)\)\.[Section˜5\.3](https://arxiv.org/html/2605.26266#S5.SS3)links these end\-to\-end gains to attention\-level improvements, including reduced quantization\-induced attention shift toward cached tokens\.
Figure 4:Shift in attention mass assigned to the cached block of tokens before \(purple\) and after \(orange\) our correction on MAGI\-1 under INT2 QuaRot\+RTN\. Positive values indicate that the quantized cached tokens steal attention from the current unquantized chunk\. The median bias is large under INT2 quantization, and our correction significantly reduces this bias toward zero\.
### 5\.3Ablation studies
We validate our correction by showing that reducing the Jensen bias improves metrics throughout the attention pipeline: attention mass balance, attention weights \(JSD;[Appendix˜L](https://arxiv.org/html/2605.26266#A12)\), attention outputs \(MSE;[Appendix˜M](https://arxiv.org/html/2605.26266#A13)\), and end\-to\-end video quality \(PSNR, VBench\)\. Together, these evaluations link the score\-level Jensen bias to quality degradation and support attention stealing as a key mechanism behind the gains in[Section˜5\.2](https://arxiv.org/html/2605.26266#S5.SS2)\. All results in this section use MAGI\-1 with QuaRot\+RTN quantization, the best VBench setting for this model, and are averaged across heads, layers, and denoising steps\.
#### Attention mass shift\.
Attention stealing caused by the Jensen bias is illustrated in[Fig\.˜2](https://arxiv.org/html/2605.26266#S1.F2)\. We quantify this effect by measuring the shift in attention mass assigned to cached tokens,ΔP𝒮=P^𝒮−P𝒮\\Delta P\_\{\\mathcal\{S\}\}=\\hat\{P\}\_\{\\mathcal\{S\}\}\-P\_\{\\mathcal\{S\}\}, aggregated across all layers, denoising steps, and attention heads\.[Figure˜4](https://arxiv.org/html/2605.26266#S5.F4)shows that under INT2 quantization,ΔP𝒮\\Delta P\_\{\\mathcal\{S\}\}is strongly positive, confirming that cached tokens steal attention mass\. Our correction shifts the distribution back toward zero, though it slightly over\-corrects into negative values, consistent with the Taylor approximation’s behavior at aggressive bitwidths \([Appendix˜A](https://arxiv.org/html/2605.26266#A1)\)\. Corresponding INT4 results are provided in[Appendix˜K](https://arxiv.org/html/2605.26266#A11)\.
#### Storage–quality trade\-off\.
Our method improves PSNR across all tested group sizes, including the most storage\-efficient settings \([Fig\.˜5](https://arxiv.org/html/2605.26266#S5.F5)\)\. The same trend holds for SSIM and LPIPS \([Appendix˜N](https://arxiv.org/html/2605.26266#A14)\)\. Thus, it preserves the group\-size\-controlled storage–quality trade\-off while uniformly shifting it toward higher quality\.
Beyond quality gains, our approach also substantially reduces storage and bandwidth requirements at comparable visual fidelity\. For example, using 2\.19 effective bits with our method outperforms 4\.38 effective bits without correction, corresponding to a 50% reduction in memory cost\.
Figure 5:Trade\-off between image quality, measured by PSNR, and memory footprint of the KV cache, measured as effective bitwidth per element, on MAGI\-1 under quantization\. Bitwidths correspond to group sizesg=\{128,64,32\}g=\\\{128,64,32\\\}\. Whiskers indicate standard error\.
### 5\.4Cross\-domain experiment: LLM partial prefill
Although our main experiments target chunk\-wise video diffusion, chunked LLM prefill has a similar cached/current attention structure: a quantized cached prefix and a multi\-token current prefill block appear in the same softmax\. We therefore run a small\-scale diagnostic study on three decoder\-only LLMs using LongBench\-Pro English prompts\(Chenet al\.,[2026b](https://arxiv.org/html/2605.26266#bib.bib22)\)\. We compare BF16, INT2 KV\-cache quantization, and INT2 with our Taylor correction under teacher\-forced negative log\-likelihood \(NLL\), using paired model/chunk\-size/prompt\-length configurations\.
Across the LLM experiments, INT2 generally increases NLL relative to BF16, while the Taylor correction reduces NLL relative to plain INT2\. This is consistent with the mechanism studied in our video experiments, but we do not interpret it as a comprehensive LLM benchmark\. Details and prompt\-length breakdowns are provided in Appendix[O](https://arxiv.org/html/2605.26266#A15)\.
## 6Discussion and Conclusion
We identify a systematic Jensen bias in softmax attention induced by KV\-cache quantization: zero\-mean key noise is amplified by the exponential, inflating cached partition mass and shifting attention away from the unquantized current chunk\. We derive a per\-attention\-score correction that removes this bias in expectation and use a second\-order Taylor approximation whose cost is negligible relative to theQK⊤QK^\{\\top\}computation\. Across MAGI\-1, SkyReels\-V2, and HY\-WorldPlay, the correction consistently improves fidelity \(PSNR, SSIM, LPIPS\) and yields large VBench gains on MAGI\-1 and SkyReels\-V2, especially under INT2 quantization\.
#### Limitations & future work\.
Our experiments focus on chunked autoregressive video diffusion, where a multi\-token current chunk attends to a quantized cached context\. This cached/current structure is central to the attention\-mass shift studied here\. Preliminary LLM results suggest that a similar bias can arise in quantized KV caches\. Chunked prefill \(where each prefill contains many current tokens\) with KV\-cache quantization\(Gokhaleet al\.,[2025](https://arxiv.org/html/2605.26266#bib.bib41)\)is therefore a natural target for further exploration\. Standard single\-token decoding offers less headroom for the correction because many cached tokens compete with only one unquantized current token\.
Our correction is unbiased only in expectation and relies on the assumed zero\-mean, approximately uniform quantization\-noise model\. It works best when cached attention is spread over enough tokens for score perturbations to average out\. When attention is concentrated on a few cached tokens, the effective sample size is small and individual noise realizations can dominate, limiting the correction’s gain\. Quantizers with nonuniform or biased error may likewise require a modified derivation\.
Because the correction acts only on attention scores, it is orthogonal to the upstream compression method\. Extending it to floating\-point formats such as FP, MXFP, and NVFP, whose non\-uniform grids produce a different noise distribution, remains an open direction\.
## Acknowledgments and Disclosure of Funding
This work was carried out as part of the first author’s Master’s thesis at the Technical University of Munich in collaboration with Tensordyne\. We thank Dr\.\-Ing\. Victor M\. van Santen for his advice and guidance throughout this project, and Prof\. Dr\.\-Ing\. Hussam Amrouch for his supervision at TUM\. We further thank Michael Truong Le and Thomas Elsken at Tensordyne for their helpful discussions during the course of this work\.
## References
- S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)QuaRot: outlier\-free 4\-bit inference in rotated llms\.arXiv preprint arXiv:2404\.00456\.Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p3.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2605.26266#S3.SS0.SSS0.Px3.p1.5),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px2.p1.4)\.
- Diffusion forcing: next\-token prediction meets full\-sequence diffusion\.External Links:2407\.01392,[Link](https://arxiv.org/abs/2407.01392)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Chen, D\. Lin, J\. Yang, C\. Lin, J\. Zhu, M\. Fan, H\. Zhang, S\. Chen, Z\. Chen, C\. Ma, W\. Xiong, W\. Wang, N\. Pang, K\. Kang, Z\. Xu, Y\. Jin, Y\. Liang, Y\. Song, P\. Zhao, B\. Xu, D\. Qiu, D\. Li, Z\. Fei, Y\. Li, and Y\. Zhou \(2025\)SkyReels\-v2: infinite\-length film generative model\.External Links:2504\.13074,[Link](https://arxiv.org/abs/2504.13074)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Chen, C\. Wei, S\. Sun, P\. Nie, K\. Zhou, G\. Zhang, M\. Yang, and W\. Chen \(2026a\)Context forcing: consistent autoregressive video generation with long context\.External Links:2602\.06028,[Link](https://arxiv.org/abs/2602.06028)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Chen, X\. Wu, J\. Jia, C\. Gao, Q\. Fu, D\. Zhang, and S\. Hu \(2026b\)LongBench pro: a more realistic and comprehensive bilingual long\-context evaluation benchmark\.External Links:2601\.02872,[Link](https://arxiv.org/abs/2601.02872)Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2),[§5\.4](https://arxiv.org/html/2605.26266#S5.SS4.p1.1)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.External Links:2208\.07339,[Link](https://arxiv.org/abs/2208.07339)Cited by:[§3](https://arxiv.org/html/2605.26266#S3.SS0.SSS0.Px3.p1.5)\.
- J\. Dong, B\. Feng, D\. Guessous, Y\. Liang, and H\. He \(2024\)Flex attention: a programming model for generating optimized attention kernels\.External Links:2412\.05496,[Link](https://arxiv.org/abs/2412.05496)Cited by:[Appendix C](https://arxiv.org/html/2605.26266#A3.p1.1),[§4\.3](https://arxiv.org/html/2605.26266#S4.SS3.p2.6)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2)\.
- X\. Gao, M\. Sitharam, and A\. E\. Roitberg \(2020\)Bounds on the jensen gap, and implications for mean\-concentrated distributions\.arXiv preprint arXiv:1712\.05267\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1712.05267),[Link](https://arxiv.org/abs/1712.05267)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p4.1)\.
- S\. Gokhale, D\. Das, R\. Patwari, A\. Sirasao, and E\. Delaye \(2025\)KV pareto: systems\-level optimization of kv cache and model compression for long context inference\.External Links:2512\.01953,[Link](https://arxiv.org/abs/2512.01953)Cited by:[§6](https://arxiv.org/html/2605.26266#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Hooper, S\. Kim, H\. Mohammadzadeh, M\. W\. Mahoney, Y\. S\. Shao, K\. Keutzer, and A\. Gholami \(2024\)KVQuant: towards 10 million context length llm inference with kv cache quantization\.arXiv preprint arXiv:2401\.18079\.Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p3.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Huang, Y\. He, J\. Yu, F\. Zhang, C\. Si, Y\. Jiang, Y\. Zhang, T\. Wu, Q\. Jin, N\. Chanpaisit, Y\. Wang, X\. Chen, L\. Wang, D\. Lin, Y\. Qiao, and Z\. Liu \(2023\)VBench: comprehensive benchmark suite for video generative models\.External Links:2311\.17982,[Link](https://arxiv.org/abs/2311.17982)Cited by:[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px3.p1.1)\.
- Z\. Huang, F\. Zhang, X\. Xu, Y\. He, J\. Yu, Z\. Dong, Q\. Ma, N\. Chanpaisit, C\. Si, Y\. Jiang, Y\. Wang, X\. Chen, Y\. Chen, L\. Wang, D\. Lin, Y\. Qiao, and Z\. Liu \(2024\)VBench\+\+: comprehensive and versatile benchmark suite for video generative models\.External Links:2411\.13503,[Link](https://arxiv.org/abs/2411.13503)Cited by:[Appendix H](https://arxiv.org/html/2605.26266#A8.p1.1),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px3.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. Le Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. El Sayed \(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.External Links:[Link](https://arxiv.org/abs/2310.06825)Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2)\.
- W\. Kong, Q\. Tian, Z\. Zhang, R\. Min, Z\. Dai, J\. Zhou, J\. Xiong, X\. Li, B\. Wu, J\. Zhang, K\. Wu, Q\. Lin, J\. Yuan, Y\. Long, A\. Wang, A\. Wang, C\. Li, D\. Huang, F\. Yang, H\. Tan, H\. Wang, J\. Song, J\. Bai, J\. Wu, J\. Xue, J\. Wang, K\. Wang, M\. Liu, P\. Li, S\. Li, W\. Wang, W\. Yu, X\. Deng, Y\. Li, Y\. Chen, Y\. Cui, Y\. Peng, Z\. Yu, Z\. He, Z\. Xu, Z\. Zhou, Z\. Xu, Y\. Tao, Q\. Lu, S\. Liu, D\. Zhou, H\. Wang, Y\. Yang, D\. Wang, Y\. Liu, J\. Jiang, and C\. Zhong \(2025\)HunyuanVideo: a systematic framework for large video generative models\.External Links:2412\.03603,[Link](https://arxiv.org/abs/2412.03603)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1)\.
- W\. Kwonet al\.\(2023\)Efficient memory management for large language model serving with pagedattention\.arXiv preprint arXiv:2309\.06180\.Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Liu, J\. Yuan, H\. Jin, S\. Zhong, Z\. Xu, V\. Braverman, B\. Chen, and X\. Hu \(2024\)KIVI: a tuning\-free asymmetric 2bit quantization for KV cache\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=L057s2Rq8O)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p3.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Lv, Y\. Shi, Y\. Huang, R\. Gong, S\. Ren, and W\. Wang \(2026\)Light forcing: accelerating autoregressive video diffusion via sparse attention\.External Links:2602\.04789,[Link](https://arxiv.org/abs/2602.04789)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Ma, X\. Zheng, J\. Xu, X\. Xu, F\. Ling, X\. Zheng, H\. Kuang, H\. Li, X\. Wang, X\. Xiao, F\. Chao, and R\. Ji \(2026\)Flow caching for autoregressive video generation\.External Links:2602\.10825,[Link](https://arxiv.org/abs/2602.10825)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1)\.
- Meta \(2024\)Meta Llama 3\.1 8B model card\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.1\-8B](https://huggingface.co/meta-llama/Llama-3.1-8B)Accessed: 2026\-05\-07Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2)\.
- Mistral AI \(2024\)Mistral\-7B\-Instruct\-v0\.3 model card\.Note:[https://huggingface\.co/mistralai/Mistral\-7B\-Instruct\-v0\.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)Accessed: 2026\-05\-07Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2)\.
- N\. P\. Pandey, M\. Fournarakis, C\. Patel, and M\. Nagel \(2023\)Softmax bias correction for quantized generative models\.External Links:2309\.01729,[Link](https://arxiv.org/abs/2309.01729)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2)\.
- Qwen \(2024\)Qwen2\.5\-32B\-Instruct model card\.Note:[https://huggingface\.co/Qwen/Qwen2\.5\-32B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct)Accessed: 2026\-05\-07Cited by:[§O\.1](https://arxiv.org/html/2605.26266#A15.SS1.p1.2)\.
- D\. Samuel, I\. Tzachor, M\. Levy, M\. Green, G\. Chechik, and R\. Ben\-Ari \(2026\)Fast autoregressive video diffusion and world models with temporal cache compression and sparse attention\.External Links:2602\.01801,[Link](https://arxiv.org/abs/2602.01801)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p2.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1)\.
- Sand\.ai, H\. Teng, H\. Jia, L\. Sun, L\. Li, M\. Li, M\. Tang, S\. Han, T\. Zhang, W\. Q\. Zhang, W\. Luo, X\. Kang, Y\. Sun, Y\. Cao, Y\. Huang, Y\. Lin, Y\. Fang, Z\. Tao, Z\. Zhang, Z\. Wang, Z\. Liu, D\. Shi, G\. Su, H\. Sun, H\. Pan, J\. Wang, J\. Sheng, M\. Cui, M\. Hu, M\. Yan, S\. Yin, S\. Zhang, T\. Liu, X\. Yin, X\. Yang, X\. Song, X\. Hu, Y\. Zhang, and Y\. Li \(2025\)MAGI\-1: autoregressive video generation at scale\.External Links:2505\.13211,[Link](https://arxiv.org/abs/2505.13211)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1),[§1](https://arxiv.org/html/2605.26266#S1.p2.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px1.p1.1)\.
- U\. Saxena and K\. Roy \(2025\)KVLinC : kv cache quantization with hadamard rotation and linear correction\.External Links:2510\.05373,[Link](https://arxiv.org/abs/2510.05373)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Sun, H\. Zhang, H\. Wang, J\. Wu, Z\. Wang, Z\. Wang, Y\. Wang, J\. Zhang, T\. Wang, and C\. Guo \(2025\)WorldPlay: towards long\-term geometric consistency for real\-time interactive world modeling\.External Links:2512\.14614,[Link](https://arxiv.org/abs/2512.14614)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px4.p1.1)\.
- Q\. Tao, W\. Yu, and J\. Zhou \(2024\)AsymKV: enabling 1\-bit quantization of kv cache with layer\-wise asymmetric quantization configurations\.External Links:2410\.13212,[Link](https://arxiv.org/abs/2410.13212)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px1.p1.1)\.
- Team Wan, A\. Wang, B\. Ai, B\. Wen, C\. Mao, C\. Xie, D\. Chen, F\. Yu, H\. Zhao, J\. Yang, J\. Zeng, J\. Wang, J\. Zhang, J\. Zhou, J\. Wang, J\. Chen, K\. Zhu, K\. Zhao, K\. Yan, L\. Huang, M\. Feng, N\. Zhang, P\. Li, P\. Wu, R\. Chu, R\. Feng, S\. Zhang, S\. Sun, T\. Fang, T\. Wang, T\. Gui, T\. Weng, T\. Shen, W\. Lin, W\. Wang, W\. Wang, W\. Zhou, W\. Wang, W\. Shen, W\. Yu, X\. Shi, X\. Huang, X\. Xu, Y\. Kou, Y\. Lv, Y\. Li, Y\. Liu, Y\. Wang, Y\. Zhang, Y\. Huang, Y\. Li, Y\. Wu, Y\. Liu, Y\. Pan, Y\. Zheng, Y\. Hong, Y\. Shi, Y\. Feng, Z\. Jiang, Z\. Han, Z\. Wu, and Z\. Liu \(2025\)Wan: open and advanced large\-scale video generative models\.External Links:2503\.20314,[Link](https://arxiv.org/abs/2503.20314)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1)\.
- Z\. Wang, A\. C\. Bovik, H\. R\. Sheikh, and E\. P\. Simoncelli \(2004\)Image quality assessment: from error visibility to structural similarity\.IEEE Transactions on Image Processing13\(4\),pp\. 600–612\.Cited by:[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px3.p1.1)\.
- B\. Widrow, I\. Kollar, and M\. Liu \(1996\)Statistical theory of quantization\.IEEE Transactions on Instrumentation and Measurement45\(2\),pp\. 353–361\.External Links:[Document](https://dx.doi.org/10.1109/19.492748)Cited by:[§4\.1](https://arxiv.org/html/2605.26266#S4.SS1.SSS0.Px1.p1.7),[§4\.2](https://arxiv.org/html/2605.26266#S4.SS2.SSS0.Px1.p1.8)\.
- H\. Xi, S\. Yang, Y\. Zhao, M\. Li, H\. Cai, X\. Li, Y\. Lin, Z\. Zhang, J\. Zhang, X\. Li, Z\. Xu, J\. Wu, C\. Xu, I\. Stoica, S\. Han, and K\. Keutzer \(2026\)Quant videogen: auto\-regressive long video generation via 2\-bit kv\-cache quantization\.External Links:2602\.02958,[Link](https://arxiv.org/abs/2602.02958)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p2.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1),[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px2.p1.4)\.
- Z\. Yang, J\. Teng, W\. Zheng, M\. Ding, S\. Huang, J\. Xu, Y\. Yang, W\. Hong, X\. Zhang, G\. Feng, D\. Yin, Y\. Zhang, W\. Wang, Y\. Cheng, B\. Xu, X\. Gu, Y\. Dong, and J\. Tang \(2025\)CogVideoX: text\-to\-video diffusion models with an expert transformer\.External Links:2408\.06072,[Link](https://arxiv.org/abs/2408.06072)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1)\.
- Y\. Yao, F\. Tian, J\. Chen, H\. Lin, G\. Dai, Y\. Liu, and J\. Wang \(2024\)Timestep\-aware correction for quantized diffusion models\.External Links:2407\.03917,[Link](https://arxiv.org/abs/2407.03917)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Yin, Q\. Zhang, R\. Zhang, W\. T\. Freeman, F\. Durand, E\. Shechtman, and X\. Huang \(2025\)From slow bidirectional to fast autoregressive video diffusion models\.External Links:2412\.07772,[Link](https://arxiv.org/abs/2412.07772)Cited by:[§1](https://arxiv.org/html/2605.26266#S1.p1.1),[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Zandieh, M\. Daliri, M\. Hadian, and V\. Mirrokni \(2025\)TurboQuant: online vector quantization with near\-optimal distortion rate\.External Links:2504\.19874,[Link](https://arxiv.org/abs/2504.19874)Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, H\. Huang, P\. Zhang, J\. Wei, J\. Zhu, and J\. Chen \(2025\)Sageattention2: efficient attention with thorough outlier smoothing and per\-thread int4 quantization\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.26266#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Zhang, P\. Isola, A\. A\. Efros, E\. Shechtman, and O\. Wang \(2018\)The unreasonable effectiveness of deep features as a perceptual metric\.External Links:1801\.03924,[Link](https://arxiv.org/abs/1801.03924)Cited by:[§5\.1](https://arxiv.org/html/2605.26266#S5.SS1.SSS0.Px3.p1.1)\.
## Appendix AExact Correction: Full Derivation
We derive the exact formula forbi=log𝔼\[eδi∣\{Δi,c\}\]b\_\{i\}=\\log\\mathbb\{E\}\[e^\{\\delta\_\{i\}\}\\mid\\\{\\Delta\_\{i,c\}\\\}\]under the uniform quantization noise model of[Section˜4\.1](https://arxiv.org/html/2605.26266#S4.SS1)\.
Recall thatδi=∑c=1dqcϵi,c/d\\delta\_\{i\}=\\sum\_\{c=1\}^\{d\}q\_\{c\}\\,\\epsilon\_\{i,c\}/\\sqrt\{d\}, where theϵi,c\\epsilon\_\{i,c\}are independent withϵi,c∼𝒰\(−Δi,c/2,\+Δi,c/2\)\\epsilon\_\{i,c\}\\sim\\mathcal\{U\}\(\-\\Delta\_\{i,c\}/2,\+\\Delta\_\{i,c\}/2\)\. By independence across channels, the moment generating function factorizes:
𝔼\[eδi\]=∏c=1d𝔼\[exp\(qcϵi,cd\)\]\.\\mathbb\{E\}\\bigl\[e^\{\\delta\_\{i\}\}\\bigr\]=\\prod\_\{c=1\}^\{d\}\\mathbb\{E\}\\\!\\left\[\\exp\\\!\\left\(\\frac\{q\_\{c\}\\,\\epsilon\_\{i,c\}\}\{\\sqrt\{d\}\}\\right\)\\right\]\.\(16\)For each channelcc, we evaluate the scalar MGF\. Lettc=qc/dt\_\{c\}=q\_\{c\}/\\sqrt\{d\}for brevity\. Sinceϵi,c∼𝒰\(−Δi,c/2,\+Δi,c/2\)\\epsilon\_\{i,c\}\\sim\\mathcal\{U\}\(\-\\Delta\_\{i,c\}/2,\\;\+\\Delta\_\{i,c\}/2\):
𝔼\[etcϵi,c\]\\displaystyle\\mathbb\{E\}\\bigl\[e^\{t\_\{c\}\\,\\epsilon\_\{i,c\}\}\\bigr\]=1Δi,c∫−Δi,c/2\+Δi,c/2etcu𝑑u\\displaystyle=\\frac\{1\}\{\\Delta\_\{i,c\}\}\\int\_\{\-\\Delta\_\{i,c\}/2\}^\{\+\\Delta\_\{i,c\}/2\}e^\{t\_\{c\}\\,u\}\\,du=sinh\(tcΔi,c/2\)tcΔi,c/2\.\\displaystyle=\\frac\{\\sinh\(t\_\{c\}\\,\\Delta\_\{i,c\}/2\)\}\{t\_\{c\}\\,\\Delta\_\{i,c\}/2\}\.\(17\)Taking the product over all channels and then the logarithm yields the exact correction:
bi=∑c=1dlog\(sinh\(qcΔi,c2d\)qcΔi,c2d\)\.b\_\{i\}=\\sum\_\{c=1\}^\{d\}\\log\\\!\\left\(\\frac\{\\sinh\\\!\\left\(\\dfrac\{q\_\{c\}\\,\\Delta\_\{i,c\}\}\{2\\sqrt\{d\}\}\\right\)\}\{\\dfrac\{q\_\{c\}\\,\\Delta\_\{i,c\}\}\{2\\sqrt\{d\}\}\}\\right\)\.\(18\)
A naive implementation of this formula is numerically unstable \(sinh\\sinhoverflows for large arguments\) and computationally expensive \(O\(d\)O\(d\)operations per score entry, matching the attention score computation itself\)\. We therefore seek a cheaper approximation\.
#### Taylor approximation\.
Letαc=qcΔi,c/\(2d\)\\alpha\_\{c\}=q\_\{c\}\\,\\Delta\_\{i,c\}/\(2\\sqrt\{d\}\)\. Usinglog\(sinh\(α\)/α\)=α2/6\+O\(α4\)\\log\(\\sinh\(\\alpha\)/\\alpha\)=\\alpha^\{2\}/6\+O\(\\alpha^\{4\}\), and summing over channels:
bi≈∑c=1dαc26=124d∑c=1dqc2Δi,c2\.b\_\{i\}\\approx\\sum\_\{c=1\}^\{d\}\\frac\{\\alpha\_\{c\}^\{2\}\}\{6\}=\\frac\{1\}\{24\\,d\}\\sum\_\{c=1\}^\{d\}q\_\{c\}^\{2\}\\,\\Delta\_\{i,c\}^\{2\}\.\(19\)
Under group\-wise per\-token quantization, where each token’sddchannels are divided intoG=d/gG=d/ggroups sharing a common step sizeΔi,j\\Delta\_\{i,j\}, this simplifies tobi≈124d∑j=1GΔi,j2‖qj‖2b\_\{i\}\\approx\\frac\{1\}\{24d\}\\sum\_\{j=1\}^\{G\}\\Delta\_\{i,j\}^\{2\}\\,\\\|q\_\{j\}\\\|^\{2\}as in[Eq\.˜15](https://arxiv.org/html/2605.26266#S4.E15)\.
[Figure˜A1](https://arxiv.org/html/2605.26266#A1.F1)compares the exact correctionlog\(sinh\(α\)/α\)\\log\(\\sinh\(\\alpha\)/\\alpha\)with its Taylor approximationα2/6\\alpha^\{2\}/6as a function ofαc=qcΔi,c/\(2d\)\\alpha\_\{c\}=q\_\{c\}\\,\\Delta\_\{i,c\}/\(2\\sqrt\{d\}\)\. The two agree closely for small\|αc\|\|\\alpha\_\{c\}\|, but the Taylor term grows asαc2\\alpha\_\{c\}^\{2\}whereas the exact correction grows only as\|αc\|\|\\alpha\_\{c\}\|for large arguments, so the approximation systematically overestimates the correction when the score\-space noise is large\.
Figure A1:Exact correctionlog\(sinh\(α\)/α\)\\log\(\\sinh\(\\alpha\)/\\alpha\)versus its second\-order Taylor approximationα2/6\\alpha^\{2\}/6\. The approximation is tight for small\|α\|\|\\alpha\|but overestimates the correction for large\|α\|\|\\alpha\|, explaining the mild overcorrection observed at aggressive bitwidths\.At aggressive bitwidths \(e\.g\., INT2\), the approximation may overcorrect, but we find empirically that this generally does not harm end\-to\-end video quality \(see[Section˜5](https://arxiv.org/html/2605.26266#S5)\)\.
## Appendix BDetailed Cost Breakdown
We detail the per\-query, per\-key, per\-score\-entry and total costs for the Taylor correction under group\-wise per\-token quantization withG=d/gG=d/ggroups\.
Under group\-wise quantization withG=d/gG=d/ggroups:
- •Per\-query:Compute‖qj‖2=∑c∈𝒢jqc2\\\|q\_\{j\}\\\|^\{2\}=\\sum\_\{c\\in\\mathcal\{G\}\_\{j\}\}q\_\{c\}^\{2\}for each groupj=1,…,Gj=1,\\dots,G, costingO\(d\)O\(d\)\.
- •Per\-key:ComputeΔi,j2/\(24d\)\\Delta\_\{i,j\}^\{2\}/\(24d\)for each group, costingO\(G\)O\(G\)per key\.
- •Per score entry:Compute an inner product between the per\-query vector\(‖qj‖2\)j=1G\(\\\|q\_\{j\}\\\|^\{2\}\)\_\{j=1\}^\{G\}and the per\-key vector\(Δi,j2/\(24d\)\)j=1G\(\\Delta\_\{i,j\}^\{2\}/\(24d\)\)\_\{j=1\}^\{G\}, costingO\(G\)O\(G\)\.
- •Total: O\(Q⋅d\+K⋅G\+Q⋅K⋅G\)\.O\(Q\\cdot d\+K\\cdot G\+Q\\cdot K\\cdot G\)\.\(20\)SinceG=d/gG=d/gandK≫dK\\gg d, the dominant term isO\(Q⋅K⋅d/g\)O\(Q\\cdot K\\cdot d/g\)\. Compared to the attention costO\(Q⋅K⋅d\)O\(Q\\cdot K\\cdot d\), this is lower by a factor ofgg\.
On storage, we note that a cached key of dimensionddquantized toBBbits per element with group sizeggrequiresd⋅Bd\\cdot Bbits for the quantized values, plus metadata per group: one scale stored in FP8 E4M3 \(8 bits\) and one zero\-point stored in BF16 \(16 bits\), for a total of2424bits per group\. WithG=d/gG=d/ggroups per token, the effective bitwidth is
Beff=d⋅B\+24⋅Gd=B\+24g\.B\_\{\\mathrm\{eff\}\}=\\frac\{d\\cdot B\+24\\cdot G\}\{d\}=B\+\\frac\{24\}\{g\}\.\(21\)Our correction adds no storage beyond this \(Δi,j\\Delta\_\{i,j\}is the scale itself\)\. For our default configuration \(d=128d=128,g=32g=32\), this yieldsBeff=2\.75B\_\{\\mathrm\{eff\}\}=2\.75at INT2\.
## Appendix CImplementation Note
In our implementation, the correction subtracts a per\-attention\-score valuebib\_\{i\}from cached scores before softmax\. Materializing this correction for every score entry would require a dense tensor with the same shape as the full score matrix, which is unnecessary for long contexts\. Instead, we apply the bias on the fly through ascore\_modfunction in PyTorch’s FlexAttention\[Donget al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib29)\], which lets the fused attention kernel incorporate the correction without materializing the full correction tensor\.
All MAGI\-1 experiments were conducted on NVIDIA L4 GPUs, SkyReels\-V2 experiments on NVIDIA A100 GPUs, and HY\-WorldPlay experiments on NVIDIA A100 80GB GPUs\.
## Appendix DPseudocode for Taylor\-Corrected Attention
[Algorithm˜1](https://arxiv.org/html/2605.26266#algorithm1)provides the full pseudocode for attention with the Taylor correction applied to quantized cached keys, as derived in[Section˜4\.2](https://arxiv.org/html/2605.26266#S4.SS2)\.
Input:Query matrix
Q∈ℝM×dQ\\in\\mathbb\{R\}^\{M\\times d\}; cached quantized keys
K𝒮qK\_\{\\mathcal\{S\}\}^\{q\}with per\-group step sizes
\{Δi,j\}\\\{\\Delta\_\{i,j\}\\\}; cached values
V𝒮V\_\{\\mathcal\{S\}\}; current\-chunk keys
KℛK\_\{\\mathcal\{R\}\}; current\-chunk values
VℛV\_\{\\mathcal\{R\}\}; group size
gg, number of groups
G=d/gG=d/g
Output:Attention output
O∈ℝM×dvO\\in\\mathbb\{R\}^\{M\\times d\_\{v\}\}
K^𝒮←dequant\(K𝒮q\)\\hat\{K\}\_\{\\mathcal\{S\}\}\\leftarrow\\mathrm\{dequant\}\(K\_\{\\mathcal\{S\}\}^\{q\}\);
S𝒮←QK^𝒮⊤/dS\_\{\\mathcal\{S\}\}\\leftarrow Q\\hat\{K\}\_\{\\mathcal\{S\}\}^\{\\top\}/\\sqrt\{d\};
Sℛ←QKℛ⊤/dS\_\{\\mathcal\{R\}\}\\leftarrow QK\_\{\\mathcal\{R\}\}^\{\\top\}/\\sqrt\{d\};
for*m=1m=1toMM*do
for*j=1j=1toGG*do
νm,j←∑c∈𝒢jQm,c2\\nu\_\{m,j\}\\leftarrow\\sum\_\{c\\in\\mathcal\{G\}\_\{j\}\}Q\_\{m,c\}^\{2\};
end for
forall*i∈𝒮i\\in\\mathcal\{S\}*do
bm,i←124d∑j=1GΔi,j2νm,jb\_\{m,i\}\\leftarrow\\dfrac\{1\}\{24\\,d\}\\sum\_\{j=1\}^\{G\}\\Delta\_\{i,j\}^\{2\}\\,\\nu\_\{m,j\};
S𝒮\[m,i\]←S𝒮\[m,i\]−bm,iS\_\{\\mathcal\{S\}\}\[m,i\]\\leftarrow S\_\{\\mathcal\{S\}\}\[m,i\]\-b\_\{m,i\};
end forall
end for
S←concat\(S𝒮,Sℛ\)S\\leftarrow\\mathrm\{concat\}\(S\_\{\\mathcal\{S\}\},S\_\{\\mathcal\{R\}\}\);
P←softmax\(S\)P\\leftarrow\\mathrm\{softmax\}\(S\);
V←concat\(V𝒮,Vℛ\)V\\leftarrow\\mathrm\{concat\}\(V\_\{\\mathcal\{S\}\},V\_\{\\mathcal\{R\}\}\);
O←PVO\\leftarrow PV;
return*OO*
Algorithm 1Attention with Taylor correction for quantized cached keys \(group\-wise\)
## Appendix EPer\-Channel Quantization Correction
When quantization is performed per\-channel \(or group\-wise per\-channel\), the step sizeΔc\\Delta\_\{c\}depends on channelccbut is shared across all tokens\. The noise model becomesϵi,c∼𝒰\(−Δc/2,\+Δc/2\)\\epsilon\_\{i,c\}\\sim\\mathcal\{U\}\(\-\\Delta\_\{c\}/2,\\;\+\\Delta\_\{c\}/2\), independent across channels and identically distributed across tokens for each fixed channel\.
Since\{Δc\}\\\{\\Delta\_\{c\}\\\}do not depend onii, the distribution ofδi=∑cqcϵi,c/d\\delta\_\{i\}=\\sum\_\{c\}q\_\{c\}\\,\\epsilon\_\{i,c\}/\\sqrt\{d\}is the same for all cached keysi∈𝒮i\\in\\mathcal\{S\}\. The correction reduces to a single scalar shared by all tokens:
b=∑c=1dlog\(sinh\(qcΔc2d\)qcΔc2d\),b=\\sum\_\{c=1\}^\{d\}\\log\\\!\\left\(\\frac\{\\sinh\\\!\\left\(\\dfrac\{q\_\{c\}\\,\\Delta\_\{c\}\}\{2\\sqrt\{d\}\}\\right\)\}\{\\dfrac\{q\_\{c\}\\,\\Delta\_\{c\}\}\{2\\sqrt\{d\}\}\}\\right\),\(22\)with the Taylor approximation
b≈124d∑c=1dqc2Δc2\.b\\approx\\frac\{1\}\{24\\,d\}\\sum\_\{c=1\}^\{d\}q\_\{c\}^\{2\}\\,\\Delta\_\{c\}^\{2\}\.\(23\)
#### Per\-channel correction\.
Sincebbis the same for alli∈𝒮i\\in\\mathcal\{S\}, the corrected scores within the cached chunk ares~i=s^i−b\\tilde\{s\}\_\{i\}=\\hat\{s\}\_\{i\}\-bfor alli∈𝒮i\\in\\mathcal\{S\}\. Subtractingbbfrom all cached scores reducesZ𝒮Z\_\{\\mathcal\{S\}\}relative toZℛZ\_\{\\mathcal\{R\}\}, restoring the inter\-chunk attention balance\.
Under per\-token quantization, the correctionbib\_\{i\}varies across tokens, allowing it to differentially adjust each token’s contribution\. In our experiments, per\-token quantization with the token\-dependent correction consistently outperforms per\-channel quantization with a shared correction\.
## Appendix FExtension to QuaRot
The derivation in[Section˜4](https://arxiv.org/html/2605.26266#S4)assumes the unrotated space\. We now extend the correction to QuaRot \(see[Section˜3](https://arxiv.org/html/2605.26266#S3.SS0.SSS0.Px3)\)\.
With the Hadamard matrixHHapplied to both keys and queries, the quantized score becomes
s^i=\(Hq\)⊤\(Hki\+ϵi\)d=si\+δi\(H\),\\hat\{s\}\_\{i\}=\\frac\{\(Hq\)^\{\\top\}\(Hk\_\{i\}\+\\epsilon\_\{i\}\)\}\{\\sqrt\{d\}\}=s\_\{i\}\+\\delta\_\{i\}^\{\(H\)\},\(24\)whereδi\(H\)=\(Hq\)⊤ϵi/d\\delta\_\{i\}^\{\(H\)\}=\(Hq\)^\{\\top\}\\epsilon\_\{i\}/\\sqrt\{d\}\. Our correction applies identically withqqreplaced byHqHq:bi\(H\)=log𝔼\[eδi\(H\)\]b\_\{i\}^\{\(H\)\}=\\log\\mathbb\{E\}\[e^\{\\delta\_\{i\}^\{\(H\)\}\}\]\.
#### Taylor approximation under rotation\.
The Taylor approximation replaces‖qj‖2\\\|q\_\{j\}\\\|^\{2\}with‖\(Hq\)j‖2\\\|\(Hq\)\_\{j\}\\\|^\{2\}\(the per\-group squared norms of the rotated query\):
bi\(H\)≈124d∑j=1GΔi,j2‖\(Hq\)j‖2\.b\_\{i\}^\{\(H\)\}\\approx\\frac\{1\}\{24\\,d\}\\sum\_\{j=1\}^\{G\}\\Delta\_\{i,j\}^\{2\}\\,\\\|\(Hq\)\_\{j\}\\\|^\{2\}\.\(25\)Note that while‖Hq‖2=‖q‖2\\\|Hq\\\|^\{2\}=\\\|q\\\|^\{2\}by orthogonality, the per\-group norms‖\(Hq\)j‖2\\\|\(Hq\)\_\{j\}\\\|^\{2\}generally differ from‖qj‖2\\\|q\_\{j\}\\\|^\{2\}because Hadamard rotation mixes channels across groups\.
## Appendix GFidelity Metric Standard Errors
[Table˜1](https://arxiv.org/html/2605.26266#S4.T1)reports fidelity metrics \(PSNR, SSIM, LPIPS\) averaged across prompts\.[Table˜2](https://arxiv.org/html/2605.26266#A7.T2)reports the same values with standard errors computed across prompts \(the independent sampling unit\), using the evaluation data described in[Section˜5\.1](https://arxiv.org/html/2605.26266#S5.SS1)\.
Table 2:Fidelity metrics with standard errors for all configurations in[Table˜1](https://arxiv.org/html/2605.26266#S4.T1)\. PSNR, SSIM, and LPIPS are computed relative to the BF16 reference;±\\pmdenotes standard error across prompts\. Best quantized result per model isbolded\.
## Appendix HPer\-Dimension VBench Results
[Table˜1](https://arxiv.org/html/2605.26266#S4.T1)reports the aggregate VBench Score in the VBench\-Long setting from VBench\+\+\[Huanget al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib21)\]on MAGI\-1 and SkyReels\-V2\. For completeness,[Tables˜3](https://arxiv.org/html/2605.26266#A8.T3)and[4](https://arxiv.org/html/2605.26266#A8.T4)break this score down across all 16 VBench dimensions, grouped by VBench’s*Quality*\(visual fidelity\) and*Semantic*\(prompt fidelity\) categories, and[Table˜5](https://arxiv.org/html/2605.26266#A8.T5)reports the corresponding sub\-scores together with the Total VBench Score that already appears in[Table˜1](https://arxiv.org/html/2605.26266#S4.T1)\. All scores are reported with standard errors across prompts \(±\\pmSE\); within\-prompt clips are averaged before computing the SE\.
Table 3:Per\-dimension VBench*Quality*results on MAGI\-1 and SkyReels\-V2 \(subject consistency, background consistency, temporal flickering, motion smoothness, dynamic degree, aesthetic quality, imaging quality\)\. Values are on the standard VBench 0–100 scale\.±\\pmdenotes standard error across prompts\. Best quantized result per model isbolded\.Table 4:Per\-dimension VBench*Semantic*results on MAGI\-1 and SkyReels\-V2 \(object class, multiple objects, human action, color, spatial relationship, scene, appearance style, temporal style, overall consistency\)\. Values are on the standard VBench 0–100 scale\.±\\pmdenotes standard error across prompts\. Best quantized result per model isbolded; ties are bolded jointly\.Table 5:Aggregate VBench scores for MAGI\-1 and SkyReels\-V2: VBench’s Quality and Semantic sub\-scores and the total VBench Score \(which already appears in[Table˜1](https://arxiv.org/html/2605.26266#S4.T1)\)\. Values are on the standard VBench 0–100 scale\. Best quantized result per model isbolded\.±\\pmdenotes standard error across prompts, propagated to aggregate scores via linear error propagation through VBench’s normalization and weighting\.ModelQuant\.schemePrec\.Withcorr\.Quality↑\\uparrowSemantic↑\\uparrowTotal↑\\uparrowMAGI\-1—BF1680\.10±\\,\\pm\\,0\.6970\.93±\\,\\pm\\,1\.9778\.27±\\,\\pm\\,0\.68RTNINT2×\\times78\.46±\\,\\pm\\,0\.6769\.49±\\,\\pm\\,2\.0076\.67±\\,\\pm\\,0\.67✓\\checkmark79\.62±\\,\\pm\\,0\.6770\.83±\\,\\pm\\,2\.0477\.86±\\,\\pm\\,0\.67QuaRot\+RTNINT2×\\times74\.90±\\,\\pm\\,0\.5551\.62±\\,\\pm\\,1\.2670\.24±\\,\\pm\\,0\.50✓\\checkmark79\.69±\\,\\pm\\,0\.6871\.31±\\,\\pm\\,1\.9478\.02±\\,\\pm\\,0\.67QVGINT2×\\times79\.57±\\,\\pm\\,0\.6770\.79±\\,\\pm\\,1\.9677\.81±\\,\\pm\\,0\.67✓\\checkmark79\.95±\\,\\pm\\,0\.6971\.35±\\,\\pm\\,1\.9778\.23±\\,\\pm\\,0\.68SkyReels\-V2—BF1683\.60±\\,\\pm\\,0\.6760\.02±\\,\\pm\\,2\.2778\.89±\\,\\pm\\,0\.70RTNINT2×\\times73\.97±\\,\\pm\\,0\.7148\.27±\\,\\pm\\,1\.7468\.83±\\,\\pm\\,0\.67✓\\checkmark84\.62±\\,\\pm\\,0\.6161\.71±\\,\\pm\\,2\.1580\.04±\\,\\pm\\,0\.65QuaRot\+RTNINT2×\\times76\.48±\\,\\pm\\,0\.7951\.28±\\,\\pm\\,2\.0671\.44±\\,\\pm\\,0\.75✓\\checkmark83\.25±\\,\\pm\\,0\.6659\.91±\\,\\pm\\,2\.2478\.58±\\,\\pm\\,0\.69
## Appendix IQualitative Comparison on SkyReels\-V2
[Figure˜1](https://arxiv.org/html/2605.26266#S1.F1)in the main text shows the qualitative effect of INT2 KV\-cache quantization and our correction on MAGI\-1\.[Figure˜A2](https://arxiv.org/html/2605.26266#A9.F2)reports the analogous comparison on SkyReels\-V2 for two representative prompts from the VBench\-Long suite\.


Figure A2:Qualitative comparison on SkyReels\-V2\. Columns show successive frames from the same video\. Rows show BF16; INT2 asymmetric QuaRot\+RTN quantization of cached keys and values; and the same setting with our correction\. As on MAGI\-1 \([Fig\.˜1](https://arxiv.org/html/2605.26266#S1.F1)\), INT2 introduces visible distortions, while our correction recovers much of the BF16\-like visual quality and temporal consistency\.
## Appendix JQualitative Comparison on HY\-WorldPlay
[Figure˜1](https://arxiv.org/html/2605.26266#S1.F1)in the main text shows the qualitative effect of INT2 KV\-cache quantization and our correction on MAGI\-1\. For completeness,[Fig\.˜A3](https://arxiv.org/html/2605.26266#A10.F3)reports the analogous comparison on HY\-WorldPlay for two representative image–prompt pairs from the original HY\-WorldPlay repository\.


Figure A3:Qualitative comparison on HY\-WorldPlay\. Columns show successive frames from the same video\. Rows show BF16; INT2 asymmetric QuaRot\+RTN KV\-cache quantization of keys and values; and the same quantized setting with our correction\. As on MAGI\-1 \([Fig\.˜1](https://arxiv.org/html/2605.26266#S1.F1)\), INT2 introduces visible distortions, while our correction recovers much of the BF16\-like visual quality and temporal consistency\.
## Appendix KAttention Mass Shift
[Figure˜4](https://arxiv.org/html/2605.26266#S5.F4)in the main text reports the cached attention mass shiftΔP𝒮\\Delta P\_\{\\mathcal\{S\}\}at INT2\. For completeness, we report here the same analysis at INT4 on MAGI\-1 with the same quantization scheme and our correction\.
Figure A4:Cached attention mass shiftΔP𝒮\\Delta P\_\{\\mathcal\{S\}\}on MAGI\-1 at INT4 with QuaRot\+RTN KV\-cache quantization\. The same qualitative pattern as at INT2 \(cf\.[Fig\.˜4](https://arxiv.org/html/2605.26266#S5.F4)\) is visible, but the bias is much smaller\. The correction centers the distribution near zero\.The INT4 results in[Fig\.˜A4](https://arxiv.org/html/2605.26266#A11.F4)show the same qualitative pattern as at INT2: a right\-skewed quantized distribution ofΔP𝒮\\Delta P\_\{\\mathcal\{S\}\}that the correction centers near zero\. However, the magnitude of the bias is much smaller\. Because the uncorrected bias is already small at INT4 and generated videos are visually close to the BF16 baseline, the correction’s benefit is correspondingly mild, which is why we focus the main paper on INT2\.
## Appendix LAttention JSD Distributions
[Figure˜A5](https://arxiv.org/html/2605.26266#A12.F5)plots the distribution of Jensen\-Shannon divergence \(JSD\) between the quantized \(or corrected\) and BF16 attention weights on MAGI\-1 under QuaRot\+RTN quantization, computed over all keys\. At INT2, the correction consistently shifts the JSD distribution toward lower values, confirming that removing the partition sum bias improves the overall attention distribution\. At INT4 the JSD is already low without correction, and the correction provides only a modest further reduction, mirroring the smaller probability\-mass bias observed in[Appendix˜K](https://arxiv.org/html/2605.26266#A11)\.
\(a\)INT2 quantization
\(b\)INT4 quantization
Figure A5:Distribution of Jensen\-Shannon divergence between quantized \(or corrected\) and BF16 attention weights on MAGI\-1 under QuaRot\+RTN\. At INT2 the correction substantially reduces the JSD; at INT4 the baseline JSD is already low and the improvement is modest\.
## Appendix MAttention Output MSE
[Figure˜A6](https://arxiv.org/html/2605.26266#A13.F6)reports the mean squared error \(MSE\) of the attention outputsoftmax\(S\)V\\mathrm\{softmax\}\(S\)\\,Vbetween the quantized \(or corrected\) and BF16 computations on MAGI\-1 under QuaRot\+RTN quantization\. At INT2, the correction consistently reduces the attention output MSE, confirming that improvements at the score level propagate to the attention output\. At INT4 the MSE follows the same trend as the JSD \([Appendix˜L](https://arxiv.org/html/2605.26266#A12)\): already low without correction, with a modest further reduction after correction\.
\(a\)INT2 quantization
\(b\)INT4 quantization
Figure A6:Attention output MSE between quantized \(or corrected\) and BF16 computations on MAGI\-1 under QuaRot\+RTN\. The correction reduces MSE at INT2, confirming that score\-level improvements propagate to the attention output\. At INT4 the effect is smaller\.
## Appendix NStorage–Quality Trade\-Off: SSIM and LPIPS
[Figure˜5](https://arxiv.org/html/2605.26266#S5.F5)in the main text reports the storage–quality trade\-off in terms of PSNR\. For completeness,[Figs\.˜A7](https://arxiv.org/html/2605.26266#A14.F7)and[A8](https://arxiv.org/html/2605.26266#A14.F8)report the same analysis for SSIM and LPIPS, confirming that the correction uniformly improves the trade\-off across all three fidelity metrics\.
Figure A7:Trade\-off between SSIM and effective bitwidth on MAGI\-1\. Same setting as[Fig\.˜5](https://arxiv.org/html/2605.26266#S5.F5)\.Figure A8:Trade\-off between LPIPS and effective bitwidth on MAGI\-1\. Same setting as[Fig\.˜5](https://arxiv.org/html/2605.26266#S5.F5)\.
## Appendix OLLM Partial\-Prefill Experiments
Our main experiments focus on chunk\-wise autoregressive video diffusion, where previously generated chunks are stored in a quantized KV cache and the current chunk remains in full precision\. In this appendix, we evaluate whether our correction transfers to decoder\-only language models under structurally analogous partial prefill\.
Following the notation of[Section˜4\.1](https://arxiv.org/html/2605.26266#S4.SS1.SSS0.Px1), each prompt contains a quantized cached prefix𝒮\\mathcal\{S\}and a full\-precision current prefill chunkℛ\\mathcal\{R\}, with lengths\|𝒮\|=A\|\\mathcal\{S\}\|=Aand\|ℛ\|=B\|\\mathcal\{R\}\|=B, whereB≫1B\\gg 1\. This setup preserves the key structural feature of chunk\-wise video generation: a quantized cached block𝒮\\mathcal\{S\}competes inside the same softmax with a multi\-token full\-precision current blockℛ\\mathcal\{R\}\.
These experiments provide a cross\-domain validation of the bias\-correction mechanism derived in Section[4](https://arxiv.org/html/2605.26266#S4), rather than a comprehensive LLM inference benchmark\.
### O\.1Experimental setup
We evaluate three decoder\-only LLMs: Llama\-3\.1\-8B\[Dubeyet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib23), Meta,[2024](https://arxiv.org/html/2605.26266#bib.bib24)\], Mistral\-7B\-Instruct\-v0\.3\[Jianget al\.,[2023](https://arxiv.org/html/2605.26266#bib.bib25), Mistral AI,[2024](https://arxiv.org/html/2605.26266#bib.bib26)\], and Qwen2\.5\-32B\-Instruct\[Qwenet al\.,[2024](https://arxiv.org/html/2605.26266#bib.bib27), Qwen,[2024](https://arxiv.org/html/2605.26266#bib.bib28)\]\. We use English prompts from LongBench\-Pro\[Chenet al\.,[2026b](https://arxiv.org/html/2605.26266#bib.bib22)\]\. We define retained prompt\-length bins, e\.g\.,\[256,512\)\[256,512\),\[512,1024\)\[512,1024\), etc\., then deterministically truncate prompts to retained lengths sampled uniformly from the corresponding bin\. Each evaluation job uses one fixed current\-chunk size across the resulting mixed prompt lengths\.
For each model and chunk size, we use the same INT2 KV\-cache quantization as in the main paper\. We apply our Taylor\-approximated score correction to cached\-key attention scores before softmax, as described in[4](https://arxiv.org/html/2605.26266#S4)\.
Completed runs cover current\-chunk sizes from128128to81928192; larger attempted configurations exceeded accelerator memory even on 80 GB GPUs\. This is due to the quadratic workspace of partial\-prefill attention, whose dense score tensor scales asHB\(A\+B\)HB\(A\+B\), whereHHis the number of attention heads,A=\|𝒮\|A=\|\\mathcal\{S\}\|is the cached\-prefix length, andB=\|ℛ\|B=\|\\mathcal\{R\}\|is the current\-chunk length\. To avoid artifacts from this missingness, all aggregate results are reported as paired comparisons: each difference is computed only within cells matched by model, current\-chunk size, prompt\-length bin, and evaluation examples\.
Our primary metric is teacher\-forced negative log\-likelihood \(NLL\)\. For a set of evaluation examples𝒟\\mathcal\{D\}, we aggregate at corpus level:
NLL=∑x∈𝒟∑t=1Tx−logpθ\(yt∣y<t,x\)∑x∈𝒟Tx\.\\mathrm\{NLL\}=\\frac\{\\sum\_\{x\\in\\mathcal\{D\}\}\\sum\_\{t=1\}^\{T\_\{x\}\}\-\\log p\_\{\\theta\}\(y\_\{t\}\\mid y\_\{<t\},x\)\}\{\\sum\_\{x\\in\\mathcal\{D\}\}T\_\{x\}\}\.We use NLL as the main metric because it aggregates token\-level likelihoods directly and avoids the heavy\-tailed behavior of averaging per\-example perplexities\.
### O\.2LLM partial prefill results
Figure[A9](https://arxiv.org/html/2605.26266#A15.F9)summarizes our findings for the LLM ablation study\. Plain INT2 KV\-cache quantization consistently worsens teacher\-forced NLL, while the Taylor correction improves over plain INT2 across the completed model and chunk\-size settings\. The corrected condition is sometimes below the BF16 NLL, although we interpret this conservatively as a partial\-prefill rebalancing effect rather than as evidence that the method generally improves over full precision\.
Figure A9:Teacher\-forced NLL by partial\-prefill chunk size in the LLM partial\-prefill setting\. Each panel corresponds to one model, and curves show BF16, plain INT2 KV\-cache quantization, and INT2 with Taylor correction\. Plain INT2 generally increases NLL, while the Taylor correction consistently reduces the degradation\.We observe substantial degradation from INT2 KV\-cache quantization, especially at large chunk sizes for the smaller Mistral\-7B\-Instruct\-v0\.3 and Llama\-3\.2\-1B models\. The larger Qwen2\.5 model shows smaller plain\-INT2 degradation, but the correction still consistently improves NLL\. This suggests that the correction is useful both in severe degradation regimes and in milder regimes where plain INT2 remains relatively stable\.
### O\.3Prompt\-length and chunk\-size breakdown
To test whether the aggregate results are driven by a small subset of prompt lengths, we also analyze NLL by retained prompt\-length bin\. Figure[A10](https://arxiv.org/html/2605.26266#A15.F10)reports paired NLL differences grouped by prompt\-length bin and current\-chunk size\.
Figure A10:Prompt\-length and chunk\-size breakdown for LLM partial\-prefill experiments\. The plotted value isNLLINT2\+Taylor−NLLINT2\\mathrm\{NLL\}\_\{\\mathrm\{INT2\+Taylor\}\}\-\\mathrm\{NLL\}\_\{\\mathrm\{INT2\}\}, computed within matched model, chunk\-size, prompt\-bin, and evaluation\-example cells\. Negative values indicate that the Taylor correction reduces teacher\-forced NLL relative to plain INT2 KV\-cache quantization\. Striped areas indicate no available matched data\.
### O\.4Attention\-mass diagnostic
The central mechanism studied in the main paper is that quantized cached keys receive inflated softmax mass because the exponential transforms zero\-mean score noise into a positive partition\-sum bias \(Fig\.[3](https://arxiv.org/html/2605.26266#S1.F3); see also Fig\.[2](https://arxiv.org/html/2605.26266#S1.F2)\)\. Figure[A11](https://arxiv.org/html/2605.26266#A15.F11)visualizes the corresponding attention\-weight shift in an LLM partial\-prefill setting\.
For this diagnostic, we use Llama\-3\.2\-1B as a lightweight model for attention visualization\. This diagnostic model is separate from the three\-model NLL benchmark above; it is used here because logging full attention weights across many layers, heads, prompts, and chunk sizes is memory intensive\.
Figure A11:Attention weights for Llama\-3\.2\-1B under INT2 KV\-cache quantization\. The visualized attention weights are averaged over representative prompts with lengths in\[1024,2048\)\[1024,2048\), layers, and attention heads for chunk size256256\. The dashed vertical line separates cached\-prefix tokens from current\-chunk tokens\. Panel\(b\)shows that, relative to the BF16 baseline in\(a\), quantization increases attention weights in the cached block of tokens and decreases them in the current chunk\. This effect is quantified by the attention massesP𝒮P\_\{\\mathcal\{S\}\}andPℛP\_\{\\mathcal\{R\}\}of the cached token block and current chunk\. Panel\(c\)shows that our correction largely restores the original attention weights, with slight overcorrection\.
### O\.5Discussion
The LLM partial\-prefill results provide additional indication in a cached/current attention structure setting similar to the main experiments on chunked auto\-regressive video diffusion in[5](https://arxiv.org/html/2605.26266#S5)\. In the completed paired comparisons, plain INT2 KV\-cache quantization generally worsens teacher\-forced NLL, while the Taylor correction reduces NLL relative to plain INT2\. This trend is consistent with our derivation and video\-model experiments, but we interpret the LLM results as a diagnostic extension rather than as a comprehensive LLM KV\-cache quantization benchmark\. We therefore emphasize paired teacher\-forced NLL comparisons and leave optimized LLM kernels, broader task\-level evaluation, and attention\-mass diagnostics across more LLM models and chunk sizes to future work\.
In some configurations, the corrected condition obtains lower NLL than the BF16 baseline\. We treat this observation cautiously and do not interpret it as a general improvement over BF16\. It may depend on the partial\-prefill setup, the teacher\-forced NLL objective, or mild overcorrection from the Taylor approximation at aggressive bitwidths\. Our main conclusion from these experiments is limited to the paired comparison between plain INT2 and INT2 with correction: the correction reduces the NLL degradation introduced by INT2 KV\-cache quantization in the evaluated partial\-prefill settings\.
## Appendix PBroader Impact
This work proposes a training\-free correction for KV\-cache quantization in autoregressive video diffusion models\. The direct goal is to improve the efficiency and quality of long\-form video generation by reducing memory usage while preserving generation fidelity\. Potential positive impacts include lowering the computational cost of research on long\-video and world\-model generation, enabling longer context windows under fixed memory budgets, and improving accessibility of efficient inference methods for academic and resource\-constrained settings\.
At the same time, improvements in the efficiency and fidelity of video generation may also lower the cost of generating synthetic video content\. As with other advances in generative video modeling, this could indirectly facilitate misuse such as producing misleading synthetic media, impersonation, or disinformation\. Our work does not introduce a new generative model, dataset, or training procedure, and we do not release new model weights\. The method is an inference\-time numerical correction applied to existing models, so the primary risks are inherited from the underlying video generation systems on which it is used\. We encourage deployment only in settings that follow the safety policies, watermarking or provenance mechanisms, and misuse\-monitoring practices appropriate for the underlying generative model\.Similar Articles
Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models
This paper introduces Forcing-KV, a hybrid KV cache compression strategy for autoregressive video diffusion models that separates attention heads into static and dynamic categories, achieving up to 2.82x speedup at 1080P resolution while maintaining output quality.
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
VideoMLA replaces per-head KV caches in video diffusion models with a shared low-rank latent and decoupled 3D-RoPE positional keys, reducing per-token KV memory by 92.7% and improving throughput by 1.23x on a B200 while maintaining quality on VBench benchmarks.