RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
Summary
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.
View Cached Full Text
Cached at: 05/11/26, 06:38 AM
# Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
Source: [https://arxiv.org/html/2605.06675](https://arxiv.org/html/2605.06675)
Fei Zuo1,∗Zikang Zhou2,∗Hao Cong3,∗Xiaoyan Xi1,∗Ho Fai Leung1,† 1BA TechWorks \(BMW Group\)2National University of Singapore3Tsinghua University
###### Abstract
Large language models cache all previously computed key\-value \(KV\) pairs during generation, and this KV cache grows linearly with sequence length, making it a primary memory bottleneck for serving\. Quantizing the KV cache to fewer bits reduces this cost, yet all current quantizers assign the same bit\-width to every attention head, ignoring the large variation in head importance\. A natural idea is to allocate more bits to important heads and fewer to the rest\. We show, however, that such mixed\-precision allocation has a hidden pitfall: each quantizer follows a different distortion curveD\(b\)=αβ−bD\(b\)=\\alpha\\beta^\{\-b\}, and the decay rateβ\\betavaries from 3\.6 to 5\.3 across quantizer designs\. Applying one quantizer’s distortion model to another inverts the allocation order and makes performance*worse*than uniform quantization\. We call this failure mode*distortion model mismatch*and proposeRateQuantto resolve it\.RateQuantfits a per\-quantizer distortion model from a small calibration set, then solves the resulting bit\-allocation problem in closed form via reverse waterfilling from rate\-distortion theory\. On Qwen3\-8B at 2\.5 average bits, calibratedRateQuantreduces KIVI’s perplexity from 49\.3 to14\.9\(70% reduction\) and improves QuaRot by 6\.6 PPL\. The entire calibration takes 1\.6 s on a single GPU and adds zero overhead at inference time\.
11footnotetext:Equal contribution\.22footnotetext:Corresponding author\.## 1Introduction
Serving large language models \(LLMs\) at scale requires caching all previously computed key\-value \(KV\) pairs so that each new token can attend to the full context\(Pope et al\.,[2023](https://arxiv.org/html/2605.06675#bib.bib20)\)\. The memory footprint of this KV cache grows linearly with sequence length, batch size, and model depth\. For a 32B\-parameter model processing 4k\-token sequences, the KV cache alone occupies over 1 GB per request at FP16 precision, often exceeding the memory consumed by the model weights themselves\. KV cache quantization reduces this cost by storing cached states at lower precision, and a rich line of recent work has produced increasingly effective quantizers\(Liu et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib16); Ashkboos et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib1); Hooper et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib10); Zandieh et al\.,[2026](https://arxiv.org/html/2605.06675#bib.bib26)\)\.
However, all these quantizers apply*uniform*bit\-widths to every attention head, ignoring the well\-documented heterogeneity of head importance\(Voita et al\.,[2019](https://arxiv.org/html/2605.06675#bib.bib22)\)\. Recent mixed\-precision approaches relax this at the layer level\(Chen et al\.,[2025a](https://arxiv.org/html/2605.06675#bib.bib2); Li et al\.,[2025b](https://arxiv.org/html/2605.06675#bib.bib13)\)or channel level\(Liao and Wen,[2026](https://arxiv.org/html/2605.06675#bib.bib14); Wang et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib23)\), but each relies on heuristic allocation rules tied to a specific quantizer, limiting transferability and lacking theoretical guarantees\.
Figure 1:Distortion model mismatch\(Qwen3\-8B, 2\.5 avg bits\)\.Left:Distortion curvesD\(b\)=αβ−bD\(b\)\{=\}\\alpha\\beta^\{\-b\}diverge across quantizers \(β\\betavaries1\.5×1\.5\{\\times\}\)\.Right:Naïve mixed\-precision with mismatchedβ\\beta*worsens*PPL \(KIVI: 49\.3→\{\\to\}87\); calibratedRateQuantwith K/V separation reaches 14\.9 \(70%↓\\downarrow\)\.We identify a deeper obstacle:*distortion model mismatch*\. Different quantizers have fundamentally different distortion\-rate curvesD\(b\)=α⋅β−bD\(b\)=\\alpha\\cdot\\beta^\{\-b\}, with the decay rateβ\\betavarying from 3\.6 \(TurboQuant\) to 5\.3 \(QuaRot\)\. As[Fig\.˜1](https://arxiv.org/html/2605.06675#S1.F1)illustrates, naïvely applying one quantizer’s distortion model to another makes mixed\-precision allocation*worse than uniform quantization*, because the mismatchedβ\\betainverts the marginal gain ordering across heads\. Resolving this mismatch is the key to building a truly general allocation framework\.
We presentRateQuant, a framework that formalizes per\-head KV cache bit allocation as rate\-distortion optimization\. Three empirical findings motivate the design\. First,*the sensitivity proxy is decisive*: gradient\-based head importance yields a 1\.07 PPL improvement over activation\-based importance at 3\.5 bits\. Second,*the distortion model must match the quantizer*: applying the wrong model worsens KIVI from 49\.3 to 87\.0 PPL at 2\.5 bits\. Third,*keys and values require separate bit budgets*: for KIVI at 2\.5 bits, allocating 2\.85 bits to keys and 2\.15 bits to values reduces PPL by 70%\. Our contributions are:
- •Rate\-distortion framework\.We formalize mixed\-precision KV quantization as weighted distortion minimization and derive the optimal continuous allocation via reverse waterfilling \([Theorem˜2](https://arxiv.org/html/2605.06675#Thmtheorem2)\)\. The achievable distortion reduction equals the ratio of arithmetic to geometric mean \(AM/GM\) of head sensitivities \([Theorem˜3](https://arxiv.org/html/2605.06675#Thmtheorem3)\), which can be computed without any quantization and serves as a cheap predictor of when mixed precision helps\.
- •Distortion calibration and K/V separation\.We identify distortion model mismatch as a previously unrecognized failure mode and resolve it by fitting a per\-quantizer distortion model from calibration data\. We further generalize the allocation to treat keys and values as2N2Nindependent components, makingRateQuantapplicable to any base quantizer\.
- •Consistent empirical gains\.CalibratedRateQuantreduces KIVI’s perplexity at 2\.5 bits from 49\.3 to14\.9\(70% reduction\) and improves QuaRot by 6\.6 PPL\. OnTurboQuantat 4\.0 bits,RateQuantrecovers 66% of the quantization\-induced degradation across three Qwen3 models\. All gains come with zero runtime overhead and less than 2 s of one\-time calibration\.
Figure 2:RateQuantpipeline\. Phases 1–3 are one\-time offline costs \(<<2 s for 8B\); Phase 4 adds zero runtime overhead\.
## 2Related Work
#### KV cache quantization\.
Reducing KV cache memory has been approached through eviction\(Zhang et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib28); Ge et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib9)\), token merging\(Nawrot et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib18)\), and quantization\(Hooper et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib10); Liu et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib16); Yue et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib25)\)\. Among quantization methods, KIVI\(Liu et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib16)\)applies per\-channel symmetric keys and per\-token asymmetric values; QuaRot\(Ashkboos et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib1)\)suppresses outliers via Hadamard rotations; KVQuant\(Hooper et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib10)\)handles outlier channels with non\-uniform quantization; andTurboQuant\(Zandieh et al\.,[2026](https://arxiv.org/html/2605.06675#bib.bib26)\)introduces rotation\-based vector quantization\. All of these assign the same bit\-width to every head\.
Several recent methods move toward mixed\-precision allocation\. At the layer level, KVmix\(Chen et al\.,[2025a](https://arxiv.org/html/2605.06675#bib.bib2)\)assigns precision via gradient norms, KVTuner\(Li et al\.,[2025b](https://arxiv.org/html/2605.06675#bib.bib13)\)uses Pareto search, and PM\-KVQ\(Chen et al\.,[2025b](https://arxiv.org/html/2605.06675#bib.bib3)\)applies integer programming\. At the channel level, ChanMix\(Liao and Wen,[2026](https://arxiv.org/html/2605.06675#bib.bib14)\)clusters by dynamic range, KITTY\(Wang et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib23)\)boosts outlier keys, and MixKVQ\(Zhang et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib27)\)scores channels via query\-activation magnitude\. KV\-AdaQuant\(Kim et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib11)\)provides a global K/V budget split, and CoKV\(Li et al\.,[2025a](https://arxiv.org/html/2605.06675#bib.bib12)\)optimizes per\-group token budgets via Shapley values\. Each of these methods is designed around a specific base quantizer, making it difficult to apply one method’s allocation rule to a different quantizer\.RateQuantoperates at per\-head granularity with closed\-form allocation and supports arbitrary base quantizers through calibration;[Table˜1](https://arxiv.org/html/2605.06675#S2.T1)summarizes the positioning\.
#### Rate\-distortion theory in quantization\.
The reverse waterfilling algorithm\(Cover and Thomas,[2006](https://arxiv.org/html/2605.06675#bib.bib4)\)is a classical solution to the Gaussian rate\-distortion problem, optimally distributing a bit budget across parallel channels\. In LLM weight quantization, Radio\(Tseng et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib21)\)applies rate\-distortion optimization via stochastic dual ascent, and BAQ\(Zheng et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib29)\)derives closed\-form waterfilling under Hessian\-weighted objectives\. For mixed\-precision weights, HAWQ\(Dong et al\.,[2019](https://arxiv.org/html/2605.06675#bib.bib6)\)uses top Hessian eigenvalues and HAWQ\-V2\(Dong et al\.,[2020](https://arxiv.org/html/2605.06675#bib.bib7)\)uses average traces to assign per\-layer bit\-widths\. All of these methods target model*weights*\. KV caches differ in two key ways: heads form natural quantization groups with distinct sensitivities, and keys and values have asymmetric error characteristics\.RateQuantis the first to apply rate\-distortion allocation to KV caches, addressing both of these structural differences\.
#### Sensitivity estimation\.
Hessian\-based sensitivity drives HAWQ\(Dong et al\.,[2019](https://arxiv.org/html/2605.06675#bib.bib6),[2020](https://arxiv.org/html/2605.06675#bib.bib7)\)for weight quantization, while activation\-based metrics such as channel magnitude are common in post\-training quantization\(Dettmers et al\.,[2022](https://arxiv.org/html/2605.06675#bib.bib5); Xiao et al\.,[2023](https://arxiv.org/html/2605.06675#bib.bib24); Lin et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib15)\)\. We show that for KV cache bit allocation, gradient\-based sensitivity is qualitatively superior to activation\-based alternatives, and the choice of proxy matters more than the choice of allocation algorithm \([Section˜4\.4](https://arxiv.org/html/2605.06675#S4.SS4)\)\.
Table 1:Mixed\-precision KV cache quantization landscape\. Cal\. = calibrated distortion model; Q\-Agn\. = quantizer\-agnostic; Closed = closed\-form solution\.RateQuantis the only method combining per\-head granularity, rate\-distortion theory, calibration, K/V separation, and closed\-form allocation\.
## 3RateQuant
We presentRateQuantin four parts: problem formulation and optimal allocation \([Section˜3\.1](https://arxiv.org/html/2605.06675#S3.SS1)\), sensitivity estimation \([Section˜3\.2](https://arxiv.org/html/2605.06675#S3.SS2)\), integer allocation algorithm \([Section˜3\.3](https://arxiv.org/html/2605.06675#S3.SS3)\), and quantizer\-agnostic extensions \([Section˜3\.4](https://arxiv.org/html/2605.06675#S3.SS4)\)\.
### 3\.1Problem Formulation and Optimal Allocation
Consider an LLM withLLlayers andHHKV heads per layer, yieldingN=L×HN=L\\times Hquantization groups\. Each groupiihas a sensitivity weightwi\>0w\_\{i\}\>0reflecting its importance to the model output \(defined in[Section˜3\.2](https://arxiv.org/html/2605.06675#S3.SS2)\)\.
###### Assumption 1\(Exponential distortion\-rate\)\.
The per\-head quantization MSE followsD\(b\)=α⋅β−bD\(b\)=\\alpha\\cdot\\beta^\{\-b\}for constantsα\>0,β\>1\\alpha\>0,\\beta\>1depending on the quantizer design and head dimensiondd\.
We validate this empirically: fittingTurboQuant’s Lloyd\-Max MSE ford=128d\{=\}128yieldsα≈1\.36\\alpha\{\\approx\}1\.36,β≈3\.48\\beta\{\\approx\}3\.48withR2\>0\.99R^\{2\}\{\>\}0\.99\([Appendix˜E](https://arxiv.org/html/2605.06675#A5)\)\. The optimization problem distributes a total bit budgetB=⌊b¯⋅N⌉B=\\lfloor\\bar\{b\}\\cdot N\\rceilto minimize weighted distortion:
min𝐛∈ℝN𝒥\(𝐛\)≜∑i=1Nwi⋅D\(bi\)s\.t\.∑i=1Nbi=B,bmin≤bi≤bmax\\min\_\{\\mathbf\{b\}\\in\\mathbb\{R\}^\{N\}\}\\;\\mathcal\{J\}\(\\mathbf\{b\}\)\\triangleq\\sum\_\{i=1\}^\{N\}w\_\{i\}\\cdot D\(b\_\{i\}\)\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i=1\}^\{N\}b\_\{i\}=B,\\quad b\_\{\\min\}\\leq b\_\{i\}\\leq b\_\{\\max\}\(1\)
###### Theorem 2\(Reverse waterfilling\)\.
Under[˜1](https://arxiv.org/html/2605.06675#Thmtheorem1), the solution to \([1](https://arxiv.org/html/2605.06675#S3.E1)\) with continuousbib\_\{i\}and inactive bound constraints is:
bi∗=b¯\+lnwi−lnw¯lnβb\_\{i\}^\{\*\}=\\bar\{b\}\+\\frac\{\\ln w\_\{i\}\-\\overline\{\\ln w\}\}\{\\ln\\beta\}\(2\)whereb¯=B/N\\bar\{b\}=B/Nandlnw¯=1N∑jlnwj\\overline\{\\ln w\}=\\frac\{1\}\{N\}\\sum\_\{j\}\\ln w\_\{j\}\.
###### Proof sketch\.
Lagrangian stationarity giveswiα\(lnβ\)β−bi=λw\_\{i\}\\alpha\(\\ln\\beta\)\\beta^\{\-b\_\{i\}\}=\\lambdafor allii, yieldingbi∝lnwi/lnβb\_\{i\}\\propto\\ln w\_\{i\}/\\ln\\beta\. The constant is fixed by∑ibi=B\\sum\_\{i\}b\_\{i\}=B\. Full proof with bound handling in[Section˜A\.1](https://arxiv.org/html/2605.06675#A1.SS1)\. ∎
Interpretation\.Heads with higher sensitivity receive more bits\. The trade\-off is governed byβ\\beta: forTurboQuant\(β=3\.48\\beta=3\.48\), a head whose sensitivity iseetimes larger than the mean receives1/ln3\.48≈0\.801/\\ln 3\.48\\approx 0\.80additional bits\.
###### Theorem 3\(Gain ratio\)\.
Let𝒥∗\\mathcal\{J\}^\{\*\}and𝒥u\\mathcal\{J\}\_\{u\}denote the optimal and uniform weighted distortions \(no active bounds\)\. Then:
𝒥u𝒥∗=w¯w~≥1\\frac\{\\mathcal\{J\}\_\{u\}\}\{\\mathcal\{J\}^\{\*\}\}=\\frac\{\\bar\{w\}\}\{\\widetilde\{w\}\}\\geq 1\(3\)wherew¯=1N∑iwi\\bar\{w\}=\\frac\{1\}\{N\}\\sum\_\{i\}w\_\{i\}is the arithmetic mean andw~=\(∏iwi\)1/N\\widetilde\{w\}=\(\\prod\_\{i\}w\_\{i\}\)^\{1/N\}is the geometric mean of head sensitivities\.
The ratiow¯/w~\\bar\{w\}/\\widetilde\{w\}is computable from sensitivities alone without quantization, serving as a cheap*a priori*predictor of potential gain\. For Qwen3 models, empirical AM/GM≈2\.0\{\\approx\}2\.0, indicating substantial room for improvement\.
###### Corollary 4\.
Iflnwi∼𝒩\(μ,σ2\)\\ln w\_\{i\}\\sim\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\), then𝒥u/𝒥∗=exp\(σ2/2\)\\mathcal\{J\}\_\{u\}/\\mathcal\{J\}^\{\*\}=\\exp\(\\sigma^\{2\}/2\)\.
### 3\.2Sensitivity Estimation
We estimate per\-head importance via squared gradient norms of the KV projection outputs:
wl,hK=𝔼𝐱∼𝒟\[1T∑t=1T‖∂ℒ∂𝐊l,h,t‖2\],wl,hV=𝔼𝐱∼𝒟\[1T∑t=1T‖∂ℒ∂𝐕l,h,t‖2\]w\_\{l,h\}^\{K\}=\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{D\}\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{K\}\_\{l,h,t\}\}\\right\\\|^\{2\}\\right\],\\quad w\_\{l,h\}^\{V\}=\\mathbb\{E\}\_\{\\mathbf\{x\}\\sim\\mathcal\{D\}\}\\left\[\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\\\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{V\}\_\{l,h,t\}\}\\right\\\|^\{2\}\\right\]\(4\)whereℒ\\mathcal\{L\}is the causal LM loss and𝒟\\mathcal\{D\}is a small calibration set \(16 sequences of length 512\)\.
###### Proposition 5\(Loss\-distortion connection\)\.
Under a second\-order Taylor expansion with diagonal Fisher approximation:
𝔼\[ℒ\(θ^\)−ℒ\(θ\)\]≈∑l,h\[wl,hK⋅D\(bl,hK\)\+wl,hV⋅D\(bl,hV\)\]\\mathbb\{E\}\[\\mathcal\{L\}\(\\hat\{\\theta\}\)\-\\mathcal\{L\}\(\\theta\)\]\\approx\\sum\_\{l,h\}\\left\[w\_\{l,h\}^\{K\}\\cdot D\(b\_\{l,h\}^\{K\}\)\+w\_\{l,h\}^\{V\}\\cdot D\(b\_\{l,h\}^\{V\}\)\\right\]\(5\)
This formalizes why gradient\-based sensitivity is the correct proxy: it appears directly in the loss expansion, whereas activation\-based proxies bound only forward\-pass error amplification \([Section˜A\.4](https://arxiv.org/html/2605.06675#A1.SS4)\)\.
### 3\.3Integer Allocation
For integer bit\-widths, we solve \([1](https://arxiv.org/html/2605.06675#S3.E1)\) via greedy marginal gain:
Algorithm 1RateQuantInteger Bit Allocation0:Sensitivities
\{wi\}i=1N\\\{w\_\{i\}\\\}\_\{i=1\}^\{N\}, distortion models
\{Di\}\\\{D\_\{i\}\\\}, budget
BB, bounds
bmin,bmaxb\_\{\\min\},b\_\{\\max\}
1:Initialize
bi←bminb\_\{i\}\\leftarrow b\_\{\\min\}for all
ii;
R←B−N⋅bminR\\leftarrow B\-N\\cdot b\_\{\\min\}
2:while
R\>0R\>0do
3:
i∗←argmaxi\{wi⋅\[Di\(bi\)−Di\(bi\+1\)\]:bi<bmax\}i^\{\*\}\\leftarrow\\operatorname\*\{arg\\,max\}\_\{i\}\\left\\\{w\_\{i\}\\cdot\[D\_\{i\}\(b\_\{i\}\)\-D\_\{i\}\(b\_\{i\}\+1\)\]:b\_\{i\}<b\_\{\\max\}\\right\\\}
4:
bi∗←bi∗\+1b\_\{i^\{\*\}\}\\leftarrow b\_\{i^\{\*\}\}\+1;
R←R−1R\\leftarrow R\-1
5:endwhile
6:return
\{bi\}i=1N\\\{b\_\{i\}\\\}\_\{i=1\}^\{N\}
###### Proposition 6\(Greedy optimality\)\.
WhenD\(b\)D\(b\)is convex inbb\(which holds under[˜1](https://arxiv.org/html/2605.06675#Thmtheorem1)\),[Algorithm˜1](https://arxiv.org/html/2605.06675#alg1)produces the optimal integer solution\. This follows from the polymatroid structure of the precedence\-constrained selection problem\(Oxley,[2011](https://arxiv.org/html/2605.06675#bib.bib19)\)\.
### 3\.4Quantizer\-Agnostic Extensions
The framework above assumes a single distortion model shared by all components\. Two extensions makeRateQuantapplicable to arbitrary base quantizers\.
#### Empirical distortion calibration\.
Different quantizers have differentD\(b\)D\(b\)curves:TurboQuanthasβ≈3\.6\\beta\\approx 3\.6while KIVI/QuaRot haveβ≈5\.0\\beta\\approx 5\.0–5\.35\.3\([Appendix˜B](https://arxiv.org/html/2605.06675#A2)\)\. We measure MSE atb∈\{2,3,4,5,6\}b\\in\\\{2,3,4,5,6\\\}on representative data and fit\(αq,βq\)\(\\alpha\_\{q\},\\beta\_\{q\}\)via least\-squares onlnD\\ln Dvs\.bb\.[Algorithm˜1](https://arxiv.org/html/2605.06675#alg1)then uses quantizer\-specific distortion models in the marginal gain computation\. This step is critical: using the wrongβ\\betainverts the marginal gain ordering\. At 2\.5 bits, naïveRateQuantworsens KIVI from 49\.3 to 87\.0 \([Table˜3](https://arxiv.org/html/2605.06675#S4.T3)\)\.
Figure 3:Marginal gainwi⋅ΔDi\(b\)w\_\{i\}\\cdot\\Delta D\_\{i\}\(b\)for the top\-8 heads\.\(a\)Correctβ=3\.6\\beta\{=\}3\.6: gains well\-separated\.\(b\)Correctβ=5\.1\\beta\{=\}5\.1: faster decay compresses gains\.\(c\)Mismatch \(β=3\.6\\beta\{=\}3\.6applied toβ=5\.1\\beta\{=\}5\.1data\): head ranking inverted\.
#### Separate K/V allocation\.
When K and V use different quantization schemes \(e\.g\., KIVI applies per\-channel symmetric for keys and per\-token asymmetric for values\), their distortion curves differ\. We generalize the allocation to2N2Nindependent components:
min𝐛K,𝐛V∑i=1N\[wiK⋅DiK\(biK\)\+wiV⋅DiV\(biV\)\]s\.t\.∑i\(biK\+biV\)=B\\min\_\{\\mathbf\{b\}^\{K\},\\mathbf\{b\}^\{V\}\}\\sum\_\{i=1\}^\{N\}\\left\[w\_\{i\}^\{K\}\\cdot D\_\{i\}^\{K\}\(b\_\{i\}^\{K\}\)\+w\_\{i\}^\{V\}\\cdot D\_\{i\}^\{V\}\(b\_\{i\}^\{V\}\)\\right\]\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i\}\(b\_\{i\}^\{K\}\+b\_\{i\}^\{V\}\)=B\(6\)[Algorithm˜1](https://arxiv.org/html/2605.06675#alg1)applies directly with2N2Ncomponents, each with its own sensitivity and distortion model\. This enables automatic discovery of asymmetric budgets: for KIVI at 2\.5 bits the optimal split isb¯K=2\.85\\bar\{b\}\_\{K\}\{=\}2\.85,b¯V=2\.15\\bar\{b\}\_\{V\}\{=\}2\.15\([Section˜4\.3](https://arxiv.org/html/2605.06675#S4.SS3)\)\.
#### Pipeline summary\.
RateQuantoperates in four phases \([Fig\.˜2](https://arxiv.org/html/2605.06675#S1.F2)\): \(1\) sensitivity calibration via 16 forward\+backward passes \(∼\{\\sim\}1\.6 s for 8B on a single H200\); \(2\) distortion modeling at 5 bit\-widths \(<<0\.1 s\); \(3\) greedy allocation with2N2Ncomponents \(<<0\.01 s\); \(4\) online inference at the allocated per\-head bit\-widths with zero runtime overhead \(a static 2 KB lookup table\)\. The first three phases are one\-time costs amortized over deployment\.
## 4Experiments
### 4\.1Setup
#### Models\.
We evaluate on three Qwen3 variants: 4B \(36 layers, 8 KV heads\), 8B \(36 layers, 8 KV heads\), and 32B \(64 layers, 8 KV heads\), all using grouped\-query attention with head dimensiondh=128d\_\{h\}\{=\}128\.
#### Evaluation\.
WikiText\-2\(Merity et al\.,[2017](https://arxiv.org/html/2605.06675#bib.bib17)\)perplexity \(PPL\) with sequence length 2048 is the primary metric\. Downstream evaluation uses ARC\-Challenge, HellaSwag, PIQA, and WinoGrande vialm\-evaluation\-harness\(Gao et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib8)\)\.
#### Base quantizers\.
We test three quantizers that span different design philosophies:TurboQuant\(rotation\-based VQ\(Zandieh et al\.,[2026](https://arxiv.org/html/2605.06675#bib.bib26)\)\), KIVI \(per\-channel symmetric K, per\-token asymmetric V\(Liu et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib16)\)\), and QuaRot \(Hadamard rotation with per\-token symmetric quantization\(Ashkboos et al\.,[2024](https://arxiv.org/html/2605.06675#bib.bib1)\)\)\. To isolate the effect of allocation, uniform andRateQuantuse identical seeds, the same integer allocation framework \([Algorithm˜1](https://arxiv.org/html/2605.06675#alg1)\), and the same total bit budget\. The only difference is the sensitivity weights:wi=1w\_\{i\}\{=\}1for uniform vs\. gradient\-based forRateQuant\.
#### Calibration\.
We use 16 sequences of length 512 from WikiText\-2 for gradient sensitivity estimation \(∼\{\\sim\}1\.6 s for 8B on one H200\) and distortion calibration \(<<0\.1 s\)\.
### 4\.2Main Results
[Table˜2](https://arxiv.org/html/2605.06675#S4.T2)presents the core comparison between uniform andRateQuantallocation under theTurboQuantbase quantizer across three model sizes\.
Table 2:WikiText\-2 PPL \(↓\\downarrow\) for uniformTurboQuantvs\.RateQuant\(gradient sensitivity, seed 42; Qwen3\-8B averaged over 3 seeds, see[Appendix˜F](https://arxiv.org/html/2605.06675#A6)\)\.Δ\\Delta: PPL improvement \(positive =RateQuantbetter\)\. Headroom = Uniform−\-FP16\.Modelb¯\\bar\{b\}bminb\_\{\\min\}/bmaxb\_\{\\max\}UniformRateQuantΔ\\DeltaHeadroomQwen3\-4B3\.53/513\.8913\.70\+0\.200\.704\.03/613\.9213\.47\+0\.450\.734\.53/613\.4213\.47−\-0\.050\.23FP1613\.19Qwen3\-8B3\.53/510\.009\.76\+0\.240\.474\.03/69\.949\.67\+0\.270\.414\.53/69\.619\.59\+0\.020\.08FP169\.53Qwen3\-32B3\.53/57\.707\.64\+0\.060\.204\.03/67\.537\.50\+0\.030\.034\.53/67\.527\.50\+0\.020\.02FP167\.50RateQuantrecovers66%of the quantization\-induced degradation for Qwen3\-8B at 4\.0 bits \(\+0\.27±0\.06\+0\.27\\pm 0\.06PPL across 3 seeds\) and 62% for Qwen3\-4B \(\+0\.45\+0\.45\)\. The sweet spot lies at 3\.5 to 4\.0 bits, where the headroom \(the gap between uniform quantization and FP16\) exceeds 0\.4 PPL\.
Qwen3\-32B shows consistent but smaller absolute gains \(\+\+0\.02 to\+\+0\.06\) due to lower headroom: large models are more robust to uniform quantization, butRateQuantstill recovers 30% of the 3\.5\-bit headroom \(AM/GM≈2\.0\\approx 2\.0;[Theorem˜3](https://arxiv.org/html/2605.06675#Thmtheorem3)\)\.[Appendix˜L](https://arxiv.org/html/2605.06675#A12)provides complete results including 3\.0, 5\.0, and 6\.0 bit settings\.
### 4\.3Cross\-Quantizer Calibration
We extendRateQuantto non\-TurboQuantquantizers, where distortion calibration becomes essential\. The fittedβ\\betadiverges substantially:TurboQuant’sβ≈3\.6\\beta\{\\approx\}3\.6vs\. KIVI/QuaRot’sβ≈5\.0\\beta\{\\approx\}5\.0to5\.35\.3\(see[Appendix˜B](https://arxiv.org/html/2605.06675#A2)forR2R^\{2\}values\)\.[Table˜3](https://arxiv.org/html/2605.06675#S4.T3)compares four allocation strategies at aggressive bit budgets on Qwen3\-8B\.
Table 3:WikiText\-2 PPL \(↓\\downarrow\) under four allocation strategies, Qwen3\-8B \(bmin=2b\_\{\\min\}\{=\}2, seed 42\)\. Theo:TurboQuant’sD\(b\)D\(b\); Cal: calibrated; \+Sep: K/V separate\.At aggressive budgets \(≤\\leq3\.0 bits\), applyingTurboQuant’s distortion model to non\-TurboQuantquantizers is harmful: mismatchedβ\\betaworsens KIVI from 49\.3 to 87\.0 and QuaRot from 34\.9 to 271\.9, because the inverted marginal gain ordering \([Fig\.˜3](https://arxiv.org/html/2605.06675#S3.F3)\) allocates bits to the wrong heads\. Calibration partially recovers, but the key component is K/V separation: for KIVI at 2\.5 bits, calibrated joint allocation yields 73\.1, whereas separate K/V reaches14\.9\. The algorithm discovers that error\-prone per\-channel keys need 2\.85 bits while per\-token values need only 2\.15\. At higher budgets \(≥\\geq3\.5 bits\), all quantizers approach FP16 and headroom vanishes; the practical regime for calibratedRateQuantisb¯≤3\.0\\bar\{b\}\\leq 3\.0\.
Notably,TurboQuant\+RateQuantat 3\.0 bits achieves PPL 9\.88, surpassing both KIVI uniform 3\.0 \(10\.81\) and QuaRot uniform 3\.0 \(11\.90\), demonstrating that principled allocation on a good base quantizer can outperform stronger quantizers with uniform bits\.
Figure 4:Per\-head sensitivity for Qwen3\-8B \(36 layers×\\times8 KV heads, log scale\)\.Left:Gradient\-based shows a U\-shaped pattern \(early \+ late layers sensitive\)\.Right:Activation\-based is monotonically increasing, explaining the 1\.07 PPL swing in[Section˜4\.4](https://arxiv.org/html/2605.06675#S4.SS4.SSS0.Px1)\.Table 4:Downstream accuracy \(%\) and throughput, Qwen3\-8B at 4\.0 bits \(TurboQuant\)\.
Figure 5:PPL vs\. average bits \(Qwen3\-8B\)\. Solid:RateQuant; dashed: uniform\.
### 4\.4Ablation and Analysis
#### Sensitivity proxy\.
[Section˜4\.4](https://arxiv.org/html/2605.06675#S4.SS4.SSS0.Px1)compares gradient\-based and activation\-based proxies\. The proxy choice dominates: at 3\.5 bits, gradient allocation achieves 9\.76 while activation\-based yields 10\.83, a1\.07PPL swing exceeding the uniform\-to\-FP16 gap\. Activation\-based sensitivity measures error amplification, not loss impact; it over\-allocates to late layers whose large‖Q‖⋅‖V‖\\\|Q\\\|\\cdot\\\|V\\\|products inflate the proxy without corresponding PPL sensitivity\.
[Fig\.˜4](https://arxiv.org/html/2605.06675#S4.F4)visualizes this difference\. Gradient sensitivity exhibits a U\-shaped pattern consistent with[Proposition˜5](https://arxiv.org/html/2605.06675#Thmtheorem5): both early and late layers carry high loss curvature\. Activation sensitivity monotonically increases with depth, driven by accumulated residual stream norms\. This confirms that gradient\-based weights are the correct proxy for loss\-preserving allocation\.
Table 5:Sensitivity proxy ablation, Qwen3\-8B \(bmin=3b\_\{\\min\}\{=\}3, seed 42\)\.
#### When doesRateQuanthelp?
Three conditions must hold: \(i\) sufficient sensitivity heterogeneity \(AM/GM≳2\\gtrsim 2\); \(ii\) sufficient quantization headroom \(Uniform−\-FP16≳0\.2\\gtrsim 0\.2\); and \(iii\) a correctly matched distortion model\. Qwen3\-32B illustrates \(ii\): despite high heterogeneity, low headroom limits gains\. The cross\-quantizer results \([Table˜3](https://arxiv.org/html/2605.06675#S4.T3)\) illustrate \(iii\): mismatchedβ\\betais harmful, while calibration unlocks the headline gains\.
### 4\.5Baselines and Downstream
[Fig\.˜5](https://arxiv.org/html/2605.06675#S4.F5)shows thatRateQuantat 4\.0 bits nearly matches FP16 \(64\.2% vs\. 64\.4%\), recovering 89\.8% of the accuracy gap, at parity throughput \(38\.0 vs\. 38\.1 tok/s\)\.[Fig\.˜5](https://arxiv.org/html/2605.06675#S4.F5)confirms the trend:RateQuantconsistently improves over uniform, with the largest gap at aggressive budgets\. A head\-to\-head comparison with mixed\-precision baselines \([Table˜8](https://arxiv.org/html/2605.06675#A4.T8)in[Appendix˜D](https://arxiv.org/html/2605.06675#A4)\) shows that at 2\.5 bits on KIVI, layer\-level methods reduce PPL by∼\{\\sim\}25%, a global K\>\>V split by 37%, butRateQuantachieves70%reduction\. Component ablation is in[Appendix˜C](https://arxiv.org/html/2605.06675#A3)\.
## 5Conclusion
We presentedRateQuant, a framework for mixed\-precision KV cache quantization grounded in rate\-distortion theory\. Our central finding is that different quantizers have fundamentally different distortion characteristics: the decay rateβ\\betaranges from 3\.6 to 5\.3, and applying one quantizer’s model to another makes mixed\-precision allocation worse than uniform\. Empirical distortion calibration resolves this mismatch\. Combined with gradient\-based sensitivity estimation and separate K/V bit budgets, it transformsRateQuantinto a quantizer\-agnostic allocation layer\. On KIVI at 2\.5 average bits,RateQuantreduces perplexity from 49\.3 to 14\.9 with zero runtime overhead\. The AM/GM ratio of head sensitivities \(w¯/w~\\bar\{w\}/\\widetilde\{w\}\) serves as a practical predictor of when mixed precision helps\. Looking forward, extending the static per\-head allocation to dynamic token\-level budgets could capture input\-dependent sensitivity variations in long\-context workloads\. CombiningRateQuantwith orthogonal KV compression strategies such as eviction and token merging is another promising direction\. We believe the rate\-distortion perspective developed here generalizes beyond KV caches to principled resource allocation in other heterogeneous neural network components\.
## References
- Ashkboos et al\. \[2024\]Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman\.Quarot: Outlier\-free 4\-bit inference in rotated llms\.In*ICML*, 2024\.
- Chen et al\. \[2025a\]Jiahao Chen, Fangcheng Wei, Zhuowei Liu, and Zhongqiu Peng\.Kvmix: Gradient\-based layer importance\-aware mixed\-precision quantization for kv cache\.*arXiv preprint arXiv:2506\.08018*, 2025a\.
- Chen et al\. \[2025b\]Yilong Chen et al\.Progressive mixed\-precision kv cache quantization for long\-cot llms\.*arXiv preprint arXiv:2505\.18610*, 2025b\.
- Cover and Thomas \[2006\]Thomas M Cover and Joy A Thomas\.*Elements of Information Theory*\.John Wiley & Sons, 2nd edition, 2006\.
- Dettmers et al\. \[2022\]Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer\.Gpt3\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.*NeurIPS*, 2022\.
- Dong et al\. \[2019\]Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer\.Hawq: Hessian aware quantization of neural networks with mixed\-precision\.*ICCV*, 2019\.
- Dong et al\. \[2020\]Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer\.Hawq\-v2: Hessian aware trace\-weighted quantization of neural networks\.*NeurIPS*, 2020\.
- Gao et al\. \[2024\]Leo Gao, Jonathan Tow, et al\.A framework for few\-shot language model evaluation, 2024\.URL[https://zenodo\.org/records/10256836](https://zenodo.org/records/10256836)\.
- Ge et al\. \[2024\]Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao\.Model tells you what to discard: Adaptive kv cache compression for llms\.*ICLR*, 2024\.
- Hooper et al\. \[2024\]Coleman Hooper, Sehoon Kim, Hiva Mohammadi, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami\.Kvquant: Towards 10 million context length llm inference with kv cache quantization\.*NeurIPS*, 2024\.
- Kim et al\. \[2025\]Dongjin Kim et al\.Quantize what counts: More for keys, less for values\.*arXiv preprint arXiv:2502\.15075*, 2025\.
- Li et al\. \[2025a\]Yichi Li et al\.Cokv: Optimizing kv cache allocation via cooperative game\.*arXiv preprint arXiv:2502\.17501*, 2025a\.
- Li et al\. \[2025b\]Yifei Li, Zhehao Wu, and Cheng Zhou\.Kvtuner: Sensitivity\-aware layer\-wise mixed\-precision kv cache quantization\.2025b\.
- Liao and Wen \[2026\]Chengxi Liao and Zeyi Wen\.Channel\-aware mixed\-precision quantization for efficient long\-context inference\.In*ICLR*, 2026\.
- Lin et al\. \[2024\]Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han\.Awq: Activation\-aware weight quantization for llm compression and acceleration\.*MLSys*, 2024\.
- Liu et al\. \[2024\]Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu\.Kivi: A tuning\-free asymmetric 2bit quantization for kv cache\.*ICML*, 2024\.
- Merity et al\. \[2017\]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher\.Pointer sentinel mixture models\.*ICLR*, 2017\.
- Nawrot et al\. \[2024\]Piotr Nawrot, Adrian Łancucki, Marcin Chochowski, David Tarjan, and Edoardo Maria Ponti\.Dynamic memory compression: Retrofitting llms for accelerated inference\.*ICML*, 2024\.
- Oxley \[2011\]James G Oxley\.*Matroid Theory*\.Oxford University Press, 2011\.
- Pope et al\. \[2023\]Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean\.Efficiently scaling transformer inference\.*Proceedings of Machine Learning and Systems*, 2023\.
- Tseng et al\. \[2025\]Albert Tseng et al\.Radio: Rate\-distortion optimization for large language model compression\.2025\.
- Voita et al\. \[2019\]Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov\.Analyzing multi\-head self\-attention: Specialized heads do the heavy lifting, the rest can be pruned\.*ACL*, 2019\.
- Wang et al\. \[2025\]Xingyu Wang et al\.Accurate and efficient 2\-bit kv cache quantization with dynamic channel\-wise precision boost\.*arXiv preprint arXiv:2511\.18643*, 2025\.
- Xiao et al\. \[2023\]Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han\.Smoothquant: Accurate and efficient post\-training quantization for large language models\.*ICML*, 2023\.
- Yue et al\. \[2024\]Peng Yue et al\.Wkvquant: Quantizing weight and key/value cache for large language models\.*arXiv preprint*, 2024\.
- Zandieh et al\. \[2026\]Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni\.Turboquant: Online vector quantization with near\-optimal distortion rate\.In*ICLR*, 2026\.
- Zhang et al\. \[2025\]Wei Zhang et al\.Query\-aware mixed\-precision kv cache quantization for long\-context reasoning\.*arXiv preprint arXiv:2512\.19206*, 2025\.
- Zhang et al\. \[2024\]Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al\.H2o: Heavy\-hitter oracle for efficient generative inference of large language models\.*NeurIPS*, 2024\.
- Zheng et al\. \[2025\]Qi Zheng et al\.Baq: Efficient bit allocation quantization for large language models\.*arXiv preprint arXiv:2506\.05664*, 2025\.
## NeurIPS Paper Checklist
1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract states three contributions \(framework, calibration extensions, validation\), each with a corresponding section and empirical support\.
5. 2\.Limitations
6. Question: Does the paper discuss the limitations of the work performed by the authors?
7. Answer:\[Yes\]
8. Justification:[Appendix˜M](https://arxiv.org/html/2605.06675#A13)discusses calibration cost, evaluation scope, per\-head independence assumption, and token\-position uniformity\.
9. 3\.Theory assumptions and proofs
10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
11. Answer:\[Yes\]
12. Justification:[˜1](https://arxiv.org/html/2605.06675#Thmtheorem1)is stated explicitly and validated \([Appendix˜E](https://arxiv.org/html/2605.06675#A5)\)\. Proof sketches in main text; full proofs in[Appendix˜A](https://arxiv.org/html/2605.06675#A1)\.
13. 4\.Experimental result reproducibility
14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results?
15. Answer:\[Yes\]
16. Justification:[Section˜4\.1](https://arxiv.org/html/2605.06675#S4.SS1)specifies models, evaluation protocol, calibration details, seeds, and fair comparison protocol\.[Algorithm˜1](https://arxiv.org/html/2605.06675#alg1)is fully specified\.
17. 5\.Open access to data and code
18. Question: Does the paper provide open access to the data and code?
19. Answer:\[No\]
20. Justification: Code will be released upon acceptance\. The algorithm is fully described\. WikiText\-2 is public\.
21. 6\.Experimental setting/details
22. Question: Does the paper specify all training and test details necessary to understand the results?
23. Answer:\[Yes\]
24. Justification: All hyperparameters, evaluation protocol, and hardware are specified in[Section˜4\.1](https://arxiv.org/html/2605.06675#S4.SS1)and[Appendix˜I](https://arxiv.org/html/2605.06675#A9)\.
25. 7\.Experiment statistical significance
26. Question: Does the paper report error bars or statistical significance information?
27. Answer:\[Yes\]
28. Justification: Qwen3\-8B reports mean±\\pmstd over 3 seeds \([Table˜2](https://arxiv.org/html/2605.06675#S4.T2),[Table˜10](https://arxiv.org/html/2605.06675#A6.T10)\)\. All seeds show consistent direction at 3\.5–4\.0b\.
29. 8\.Experiments compute resources
30. Question: Does the paper provide sufficient information on compute resources?
31. Answer:\[Yes\]
32. Justification: Single NVIDIA H200 GPU; calibration times in[Appendix˜I](https://arxiv.org/html/2605.06675#A9)\.
33. 9\.Code of ethics
34. Question: Does the research conform with the NeurIPS Code of Ethics?
35. Answer:\[Yes\]
36. Justification: No human subjects, private data, or dual\-use concerns beyond general LLM deployment\.
37. 10\.Broader impacts
38. Question: Does the paper discuss potential societal impacts?
39. Answer:\[Yes\]
40. Justification:[Appendix˜M](https://arxiv.org/html/2605.06675#A13)discusses both positive and negative impacts\.
41. 11\.Safeguards
42. Question: Does the paper describe safeguards for responsible release?
43. Answer:\[N/A\]
44. Justification: No pre\-trained models, datasets, or assets with misuse risk are released\.
45. 12\.Licenses for existing assets
46. Question: Are existing assets properly credited with license information?
47. Answer:\[Yes\]
48. Justification: All models and datasets are cited per their licenses\.
49. 13\.New assets
50. Question: Are new assets well documented?
51. Answer:\[N/A\]
52. Justification: No new datasets or models released\.
53. 14\.Crowdsourcing and human subjects
54. Question: Does the paper include details about crowdsourcing or human subject research?
55. Answer:\[N/A\]
56. Justification: Not applicable\.
57. 15\.IRB approvals
58. Question: Were IRB approvals obtained if applicable?
59. Answer:\[N/A\]
60. Justification: No human subjects\.
61. 16\.Declaration of LLM usage
62. Question: Does the paper describe LLM usage in core methods?
63. Answer:\[N/A\]
64. Justification: LLMs are evaluation subjects only, not part of the methodology\.
## Appendix AProofs
### A\.1Proof of[Theorem˜2](https://arxiv.org/html/2605.06675#Thmtheorem2)\(Reverse Waterfilling\)
We solve the constrained optimization:
min𝐛∑i=1Nwiαβ−bis\.t\.∑i=1Nbi=B,bmin≤bi≤bmax\\min\_\{\\mathbf\{b\}\}\\sum\_\{i=1\}^\{N\}w\_\{i\}\\alpha\\beta^\{\-b\_\{i\}\}\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i=1\}^\{N\}b\_\{i\}=B,\\quad b\_\{\\min\}\\leq b\_\{i\}\\leq b\_\{\\max\}
#### KKT conditions\.
The Lagrangian is:
ℒ=∑iwiαβ−bi\+λ\(∑ibi−B\)\+∑iμi\(bmin−bi\)\+∑iνi\(bi−bmax\)\\mathcal\{L\}=\\sum\_\{i\}w\_\{i\}\\alpha\\beta^\{\-b\_\{i\}\}\+\\lambda\\left\(\\sum\_\{i\}b\_\{i\}\-B\\right\)\+\\sum\_\{i\}\\mu\_\{i\}\(b\_\{\\min\}\-b\_\{i\}\)\+\\sum\_\{i\}\\nu\_\{i\}\(b\_\{i\}\-b\_\{\\max\}\)Stationarity:−wiα\(lnβ\)β−bi\+λ−μi\+νi=0\-w\_\{i\}\\alpha\(\\ln\\beta\)\\beta^\{\-b\_\{i\}\}\+\\lambda\-\\mu\_\{i\}\+\\nu\_\{i\}=0with complementary slacknessμi\(bmin−bi\)=0\\mu\_\{i\}\(b\_\{\\min\}\-b\_\{i\}\)=0,νi\(bi−bmax\)=0\\nu\_\{i\}\(b\_\{i\}\-b\_\{\\max\}\)=0\.
#### Unconstrained solution\.
For heads withbmin<bi∗<bmaxb\_\{\\min\}<b\_\{i\}^\{\*\}<b\_\{\\max\}\(soμi=νi=0\\mu\_\{i\}=\\nu\_\{i\}=0\):
wiα\(lnβ\)β−bi=λ⟹bi=ln\(wiαlnβ\)−lnλlnβw\_\{i\}\\alpha\(\\ln\\beta\)\\beta^\{\-b\_\{i\}\}=\\lambda\\implies b\_\{i\}=\\frac\{\\ln\(w\_\{i\}\\alpha\\ln\\beta\)\-\\ln\\lambda\}\{\\ln\\beta\}Letℐfree\\mathcal\{I\}\_\{\\text\{free\}\}be the unconstrained set\. The budget constraint givesbi∗=b¯free\+\(lnwi−lnw¯free\)/lnβb\_\{i\}^\{\*\}=\\bar\{b\}\_\{\\text\{free\}\}\+\(\\ln w\_\{i\}\-\\overline\{\\ln w\}\_\{\\text\{free\}\}\)/\\ln\\beta\. When all heads are free, this simplifies to[Eq\.˜2](https://arxiv.org/html/2605.06675#S3.E2)\.
#### Iterative waterfilling\.
When bounds are active: \(1\) initialize all heads as free; \(2\) computebi∗b\_\{i\}^\{\*\}; \(3\) clip to\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\]and fix; \(4\) update budget; \(5\) repeat\. Convergence in at mostNNsteps since each iteration fixes at least one head\. ∎
### A\.2Proof of[Theorem˜3](https://arxiv.org/html/2605.06675#Thmtheorem3)\(Gain Ratio\)
LetYi=lnwiY\_\{i\}=\\ln w\_\{i\}\. Under uniform allocation \(bi=b¯b\_\{i\}=\\bar\{b\}\):𝒥u=Nαβ−b¯w¯\\mathcal\{J\}\_\{u\}=N\\alpha\\beta^\{\-\\bar\{b\}\}\\bar\{w\}\.
Substitutingbi∗=b¯\+\(Yi−Y¯\)/lnβb\_\{i\}^\{\*\}=\\bar\{b\}\+\(Y\_\{i\}\-\\bar\{Y\}\)/\\ln\\betainto𝒥∗\\mathcal\{J\}^\{\*\}:
𝒥∗\\displaystyle\\mathcal\{J\}^\{\*\}=α∑ieYiβ−b¯−\(Yi−Y¯\)/lnβ=αβ−b¯eY¯∑i1=Nαβ−b¯w~\\displaystyle=\\alpha\\sum\_\{i\}e^\{Y\_\{i\}\}\\beta^\{\-\\bar\{b\}\-\(Y\_\{i\}\-\\bar\{Y\}\)/\\ln\\beta\}=\\alpha\\beta^\{\-\\bar\{b\}\}e^\{\\bar\{Y\}\}\\sum\_\{i\}1=N\\alpha\\beta^\{\-\\bar\{b\}\}\\widetilde\{w\}\(7\)where we usedβ−x/lnβ=e−x\\beta^\{\-x/\\ln\\beta\}=e^\{\-x\}andw~=eY¯\\widetilde\{w\}=e^\{\\bar\{Y\}\}\. Hence𝒥u/𝒥∗=w¯/w~≥1\\mathcal\{J\}\_\{u\}/\\mathcal\{J\}^\{\*\}=\\bar\{w\}/\\widetilde\{w\}\\geq 1by AM\-GM\.
For log\-normal weightsYi∼𝒩\(μ,σ2\)Y\_\{i\}\\sim\\mathcal\{N\}\(\\mu,\\sigma^\{2\}\):w¯/w~→exp\(σ2/2\)\\bar\{w\}/\\widetilde\{w\}\\to\\exp\(\\sigma^\{2\}/2\)\. ∎
### A\.3Proof of[Proposition˜6](https://arxiv.org/html/2605.06675#Thmtheorem6)\(Greedy Optimality\)
The marginal gain of thekk\-th bit to headiiisgi\(k\)=wiαβ−\(bmin\+k−1\)\(1−β−1\)g\_\{i\}\(k\)=w\_\{i\}\\alpha\\beta^\{\-\(b\_\{\\min\}\+k\-1\)\}\(1\-\\beta^\{\-1\}\), strictly decreasing inkk\. The total gain is∑i∑k=1bi−bmingi\(k\)\\sum\_\{i\}\\sum\_\{k=1\}^\{b\_\{i\}\-b\_\{\\min\}\}g\_\{i\}\(k\)\. We select exactlyR=B−NbminR=B\-Nb\_\{\\min\}items from the pool\{gi\(k\)\}\\\{g\_\{i\}\(k\)\\\}subject to precedence \(itemkkrequiresk−1k\{\-\}1\)\. Since gains are decreasing per head, this forms a polymatroid\[Oxley,[2011](https://arxiv.org/html/2605.06675#bib.bib19)\]and greedy is optimal\. ∎
### A\.4Proof of[Proposition˜5](https://arxiv.org/html/2605.06675#Thmtheorem5)\(Loss\-Distortion Connection\)
Replacing𝐊l,h\\mathbf\{K\}\_\{l,h\}with𝐊^l,h=𝐊l,h\+𝜹l,hK\\hat\{\\mathbf\{K\}\}\_\{l,h\}=\\mathbf\{K\}\_\{l,h\}\+\\bm\{\\delta\}\_\{l,h\}^\{K\}, second\-order Taylor gives:
ℒ\(θ^\)−ℒ\(θ\)≈∑l,h⟨∇Kℒ,𝜹K⟩\+12\(𝜹K\)T𝐇K𝜹K\\mathcal\{L\}\(\\hat\{\\theta\}\)\-\\mathcal\{L\}\(\\theta\)\\approx\\sum\_\{l,h\}\\langle\\nabla\_\{K\}\\mathcal\{L\},\\bm\{\\delta\}^\{K\}\\rangle\+\\frac\{1\}\{2\}\(\\bm\{\\delta\}^\{K\}\)^\{T\}\\mathbf\{H\}^\{K\}\\bm\{\\delta\}^\{K\}The first\-order term vanishes in expectation \(unbiased quantization\)\. Under diagonal Fisher approximation:𝔼\[\(𝜹K\)T𝐇K𝜹K\]≈tr\(𝐇K\)⋅D\(bK\)/dK\\mathbb\{E\}\[\(\\bm\{\\delta\}^\{K\}\)^\{T\}\\mathbf\{H\}^\{K\}\\bm\{\\delta\}^\{K\}\]\\approx\\operatorname\{tr\}\(\\mathbf\{H\}^\{K\}\)\\cdot D\(b^\{K\}\)/d\_\{K\}\. Sincetr\(𝐇l,hK\)∝T⋅dK⋅wl,hK\\operatorname\{tr\}\(\\mathbf\{H\}\_\{l,h\}^\{K\}\)\\propto T\\cdot d\_\{K\}\\cdot w\_\{l,h\}^\{K\}, combining K and V yields[Eq\.˜5](https://arxiv.org/html/2605.06675#S3.E5)\.
#### Why activation\-based fails\.
The activation proxyw~l,hK=𝔼\[‖Q‖2‖V‖2\]/d\\tilde\{w\}\_\{l,h\}^\{K\}=\\mathbb\{E\}\[\\\|Q\\\|^\{2\}\\\|V\\\|^\{2\}\]/dbounds the forward\-pass attention error, not the loss change\. A head may amplify quantization error \(highw~\\tilde\{w\}\) yet have low loss impact \(lowww\) if residual connections absorb the error\. ∎
## Appendix BDistortion Model Parameters
Table 6:CalibratedD\(b\)=αβ−bD\(b\)=\\alpha\\beta^\{\-b\}parameters \(Qwen3\-8B,dh=128d\_\{h\}\{=\}128\)\. The1\.5×1\.5\{\\times\}β\\beta\-gap across quantizers is the root cause of mismatch\.
## Appendix CComponent Ablation Waterfall
[Table˜7](https://arxiv.org/html/2605.06675#A3.T7)decomposes the contribution of eachRateQuantcomponent on KIVI at 2\.5 bits\. Adding gradient sensitivity*without*calibration worsens PPL \(49\.3→\\to87\.0\) because the algorithm appliesTurboQuant’sβ=3\.6\\beta\{=\}3\.6to KIVI’sβ=5\.1\\beta\{=\}5\.1\. Calibration partially recovers \(87\.0→\\to73\.1\), but the decisive step is K/V separation \(73\.1→\\to14\.9\), which discovers the 2\.85/2\.15 K/V split\.
Table 7:Component ablation waterfall: KIVI 2\.5 bits, Qwen3\-8B, seed 42\.
## Appendix DMixed\-Precision Baseline Comparison
[Table˜8](https://arxiv.org/html/2605.06675#A4.T8)comparesRateQuanthead\-to\-head with existing mixed\-precision allocation methods\. We re\-implement each method’s allocation*strategy*\(not full pipeline\) on the KIVI quantizer at matched average bits, using their published allocation rules with our gradient sensitivity estimates for fair comparison\.
Table 8:Head\-to\-head with mixed\-precision approaches \(Qwen3\-8B, WikiText\-2 PPL\)\.†Re\-implemented allocation strategy on KIVI at matched bits\.MethodGran\.b¯\\bar\{b\}PPLΔ\\Delta%CostStrategyKIVI base quantizer \(Uniform PPL = 49\.32\):KVmix†\[Chen et al\.,[2025a](https://arxiv.org/html/2605.06675#bib.bib2)\]Layer2\.538\.4122\.115 mTop\-20%KVTuner†\[Li et al\.,[2025b](https://arxiv.org/html/2605.06675#bib.bib13)\]Layer2\.535\.7327\.545 mParetoK\>\>V global†\[Kim et al\.,[2025](https://arxiv.org/html/2605.06675#bib.bib11)\]Global2\.531\.0637\.00\.1 sK3V2RateQuant\(cal\+sep\)Head2\.514\.8669\.91\.7 sRD opt\.TurboQuantbase quantizer \(Uniform PPL = 9\.95\):Layer\-MP†Layer3\.59\.8816\.715 mPer\-layerRateQuant\(grad\)Head3\.59\.7254\.81\.6 sPer\-headFP16169\.53Reference
## Appendix EDistortion Model Validation
Table 9:Exact Lloyd\-Max MSE vs\. fitted exponential ford=128d\{=\}128,σ2=1\\sigma^\{2\}\{=\}1\. Max relative error: 7\.5% at 1 bit\.
## Appendix FMulti\-Seed Reliability
We report multi\-seed results for the primaryTurboQuantconfiguration on Qwen3\-8B, the most complete evaluation setting \(main results \+ ablation \+ downstream\)\. For cross\-quantizer experiments \([Table˜3](https://arxiv.org/html/2605.06675#S4.T3)\), we use seed 42; the dominant source of variance there is the allocation strategy, not the seed\.
Table 10:Per\-seed PPL for Qwen3\-8B \(TurboQuantbase\)\. All seeds show positiveΔ\\Deltaat 3\.5 and 4\.0 bits\.
## Appendix GK/V Asymmetry Visualization
Figure 6:KIVI per\-layer MSE at different bit\-widths\.Left:Key cache has high distortion with strong per\-layer variation\.Right:Value cache has∼4×\{\\sim\}4\{\\times\}lower MSE, driving the optimal K/V split \(2\.85/2\.15 at 2\.5 bits\)\.
## Appendix HBit Allocation Visualization
Figure 7:Per\-head bit allocation for Qwen3\-8B atb¯=4\.0\\bar\{b\}\{=\}4\.0\(bmin=3b\_\{\\min\}\{=\}3,bmax=6b\_\{\\max\}\{=\}6\)\. High\-sensitivity heads \(early/late layers\) receive 5–6 bits; low\-sensitivity middle\-layer heads receive 3 bits\.
## Appendix ICalibration Overhead
Table 11:Calibration cost \(single H200 GPU, 16 sequences of length 512\)\.
## Appendix JCalibration Size Ablation
Table 12:Calibration size ablation on Qwen3\-8B at 4\.0 bits \(seed 42\)\. Even 4 samples yield\>\>80% of the 16\-sample gain\.
## Appendix KMemory Footprint
Table 13:KV cache memory at sequence length 4096\.RateQuantadds no memory overhead at the same average bit\-width\.
## Appendix LComplete Per\-Model Results
### L\.1Qwen3\-8B
Table 14:Complete results for Qwen3\-8B \(TurboQuant, gradient sensitivity, seed 42\)\.
### L\.2Qwen3\-4B
Table 15:Complete results for Qwen3\-4B \(TurboQuant, gradient sensitivity, seed 42\)\.
### L\.3Qwen3\-32B
Table 16:Complete results for Qwen3\-32B \(TurboQuant, gradient sensitivity, seed 42\)\.Constrained gain analysis\.Whenbminb\_\{\\min\}constraints are active, some heads are floored atbminb\_\{\\min\}, reducing the budget available for differentiation\. Atb¯=bmin\\bar\{b\}=b\_\{\\min\}\(e\.g\., 3\.0 bits withbmin=3b\_\{\\min\}\{=\}3\), all heads are floored and the gain ratio is exactly 1, explaining the tied performance at 3\.0 bits\. Asb¯\\bar\{b\}increases, the floor fraction decreases and the gain grows, peaking where the budget allows maximal differentiation\. Beyond a certain point, diminishing distortion at high bits reduces absolute PPL benefit, consistent with the small or negativeΔ\\Deltaobserved at≥\\geq4\.5 bits\.
### L\.4RTN Base Quantizer \(Extreme Case\)
RTN per\-token symmetric is the weakest quantizer tested\.RateQuantproduces very large gains because the sensitivity signal dominates when quantization error is extreme\.
Table 17:RTN per\-token symmetric on Qwen3\-8B\.
## Appendix MLimitations and Broader Impact
#### Limitations\.
Our evaluation centers on Qwen3 models with WikiText\-2 PPL and four downstream tasks\. While the framework is model\-agnostic, validation on additional model families \(LLaMA\-3, Mistral\) and long\-context benchmarks \(RULER, LongBench\) would strengthen generality claims\. Gradient calibration requires backward passes \(∼\{\\sim\}1\.6 s for 8B\), which is negligible for deployment but not zero\-cost\. The per\-head independence assumption may not hold for architectures with strongly coupled heads\. Finally,RateQuantallocates bits per head uniformly across token positions; position\-aware allocation could further improve long\-sequence efficiency\.
#### Broader impact\.
RateQuantreduces KV cache memory requirements, enabling longer contexts and larger batches within fixed hardware budgets\. As a quantizer\-agnostic allocation layer, it can be combined with any future base quantizer, amplifying the practical impact of improvements in the quantization design space\.Similar Articles
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models, achieving state-of-the-art 2-bit precision on reasoning benchmarks.
@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)
SpectralQuant is a new KV cache quantization technique achieving 5.95× compression on Mistral 7B with only 7.5% perplexity overhead, significantly outperforming TurboQuant while requiring only 15 seconds of calibration per model.
RoPE-Aware Bit Allocation for KV-Cache Quantization
Proposes Block-GTQ, a RoPE-aware bit allocation method for key-value cache quantization that improves long-context performance and memory efficiency by allocating more bits to high-energy RoPE blocks.