RoPE-Aware Bit Allocation for KV-Cache Quantization
Summary
Proposes Block-GTQ, a RoPE-aware bit allocation method for key-value cache quantization that improves long-context performance and memory efficiency by allocating more bits to high-energy RoPE blocks.
View Cached Full Text
Cached at: 06/24/26, 07:50 AM
# RoPE-Aware Bit Allocation for KV-Cache Quantization
Source: [https://arxiv.org/html/2606.24033](https://arxiv.org/html/2606.24033)
Fengfeng Liang1Yuechen Zhang2,3Jiaya Jia1,∗\* 1Hong Kong University of Science and Technology 2The Chinese University of Hong Kong 3MiMo, Xiaomi Corporation ∗\*Corresponding author
###### Abstract
Existing low\-bit KV\-cache quantizers typically treat each cached key as a flat vector\. Under RoPE, however, the contribution of a cached key to a future attention logit decomposes into a position\-dependent sum over two\-dimensional frequency blocks\. This makes key\-cache quantization a block\-wise bit\-allocation problem: high\-energy RoPE blocks are more sensitive to quantization error and should therefore receive more bits\. We introduce*Block\-GTQ*, a RoPE\-aware bit allocator for key\-cache quantization built on TurboQuant\-MSE \(TQ\-MSE\)\. For each layer and KV head, Block\-GTQ computes a label\-free energy score for each RoPE block and greedily allocates integer bit widths using marginal gains\. Under matched K/V bit budgets, Block\-GTQ better preserves RoPE query\-key logits on a diverse ten\-model diagnostic panel—at both22and33b/dim K\-only, cutting per\-layer MAE by3232–80%80\\%across models and winning all367/367367/367layer comparisons at each budget against uniform TQ\-MSE—and these fidelity gains translate to stronger downstream long\-context retrieval, understanding, and reasoning\. At K2V2 on Llama\-3\.1\-8B\-Instruct, Block\-GTQ raises the six\-task NIAH average from70\.670\.6to97\.497\.4, and the eight\-task LongBench\-EN average from36\.8736\.87to53\.3153\.31, relative to the uniform\-allocation TQ\-MSE baseline\. On AIME 2024/2025 with DeepSeek\-R1\-Distill\-Qwen\-7B, and without relying on an fp16 recent\-key buffer, Block\-GTQ at K3V2 scores51\.7/37\.551\.7/37\.5, close to fp16’s54\.2/37\.954\.2/37\.9, whereas uniform\-allocation TQ\-MSE collapses to0\.0/0\.00\.0/0\.0\. We further implement a packed\-cache serving path that avoids materializing an fp16 KV cache: on a single H800 GPU with Qwen2\.5\-3B\-Instruct, the packed K3V3 path achieves3\.24×3\.24\\timesKV\-cache compression with quality comparable to fp16, runs1\.34×1\.34\\timesfaster than fp16 FlashAttention2 at128128K context, reduces peak memory from56\.3156\.31GB to19\.8519\.85GB, and remains feasible at256256K/512512K where fp16 OOMs\. Our code is available at[https://github\.com/JIA\-Lab\-research/blockgtq](https://github.com/JIA-Lab-research/blockgtq)\.
## 1Introduction
Long\-context inference makes the KV cache the dominant sequence\-dependent memory cost in autoregressive decoding\. The cache stores one key and one value vector for every past token in every layer, and each decode step must access this growing state to attend over the context\. For example, in a GQA\-style 70B\-class model with 80 layers, 8 KV heads, and 128\-dimensional heads, an fp16 KV cache requires about 320 KiB per token, or roughly 40 GiB at a 128K\-token context\. This creates two coupled bottlenecks: capacity, because the resident cache must fit in memory, and bandwidth, because the attention kernel must stream the cached K/V at each decode step\[[27](https://arxiv.org/html/2606.24033#bib.bib4),[7](https://arxiv.org/html/2606.24033#bib.bib6),[8](https://arxiv.org/html/2606.24033#bib.bib7),[17](https://arxiv.org/html/2606.24033#bib.bib5),[28](https://arxiv.org/html/2606.24033#bib.bib8),[22](https://arxiv.org/html/2606.24033#bib.bib9)\]\.
KV\-cache quantization mitigates this pressure by storing cached keys and values with fewer bits\[[24](https://arxiv.org/html/2606.24033#bib.bib11),[14](https://arxiv.org/html/2606.24033#bib.bib12),[13](https://arxiv.org/html/2606.24033#bib.bib13),[43](https://arxiv.org/html/2606.24033#bib.bib14),[40](https://arxiv.org/html/2606.24033#bib.bib16)\]\. Most methods cast the problem as vector compression, choosing a quantization granularity over heads, channels, groups, or tokens so that dequantized vectors remain close to the originals\. This view is natural for storage, but it does not capture how cached keys are used\. A value error affects the post\-softmax weighted sum, whereas a key error perturbs the pre\-softmax logits seen by future queries and can change the attention distribution\.
For RoPE attention\[[30](https://arxiv.org/html/2606.24033#bib.bib1)\], this key\-logit computation is block structured\. LetΔ\\Deltabe the relative position between a future query𝐪∈ℝdh\\mathbf\{q\}\\in\\mathbb\{R\}^\{d\_\{h\}\}and a cached key𝐤∈ℝdh\\mathbf\{k\}\\in\\mathbb\{R\}^\{d\_\{h\}\}\. Up to the usual attention scaling, their logit is𝒦Δ\(𝐪,𝐤\)=𝐪⊤RΔ𝐤\\mathcal\{K\}\_\{\\Delta\}\(\\mathbf\{q\},\\mathbf\{k\}\)=\\mathbf\{q\}^\{\\top\}R\_\{\\Delta\}\\mathbf\{k\}, whereRΔR\_\{\\Delta\}is block diagonal with2×22\\times 2RoPE rotations\. Hence the logit is a position\-dependent sum of block terms𝐪\(i\)⊤R\(Δθi\)𝐤\(i\)\\mathbf\{q\}^\{\(i\)\\top\}R\(\\Delta\\theta\_\{i\}\)\\mathbf\{k\}^\{\(i\)\}, fori=1,…,dh/2i=1,\\ldots,d\_\{h\}/2, withθi\\theta\_\{i\}the frequency of blockii\. A cached key is therefore not used through a flat\-vector interface\. Key\-cache quantization should allocate precision across RoPE blocks according to their logit impact, rather than optimize a single flat\-vector reconstruction objective over the whole key head\.
This block\-wise view changes where bits should be spent\. Uniform allocation within a key head is natural only if RoPE blocks have comparable influence on future logits\. Empirically, block\-energy profiles can be sharply uneven: a few frequency blocks carry most of the query\-key signal, making future logits more sensitive to quantization error in those blocks than to comparable error elsewhere\. RoPE\-agnostic uniform allocation therefore spends the same precision on blocks with very different logit sensitivity, potentially over\-protecting low\-impact blocks and under\-protecting high\-impact ones\. Figure[1](https://arxiv.org/html/2606.24033#S1.F1)illustrates this allocation gap: one KV head from Qwen3\-8B has a sharply non\-uniform block\-energy profile, and under the same average bit budgetb¯=3\\bar\{b\}=3, Block\-GTQ shifts precision toward high\-energy blocks instead of using the same bit width for every block\.
Figure 1:RoPE\-block allocation\.\(a\)The RoPE attention logit𝐪⊤RΔ𝐤\\mathbf\{q\}^\{\\top\}R\_\{\\Delta\}\\mathbf\{k\}decomposes into a sum over two\-dimensional frequency blocks\.\(b\)Per\-block energy scores for one Qwen3\-8B KV head \(layer 10, head 4\); scores are median\-normalized for display and span orders of magnitude\.\(c\)Under the same average bit widthb¯=3\\bar\{b\}=3\(dashed line\), Block\-GTQ reallocates bits from low\-energy blocks to high\-energy blocks instead of using a uniform 3\-bit width\.We propose*Block\-GTQ*, a lightweight RoPE\-aware allocator that spends key\-cache precision where future logits are most sensitive\. For each layer and KV head, Block\-GTQ computes a label\-free RoPE\-block energy score from Q/K activations, combines it with the TurboQuant\-MSE \(TQ\-MSE\)\[[40](https://arxiv.org/html/2606.24033#bib.bib16)\]4−b4^\{\-b\}squared\-error rate law, and greedily assigns integer bit widths under a fixed average\-bit budget\. Blocks assigned the same bit width are grouped and encoded by the original TQ\-MSE local quantizer\. Since values do not enter the RoPE key\-logit computation, V is encoded with uniform\-allocation TQ\-MSE\. All RoPE blocks are still stored; Block\-GTQ changes only their bit widths\.
Our contributions are:
1. 1\.We formulate key\-cache compression for RoPE models as a logit\-preservation problem over two\-dimensional frequency blocks, rather than a flat\-vector reconstruction problem\.
2. 2\.We derive a RoPE\-block integer bit allocator that combines a label\-free Q/K energy score with the TQ\-MSE4−b4^\{\-b\}error law, and reuse the TQ\-MSE encoder for same\-bit\-width block groups\.
3. 3\.We validate the mechanism from RoPE\-logit fidelity to downstream long\-context retrieval, understanding, and reasoning tasks\. On a diverse ten\-model diagnostic panel, at both22and33b/dim K\-only, Block\-GTQ cuts per\-layer RoPE\-logit MAE by3232–80%80\\%across models and wins all367/367367/367layer comparisons at each budget against uniform TQ\-MSE\. At the K2V2 budget on Llama\-3\.1\-8B\-Instruct, it raises the six\-task NIAH average from70\.670\.6to97\.497\.4, and the eight\-task LongBench\-EN average from36\.8736\.87to53\.3153\.31, relative to uniform\-allocation TQ\-MSE\. On AIME 2024/2025 with DeepSeek\-R1\-Distill\-Qwen\-7B, and without relying on an fp16 recent\-key buffer, Block\-GTQ at K3V2 scores51\.7/37\.551\.7/37\.5, close to fp16’s54\.2/37\.954\.2/37\.9, whereas uniform\-allocation TQ\-MSE collapses to0\.0/0\.00\.0/0\.0\.
4. 4\.We implement a packed\-cache serving path for compressed K/V codes and evaluate it with K3V3 Block\-GTQ on Qwen2\.5\-3B\-Instruct\. Compared with fp16 FlashAttention2 using an uncompressed KV cache, at 128K tokens the packed path compresses the KV cache by3\.24×3\.24\\times, reduces peak memory from56\.3156\.31GB to19\.8519\.85GB, and lowers single\-request decode latency from70\.9670\.96ms to52\.9552\.95ms\. At 256K and 512K tokens, the fp16 baseline runs out of memory on the same H800, while the packed path remains feasible with33\.4233\.42GB and60\.5660\.56GB peak memory\.
## 2RoPE\-Structured Key\-Cache Error
### 2\.1RoPE Block Notation
For one query/key head, let𝐪,𝐤∈ℝdh\\mathbf\{q\},\\mathbf\{k\}\\in\\mathbb\{R\}^\{d\_\{h\}\}be split intoL=dh/2L=d\_\{h\}/2two\-dimensional RoPE blocks𝐪\(i\),𝐤\(i\)∈ℝ2\\mathbf\{q\}^\{\(i\)\},\\mathbf\{k\}^\{\(i\)\}\\in\\mathbb\{R\}^\{2\}, with block frequenciesθi\\theta\_\{i\}\. At relative offsetΔ\\Delta,RΔ=diag\(R\(Δθ1\),…,R\(ΔθL\)\)R\_\{\\Delta\}=\\mathrm\{diag\}\(R\(\\Delta\\theta\_\{1\}\),\\ldots,R\(\\Delta\\theta\_\{L\}\)\), so the query\-key logit is𝒦Δ\(𝐪,𝐤\)=𝐪⊤RΔ𝐤=∑i=1L𝐪\(i\)⊤R\(Δθi\)𝐤\(i\)\\mathcal\{K\}\_\{\\Delta\}\(\\mathbf\{q\},\\mathbf\{k\}\)=\\mathbf\{q\}^\{\\top\}R\_\{\\Delta\}\\mathbf\{k\}=\\sum\_\{i=1\}^\{L\}\\mathbf\{q\}^\{\(i\)\\top\}R\(\\Delta\\theta\_\{i\}\)\\mathbf\{k\}^\{\(i\)\}\.
Let𝐤^\\hat\{\\mathbf\{k\}\}be a decoded key in the same coordinate system as𝐤\\mathbf\{k\}, and define𝐞𝐤\(i\)=𝐤\(i\)−𝐤^\(i\)\\mathbf\{e\}\_\{\\mathbf\{k\}\}^\{\(i\)\}=\\mathbf\{k\}^\{\(i\)\}\-\\hat\{\\mathbf\{k\}\}^\{\(i\)\}\. The induced logit error is∑i𝐪\(i\)⊤R\(Δθi\)𝐞𝐤\(i\)\\sum\_\{i\}\\mathbf\{q\}^\{\(i\)\\top\}R\(\\Delta\\theta\_\{i\}\)\\mathbf\{e\}\_\{\\mathbf\{k\}\}^\{\(i\)\}, and Cauchy–Schwarz together with rotation orthogonality gives\|𝒦Δ\(𝐪,𝐤\)−𝒦Δ\(𝐪,𝐤^\)\|≤∑i=1L‖𝐪\(i\)‖2‖𝐞𝐤\(i\)‖2\|\\mathcal\{K\}\_\{\\Delta\}\(\\mathbf\{q\},\\mathbf\{k\}\)\-\\mathcal\{K\}\_\{\\Delta\}\(\\mathbf\{q\},\\hat\{\\mathbf\{k\}\}\)\|\\leq\\sum\_\{i=1\}^\{L\}\\\|\\mathbf\{q\}^\{\(i\)\}\\\|\_\{2\}\\\|\\mathbf\{e\}\_\{\\mathbf\{k\}\}^\{\(i\)\}\\\|\_\{2\}\. Each block thus contributes independently to the bound, with no cross\-block terms\. Since each RoPE block is an orthogonal2×22\\times 2rotation, RoPE preserves theℓ2\\ell\_\{2\}norm of every query/key block, so norm\-based block statistics are RoPE\-invariant—we can therefore compute the energy score in pre\-RoPE coordinates\.
### 2\.2TQ\-MSE Rate Law
Block\-GTQ reuses the local encoder from TQ\-MSE\[[40](https://arxiv.org/html/2606.24033#bib.bib16)\]: after normalizing a nonzero vector𝐱\\mathbf\{x\}, TQ\-MSE applies a shared orthogonal rotation, scalar\-quantizes the rotated coordinates, and restores the radius\. For allocation, only its rate law is needed: atbbbits per coordinate, the decoded vector𝐱^\\hat\{\\mathbf\{x\}\}satisfies𝔼‖𝐱−𝐱^‖22≤‖𝐱‖22CTQ4−b\\mathbb\{E\}\\\!\\\|\\mathbf\{x\}\-\\hat\{\\mathbf\{x\}\}\\\|\_\{2\}^\{2\}\\leq\\\|\\mathbf\{x\}\\\|\_\{2\}^\{2\}\\,C\_\{\\mathrm\{TQ\}\}\\,4^\{\-b\}withCTQ=3π/2C\_\{\\mathrm\{TQ\}\}=\\sqrt\{3\}\\pi/2\. Each additional bit quarters the local MSE bound; Block\-GTQ uses this4−b4^\{\-b\}rate law to allocate bits across RoPE blocks\.
### 2\.3Key\-Cache Logit Error
Let𝐪nR=Rn𝐪n\\mathbf\{q\}\_\{n\}^\{\\mathrm\{R\}\}=R\_\{n\}\\mathbf\{q\}\_\{n\}and𝐤mR=Rm𝐤m\\mathbf\{k\}\_\{m\}^\{\\mathrm\{R\}\}=R\_\{m\}\\mathbf\{k\}\_\{m\}denote the post\-RoPE query and key at positionsn,mn,m, withRtR\_\{t\}the absolute RoPE rotation at positiontt\. If the deployed cache decodes the post\-RoPE key as𝐤^mR\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}, the key\-cache logit errorℰn,m:=\|\(𝐪nR\)⊤𝐤mR−\(𝐪nR\)⊤𝐤^mR\|\\mathcal\{E\}\_\{n,m\}:=\|\(\\mathbf\{q\}\_\{n\}^\{\\mathrm\{R\}\}\)^\{\\top\}\\mathbf\{k\}\_\{m\}^\{\\mathrm\{R\}\}\-\(\\mathbf\{q\}\_\{n\}^\{\\mathrm\{R\}\}\)^\{\\top\}\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}\|is, up to the usual1/dh1/\\sqrt\{d\_\{h\}\}scaling, the logit perturbation induced by key\-cache compression\. We focus on keys because queries are computed on the fly and values are mixed only after softmax weights are computed\.
Although deployment stores post\-RoPE keys, we analyzeℰn,m\\mathcal\{E\}\_\{n,m\}in pre\-RoPE coordinates with relative offsetΔ=m−n\\Delta=m\-n; the per\-block bound from Subsection[2\.1](https://arxiv.org/html/2606.24033#S2.SS1)applies: blockii’s contribution depends only on the query norm and key\-error norm in the same block \(Appendices[A\.1](https://arxiv.org/html/2606.24033#A1.SS1)and[A\.2](https://arxiv.org/html/2606.24033#A1.SS2)\)\.
This block\-wise structure motivates a per\-block bit allocation: for each layer and KV head, choose integer bit widths𝐛=\(b1,…,bL\)\\mathbf\{b\}=\(b\_\{1\},\\ldots,b\_\{L\}\)withbmin≤bi≤bmaxb\_\{\\min\}\\leq b\_\{i\}\\leq b\_\{\\max\}and∑ibi=B\\sum\_\{i\}b\_\{i\}=B, while keeping every RoPE block cached\. In expectation, blockii’s contribution to the bound is the ideal block weightsi⋆:=𝔼\[‖𝐪\(i\)‖2‖𝐤\(i\)‖2\]s\_\{i\}^\{\\star\}:=\\mathbb\{E\}\[\\\|\\mathbf\{q\}^\{\(i\)\}\\\|\_\{2\}\\,\\\|\\mathbf\{k\}^\{\(i\)\}\\\|\_\{2\}\]times the local quantizer’s bit\-dependent rate \(Appendix[A\.3](https://arxiv.org/html/2606.24033#A1.SS3)\)\. The optimal allocation therefore gives more bits to blocks with highersi⋆s\_\{i\}^\{\\star\}\.
## 3Block\-GTQ: RoPE\-Block Bit Allocation
### 3\.1Block\-Energy Score
Directly estimatingsi⋆s\_\{i\}^\{\\star\}requires paired query\-key products, which can be noisy on a short calibration prefix\. Block\-GTQ instead uses an AM\-GM\-based energy score that depends only on marginal Q/K second moments:si:=12𝔼\[‖𝐪\(i\)‖22\+‖𝐤\(i\)‖22\]s\_\{i\}:=\\frac\{1\}\{2\}\\mathbb\{E\}\[\\\|\\mathbf\{q\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\+\\\|\\mathbf\{k\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\]\. By AM\-GM,si⋆≤sis\_\{i\}^\{\\star\}\\leq s\_\{i\}in expectation:sis\_\{i\}may overestimatesi⋆s\_\{i\}^\{\\star\}but never underestimates it\.
Instantiatingsis\_\{i\}for layerℓ\\elland KV headhh, the empirical energy score is
sℓ,h,i=12\(𝔼t,g∈G\(h\)\[‖𝐪ℓ,g,t\(i\)‖22\]\+𝔼t\[‖𝐤ℓ,h,t\(i\)‖22\]\)\.s\_\{\\ell,h,i\}=\\frac\{1\}\{2\}\\left\(\\mathbb\{E\}\_\{t,g\\in G\(h\)\}\\left\[\\\|\\mathbf\{q\}\_\{\\ell,g,t\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\right\]\+\\mathbb\{E\}\_\{t\}\\left\[\\\|\\mathbf\{k\}\_\{\\ell,h,t\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\right\]\\right\)\.\(1\)
HereG\(h\)G\(h\)is the set of query heads that read KV headhh, and expectations are averaged over a short unlabeled calibration prefix; Appendix[C](https://arxiv.org/html/2606.24033#A3.SS0.SSS0.Px1)expands these expectations into explicit sums\.
### 3\.2Budgeted RoPE\-Block Allocation
For each layerℓ\\elland KV headhhwith head\-level integer budgetB∈\[Lbmin,Lbmax\]B\\in\[Lb\_\{\\min\},Lb\_\{\\max\}\], Block\-GTQ chooses an integer bit schedule𝐛=\(b1,…,bL\)\\mathbf\{b\}=\(b\_\{1\},\\ldots,b\_\{L\}\)that minimizesJℓ,h\(𝐛\)=∑i=1Lsℓ,h,i4−biJ\_\{\\ell,h\}\(\\mathbf\{b\}\)=\\sum\_\{i=1\}^\{L\}s\_\{\\ell,h,i\}\\,4^\{\-b\_\{i\}\}, subject tobmin≤bi≤bmaxb\_\{\\min\}\\leq b\_\{i\}\\leq b\_\{\\max\}and∑ibi=B\\sum\_\{i\}b\_\{i\}=B\.
Algorithm 1Block\-GTQ: greedy bit allocation per layer and KV head1:Scores
s1,…,sLs\_\{1\},\\ldots,s\_\{L\}, feasible integer budget
BB, bounds
bmin,bmaxb\_\{\\min\},b\_\{\\max\}
2:Per\-block bit widths
b1,…,bLb\_\{1\},\\ldots,b\_\{L\}
3:
bi←bminb\_\{i\}\\leftarrow b\_\{\\min\}for all
i=1,…,Li=1,\\ldots,L
4:
Bextra←B−LbminB\_\{\\mathrm\{extra\}\}\\leftarrow B\-Lb\_\{\\min\}
5:Initialize a max\-priority queue with key
Δi=34si4−bi\\Delta\_\{i\}=\\tfrac\{3\}\{4\}s\_\{i\}4^\{\-b\_\{i\}\}for all
iiwith
bi<bmaxb\_\{i\}<b\_\{\\max\}
6:while
Bextra\>0B\_\{\\mathrm\{extra\}\}\>0and the queue is nonemptydo
7:Pop
i⋆i^\{\\star\}with largest
Δi\\Delta\_\{i\}
8:
bi⋆←bi⋆\+1b\_\{i^\{\\star\}\}\\leftarrow b\_\{i^\{\\star\}\}\+1,
Bextra←Bextra−1B\_\{\\mathrm\{extra\}\}\\leftarrow B\_\{\\mathrm\{extra\}\}\-1
9:if
bi⋆<bmaxb\_\{i^\{\\star\}\}<b\_\{\\max\}then
10:Push
i⋆i^\{\\star\}back with updated key
Δi⋆=34si⋆4−bi⋆\\Delta\_\{i^\{\\star\}\}=\\tfrac\{3\}\{4\}s\_\{i^\{\\star\}\}4^\{\-b\_\{i^\{\\star\}\}\}
11:endif
12:endwhilereturn
b1,…,bLb\_\{1\},\\ldots,b\_\{L\}
To solve this objective, we use greedy bit allocation guided by the marginal reduction\. Adding one bit to blockiiat current widthbib\_\{i\}reducesJℓ,hJ\_\{\\ell,h\}byΔi\(bi\)=sℓ,h,i4−bi−sℓ,h,i4−\(bi\+1\)=34sℓ,h,i4−bi\\Delta\_\{i\}\(b\_\{i\}\)=s\_\{\\ell,h,i\}4^\{\-b\_\{i\}\}\-s\_\{\\ell,h,i\}4^\{\-\(b\_\{i\}\+1\)\}=\\tfrac\{3\}\{4\}s\_\{\\ell,h,i\}4^\{\-b\_\{i\}\}: high\-score blocks ask for bits first, but each bit they receive divides their next marginal gain by four\. Algorithm[1](https://arxiv.org/html/2606.24033#alg1)initializes every block atbminb\_\{\\min\}and repeatedly assigns the next bit to the block with the largest currentΔi\\Delta\_\{i\}until the budget is spent\. In fact, greedy is optimal for this objective:
###### Theorem 1\(Greedy optimality for the allocation objective\)\.
For positive scoressis\_\{i\}and feasible integer budgetB∈\[Lbmin,Lbmax\]B\\in\[Lb\_\{\\min\},Lb\_\{\\max\}\], Algorithm[1](https://arxiv.org/html/2606.24033#alg1)minimizesJ\(𝐛\)=∑isi4−biJ\(\\mathbf\{b\}\)=\\sum\_\{i\}s\_\{i\}4^\{\-b\_\{i\}\}over all integer allocations satisfyingbmin≤bi≤bmaxb\_\{\\min\}\\leq b\_\{i\}\\leq b\_\{\\max\}and∑ibi=B\\sum\_\{i\}b\_\{i\}=B\.
The proof is in Appendix[A\.4](https://arxiv.org/html/2606.24033#A1.SS4)\.
The greedy output is a bit schedule, not yet a physical cache layout\. Block\-GTQ realizes this schedule by grouping RoPE blocks with the same assigned bit width\. For each nonempty group𝒢b\(ℓ,h\)=\{i:bℓ,h,i=b\}\\mathcal\{G\}\_\{b\}^\{\(\\ell,h\)\}=\\\{i:b\_\{\\ell,h,i\}=b\\\}, we concatenate the corresponding post\-RoPE key blocks and encode the resulting subvector with one TQ\-MSE encoder atbbbits/dim\. This keeps the allocation decision at RoPE\-block granularity while avoiding a separate tiny quantizer for every two\-dimensional block\. Uniform TQ\-MSE is the special case in which all blocks belong to one same\-rate group\.
## 4Serving Block\-GTQ from a Packed Cache
Figure 2:Packed\-cache serving path\.Persistent HBM stores packed K/V code streams plus norms and metadata\. The fused attention kernel decodes only the current tile into kernel\-local temporaries and consumes them directly in QK and PV products, avoiding a resident decoded fp16 KV cache\.At inference time, we serve Block\-GTQ directly from a packed cache\. The cache update writes packed K/V code streams, norms, and static layout metadata into HBM\. The fused attention kernel loads only the current time tile, decodes it into kernel\-local temporaries, and consumes them in QK and PV products; a full fp16 KV cache is never materialized in HBM\. Figure[2](https://arxiv.org/html/2606.24033#S4.F2)shows this path\. The K stream follows the mixed\-rate Block\-GTQ schedule, with low\-bit groups stored in nibble containers and higher\-bit groups stored as bytes; the V stream is also packed, with uniform\-allocation TQ\-MSE\.
This layout turns the packed cache into a memory\-bandwidth win\. Single\-token decoding is memory\-bandwidth bound—one query attends to allTTcached keys, so each step is governed by streaming the KV cache from HBM\. The fused kernel unpacks each tile \(nibble extraction for≤4\\leq 4\-bit groups, byte loads for higher\-bitKKgroups\), dequantizes through a shared fp16 codebook small enough to stay resident in the L1 cache, rescales by the per\-groupKKand per\-tokenVVnorms, and formsQK⊤QK^\{\\top\}andPVPVas two fp16\-input, fp32\-accumulate tensor\-core matmuls under a fully fp32 online softmax; long contexts are split along the key axis and recombined with an exact log\-sum\-exp merge\. The dequantizedKK/VVstay in registers as tensor\-core operands and are never written back to HBM, so the per\-step HBM traffic is only the packed codes and norms—about157157\\,B per token and KV head at K3V3 versus512512\\,B for anfp16pair \(Table[18](https://arxiv.org/html/2606.24033#A5.T18),∼3\.26×\\sim\\\!3\.26\\times\)\. The in\-kernel unpack adds a fixed per\-step cost thatfp16FlashAttention\-2 does not pay, so the packed path is marginally slower at short context and overtakesfp16only once the sequence is long enough for KV bandwidth to dominate—crossing over atT=128T\{=\}128K, where it decodes1\.34×1\.34\\timesfaster at3\.26×\\\!3\.26\\timesless KV memory \(Table[19](https://arxiv.org/html/2606.24033#A5.T19)\)\. Per\-step launch overhead is removed by capturing the cache update as a CUDA graph, and the matchingQQ\-side rotation from TQ\-MSE is a smallQR⊤QR^\{\\top\}matmul after theqq\-projection that can be folded into theqq\-projection weights offline\.
##### Prefill\.
We populate the cache one transformer layer at a time, so each layer runs in a single full\-length pass instead of theO\(T\)O\(T\)steps of an autoregressive fill\. The QKV/MLP projections and rotary embedding execute as full\-TTmatrix multiplications, and two batched Triton kernels—one forKK, one forVV—each launch once per layer to quantize every head’s keys and values for the whole prompt, packing the codes into the cache’s nibble/mixed\-byte layout and writing the code streams and norms straight into the persistent buffers in the same layout the decode kernel reads\. Prefill attention is then a FlashAttention\-2–style kernel over the packed cache: each program owns a tile of queries, streams the compressedKK/VVtiles, decodes them into kernel\-local temporaries for a per\-segment \(per\-rate\-group\)QK⊤QK^\{\\top\}accumulation and a full\-widthPVPVproduct, and accumulates under a causal mask with an online softmax—without materializing anfp16attention matrix or KV cache\. Constant launches per layer \(vsO\(T\)O\(T\)\) remove the dispatch overhead that otherwise dominates long\-context prefill, making hundred\-thousand\-token prefill feasible on a single H800\.
## 5Related Work
##### Long\-context inference and KV\-cache memory\.
Autoregressive long\-context decoding is often limited by repeatedly reading a KV cache that grows with sequence length\[[27](https://arxiv.org/html/2606.24033#bib.bib4)\]\. Serving systems such as PagedAttention and CacheGen manage and reuse this state more carefully\[[17](https://arxiv.org/html/2606.24033#bib.bib5),[22](https://arxiv.org/html/2606.24033#bib.bib9)\], while context\-extension methods such as YaRN and LongLoRA change how models reach longer windows\[[26](https://arxiv.org/html/2606.24033#bib.bib2),[4](https://arxiv.org/html/2606.24033#bib.bib3)\]\.
##### KV\-cache quantization\.
Most KV\-cache quantizers optimize reconstruction or outlier objectives at channel, token, group, or vector granularity\. KIVI, KVQuant, ZipCache, Coupled Quantization, MiKV, MoQAE, and AQUA\-KV\[[24](https://arxiv.org/html/2606.24033#bib.bib11),[14](https://arxiv.org/html/2606.24033#bib.bib12),[13](https://arxiv.org/html/2606.24033#bib.bib13),[43](https://arxiv.org/html/2606.24033#bib.bib14),[39](https://arxiv.org/html/2606.24033#bib.bib15),[34](https://arxiv.org/html/2606.24033#bib.bib23),[29](https://arxiv.org/html/2606.24033#bib.bib22)\]pair low\-bit KV storage with outlier or mixed\-precision adjustments; KVSink, Outlier Tokens Tracing, and SQuat target sink tokens, outliers, and query\-subspace structure\[[33](https://arxiv.org/html/2606.24033#bib.bib19),[31](https://arxiv.org/html/2606.24033#bib.bib20),[35](https://arxiv.org/html/2606.24033#bib.bib21)\]\. PolarQuant and TurboQuant\[[12](https://arxiv.org/html/2606.24033#bib.bib17),[40](https://arxiv.org/html/2606.24033#bib.bib16)\]provide local vector\-quantization primitives, while GEAR adds low\-rank and sparse error recovery\[[16](https://arxiv.org/html/2606.24033#bib.bib40)\]\. Block\-GTQ uses TurboQuant\-MSE as its local primitive but greedily allocates bits per RoPE frequency block via a block\-energy score, since the key\-side attention logit decomposes into block terms\.
##### RoPE\-aware KV\-cache quantization\.
Several methods exploit RoPE structure when reducing KV\-cache cost\. KVQuant and RotateKV both operate before RoPE—the former quantizes keys, the latter applies outlier\-aware rotations\[[14](https://arxiv.org/html/2606.24033#bib.bib12),[32](https://arxiv.org/html/2606.24033#bib.bib28)\]; CommVQ learns codebooks that commute with RoPE\[[18](https://arxiv.org/html/2606.24033#bib.bib29)\]\. EliteKV combines head\-specific RoPE\-frequency selection with joint low\-rank projection\[[45](https://arxiv.org/html/2606.24033#bib.bib30)\]; RAP prunes RoPE\-aligned pairs\[[38](https://arxiv.org/html/2606.24033#bib.bib31)\]; TriAttention scores key importance via pre\-RoPE Q/K geometry and trigonometric distance\[[25](https://arxiv.org/html/2606.24033#bib.bib32)\]\. Block\-GTQ instead greedily allocates precision per RoPE block using a block\-energy score derived from the RoPE logit\-error bound; no block is dropped\.
##### Non\-uniform precision allocation\.
The closest line assigns precision non\-uniformly: PM\-KVQ at per\-layer granularity \(shared by K and V\)\[[21](https://arxiv.org/html/2606.24033#bib.bib24)\]; MixKVQ \(query\-aware\) and Kitty at key\-channel granularity\[[42](https://arxiv.org/html/2606.24033#bib.bib25),[36](https://arxiv.org/html/2606.24033#bib.bib26)\]; and Ada\-KV via head\-wise eviction budgets\[[10](https://arxiv.org/html/2606.24033#bib.bib27)\]\. Block\-GTQ differs in both unit and score: its greedy allocator assigns bits to RoPE frequency blocks inside each head from a block\-energy score derived from the RoPE logit\-error bound\.
##### Token retention and low\-rank compression\.
Another family reduces cache cost by keeping, merging, or sampling cached tensors\. Attention Sinks, H2O, Scissorhands, FastGen, SnapKV, PyramidKV, MagicPIG, and SubGen retain or sample tokens based on sink behavior, heavy hitters, persistence, profiled head patterns, attention scores, pyramidal layer budgets, or clustering\[[37](https://arxiv.org/html/2606.24033#bib.bib10),[44](https://arxiv.org/html/2606.24033#bib.bib33),[23](https://arxiv.org/html/2606.24033#bib.bib34),[11](https://arxiv.org/html/2606.24033#bib.bib35),[19](https://arxiv.org/html/2606.24033#bib.bib36),[2](https://arxiv.org/html/2606.24033#bib.bib37),[5](https://arxiv.org/html/2606.24033#bib.bib38),[41](https://arxiv.org/html/2606.24033#bib.bib39)\]\. Low\-rank and hybrid methods \(Palu, MiniCache, GEAR for KV cache; UniQL for edge LLMs\) reduce cache dimensionality, merge across depth, or recover quantization error\[[3](https://arxiv.org/html/2606.24033#bib.bib41),[20](https://arxiv.org/html/2606.24033#bib.bib42),[16](https://arxiv.org/html/2606.24033#bib.bib40),[6](https://arxiv.org/html/2606.24033#bib.bib43)\]\. These directions are orthogonal to Block\-GTQ’s per\-RoPE\-block bit allocation\.
## 6Experiments
### 6\.1Allocation and Attention Diagnostics
We empirically verify Block\-GTQ on a ten\-model panel, inspecting \(i\) the bit allocation it produces, \(ii\) the resulting RoPE\-logit error, and \(iii\) the resulting attention distributions\. The panel covers GQA backbones from Qwen, Llama, DeepSeek, Mistral, and GLM plus an MLA\-based DeepSeek model\. All experiments in this subsection quantizeKKonly;VVremains in fp16 to isolate the effect of K\-cache quantization\. Details about the panel and experimental setup are in Appendix[B\.1](https://arxiv.org/html/2606.24033#A2.SS1)\.
We first inspect the bit allocation Block\-GTQ produces at the33b/dim budget\. Figure[3](https://arxiv.org/html/2606.24033#S6.F3)shows that every architecture has a non\-uniform RoPE\-block energy profile, which Block\-GTQ translates into a non\-uniform allocation\. Aggregate distributions and per\-layer heterogeneity across the panel are in Appendix[B\.2](https://arxiv.org/html/2606.24033#A2.SS2)\. We then measure per\-layer RoPE\-logit error\. Table[1](https://arxiv.org/html/2606.24033#S6.T1)shows that Block\-GTQ reduces mean RoPE\-logit MAE versus uniform TQ\-MSE on all1010models and wins367/367367/367\(100%100\\%\) layer comparisons; the same367/367367/367pattern holds at the tighter22b/dim budget \(Table[11](https://arxiv.org/html/2606.24033#A2.T11)\)\. Definition and protocol are in Appendix[B\.3](https://arxiv.org/html/2606.24033#A2.SS3)\. Finally, we test whether these per\-layer reductions propagate to the attention distribution itself\. Figure[4](https://arxiv.org/html/2606.24033#S6.F4)shows that, without a recent\-token buffer, Block\-GTQ achieves both the lowest mean softmax KL versus fp16 and the highest top\-10 attended\-token overlap at every budget\. Setup, metrics, and additional results are in Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4)\.
Table 1:Per\-layer RoPE\-logit error at the33b/dim budget, K\-only\. Values are mean RoPE\-logit MAE across model layers; lower is better\.Δ\\Deltais the relative reduction versus TQ\-MSE; “Wins” counts layers where Block\-GTQ beats uniform TQ\-MSE\. Definition and protocol in Appendix[B\.3](https://arxiv.org/html/2606.24033#A2.SS3)\.ModelTQ\-MSEBlock\-GTQΔ\\DeltaWinsQwen2\.5\-3B6\.433\.23\+49\.9%\+49\.9\\%36/36Qwen2\.5\-14B4\.142\.61\+37\.1%\+37\.1\\%48/48Qwen3\-8B5\.812\.96\+49\.0%\+49\.0\\%36/36Qwen3\-30B\-A3B6\.763\.00\+55\.6%\+55\.6\\%48/48Llama\-3\.1\-8B3\.802\.55\+32\.7%\+32\.7\\%32/32ModelTQ\-MSEBlock\-GTQΔ\\DeltaWinsDS\-R1\-Llama\-8B3\.442\.33\+32\.2%\+32\.2\\%32/32DS\-R1\-Qwen\-7B11\.442\.40\+79\.1%\+79\.1\\%28/28Mistral\-Nemo\-12B3\.462\.28\+34\.2%\+34\.2\\%40/40GLM\-4\-9B7\.304\.51\+38\.2%\+38\.2\\%40/40DS\-V2\-Lite6\.013\.87\+35\.5%\+35\.5\\%27/27Figure 3:Allocator fingerprint at the33b/dim budget\.One subplot per model; each vertical slice is one layer, with stacked color bands giving its bit\-width distribution \(11b red→\\rightarrow88b green\)\.Figure 4:Mean softmax KL and Top\-10 attended\-token overlap across models\.K\-only quantization \(VVstays fp16\)\.\(a\)Mean softmax KL versus fp16 per method, averaged over the ten\-model panel at22,33, and44b/dim budgets\.\(b\)Top\-10 attended\-token overlap versus softmax KL, one marker per model \(color: method, shape: bit\-rate\); the upper\-left corner is best\. See Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)for the KIVI no\-buffer setting\.
### 6\.2Calibration Robustness Ablation
We ablate calibration along two axes \(Table[2](https://arxiv.org/html/2606.24033#S6.T2)\): \(a\)*length*, sweepingNcal∈\{64,…,2048\}N\_\{\\mathrm\{cal\}\}\\in\\\{64,\\ldots,2048\\\}tokens from WikiText\-2 test; and \(b\)*corpus*, drawing20482048tokens each from WikiText\-2, PG19, C4, and code\. We further extend the length axis to a continuous metric \(Table[3](https://arxiv.org/html/2606.24033#S6.T3)\): for each cell we draw three independent calibration prefixes from WikiText\-2 train \(offsets0/1010k/2020k\) and report sliding\-window PPL on the full WikiText\-2 test set as mean±\\pmstd\. Both show K3V3 is robustly less sensitive than K2V2: at K3V3 the six NIAH subtasks stay within1\.071\.07pp and PPL within±1σ\\pm 1\\sigmaacross seeds, while at K2V2 NIAH swings1\.571\.57–4\.094\.09pp and PPL by severalσ\\sigma\. This reflects the4−b4^\{\-b\}rate law in Block\-GTQ’s allocator objective∑isi⋅4−bi\\sum\_\{i\}s\_\{i\}\\cdot 4^\{\-b\_\{i\}\}: a misplaced bit atb=3b\{=\}3\(K3V3\) costs roughly4×4\\timesless than atb=2b\{=\}2\(K2V2\), so the same calibration noise has a proportionally smaller downstream effect\. Details and more results are in Appendix[C](https://arxiv.org/html/2606.24033#A3)\.
Table 2:Calibration ablations on Llama\-3\.1\-8B\-Instruct along two axes: \(a\) calibration length and \(b\) calibration corpus\. NIAH Overall \(%\) is the unweighted mean of six NIAH subtasks; higher is better\.\(a\)Calibration length:Δ\\DeltavsNcal=2048N\_\{\\mathrm\{cal\}\}=2048baseline\.
\(b\)Calibration corpus:Δ\\Deltavs WikiText\-2 baseline\. Each row draws20482048tokens from the named corpus\.
Table 3:Calibration length sensitivity under prefix noise\.Cells are PPL mean±\\pmstd across three WT2\-train calibration prefixes \(offsets0/1010k/2020k\)\.Δ\\Deltais the change in mean PPL fromNcal=128N\_\{\\mathrm\{cal\}\}\{=\}128toNcal=2048N\_\{\\mathrm\{cal\}\}\{=\}2048\.
### 6\.3Downstream Evaluation
In Section[6\.1](https://arxiv.org/html/2606.24033#S6.SS1)we showed that Block\-GTQ reduces RoPE\-logit error and preserves the softmax attention distribution\. We now ask whether this attention\-interface advantage carries to downstream task quality, focusing on two regimes where K\-cache errors are most consequential: long\-context retrieval and understanding, where old keys must remain useful across a long prompt; and reasoning\-style generation, where small attention perturbations can compound over many decode steps\.
#### 6\.3\.1Long\-Context Tasks
Figure 5:NIAH single\-needle retrieval on Llama\-3\.1\-8B\-Instruct\.Pass rate is shown over context length \(44K–128128K\) and needle depth \(0%0\\%–100%100\\%\), averaged over three trials per cell\. The two rows use the same method layout:fp16,KIVI\-ScaleOnly\(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\),TQ\-MSE, andBlock\-GTQ\.Table 4:Multi\-task NIAH pass\-rate \(%\) on \(a\) Llama\-3\.1\-8B\-Instruct and \(b\) Qwen2\.5\-7B\-Instruct\. Each entry is averaged over context lengths44K–128128K, needle depths \(0%0\\%–100%100\\%\), three trials per cell\. Block\-GTQ uses a 2048\-token WikiText\-2 calibration; KIVI\-ScaleOnly is defined in Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\.\(a\)Llama\-3\.1\-8B\-Instruct\.
\(b\)Qwen2\.5\-7B\-Instruct\.
##### NIAH\.
NIAH\[[15](https://arxiv.org/html/2606.24033#bib.bib47)\]probes where retrieval breaks across context length and needle depth\. On Llama\-3\.1\-8B\-Instruct, Figure[5](https://arxiv.org/html/2606.24033#S6.F5)shows that Block\-GTQ’s NIAH retrieval pattern matches fp16’s at both K3V3 and K2V2\. Table[4\(a\)](https://arxiv.org/html/2606.24033#S6.T4.st1)quantifies this across the six NIAH subtasks: TQ\-MSE drops from97\.797\.7at K3V3 to70\.670\.6at K2V2 and KIVI\-ScaleOnly never exceeds35\.435\.4Avg, while Block\-GTQ stays close to fp16’s99\.699\.6ceiling, scoring98\.498\.4/96\.896\.8/97\.497\.4at K3V3/K3V2/K2V2\. The gap is wider on Qwen2\.5\-7B\-Instruct: TQ\-MSE collapses to0\.00\.0at every budget and KIVI\-ScaleOnly never exceeds35\.235\.2Avg, while Block\-GTQ stays close to fp16’s67\.167\.1ceiling, scoring65\.165\.1/64\.864\.8/60\.160\.1at K3V3/K3V2/K2V2 \(Qwen2\.5\-7B\-Instruct heatmap: Appendix[D\.1](https://arxiv.org/html/2606.24033#A4.SS1)\)\.
##### LongBench\-EN\.
Table 5:LongBench\-EN per\-subtask scores on Llama\-3\.1\-8B\-Instruct\. Subtask abbreviations and metrics are listed in Appendix[D\.1](https://arxiv.org/html/2606.24033#A4.SS1);Avgis the unweighted mean\.KIVIdenotes KIVI\-ScaleOnly \(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\)\. Higher is better; bold marks the best quantized method within each budget\.LongBench\-EN\[[1](https://arxiv.org/html/2606.24033#bib.bib48)\]is the natural\-task counterpart to NIAH: it tests whether the quantizer preserves attention well enough for generation, not just retrieval\. Table[5](https://arxiv.org/html/2606.24033#S6.T5)reports Llama\-3\.1\-8B\-Instruct on eight subtasks; full subtask definitions and the inference protocol are in Appendix[D\.1](https://arxiv.org/html/2606.24033#A4.SS1)\. Across all three budgets, Block\-GTQ Overall stays closest to the59\.8359\.83fp16 ceiling \(59\.0859\.08/58\.8458\.84/53\.3153\.31at K3V3/K3V2/K2V2\); at the tight K2V2 budget, TQ\-MSE drops to36\.8736\.87and KIVI\-ScaleOnly to38\.4638\.46\.
#### 6\.3\.2Reasoning Tasks
Reasoning tests the cache differently from retrieval: in long chain\-of\-thought decoding, small cache errors compound across many decode steps and surface as a wrong final answer\. We evaluate Block\-GTQ on AIME 2024 and AIME 2025 in thinking mode on two DeepSeek\-R1\[[9](https://arxiv.org/html/2606.24033#bib.bib49)\]distilled backbones at K3V2, reporting average pass@1 over88samples per problem\.
To separate the contribution of the quantizer from that of the recent\-token fp16 buffer, we compare two regimes\.*No buffer*removes all uncompressed\-token windows so every attended key is served from the compressed cache\.*Protected*keeps the first44tokens \(sink\) and the last128128tokens \(recent\) as fp16\. The128128\-token recent window matches PM\-KVQ’s protected configuration\[[21](https://arxiv.org/html/2606.24033#bib.bib24)\]; the44\-token sink follows the attention\-sink convention\[[37](https://arxiv.org/html/2606.24033#bib.bib10)\]\. We apply this44/128128allowance identically across methods; per\-method details are in Appendix[D\.2](https://arxiv.org/html/2606.24033#A4.SS2)\.
Table 6:AIME 2024/2025 pass@1 \(%\) at K3V2\.*Protected*:44sink \+128128recent fp16;*No buffer*: both0\(Section[6\.3\.2](https://arxiv.org/html/2606.24033#S6.SS3.SSS2)\)\. Under no\-buffer, KIVI is run as KIVI\-ScaleOnly \(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\)\.In the protected regime, Block\-GTQ stays close to fp16 on both backbones \(matching on DeepSeek\-R1\-Distill\-Qwen\-7B, slightly exceeding on DeepSeek\-R1\-Distill\-Llama\-8B\)\. TQ\-MSE is notably lower on both, especially on DeepSeek\-R1\-Distill\-Qwen\-7B\. PM\-KVQ matches fp16 on DeepSeek\-R1\-Distill\-Llama\-8B \(still slightly below Block\-GTQ\) but is lower on DeepSeek\-R1\-Distill\-Qwen\-7B\. In the no\-buffer regime \(AIME 2024/AIME 2025\), Block\-GTQ stays close to fp16 on DeepSeek\-R1\-Distill\-Qwen\-7B \(51\.7/37\.551\.7/37\.5vs54\.2/37\.954\.2/37\.9\) but is lower on DeepSeek\-R1\-Distill\-Llama\-8B \(32\.5/23\.332\.5/23\.3vs43\.3/28\.843\.3/28\.8\)\. PM\-KVQ\[[21](https://arxiv.org/html/2606.24033#bib.bib24)\]shows the opposite pattern: leading at42\.9/24\.642\.9/24\.6on DeepSeek\-R1\-Distill\-Llama\-8B but lower at40\.8/27\.540\.8/27\.5on DeepSeek\-R1\-Distill\-Qwen\-7B\. TQ\-MSE collapses on both backbones \(worst at0\.0/0\.00\.0/0\.0on DeepSeek\-R1\-Distill\-Qwen\-7B\), as does KIVI without the buffer\. Block\-GTQ’s no\-buffer drop on DeepSeek\-R1\-Distill\-Llama\-8B reflects a bit\-allocation difference: PM\-KVQ allocates K and V jointly per layer via loss\-gradient sensitivity, whereas Block\-GTQ allocates only K per RoPE block \(via energy\) and leaves V at uniform TQ\-MSE\. Without the recent\-token buffer, this K\-only allocation can surface as a quality gap on V\-sensitive backbones\. Adding a V\-side allocator to Block\-GTQ is a natural extension\.
### 6\.4Block\-GTQ Deployment
We run Qwen2\.5\-3B\-Instruct on a single H800 GPU at the K3V3 operating point and report decode\-step latency, peak GPU memory, and downstream perplexity\. We compare Block\-GTQ against an fp16 FlashAttention\-2 \(FA\-2\) baseline and uniform\-TQ\-MSE\. Block\-GTQ and uniform\-TQ\-MSE both run through our fused\-attention packed\-cache path \(Section[4](https://arxiv.org/html/2606.24033#S4)\), which unpacks compressed K/V codes inline within the attention kernel; they differ only in K layout: uniform\-TQ\-MSE uses a single bit\-width per head, while Block\-GTQ varies it across RoPE blocks\.
Figure 6:Kernel\-optimization gains, decode latency, and peak memory for Block\-GTQ on Qwen2\.5\-3B\-Instruct \(single H800\)\. Panel \(a\) decomposes the speedup contribution of each kernel\-optimization stage—each stage targets a specific bottleneck of the packed\-cache decode path\. As context length grows \(panels \(b\), \(c\)\), Block\-GTQ’s optimized decode kernel overtakes fp16 FA\-2 atT=128T=128K and continues to run cleanly atT≥256T\\geq 256K where fp16 OOMs\. PPL is annotated on in panel \(b\)\.At short context \(T≤64T\\leq 64K\), Block\-GTQ’s decode kernel is slower than fp16 FA\-2: the packed\-cache path pays per\-step overhead for in\-kernel unpacking of compressed K/V codes that fp16 FA\-2 does not incur\. As context grows, KV bandwidth dominates per\-step decode and Block\-GTQ overtakes fp16; atT=128T=128K, Block\-GTQ runs1\.34×1\.34\\timesfaster than fp16 and cuts peak memory from56\.3156\.31GB to19\.8519\.85GB\. Beyond this, fp16 OOMs atT≥256T\\geq 256K because*peak total*memory exceeds the8080GB GPU budget, while Block\-GTQ continues to run\. Uniform\-TQ\-MSE is modestly faster than Block\-GTQ on decode \(∼14%\\sim\\\!14\\%atT=128T=128K\) and has a slightly smaller KV footprint \(3\.88×3\.88\\timesvs\.3\.24×3\.24\\timescompression; the gap comes from per\-segment metadata needed by Block\-GTQ’s mixed\-rate K storage\)\. However, TQ\-MSE’s quality collapses: its PPL is orders of magnitude worse than Block\-GTQ’s at every tested context length, while Block\-GTQ stays close to fp16’s PPL \(annotated in Figure[6](https://arxiv.org/html/2606.24033#S6.F6)\(b\); full values in Table[21](https://arxiv.org/html/2606.24033#A5.T21)\)—making Block\-GTQ the deployable operating point\. Full latency, memory, and prefill matrices are in Appendix[E](https://arxiv.org/html/2606.24033#A5)\.
## 7Conclusion
We reframe low\-bit K\-cache compression for RoPE models as a block\-level rate\-allocation problem\. Because RoPE attention decomposes exactly over two\-dimensional frequency blocks and block energy is non\-uniform, Block\-GTQ uses a label\-free energy score to assign more bits to high\-energy RoPE blocks\. Both K and V are encoded with TQ\-MSE, V at a uniform bit\-width\.
On a diverse ten\-model panel, at both22and33b/dim K\-only, Block\-GTQ cuts per\-layer RoPE\-logit MAE by3232–80%80\\%across models and wins all367/367367/367layer comparisons at each budget against uniform TQ\-MSE\. Across NIAH, LongBench\-EN, and AIME, Block\-GTQ stays close to the fp16 ceiling at tight K budgets, where uniform TQ\-MSE typically collapses\. On a single H800 at the K3V3 budget, our packed\-cache serving path enables long\-context inference that fp16 FlashAttention2 cannot reach: with3\.24×3\.24\\timesKV\-cache compression and quality comparable to fp16, it runs1\.34×1\.34\\timesfaster at128128K context and remains feasible at256256K/512512K where fp16 OOMs\.
##### Limitations and future work\.
Block\-GTQ allocates bits only on K, leaving V uniform\. A V\-side allocator, joint K\+V optimization, and denser packing could further reduce memory\. The fused decode path is an initial single\-GPU implementation; multi\-GPU and batched serving are open directions\.
## References
- \[1\]Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li\(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3119–3137\.Cited by:[§D\.1\.2](https://arxiv.org/html/2606.24033#A4.SS1.SSS2.Px2.p1.1),[§6\.3\.1](https://arxiv.org/html/2606.24033#S6.SS3.SSS1.Px2.p1.6)\.
- \[2\]\(2025\)PyramidKV: dynamic kv cache compression based on pyramidal information funneling\.InConference on Language Modeling \(COLM\),Note:arXiv:2406\.02069Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[3\]C\. Chang, W\. Lin, C\. Lin, C\. Chen, Y\. Hu, P\. Wang, N\. Huang, L\. Ceze, M\. S\. Abdelfattah, and K\. Wu\(2025\)Palu: kv\-cache compression with low\-rank projection\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[4\]Y\. Chen, S\. Qian, H\. Tang, X\. Lai, Z\. Liu, S\. Han, and J\. Jia\(2024\)LongLoRA: efficient fine\-tuning of long\-context large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px1.p1.1)\.
- \[5\]Z\. Chen, R\. Sadhukhan, Z\. Ye, Y\. Zhou, J\. Zhang, N\. Nolte, Y\. Tian, M\. Douze, L\. Bottou, Z\. Jia, and B\. Chen\(2025\)MagicPIG: lsh sampling for efficient llm generation\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[6\]H\. Chiang, C\. Chang, Y\. Lu, C\. Lin, K\. Wu, M\. S\. Abdelfattah, and D\. Marculescu\(2026\)UniQL: unified quantization and low\-rank compression for adaptive edge llms\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[7\]T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré\(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p1.1)\.
- \[8\]T\. Dao\(2023\)FlashAttention\-2: faster attention with better parallelism and work partitioning\.arXiv preprint arXiv:2307\.08691\.Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p1.1)\.
- \[9\]DeepSeek\-AI\(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§6\.3\.2](https://arxiv.org/html/2606.24033#S6.SS3.SSS2.p1.1)\.
- \[10\]Y\. Feng, J\. Lv, Y\. Cao, X\. Xie, and S\. K\. Zhou\(2025\)Ada\-KV: optimizing KV cache eviction by adaptive budget allocation for efficient LLM inference\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:arXiv:2407\.11550Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px4.p1.1)\.
- \[11\]S\. Ge, Y\. Zhang, L\. Liu, M\. Zhang, J\. Han, and J\. Gao\(2024\)Model tells you what to discard: adaptive KV cache compression for LLMs\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[12\]I\. Han, P\. Kacham, V\. Mirrokni, A\. Zandieh, and A\. Karbasi\(2025\)PolarQuant: quantizing kv caches with polar transformation\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[13\]Y\. He, L\. Zhang, W\. Wu, J\. Liu, H\. Zhou, and B\. Zhuang\(2024\)ZipCache: accurate and efficient kv cache quantization with salient token identification\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p2.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[14\]C\. Hooper, S\. Kim, H\. Mohammadzadeh, M\. W\. Mahoney, Y\. S\. Shao, K\. Keutzer, and A\. Gholami\(2024\)KVQuant: towards 10 million context length llm inference with kv cache quantization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p2.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px3.p1.1)\.
- \[15\]C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, and B\. Ginsburg\(2024\)RULER: what’s the real context size of your long\-context language models?\.arXiv preprint arXiv:2404\.06654\.Cited by:[§6\.3\.1](https://arxiv.org/html/2606.24033#S6.SS3.SSS1.Px1.p1.13)\.
- \[16\]H\. Kang, Q\. Zhang, S\. Kundu, G\. Jeong, Z\. Liu, T\. Krishna, and T\. Zhao\(2024\)GEAR: an efficient KV cache compression recipe for near\-lossless generative inference of LLM\.arXiv preprint arXiv:2403\.05527\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[17\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th Symposium on Operating Systems Principles,pp\. 611–626\.Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p1.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px1.p1.1)\.
- \[18\]J\. Li, Y\. Zhang, M\. Y\. Hassan, T\. Chafekar, T\. Cai, Z\. Ren, P\. Guo, F\. Karimzadeh, C\. Reed, C\. Wang, and C\. Gan\(2025\)CommVQ: commutative vector quantization for KV cache compression\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 36831–36845\.External Links:[Link](https://proceedings.mlr.press/v267/li25du.html)Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px3.p1.1)\.
- \[19\]Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen\(2024\)SnapKV: llm knows what you are looking for before generation\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[20\]A\. Liu, J\. Liu, Z\. Pan, Y\. He, G\. Haffari, and B\. Zhuang\(2024\)MiniCache: KV cache compression in depth dimension for large language models\.arXiv preprint arXiv:2405\.14366\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[21\]T\. Liu, S\. Li, J\. Yang, T\. Zhao, F\. Zhou, X\. Song, G\. Dai, S\. Yan, H\. Yang, and Y\. Wang\(2026\)PM\-KVQ: progressive mixed\-precision kv cache quantization for long\-cot llms\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv:2505\.18610; code:[https://github\.com/thu\-nics/PM\-KVQ](https://github.com/thu-nics/PM-KVQ)Cited by:[1st item](https://arxiv.org/html/2606.24033#A4.I3.i1.p1.11),[2nd item](https://arxiv.org/html/2606.24033#A4.I4.i2.p1.8),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px4.p1.1),[§6\.3\.2](https://arxiv.org/html/2606.24033#S6.SS3.SSS2.p2.6),[§6\.3\.2](https://arxiv.org/html/2606.24033#S6.SS3.SSS2.p3.7)\.
- \[22\]Y\. Liu, H\. Li, Y\. Cheng, S\. Ray, Y\. Huang, Q\. Zhang, K\. Du, J\. Yao, S\. Lu, G\. Ananthanarayanan, M\. Maire, H\. Hoffmann, A\. Holtzman, and J\. Jiang\(2024\)CacheGen: KV cache compression and streaming for fast language model serving\.InProceedings of the ACM SIGCOMM 2024 Conference,External Links:[Document](https://dx.doi.org/10.1145/3651890.3672274)Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p1.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px1.p1.1)\.
- \[23\]Z\. Liu, A\. Desai, F\. Liao, W\. Wang, V\. Xie, Z\. Xu, A\. Kyrillidis, and A\. Shrivastava\(2023\)Scissorhands: exploiting the persistence of importance hypothesis for LLM KV cache compression at test time\.arXiv preprint arXiv:2305\.17118\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[24\]Z\. Liu, J\. Yuan, H\. Jin, S\. Zhong, Z\. Xu, V\. Braverman, B\. Chen, and X\. Hu\(2024\)KIVI: a tuning\-free asymmetric 2bit quantization for kv cache\.InForty\-first International Conference on Machine Learning \(ICML\),Cited by:[3rd item](https://arxiv.org/html/2606.24033#A4.I4.i3.p1.3),[§1](https://arxiv.org/html/2606.24033#S1.p2.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[25\]W\. Mao, X\. Lin, W\. Huang, Y\. Xie, T\. Fu, B\. Zhuang, S\. Han, and Y\. Chen\(2026\)TriAttention: efficient long reasoning with trigonometric KV compression\.arXiv preprint arXiv:2604\.04921\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px3.p1.1)\.
- \[26\]B\. Peng, J\. Quesnelle, H\. Fan, and E\. Shippole\(2024\)YaRN: efficient context window extension of large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px1.p1.1)\.
- \[27\]R\. Pope, S\. Douglas, A\. Chowdhery, J\. Devlin, J\. Bradbury, J\. Heek, K\. Xiao, S\. Agrawal, and J\. Dean\(2022\)Efficiently scaling transformer inference\.arXiv preprint arXiv:2211\.05102\.Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p1.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px1.p1.1)\.
- \[28\]Y\. Sheng, L\. Zheng, B\. Yuan, Z\. Li, M\. Ryabinin, D\. Y\. Fu, Z\. Xie, B\. Chen, C\. Barrett, J\. E\. Gonzalez, P\. Liang, C\. Ré, I\. Stoica, and C\. Zhang\(2023\)FlexGen: high\-throughput generative inference of large language models with a single GPU\.arXiv preprint arXiv:2303\.06865\.Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p1.1)\.
- \[29\]A\. Shutova, V\. Malinovskii, V\. Egiazarian, D\. Kuznedelev, D\. Mazur, N\. Surkov, I\. Ermakov, and D\. Alistarh\(2025\)Cache me if you must: adaptive key\-value quantization for large language models\.InForty\-second International Conference on Machine Learning \(ICML\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[30\]J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu\(2024\)RoFormer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p3.10)\.
- \[31\]Y\. Su, Y\. Zhou, Q\. Qiu, J\. Li, Q\. Xia, P\. Li, X\. Duan, Z\. Wang, and M\. Zhang\(2025\)Accurate kv cache quantization with outlier tokens tracing\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 12895–12915\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.631),[Link](https://aclanthology.org/2025.acl-long.631/)Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[32\]Z\. Su, H\. Wei, Z\. Chen, W\. Shen, L\. Li, H\. Yu, and K\. Yuan\(2025\)RotateKV: accurate and robust 2\-bit KV cache quantization for LLMs via outlier\-aware adaptive rotations\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence \(IJCAI\),pp\. 6200–6208\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2025/690),[Link](https://www.ijcai.org/proceedings/2025/690)Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px3.p1.1)\.
- \[33\]Z\. Su and K\. Yuan\(2025\)KVSink: understanding and enhancing the preservation of attention sinks in kv cache quantization for llms\.InConference on Language Modeling \(COLM\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[34\]W\. Tao, H\. Lu, X\. Qu, B\. Zhang, K\. Lu, J\. Wan, and J\. Wang\(2025\)MoQAE: mixed\-precision quantization for long\-context llm inference via mixture of quantization\-aware experts\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 10810–10820\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.531),[Link](https://aclanthology.org/2025.acl-long.531/)Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[35\]H\. Wang, L\. Han, K\. Xu, and A\. Srivastava\(2025\)SQuat: subspace\-orthogonal kv cache quantization\.arXiv preprint arXiv:2503\.24358\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[36\]H\. Xia, X\. Wu, J\. Li, R\. Wu, J\. Wang, J\. Wang, C\. Li, A\. Singhal, A\. D\. Shah, A\. Ariyak, D\. Zhuang, Z\. Zhou, B\. Athiwaratkun, Z\. Zheng, and S\. L\. Song\(2025\)Kitty: accurate and efficient 2\-bit kv cache quantization with dynamic channel\-wise precision boost\.arXiv preprint arXiv:2511\.18643\.Note:Code:[https://github\.com/Summer\-Summer/Kitty](https://github.com/Summer-Summer/Kitty)Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px4.p1.1)\.
- \[37\]G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis\(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[1st item](https://arxiv.org/html/2606.24033#A4.I3.i1.p1.11),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1),[§6\.3\.2](https://arxiv.org/html/2606.24033#S6.SS3.SSS2.p2.6)\.
- \[38\]J\. Xin, T\. Lyu, D\. Keyes, H\. Ltaief, and M\. Canini\(2026\)RAP: KV\-cache compression via RoPE\-aligned pruning\.arXiv preprint arXiv:2602\.02599\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px3.p1.1)\.
- \[39\]J\. Y\. Yang, B\. Kim, J\. Bae, B\. Kwon, G\. Park, E\. Yang, S\. J\. Kwon, and D\. Lee\(2024\)No token left behind: reliable kv cache compression via importance\-aware mixed precision quantization\.arXiv preprint arXiv:2402\.18096\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[40\]A\. Zandieh, M\. Daliri, M\. Hadian, and V\. Mirrokni\(2026\)TurboQuant: online vector quantization with near\-optimal distortion rate\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[4th item](https://arxiv.org/html/2606.24033#A4.I4.i4.p1.1),[§1](https://arxiv.org/html/2606.24033#S1.p2.1),[§1](https://arxiv.org/html/2606.24033#S1.p5.1),[§2\.2](https://arxiv.org/html/2606.24033#S2.SS2.p1.6),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[41\]A\. Zandieh, I\. Han, V\. Mirrokni, and A\. Karbasi\(2024\)SubGen: token generation in sublinear time and memory\.arXiv preprint arXiv:2402\.06082\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[42\]T\. Zhang, Z\. Zeng, H\. Peng, H\. Zhuang, and C\. Chen\(2025\)MixKVQ: query\-aware mixed\-precision kv cache quantization for long\-context reasoning\.arXiv preprint arXiv:2512\.19206\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px4.p1.1)\.
- \[43\]T\. Zhang, J\. Yi, Z\. Xu, and A\. Shrivastava\(2024\)KV cache is 1 bit per channel: efficient large language model inference with coupled quantization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.24033#S1.p2.1),[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px2.p1.1)\.
- \[44\]Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett, Z\. Wang, and B\. Chen\(2023\)H2O: heavy\-hitter oracle for efficient generative inference of large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px5.p1.1)\.
- \[45\]Y\. Zhou, S\. Song, B\. Liu, Z\. Xi, S\. Jin, X\. Fan, Z\. Zhang, W\. Li, and X\. Huang\(2025\)EliteKV: scalable KV cache compression via RoPE frequency selection and joint low\-rank projection\.arXiv preprint arXiv:2503\.01586\.Cited by:[§5](https://arxiv.org/html/2606.24033#S5.SS0.SSS0.Px3.p1.1)\.
## Appendix Roadmap
Appendix[A](https://arxiv.org/html/2606.24033#A1)collects proofs for Block\-GTQ \(error bound, block weight, greedy optimality\)\. Appendix[B](https://arxiv.org/html/2606.24033#A2)details the ten\-model panel, the bit\-allocation analysis, and the attention\-fidelity diagnostics\. Appendix[C](https://arxiv.org/html/2606.24033#A3)ablates calibration along length, score, and corpus, and reports cross\-model PPL and allocation stability\. Appendix[D](https://arxiv.org/html/2606.24033#A4)provides long\-context \(NIAH, LongBench\) and reasoning \(AIME\) protocols\. Appendix[E](https://arxiv.org/html/2606.24033#A5)provides the deployment data tables: footprint, latency/memory, and long\-context perplexity\.
## Appendix ASupplementary Theory Details
This appendix collects the theory details that are useful for auditability but are not needed in the main narrative\. The main text uses three facts: the deployed K\-cache error is a RoPE\-logit error \(Section[2\.3](https://arxiv.org/html/2606.24033#S2.SS3)\), that error admits a per\-block bound \(Lemma[2](https://arxiv.org/html/2606.24033#Thmtheorem2)\), and the resulting allocation objective is optimized exactly by greedy allocation \(Theorem[1](https://arxiv.org/html/2606.24033#Thmtheorem1)\)\. The details below explain the coordinate change, the proof of the per\-block error bound, the absolute\-error chain behind the block weight, and the greedy allocation proof\.
### A\.1Post\-RoPE Cache and Pre\-RoPE Coordinates
Although the cache stores post\-RoPE keys, the analysis can be written in pre\-RoPE coordinates\. If𝐤^mR\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}is the decoded post\-RoPE key andRtR\_\{t\}denotes the absolute RoPE rotation at positiontt, define𝐤^m:=Rm⊤𝐤^mR\\hat\{\\mathbf\{k\}\}\_\{m\}:=R\_\{m\}^\{\\top\}\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}\. Then, for a query at positionnn,
\(𝐪nR\)⊤𝐤^mR=\(Rn𝐪n\)⊤𝐤^mR=𝐪n⊤Rn⊤Rm𝐤^m=𝐪n⊤Rm−n𝐤^m\.\(\\mathbf\{q\}\_\{n\}^\{\\mathrm\{R\}\}\)^\{\\top\}\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}=\(R\_\{n\}\\mathbf\{q\}\_\{n\}\)^\{\\top\}\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}=\\mathbf\{q\}\_\{n\}^\{\\top\}R\_\{n\}^\{\\top\}R\_\{m\}\\hat\{\\mathbf\{k\}\}\_\{m\}=\\mathbf\{q\}\_\{n\}^\{\\top\}R\_\{m\-n\}\\hat\{\\mathbf\{k\}\}\_\{m\}\.\(2\)RoPE is orthogonal block by block, so this coordinate change does not change block norms\. It only lets us express the deployed post\-RoPE cache error as a relative\-position logit error\.
### A\.2Proof of the Per\-Block Accounting Bound
###### Lemma 2\(Per\-block accounting of attention\-logit error\)\.
For a query at positionnn, a cached key at positionmm, and the equivalent pre\-RoPE decoded key𝐤^m=Rm⊤𝐤^mR\\hat\{\\mathbf\{k\}\}\_\{m\}=R\_\{m\}^\{\\top\}\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\\mathrm\{R\}\}, let𝐞𝐤,m\(i\)=𝐤m\(i\)−𝐤^m\(i\)\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}=\\mathbf\{k\}\_\{m\}^\{\(i\)\}\-\\hat\{\\mathbf\{k\}\}\_\{m\}^\{\(i\)\}and
ℰn,m:=\|𝒦m−n\(𝐪n,𝐤m\)−𝒦m−n\(𝐪n,𝐤^m\)\|\.\\mathcal\{E\}\_\{n,m\}:=\\left\|\\mathcal\{K\}\_\{m\-n\}\(\\mathbf\{q\}\_\{n\},\\mathbf\{k\}\_\{m\}\)\-\\mathcal\{K\}\_\{m\-n\}\(\\mathbf\{q\}\_\{n\},\\hat\{\\mathbf\{k\}\}\_\{m\}\)\\right\|\.Then
ℰn,m≤∑i‖𝐪n\(i\)‖2‖𝐞𝐤,m\(i\)‖2\.\\mathcal\{E\}\_\{n,m\}\\leq\\sum\_\{i\}\\\|\\mathbf\{q\}\_\{n\}^\{\(i\)\}\\\|\_\{2\}\\,\\\|\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\\|\_\{2\}\.
###### Proof\.
WithΔ=m−n\\Delta=m\-n, the block decomposition in Section[2\.1](https://arxiv.org/html/2606.24033#S2.SS1)gives
𝒦Δ\(𝐪n,𝐤m\)−𝒦Δ\(𝐪n,𝐤^m\)=∑i𝐪n\(i\)⊤R\(Δθi\)𝐞𝐤,m\(i\)\.\\mathcal\{K\}\_\{\\Delta\}\(\\mathbf\{q\}\_\{n\},\\mathbf\{k\}\_\{m\}\)\-\\mathcal\{K\}\_\{\\Delta\}\(\\mathbf\{q\}\_\{n\},\\hat\{\\mathbf\{k\}\}\_\{m\}\)=\\sum\_\{i\}\\mathbf\{q\}\_\{n\}^\{\(i\)\\top\}R\(\\Delta\\theta\_\{i\}\)\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\.The triangle inequality and Cauchy–Schwarz yield
ℰn,m≤∑i\|𝐪n\(i\)⊤R\(Δθi\)𝐞𝐤,m\(i\)\|≤∑i‖𝐪n\(i\)‖2‖R\(Δθi\)𝐞𝐤,m\(i\)‖2\.\\mathcal\{E\}\_\{n,m\}\\leq\\sum\_\{i\}\\left\|\\mathbf\{q\}\_\{n\}^\{\(i\)\\top\}R\(\\Delta\\theta\_\{i\}\)\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\right\|\\leq\\sum\_\{i\}\\\|\\mathbf\{q\}\_\{n\}^\{\(i\)\}\\\|\_\{2\}\\\|R\(\\Delta\\theta\_\{i\}\)\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\\|\_\{2\}\.EachR\(Δθi\)R\(\\Delta\\theta\_\{i\}\)is a rotation, so it preserves the block norm:‖R\(Δθi\)𝐞𝐤,m\(i\)‖2=‖𝐞𝐤,m\(i\)‖2\\\|R\(\\Delta\\theta\_\{i\}\)\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\\|\_\{2\}=\\\|\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\\|\_\{2\}\. ∎
### A\.3From the Block Bound to the RoPE\-Block Weight
Lemma[2](https://arxiv.org/html/2606.24033#Thmtheorem2)gives, for each query\-key pair,
ℰn,m≤∑i‖𝐪n\(i\)‖2‖𝐞𝐤,m\(i\)‖2\.\\mathcal\{E\}\_\{n,m\}\\leq\\sum\_\{i\}\\\|\\mathbf\{q\}\_\{n\}^\{\(i\)\}\\\|\_\{2\}\\\|\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\\|\_\{2\}\.Suppose a local quantizer at bit widthbib\_\{i\}contributes a relative error factorαi\(bi\)\\alpha\_\{i\}\(b\_\{i\}\)in blockii, so that the typical block error is bounded byαi\(bi\)‖𝐤\(i\)‖2\\alpha\_\{i\}\(b\_\{i\}\)\\\|\\mathbf\{k\}^\{\(i\)\}\\\|\_\{2\}\. Taking expectations over future query\-key pairs gives
𝔼\[ℰ\]≲∑iαi\(bi\)𝔼\[‖𝐪\(i\)‖2‖𝐤\(i\)‖2\]⏟si⋆\.\\mathbb\{E\}\[\\mathcal\{E\}\]\\lesssim\\sum\_\{i\}\\alpha\_\{i\}\(b\_\{i\}\)\\underbrace\{\\mathbb\{E\}\[\\\|\\mathbf\{q\}^\{\(i\)\}\\\|\_\{2\}\\\|\\mathbf\{k\}^\{\(i\)\}\\\|\_\{2\}\]\}\_\{s\_\{i\}^\{\\star\}\}\.\(3\)This derivation identifies the logit\-error block weightsi⋆s\_\{i\}^\{\\star\}; it is not the final method loss\. Block\-GTQ then uses the energy surrogatesis\_\{i\}from Section[3\.1](https://arxiv.org/html/2606.24033#S3.SS1)together with the TQ\-MSE bit\-error decay\. The resulting allocation objective
J\(𝐛\)=∑isi4−biJ\(\\mathbf\{b\}\)=\\sum\_\{i\}s\_\{i\}4^\{\-b\_\{i\}\}should be read as a rate\-allocation proxy rather than a tight consequence of the absolute\-error bound above: the score comes from the RoPE\-logit sensitivity, while the factor4−bi4^\{\-b\_\{i\}\}comes from the local MSE\-oriented quantizer\.
The4−bi4^\{\-b\_\{i\}\}rate is not arbitrary: it matches the rate at which the*squared*logit error decays\. Squaring the per\-block bound and applying Cauchy–Schwarz givesℰn,m2≤L∑i‖𝐪n\(i\)‖22‖𝐞𝐤,m\(i\)‖22\\mathcal\{E\}\_\{n,m\}^\{2\}\\leq L\\sum\_\{i\}\\\|\\mathbf\{q\}\_\{n\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\\|\\mathbf\{e\}\_\{\\mathbf\{k\},m\}^\{\(i\)\}\\\|\_\{2\}^\{2\}; together with the TQ\-MSE squared\-error bound𝔼‖𝐞\(i\)‖22≲4−bi‖𝐤\(i\)‖22\\mathbb\{E\}\\\|\\mathbf\{e\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\lesssim 4^\{\-b\_\{i\}\}\\\|\\mathbf\{k\}^\{\(i\)\}\\\|\_\{2\}^\{2\}, this yields a mean\-squared logit\-error bound of the form∑i4−bi𝔼\[‖𝐪\(i\)‖22‖𝐤\(i\)‖22\]\\sum\_\{i\}4^\{\-b\_\{i\}\}\\,\\mathbb\{E\}\[\\\|\\mathbf\{q\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\\,\\\|\\mathbf\{k\}^\{\(i\)\}\\\|\_\{2\}^\{2\}\]\. The4−bi4^\{\-b\_\{i\}\}rate inJJis thus consistent with bounding the squared logit error, withsis\_\{i\}serving as a simpler second\-moment proxy for the product weight\.
### A\.4Proof of Greedy Allocation Optimality
This section gives the full exchange proof for Theorem[1](https://arxiv.org/html/2606.24033#Thmtheorem1)\. Fix positive scoressis\_\{i\}, boundsbmin≤bi≤bmaxb\_\{\\min\}\\leq b\_\{i\}\\leq b\_\{\\max\}, and a feasible integer budgetB∈\[Lbmin,Lbmax\]B\\in\[Lb\_\{\\min\},Lb\_\{\\max\}\]\. Start all blocks atbminb\_\{\\min\}and letK=B−LbminK=B\-Lb\_\{\\min\}be the number of extra bit units to assign\. For blockii, define the gain of itsrr\-th extra bit as
gi,r:=si4−\(bmin\+r−1\)−si4−\(bmin\+r\)=34si4−\(bmin\+r−1\),g\_\{i,r\}:=s\_\{i\}4^\{\-\(b\_\{\\min\}\+r\-1\)\}\-s\_\{i\}4^\{\-\(b\_\{\\min\}\+r\)\}=\\tfrac\{3\}\{4\}s\_\{i\}4^\{\-\(b\_\{\\min\}\+r\-1\)\},forr=1,…,bmax−bminr=1,\\ldots,b\_\{\\max\}\-b\_\{\\min\}\. These gains decrease geometrically inrr\.
Choosing a final bit widthbi=bmin\+kib\_\{i\}=b\_\{\\min\}\+k\_\{i\}is equivalent to choosing the firstkik\_\{i\}gainsgi,1,…,gi,kig\_\{i,1\},\\ldots,g\_\{i,k\_\{i\}\}from blockii\. Hence every feasible allocation chooses exactlyKKgains subject to a prefix constraint: it may choosegi,rg\_\{i,r\}only if it also choosesgi,1,…,gi,r−1g\_\{i,1\},\\ldots,g\_\{i,r\-1\}\. The value of an allocation is the total chosen gain, because subtracting these gains from the all\-bminb\_\{\\min\}objective givesJ\(𝐛\)J\(\\mathbf\{b\}\)\.
Algorithm[1](https://arxiv.org/html/2606.24033#alg1)repeatedly chooses the largest available gain, where available means that the required prefix for that block has already been chosen\. We prove optimality by induction on the greedy prefix\. Assume there is an optimal feasible setOOcontaining the firstttgreedy gains, and letPtP\_\{t\}denote that prefix\. Letggbe the next greedy gain, from blockaa\. Ifg∈Og\\in O, the invariant already holds forPt∪\{g\}P\_\{t\}\\cup\\\{g\\\}\. Otherwise, addggtoOO; this is prefix\-feasible becauseggwas available afterPtP\_\{t\}, andPt⊆OP\_\{t\}\\subseteq O\. The enlarged set has one too many gains, so we remove a terminal gain from another block without reducing value\. SinceOOcontainsKKgains but omitsgg, some blockjjhas gains inO∖PtO\\setminus P\_\{t\}\. Lethjh\_\{j\}be the first such gain after the prefix of blockjjalready present inPtP\_\{t\}\. This gain was available to greedy at stept\+1t\+1, sog≥hjg\\geq h\_\{j\}\. Remove instead the last selected gain from blockjjinOO; monotonicity gives this terminal gain value at mosthjh\_\{j\}, hence at mostgg, and removing a terminal gain preserves the prefix constraint\. The exchange therefore produces an optimal feasible set containingPt∪\{g\}P\_\{t\}\\cup\\\{g\\\}\. Repeating fort=0,…,K−1t=0,\\ldots,K\-1proves the greedy allocation is optimal\.
## Appendix BAttention\-Interface Diagnostic Details
The main text reports the bit\-allocation fingerprint, the cross\-model RoPE\-logit error summary, and the panel\-wide softmax\-KL bars and top\-1010overlap scatter\. This appendix supplies the model panel, activation\-extraction rules, the panel\-level bit\-allocation analysis \(aggregate distributions and per\-layer heterogeneity\), metric definitions, the per\-layer RoPE\-logit error protocol, and per\-model softmax\-KL and top\-1010overlap tables\.
### B\.1Model Panel and Activation Extraction
Table[7](https://arxiv.org/html/2606.24033#A2.T7)lists the ten\-model panel used for the cross\-architecture attention diagnostics\. The panel is chosen for architectural coverage rather than leaderboard coverage\. Nine models use GQA \(small to larger Qwen2\.5, Qwen3 with QK\-RMSNorm including the MoE Qwen3\-30B\-A3B, Llama\-3\.1, two reasoning\-distilled DeepSeek\-R1 backbones, Mistral\-Nemo, and the fused\-QKV GLM\-4\-9B\), and one uses MLA \(DS\-V2\-Lite, which is also MoE\)\. For brevity in tables, we abbreviate DeepSeek\-R1\-Distill\-Llama\-8B, DeepSeek\-R1\-Distill\-Qwen\-7B, and DeepSeek\-V2\-Lite as DS\-R1\-Llama\-8B, DS\-R1\-Qwen\-7B, and DS\-V2\-Lite, respectively\.
Table 7:Ten\-model panel used for the attention diagnostics and aggregate bit\-allocation tables\. “Geometry” reports number of layers, query/KV head counts, and per\-head dimension; the last column notes each model’s role in the panel and any non\-standard calibration handling\.Two models in the panel deviate from the standard GQA Q/K layout, so they need an extra step\. GLM\-4\-9B fuses Q, K, and V into a singlequery\_key\_valueprojection matrix instead of the three separate matrices \(q\_proj,k\_proj,v\_proj\) used by the other GQA models\. This is an implementation\-level fusion that leaves the attention math unchanged\. We apply the fused projection, slice its output along the last dimension into Q, K, V, and feed Q and K through the same GQA averaging used elsewhere\. DeepSeek\-V2\-Lite uses MLA, in which the K vector consumed by attention has two components: a content part recovered from a low\-rank latent representation \(not RoPE\-rotated, identical across query heads\) and a small decoupled RoPE\-key that carries position through RoPE rotation \(also shared across all query heads\)\. Block\-GTQ targets only RoPE\-rotated keys, so the latent is outside its scope and the diagnostic uses only the decoupled RoPE\-key path\. In the panel table this path appears as one shared head withdrope=64d\_\{\\mathrm\{rope\}\}\{=\}64, treated as a single KV head common to all query heads\.
### B\.2Bit Allocation across Models
##### Aggregate distributions\.
Block\-GTQ’s energy scores are calibrated on20482048WikiText\-2 train tokens \(full protocol in Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4)\)\. At both the33b/dim and22b/dim budgets, every model produces a non\-uniform allocation: the budget bit width is the mode, with nontrivial mass at lower and higher widths; the mode shifts from33to22bits between the two budgets for ten models\. Tables[8](https://arxiv.org/html/2606.24033#A2.T8)and[9](https://arxiv.org/html/2606.24033#A2.T9)give the per\-model percentages at each budget, and Figure[7](https://arxiv.org/html/2606.24033#A2.F7)plots the per\-layer fingerprint at the22b/dim budget\.
Table 8:Aggregate Block\-GTQ bit\-allocation distribution at the33b/dim budget \(numeric counterpart of Figure[3](https://arxiv.org/html/2606.24033#S6.F3)\)\. Each cell is the percentage of all \(layer, head, frequency\-block\) triples in that model assigned the given bit width\.Table 9:Aggregate Block\-GTQ bit\-allocation distribution at the22b/dim budget \(per\-layer fingerprint in Figure[7](https://arxiv.org/html/2606.24033#A2.F7)\)\. Each cell is the percentage of all \(layer, head, frequency\-block\) triples in that model assigned the given bit width\.Figure 7:Allocator fingerprint at the22b/dim budget\.Per\-layer bit\-width distribution for each model\. The numeric counterpart is Table[9](https://arxiv.org/html/2606.24033#A2.T9)\.
##### Per\-layer heterogeneity\.
The aggregate tables above show distributions at a single budget but hide how the allocation varies across layers within each model\. For each layerℓ\\ellwe collapse the allocation over all \(head, frequency\-block\) pairs into a histogram\{nb\(ℓ\)\}b=18\\\{n\_\{b\}^\{\(\\ell\)\}\\\}\_\{b=1\}^\{8\}and report two layer\-level statistics: the distinct\-bit\-width countgrps\(ℓ\)=\|\{b:nb\(ℓ\)\>0\}\|\\mathrm\{grps\}^\{\(\\ell\)\}=\|\\\{b:n\_\{b\}^\{\(\\ell\)\}\>0\\\}\|and the Shannon entropyH\(ℓ\)=−∑bpb\(ℓ\)log2pb\(ℓ\)H^\{\(\\ell\)\}=\-\\sum\_\{b\}p\_\{b\}^\{\(\\ell\)\}\\log\_\{2\}p\_\{b\}^\{\(\\ell\)\}in bits \(H=0H\{=\}0marks a single\-bit\-width layer;H≈3H\\approx 3marks near\-uniform coverage of all88widths\)\. Table[10](https://arxiv.org/html/2606.24033#A2.T10)reports the per\-model mean and spread of both statistics at33b/dim, and Figure[8](https://arxiv.org/html/2606.24033#A2.F8)plotsH\(ℓ\)H^\{\(\\ell\)\}against normalized layer depth for each model\. Every model uses multiple bit widths per layer \(grps¯∈\[4\.0,5\.6\]\\overline\{\\mathrm\{grps\}\}\\in\[4\.0,5\.6\]\), the entropy curves typically oscillate aroundH∈\[1\.3,1\.6\]H\\in\[1\.3,1\.6\], and the most heterogeneous layer varies by model\.
Table 10:Per\-layer bit\-distribution summary across all ten models at33b/dim\.grps¯\\overline\{\\mathrm\{grps\}\}is the mean number of distinct bit levels per layer;H¯\\overline\{H\}is the mean per\-layer Shannon entropy \(bits\);σH\\sigma\_\{H\}is its standard deviation across layers;HmaxH\_\{\\max\}\(L\) andHminH\_\{\\min\}\(L\) are the most/least heterogeneous layer indices, with the corresponding entropy value\.Figure 8:Per\-layer Shannon entropyH\(ℓ\)H^\{\(\\ell\)\}of the bit\-width histogram at33b/dim across normalized layer depth \(0= first layer,11= last layer\)\. Each curve is one model; per\-model means and extrema are listed in Table[10](https://arxiv.org/html/2606.24033#A2.T10)\.
### B\.3Per\-Layer RoPE\-Logit MAE
The per\-layer RoPE\-logit MAE between the original key𝐤\\mathbf\{k\}and its quantized reconstruction𝐤^\\hat\{\\mathbf\{k\}\}, averaged over all KV heads at the layer, is
MAEℓ=𝔼h∈ℋℓKV𝔼g∈Gℓ\(h\)𝔼\(𝐪ℓ,g,𝐤ℓ,h\)∼𝒯ℓ𝔼Δ∈𝒟\|𝐪ℓ,g⊤RΔ𝐤ℓ,h−𝐪ℓ,g⊤RΔ𝐤^ℓ,h\|,\\mathrm\{MAE\}\_\{\\ell\}\\;=\\;\\mathbb\{E\}\_\{h\\in\\mathcal\{H\}\_\{\\ell\}^\{\\mathrm\{KV\}\}\}\\;\\mathbb\{E\}\_\{g\\in G\_\{\\ell\}\(h\)\}\\;\\mathbb\{E\}\_\{\(\\mathbf\{q\}\_\{\\ell,g\},\\mathbf\{k\}\_\{\\ell,h\}\)\\sim\\mathcal\{T\}\_\{\\ell\}\}\\;\\mathbb\{E\}\_\{\\Delta\\in\\mathcal\{D\}\}\\left\|\\mathbf\{q\}\_\{\\ell,g\}^\{\\top\}R\_\{\\Delta\}\\mathbf\{k\}\_\{\\ell,h\}\-\\mathbf\{q\}\_\{\\ell,g\}^\{\\top\}R\_\{\\Delta\}\\hat\{\\mathbf\{k\}\}\_\{\\ell,h\}\\right\|,where𝐪ℓ,g\\mathbf\{q\}\_\{\\ell,g\}and𝐤ℓ,h\\mathbf\{k\}\_\{\\ell,h\}are pre\-RoPE query and key activations at layerℓ\\ell, query headgg, and KV headhh\(the analytic block\-diagonal rotationRΔR\_\{\\Delta\}is applied identically to clean and quantized keys\);Gℓ\(h\)G\_\{\\ell\}\(h\)is the set of query heads served by KV headhh\(for the DS\-V2\-Lite MLA,HKV=1H\_\{\\mathrm\{KV\}\}=1and𝐤ℓ,h\\mathbf\{k\}\_\{\\ell,h\}is the single shared decoupled RoPE\-key; for the partial\-rotary GLM\-4,RΔR\_\{\\Delta\}rotates only the first6464of the128128key dimensions and is the identity on the remaining6464, which therefore contribute a static, offset\-independent term\); and𝒟\\mathcal\{D\}is a grid of5050evenly spaced relative offsets in\[−1024,1024\]\[\-1024,1024\]\. Architectural specifics for non\-standard projections \(Qwen3’s QK\-RMSNorm, GLM\-4’s fused QKV\) are described in Appendix[B\.1](https://arxiv.org/html/2606.24033#A2.SS1)\. We computeMAEℓ\\mathrm\{MAE\}\_\{\\ell\}independently for every \(model, layer\) pair under aKK\-only setting \(VVis unchanged\)\. Block\-GTQ’s frequency\-block energy scores are fit on the first20482048tokens of the WikiText\-2*train*split;MAEℓ\\mathrm\{MAE\}\_\{\\ell\}is then evaluated on the first20482048tokens of the WikiText\-2*test*split \(TQ\-MSE is data\-free and needs no fit\)\. Table[1](https://arxiv.org/html/2606.24033#S6.T1)reportsMAEℓ\\mathrm\{MAE\}\_\{\\ell\}at the33b/dim budget; Table[11](https://arxiv.org/html/2606.24033#A2.T11)repeats it at22b/dim, where Block\-GTQ again wins all367/367367/367layer comparisons with comparable relative reductions \(absolute MAE rises at the tighter budget for both methods\)\.
Table 11:Per\-layer RoPE\-logit error at the22b/dim budget, K\-only \(appendix counterpart of Table[1](https://arxiv.org/html/2606.24033#S6.T1), which is at33b/dim\)\. Values are mean RoPE\-logit MAE across model layers; lower is better\.Δ\\Deltais the relative reduction versus TQ\-MSE; “Wins” counts layers where Block\-GTQ beats uniform TQ\-MSE\.ModelTQ\-MSEBlock\-GTQΔ\\DeltaWinsQwen2\.5\-3B13\.256\.30\+52\.5%\+52\.5\\%36/36Qwen2\.5\-14B8\.205\.15\+37\.2%\+37\.2\\%48/48Qwen3\-8B11\.465\.88\+48\.7%\+48\.7\\%36/36Qwen3\-30B\-A3B12\.975\.89\+54\.6%\+54\.6\\%48/48Llama\-3\.1\-8B7\.495\.02\+33\.0%\+33\.0\\%32/32ModelTQ\-MSEBlock\-GTQΔ\\DeltaWinsDS\-R1\-Llama\-8B6\.774\.54\+33\.0%\+33\.0\\%32/32DS\-R1\-Qwen\-7B25\.205\.01\+80\.1%\+80\.1\\%28/28Mistral\-Nemo\-12B6\.754\.42\+34\.5%\+34\.5\\%40/40GLM\-4\-9B16\.319\.90\+39\.3%\+39\.3\\%40/40DS\-V2\-Lite11\.317\.25\+35\.9%\+35\.9\\%27/27
### B\.4Attention Diagnostics across Models
##### Test protocol\.
Test contexts are drawn from the held\-out WikiText\-2 test split\. We forward the first20482048tokens through the model as a single long\-context sequence, and collect pre\-RoPE Q/K at every transformer layer\. Attention metrics are then computed via this process: each query positiont∈\{1025,…,2048\}t\\in\\\{1025,\\ldots,2048\\\}attends to its full causal prefix\{1,…,t−1\}\\\{1,\\ldots,t\-1\\\}, with RoPE attention logitsst,i=𝐪t⊤Rt−i𝐤is\_\{t,i\}=\\mathbf\{q\}\_\{t\}^\{\\top\}R\_\{t\-i\}\\mathbf\{k\}\_\{i\}formed analytically \(the same rotationRt−iR\_\{t\-i\}is applied to clean and quantized keys\)\. Each \(model, method, bit rate\) cell is averaged over all10241024query positions\.
We report the no\-buffer setting, where every cached key is read from its quantized representation\. An fp16 recent\-key buffer leaves the most\-recent keys exact for every method; since attention places an outsized share of its mass on recent positions, a buffered comparison reflects that shared fp16 region more than the quantizer under test\. We therefore isolate the K\-quantizer with no buffer and represent KIVI by its buffer\-free ScaleOnly variant \(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\)\.
##### Calibration\.
The calibration sample\(𝐪cal,𝐤cal\)\(\\mathbf\{q\}\_\{\\mathrm\{cal\}\},\\mathbf\{k\}\_\{\\mathrm\{cal\}\}\)is drawn from a20482048\-token WikiText\-2 train prompt\. Each quantizer uses this sample differently: KIVI fits its initial per\-channel scale on𝐤cal\\mathbf\{k\}\_\{\\mathrm\{cal\}\}in the no\-buffer setting \(see Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\); TQ\-MSE is data\-free; Block\-GTQ computes the per\-block energy score from\(𝐪cal,𝐤cal\)\(\\mathbf\{q\}\_\{\\mathrm\{cal\}\},\\mathbf\{k\}\_\{\\mathrm\{cal\}\}\), then derives the per\-block bit allocation and the same\-rate group codebooks\.
##### Metrics\.
For a query𝐪∈ℝd\\mathbf\{q\}\\in\\mathbb\{R\}^\{d\}and original/quantized context\-key matricesK,K^∈ℝC×dK,\\hat\{K\}\\in\\mathbb\{R\}^\{C\\times d\}, lets,s^∈ℝCs,\\hat\{s\}\\in\\mathbb\{R\}^\{C\}be the original and quantized RoPE attention logit rows andp=softmax\(s/d\)p=\\operatorname\{softmax\}\(s/\\sqrt\{d\}\),p^=softmax\(s^/d\)\\hat\{p\}=\\operatorname\{softmax\}\(\\hat\{s\}/\\sqrt\{d\}\)\. We report two diagnostics:
Softmax KL=𝔼KL\(p∥p^\),Top\-10overlap=𝔼\[\|top10\(s\)∩top10\(s^\)\|10\]\.\\text\{Softmax KL\}=\\mathbb\{E\}\\,\\mathrm\{KL\}\(p\\,\\\|\\,\\hat\{p\}\),\\qquad\\text\{Top\-\}10\\text\{ overlap\}=\\mathbb\{E\}\\\!\\left\[\\frac\{\|\\operatorname\{top\}\_\{10\}\(s\)\\cap\\operatorname\{top\}\_\{10\}\(\\hat\{s\}\)\|\}\{10\}\\right\]\.Softmax KL is the divergence between the quantized and fp16 softmax distributions; because KL weights each token by its fp16 attention mass, errors at high\-attention tokens dominate\. Top\-1010attended\-token overlap reports the fraction of fp16’s ten most\-attended tokens that the quantized version also ranks in its top\-1010\.
Table 12:Per\-model softmax KL\(↓\\downarrow, lower is better\)\. Per\-model values behind the panel\-mean bars in Figure[4](https://arxiv.org/html/2606.24033#S6.F4)\(a\), with columns grouped by the22,33, and44b/dim budgets\. KIVI refers to the no\-buffer KIVI\-ScaleOnly variant \(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\)\. Best per \(model, budget\) inbold; the last row is the panel mean\.Table 13:Per\-model top\-1010attended\-token overlap\(↑\\uparrow, higher is better\)\. Per\-model values behind the scatter in Figure[4](https://arxiv.org/html/2606.24033#S6.F4)\(b\), with columns grouped by the22,33, and44b/dim budgets\. Best per \(model, budget\) inbold; the last row is the panel mean\.
##### Cross\-method results\.
Per\-model numbers behind Figure[4](https://arxiv.org/html/2606.24033#S6.F4)are in Tables[12](https://arxiv.org/html/2606.24033#A2.T12)and[13](https://arxiv.org/html/2606.24033#A2.T13)\. Block\-GTQ has the lowest mean softmax KL at every budget and jointly wins both axes—lowest softmax KL and highest top\-1010overlap—on7/107/10models at22b/dim,8/108/10at33b/dim, and9/109/10at44b/dim\. Relative to TQ\-MSE, panel\-mean softmax KL drops by3\.28×/3\.88×/4\.63×3\.28\\times\\,/\\,3\.88\\times\\,/\\,4\.63\\timesand panel\-mean top\-1010overlap rises by12\.8/8\.9/5\.612\.8\\,/\\,8\.9\\,/\\,5\.6percentage points at the three budgets\. The advantage widens with the bit budget: as more bits become available, the non\-uniform allocator routes incremental bandwidth to high\-energy RoPE blocks that uniform\-rate baselines cannot exploit; at the tight22\-bit budget all three methods absorb relatively similar quantization noise, so the gap is the smallest\.
##### Where KIVI is competitive\.
Block\-GTQ beats TQ\-MSE on every \(model, bit\-budget\) cell—on both softmax KL and top\-1010—and beats KIVI\-ScaleOnly on eight of the ten panel models\. The remaining two, DS\-V2\-Lite and GLM\-4\-9B, are the architectures whose RoPE substructure is half\-width: both leave the allocator only3232RoPE\-carrying frequency blocks, half the6464of the standard GQA models\. DS\-V2\-Lite uses MLA, whose single shared decoupled RoPE\-key is6464\-dimensional \(3232blocks total\); GLM\-4\-9B is partial\-rotary, rotating only the first6464of its128128key dimensions, so only3232of its6464blocks carry RoPE\-frequency structure\. With half the structure to differentiate, Block\-GTQ’s RoPE\-aware advantage shrinks and per\-channel KIVI\-ScaleOnly becomes competitive\. The effect is decisive on DS\-V2\-Lite—KIVI wins both axes at every budget, by a wide margin at22b/dim \(top\-101068\.0%68\.0\\%vs49\.9%49\.9\\%\)—but only partial on GLM\-4\-9B, where KIVI wins both axes at22b/dim and softmax KL at33b/dim before Block\-GTQ recovers both by44b/dim\. We attribute the persistence on DS\-V2\-Lite to its MLA geometry: the single decoupled RoPE\-key is consumed by all1616query heads, so its quantization error is shared layer\-wide and the allocator has only those3232blocks to work with; GLM\-4\-9B instead keeps a full128128\-dimensional key, so once the budget loosens the allocator can spend the extra bandwidth on its non\-rotary half and recover\.
##### Fair\-comparison note on KIVI\.
KIVI as originally proposed ships with a3232\-token fp16 residual buffer as an integral part of the method—every cached key passes through this buffer before being quantized\. Our diagnostic uses a custom KIVI\-ScaleOnly variant that retains KIVI’s per\-channel rolling\-scale quantizer but removes the residual buffer\. KIVI\-ScaleOnly is therefore not a deployment configuration; it exists only to make the K\-quantizer comparable across methods\.
The above describes only the K side of KIVI\-ScaleOnly\. In the attention diagnostics \(Section[6\.1](https://arxiv.org/html/2606.24033#S6.SS1)\), V stays fp16, so this K\-only variant is used directly\. In the downstream tasks \(Section[6\.3](https://arxiv.org/html/2606.24033#S6.SS3): NIAH, LongBench, AIME\) at K3V3/K3V2/K2V2 budgets, Block\-GTQ, TQ\-MSE, and KIVI\-ScaleOnly share the same V quantizer \(TQ\-MSE\); the K quantizer is where they differ\.
## Appendix CCalibration Robustness
Block\-GTQ’s bit allocation is computed from a per\-RoPE\-block energy score over a short calibration prefix; Algorithm[2](https://arxiv.org/html/2606.24033#alg2)states the calibration procedure and Equation[4](https://arxiv.org/html/2606.24033#A3.E4)below gives the GQA\-aware score formula\. This appendix quantifies the sensitivity of the resulting allocation to three calibration choices: prefix length, score function, and calibration corpus\.
Algorithm 2RoPE\-block score calibration1:Model
ℳ\\mathcal\{M\}, calibration tokens
XX
2:Score vectors
\{sℓ,h,i\}\\\{s\_\{\\ell,h,i\}\\\}for every layer
ℓ\\ell, KV head
hh, and RoPE block
ii
3:Run
ℳ\\mathcal\{M\}on
XXand capture Q/K vectors used by RoPE attention
4:foreach layer
ℓ\\elldo
5:foreach KV head
hhdo
6:Identify the query\-head group
G\(h\)G\(h\)served by KV head
hh
7:Split each captured query and key head into RoPE blocks
i=1,…,Li=1,\\ldots,L
8:foreach RoPE block
iido
9:Average
‖𝐪ℓ,g,t\(i\)‖22\\\|\\mathbf\{q\}\_\{\\ell,g,t\}^\{\(i\)\}\\\|\_\{2\}^\{2\}over tokens
ttand query heads
g∈G\(h\)g\\in G\(h\)
10:Average
‖𝐤ℓ,h,t\(i\)‖22\\\|\\mathbf\{k\}\_\{\\ell,h,t\}^\{\(i\)\}\\\|\_\{2\}^\{2\}over tokens
tt
11:Set
sℓ,h,is\_\{\\ell,h,i\}by Equation[4](https://arxiv.org/html/2606.24033#A3.E4)
12:endfor
13:endfor
14:endfor
##### GQA energy formula\.
Under grouped\-query attention \(GQA\), one KV head is shared by multiple query heads\. WithGℓ\(h\)G\_\{\\ell\}\(h\)denoting the layer\-specific query\-head group from Subsection[3\.1](https://arxiv.org/html/2606.24033#S3.SS1)andNNthe number of calibration tokens, the Block\-GTQ scoresℓ,h,is\_\{\\ell,h,i\}for layerℓ\\ell, KV headhh, and RoPE blockiiis
sℓ,h,i=12\(1N\|Gℓ\(h\)\|∑t=1N∑g∈Gℓ\(h\)‖𝐪ℓ,g,t\(i\)‖2\+1N∑t=1N‖𝐤ℓ,h,t\(i\)‖2\)\.s\_\{\\ell,h,i\}=\\frac\{1\}\{2\}\\left\(\\frac\{1\}\{N\|G\_\{\\ell\}\(h\)\|\}\\sum\_\{t=1\}^\{N\}\\sum\_\{g\\in G\_\{\\ell\}\(h\)\}\\left\\\|\\mathbf\{q\}\_\{\\ell,g,t\}^\{\(i\)\}\\right\\\|^\{2\}\+\\frac\{1\}\{N\}\\sum\_\{t=1\}^\{N\}\\left\\\|\\mathbf\{k\}\_\{\\ell,h,t\}^\{\(i\)\}\\right\\\|^\{2\}\\right\)\.\(4\)The Q\-side term averages squared norms over the query heads served by the KV head\. Averaging the Q vectors over heads*before*squaring would yield a strictly smaller value by Jensen’s inequality \(with equality only when the heads are collinear\), and would therefore systematically under\-count the Q\-side energy\.
### C\.1Calibration length ablation
K2V2 is sensitive to the calibration length \(Table[14](https://arxiv.org/html/2606.24033#A3.T14)\): the curve is non\-monotone—N=64N=64\(95\.6895\.68\) beatsN=128N=128,256256, and10241024, and onlyN=2048N=2048wins cleanly \(97\.3697\.36\); the per\-task breakdown \(same table\) shows multi\-query alone swinging1212pp peak\-to\-trough across the smaller budgets \(74\.2474\.24atN=1024N=1024to86\.7086\.70atN=2048N=2048\), with the binary subtasks staying≥91\.92\\geq\\\!91\.92\. K3V3 is much less sensitive—everyNNlies within1\.071\.07pp ofN=2048N=2048, and even m\-query stays within3\.203\.20pp of theN=2048N=2048baseline\. The K2V2 non\-monotonicity comes from finite\-sample noise in the per\-block energy estimates: smallNcalN\_\{\\mathrm\{cal\}\}flips roughly five of6464marginal\-gain comparisons per head, damped out only atN=2048N=2048\.
Table 14:Calibration length ablation, per\-task NIAH pass\-rate \(%\) on Llama\-3\.1\-8B\-Instruct\.Δ2048\\Delta\_\{2048\}is the change in Overall vsN=2048N=2048\. At K2V2 the budget\-noise effect concentrates on the fractional\-scored subtasks, with m\-query swinging1212pp peak\-to\-trough \(74\.2474\.24atN=1024N=1024vs\.86\.7086\.70atN=2048N=2048\); binary subtasks stay≥91\.92\\geq\\\!91\.92\. At K3V3 every subtask is within∼3\.54\\sim\\\!3\.54pp of theN=2048N=2048baseline\.
### C\.2Energy score ablation
We compare five energy score functions on Block\-GTQ—the defaultqk\_avg\(Eq\.[4](https://arxiv.org/html/2606.24033#A3.E4)\) and four alternatives spanning symmetric aggregations and single\-sided variants:
qk\_avg=12\(𝔼‖𝐪‖2\+𝔼‖𝐤‖2\),\\displaystyle=\\tfrac\{1\}\{2\}\\\!\\left\(\\mathbb\{E\}\\\|\\mathbf\{q\}\\\|^\{2\}\+\\mathbb\{E\}\\\|\\mathbf\{k\}\\\|^\{2\}\\right\),qk\_max=max\(𝔼‖𝐪‖2,𝔼‖𝐤‖2\),\\displaystyle=\\max\\\!\\left\(\\mathbb\{E\}\\\|\\mathbf\{q\}\\\|^\{2\},\\,\\mathbb\{E\}\\\|\\mathbf\{k\}\\\|^\{2\}\\right\),qk\_product=𝔼‖𝐪‖2⋅𝔼‖𝐤‖2,\\displaystyle=\\sqrt\{\\mathbb\{E\}\\\|\\mathbf\{q\}\\\|^\{2\}\\cdot\\mathbb\{E\}\\\|\\mathbf\{k\}\\\|^\{2\}\},k\_only=𝔼‖𝐤‖2,\\displaystyle=\\mathbb\{E\}\\\|\\mathbf\{k\}\\\|^\{2\},q\_only=𝔼‖𝐪‖2\.\\displaystyle=\\mathbb\{E\}\\\|\\mathbf\{q\}\\\|^\{2\}\.qk\_maxis another symmetric aggregator \(the larger of the two squared norms\);qk\_productis their geometric mean; the two single\-sided variantsk\_onlyandq\_onlydrop one side of attention entirely and pin down which side carries the signal\. All five variants share the same calibration—the first20482048tokens of the WikiText\-2 test split—and the same Block\-GTQ allocator; we run NIAH on Llama\-3\.1\-8B\-Instruct at the rate\-sensitive K2V2 budget, where the score choice is most consequential \(Table[15](https://arxiv.org/html/2606.24033#A3.T15)\)\. The symmetric defaultqk\_avgwins by Overall \(97\.3697\.36\) and on most per\-task columns\.
Table 15:Energy score ablation\. Block\-GTQ per\-task NIAH pass\-rate \(%\) on Llama\-3\.1\-8B\-Instruct at K2V2; per\-column best in bold\.
### C\.3Calibration corpus ablation
To test whether the per\-block energy ranking is sensitive to the calibration corpus, we compare four20482048\-token calibration sources and re\-evaluate Block\-GTQ NIAH on Llama\-3\.1\-8B\-Instruct \(Table[16](https://arxiv.org/html/2606.24033#A3.T16)\):
- •*WikiText\-2 test*\(baseline\): curated Wikipedia prose; first20482048tokens of the WikiText\-2 test split\.
- •*PG19*: Project Gutenberg literary text in a comparatively older English register; first20482048tokens of the HuggingFacepg19train split\.
- •*C4*: heterogeneous web text from Common Crawl—prose interleaved with boilerplate, URLs, and navigation fragments; first20482048tokens of the HuggingFacec4envalidation split\.
- •*Code*: Python source code from the CPython 3\.11\.0 standard library—first20482048tokens of a concatenation ofargparse\.py,json/encoder\.py, andjson/decoder\.pyfetched from the cpython GitHub repository\.
NIAH evaluates retrieval over long passages of natural English prose \(“haystacks”\) with short factual “needles” inserted at varying depths\. The four corpora span an ordered range of distance from this deployment distribution: WikiText\-2 and PG19 are both natural prose \(closest\); C4 is mostly prose interleaved with web artifacts; code is structurally different in both surface form and distributional statistics \(furthest\)\. K2V2 is sensitive to that distance \(Table[16](https://arxiv.org/html/2606.24033#A3.T16)\): PG19 stays within0\.280\.28pp of WikiText\-2, C4 drops2\.782\.78pp, and code drops3\.703\.70pp Overall—monotonic in how far the calibration diverges from prose\. K3V3 is much less sensitive; all four corpora are within0\.340\.34pp of one another,1111–16×16\\timestighter than at K2V2\.
This separation reflects the4−b4^\{\-b\}rate law in Block\-GTQ’s allocator objective∑isi⋅4−bi\\sum\_\{i\}s\_\{i\}\\cdot 4^\{\-b\_\{i\}\}\. When an off\-domain calibration shifts the per\-block energy ranking, the allocator misplaces some bits; the cost of each misplacement \(e\.g\., assigningbbwhereb\+1b\{\+\}1would have been better\) issi\(4−b−4−\(b\+1\)\)s\_\{i\}\\,\(4^\{\-b\}\-4^\{\-\(b\+1\)\}\), which at K2V2 \(averageb=2b\{=\}2\) issi\(4−2−4−3\)s\_\{i\}\(4^\{\-2\}\-4^\{\-3\}\)and at K3V3 \(averageb=3b\{=\}3\) onlysi\(4−3−4−4\)s\_\{i\}\(4^\{\-3\}\-4^\{\-4\}\)—a factor of∼4\\sim\\\!4smaller per misplaced bit\. The same calibration\-induced ranking shift therefore translates into a∼4×\\sim\\\!4\\timessmaller objective penalty at K3V3, and the observed1111–16×16\\timesNIAH swing reduction is the downstream manifestation of this rate\-law amortization\.
Table 16:Calibration corpus ablation, per\-task NIAH pass\-rate \(%\) on Llama\-3\.1\-8B\-Instruct\.Δ\\Deltais the change in Overall vs the WT2 baseline\.\.
### C\.4Cross\-model PPL and allocation\-distance diagnostics
We run Block\-GTQ atNcal∈\{128,512,2048\}N\_\{\\mathrm\{cal\}\}\\in\\\{128,512,2048\\\}on Llama\-3\.1\-8B\-Instruct and DeepSeek\-R1\-Distill\-Qwen\-7B at both K3V3 and K2V2\. For each cell we draw three calibration prefixes from WikiText\-2 train at offsets0,10,00010\{,\}000, and20,00020\{,\}000tokens \(different articles, three near\-independent draws\)\. We evaluate using four metrics; \(ii\)–\(iv\) compare each perturbed allocation𝐛\\mathbf\{b\}against theNcal=2048N\_\{\\mathrm\{cal\}\}=2048, seed\-0reference𝐛ref\\mathbf\{b\}^\{\\mathrm\{ref\}\}over all \(layer, KV head, RoPE\-block\) triplesℐ\\mathcal\{I\}:
- •*\(i\) Output PPL*: sliding\-window perplexity on the full WikiText\-2 test set \(C=4096C=4096,S=512S=512,∼99\\sim\\\!99K tokens\), reported in main\-text Table[3](https://arxiv.org/html/2606.24033#S6.T3)\. PPL captures robustness at the*output*level but cannot tell whether the allocation itself is stable or whether it moves with small objective cost—hence \(ii\)–\(iv\) below\.
- •*\(ii\) Hamming distance*, the fraction of slots whose bit value changed: Hamm\(𝐛,𝐛ref\)=1\|ℐ\|∑\(ℓ,h,i\)∈ℐ𝟏\[bℓ,h,i≠bℓ,h,iref\]\.\\mathrm\{Hamm\}\(\\mathbf\{b\},\\mathbf\{b\}^\{\\mathrm\{ref\}\}\)=\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{\(\\ell,h,i\)\\in\\mathcal\{I\}\}\\mathbf\{1\}\\\!\\left\[b\_\{\\ell,h,i\}\\neq b^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\\right\]\.\(5\)
- •*\(iii\) High\-bit Jaccard at threshold44*, measuring the overlap of the slots the allocator protected with≥4\\geq\\\!4bits: HB@4\(𝐛,𝐛ref\)=\|\{\(ℓ,h,i\):bℓ,h,i≥4\}∩\{\(ℓ,h,i\):bℓ,h,iref≥4\}\|\|\{\(ℓ,h,i\):bℓ,h,i≥4\}∪\{\(ℓ,h,i\):bℓ,h,iref≥4\}\|\.\\mathrm\{HB@4\}\(\\mathbf\{b\},\\mathbf\{b\}^\{\\mathrm\{ref\}\}\)=\\frac\{\\left\|\\\{\(\\ell,h,i\):b\_\{\\ell,h,i\}\\geq 4\\\}\\cap\\\{\(\\ell,h,i\):b^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\\geq 4\\\}\\right\|\}\{\\left\|\\\{\(\\ell,h,i\):b\_\{\\ell,h,i\}\\geq 4\\\}\\cup\\\{\(\\ell,h,i\):b^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\\geq 4\\\}\\right\|\}\.\(6\)
- •*\(iv\) Energy\-weighted regret*, the cost in the allocator’s own objective with each change weighted by importancesirefs^\{\\mathrm\{ref\}\}\_\{i\}and bit magnitude4−b4^\{\-b\}: Regret\(𝐛\)=∑\(ℓ,h,i\)∈ℐsℓ,h,iref\(4−bℓ,h,i−4−bℓ,h,iref\)∑\(ℓ,h,i\)∈ℐsℓ,h,iref4−bℓ,h,iref\.\\mathrm\{Regret\}\(\\mathbf\{b\}\)=\\frac\{\\sum\_\{\(\\ell,h,i\)\\in\\mathcal\{I\}\}s^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\\,\(4^\{\-b\_\{\\ell,h,i\}\}\-4^\{\-b^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\}\)\}\{\\sum\_\{\(\\ell,h,i\)\\in\\mathcal\{I\}\}s^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\\,4^\{\-b^\{\\mathrm\{ref\}\}\_\{\\ell,h,i\}\}\}\.\(7\)
The two non\-referenceN=2048N=2048seed cells differ from the reference only in the random WikiText\-2 slice, so their disagreement with the reference defines a within\-source noise floor for each metric\.
Table 17:Allocation distance against theNcal=2048N\_\{\\mathrm\{cal\}\}=2048, seed\-0reference allocation, K3V3\. Hamming counts changed bit slots \(Eq\.[5](https://arxiv.org/html/2606.24033#A3.E5)\); HB@4 is the high\-bit Jaccard at threshold44\(Eq\.[6](https://arxiv.org/html/2606.24033#A3.E6)\); Regret is Eq\.[7](https://arxiv.org/html/2606.24033#A3.E7)\. The20482048\-token non\-reference seeds define the within\-source noise floor\.ModelNcalN\_\{\\mathrm\{cal\}\}seedHammingHB@4RegretLlama\-3\.1\-8B\-Instruct \(K3V3\)12800\.1480\.730\+2\.99%\+2\.99\\%12810\.1240\.776\+2\.15%\+2\.15\\%12820\.1410\.736\+2\.78%\+2\.78\\%51200\.0770\.845\+0\.80%\+0\.80\\%51210\.0830\.844\+0\.97%\+0\.97\\%51220\.0890\.831\+1\.09%\+1\.09\\%204800\.0001\.000\+0\.00%\+0\.00\\%204810\.0810\.855\+0\.89%\+0\.89\\%204820\.0740\.870\+0\.73%\+0\.73\\%DeepSeek\-R1\-Distill\-Qwen\-7B \(K3V3\)12800\.1410\.827\+1\.89%\+1\.89\\%12810\.1120\.877\+1\.22%\+1\.22\\%12820\.1780\.802\+3\.11%\+3\.11\\%51200\.0650\.926\+0\.45%\+0\.45\\%51210\.0840\.915\+0\.67%\+0\.67\\%51220\.0840\.912\+0\.70%\+0\.70\\%204800\.0001\.000\+0\.00%\+0\.00\\%204810\.0730\.918\+0\.49%\+0\.49\\%204820\.0660\.931\+0\.44%\+0\.44\\%At K3V3, the within\-source noise floor \(two non\-referenceN=2048N=2048seeds in Table[17](https://arxiv.org/html/2606.24033#A3.T17)\) is Hamming0\.070\.07–0\.080\.08, HB@40\.860\.86–0\.930\.93, regret\+0\.4\+0\.4–0\.9%0\.9\\%on both models\.N=128N=128pushes Hamming to1\.41\.4–2\.7×2\.7\\timesthe floor and drops HB@4 by1010–1313pp \(about27%27\\%of the high\-bit tail reshuffled on Llama\), yet regret stays at1\.21\.2–3\.1%3\.1\\%: the allocation visibly moves above what calibration randomness alone explains\. This is the4−b4^\{\-b\}rate law in action—each misplaced bit atb=3b=3costs roughly4×4\\timesless than atb=2b=2, so the same allocator movement amortizes into≤3\.1%\\leq\\\!3\.1\\%regret at K3V3\.
## Appendix DDownstream Evaluation Details
### D\.1Long\-Context Tasks
#### D\.1\.1NIAH Protocol
The main text shows the Llama\-3\.1\-8B\-Instruct single\-needle heatmap \(Figure[5](https://arxiv.org/html/2606.24033#S6.F5)\) and the combined Llama / Qwen multi\-task scores aggregated across six NIAH variants \(Table[4](https://arxiv.org/html/2606.24033#S6.T4)\)\. This appendix adds the matching Qwen2\.5\-7B\-Instruct heatmap \(Figure[9](https://arxiv.org/html/2606.24033#A4.F9)\) and details the protocol and subtask definitions\.
Figure 9:NIAH single\-needle retrieval on Qwen2\.5\-7B\-Instruct\.Pass rate is shown over context length \(44K–128128K\) and needle depth \(0%0\\%–100%100\\%\), averaged over three trials per cell\. The two rows use the same method layout as Figure[5](https://arxiv.org/html/2606.24033#S6.F5):fp16,KIVI\-ScaleOnly\(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\),TQ\-MSE, andBlock\-GTQ\. Top row: K3V3\. Bottom row: K2V2\. The fp16 panel is identical across budgets for a fixed model and serves as the retrieval ceiling\.##### Calibration\.
Block\-GTQ is calibrated on the first20482048tokens of the WikiText\-2 test split, concatenated as a single contiguous raw\-text stream with article headings stripped\. TQ\-MSE is data\-independent; KIVI\-ScaleOnly is defined in Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\.
##### Subtasks\.
The six NIAH variants in Table[4](https://arxiv.org/html/2606.24033#S6.T4)stress different facets of long\-context retrieval\. Across all six, the haystack is filler text into which one or more synthetic key–value*needles*are inserted; the model receives the haystack plus a query and must return the matching value\(s\)\. Variants differ in needle count, distractor structure, and query structure\.
- •single: one key–value needle is inserted at a given depth and the model is queried for its value\. Tests basic retrieval—the model must locate the needle by key and return the corresponding value\.
- •distract\.: one target needle is inserted alongside several distractor key–value pairs with similar formatting \(but unrelated to the query\)\. Tests discrimination against same\-format distractors: the model must not be misled by lookalike but incorrect needles\.
- •multi: several distinct needles are inserted in the haystack, and the model is queried for one specific value\. Tests selective retrieval when multiple plausible candidates exist\.
- •m\-key: three distinct key–value needles are placed close together in the haystack, and the model is queried for each key in turn\. Tests fine\-grained key discrimination among nearby needles—the model must not conflate adjacent key–value pairs\.
- •m\-value: several values are bound to a single entity, and the query requires returning all of them\. Tests recall completeness: partial answers are penalized\.
- •m\-query: several distinct queries are run against a haystack holding multiple needles\. Tests robustness across multi\-fact recall—the score is averaged over all queries\.
The first three tasks are scored0/10/1\(the model either returns the correct value or not\); the last three \(m\-key,m\-value,m\-query\) are scored as the fraction of correct answers among multiple expected responses\.
##### Sampling\.
Each \(task, context length, depth\) cell averages three haystack samples \(random filler text, fixed needles\) over six context lengths \(44K–128128K\) and eleven needle depths\{0%,10%,…,100%\}\\\{0\\%,10\\%,\\ldots,100\\%\\\}\. All methods and bit budgets share the same needle set, so cross\-method and cross\-budget comparisons are paired on identical needle facts\.
#### D\.1\.2LongBench\-EN Protocol
The LongBench\-EN table in Section[6\.3\.1](https://arxiv.org/html/2606.24033#S6.SS3.SSS1.Px2)uses Llama\-3\.1\-8B\-Instruct on eight subtasks spanning single\-document QA, multi\-document QA, summarization, few\-shot classification, synthetic retrieval, and code completion\.
##### Calibration\.
Block\-GTQ uses the same calibration as for NIAH: the first20482048tokens of the WikiText\-2 test split\. TQ\-MSE is data\-independent; KIVI\-ScaleOnly is defined in Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\.
##### Subtasks\.
Each subtask is listed below with its column abbreviation, scoring metric, and output\-token cap\. Metrics follow LongBench\[[1](https://arxiv.org/html/2606.24033#bib.bib48)\]:*QA\-F1*is token\-level F1 between predicted and reference answers;*ROUGE\-L*is LCS\-based ROUGE for summarization;*classification score*and*retrieval score*are answer accuracy;*edit similarity*is edit\-distance\-based similarity for code\. The*output\-token cap*is the maximum number of tokens the model may generate per example, set by LongBench per task\.
- •qasper\(Qasp; QA\-F1,128128\-token cap\): single\-document QA over NLP research papers \(arXiv NLP papers as input\)\. Questions require extracting specific factual details from a paper\-length input, often spanning multiple sections \(methodology, results, related work\)\.
- •multifieldqa\_en\(MFQA; QA\-F1,6464\-token cap\): single\-document QA over English\-language documents from diverse domains \(legal, government, encyclopedic, etc\.\)\. Questions require locating specific information in a long structured document\.
- •hotpotqa\(HP; QA\-F1,3232\-token cap\): multi\-document QA from HotpotQA, requiring22\-hop reasoning across multiple Wikipedia paragraphs\. The model must locate facts in different paragraphs and chain them to produce the answer\.
- •2wikimqa\(2W; QA\-F1,3232\-token cap\): multi\-document QA from 2WikiMultiHopQA, with multi\-hop bridges between Wikipedia articles—similar to hotpotqa but with explicit entity\-bridge reasoning chains\.
- •gov\_report\(Gov; ROUGE\-L,512512\-token cap\): abstractive summarization of long U\.S\. government reports \(often several thousand to tens of thousands of tokens\)\. The model must produce a faithful compressed summary covering key findings\.
- •trec\(TREC; classification score,6464\-token cap\): few\-shot question\-type classification using the TREC label set\. The model sees many in\-context exemplars and must classify a test question into one of5050fine\-grained categories \(e\.g\.,ABBR:abbreviation,NUM:date\)\.
- •passage\_retrieval\_en\(Pass; retrieval score,3232\-token cap\): synthetic retrieval—given a paraphrased question and a set of candidate Wikipedia passages, the model must identify which passage contains the answer by outputting its index\.
- •lcc\(LCC; edit similarity,6464\-token cap\): line\-level code completion over long source files \(often\>10\>10K tokens\)\. The model sees a file with the last line removed and must reproduce that line, requiring understanding of surrounding code structure\.
##### Inference\.
Inputs are middle\-truncated to at most31,50031\{,\}500tokens and decoded greedily, with per\-task output caps as listed above\. TheAvgcolumn in Table[5](https://arxiv.org/html/2606.24033#S6.T5)is the unweighted mean over the eight subtasks\.
### D\.2Reasoning Tasks \(AIME\)
This appendix is organised in three parts: the AIME protocol, the buffer configurations of the two regimes, and the per\-method calibration recipes that produced the numbers in Section[6\.3\.2](https://arxiv.org/html/2606.24033#S6.SS3.SSS2)\.
##### Protocol\.
All AIME runs use the K3V2 cache budget \(K at33bits/dim, V at22bits/dim\)\. Generation is stochastic at temperature0\.60\.6and top\-pp0\.950\.95, with a32,76832\{,\}768\-token output cap\. For each problem we draw eight samples and report pass@1 \(avg@8\)\. PM\-KVQ is run through its official code111[https://github\.com/thu\-nics/PM\-KVQ](https://github.com/thu-nics/PM-KVQ); KIVI is run with its official quantization scheme222[https://github\.com/jy\-yuan/KIVI](https://github.com/jy-yuan/KIVI)\. The upstream packed kernel requires bit\-width∈\{2,4,8\}\\in\\\{2,4,8\\\}; at the33\-bit budgets used here we substitute a bit\-exact round\-trip that reproduces KIVI’s quant–dequant numerics\.in the protected regime, and the KIVI\-ScaleOnly variant \(Appendix[B\.4](https://arxiv.org/html/2606.24033#A2.SS4.SSS0.Px6)\) in the no\-buffer regime\.
##### Buffer configurations\.
During decoding the KV cache is conceptually laid out as\[sink \(fp16\)\|compressed\|recent \(fp16\)\]\[\\,\\text\{sink \(fp16\)\}\\,\|\\,\\text\{compressed\}\\,\|\\,\\text\{recent \(fp16\)\}\\,\]; the two regimes differ only in sink and recent\-window sizes\.
- •In the*protected\-buffer*regime we keep the first44tokens as fp16 sink and the most recent128128tokens as fp16 recent\. The128128\-token recent span matches PM\-KVQ’s protected configuration\[[21](https://arxiv.org/html/2606.24033#bib.bib24)\]; the44\-token sink follows the attention\-sink convention ofXiaoet al\.\[[37](https://arxiv.org/html/2606.24033#bib.bib10)\], overriding PM\-KVQ’s native default of sink=1=1so that the same protected allowance is applied uniformly across methods\. TQ\-MSE and Block\-GTQ, which are buffer\-free by design, run under this44/128128allowance\. KIVI uses its default path: per\-\(T=32,channel\)\(T\{=\}32,\\text\{channel\}\)asymmetric quantization for K and per\-\(token,D=32\)\(\\text\{token\},D\{=\}32\)asymmetric quantization for V\. KIVI’s native128128\-token fp16 residual coincides with the shared recent window, and its3232\-token grouping is the K quantization group size along the token axis, not an additional fp16 buffer\.
- •In the*no\-buffer stress*regime we set both sink and recent windows to0, so every attended token is served from the compressed cache\. KIVI is replaced by the KIVI\-ScaleOnly variant: K uses*per\-channel*quantization with a3232\-token rolling buffer of fp32 statistics for scale refresh \(the statistics never enter the attention path\), and V uses TQ\-MSE\. PM\-KVQ is run with neither sink nor sliding window, so no token is kept at high precision\. Each K/V is quantized on arrival with per\-group \(128128\-channel\) asymmetric quantization; following PM\-KVQ’s progressive scheme, the cache enters at1616\-bit and a layer’s entire cache is halved in bit\-width \(16→8→4→216\{\\to\}8\{\\to\}4\{\\to\}2\) whenever it exceeds its calibrated per\-layer*memory*budget, so the effective precision decreases as the sequence grows\. TQ\-MSE and Block\-GTQ already operate without any buffer\.
##### Calibration\.
- •Block\-GTQ: calibrated on the first20482048tokens of the WikiText\-2 test split; see Appendix[C](https://arxiv.org/html/2606.24033#A3)for the energy score and bit\-allocation procedure\.
- •PM\-KVQ\[[21](https://arxiv.org/html/2606.24033#bib.bib24)\]: a progressive mixed\-precision quantizer whose calibration produces two offline artifacts from a single PI calibration set: \(i\)kv\_budgets—a per\-layer memory budget \(shared by K and V\), obtained by integer programming over loss\-gradient sensitivity with bit choices\{4,2\}\\\{4,2\\\}and a2\.52\.5b/d average target \(matching our K3V2 average\); \(ii\)rep\_scales—a SmoothQuant\-style per\-channel pre\-scaling folded intok\_proj/q\_proj, obtained via a three\-stage offline search and*disabled in all our PM\-KVQ runs*\. The PI calibration set is512512sequences of20482048tokens \(about11M tokens, more than two orders of magnitude beyond the20482048tokens Block\-GTQ uses\) randomly sampled from the WikiText\-2 train split, with position\-ids stretched by stride44to an effective length of81928192\.
- •KIVI\[[24](https://arxiv.org/html/2606.24033#bib.bib11)\]: tuning\-free—per\-channel K scales and per\-token V scales are computed online from running statistics, with no calibration data\. We use KIVI as\-is in the protected regime\. In the no\-buffer regime we substitute KIVI\-ScaleOnly\. Since the very first tokens are quantized before any rolling statistics exist, we seed the estimator’s per\-channel K scale from a single forward pass over the first6464tokens of the same wikitext prefix used by Block\-GTQ; this seed only governs the first3232\-token group of each sequence, after which the scale is fully refreshed online from the3232\-token rolling fp32\-statistics buffer\.
- •TQ\-MSE\[[40](https://arxiv.org/html/2606.24033#bib.bib16)\]: data\-independent\.
## Appendix EDeployment Protocol and Extended Results
This appendix gives the measurement protocol and numerical details behind Section[6\.4](https://arxiv.org/html/2606.24033#S6.SS4)\. The deployment benchmarks run Qwen2\.5\-3B\-Instruct on a single H800 80GB GPU and compare Block\-GTQ and uniform TQ\-MSE against an fp16 FlashAttention\-2 \(FA\-2\) baseline\. Block\-GTQ metadata uses the first 64 tokens of the WikiText\-2 train as its calibration prefix; decode latency is the median per\-step time over 20 timed autoregressive steps onTTconsecutive WikiText\-2 input tokens, and peak memory includes model weights, the resident cache, and transient activations/buffers\. All three methods run the same Qwen2\.5\-3B\-Instruct in fp16, with identical fp16 weights and fp16 attention compute, so the only quantity that varies across methods is the KV\-cache representation: a dense f16 cache for the baseline versus the packed low\-bit codes of uniform TQ\-MSE and Block\-GTQ, both quantized from the same fp16 keys and values emitted by the model’s projections\. The comparison is thus dtype\-matched—the reported latency, memory, and perplexity differences are attributable to the cache representation alone, not to any change in weight or attention precision\. The public Qwen2\.5\-3B\-Instruct checkpoint is released in bf16; we cast it to fp16 uniformly for every method, including the baseline, so the dtype choice advantages none of them\.
### E\.1Allocated Footprint Accounting
The deployed cache footprint should be read as an allocated tensor footprint, not as the ideal number of information bits\. ForD=128D=128, a pure 3\-bit code\-only K\+V cache would use48\+48=9648\+48=96bytes per token and KV head, giving an ideal512/96=5\.33×512/96=5\.33\\timescompression relative to fp16\. The deployed K3V3 path allocates≈157\\approx 157bytes per token and KV head \(Table[18](https://arxiv.org/html/2606.24033#A5.T18)\); the≈61\\approx 61\-byte overhead is dominated by two sources: \(i\) per\-coordinate bit widths are rounded up to nibble \(44\-bit\) or byte \(88\-bit\) storage so that GPU decoding stays bit\-shift–free, and \(ii\) each same\-rate group carries an fp16 normalization scalar\. The overhead is a physical layout cost of serving from a packed cache, not a change to the Block\-GTQ allocation objective\.
Table 18:Allocated footprint per token and KV head\.Qwen2\.5\-3B\-Instruct K3V3 deployment, in bytes\.
### E\.2Decode Latency and Memory: Three\-Way Comparison
Tables[19](https://arxiv.org/html/2606.24033#A5.T19)and[20](https://arxiv.org/html/2606.24033#A5.T20)give the numerical data behind Fig\.[6](https://arxiv.org/html/2606.24033#S6.F6)\. The two compressed paths share the same packed\-cache interface: uniform TQ\-MSE applies a uniform 3\-bit K budget, while Block\-GTQ assigns K bits by RoPE block\. Table[19](https://arxiv.org/html/2606.24033#A5.T19)reports per\-step decode\-latency statistics, prefill time, and the median\-latency decode speedup over fp16 FA\-2 across five context lengthsTT; Table[20](https://arxiv.org/html/2606.24033#A5.T20)reports KV\-cache footprint, KV compression ratios, and total/non\-weight peak GPU memory at the sameTT\.
Table 19:Decode latency and prefill time\.Qwen2\.5\-3B\-Instruct K3V3 deployment on a single H800, comparing fp16 FA\-2, uniform TQ\-MSE, and Block\-GTQ\. Med\., Mean, p5, p95 are the median, mean, and 5th/95th percentiles of per\-step decode latency over 20 timed autoregressive steps\. Prefill is the wall\-clock time to construct the cache for the full prompt: fp16 prefill is FA\-2\-backed; the compressed paths build the packed cache\. Med\. speedup vs\. fp16 FA\-2 is the ratio of fp16 FA\-2’s median decode latency to the method’s median decode latency\.TTMethodMed\.Meanp5p95PrefillMed\. speedupms/stepms/stepmsmssvs\. fp16 FA\-216KFP16 FA\-217\.8417\.8617\.7618\.010\.371\.00×\\timesTQ\-MSE45\.4745\.4645\.3245\.610\.870\.39×\\timesBlock\-GTQ45\.8645\.8345\.6646\.011\.330\.39×\\times64KFP16 FA\-236\.1536\.1536\.1236\.172\.841\.00×\\timesTQ\-MSE45\.5845\.5945\.4145\.689\.880\.79×\\timesBlock\-GTQ45\.5745\.5645\.4445\.6316\.620\.79×\\times128KFP16 FA\-270\.9670\.9570\.8571\.039\.091\.00×\\timesTQ\-MSE45\.5345\.6345\.3945\.9736\.411\.56×\\timesBlock\-GTQ52\.9560\.5252\.69121\.7263\.531\.34×\\times256KFP16 FA\-2OOMTQ\-MSE60\.9560\.9760\.7661\.21141\.50—Block\-GTQ82\.2482\.2781\.9782\.61250\.17—512KFP16 FA\-2OOMTQ\-MSE101\.18101\.22101\.02101\.45558\.75—Block\-GTQ140\.27140\.40140\.04140\.86997\.19—At short context \(T≤64T\\leq 64K\), in\-kernel decoding adds a per\-step overhead that outweighs the KV\-bandwidth savings, so fp16 FA\-2 is fastest\. AsTTgrows, KV\-bandwidth dominates and the compressed paths overtake fp16 atT=128T=128K in median per\-step decode latency: Block\-GTQ runs1\.34×1\.34\\timesfaster than fp16 FA\-2, uniform TQ\-MSE1\.56×1\.56\\times\. Beyond this, fp16 OOMs because peak total memory exceeds8080\\,GB\. Uniform TQ\-MSE is consistently faster than Block\-GTQ in our current implementation because its K layout is simpler, but this speed advantage comes at a steep quality cost: as Appendix[E\.3](https://arxiv.org/html/2606.24033#A5.SS3)shows, TQ\-MSE’s PPL collapses across all tested context lengths while Block\-GTQ stays close to fp16, identifying Block\-GTQ as the preferred operating point\.
Table 20:KV footprint and peak GPU memory\.Same Qwen2\.5\-3B\-Instruct K3V3 deployment as Table[19](https://arxiv.org/html/2606.24033#A5.T19)\. KV comp\.: ratio of fp16’s KV cache size to this method’s\. Peak total: total GPU memory \(model weights \+ KV cache \+ transient activations/buffers\)\. Peak minus weights: peak total minus the 6\.17GB Qwen2\.5\-3B\-Instruct weights, exposing non\-weight memory\.Block\-GTQ uses slightly more resident KV memory than uniform TQ\-MSE—mixed\-rate K allocation carries additional per\-segment metadata—but both compressed paths reduce the KV footprint by roughly3\.4×3\.4\\timesin K3V3 budget relative to fp16\. Figure[6](https://arxiv.org/html/2606.24033#S6.F6)\(c\) shows the peak\-memory curves for the two compressed paths nearly overlap\. From Table[20](https://arxiv.org/html/2606.24033#A5.T20), their entire peak\-memory difference comes from the KV\-cache gap \(at most0\.980\.98GB atT=512T=512K\)—all other peak components \(weights and transient activations\) are identical between the two\.
### E\.3Perplexity at Long Context
Under the same deployment setting \(Qwen2\.5\-3B\-Instruct, K3V3\), for each context lengthTT, we feedTTconsecutive WikiText\-2 tokens into the model as input and score its perplexity on the next 1000 tokens—the same 1000 positions for all three methods\. A largernppln\_\{\\mathrm\{ppl\}\}reduces token\-averaging noise\. Calibration \(tokens\[0,64\)\[0,64\)\) and PPL evaluation \(tokens\[T,T\+1000\)\[T,T\+1000\), withT≥4096T\\geq 4096\) read from the same WikiText\-2 train stream but use non\-overlapping windows, so the setup is leakage\-free\.
Table 21:Long\-context perplexity\.Same Qwen2\.5\-3B\-Instruct K3V3 deployment as Table[19](https://arxiv.org/html/2606.24033#A5.T19)\. PPL on the 1000 tokens followingTTconsecutive WikiText\-2 train input tokens; all three methods score the same 1000 positions\.Δ\\Delta: Block\-GTQ’s PPL increase relative to fp16\. TQ\-MSE/Block\-GTQ: ratio of the two PPLs\.Qwen2\.5\-3B\-Instruct supports context up to 128K with YaRN extension\. ForT≤128T\\leq 128K, Block\-GTQ’s PPL stays within1\.6%1\.6\\%–3\.6%3\.6\\%of fp16 \(Δ\\Deltacolumn\)\. AtT=256T=256K and512512K, fp16 runs out of memory, and both packed paths’ PPL values rise sharply—this reflects the model itself failing to extrapolate beyond its supported context, not cache compression\. By contrast, TQ\-MSE’s PPL is14×14\\timesto36,879×36\{,\}879\\timeshigher than Block\-GTQ’s at everyTT\(TQ\-MSE/Block\-GTQ column\), includingT=4T=4K well within the supported range—a failure of uniform K allocation, not of context length\. For instance, atT=16T=16K, TQ\-MSE’s PPL collapses to299,419299\{,\}419, while Block\-GTQ’s PPL is8\.128\.12, close to fp16’s7\.847\.84\.Similar Articles
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
PolyKV is a layer-wise KV cache compression framework that assigns heterogeneous eviction policies and non-uniform budgets per layer, significantly improving over uniform baselines on LongBench with LLaMA-3.1-8B and Qwen3-8B.
CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
CompressKV proposes a semantic-retrieval-guided KV-cache compression method for GQA-based LLMs, identifying Semantic Retrieval Heads to retain critical tokens. It achieves over 97% full-cache performance using only 3% of the KV cache on LongBench tasks.
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.