KVarN: Native vLLM backend for KV-cache quantization by Huawei
Summary
Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.
View Cached Full Text
Cached at: 06/05/26, 02:11 AM
huawei-csl/KVarN
Source: https://github.com/huawei-csl/KVarN
⚡️ Built for agentic and long-context workloads.
💡 KVarN delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, so you fit far longer contexts and serve more concurrent requests, with FP16-level accuracy.
🔌 Calibration-free, plug-and-play with vLLM. A native vLLM attention backend: add one flag, no model changes, no calibration.
🥊 Up to ~2.4× TurboQuant throughput, same capacity, higher accuracy.
Why KVarN (Variance Normalized KV-Cache)?
kvarn /kvɑːɳ/ · noun (Swedish)
- A grinding apparatus used to reduce substances into smaller particles or powder, especially grains, seeds, spices, coffee beans, KV-caches.
KV-cache quantization usually comes with a catch. As the vLLM TurboQuant blog shows, existing methods buy extra KV-cache capacity but give up throughput (TurboQuant reports 40 to 52% lower throughput for 2.3-3.7x capacity), and aggressive low-bit quantization also tends to cost accuracy. Losing both speed and quality is the main reason KV-cache quantization is rarely turned on in production.
KVarN is built to keep both. On Qwen3-32B (AIME25, 16K-context burst, TP=2) it matches FP16 accuracy and beats its throughput while delivering ~4× the KV-cache capacity:
KVarN stays in the upper-right corner the blog’s methods can’t reach: FP16-level accuracy, FP16-or-better throughput, and several times the context.
Quickstart
KVarN ships as a vLLM fork. Install it like vLLM, then select the KVarN KV-cache dtype.
# 1. Clone
git clone https://github.com/huawei-csl/KVarN.git
cd KVarN
# 2. Install (uses the upstream precompiled wheel; KVarN kernels are Triton, JIT-compiled at runtime)
VLLM_USE_PRECOMPILED=1 pip install -e .
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen3-32B",
dtype="float16", # KVarN runs in float16
kv_cache_dtype="kvarn_k4v2_g128", # enable KVarN
block_size=128, # KVarN tile size
)
print(llm.generate("Explain KV-cache quantization in one sentence.",
SamplingParams(max_tokens=64))[0].outputs[0].text)
Serving works the same way:
vllm serve Qwen/Qwen3-32B --dtype float16 --kv-cache-dtype kvarn_k4v2_g128 --block-size 128
Note: KVarN runs in
float16compute. The tile / page size is currently fixed at 128 (one vLLM block = one KVarN tile); other page sizes are coming soon.
Tip (capacity): KVarN realizes its full KV-cache capacity when there is room to amortize a small fixed decode workspace. On multi-GPU or generous
--gpu-memory-utilizationsetups this is automatic. On a tight single-GPU budget, vLLM’s CUDA-graph memory profiler can over-reserve and shrink the KV pool; setVLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0(and/or raise--gpu-memory-utilization) to recover the full capacity.
How does KVarN work?
KVarN quantizes the KV cache one fixed-size token tile at a time, walking each tile through the four stages above:
-
Cache: the raw fp16 KV tile (channels × tokens), straight from attention.
-
Rotated Cache: a Hadamard rotation along the channel dimension mixes channels so that per-channel outliers are spread out, making the tile easier to quantize. The rotation is orthonormal, so attention scores are preserved.
-
Normalized Cache: iterative variance normalization (Sinkhorn-like) alternates column- and row-wise standard-deviation normalization in log space, equalizing variance across the tile and shrinking quantization error before any rounding happens.
-
Quantized Cache: asymmetric round-to-nearest at low bit-width, with the scales folded back in at read time (keys per channel, values per token).
The shipped preset spends more bits on keys than values (kvarn_k4v2_g128:
4-bit keys, 2-bit values). We chose to release this configuration because it meets
the strictest accuracy bar, matching FP16, that the most demanding production
deployments and vLLM require, while still delivering throughput above FP16.
Citation
KVarN is the official vLLM implementation of our paper:
📄 KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks (arXiv:2606.03458)
If you use KVarN, please cite:
@misc{muller2026kvarn,
title={KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks},
author={Lorenz K. Muller and Philippe Bich and Chiara Boretti and Hyun-Min Chang and Jiawei Zhuang and Lukas Cavigelli},
year={2026},
eprint={2606.03458},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2606.03458}
}
License and attribution
KVarN is built on vLLM (v0.22.0) and is
released under the Apache 2.0 License. The original vLLM README is preserved as
README_vLLM.md.
Similar Articles
KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models, achieving state-of-the-art 2-bit precision on reasoning benchmarks.
@JakeKAllDay: Huawei released a paper on a new KV compression method called KVarN: it demonstrated minimal loss down to *2 bit* quant…
Huawei released a paper on KVarN, a new KV cache compression method achieving minimal loss at 2-bit quantization compared to FP16, outperforming methods like TurboQuant and KIVI with little inference slowdown.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)
SpectralQuant is a new KV cache quantization technique achieving 5.95× compression on Mistral 7B with only 7.5% perplexity overhead, significantly outperforming TurboQuant while requiring only 15 seconds of calibration per model.
Quantizing MTP KV Cache = free lunch?
Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.