OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
Summary
OSCAR is an offline spectral covariance-aware rotation method for 2-bit KV cache quantization that aligns quantization with attention covariance structures, achieving high accuracy and efficiency for long-context LLM serving.
View Cached Full Text
Cached at: 05/19/26, 10:34 PM
Paper page - OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
Source: https://huggingface.co/papers/2605.17757
Abstract
OSCAR is an ultra-low-bit KV cache quantization method that aligns quantization with attention-aware covariance structures, achieving high accuracy and efficiency for long-context LLM serving.
INT2KV-cache quantizationis attractive forlong-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade atINT2because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimatesattention-aware covariance structuresoffline and uses them to derivefixed rotationsandclipping thresholdsfor quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a customINT2attention kernelthat remains compatible withpaged KV-cache servingandfused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces theBF16 accuracy gapto 3.78 and 1.42 points, respectively, while naive rotationINT2collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotationINT2collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.
View arXiv pageView PDFProject pageGitHub7Add to collection
Get this paper in your agent:
hf papers read 2605\.17757
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### Zhongzhu/OSCAR-RotationZoo Updatedabout 19 hours ago • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.17757 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.17757 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
New KV Quants coming 😍 Welcome OSCAR kv quant open sourced by togetherAI
Together AI open-sources OSCAR, an attention-aware 2-bit KV cache quantization system that enables efficient long-context LLM serving by redistributing quantization error according to attention importance.
RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.
@anirudhbv_ce: Introducing SpectralQuant.. here to save your KV cache :)
SpectralQuant is a new KV cache quantization technique achieving 5.95× compression on Mistral 7B with only 7.5% perplexity overhead, significantly outperforming TurboQuant while requiring only 15 seconds of calibration per model.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.