OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Hugging Face Daily Papers Papers

Summary

OSCAR is an offline spectral covariance-aware rotation method for 2-bit KV cache quantization that aligns quantization with attention covariance structures, achieving high accuracy and efficiency for long-context LLM serving.

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.
Original Article
View Cached Full Text

Cached at: 05/19/26, 10:34 PM

Paper page - OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Source: https://huggingface.co/papers/2605.17757

Abstract

OSCAR is an ultra-low-bit KV cache quantization method that aligns quantization with attention-aware covariance structures, achieving high accuracy and efficiency for long-context LLM serving.

INT2KV-cache quantizationis attractive forlong-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade atINT2because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimatesattention-aware covariance structuresoffline and uses them to derivefixed rotationsandclipping thresholdsfor quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a customINT2attention kernelthat remains compatible withpaged KV-cache servingandfused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces theBF16 accuracy gapto 3.78 and 1.42 points, respectively, while naive rotationINT2collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotationINT2collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

View arXiv pageView PDFProject pageGitHub7Add to collection

Get this paper in your agent:

hf papers read 2605\.17757

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### Zhongzhu/OSCAR-RotationZoo Updatedabout 19 hours ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.17757 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.17757 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.