OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL Papers

Summary

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.

arXiv:2509.21623v2 Announce Type: replace Abstract: The expanding long-context capabilities of large language models are constrained by a significant memory bottleneck: the key-value (KV) cache required for autoregressive generation. This bottleneck is substantial; for instance, a Llama-3.1-8B model processing a 32K-token prompt at a batch size of 4 requires approximately 16GB for its KV cache, a size exceeding the model's weights. While KV-cache compression via low-rank projection is a promising direction, existing methods rely on a static, offline-learned subspace that performs poorly under data distribution shifts. To overcome these limitations, we introduce OjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention. Second, for the vast majority of intermediate tokens, it applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis. This adaptation involves a comprehensive update during prompt prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context. Crucially, our framework is fully compatible with modern attention modules like FlashAttention. Experiments demonstrate that OjaKV maintains or even improves zero-shot accuracy at high compression ratios. In particular, OjaKV achieves its strongest gains on very long-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts. These results establish our hybrid framework as a practical, plug-and-play solution for memory-efficient long-context inference without requiring model fine-tuning.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# OjaKV: Context-Aware Online Low-Rank KV Cache Compression Source: https://arxiv.org/html/2509.21623 Yuxuan Zhu1, David H\. Yang1\*, Mohammad Mohammadi Amiri1, Keerthiram Murugesan2, Tejaswini Pedapati2, Pin\-Yu Chen2 1Rensselaer Polytechnic Institute2IBM Research \{yangd13, zhuy27, mamiri\}@rpi\.edutejaswinip@us\.ibm\.com \{keerthiram\.murugesan, pin\-yu\.chen\}@ibm\.com ###### Abstract The expanding long\-context capabilities of large language models are constrained by a significant memory bottleneck: the key\-value \(KV\) cache required for autoregressive generation\. This bottleneck is substantial; for instance, a Llama\-3\.1\-8B model processing a 32K\-token prompt at a batch size of 4 requires approximately 16 GB for its KV cache, a size exceeding the model’s weights\. While KV\-cache compression via low\-rank projection is a promising direction, existing methods’ rely on a static, offline\-learned subspace that performs poorly under data distribution shifts\. To overcome these limitations, we introduceOjaKV, a novel framework that integrates a strategic hybrid storage policy with online subspace adaptation\. First, OjaKV recognizes that not all tokens are equally important for compression; it preserves the crucial tokens in full\-rank, maintaining high\-fidelity anchors for attention\. Second, for the majority of intermediate tokens, it applies low\-rank compression by incrementally adapting the projection basis using Oja’s algorithm for online principal component analysis\. This adaptation involves a comprehensive update during prefilling and lightweight periodic updates during decoding, ensuring the subspace remains aligned with the evolving context\. Crucially, our framework is fully compatible with modern attention modules likeFlashAttention\. Experiments show that OjaKV maintains or even improves accuracy at high compression ratios\. In particular, OjaKV achieves its strongest gains on very long\-context benchmarks that require complex reasoning, highlighting the importance of online subspace adaptation in dynamically tracking context shifts\. Furthermore, our approach is compatible with token\-selection methods, enabling compounded memory savings\. These results establish our hybrid framework as a practical, plug\-and\-play solution for memory\-efficient long\-context inference without requiring model fine\-tuning\. Code available athttps://github.com/zzbright1998/OjaKV OjaKV: Context\-Aware Online Low\-Rank KV Cache Compression Yuxuan Zhu1††thanks:Equal contribution\., David H\. Yang1\*, Mohammad Mohammadi Amiri1,Keerthiram Murugesan2, Tejaswini Pedapati2, Pin\-Yu Chen21Rensselaer Polytechnic Institute2IBM Research\{yangd13, zhuy27, mamiri\}@rpi\.edutejaswinip@us\.ibm\.com\{keerthiram\.murugesan, pin\-yu\.chen\}@ibm\.com ## 1Introduction Large language models \(LLMs\) such as GPT\-4o\(OpenAIet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib6)\)and Deepseek\-R1\(DeepSeek\-AIet al\.,2025 (https://arxiv.org/html/2509.21623#bib.bib7)\)have demonstrated remarkable performance across diverse domains, including coding\(Namet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib8)\), mathematics\(Setluret al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib9)\), and open\-ended text generation\(Kumichevet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib10)\)\. However, as model capabilities and context length expand, GPU memory emerges as a critical bottleneck for inference\. The memory footprint arises from two primary sources: \(i\) model weights, with a model like Llama\-3\.1\-8B requiring 16 GB alone; and \(ii\) the Key\-Value \(KV\) cache used during prompt prefilling and autoregressive decoding\. For instance, processing a 32K\-token prompt with Llama\-3\.1\-8B in float16 precision at a batch size of 4 consumes an additional 16 GB for the KV cache, rivalling the size of the model weights themselves\. This substantial memory consumption makes long\-context inference prohibitive on all but high\-end hardware\. To mitigate this challenge, a variety of methods have been proposed to optimize KV\-cache memory usage\(Shiet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib11)\)\. These approaches can be grouped into four categories: \(1\)*Quantization*, which stores keys and values at a lower precision \(e\.g\., 8\-bit\)\(Hooperet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib12); Liuet al\.,2024c (https://arxiv.org/html/2509.21623#bib.bib13)\); \(2\)*Token Selection*, which prunes or merges tokens deemed unimportant based on attention scores or heuristic saliency measures\(Xiaoet al\.,2023 (https://arxiv.org/html/2509.21623#bib.bib14); Liet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib15); Zhanget al\.,2023 (https://arxiv.org/html/2509.21623#bib.bib16)\); \(3\)*Offloading*, which transfers the KV cache to CPU memory and selectively streams it back during decoding\(Tanget al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib18); Sunet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib19); Zhuet al\.,2025 (https://arxiv.org/html/2509.21623#bib.bib17)\); and \(4\)*Low\-rank Approximation*, which projects keys and values into a lower\-dimensional subspace\(Saxenaet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib20); Linet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib21)\)\. Our work focuses on this fourth direction\. By compressing each key and value vector from dimensiondd\(e\.g\.,d=128d=128\) torr\(e\.g\.,r=96r=96\), low\-rank methods can reduce cache memory by\(\(1−r/d\)×100\)%\(\(1\-r/d\)\\times 100\)\\%while preserving model accuracy, yielding substantial savings for long\-context inference\. While token selection has become a widely adopted strategy, we provide theoretical support showing that low\-rank projection and token eviction are compatible, offering multiplicative benefits that enable even greater memory reductions when combined\. Existing low\-rank methods fall into two main categories\. \(1\)*Weight\-decomposition*techniques directly factorize the linear projection weights for query, key, and value \(Wq\\bm\{W\}\_\{q\},Wk\\bm\{W\}\_\{k\},Wv\\bm\{W\}\_\{v\}\) into low\-rank matrices, thereby caching already\-compressed intermediate states\(Changet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib22)\)\. However, this approach often incurs a noticeable degradation in accuracy\. \(2\)*Projection\-based*techniques learn fixed orthonormal projeciton bases \(Uq\\bm\{U\}\_\{q\},Uk\\bm\{U\}\_\{k\},Uv\\bm\{U\}\_\{v\}\) from a calibration dataset\. These bases are then used to compress the KV cache, which is reconstructed during attention computation\(Saxenaet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib20); Linet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib21)\)\. While effective, these static bases implicitly assume that inference prompts will follow the same distribution as the calibration data\. In practice, distribution shifts \(e\.g\., from dialogue to code generation\) cause the approximation to deteriorate, harming generation quality\. To address these limitations, we proposeOjaKV, a novel framework for KV\-cache compression that operates on two core principles\. Our first key insight is that uniform compression across all tokens is suboptimal\. Motivated by the findings of attention sinks\(Xiaoet al\.,2023 (https://arxiv.org/html/2509.21623#bib.bib14)\)and SnapKVLiet al\.\(2024 (https://arxiv.org/html/2509.21623#bib.bib15)\), OjaKV employs a hybrid storage policy that strategically excludes the crucial tokens with high reconstruction error from low\-rank projection\. This preserves their full\-rank fidelity, creating stable anchors for the attention mechanism and forming a significantly stronger performance baseline\. Second, for the remaining intermediate tokens, we incorporate online subspace adaptation using Oja’s incremental principle component analysis \(PCA\)\(Oja,1997 (https://arxiv.org/html/2509.21623#bib.bib23)\)\. This mechanism performs a comprehensive update during the prefill stage on a selection of salient tokens, and subsequently executes periodic lightweight updates during decoding\. This ensures the low\-rank basis continuously adapts to the evolving context with negligible overhead\. Our framework is fully compatible with modern attention modules such as FlashAttention\(Daoet al\.,2022 (https://arxiv.org/html/2509.21623#bib.bib24)\), ensuring practicality in real\-world long\-context inference\. We evaluate OjaKV on multiple\-choice benchmarks from the lm\-eval\-harness\(Bidermanet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib25)\), aligning with prior studies\(Saxenaet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib20); Linet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib21)\)\. Additionally, for the first time to the best of our knowledge, we evaluate projection\-based low\-rank KV cache compression methods’ performance on generation\-centric long\-context tasks using LongBench\(Baiet al\.,2023 (https://arxiv.org/html/2509.21623#bib.bib26)\)and RULER\(Hsiehet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib27)\)\. We also evaluate our method on long reasoning and generation task, AIMEBalunovićet al\.\(2025 (https://arxiv.org/html/2509.21623#bib.bib38)\)\. Across all settings, OjaKV demonstrates superior performance over static low\-rank baselines at the equivalent compression ratios\. In summary, our contributions are threefold\. First, we introduce a hybrid low\-rank KV\-cache compression framework, OjaKV, which combines a selective full\-rank storage policy with a context\-aware online subspace adaptation\. Second, our design is compatible with FlashAttention, ensuring practicality for modern long\-context inference pipelines\. Third, we conduct the first comprehensive evaluation of low\-rank KV compression on challenging generation\-centric benchmarks, moving beyond the simpler multiple\-choice tasks\. ## 2Related Work Recent work on reducing the memory footprint of LLM inference has led to various KV cache compression strategies, including quantization, token pruning, offloading, and subspace compression\. Our method builds on low\-rank approximation, extending it with an online, context\-adaptive formulation\. We review relevant methods below\. ### 2\.1KV\-Cache Compression The memory overhead of the KV cache motivates four main compression strategies\(Shiet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib11)\)\. \(1\)*Quantization*: These methods reduce precision \(e\.g\., to 4\-bit or 2\-bit\) for storage\. Approaches like KVQuant\(Hooperet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib12)\)and KIVI\(Liuet al\.,2024c (https://arxiv.org/html/2509.21623#bib.bib13)\)achieve significant compression with minimal quality loss\. \(2\)*Token selection*: This category retains only essential tokens\. StreamingLLM\(Xiaoet al\.,2023 (https://arxiv.org/html/2509.21623#bib.bib14)\)keeps “attention sinks,” while SnapKV\(Liet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib15)\)selects tokens based on importance scores\. \(3\)*Offloading*: Methods such as Quest\(Tanget al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib18)\)and ShadowKV\(Sunet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib19)\)store the cache in CPU memory, selectively reloading relevant parts during computation\. \(4\)*Low\-rank approximation*: These approaches project keys and values into a lower\-dimensional subspace\(Saxenaet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib20); Linet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib21)\)\. Our method falls into this last category and is*orthogonal*to the others, enabling additive memory savings\. AppendixA\.9 (https://arxiv.org/html/2509.21623#A1.SS9)provides analysis and experiments showing that our method can be combined with token\-eviction\-based methods\. ### 2\.2Low\-Rank Approximation for Attention Low\-rank structures are widely utilized for compression\. Palu\(Changet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib22)\)and ReCalKV\(Yanet al\.,2025 (https://arxiv.org/html/2509.21623#bib.bib28)\)factorize weights to cache compressed states, though often with accuracy degradation\. EigenAttention\(Saxenaet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib20)\)and MatryoshkaKV\(Linet al\.,2024 (https://arxiv.org/html/2509.21623#bib.bib21)\)project activations using bases derived from calibration data \(Uk,Uv\\bm\{U\_\{k\}\},\\bm\{U\_\{v\}\}\)\. However, these methods rely on a*static*basis, which degrades performance under distribution shifts\. While architectures like Multi\-Head Latent Attention\(Liuet al\.,2024a (https://arxiv.org/html/2509.21623#bib.bib36),b (https://arxiv.org/html/2509.21623#bib.bib37)\)integrate low\-rank structures natively, they require training new models\. Our method addresses these limitations via*online subspace adaptation*, continuously updating projection matrices during inference for a plug\-in, context\-aware approximation\. ### 2\.3Online Principal Component Analysis In autoregressive decoding, rerunning SVD on the entire history is computationally infeasible\. Online PCA algorithms address this by incrementally updating the subspace\. Perturbation techniques\(Gu and Eisenstat,1994 (https://arxiv.org/html/2509.21623#bib.bib29)\)provide accuracy but are too resource\-intensive for high\-dimensional streams\. Incremental SVD\(Brand,2002 (https://arxiv.org/html/2509.21623#bib.bib30)\)offers a middle ground but remains costly for real\-time LLM inference\. In contrast, stochastic methods like Oja’s rule\(Oja,1997 (https://arxiv.org/html/2509.21623#bib.bib23)\)efficiently optimize variance via gradient updates\. Our proposed method, OjaKV, builds on this foundation\. By applying Oja’s rule to update key and value subspaces on\-the\-fly, it enables fast, adaptive, and calibration\-free approximation\. ## 3Preliminaries: Low\-Rank Attention We begin by describing the standard attention mechanism and how it can be adapted for low\-rank KV cache compression\. For simplicity, we focus on a single attention head\. Let the head dimension bedhd\_\{h\}and the input sequence beX∈Rn×d\\bm\{X\}\\in\\mathbb\{R\}^\{n\\times d\}\. Standard attention first projects the input into query, key, and value representations using projection matricesWq\\bm\{W\}\_\{q\},Wk\\bm\{W\}\_\{k\},Wv\\bm\{W\}\_\{v\}∈Rd×dh\\in\\mathbb\{R\}^\{d\\times d\_\{h\}\}: Q=XWq,K=XWk,V=XWv,\\bm\{Q\}=\\bm\{X\}\\bm\{W\}\_\{q\},\\qquad\\bm\{K\}=\\bm\{X\}\\bm\{W\}\_\{k\},\\qquad\\bm\{V\}=\\bm\{X\}\\bm\{W\}\_\{v\},whereQ,K,V∈Rn×dh\\bm\{Q\},\\bm\{K\},\\bm\{V\}\\in\\mathbb\{R\}^\{n\\times d\_\{h\}\}\. The KV cache storesK\\bm\{K\}andV\\bm\{V\}for future use during decoding\. The attention scoresA\\bm\{A\}are then computed as the scaled dot product between queries and keys, followed by a softmax operation\. The output of the attention head,O\\bm\{O\}, is computed by applying these attention scores to value matrixV\\bm\{V\}:A=softmax\(QKT/dh\)∈Rn×n,O=AV∈Rn×dh\\bm\{A\}=\\text\{softmax\}\\left\(\\bm\{Q\}\\bm\{K\}^\{T\}/\\sqrt\{d\_\{h\}\}\\right\)\\in\\mathbb\{R\}^\{n\\times n\},\\bm\{O\}=\\bm\{A\}\\bm\{V\}\\in\\mathbb\{R\}^\{n\\times d\_\{h\}\}\. ### 3\.1Low\-Rank KV\-Cache Approximation The core idea of low\-rank approximation is to projectK\\bm\{K\}andV\\bm\{V\}onto lower\-dimensional subspaces\. We define two orthonormal bases:Uk∈Rdh×rk\\bm\{U\}\_\{k\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times r\_\{k\}\}for keys and queries, andUv∈Rdh×rv\\bm\{U\}\_\{v\}\\in\\mathbb\{R\}^\{d\_\{h\}\\times r\_\{v\}\}for values, whererk,rv≪dhr\_\{k\},r\_\{v\}\\ll d\_\{h\}are the desired ranks for compression\. The bases satisfyUkTUk=Irk\\bm\{U\}\_\{k\}^\{\\mathsf\{T\}\\}\\bm\{U\}\_\{k\}=\\bm\{I\}\_\{r\_\{k\}\}andUvTUv=Irv\\bm\{U\}\_\{v\}^\{\\mathsf\{T\}\\}\\bm\{U\}\_\{v\}=\\bm\{I\}\_\{r\_\{v\}\}\. The low\-rank basesUk\\bm\{U\}\_\{k\}andUv\\bm\{U\}\_\{v\}are initialized from a small calibration dataset \(see AppendixA\.3 (https://arxiv.org/html/2509.21623#A1.SS3)for details\)\. I

Similar Articles

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.