Key-Value Means
Summary
Key-Value Means (KVM) is a novel attention mechanism that combines the strengths of transformers and RNNs with controllable computational complexity and memory usage. It supports fixed-size or growing state, offers subquadratic prefill time and sublinear state growth, and can be implemented without custom kernels.
View Cached Full Text
Cached at: 05/12/26, 02:52 PM
Paper page - Key-Value Means
Source: https://huggingface.co/papers/2605.09877
Abstract
Key-Value Means introduces a novel attention mechanism that combines transformer and RNN advantages with controllable computational complexity and memory usage.
We present Key-Value Means (“KVM”), a novelblock-recurrenceforattentionthat can accommodate eitherfixed-sizeorgrowing state. Equipping a strongtransformerbaseline withfixed-sizeKVMattentionlayers yields a strongO(N)chunked RNN, while adding only an insignificant number of new parameters. We train atransformerwith a growable KVM cache and show it performs competitively on long-context tests with onlysubquadratic prefill timeandsublinear state growth. KVM is implementable with standard operations and without custom kernels, and supportschunk-wise parallelizable trainingandprefill. It provides many of the benefits of both traditionaltransformers (expandable context memory,chunk-wise parallelizable trainingandprefill) and linear RNNs in a single unified package. It can be used on every layer, savingKV-cachememory, and allowing a continuous range of choices ofprefilltime complexity betweenO(N)and O(N^2). It can also be implemented in ahybrid solutionin tandem withLRNNlayers in place of traditionalattention, to supplement theLRNNwith improved sublinear memory growth context length usage and long context decoding. We release our code at https://github.com/recursal/KVM-paper and trained models at https://huggingface.co/collections/recursal/key-value-means under the Apache 2.0 license.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.09877
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09877 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09877 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09877 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@jiqizhixin: What if your AI’s memory didn’t have to balloon with every extra sentence? University of Oxford, Technion, AITHYRA, and…
Introduces KV-Compression Aware Training (KV-CAT), a method that encourages transformers to learn compressible key-value caches during training, improving memory efficiency for long-context tasks without sacrificing performance.
Models Take Notes at Prefill: KV Cache Can Be Editable and Composable
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.
@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…
KV cache stores previously computed key and value vectors during autoregressive generation, allowing models to avoid recomputing the entire sequence at each step, significantly speeding up inference at the cost of increased memory usage.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
Introduces Self-Pruned Key-Value Attention (SP-KV), a mechanism that learns to predict future utility of key-value pairs to dynamically prune the KV cache, reducing memory usage and decoding speed by 3-10x with minimal performance degradation. The model and utility predictor are trained end-to-end using next-token prediction.
NestedKV: Nested Memory Routing for Long-Context KV Cache Compression
NestedKV is a training-free KV cache compression method that uses nested memory routing with multi-time-scale anomaly scoring to improve long-context language model efficiency, achieving significant gains on benchmarks like RULER and LongBench.