EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv cs.LG Papers

Summary

EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.

arXiv:2605.21649v1 Announce Type: new Abstract: Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $\alpha$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $\delta$, showing that output error is controlled by $\delta$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:50 AM

# EntmaxKV: Support-Aware Decoding for Entmax Attention
Source: [https://arxiv.org/abs/2605.21649](https://arxiv.org/abs/2605.21649)
[View PDF](https://arxiv.org/pdf/2605.21649)

> Abstract:Long\-context decoding is increasingly limited by KV\-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length\. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass\. In contrast, $\\alpha$\-entmax produces exact zeros, turning sparse decoding from dense\-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact\. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known\. In this work, we introduce EntmaxKV, an entmax\-native sparse decoding framework that exploits sparsity before KV pages are loaded\. EntmaxKV combines query\-aware page scoring, support\-aware candidate selection, and sparse entmax attention\. We analyze truncation error through the dropped probability mass $\\delta$, showing that output error is controlled by $\\delta$ and vanishes when the entmax support is recovered\. We further introduce a Gaussian\-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution\. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax\-based sparse decoding at matched KV budgets\. On long\-context and language modeling benchmarks, it closely matches full\-cache entmax while using a small fraction of the KV cache, achieving up to $3\.36\\times$ \(softmax\) and $5\.43\\times$ \(entmax\) speedup over full attention baselines at 1M context length\. Code available at:[this https URL](https://github.com/deep-spin/entmaxkv)\.

## Submission history

From: Marcos Vinícius Treviso \[[view email](https://arxiv.org/show-email/1d1f2d8f/2605.21649)\] **\[v1\]**Wed, 20 May 2026 19:03:38 UTC \(809 KB\)

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.