@rohanpaul_ai: New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) …
Summary
A new paper from Alibaba and Nanjing University introduces RTPurbo, a method that speeds up million-token prefill by up to 9.36x compared to FlashAttention-2 by selectively applying full attention only where needed, without retraining the model.
View Cached Full Text
Cached at: 05/25/26, 04:43 AM
New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) with only lightweight adaptation
Shows standard LLMs can handle very long context faster by making attention selectively sparse.
The problem is that full attention gets very expensive when the input grows to hundreds of thousands or 1M tokens, because the model keeps comparing too many tokens with too many other tokens.
The paper’s claim is that a trained full-attention model already has a hidden sparse structure, so the model does not need to be rebuilt or trained from scratch.
RTPurbo uses that structure by finding the few attention heads that really need faraway tokens, while letting the other heads focus mostly on nearby text.
For those retrieval heads, it uses a small 16-dimensional token finder to guess which old tokens matter, then runs the real attention only on that selected set.
The authors tested this on long-context benchmarks and reasoning tasks, and RTPurbo kept accuracy close to full attention while reaching up to 9.36x faster prefill at 1M tokens and about 2x faster decoding.
RTPurbo’s engineering rule: keep expensive long-context access only where it matters, and route the rest through a smaller search space.
The clever part is the 16-dimensional indexer.
It does not replace the model’s real attention computation; it acts like a cheap scout, finding likely useful tokens before the full representation is used on the selected set.
RTPurbo is not proof that every model can be safely sparsified this way.
But it is strong evidence that the waste in long-context inference is more structured than it looks.
Paper Link – arxiv. org/abs/2605.16928v1
Paper Title: “Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps”
Similar Articles
@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …
MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
RTPurbo converts full-attention LLMs into sparse models with only a few hundred training steps, achieving near-lossless accuracy and up to 9.36x prefill and 2.01x decode speedups.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.
Anyone using Flash Attention 2 (ai-bond) on their V100's? How is the performance?
A user benchmarks a V100-compatible port of Flash Attention 2, reporting 3x-17x speedups and up to 94% memory reduction over default PyTorch attention.
Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.