@rohanpaul_ai: New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) …

X AI KOLs Timeline 05/24/26, 07:53 PM Papers

million-token-prefill attention-speedup sparse-attention long-context inference-optimization rtpurbo alibaba

Summary

A new paper from Alibaba and Nanjing University introduces RTPurbo, a method that speeds up million-token prefill by up to 9.36x compared to FlashAttention-2 by selectively applying full attention only where needed, without retraining the model.

New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) with only lightweight adaptation Shows standard LLMs can handle very long context faster by making attention selectively sparse. The problem is that full attention gets very expensive when the input grows to hundreds of thousands or 1M tokens, because the model keeps comparing too many tokens with too many other tokens. The paper’s claim is that a trained full-attention model already has a hidden sparse structure, so the model does not need to be rebuilt or trained from scratch. RTPurbo uses that structure by finding the few attention heads that really need faraway tokens, while letting the other heads focus mostly on nearby text. For those retrieval heads, it uses a small 16-dimensional token finder to guess which old tokens matter, then runs the real attention only on that selected set. The authors tested this on long-context benchmarks and reasoning tasks, and RTPurbo kept accuracy close to full attention while reaching up to 9.36x faster prefill at 1M tokens and about 2x faster decoding. RTPurbo's engineering rule: keep expensive long-context access only where it matters, and route the rest through a smaller search space. The clever part is the 16-dimensional indexer. It does not replace the model’s real attention computation; it acts like a cheap scout, finding likely useful tokens before the full representation is used on the selected set. RTPurbo is not proof that every model can be safely sparsified this way. But it is strong evidence that the waste in long-context inference is more structured than it looks. ---- Paper Link – arxiv. org/abs/2605.16928v1 Paper Title: "Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps"

Original Article

View Cached Full Text

Cached at: 05/25/26, 04:43 AM

New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) with only lightweight adaptation

Shows standard LLMs can handle very long context faster by making attention selectively sparse.

The problem is that full attention gets very expensive when the input grows to hundreds of thousands or 1M tokens, because the model keeps comparing too many tokens with too many other tokens.

The paper’s claim is that a trained full-attention model already has a hidden sparse structure, so the model does not need to be rebuilt or trained from scratch.

RTPurbo uses that structure by finding the few attention heads that really need faraway tokens, while letting the other heads focus mostly on nearby text.

For those retrieval heads, it uses a small 16-dimensional token finder to guess which old tokens matter, then runs the real attention only on that selected set.

The authors tested this on long-context benchmarks and reasoning tasks, and RTPurbo kept accuracy close to full attention while reaching up to 9.36x faster prefill at 1M tokens and about 2x faster decoding.

RTPurbo’s engineering rule: keep expensive long-context access only where it matters, and route the rest through a smaller search space.

The clever part is the 16-dimensional indexer.

It does not replace the model’s real attention computation; it acts like a cheap scout, finding likely useful tokens before the full representation is used on the selected set.

RTPurbo is not proof that every model can be safely sparsified this way.

But it is strong evidence that the waste in long-context inference is more structured than it looks.

Paper Link – arxiv. org/abs/2605.16928v1

Paper Title: “Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps”

@rohanpaul_ai: New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) …

Similar Articles

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

Anyone using Flash Attention 2 (ai-bond) on their V100's? How is the performance?

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

Submit Feedback

Similar Articles

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

Anyone using Flash Attention 2 (ai-bond) on their V100's? How is the performance?

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models