Tag
A new paper from Alibaba and Nanjing University introduces RTPurbo, a method that speeds up million-token prefill by up to 9.36x compared to FlashAttention-2 by selectively applying full attention only where needed, without retraining the model.
RTPurbo converts full-attention LLMs into sparse models with only a few hundred training steps, achieving near-lossless accuracy and up to 9.36x prefill and 2.01x decode speedups.