RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
Summary
RT-Lynx proposes using activation sparsity instead of weight sparsity to accelerate diffusion models, achieving up to 1.55× linear-layer speedup while maintaining generation quality, and is accepted at ICML 2026.
View Cached Full Text
Cached at: 05/27/26, 02:47 AM
Paper page - RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
Source: https://huggingface.co/papers/2605.26632 👋 Hi everyone! We’re excited to share our ICML 2026 workRT-Lynx: Putting GEMM Sparsity in the Right Place for Diffusion Models.
Semi-structured sparsity has the potential to nearly halve GEMM FLOPs, but applying it to diffusion models remains challenging: conventional weight sparsification often removes critical generative capacity and causes visible quality degradation.
We revisit this problem and find that, unlike weights, DiT activations are intrinsically sparse and significantly more robust to 2:4 semi-structured sparsity. This suggests that activation sparsity is a better target than weight sparsity for accelerating Diffusion Transformers.Based on this observation, we propose RT-Lynx, which shifts the sparsification target from weights to activations. It combines online activation sparsification with norm-based compensation and a lightweight LoRA branch to recover fine-grained visual details.To make this practically efficient, we further design optimized CUDA kernels that fuse sparsification, compression, and sparse Tensor Core computation into a unified inference pipeline.
Across Qwen-Image, FLUX.1-dev, and Z-Image, RT-Lynx preserves generation quality while achieving around 1.2× end-to-end speedup and up to 1.55× average linear-layer acceleration.
We hope this work highlights activation sparsity as a more suitable and hardware-friendly direction for accelerating modern Diffusion Transformers. Feedback is very welcome!
Similar Articles
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
R²-dLLM introduces spatio-temporal redundancy reduction techniques that cut diffusion LLM decoding steps by up to 75% while preserving generation quality, addressing a key deployment bottleneck.
Supportive Token Revealing for Fast Diffusion Language Model Decoding
This paper proposes AXON, a training-free module that improves the quality-latency trade-off of discrete diffusion language model decoding by intelligently selecting 'anchor' tokens to reveal first, using attention, uncertainty, and confidence signals to support subsequent denoising steps. Experiments on reasoning and code-generation benchmarks show AXON reduces function evaluations while maintaining or improving accuracy.
@_akhaliq: SpenseGPT Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference
SpenseGPT introduces a practical one-shot pruning method for LLMs that enables both sparse and dense GEMMs during inference, improving efficiency.
Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation
This paper introduces Temporal-Spatial Parallel Decoding (TSPD) and Confidence Extrapolation (CE) to accelerate inference in diffusion-based large language models by dynamically deciding when tokens have converged and forecasting logit trends, reducing unnecessary denoising steps while preserving output quality.
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes
This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.