RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Hugging Face Daily Papers 05/26/26, 12:00 AM Papers

diffusion-models sparsity activation-sparsity gemm acceleration transformers icml-2026

Summary

RT-Lynx proposes using activation sparsity instead of weight sparsity to accelerate diffusion models, achieving up to 1.55× linear-layer speedup while maintaining generation quality, and is accepted at ICML 2026.

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.

Original Article

View Cached Full Text

Cached at: 05/27/26, 02:47 AM

Paper page - RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Source: https://huggingface.co/papers/2605.26632 👋 Hi everyone! We’re excited to share our ICML 2026 workRT-Lynx: Putting GEMM Sparsity in the Right Place for Diffusion Models.

Semi-structured sparsity has the potential to nearly halve GEMM FLOPs, but applying it to diffusion models remains challenging: conventional weight sparsification often removes critical generative capacity and causes visible quality degradation.

We revisit this problem and find that, unlike weights, DiT activations are intrinsically sparse and significantly more robust to 2:4 semi-structured sparsity. This suggests that activation sparsity is a better target than weight sparsity for accelerating Diffusion Transformers.Based on this observation, we propose RT-Lynx, which shifts the sparsification target from weights to activations. It combines online activation sparsification with norm-based compensation and a lightweight LoRA branch to recover fine-grained visual details.To make this practically efficient, we further design optimized CUDA kernels that fuse sparsification, compression, and sparse Tensor Core computation into a unified inference pipeline.

Across Qwen-Image, FLUX.1-dev, and Z-Image, RT-Lynx preserves generation quality while achieving around 1.2× end-to-end speedup and up to 1.55× average linear-layer acceleration.

We hope this work highlights activation sparsity as a more suitable and hardware-friendly direction for accelerating modern Diffusion Transformers. Feedback is very welcome!

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Paper page - RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

Similar Articles

@_yucheng_lu: MTP makes autoregressive LLMs fast. Can the same trick work for diffusion LMs? Had a fun collaboration with @modal expl…

Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

LaCache: Exact Caching and Precision-Adaptive Inference for Diffusion Large Language Models

Neuromorphic Diffusion Language Models: Addressing Compute and Memory Bottlenecks via Sparsity and Block Denoising

Submit Feedback

Similar Articles

@_yucheng_lu: MTP makes autoregressive LLMs fast. Can the same trick work for diffusion LMs? Had a fun collaboration with @modal expl…

Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

LaCache: Exact Caching and Precision-Adaptive Inference for Diffusion Large Language Models

Neuromorphic Diffusion Language Models: Addressing Compute and Memory Bottlenecks via Sparsity and Block Denoising