Tag
RT-Lynx proposes using activation sparsity instead of weight sparsity to accelerate diffusion models, achieving up to 1.55× linear-layer speedup while maintaining generation quality, and is accepted at ICML 2026.
This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.