Tag
This paper introduces a grammatically-guided sparse attention mechanism for Transformers, aiming to improve efficiency and interpretability by leveraging linguistic structure.
RT-Lynx proposes using activation sparsity instead of weight sparsity to accelerate diffusion models, achieving up to 1.55× linear-layer speedup while maintaining generation quality, and is accepted at ICML 2026.
This paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that uses only two weights per layer to achieve over 80x parameter reduction on MNIST and MIT-BIH arrhythmia datasets while maintaining comparable accuracy, making it ideal for edge and tiny-ML platforms.
This paper proposes φ-balancing, a principled framework for load balancing in Mixture-of-Experts models that directly targets population-level expert balance using convex duality and mirror descent, achieving more stable expert utilization and outperforming prior methods on reasoning and code generation benchmarks.
This paper analyzes neural activation patterns across six LLM architectures on cognitive tasks, revealing differences in attention entropy and sparsity between encoder and decoder models.
This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.