scaling-behavior

#scaling-behavior

Rethinking the Role of Efficient Attention in Hybrid Architectures

arXiv cs.CL ↗ · 15h ago Cached

This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.

0 favorites 0 likes

scaling-behavior

Rethinking the Role of Efficient Attention in Hybrid Architectures

Submit Feedback