Tag
This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.