hybrid-architectures

#hybrid-architectures

Rethinking the Role of Efficient Attention in Hybrid Architectures

arXiv cs.CL ↗ · 17h ago Cached

This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.

0 favorites 0 likes

#hybrid-architectures

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.

0 favorites 1 likes

hybrid-architectures

Rethinking the Role of Efficient Attention in Hybrid Architectures

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

Submit Feedback