Tag
This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.
The article introduces Raven, a new State Space Model (SSM) with selective memory allocation that achieves state-of-the-art performance on recall tasks and demonstrates superior length generalization compared to existing models like SWA.
Researchers from MIT CSAIL and other institutions introduced CompreSSM, a technique that compresses state-space AI models during training by removing unnecessary components early, resulting in faster training and smaller models without sacrificing performance.