The KV-cache wall: why fixed-size memory sequence models keep coming back

Reddit r/ArtificialInteligence News

Summary

Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.

No content available
Original Article

Similar Articles

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

arXiv cs.LG

This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.