I solved kv-cache
Summary
The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.
Similar Articles
Maybe KV cache offload to RAM isn't bad
A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.
@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…
DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.
Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo
InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.
The KV-cache wall: why fixed-size memory sequence models keep coming back
Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.
Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.