I solved kv-cache

Reddit r/AI_Agents Tools

Summary

The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.

I have open sourced a kv-cache solution...a complete solve, really. this is an adapter made from my closed source/freemium SDK, catalyst-brain. This isn't another compression play -- this is a completely novel solution. This dramatically lowers the barrier of entry to running local, private models as RAM will no longer explode with context. There is a variation I am working on which allows for a sort of infinite context window trick -- I will publish the adapter for that as well. Enjoy!!
Original Article

Similar Articles

Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

Reddit r/LocalLLaMA

InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.