I solved kv-cache
Summary
The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.
Similar Articles
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
This paper introduces ReST-KV, a novel method for robust KV cache eviction in large language models that uses layer-wise output reconstruction and spatial-temporal smoothing to improve efficiency. The method significantly reduces decoding latency and outperforms state-of-the-art baselines on long-context benchmarks like LongBench and RULER.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.
TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.