Tag
Redis creator antirez released an open-source project called ds4, a DeepSeek V4 Flash local inference engine optimized for Mac Metal, featuring disk KV caching, ultra-long context, and excellent performance.
A Stanford lecture on AI inference emphasizes practical bottlenecks like KV-cache and techniques like speculative decoding and continuous batching, offering more real-world insight than typical ML courses.
This paper introduces a fixed-contract diagnostic tool to analyze why KV cache compression methods succeed or fail in long-context LLM inference. It identifies three failure modes—missing evidence, scoring irrelevant tokens, and breaking related evidence—and evaluates them on LongBench and NeedleBench.
This paper introduces ReST-KV, a novel method for robust KV cache eviction in large language models that uses layer-wise output reconstruction and spatial-temporal smoothing to improve efficiency. The method significantly reduces decoding latency and outperforms state-of-the-art baselines on long-context benchmarks like LongBench and RULER.
Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models for fast parallel token generation while maintaining exact inference fidelity via shared KV caches and consensus mechanisms, achieving up to 7.8x speedup.
Comparison of hardware requirements for running multi-agent AI workflows locally, highlighting VRAM and KV Cache constraints.
This paper introduces Louver, a novel index structure for KV cache retrieval that reformulates sparse attention as a range searching problem, guaranteeing zero false negatives and improving efficiency over existing methods.
The paper introduces WiCER, an iterative algorithm for compiling domain knowledge into LLM Wiki systems to minimize information loss and catastrophic failure rates during knowledge distillation. It demonstrates that this approach improves upon full-context KV cache inference by preserving critical facts better than blind compilation methods.
This paper introduces LKV, a method for end-to-end learning of head-wise budgets and token selection to optimize KV cache eviction in large language models, achieving state-of-the-art performance with high compression rates.
This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.
The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.
Luce DFlash has achieved a 10-15% speedup by implementing per-layer K/V truncation in the draft graph for SWA layers.
This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.
Modular published a blog post explaining why traditional HTTP routing doesn't work for LLM inference workloads. The article describes how their distributed inference framework handles stateful, heterogeneous GPU pods with KV caches, specialized prefill/decode backends, and conversation-level routing that traditional stateless routing algorithms cannot address.
Proposes Memory-Efficient Looped Transformer (MELT), a novel recurrent LLM architecture that decouples reasoning depth from memory consumption by sharing a single KV cache across loops and using chunk-wise training with interpolated transition and attention-aligned distillation.
The paper introduces SPEED, a layer-asymmetric KV visibility policy that reduces long-context inference costs by processing prompt tokens only in lower layers during prefill while maintaining full-depth attention during decoding.
IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.