Tag
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.
A custom FPGA implementation of a Transformer with KV cache achieves 56,000 tokens per second at 80 MHz, running microGPT on a tiny LCD.
Attemory is an open-source local memory retrieval engine that uses attention-based retrieval over KV cache instead of traditional embedding or BM25 methods, achieving state-of-the-art results on LongMemEval, LoCoMo, and code search benchmarks.
NVIDIA Research publishes a technical blog post examining KV cache compression techniques and their infrastructure problems, including how FlashAttention and paged attention create practical obstacles for production deployment of long-context LLMs, with a proposed geometric solution using RoPE.
A tweet from Song Han highlights continued work on KV cache compression, featuring a blog by Weian Mao that discusses system-level aspects often overlooked in papers.
DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.
A new KV cache optimization called kvflash doubles generation speed and reduces VRAM usage for Qwen 3.6-27B on a single RTX 3090 while maintaining accuracy.
The article redefines KV Cache from an engineering perspective, pointing out that it is not just an inference optimization technique, but becomes a runtime infrastructure for reusing already computed results in the Agent era, helping AI avoid redundant reasoning.
Introduces Parallel-Synthesis, a framework that enables direct consumption of KV caches from parallel worker agents, reducing time-to-first-token by 2.5x–11x while maintaining or improving performance on agentic tasks.
Introduces KV-Compression Aware Training (KV-CAT), a method that encourages transformers to learn compressible key-value caches during training, improving memory efficiency for long-context tasks without sacrificing performance.
InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.
FlashMemory-DeepSeek-V4 proposes a novel inference paradigm called Lookahead Sparse Attention (LSA), which uses a neural memory indexer to actively predict future context needs, compressing physical KV cache usage to 13.5% of full context baseline while improving average accuracy by 0.6%. This method adopts a decoupled training strategy that allows independent training of the indexer without loading the base model, significantly reducing training cost.
User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.
This paper presents a method for dense latent communication between heterogeneous multi-agent systems using aligned KV-cache transformation, achieving better performance than text-based methods with lower computational costs.
Introduces RKSC, a training-free inference framework for multi-branch LLM reasoning that reduces KV cache redundancy via similarity-based sharing and early exit, achieving up to 3x speedup with minimal error.
IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.
This paper reveals that low-bit KV cache quantization can silently destroy safety alignment in instruction-tuned LLMs, and proposes a diagnostic method (PCR) to classify failure modes along with a training-free mitigation protocol that recovers up to 97% of lost alignment.
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
This blog post introduces the concept of the 'Forgetting Wall' in long-horizon video generation and world models, arguing that the primary bottleneck is memory (KV cache growth) rather than compute, and explores compression as a key direction for future models.
Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.