kv-cache

Tag

Cards List
#kv-cache

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

arXiv cs.LG · 4h ago Cached

This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.

0 favorites 0 likes
#kv-cache

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

Hacker News Top · 15h ago Cached

A custom FPGA implementation of a Transformer with KV cache achieves 56,000 tokens per second at 80 MHz, running microGPT on a tiny LCD.

0 favorites 0 likes
#kv-cache

A local attention-based retrieval with SOTA results on LongMemEval, LoCoMo, and code search benchmarks

Reddit r/AI_Agents · 23h ago

Attemory is an open-source local memory retrieval engine that uses attention-based retrieval over KV cache instead of traditional embedding or BM25 methods, achieving state-of-the-art results on LongMemEval, LoCoMo, and code search benchmarks.

0 favorites 0 likes
#kv-cache

@yukangchen_: We are excited to share a new technical article “KV Cache Compression and Its Infra Problems.” https://research.nvidia.…

X AI KOLs Timeline · yesterday Cached

NVIDIA Research publishes a technical blog post examining KV cache compression techniques and their infrastructure problems, including how FlashAttention and paged attention create practical obstacles for production deployment of long-context LLMs, with a proposed geometric solution using RoPE.

0 favorites 0 likes
#kv-cache

@songhan_mit: Explore our continued efforts on KV cache compression:

X AI KOLs Following · yesterday Cached

A tweet from Song Han highlights continued work on KV cache compression, featuring a blog by Weian Mao that discusses system-level aspects often overlooked in papers.

0 favorites 0 likes
#kv-cache

@zhijianliu_: This is what DFlash was built for. Our block-diffusion drafter + KV injection, now running at frontier scale — thanks t…

X AI KOLs Following · yesterday Cached

DFlash, a block-diffusion drafter with KV injection, is now running at frontier scale, achieving up to 4.3x greater throughput over baseline, integrated with Modal and SGLang for Qwen 397B.

0 favorites 0 likes
#kv-cache

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b

Reddit r/LocalLLaMA · yesterday

A new KV cache optimization called kvflash doubles generation speed and reduces VRAM usage for Qwen 3.6-27B on a single RTX 3090 while maintaining accuracy.

0 favorites 0 likes
#kv-cache

@ZeroZ_JQ: https://x.com/ZeroZ_JQ/status/2066380476970103028

X AI KOLs Timeline · 2d ago Cached

The article redefines KV Cache from an engineering perspective, pointing out that it is not just an inference optimization technique, but becomes a runtime infrastructure for reusing already computed results in the Agent era, helping AI avoid redundant reasoning.

0 favorites 0 likes
#kv-cache

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

arXiv cs.AI · 2d ago Cached

Introduces Parallel-Synthesis, a framework that enables direct consumption of KV caches from parallel worker agents, reducing time-to-first-token by 2.5x–11x while maintaining or improving performance on agentic tasks.

0 favorites 0 likes
#kv-cache

@jiqizhixin: What if your AI’s memory didn’t have to balloon with every extra sentence? University of Oxford, Technion, AITHYRA, and…

X AI KOLs Timeline · 2d ago Cached

Introduces KV-Compression Aware Training (KV-CAT), a method that encourages transformers to learn compressible key-value caches during training, improving memory efficiency for long-context tasks without sacrificing performance.

0 favorites 0 likes
#kv-cache

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

Reddit r/LocalLLaMA · 5d ago

InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.

0 favorites 0 likes
#kv-cache

@karminski3: Magic! DeepSeekV4 context memory compressed to 1/10! Everyone knows DeepSeekV4 supports 1M context and is heavily optimized. To actually use 1M context, VRAM usage is only about 10GB (compared to DeepSeek-V3.2 which needs about…

X AI KOLs Following · 5d ago Cached

FlashMemory-DeepSeek-V4 proposes a novel inference paradigm called Lookahead Sparse Attention (LSA), which uses a neural memory indexer to actively predict future context needs, compressing physical KV cache usage to 13.5% of full context baseline while improving average accuracy by 0.6%. This method adopts a decoupled training strategy that allows independent training of the indexer without loading the base model, significantly reducing training cost.

0 favorites 0 likes
#kv-cache

How do i prevent llama.cpp from offloading on Swap?

Reddit r/LocalLLaMA · 5d ago

User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.

0 favorites 0 likes
#kv-cache

See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents

Hugging Face Daily Papers · 6d ago Cached

This paper presents a method for dense latent communication between heterogeneous multi-agent systems using aligned KV-cache transformation, achieving better performance than text-based methods with lower computational costs.

0 favorites 0 likes
#kv-cache

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

arXiv cs.LG · 2026-06-10 Cached

Introduces RKSC, a training-free inference framework for multi-branch LLM reasoning that reduces KV cache redundancy via similarity-based sharing and early exit, achieving up to 3x speedup with minimal error.

0 favorites 0 likes
#kv-cache

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

arXiv cs.LG · 2026-06-10 Cached

IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.

0 favorites 0 likes
#kv-cache

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arXiv cs.LG · 2026-06-10 Cached

This paper reveals that low-bit KV cache quantization can silently destroy safety alignment in instruction-tuned LLMs, and proposes a diagnostic method (PCR) to classify failure modes along with a training-free mitigation protocol that recovers up to 97% of lost alignment.

0 favorites 0 likes
#kv-cache

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

arXiv cs.CL · 2026-06-10 Cached

This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.

0 favorites 0 likes
#kv-cache

@HaochengXiUCB: New blog post: The Forgetting Wall in Video and World Models Long-horizon video generation is not just limited by compu…

X AI KOLs Following · 2026-06-10 Cached

This blog post introduces the concept of the 'Forgetting Wall' in long-horizon video generation and world models, arguing that the primary bottleneck is memory (KV cache growth) rather than compute, and explores compression as a key direction for future models.

0 favorites 0 likes
#kv-cache

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

Hugging Face Daily Papers · 2026-06-10 Cached

Proposes Reroute, a training-free plug-in for vision-language models that replaces irreversible visual-token pruning with recoverable routing, allowing tokens to re-enter the pipeline later to improve grounding under aggressive token reduction while maintaining VQA performance.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback