kv-cache

Tag

Cards List
#kv-cache

Epiphany-Aware KV Cache Eviction Without the Attention Matrix

arXiv cs.LG · 10h ago Cached

This paper introduces EpiKV, a KV cache eviction method that scores token importance via changes in internal representations (epiphany score) instead of attention weights, avoiding the need to materialize the attention matrix. It achieves competitive performance on reasoning benchmarks while enabling up to 16× longer context lengths.

0 favorites 0 likes
#kv-cache

The KV-cache wall: why fixed-size memory sequence models keep coming back

Reddit r/ArtificialInteligence · 23h ago

Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.

0 favorites 0 likes
#kv-cache

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

arXiv cs.CL · yesterday Cached

Dustin introduces a sparse verification framework for speculative decoding that leverages draft model signals and sparse attention head scoring to overcome the KV cache verification bottleneck, achieving up to 27.85x speedup in self-attention and 9.17x end-to-end decoding speedup on long-context tasks with negligible accuracy loss.

0 favorites 0 likes
#kv-cache

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Reddit r/singularity · yesterday Cached

DualPath is a system that breaks the storage bandwidth bottleneck in agentic LLM inference by introducing a dual-path KV-cache loading mechanism, improving throughput by up to 1.87x offline and 1.96x online.

0 favorites 0 likes
#kv-cache

The memory wall gets expensive: KV cache is why you should stop worshiping softmax attention

Reddit r/singularity · yesterday

The article discusses how rising DDR5 memory prices signal a broader memory bottleneck in AI, particularly the KV cache in softmax attention for LLMs, and highlights post-transformer architectures like linear attention and state space models that aim to reduce memory usage.

0 favorites 0 likes
#kv-cache

RoPE-Aware Bit Allocation for KV-Cache Quantization

arXiv cs.LG · 2d ago Cached

Proposes Block-GTQ, a RoPE-aware bit allocation method for key-value cache quantization that improves long-context performance and memory efficiency by allocating more bits to high-energy RoPE blocks.

0 favorites 0 likes
#kv-cache

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

arXiv cs.LG · 2d ago Cached

Introduces Nexus Sampling, a training-free KV-cache eviction method using weighted reservoir sampling instead of deterministic top-k, improving long-context LLM inference under fixed memory budgets, matching dense attention performance at 80% eviction.

0 favorites 0 likes
#kv-cache

@techNmak: Your LLM inference is burning 50% of its compute on work it has already done. If you're running RAG or Multi-Turn Chat,…

X AI KOLs Timeline · 2d ago Cached

LMCache is an open-source library that makes KV cache persistent and shareable across requests, eliminating recomputation in RAG and multi-turn chat workloads, achieving up to 15x throughput gain and 3-10x reduction in time-to-first-token.

0 favorites 0 likes
#kv-cache

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Reddit r/LocalLLaMA · 2d ago

The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.

0 favorites 0 likes
#kv-cache

@kazukifujii: This vLLM blog post explains weight updates in RL + KV cache recompute in a very clear and illustrated way, and it also…

X AI KOLs Timeline · 3d ago Cached

This article explains vLLM's weight syncing API for reinforcement learning, covering how it facilitates weight updates and KV cache recompute in RL training, with a focus on reducing complexity for training frameworks.

0 favorites 0 likes
#kv-cache

@AdinaYakup: Unlimited-OCR New OCR from @PaddlePaddle It can parse hundreds of pages in a single pass while maintaining stable speed…

X AI KOLs Following · 3d ago Cached

PaddlePaddle releases Unlimited-OCR, a new OCR model using Reference Sliding Window Attention (R-SWA) to maintain constant KV cache during decoding, achieving 93% on OmniDocBench and a 6% improvement over previous methods.

0 favorites 0 likes
#kv-cache

Unlimited OCR Works

Hugging Face Daily Papers · 4d ago Cached

Unlimited OCR introduces Reference Sliding Window Attention to eliminate growing memory consumption in long-sequence OCR tasks, enabling efficient transcription of multiple pages in a single forward pass.

0 favorites 0 likes
#kv-cache

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

Reddit r/MachineLearning · 6d ago

An open, in-progress handbook explaining LLM inference internals including GPU memory hierarchy, KV cache, batching, and popular inference engines like vLLM and TensorRT-LLM.

0 favorites 0 likes
#kv-cache

@FakeMaidenMaker: Incredible! This open-source project can significantly speed up and save VRAM for self-hosted large model inference. It has garnered 9.2K stars on GitHub, joined the PyTorch Foundation, and NVIDIA's Dynamo has integrated it. GitHub: https://github.com/LMC…

X AI KOLs Timeline · 2026-06-18 Cached

LMCache is a KV cache management layer that accelerates large model inference and reduces VRAM consumption by caching and reusing KV cache. It has received 9.2K stars and joined the PyTorch Foundation, and is integrated by NVIDIA Dynamo.

0 favorites 0 likes
#kv-cache

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Reddit r/LocalLLaMA · 2026-06-18

NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.

0 favorites 0 likes
#kv-cache

Dual Dimensionality for Local and Global Attention

arXiv cs.CL · 2026-06-18 Cached

Proposes Distance-Adaptive Representation (DAR) which reduces key-value dimensionality for distant tokens while preserving full dimensionality for nearby tokens, improving KV cache efficiency without performance loss.

0 favorites 0 likes
#kv-cache

Investigating Implicit Latent Trajectory Shifts: Bypassing Alignment via Long-Form Coherent Context

Reddit r/ArtificialInteligence · 2026-06-17

An empirical study investigating how long, semantically dense benign text can shift a model's latent space trajectory, diluting initial system prompts and bypassing post-training alignment constraints, as observed in both closed and open-source models.

0 favorites 0 likes
#kv-cache

@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…

X AI KOLs Timeline · 2026-06-17 Cached

A detailed blog post explaining how vLLM works, including PagedAttention, KV cache management, and continuous batching for efficient LLM serving.

1 favorites 1 likes
#kv-cache

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

arXiv cs.LG · 2026-06-17 Cached

This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.

0 favorites 0 likes
#kv-cache

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

Hacker News Top · 2026-06-16 Cached

A custom FPGA implementation of a Transformer with KV cache achieves 56,000 tokens per second at 80 MHz, running microGPT on a tiny LCD.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback