@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…
Summary
DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.
View Cached Full Text
Cached at: 05/24/26, 06:23 AM
KV cache is the model’s working memory during generation.
As the context window gets longer, the model has to keep more key/value attention state for previous tokens. That cache can become a huge part of RAM or HBM usage, especially at long context, and it directly limits how many long prompts you can serve at once, or the specs of the hardware you need (128GB instead of 64GB) to serve the model locally.
So how did DeepSeek make KV cache so compact?
DeepSeek’s KV-cache story comes from two major innovations.
First, MLA in DeepSeek-V2 made each token’s KV much smaller. DeepSeek-V2 already reduced KV cache by about 93% compared with older attention layouts.
Then DeepSeek-V4 added CSA + HCA, which compress long-context memory itself: fewer full KV entries, sparse retrieval, and heavily compressed global memory.
This compact KV cache directly leads to @antirez’s ds4.c project.
Once KV cache is small enough, it becomes practical to treat it as reusable local state on SSD:
prefill once, persist to SSD, reload later, and continue from the new suffix.
To quote antirez:
“The KV cache is actually a first-class disk citizen.”
Similar Articles
@Michaelzsguo: Found this great tool that may be handy for your local LLM inference optimization: https://kvcache.ai/tools/kv-cache-ca…
A tweet shares the KV Cache Size Calculator from KVCache.ai, a tool for estimating KV cache memory usage for local LLM inference, highlighting that 1M tokens for DeepSeek V4 Pro uses only 5GB of RAM.
@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…
KV cache stores previously computed key and value vectors during autoregressive generation, allowing models to avoid recomputing the entire sequence at each step, significantly speeding up inference at the cost of increased memory usage.
@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…
This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.
KV Cache Is Becoming the Memory Hierarchy of Inference
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.