@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…

X AI KOLs Timeline 05/23/26, 10:15 PM Models

kv-cache deepseek attention compression long-context local-inference

Summary

DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.

KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more key/value attention state for previous tokens. That cache can become a huge part of RAM or HBM usage, especially at long context, and it directly limits how many long prompts you can serve at once, or the specs of the hardware you need (128GB instead of 64GB) to serve the model locally. So how did DeepSeek make KV cache so compact? DeepSeek’s KV-cache story comes from two major innovations. First, MLA in DeepSeek-V2 made each token’s KV much smaller. DeepSeek-V2 already reduced KV cache by about 93% compared with older attention layouts. Then DeepSeek-V4 added CSA + HCA, which compress long-context memory itself: fewer full KV entries, sparse retrieval, and heavily compressed global memory. This compact KV cache directly leads to @antirez’s ds4.c project. Once KV cache is small enough, it becomes practical to treat it as reusable local state on SSD: prefill once, persist to SSD, reload later, and continue from the new suffix. To quote antirez: “The KV cache is actually a first-class disk citizen.”

Original Article

View Cached Full Text

Cached at: 05/24/26, 06:23 AM

KV cache is the model’s working memory during generation.

As the context window gets longer, the model has to keep more key/value attention state for previous tokens. That cache can become a huge part of RAM or HBM usage, especially at long context, and it directly limits how many long prompts you can serve at once, or the specs of the hardware you need (128GB instead of 64GB) to serve the model locally.

So how did DeepSeek make KV cache so compact?

DeepSeek’s KV-cache story comes from two major innovations.

First, MLA in DeepSeek-V2 made each token’s KV much smaller. DeepSeek-V2 already reduced KV cache by about 93% compared with older attention layouts.

Then DeepSeek-V4 added CSA + HCA, which compress long-context memory itself: fewer full KV entries, sparse retrieval, and heavily compressed global memory.

This compact KV cache directly leads to @antirez’s ds4.c project.

Once KV cache is small enough, it becomes practical to treat it as reusable local state on SSD:

prefill once, persist to SSD, reload later, and continue from the new suffix.

To quote antirez:

“The KV cache is actually a first-class disk citizen.”

@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…

Similar Articles

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

@Michaelzsguo: Found this great tool that may be handy for your local LLM inference optimization: https://kvcache.ai/tools/kv-cache-ca…

@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…

CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…

Submit Feedback

Similar Articles

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

@Michaelzsguo: Found this great tool that may be handy for your local LLM inference optimization: https://kvcache.ai/tools/kv-cache-ca…

@TheTuringPost: Why KV cache is one of the main reasons LLMs are fast? KV cache is what connects attention mechanism with generation st…

CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference

@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…