@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…

X AI KOLs Timeline Models

Summary

DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.

KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more key/value attention state for previous tokens. That cache can become a huge part of RAM or HBM usage, especially at long context, and it directly limits how many long prompts you can serve at once, or the specs of the hardware you need (128GB instead of 64GB) to serve the model locally. So how did DeepSeek make KV cache so compact? DeepSeek’s KV-cache story comes from two major innovations. First, MLA in DeepSeek-V2 made each token’s KV much smaller. DeepSeek-V2 already reduced KV cache by about 93% compared with older attention layouts. Then DeepSeek-V4 added CSA + HCA, which compress long-context memory itself: fewer full KV entries, sparse retrieval, and heavily compressed global memory. This compact KV cache directly leads to @antirez’s ds4.c project. Once KV cache is small enough, it becomes practical to treat it as reusable local state on SSD: prefill once, persist to SSD, reload later, and continue from the new suffix. To quote antirez: “The KV cache is actually a first-class disk citizen.”
Original Article
View Cached Full Text

Cached at: 05/24/26, 06:23 AM

KV cache is the model’s working memory during generation.

As the context window gets longer, the model has to keep more key/value attention state for previous tokens. That cache can become a huge part of RAM or HBM usage, especially at long context, and it directly limits how many long prompts you can serve at once, or the specs of the hardware you need (128GB instead of 64GB) to serve the model locally.

So how did DeepSeek make KV cache so compact?

DeepSeek’s KV-cache story comes from two major innovations.

First, MLA in DeepSeek-V2 made each token’s KV much smaller. DeepSeek-V2 already reduced KV cache by about 93% compared with older attention layouts.

Then DeepSeek-V4 added CSA + HCA, which compress long-context memory itself: fewer full KV entries, sparse retrieval, and heavily compressed global memory.

This compact KV cache directly leads to @antirez’s ds4.c project.

Once KV cache is small enough, it becomes practical to treat it as reusable local state on SSD:

prefill once, persist to SSD, reload later, and continue from the new suffix.

To quote antirez:

“The KV cache is actually a first-class disk citizen.”

Similar Articles

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.