TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

arXiv cs.CL Papers

Summary

TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.

arXiv:2604.19769v1 Announce Type: new Abstract: Key-value (KV) caching is critical for efficient inference in large language models (LLMs), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility. However, this assumption contrasts with human memory systems, where memories vary in clarity, recall frequency, and relevance with temporal proximity.Motivated by this insight, we propose TTKV, a KV cache management framework that maps the human memory system onto the KV cache. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision. The design addresses three aspects: (1) Tier Layout, decoupling fast and slow memory using HBM and DRAM; (2) Tier Content, assigning more recent KV states to faster, higher-precision tiers based on temporal proximity; and (3) Tier Interaction, employing block-wise streaming attention to overlap communication and computation when accessing slow tiers. Experiments show that TTKV reduces cross-tier traffic by 5.94x on 128K-context tasks, achieving up to 76% latency reduction and 2x throughput improvement over strong baselines.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/23/26, 10:02 AM

# TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference
Source: [https://arxiv.org/html/2604.19769](https://arxiv.org/html/2604.19769)
Gradwell Dzikanyanga1†, Weihao Yang1, Hao Huang1, Donglei Wu2, Shihao Wang1, Wen Xia1‡, Sanjeeb K C1 1Harbin Institute of Technology, Shenzhen,2Guangzhou University †:gdzikanyanga@gmail\.com, ‡:xiawen@hit\.edu\.cn

###### Abstract

Key–value \(KV\) caching is critical for efficient inference in large language models \(LLMs\), yet its memory footprint scales linearly with context length, resulting in a severe scalability bottleneck\. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility\. However, this assumption contrasts with human memory systems, where memories vary in clarity, recall frequency, and relevance with temporal proximity\. Motivated by this insight, we proposeTTKV, a KV cache management framework that maps the human memory system onto the KV cache\. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision\. The design addresses three aspects: \(1\)Tier Layout, decoupling fast and slow memory using HBM and DRAM; \(2\)Tier Content, assigning more recent KV states to faster, higher\-precision tiers based on temporal proximity; and \(3\)Tier Interaction, employing block\-wise streaming attention to overlap communication and computation when accessing slow tiers\. Experiments show that TTKV reduces cross\-tier traffic by5\.94×5\.94\\timeson 128K\-context tasks, achieving up to 76% latency reduction and2×2\\timesthroughput improvement over strong baselines\.

## 1Introduction

Large language models \(LLMs\) have demonstrated remarkable capabilities across a wide range of tasksXiaoet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib8)\)\. A key mechanism enabling efficient autoregressive inference is the KV cache, which stores the key and value activations of previously generated tokens to avoid recomputation in transformer attention layersHooperet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib14)\); Liuet al\.\([2024b](https://arxiv.org/html/2604.19769#bib.bib15)\)\.

However, the KV cache grows linearly with context length, becoming a dominant bottleneck in both memory consumption and inference latency\. For example, with a 32K token window on LLaMA\-7B, the KV cache alone can require tens of gigabytes of memoryWanget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib34)\)\. In standard LLM inference pipelines, inference consists of two stages: a parallel*prefill*stage that processes the input and populates the KV cache, and a sequential*decode*stage that generates tokens autoregressively by repeatedly querying cached KVsShenget al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib36)\); Liuet al\.\([2024a](https://arxiv.org/html/2604.19769#bib.bib29)\)\. During decoding, each step attends over the entire cached context, making KV cache access a primary determinant of per\-token latency\.

The KV cache bottleneck becomes more significant once its size exceeds high\-bandwidth memory \(HBM\) and should be spilled into host memory \(DRAM\)Shenget al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib36)\); Liuet al\.\([2024a](https://arxiv.org/html/2604.19769#bib.bib29)\); Sharmaet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib37)\)\. Since PCIe bandwidth is typically an order of magnitude lower than HBM bandwidth, cross\-tier transfers dominate latency in long\-context decodingJianget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib44)\)\. Consequently, the performance bottleneck shifts from merely reducing KV cache size to effectively integrating KV data\-reduction techniques, such as quantization and sparsity, with the memory hierarchy to minimize costly cross\-tier reads\. Without such integration, these techniques remain constrained by host\-to\-GPU traffic when KV caches are offloaded to slower memory\.

Existing approaches to mitigate this bottleneck fall into two categories: \(1\)KV reduction, which reduces the KV cache footprint through quantization and sparsificationHooperet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib14)\); Liuet al\.\([2024b](https://arxiv.org/html/2604.19769#bib.bib15)\); Zhanget al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib38)\); and \(2\)KV offloading, which offloads the KV cache to DRAM and optimizes access patterns to reduce transfer costsShenget al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib36)\); Sharmaet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib37)\)\. However, neither approach alone fully resolves the challenges of long\-context LLM inference\. KV reduction alone may still result in KV caches exceeding GPU memory, while KV offloading alone does not address the growth of the KV cache\. Moreover, naively combining these techniques is non\-trivial, as it often leads to conflicting trade\-offs among cache size reduction, transfer volume, and decode latency\. This raises the question:*how to effectively combine KV reduction and KV offloading techniques to support efficient long\-context LLM inference?*

To address these challenges, we proposeTTKV, a new perspective on KV cache management inspired by human memory mechanisms\. Existing approaches largely treat KV states as equally important across time, implicitly assuming uniform precision and accessibility\. In contrast, our key insight is that the importance of KV states varies over time:recent states function as short\-term memory and are more critical for generation, whereas older states form long\-term memory and only a small subset remains relevant to the current query\.This abstraction enables KV caches with different temporal relevance to be managed with heterogeneous latency and precision requirements, laying the foundation for efficient long\-context LLM inference\.

We implement TTKV, the Temporal\-Tiered KV Cache, which abstracts KV cache management through human memory mechanisms\. The design consists of three key components\. \(1\)Tier Layout: KV caches are organized into a fast tier \(short\-term memory\) and a slow tier \(long\-term memory\) aligned with modern hardware hierarchies, with latency\-sensitive states placed in HBM and capacity\-oriented states in DRAM\. \(2\)Tier Content: KV states are allocated across tiers according to temporal relevance: the fast tier preserves full precision for recent and frequently accessed tokens, while the slow tier applies differential quantization and sparsification to older, less frequently accessed states\. \(3\)Tier Interaction: To mitigate the latency overhead introduced by the slow tier, we employ streaming attention that overlaps computation and communication, enabling efficient access to DRAM\-resident KV states during decoding\.

We evaluate TTKV on a diverse set of LLMs, including LLaMA\-3\.1\-8BGrattafioriet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib4)\), Qwen2\.5\-32BHuiet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib46)\), DeepSeek\-R1\-14BZhenget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib47)\), and LLaMA\-3\.1\-70BGrattafioriet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib4)\)\. We compare TTKV with state\-of\-the\-art KV cache optimization methods, including KIVILiuet al\.\([2024b](https://arxiv.org/html/2604.19769#bib.bib15)\), KVQuantHooperet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib14)\), DiffKVZhanget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib13)\), and ShadowKVSunet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib16)\), in long\-context inference settings\. Across these models, TTKV consistently outperforms prior approaches\. In particular, on 128K\-context tasks, TTKV reduces cross\-tier traffic by up to 5\.94×, leading to as much as 76% latency reduction and 2× throughput improvement over compared methods, while maintaining model accuracy\.

In a nutshell, we make the following contributions:

- •We observe that existing KV cache optimization methods typically emphasize either KV reduction or KV offloading, yet often fail to effectively mitigate cross\-tier traffic and end\-to\-end decode latency in long\-context inference while preserving model accuracy\.
- •We propose*TTKV*, a new perspective on KV cache management inspired by human memory mechanisms\. TTKV partitions the KV cache into temporal tiers with heterogeneous capacity and precision, and is realized through tier layout, tier content, and tier interaction\.
- •we demonstrate that*TTKV*significantly reduces host\-to\-GPU traffic and end\-to\-end decode latency across a wide range of context lengths, while preserving model accuracy and enabling scalable long\-context LLM inference compared to state\-of\-the\-art methods\.

## 2Related Work

#### KV Reduction Methods

aims to mitigate GPU memory overhead through quantization or sparsification\. Specifically, quantization compresses KV states into low\-precision representations to shrink the cache footprintLiuet al\.\([2024b](https://arxiv.org/html/2604.19769#bib.bib15)\); Hooperet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib14)\); Jianget al\.\([2026](https://arxiv.org/html/2604.19769#bib.bib43)\)\. Sparsification limits memory I/Os by selectively retaining or retrieving only salient KV states during attention computationKitaevet al\.\([2020](https://arxiv.org/html/2604.19769#bib.bib40)\); Xiaoet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib8)\); Liuet al\.\([2026](https://arxiv.org/html/2604.19769#bib.bib45)\)\. However, under long\-context workloads, the KV cache size still scales linearly with the context length and exceeds the GPU memory\.

KV Offloading Methodsaim to extend KV cache capacity by offloading KV states from GPU memory to slower but larger host DRAMShenget al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib36)\); Sunet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib16)\); Jianget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib44)\)\. These approaches primarily optimize memory placement strategies and data transfer scheduling methods to alleviate GPU memory pressure\. However, without joint optimizations with KV reduction techniques, they still incur substantial cross\-tier traffic and fail to minimize end\-to\-end decode latency in long\-context inference\.

## 3Design Principle

Human memory mechanisms often assign different levels of importance to memories over the time dimension\. Therefore, we mimic human memory mechanisms by explicitly decoupling the model’s KV cache management into two\-tier implementations\. Based on the above ideas, we propose a key structure calledTier, considering following three aspects:

1. 1\.Tier Layout:Tiers will be placed in accordance with current computer architectures \(e\.g\., HBM or DRAM\)\.
2. 2\.Tier Content:Different tiers will have different KV cache capacities and precision \(bits\) to to simulate the differentiated importance over time dimension\.
3. 3\.Tier Interaction:Two tiers will interact efficiently through a pipeline approach at both prefilling and decoding, with overlapped computation and communication\.

We now present empirical evidence validating the three design perspectives ofTierand demonstrating their necessity for efficient and scalable long\-context LLM inference\.

![Refer to caption](https://arxiv.org/html/2604.19769v1/x1.png)Figure 1:Accuracy and latency trade\-offs for different KV cache placement strategies\. Only the tiered layout \(i\.e\., TTKV\) maintains both high accuracy and low latency as context length scales\.#### Tier Layout: Placement According to Architecture\.

Tier layout can lead to longer context length and better efficiency\. We validate this with four placement strategies for the KV cache in Llama\-3\.1\-8B: storing all entries in GPU HBM, all in host DRAM, a naive uniform split, and our proposed tiered placement\. As shown in Figure[1](https://arxiv.org/html/2604.19769#S3.F1), the GPU\-only strategy fails at long contexts due to memory capacity limits, while the DRAM\-only approach preserves capacity but suffers from prohibitive latency\. The uniform partition performs poorly on both metrics\. However, only thetiered layout—which retains recent, frequently accessed tokens in HBM while offloading older contexts to DRAM—simultaneously maintains model accuracy and low latency\. This confirms that tier placement aligned with computer memory hierarchy is essential for scalable long\-context inference\.

#### Tier Content: Differential Capacity and Precision\.

Human memory prioritizes critical information with high fidelity while compressing less essential details\. This principle directly informs our tiered precision strategy: the tier for fast memory \(i\.e\., HBM\) retains high precision for recent, frequently accessed tokens—analogous to vivid, immediate recollections—while the tier for slow memory \(i\.e\., DRAM\) employs aggressive compression for older, less\-accessed contexts, mirroring how the brain archives distant memories\.

To further improve the compression ratio of the slow tier without sacrificing attention accuracy, we examine the inherent sensitivity of keys and values\. Figure[2](https://arxiv.org/html/2604.19769#S3.F2)reveals that in Llama\-3\.1\-8B, keys exhibit significantly greater variance and dynamic range than values, making them more susceptible to precision loss\. As such, within the compressed slow\-tier, we allocate more bits to accuracy\-critical keys and fewer to more tolerant values, thereby maximizing storage efficiency while preserving the fidelity of attention computations\.

![Refer to caption](https://arxiv.org/html/2604.19769v1/figures/Layer_1.png)Figure 2:Magnitude distributions of keys \(left\) and values \(right\) in Llama\-3\.1\-8B\. Keys show greater dynamic range, justifying higher precision allocation in the fast\-tier compared to the slow\-tier\.
#### Tier Interaction: Pipelining Computation and Communication

\. Human memory retrieval does not halt cognitive processing; instead, it streams relevant details incrementally while reasoning continues\. This efficient recall mechanism motivates a pipeline transmission for cross\-tier interaction, where communication overlaps with computation\. We profile a 64K\-context decoding step in Llama\-3\.1\-8B \(Figure[3](https://arxiv.org/html/2604.19769#S3.F3)\)\. The baseline approach performs bulk transfers that serialize computation and communication, leaving the GPU idle for approximately 78% of the decode time\. By simply implementing a pipeline that breaks transfers into smaller chunks and overlaps them with attention computation, we reduce effective transfer latency by approximately 3×\\times\. This demonstrates that overlapping computation and communication is essential for minimizing the performance impact of cross\-tier data movement, indicating that a pipelined attention method for inter\-tier interaction can ensure efficiency\.

![Refer to caption](https://arxiv.org/html/2604.19769v1/figures/streaming_attention.png)Figure 3:Timeline of a decoding step \(64K context, batch=8\)\. \(a\) Baseline: bulk transfers with GPU idle\. \(b\) Pipeline: communication and computation are overlapped for better GPU utilization\.![Refer to caption](https://arxiv.org/html/2604.19769v1/x2.png)Figure 4:The overview of TTKV\.Inspired by human memory mechanism, TTKV consists of three parts:Tier Layout Mapping,Tier Content\(Differential Quantization\), andTier Interaction\(Streaming Attention\), which enable efficient inference\.

## 4Methodology

Following the three design principles established in Section[3](https://arxiv.org/html/2604.19769#S3)—Tier Layout, Tier Content, and Tier Interaction—we presentTTKV, a unified framework that co‑designs KV cache management with the memory hierarchy\. As shown in Figure[4](https://arxiv.org/html/2604.19769#S3.F4), TTKV explicitly implements: \(1\) a two‑tier memory layout \(Section[4\.1](https://arxiv.org/html/2604.19769#S4.SS1)\), \(2\) differential quantization \(Section[4\.2](https://arxiv.org/html/2604.19769#S4.SS2)\), and \(3\) streaming attention \(Section[4\.3](https://arxiv.org/html/2604.19769#S4.SS3)\)\.

### 4\.1Tier Layout: Tiered Memory Architecture

TheTier Layoutprinciple aligns the memory placement strategy with computer architecture: thefast\-tierresides in GPU HBM for low\-latency access and stores KVs in full precision, while theslow\-tieruses host DRAM for higher capacity and stores KVs in compressed form\. As discussed in Section[3](https://arxiv.org/html/2604.19769#S3), this two\-tier partitioning ensures that recent, frequently accessed tokens remain in fast memory, while older context is stored in slower memory without sacrificing model accuracy\. TTKV’s layout is shown in Figure[4](https://arxiv.org/html/2604.19769#S3.F4)\.

- •Fast\-tier \(GPU HBM\):Rather than fixing the block count for the fast\-tier, we dynamically define its contents based on the available GPU HBM capacity\. Specifically, letLfastL\_\{\\text\{fast\}\}denote the number of most recent tokens stored in full precision \(FP16\) in GPU HBM\. The sizeLfastL\_\{\\text\{fast\}\}is chosen such that the memory footprintMfastM\_\{\\text\{fast\}\}does not exceed the allocated HBM capacityCHBMC\_\{\\text\{HBM\}\}: Mfast=Lfast×dkv×bFP16≤CHBM,M\_\{\\text\{fast\}\}=L\_\{\\text\{fast\}\}\\times d\_\{\\text\{kv\}\}\\times b\_\{\\text\{FP16\}\}\\leq C\_\{\\text\{HBM\}\},wheredkvd\_\{\\text\{kv\}\}is the per\-token KV feature dimension, andbFP16b\_\{\\text\{FP16\}\}is the byte width of FP16 precision\. Therefore, the fast\-tier \(i\.e\., HBM\) holds the most recentLfastL\_\{\\text\{fast\}\}tokens in full precision, ensuring that the most frequently accessed data are cached in the fastest memory\.
- •Slow\-tier \(Host DRAM\):The slow\-tier stores older tokens in a compressed form, with a larger memory capacity but slower access speed\. Moreover, these tokens are partitioned into structured blocks for efficient memory management and cross\-tier movement\.

Formally, let𝒦\\mathcal\{K\}and𝒱\\mathcal\{V\}denote the full KV cache for a given context\. TTKV partitions the cache as:

𝒦=𝒦fast∪𝒦slow,𝒱=𝒱fast∪𝒱slow,\\mathcal\{K\}=\\mathcal\{K\}\_\{\\text\{fast\}\}\\cup\\mathcal\{K\}\_\{\\text\{slow\}\},\\quad\\mathcal\{V\}=\\mathcal\{V\}\_\{\\text\{fast\}\}\\cup\\mathcal\{V\}\_\{\\text\{slow\}\},where the subscriptsfastandslowdenote entries assigned to the fast and slow tiers, respectively\. Entries in𝒦fast\\mathcal\{K\}\_\{\\text\{fast\}\}and𝒱fast\\mathcal\{V\}\_\{\\text\{fast\}\}are selected based on recency and role importance \(e\.g\., high attention scores\), whereas less frequently accessed or lower\-importance entries are placed in𝒦slow\\mathcal\{K\}\_\{\\text\{slow\}\}and𝒱slow\\mathcal\{V\}\_\{\\text\{slow\}\}\. This selection maximizes the fraction of accesses served from GPU HBM, thereby reducing costly PCIe transfers and preserving accuracy for the most impactful tokens\.

Block\-wise Organization in Slow\-tier\.The block structure is a core mechanism enabling theTier Interaction\. By organizing the slow‑tier KV cache into fixed‑size*blocks*ofBblkB\_\{\\text\{blk\}\}tokens, we create manageable units for the streaming attention, allowing fine‑grained scheduling that overlaps transfer and computation\. This mirrors the cognitive process ofchunking, where the brain groups information into units for efficient recall and processing\. We experimented with block sizesBblk∈\{32,64,128,256\}B\_\{\\text\{blk\}\}\\in\\\{32,64,128,256\\\}and found that 128 offered the best trade\-off between latency and accuracy across models; details are in Section[5\.3](https://arxiv.org/html/2604.19769#S5.SS3)\.

Each slow‑tier blockBjB\_\{j\}contains compressed representations ofBblkB\_\{\\text\{blk\}\}key and value vectors\. Moreover, a lightweight index, functioning like a cognitive pointer to memory chunks, maps token positions to block identifiers:

idx:𝒯slow​\(t\)→\{1,2,…\},\\mathrm\{idx\}:\\mathcal\{T\}\_\{\\text\{slow\}\}\(t\)\\to\\\{1,2,\\ldots\\\},\(1\)so that, given a positioni∈𝒯slow​\(t\)i\\in\\mathcal\{T\}\_\{\\text\{slow\}\}\(t\),idx​\(i\)\\mathrm\{idx\}\(i\)returns the specific blockBjB\_\{j\}\. This index structure is fundamental to theTier Interaction, as it \(i\) provides a deterministic, fine‑grained unit for cross‑tier data fetches, and \(ii\) enables thestreaming attentionto efficiently locate, schedule, and transfer only the required blocks, thereby minimizing overhead and supporting the overlap of computation and communication\.

Dynamic Tier Management\.The flow of data between tiers is governed by the principles ofLayoutandContent\. Newly generated tokens—representing the most immediate ”working memory”—are first appended to the fast\-tier in full precision\. When the fast\-tier reaches its capacityCfastC\_\{\\text\{fast\}\}, the oldest block ofBblkB\_\{\\text\{blk\}\}tokens is evicted\. This block is then compressed according to the differential quantization scheme \(see Section[4\.2](https://arxiv.org/html/2604.19769#S4.SS2)\) before being appended to the slow\-tier\.

This First\-In\-First\-Out \(FIFO\) management policy ensures that the working set of the attention operation remains in GPU HBM, while older contexts are preserved in a compressed, structured form in host DRAM\. Moreover, this process mimics the cognitive mechanism ofhuman memory consolidation, in which recent experiences are initially vivid, and later are processed \(e\.g\., compressed\) for long\-term storage\.

### 4\.2Tier Content: Differential Quantization

TheTier Contentprinciple dictates that different data should be stored with different precision, mirroring the selective retention characteristics of human memory\. Moreover, our analysis in Figure[2](https://arxiv.org/html/2604.19769#S3.F2)suggests thatkeysexhibit larger variance and govern attention scores, whereasvaluesare stable\. Applying uniform compression is therefore suboptimal, which wastes bits on tolerant values and degrades critical keys\.

TTKV implements this principle throughdifferential quantization, aligning precision with functional role:

- •Keysare preserved with higher precision \(8\-bit\) to maintain the accuracy of attention score computation, which is the core component of the model’s reasoning\.
- •Valuesare Compressed to lower precision \(4\-bit\) to maximize reductions in storage footprint and cross‑tier transfer volume, with minimal impact on output quality\.

Formally, for a vectorxxin the slow\-tier, the applied differential quantization operatorQQis defined as:

Q​\(x\)=\{Q8​\-bit​\(x\)if​x∈𝒦slow,Q4​\-bit​\(x\)if​x∈𝒱slow,Q\(x\)=\\begin\{cases\}Q\_\{8\\text\{\-bit\}\}\(x\)&\\text\{if \}x\\in\\mathcal\{K\}\_\{\\text\{slow\}\},\\\\\[4\.0pt\] Q\_\{4\\text\{\-bit\}\}\(x\)&\\text\{if \}x\\in\\mathcal\{V\}\_\{\\text\{slow\}\},\\end\{cases\}whereQ8​\-bitQ\_\{8\\text\{\-bit\}\}andQ4​\-bitQ\_\{4\\text\{\-bit\}\}are quantizers for their respective bit\-widths\. This targeted precision assignment is a pivotal co\-design element: by reducing the data volume of values before they are placed in the slow\-tier, differential quantization directly amplifies the efficiency gains of theLayout\(smaller blocks to move\) andInteraction\(less traffic to schedule\) principles, while preserving model accuracy\.

### 4\.3Tier Interaction: Streaming Attention

TheTier Interactionprinciple is realized throughstreaming attentionthat shifts cross‑tier data movement from a blocking cost to a hidden overhead\. As Figure[3](https://arxiv.org/html/2604.19769#S3.F3)quantifies, naive access patterns leave the GPU idle during PCIe transfers\. Inspired by the brain’s ability to stream and process memories concurrently, TTKV’s pipeline proactively manages data flow to overlap communication with computation\.

The core mechanism, detailed in Algorithm[1](https://arxiv.org/html/2604.19769#alg1), orchestrates the fetch and execution of compressed KV blocks in three phases that operate in a pipelined manner:

1. 1\.Block Identification:Based on the current query and the block index, streaming attention identifies a subset of the most relevant compressed blocks in the slow‑tier \(i\.e\., streaming attention is a sparse attention manner\)\.
2. 2\.Asynchronous Prefetch:It issues non‑blocking requests to transfer these compressed blocks from host DRAM to GPU HBMaheadof their immediate need\.
3. 3\.Scheduled Overlap:These transfers are processed in parallel with the attention computation on data already present in the fast‑tier, thus hiding the PCIe latency\.

TheInteractionpipeline is essential to unlock the full potential of theLayoutandContentprinciples\. The tiered layout provides the structure for block‑wise management, while differential quantization reduces the volume of data to be moved\. In turn, the pipeline efficiently hides the remaining transfer cost, ensuring that the theoretical benefits of the first two principles translate into real‑world latency reduction\.

Formally, during each decoding steptt, the system state is defined by the tokens currently in the fast\-tier,𝒦fast\(t\)\\mathcal\{K\}\_\{\\text\{fast\}\}^\{\(t\)\}and𝒱fast\(t\)\\mathcal\{V\}\_\{\\text\{fast\}\}^\{\(t\)\}\. TTKV’s streaming attention executes the attention computation concurrently with data prefetching:

𝐨t\\displaystyle\\mathbf\{o\}\_\{t\}=Attn​\(Qt,𝒦fast\(t\),𝒱fast\(t\)\)\\displaystyle=\\mathrm\{Attn\}\\big\(Q\_\{t\},\\,\\mathcal\{K\}\_\{\\text\{fast\}\}^\{\(t\)\},\\,\\mathcal\{V\}\_\{\\text\{fast\}\}^\{\(t\)\}\\big\)while asynchronously fetching​ℬnext⊂\{𝒦slow,𝒱slow\}\.\\displaystyle\\text\{ while asynchronously fetching \}\\mathcal\{B\}\_\{\\text\{next\}\}\\subset\\\{\\mathcal\{K\}\_\{\\text\{slow\}\},\\mathcal\{V\}\_\{\\text\{slow\}\}\\\}\.Here,Attn​\(⋅\)\\mathrm\{Attn\}\(\\cdot\)computes the attention output using available data, andAsyncPrefetch​\(⋅\)\\text\{AsyncPrefetch\}\(\\cdot\)proactively fetches the next blockℬnext\\mathcal\{B\}\_\{\\text\{next\}\}from the slow\-tier based on the block indexidx\\mathrm\{idx\}\.

Algorithm 1TTKV Streaming AttentionInput: Query𝐪t\\mathbf\{q\}\_\{t\}, fast\-tier cache\(𝒦fast,𝒱fast\)\(\\mathcal\{K\}\_\{\\text\{fast\}\},\\mathcal\{V\}\_\{\\text\{fast\}\}\), slow\-tier blocks\{Bj\}\\\{B\_\{j\}\\\} Parameter: CapacityLfastL\_\{\\text\{fast\}\}, scoring functionϕ​\(⋅\)\\phi\(\\cdot\) Output: Attention output𝐨t\\mathbf\{o\}\_\{t\}

1:

𝐨t←Attn​\(𝐪t,𝒦fast,𝒱fast\)\\mathbf\{o\}\_\{t\}\\leftarrow\\mathrm\{Attn\}\(\\mathbf\{q\}\_\{t\},\\mathcal\{K\}\_\{\\text\{fast\}\},\\mathcal\{V\}\_\{\\text\{fast\}\}\)\{Layout\}

2:

Jscores←\[\]J\_\{\\text\{scores\}\}\\leftarrow\[\\,\]
3:foreach block

BjB\_\{j\}in slow\-tierdo

4:append

\(ϕ​\(𝐪t,Bj\),j\)\(\\phi\(\\mathbf\{q\}\_\{t\},B\_\{j\}\),j\)to

JscoresJ\_\{\\text\{scores\}\}\{Identify via

idx\\mathrm\{idx\}\}

5:endfor

6:

Jtop←TopK​\(Jscores,k\)J\_\{\\text\{top\}\}\\leftarrow\\text\{TopK\}\(J\_\{\\text\{scores\}\},k\)
7:foreach

\(s​c​o​r​e,j\)\(score,j\)in

JtopJ\_\{\\text\{top\}\}do

8:

AsyncPrefetch​\(Bj\)\\text\{AsyncPrefetch\}\(B\_\{j\}\)\{Interaction\}

9:

\(𝒦~j,𝒱~j\)←Decompress​\(Bj\)\(\\widetilde\{\\mathcal\{K\}\}\_\{j\},\\widetilde\{\\mathcal\{V\}\}\_\{j\}\)\\leftarrow\\text\{Decompress\}\(B\_\{j\}\)\{Content\(8b/4b\)\}

10:

𝐨t←𝐨t\+Attn​\(𝐪t,𝒦~j,𝒱~j\)\\mathbf\{o\}\_\{t\}\\leftarrow\\mathbf\{o\}\_\{t\}\+\\mathrm\{Attn\}\(\\mathbf\{q\}\_\{t\},\\widetilde\{\\mathcal\{K\}\}\_\{j\},\\widetilde\{\\mathcal\{V\}\}\_\{j\}\)
11:endfor

12:if

len​\(𝒦fast\)≥Lfast\\mathrm\{len\}\(\\mathcal\{K\}\_\{\\text\{fast\}\}\)\\geq L\_\{\\text\{fast\}\}then

13:

Bevict←GetOldestBlock​\(𝒦fast,𝒱fast\)B\_\{\\text\{evict\}\}\\leftarrow\\text\{GetOldestBlock\}\(\\mathcal\{K\}\_\{\\text\{fast\}\},\\mathcal\{V\}\_\{\\text\{fast\}\}\)
14:

CompressAndAppendToSlowTier​\(Bevict\)\\text\{CompressAndAppendToSlowTier\}\(B\_\{\\text\{evict\}\}\)\{Evict & compress\}

15:

RemoveFromFastTier​\(Bevict\)\\text\{RemoveFromFastTier\}\(B\_\{\\text\{evict\}\}\)
16:endif

17:return

𝐨t\\mathbf\{o\}\_\{t\}

This concurrent execution ensures that:

- •GPU idle cycles are minimizedby eliminating blocking waits for data\.
- •The effective latency of cross‑tier movement is hiddenbehind ongoing, useful computation\.
- •Necessary KV blocks are staged in HBMjust as they become relevant for subsequent attention steps\.

This streaming attention is the essential enabler that translates the theoretical advantages of the Layout and Content principles into tangible latency reduction by overlapping communication with computation, directly targeting the data‑movement bottleneck in long‑context inference\.

## 5Experiments

### 5\.1Experimental Setup

Models\.We evaluate TTKV on a set of large decoder\-only language models spanning various scales and architectural families\. Our primary models include LLaMA\-3\.1\-8B and LLaMA\-3\.1\-70B from the LLaMA2/3 seriesTouvronet al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib5)\), the Qwen2\.5 family of models, up to 32B parameters, as described in the Qwen2\.5\-Coder technical reportHuiet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib46)\), and DeepSeek\-R1\-14B, distilled variants derived from reasoning\-optimized backbonesZhenget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib47)\)\. We also include Mistral\-7B, representative of efficient open\-weight LLMsJianget al\.\([2023](https://arxiv.org/html/2604.19769#bib.bib6)\)\.

Baselines and Datasets\.We compare TTKV against the standard FP16 dense\-prefix KV cache and several state\-of\-the\-art KV\-cache methods, including KIVILiuet al\.\([2024b](https://arxiv.org/html/2604.19769#bib.bib15)\), KVQuantHooperet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib14)\), DiffKVZhanget al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib13)\), and ShadowKVSunet al\.\([2025](https://arxiv.org/html/2604.19769#bib.bib16)\)\. For evaluation, we use Qasper and MultiNews from LongBenchBaiet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib1)\), the Loong extended multi\-document QA benchmarkWanget al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib17)\), and the RULER synthetic long\-context benchmarkHsiehet al\.\([2024](https://arxiv.org/html/2604.19769#bib.bib18)\)\.

Implementation details\.All experiments are conducted on NVIDIA A100 and RTX 3090 GPUs, utilizing a unified PyTorch and CUDA implementation\. Moreover, our implementation configurations are as follows: 1\. measurements of memory usage, latency, and accuracy are taken under identical context lengths, with a batch size of 8 to ensure a fair comparison\. 2\. For our differential quantization, we assign 8 bits for keys and 4 bits for values\. 3\. The KV cache in the slow\-tier is organized into blocks, with each block containing 128 tokens\. 4\. Latency is measured on the GPU using CUDA events, with sufficient warm\-up steps to ensure stable results, and p95 per\-token latency is reported\. 5\. Memory usage is monitored using NVML counters, and host\-to\-GPU KV traffic is measured by aggregating host\-to\-GPU memory reads triggered by KV cache fetches during inference\.

ModelMethodp95 Lat\.\(ms\)↓\\downarrowH→\\rightarrowG\(GB\)↓\\downarrowAcc\.\(%\)↑\\uparrowMistral\-7BFP163403\.2059\.2KIVI \(2\-bit\)3252\.7558\.8KVQuant \(3\-bit\)3322\.9559\.0DiffKV3383\.0558\.8ShadowKV3453\.1558\.9\\cellcolorgray\!15TTKV\\cellcolorgray\!15215\\cellcolorgray\!150\.95\\cellcolorgray\!1559\.1Llama\-3\.1\-8BFP163803\.6065\.4KIVI \(2\-bit\)3623\.1064\.9KVQuant \(3\-bit\)3703\.3565\.2DiffKV3783\.4565\.1ShadowKV3853\.5565\.3\\cellcolorgray\!15TTKV\\cellcolorgray\!15245\\cellcolorgray\!151\.05\\cellcolorgray\!1565\.0Table 1:L\-Eval \(32K\): TTKV achieves\>3×\>3\\timeslower Host\-to\-GPU \(H→\\rightarrowG\) traffic and∼\\sim35% lower p95 latency while matching FP16 accuracy\.ModelMethodp95 Lat\.\(ms\)↓\\downarrowH→\\rightarrowG\(GB\)↓\\downarrowROUGE\-L↑\\uparrowDeepSeek\-R1\-14BFP162852\.4050\.2KIVI \(2\-bit\)2752\.1548\.8KVQuant \(3\-bit\)2802\.2550\.0DiffKV2852\.3049\.2ShadowKV2902\.3849\.9\\cellcolorgray\!15TTKV\\cellcolorgray\!15175\\cellcolorgray\!150\.85\\cellcolorgray\!1550\.0Llama\-3\.1\-8BFP161851\.6048\.5KIVI \(2\-bit\)1781\.4047\.8KVQuant \(3\-bit\)1821\.5048\.3DiffKV1851\.5548\.1ShadowKV1881\.5848\.2\\cellcolorgray\!15TTKV\\cellcolorgray\!15118\\cellcolorgray\!150\.62\\cellcolorgray\!1548\.4Table 2:GovReport \(16K\): TTKV reduces Host\-to\-GPU \(H→\\rightarrowG\) traffic by∼\\sim2\.8×\\timesand p95 latency by∼\\sim35% under a 4 GB HBM KV cache budget while preserving the ROUGE\-L score\.Table 3:Token throughput \(tokens/sec\) vs\. context length for LLaMA‑3\.1‑8B \(batch size 8\)\. Baselines show expected throughput degradation due to memory traffic, while TTKV maintains higher throughput by minimizing cross‑tier overhead\.ModelMethodMultiNewsQasperRepoBench\-PRULERROUGE\-L↑\\uparrowp95 \(ms\)↓\\downarrowF1 score↑\\uparrowp95 \(ms\)↓\\downarrowScore↑\\uparrowp95 \(ms\)↓\\downarrowAcc\.%↑\\uparrowp95 \(ms\)↓\\downarrowLlama\-3\.1\-70BFP1629\.5240038\.0245058\.0250062\.02450KIVI27\.7100035\.8102055\.8103059\.51015KVQuant28\.595036\.897056\.898060\.8965DiffKV28\.785037\.088057\.089061\.0875ShadowKV28\.582036\.885056\.886060\.8845\\cellcolorgray\!15TTKV\\cellcolorgray\!1529\.3\\cellcolorgray\!15580\\cellcolorgray\!1537\.8\\cellcolorgray\!15610\\cellcolorgray\!1557\.8\\cellcolorgray\!15600\\cellcolorgray\!1561\.8\\cellcolorgray\!15590Qwen2\.5\-32BFP1628\.0230039\.0235060\.0240064\.02350KIVI26\.298036\.5102057\.0105060\.51015KVQuant27\.292037\.896058\.298062\.3965DiffKV27\.485038\.088058\.590062\.5885ShadowKV27\.282037\.986058\.388062\.3865\\cellcolorgray\!15TTKV\\cellcolorgray\!1527\.9\\cellcolorgray\!15540\\cellcolorgray\!1538\.9\\cellcolorgray\!15570\\cellcolorgray\!1559\.9\\cellcolorgray\!15560\\cellcolorgray\!1563\.9\\cellcolorgray\!15550DeepSeek\-R1\-14BFP1627\.0250037\.5255059\.5260063\.52550KIVI25\.3102035\.0105056\.5107059\.81045KVQuant26\.295036\.298058\.0100061\.8985DiffKV26\.488036\.591058\.393062\.1915ShadowKV26\.284536\.388058\.190061\.9885\\cellcolorgray\!15TTKV\\cellcolorgray\!1526\.9\\cellcolorgray\!15450\\cellcolorgray\!1537\.3\\cellcolorgray\!15480\\cellcolorgray\!1559\.3\\cellcolorgray\!15470\\cellcolorgray\!1563\.3\\cellcolorgray\!15460Table 4:128K long\-context performance on MultiNews, Qasper, RepoBench\-P, and RULER\. TTKV delivers substantially lower p95 latency while maintaining near\-FP16 quality\.Table 5:128K summary table: TTKV reduces H→\\rightarrowG traffic by∼\\sim5\.94×\\timesand p95 latency by∼\\sim76% \(averaged across tasks\)\.![Refer to caption](https://arxiv.org/html/2604.19769v1/x3.png)\(a\)Host\-to\-GPU Traffic
![Refer to caption](https://arxiv.org/html/2604.19769v1/x4.png)\(b\)p95 Decode Latency

Figure 5:Performance scaling on RULER dataset\.![Refer to caption](https://arxiv.org/html/2604.19769v1/x5.png)Figure 6:Normalized Host→GPU traffic and p95 latency across models\. TTKV reduces both metrics compared to state\-of\-the\-art KV compression baselines \(KIVI, KVQuant, ShadowKV\)\.
### 5\.2Results and Analysis

#### Host\-to\-GPU Traffic Reduction:

It is apparent from Figure[5](https://arxiv.org/html/2604.19769#S5.F5), Figure[6](https://arxiv.org/html/2604.19769#S5.F6)and Table[5](https://arxiv.org/html/2604.19769#S5.T5)that TTKV achieves significant reductions in host\-to\-GPU traffic across all evaluated models, including LLaMA\-3\.1\-70B, Qwen2\.5\-32B, and DeepSeek\-R1\-14B\. The reductions range from 2\.8× to 5\.94×, depending on the context length\. TTKV’s efficiency is more obvious at longer context lengths, where it outperforms the baseline FP16 method by more than 5\.94× in terms of host\-to\-GPU traffic at the 128K context\. These results show that TTKV effectively manages memory traffic, especially with longer context lengths, reducing overhead in large\-scale inference tasks compared to existing methods\.

#### p95 Latency Reduction:

Table[1](https://arxiv.org/html/2604.19769#S5.T1)and Table[2](https://arxiv.org/html/2604.19769#S5.T2)indicates that TTKV significantly reduces p95 latency by 35% on average compared to FP16 across different models, including LLaMA\-3\.1\-8B, Mistral\-7B, and DeepSeek\-R1\-14B, evaluated on medium\-context tasks such as those in the L\-Eval and GovReport datasets\. The largest improvements are observed in long\-context tasks, particularly at the 128K context length, where TTKV reduces p95 latency by approximately 76% compared to FP16\. This is particularly evident on models like LLaMA\-3\.1\-70B and Qwen2\.5\-32B, across tasks such as Qasper, MultiNews, and RULER \(see Table[4](https://arxiv.org/html/2604.19769#S5.T4)and Figure[6](https://arxiv.org/html/2604.19769#S5.F6)\)\. TTKV consistently outperforms all state\-of\-the\-art methods, including KIVI, KVQuant, and ShadowKV, demonstrating its superior efficiency in optimizing decoding time while maintaining strong task performance\.

#### Throughput Analysis

The data in Table[3](https://arxiv.org/html/2604.19769#S5.T3)indicates that TTKV consistently achieves higher token throughput \(tokens/sec\) across increasing context lengths, demonstrating its superior end\-to\-end generation efficiency\. TTKV achieves up to approximately 2× higher throughput at 128K context length compared to the best baseline, reflecting reduced cross\-tier overhead and improved decoding efficiency\.

#### Accuracy Preservation:

Results in Table[1](https://arxiv.org/html/2604.19769#S5.T1)and Table[4](https://arxiv.org/html/2604.19769#S5.T4)show that TTKV maintains accuracy comparable to or slightly better than state\-of\-the\-art methods on MultiNews, Qasper, L\-Eval and RULER\. For example, on the LLaMA\-3\.1\-70B model, TTKV achieves a ROUGE\-L score of 29\.3, outperforming all state\-of\-the\-art methods\. This demonstrates TTKV’s ability to preserve model accuracy while significantly reducing traffic and latency\.

### 5\.3Ablation Study

We conduct an ablation study to evaluate the individual contributions of the fast/slow‑tier, differential quantization, streaming attention, and block size\. Experiments are performed on LLaMA‑3\.1‑8B with a 64K context on RULER, focusing on host‑to‑GPU traffic, p95 latency, and accuracy\.

#### Fast\-tier and Slow\-tier \(Tier Layout\):

We compare TTKV with both the fast\-tier and slow\-tier to a version that stores all KV entries in a single\-tier \(slow\-tier only\)\. The results in Table[6](https://arxiv.org/html/2604.19769#S5.T6)indicate that TTKV with the tiered memory architecture significantly outperforms the single\-tier version\. Specifically, using a single\-tier configuration results in a 10% drop in accuracy, along with a substantial increase in both host\-to\-GPU traffic and p95 latency\. This demonstrates the importance of the fast\-tier in improving both performance and efficiency, particularly in long\-context tasks\.

#### Differential Quantization \(Tier Content\):

This ablation evaluates theTier Contentprinciple by comparing TTKV’s differential quantization scheme to a uniform quantization baseline applied to both keys and values\. The data in Table[6](https://arxiv.org/html/2604.19769#S5.T6)shows a clear advantage for the differential approach: it reduces p95 latency by 24% and host\-to\-GPU traffic by 4\.3%, while maintaining comparable accuracy\. These results demonstrate the effectiveness of our proposed differential quantization in balancing performance and efficiency\.

#### Streaming Attention \(Tier Interaction\):

This ablation study isolates the contribution of theTier Interactionprinciple by evaluating TTKV with and without streaming attention\. The results are shown in Table[6](https://arxiv.org/html/2604.19769#S5.T6): removing streaming attention leads to significantly higher p95 latency and host\-to\-GPU traffic, which reveals that its ability to overlap data prefetching with attention computation is critical\.

Table 6:Ablation Results: p95 Latency, Host\-to\-GPU Traffic, and Accuracy \(LLaMA\-3\.1\-8B on RULER dataset\)
#### Block Size in Slow\-tier

We experiment with block sizesBblk∈\{32,64,128,256\}B\_\{\\text\{blk\}\}\\in\\\{32,64,128,256\\\}for KV reduction\. As shown in Table[7](https://arxiv.org/html/2604.19769#S5.T7), 128 tokens offer the best latency/accuracy trade\-off\. However, smaller sizes improve accuracy but increase latency, and larger sizes reduce latency but hurt accuracy\.

Table 7:Impact of Block Size in Slow\-tier on p95 Latency, Host\-to\-GPU Traffic, and Accuracy \(LLaMA\-3\.1\-70B on RULER dataset\)

## 6Conclusion

This paper presents*TTKV*, a KV cache management framework inspired by human memory mechanisms\. Unlike existing approaches that treat all KV states as equally important over time, TTKV observes that recent context requires fast and precise access, while older context can be stored more compactly and retrieved with lower fidelity\. Based on this insight, TTKV organizes KV caches into temporal tiers with heterogeneous capacity, precision, and latency, aligned with the HBM–DRAM memory hierarchy\. By jointly designing tier layout, tier content, and tier interaction, TTKV effectively integrates KV reduction with memory hierarchy awareness, significantly reducing cross\-tier traffic while preserving model accuracy\. Experiments on multiple LLMs and long\-context benchmarks show that TTKV consistently reduces end\-to\-end decode latency and improves throughput over state\-of\-the\-art approaches, enabling efficient and scalable long\-context LLM inference\.

## References

- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3119–3137\.External Links:[Link](https://aclanthology.org/2024.acl-long.172/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.172)Cited by:[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p7.1)\.
- C\. Hooper, S\. Kim, H\. Mohammadzadeh, M\. W\. Mahoney, Y\. S\. Shao, K\. Keutzer, and A\. Gholami \(2025\)KVQuant: towards 10 million context length llm inference with kv cache quantization\.External Links:2401\.18079,[Link](https://arxiv.org/abs/2401.18079)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p1.1),[§1](https://arxiv.org/html/2604.19769#S1.p4.1),[§1](https://arxiv.org/html/2604.19769#S1.p7.1),[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.External Links:2404\.06654,[Link](https://arxiv.org/abs/2404.06654)Cited by:[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu, K\. Dang, Y\. Fan, Y\. Zhang, A\. Yang, R\. Men, F\. Huang, B\. Zheng, Y\. Miao, S\. Quan, Y\. Feng, X\. Ren, X\. Ren, J\. Zhou, and J\. Lin \(2024\)Qwen2\.5\-coder technical report\.External Links:2409\.12186,[Link](https://arxiv.org/abs/2409.12186)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p7.1),[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p1.1)\.
- B\. Jiang, T\. Yang, Y\. Liu, X\. He, S\. Di, and S\. Jin \(2026\)PackKV: reducing kv cache memory footprint through llm\-aware lossy compression\.External Links:2512\.24449,[Link](https://arxiv.org/abs/2512.24449)Cited by:[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Jiang, L\. Gao, H\. E\. Zarch, and M\. Annavaram \(2025\)KVPR: efficient LLM inference with I/O\-aware KV cache partial recomputation\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 19474–19488\.External Links:[Link](https://aclanthology.org/2025.findings-acl.997/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.997),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p3.1),[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p2.1)\.
- N\. Kitaev, Ł\. Kaiser, and A\. Levskaya \(2020\)Reformer: the efficient transformer\.External Links:2001\.04451,[Link](https://arxiv.org/abs/2001.04451)Cited by:[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Liu, C\. Li, Z\. Ning, J\. Lin, Y\. Yao, D\. Ke, M\. Guo, and J\. Zhao \(2026\)FreeKV: boosting kv cache retrieval for efficient llm inference\.External Links:2505\.13109,[Link](https://arxiv.org/abs/2505.13109)Cited by:[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, H\. Li, Y\. Cheng, S\. Ray, Y\. Huang, Q\. Zhang, K\. Du, J\. Yao, S\. Lu, G\. Ananthanarayanan, M\. Maire, H\. Hoffmann, A\. Holtzman, and J\. Jiang \(2024a\)CacheGen: kv cache compression and streaming for fast large language model serving\.InProceedings of the ACM SIGCOMM 2024 Conference,ACM SIGCOMM ’24,New York, NY, USA,pp\. 38––56\.External Links:ISBN 9798400706141,[Link](https://doi.org/10.1145/3651890.3672274),[Document](https://dx.doi.org/10.1145/3651890.3672274)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p2.1),[§1](https://arxiv.org/html/2604.19769#S1.p3.1)\.
- Z\. Liu, J\. Yuan, H\. Jin, S\. Zhong, Z\. Xu, V\. Braverman, B\. Chen, and X\. Hu \(2024b\)KIVI: a tuning\-free asymmetric 2bit quantization for kv cache\.arXiv preprint arXiv:2402\.02750\.External Links:[Link](https://arxiv.org/abs/2402.02750)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p1.1),[§1](https://arxiv.org/html/2604.19769#S1.p4.1),[§1](https://arxiv.org/html/2604.19769#S1.p7.1),[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- A\. Sharma, H\. Ding, J\. Li, N\. Dani, and M\. Zhang \(2025\)MiniKV: pushing the limits of 2\-bit KV cache via compression and system co\-design for efficient long context inference\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18506–18523\.External Links:[Link](https://aclanthology.org/2025.findings-acl.952/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.952),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p3.1),[§1](https://arxiv.org/html/2604.19769#S1.p4.1)\.
- Y\. Sheng, L\. Zheng, B\. Yuan, Z\. Li, M\. Ryabinin, B\. Chen, P\. Liang, C\. Re, I\. Stoica, and C\. Zhang \(2023\)FlexGen: high\-throughput generative inference of large language models with a single GPU\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 31094–31116\.External Links:[Link](https://proceedings.mlr.press/v202/sheng23a.html)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p2.1),[§1](https://arxiv.org/html/2604.19769#S1.p3.1),[§1](https://arxiv.org/html/2604.19769#S1.p4.1),[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p2.1)\.
- H\. Sun, L\. Chang, W\. Bao, S\. Zheng, N\. Zheng, X\. Liu, H\. Dong, Y\. Chi, and B\. Chen \(2025\)ShadowKV: kv cache in shadows for high\-throughput long\-context llm inference\.External Links:2410\.21465,[Link](https://arxiv.org/abs/2410.21465)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p7.1),[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p2.1),[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. C\. Ferrer, M\. Chen, G\. Cucurull,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.External Links:2307\.09288,[Link](https://arxiv.org/abs/2307.09288)Cited by:[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p1.1)\.
- D\. Wang, Z\. Liu, S\. Wang, Y\. Ren, J\. Deng, J\. Hu, T\. Chen, and H\. Yang \(2025\)FIER: fine\-grained and efficient kv cache retrieval for long\-context llm inference\.External Links:2508\.08256,[Link](https://arxiv.org/abs/2508.08256)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p2.1)\.
- M\. Wang, L\. Chen, F\. Cheng, S\. Liao, X\. Zhang, B\. Wu, H\. Yu, N\. Xu, L\. Zhang, R\. Luo, Y\. Li, M\. Yang, F\. Huang, and Y\. Li \(2024\)Leave no document behind: benchmarking long\-context LLMs with extended multi\-doc QA\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 5627–5646\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.322/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.322)Cited by:[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.External Links:2309\.17453,[Link](https://arxiv.org/abs/2309.17453)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p1.1),[§2](https://arxiv.org/html/2604.19769#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, Y\. Hu, R\. Zhao, J\. C\. S\. Lui, and H\. Chen \(2025\)DiffKV: differentiated memory management for large language models with parallel kv compaction\.External Links:2412\.03131,[Link](https://arxiv.org/abs/2412.03131)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p7.1),[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p2.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett, Z\. Wang, and B\. Chen \(2023\)H2o: heavy\-hitter oracle for efficient generative inference of large language models\.External Links:2306\.14048,[Link](https://arxiv.org/abs/2306.14048)Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p4.1)\.
- C\. Zheng, Z\. Zhang, B\. Zhang, R\. Lin, K\. Lu, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)ProcessBench: identifying process errors in mathematical reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1009–1024\.External Links:[Link](https://aclanthology.org/2025.acl-long.50/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.50),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2604.19769#S1.p7.1),[§5\.1](https://arxiv.org/html/2604.19769#S5.SS1.p1.1)\.

Similar Articles

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

arXiv cs.CL

OjaKV introduces a context-aware online low-rank KV cache compression framework that uses hybrid storage and Oja's algorithm for incremental subspace adaptation to reduce GPU memory bottlenecks in long-context LLM inference without model fine-tuning.

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.