@_avichawla: A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on lo…

X AI KOLs Timeline 06/27/26, 11:14 AM News

kv-cache memory-management vllm reasoning-models attention compression nvidia

Summary

Explains why evicting 90% of KV cache tokens fails to free GPU memory when serving reasoning models on vLLM, due to paged attention fragmentation, and introduces NVIDIA's TriAttention as a solution that achieves 2.5x speedup and 10.7x memory reduction.

A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below) Evicting 90% of the KV cache can free almost none of the memory it was using. This sounds counterintuitive, but it follows directly from how production servers store the cache today. The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues. This is the dominant memory cost for reasoning models. If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU. One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it. But this does not solve the memory problem yet. The reason is paged attention, which is the memory manager behind vLLM and most production servers. Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens. This block returns to the allocator only when every slot inside it is empty. Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks... ...so despite eviction, almost every block is left with at least some survivor tokens. For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token. This means the allocator frees almost nothing. Placing the new tokens into those freed slots is not ideal because it breaks the cache's layout. Say token 16,001 arrives, and it's placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order. Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds. This introduces another bookkeeping cost that an in-order layout inherently avoids. So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server. There's another problem. Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected). But fast attention kernels used in production, like FlashAttention, never save those scores. They compute attention in small pieces and throw the full score grid away as they go, which is also why they're fast. So the exact signal eviction methods need isn't available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide. NVIDIA published a method called TriAttention to solve both these problems. It never needs attention scores. Instead, it scores tokens from the geometry of the model's key and query vectors before RoPE is applied, where those vectors sit in stable clusters. For the memory problem, it runs a compaction pass every 128 decoded tokens. The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order. On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory. KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens. You can find the NVIDIA write-up here: https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/… I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching. Read it below.

Original Article

View Cached Full Text

Cached at: 06/27/26, 01:55 PM

A tricky LLM interview question:

You’re serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces.

So you add KV cache compression and evict 90% of the cached tokens.

VRAM usage stays as is and GPU still runs out of memory.

Why?

(answer below)

Evicting 90% of the KV cache can free almost none of the memory it was using.

This sounds counterintuitive, but it follows directly from how production servers store the cache today.

The KV cache grows with every token a model generates. Each token appends its key and value vectors across every layer, and nothing is freed while generation continues.

This is the dominant memory cost for reasoning models.

If a 32K-token CoT caches ~32K tokens of KV vectors, a Qwen3-32B with 4-bit weights will run out-of-memory around 24K tokens on a 24GB GPU.

One obvious solution is to keep the important tokens and drop the rest, since attention is sparse enough to allow it.

But this does not solve the memory problem yet.

The reason is paged attention, which is the memory manager behind vLLM and most production servers.

Under the hood, it splits GPU memory into fixed physical blocks, each one holds the KV for about 16 tokens.

This block returns to the allocator only when every slot inside it is empty.

Since the eviction logic selects tokens by importance, and such tokens are scattered across blocks…

…so despite eviction, almost every block is left with at least some survivor tokens.

For instance, if the logic evicts 14k of 16k tokens across 1,000 blocks, most likely every block will still have a token.

This means the allocator frees almost nothing.

Placing the new tokens into those freed slots is not ideal because it breaks the cache’s layout.

Say token 16,001 arrives, and it’s placed in the slot the 40th token used to hold. The cache now reads position 38, then 16,001, then 41, so the cache is no longer in token order.

Attention can still compute the right answer from that, but only if every slot now carries a separate note recording which position it actually holds.

This introduces another bookkeeping cost that an in-order layout inherently avoids.

So the cache is logically 90% smaller and still physically the same size. Many compression results miss this because they measure on pre-allocated contiguous tensors rather than a paged server.

There’s another problem.

Eviction methods pick which tokens to keep by looking at the attention scores themselves (as expected).

But fast attention kernels used in production, like FlashAttention, never save those scores.

They compute attention in small pieces and throw the full score grid away as they go, which is also why they’re fast.

So the exact signal eviction methods need isn’t available in memory. The workaround is to fall back to eager attention and build the full matrix, which gives up the speed FlashAttention was there to provide.

NVIDIA published a method called TriAttention to solve both these problems.

It never needs attention scores. Instead, it scores tokens from the geometry of the model’s key and query vectors before RoPE is applied, where those vectors sit in stable clusters.

For the memory problem, it runs a compaction pass every 128 decoded tokens.

The surviving tokens slide forward to close the holes eviction creates, so whole blocks empty out and return to the allocator while the cache stays in token order.

On long reasoning traces, the approach matches full-attention accuracy while decoding 2.5x faster and using 10.7x less KV memory.

KV cache compression is a big infrastructure problem. The number that decides whether it works is the count of freed blocks, not the count of evicted tokens.

You can find the NVIDIA write-up here: https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/…

I wrote a first-principles breakdown of how the KV cache works. It walks through why the model stores keys and values at all, why the cache grows with every token, and a comparison of LLM generation speed with and without KV caching.

Read it below.

KV Cache Compression and Its Infra Problems

Source: https://research.nvidia.com/labs/eai/blogs/kv-cache-compression-and-its-infra-problems/ Consider serving OpenClaw on a local GPU with a capable reasoning model. The agent picks up a simple task — read a handful of documents, write a weekly report — and crashes halfway through. Not because the task is hard, but because the GPU ran out of memory.

This blog examines KV cache compression through an infrastructure lens: not only what the algorithms do, but whether they can actually run in production. An entire field is devoted to compressing the KV cache so that long inference fits in GPU memory; most of it works in the lab, much less in real deployment. The gap comes down to two infrastructure problems that papers almost never discuss. This post surveys the field’s main ideas, traces where they collide with these problems, and shows how a geometric property hidden beneath RoPE has recently been used to resolve both.

The hard part of KV cache compression is not choosing which tokens to keep — it istwo collisions with production infrastructure. Existing methods needhistorical attention scores, but theFlashAttention kernel never writes those scoresinto GPU memory. And paged attention reclaims memory only whena block holds no KV at all— yet after token eviction the survivors leave almost every block partially occupied, so the**“freed” memory never actually comes back**.

Contents

1. Background: The KV Cache and Memory Exhaustion in Long Inference

When a Transformer generates a token, it computes aquery, akey, and avaluevector; the new query attends over all previous keys and values to read the model’s own history. The KV cache is what makes this affordable: each token’s K and V are computed once and saved, and every later step reuses them instead of recomputing — O(n) work per step instead of O(n²). But nothing ever shrinks it: every generated token appends new K and V rows across all layers and heads.

The arithmetic ends badly. Qwen3-32B with 4-bit quantized weights — a configuration people actually deploy — crashes with an out-of-memory error on a 24 GB GPU after roughly 24,000 generated tokens, short of the 32K-token traces reasoning models routinely produce on hard problems (Figure 1).

KV cache growth consuming GPU memory (schematic)A horizontal bar represents GPU memory, schematic and not to scale. A fixed segment of about one third holds model weights plus runtime; the remaining two thirds fills with the growing KV cache until it reaches a dashed out-of-memory line at the bar’s right end. Below the bar, an LLM box emits output tokens into a history box aligned directly under the KV cache segment; the history fills with token squares one by one, and with each new token the KV cache segment above extends in sync.GPU memory(schematic — not to scale)Weights + runtime(fixed)KV cachegrows with every generated token →OOMLLMgenerating…output tokenshistory— every token’s K/V stays hereEvery generated token appends its keys and values to the history — that history is the KV cache.Nothing is freed while generation continues, so long outputs march the bar into the OOM wall.**Figure 1.**Schematic, not to scale: every token the model generates appends its keys and values to the running history — the KV cache — so cache memory grows with output length while the weights stay fixed. Let generation run long enough, and the cache alone drives the GPU out of memory.Compression is possible because attention is sparse. A small fraction of tokens collect the overwhelming majority of attention weight; a token that never gets attended to could, in principle, be deleted from the cache without changing the model’s outputs. The question is: which tokens are safe to delete, and when?

2. The Existing Methods — and Two Infrastructure Problems

The field launched in 2023 from one observation: attention is highly non-uniform. Models devote a disproportionate share of attention weight to the very first tokens — “attention sinks,” not semantically important but always present (StreamingLLM[2]) — while roughly 20% of tokens collect 80% of the total weight, “heavy hitters” that tend to stay important over time (H2O[3]; Scissorhands[4]). The standard recipe followed (Figure 2): always keep the sinks, always keep a sliding window of recent tokens, and keep a budget of heavy hitters from the middle. On long-document tasks — question answering, summarization, code retrieval — the recipe works well.

StreamingLLM: attention sinks plus a sliding windowA row of tokens. The first two are attention sinks and are always kept. A window outline covers the four most recent tokens. As generation proceeds, new tokens appear on the right, the window slides forward, and tokens that fall out of it are evicted into dashed ghost slots. The sinks never move and are never evicted.StreamingLLM — keep the sinks, keep a sliding window, evict the rest← oldest token sequence grows during generation newest →sinkKEEP ✓sinkKEEP ✓sliding window — keptsliding window — keptsliding window — keptsliding window — keptsliding window — keptsliding window — keptsliding window — keptAttention sink — always keptIn sliding window — keptFell out of window — evictedFigure 2.StreamingLLM keeps only two kinds of tokens: a handful of attention sinks at the very start, and a sliding window of the most recent tokens. As generation proceeds (animated), the window slides forward and everything that falls out of it is evicted — memory stays constant, but information outside the window is permanently lost.The recipe leaves its key question open: how does a method know which middle tokens are the heavy hitters? The answer that defines the largest category of compression methods is to read the model’s ownhistorical attention scores. Every decode step computes attention over the whole history anyway, and those scores are a running record of which cached tokens the model actually uses. H2O[3]is the canonical example: it maintains, for every cached token, a cumulative sum of the attention that token has received across all decode steps so far, and after each step it evicts the lowest-scoring token to hold the cache at a fixed budget (Figure 3).

H2O: cumulative historical attention scoringA query box at the top attends to eight cached tokens. With each decode step, every token’s score bar grows by the attention it just received. After several steps, tokens with high cumulative scores are kept; tokens with low scores are evicted into dashed ghost slots.H2O — every decode step adds to each token’s cumulative attention scorenew tokenstep tstep t+1step t+2sinkKEEP ✓evict★KEEP ✓evictevict★KEEP ✓evictrecentKEEP ✓Cumulative attention score (grows each step)Lowest scores — evicted at each step**Figure 3.**H2O’s “historical attention” scoring. Every decode step, the newest token attends over the whole cache (flashing arrows), and each cached token’s cumulative score grows by the attention it just received. As decoding proceeds, the lowest-scoring tokens are evicted one per step, holding the cache at a fixed budget. These per-step scores are exactly what FlashAttention never writes to memory — Infrastructure Problem 1, below.SnapKV[6]is the most influential refinement of the same idea. Instead of accumulating scores continuously, it scores once (Figure 4): the most recent W tokens of the prompt form an “observation window,” and the attention they pay to the full history decides — at the end of prefill, once and for all — which tokens stay. This removes the per-step bookkeeping and avoids cumulative scoring’s bias toward tokens that have simply been around longer. The category has many further variants on the same ingredients: Scissorhands[4]and TOVA[5]swap the decode-time scoring signal; PyramidKV[7]and Ada-KV[8]split the one-shot budget per layer and per head.

But the observation window cannot be made large. Because RoPE — the rotation that encodes each token’s position into its query vector — orients every query by its position, only the most recent queries, empirically about 25, reflect where the model is actually attending. So any piece of information that draws no attention during one phase of the reasoning process gets evicted — even if later reasoning depends on it.

Observation-window snapshot scoringThe most recent W tokens form an observation window. The attention they pay to history identifies heavy-hitter tokens. High-scoring tokens are kept within budget B; low-scoring tokens are evicted.Observation-Window Snapshot Scoring (SnapKV-style)← token history (oldest left → newest right) →sinkKEEP ✓—evict—evict★KEEP ✓—evict—evict★KEEP ✓—evict—evictWKEEP ✓WKEEP ✓WKEEP ✓observation window (W recent tokens)attnscore↑**Figure 4.**Snapshot-based methods in the SnapKV style assign each cached token an importance score by measuring how much attention it receives from a window of recent queries; tokens outside the top-Bbudget are evicted.This entire category — whether it scores continuously like H2O or once like SnapKV — rests on observing attention scores; among the methods above, only StreamingLLM’s fixed sink-plus-recency rule needs none. That shared requirement, not the quality of any particular heuristic, collides with production infrastructure in two ways.

Infrastructure Problem 1: FlashAttention does not expose attention scores.Production inference runs on FlashAttention, which tiles the attention computation through SRAM (the GPU’s small, fast on-chip memory) and never materializes the N×N score matrix in HBM (the GPU’s main memory). That is what makes it fast — and it means the scores eviction methods want to observe are never written anywhere compression code can read them.This hits hardest where the method needs thefull attention history. H2O-style cumulative scoring requires the attention scores ofevery decode step— the running sums cannot be updated without observing each step’s scores as they are computed. The reference H2O implementation resolves the conflict byfalling back to eager attention,materializing the full score matrixandabandoning FlashAttentionoutright.

Infrastructure Problem 2: Repeated token eviction cannot actually free GPU memory in production systems.Production serving systems like vLLM manage the KV cache withpaged attention: GPU memory is divided into fixed-size physical blocks, each holding KV data for about 16 tokens, and a block can be freed only when it is completely empty.Repeated decode-time evictionscatters survivors across blocks(Figure 5). Evicting 14,400 of 16,000 tokens leaves 1,600 survivors spread over the roughly 1,000 blocks originally allocated;almost every block keeps at least one survivor, so theallocator reclaims almost nothing. R-KV[10]— the strongest prior evictor designed specifically for reasoning models — reports 90% memory savings, but those savings were measured withpre-allocated contiguous tensors, not in vLLM, and its own Appendix D acknowledges that integrating with paged attention “presents a non-trivial challenge and may require further investigation.” Quest[9]sidesteps the block-reclamation problem bykeeping the full cacheand only selecting which pages each query reads, soKV memory still grows with context length.

Paged attention fragmentation after token evictionBefore eviction: four paged blocks all full. After evicting most tokens, survivors scatter across all blocks. No block is empty, so none can be freed.Before eviction — 4 blocks, all fullBlock 1Block 2Block 3Block 4All 4 slots in each block occupied — 4 blocks × 4 tokens = 16 tokensAfter evicting most — survivors scatteredBlock 1Block 2Block 3Block 4Survivor tokenGhost slot (evicted)Every block still has a survivor → zero blocks freedVRAM usage unchanged despite evicting most tokens→evictmost**Figure 5.**After a decode-time eviction pass (H2O, R-KV style), one survivor token lands in each physical block. Since no block is completely empty, the vLLM block allocator cannot reclaim any of them — GPU memory usage is identical before and after eviction.

3. A System Solution to the Two Infrastructure Problems

The scoring function does not need attention scores at all.

The way out starts from a different question. Instead of “which tokens received high attention recently?”, ask: “does the geometry of the model’s learned representation space predict that a token will be important?” The distinction changes what information a method needs at runtime — and therefore what infrastructure it requires. TriAttention[1]is the method built on this question.

Solution to Problem 1: No Attention Scores Needed — Pre-RoPE Geometry

TriAttention’s answer to Problem 1 is to never need attention scores in the first place. It decides which KV entries to keep from the geometry of the model’s learned representation space — no attention score is ever observed at runtime, so the records FlashAttention refuses to write are records the method never asks for. Problem 1 is not worked around; it simply does not apply.

The mechanism rests on a stable geometric property of the model’s learned Q/K vectors — Figure 6 walks through it step by step.

— playing —

**Figure 6.**The mechanism, animated step by step on real Q/K data (DeepSeek-R1, layer 10, head 1). The live caption above the progress bar narrates each step as it plays; the full loop runs about 73 seconds.### Solution to Problem 2: Forward-Packing Compaction

A score by itself frees nothing. In paged memory, a block can only be returned to the allocator onceeveryslot in it is dead — eviction scores merely mark tokens, so the survivors must be physically consolidated until whole blocks actually empty out. TriAttention runs this consolidation roughly every 128 decoded tokens, and there are two ways to do it. The order-preserving repack (Figure 7) slides every survivor forward so the cache stays in original token order: no extra position bookkeeping, and the changes to an inference engine like vLLM stay minimal. The hole-filling variant (Figure 8) instead drops the newest survivors straight into the slots vacated by eviction: far less data movement per compaction, at the cost of a scrambled physical order that must then be tracked explicitly. Both variants live in the TriAttention codebase (the hole-filling one since v0.1.0) and remain maintained. The two figures below play the same round of decoding through each strategy.

Order-preserving repack: survivors slide forward, one whole block is freedSix memory blocks of four slots each. B1 to B5 hold the past context, tokens T0 through T19, every slot full; B6 receives the current round’s newly generated tokens T20 to T23, standing in for about 128 decoded tokens. Scoring evicts T2, T9 and T13 from the history and T21 from the new round, leaving dashed empty frames. Every surviving token then slides forward to close the holes while keeping its original order, so B1 to B5 end up holding T0, T1, T3, T4, then T5 to T8, then T10, T11, T12, T14, then T15 to T18, then T19, T20, T22, T23 — and B6 ends up completely empty and is returned to the allocator as a freed block.Order-preserving repack — survivors slide forward, the free space drifts to the tailpast context — every slot occupiedcurrent roundT2T9T13T21T0T1T3T4T5T6T7T8T10T11T12T14T15T16T17T18T19T20T22T23freed ✓B1B2B3B4B5B6history token (kept)new-round tokenevicted slot (ghost)freed blockdecoding — a new round (~128 tokens) fills B6scoring — deciding which tokens to keep: 4 lose their slotsrepacking — every survivor slides forward in order; the holes drift to the tailB6 is empty — one whole block returned to the allocator**Figure 7.Order-preserving repack, played on one round of decoding (B6’s four slots stand in for ~128 new tokens). After scoring, every surviving entry slides forward, block by block, so the cache stays denseandin original token order — follow the IDs — while the holes drift to the tail until B6 holds nothing at all. The move is a single gather–clone–scatter pass: every survivor is read before any is written, so overlapping source and destination ranges cannot corrupt each other. The emptied block goes straight back to the paged allocator.Hole-filling: survivors stay put, the new round drops into the holesSame six blocks and the same evictions as the previous figure: B1 to B5 hold tokens T0 through T19 with every slot full, B6 receives the current round’s new tokens T20 to T23, and scoring evicts T2, T9, T13 and T21. This time the history survivors never move: the new round’s three survivors T20, T22 and T23 drop directly into the three holes in B1, B3 and B4, so B1 ends up reading T0, T1, T20, T3 — dense but out of original order. B6 again ends up empty and is returned to the allocator as a freed block.Hole-filling — survivors stay put, the new round drops into the holespast context — every slot occupiedcurrent roundT2T9T13T21T0T1T3T4T5T6T7T8T10T11T12T14T15T16T17T18T19T20T22T23freed ✓B1B2B3B4B5B6history token (never moves)new-round tokenevicted slot (ghost)freed blockdecoding — a new round (~128 tokens) fills B6scoring — deciding which tokens to keep: 4 lose their slotshole-filling — only B6’s survivors move, straight into the holesB6 is empty — same block freed with 3 copies instead of 18, order scrambledFigure 8.**Hole-filling on the exact same input as Figure 7. The history survivors in B1–B5 never move; only the new round’s three survivors drop into the three holes that scoring opened. The same whole block is reclaimed for a fraction of the data movement — 3 copies instead of 18 — but the physical order comes out scrambled (T20 now sits between T1 and T3), so each entry’s logical position must be tracked separately.### Results

The table below shows accuracy on competition math, where KV cache compression matters most. For AIME 2024 and AIME 2025 the KV budget is 2,048 tokens — one-sixteenth of what a full 32K-token trace would otherwise cache; for MATH500 it is 512. All methods run at the same budget.

MethodAIME 2024AIME 2025MATH 500Full Attention (no compression)57.1%40.8%69.6%SnapKV[6]34.6%20.0%49.2%R-KV[10]25.4%17.5%46.4%TriAttention[1]42.1%****32.9%**56.0% The accuracy gap is large. At budget 2,048, TriAttention nearly doubles R-KV’s AIME 2025 accuracy (32.9% vs. 17.5%). At a slightly larger budget of 3,072 tokens, it matches the full-attention baseline (40.8%) while delivering2.5× higher throughput**(563 vs. 223 tokens/second) and10.7× KV memory reduction. Code is available atgithub.com/WeianMao/triattention.

Accuracy-matched comparison on AIME 2025: TriAttention vs. full attentionTwo bar-chart panels comparing full attention and TriAttention on AIME 2025 with Qwen3-8B at identical 40.8 percent accuracy. Left panel, decoding throughput in tokens per second: full attention reaches 222.8, TriAttention reaches 563.5, annotated as 2.5 times faster. Right panel, relative KV cache memory: full attention is 100 percent, TriAttention is 9.3 percent, annotated as 10.7 times smaller.Full AttentionTriAttentionDecoding throughput (tokens/s)222.8563.52.5× fasterKV cache memory (relative)100%9.3%10.7× smaller**Figure 9.**Accuracy-matched comparison on AIME 2025 (Qwen3-8B, 32K-token generation): at a KV budget of 3,072 tokens, TriAttention matches full attention’s 40.8% accuracy while decoding 2.5× faster (563.5 vs. 222.8 tokens/s) and holding 10.7× less KV cache memory.## 4. KV Cache Infra in Video Generation

The memory pressure of long inference reappears in video generation — at larger scale. A video model’s tokens are spatial patches rather than words, and anautoregressivegenerator caches KV entries for every frame produced so far, exactly the way an LLM caches past tokens; within seconds of 480p video, the cache outgrows the model’s own weights[11]. The video community has pursued the same two tracks as text — quantization and token eviction — and it has learned the lesson this blog keeps returning to: the algorithmic idea is half the work, and the demons live in the infrastructure details. Each of the three systems below earns its gains there.

On the quantization track, Quant VideoGen[11]pushes the cache down to 2 bits per element. The enabler is a property specific to video: adjacent frames are nearly identical, so the cache is full of near-duplicate tokens; grouping those tokens and storing each one’s small residual from its group average instead of full-precision K and V tensors makes the data quantization-friendly, and progressively refining and quantizing the residuals yields up to 7× memory compression — no retraining, and less than 4% latency overhead. LongLive 2.0[13]takes the same track to production scale, quantizing both weights and KV cache to NVFP4 (4-bit) on a 5B-parameter generator for1.84× throughputat negligible quality cost — and its detail worth dwelling on is the fusedparallel dequantizationkernel (Figure 10). Storing the cache in 4 bits means every block must be dequantized before each attention step, and the naive approach launches one GPU kernel per block. Dequantization, however, is embarrassingly parallel in principle — every 4-bit value can be decoded independently of all the others — and the per-block loop hides that parallelism behind a queue of small sequential launches; the fused kernel instead issues one launch whose threads map one-to-one onto every packed pair of FP4 values across the whole cache window, presenting the entire workload to the GPU at once. In practice, this keeps the overall quantization/dequantization overhead below 2%.

LongLive 2.0 pipeline: an NVFP4 KV cache window of chunks is dequantized in parallel on the model GPU, while latent chunks are asynchronously VAE-decoded to video on a second GPU. **Figure 10.**LongLive 2.0’s two-GPU inference pipeline. The model on GPU 0 attends over a sliding window of NVFP4-quantized KV cache chunks — cutting cache memory by roughly 10 GB — with a fusedparallel dequantizationkernel expanding the quantized window before attention; finished latent chunks stream to GPU 1, where the VAE decodes them to video asynchronously, overlapping decoding with generation for the annotated 1.5× speedup.On the eviction track, §3’s system solution carries over directly: TriAttention’s trigonometric scoring[1]has been applied to LongLive[12], a real-time video generator built on Wan2.1-T2V-1.3B. Each KV entry corresponds to spatial patches from one frame, and the scoring decides which frames to evict — no attention scores observed, nothing for the serving stack to fight — cutting the cache by 50% with negligible quality drop; the integration ships in the TriAttention repository.

Both domains are converging on the same insight: good compression is not primarily about finding the right scoring heuristic. It is about understanding why a signal has the structure it has — and exploiting that structure at the kernel level.

5. Closing Thoughts

From 2023 to 2025 the field’s framing was: find the right heuristic for which tokens matter — heavy hitters, sinks plus recency, observation windows, redundancy. All reasonable hypotheses. But these methods run into two walls. Methods that select tokens by observed attention hit the first: FlashAttention never exposes the scores they need. Methods that evict repeatedly during decode hit the second: eviction frees no GPU memory when the allocator works in whole blocks and the survivors stay scattered. A method that cannot clear the second wall saves compute but not memory — the scarcer resource during long inference.

TriAttention clears both walls by changing the question: score tokens from the model’s stable pre-RoPE geometry rather than from observed attention, and physically consolidate survivors into a dense prefix so that tail blocks are actually freed. The result is a method that reduces the physical memory footprint of the KV cache in production paged-attention deployment — and, because pre-RoPE geometry predicts a token’s importance at any distance rather than through a ~25-query observation window, sustains accuracy at compression ratios where observation-window methods collapse.

One open question remains. TriAttention uses a uniform KV budget across heads, yet the per-head distance-preference curves it computes already classify heads by behavior — local, sink, range-specific. The natural next step is to allocate budget in proportion to a head’s complexity. Whether the pre-RoPE insight generalizes further — to model interpretability, to dynamic sparse attention, to other modalities — is an open problem worth exploring.

References

**TriAttention: Efficient Long Reasoning with Trigonometric KV Compression.**Mao W., Lin X., Huang W., Xie Y., Fu T., Zhuang B., Han S., Chen Y.ICML, 2026.github.com/WeianMao/triattention
**Efficient Streaming Language Models with Attention Sinks.**Xiao G., Tian Y., Chen B., Han S., Lewis M.ICLR, 2024.arXiv:2309.17453
**H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.**Zhang Z., Sheng Y., Zhou T., Chen T., Zheng L., Cai R., Song Z., Tian Y., Ré C., Barrett C., Wang Z., Chen B.NeurIPS, 2023.arXiv:2306.14048
**Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.**Liu Z., Desai A., Liao F., Wang W., Xie V., Xu Z., Kyrillidis A., Shrivastava A.NeurIPS, 2023.arXiv:2305.17118
**Transformers are Multi-State RNNs.**Oren M., Hassid M., Yarden N., Adi Y., Schwartz R.EMNLP, 2024.arXiv:2401.06104(TOVA)
**SnapKV: LLM Knows What You Are Looking for Before Generation.**Li Y., Huang Y., Yang B., Venkitesh B., Locatelli A., Ye H., Cai T., Lewis P., Chen D.NeurIPS, 2024.arXiv:2404.14469
**PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling.**Cai Z., Zhang Y., Gao B., Liu Y., Li Y., Liu T., Lu K., Xiong W., Dong Y., Hu J., Xiao W.arXiv preprint, 2024.arXiv:2406.02069
**Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference.**Feng Y., Lv J., Cao Y., Xie X., Zhou S.K.NeurIPS, 2025.arXiv:2407.11550
**Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.**Tang J., Zhao Y., Zhu K., Xiao G., Kasikci B., Han S.ICML, 2024.arXiv:2406.10774
**R-KV: Redundancy-aware KV Cache Compression for Reasoning Models.**Cai Z., Xiao W., Sun H., Luo C., Zhang Y., et al.NeurIPS, 2025.arXiv:2505.24133
**Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization.**Xi H., Yang S., Zhao Y., Li M., Cai H., et al., Han S., Keutzer K.ICML, 2026.arXiv:2602.02958
**LongLive: Real-time Interactive Long Video Generation.**Yang S., Huang W., Chu R., Xiao Y., Zhao Y., et al., Han S., Chen Y.ICLR, 2026.arXiv:2509.22622
**LongLive 2.0: An NVFP4 Parallel Infrastructure for Long Video Generation.**Chen Y., Wang L., Huang W., Yang S., Zhang B., Xiao Y., Chu R., Mao W., Hu Q., Liu S., Zhao Y., Mao H., Chen Y.-C., Xie E., Qi X., Han S.arXiv preprint, 2026.arXiv:2605.18739

@_avichawla: A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on lo…

KV Cache Compression and Its Infra Problems

1. Background: The KV Cache and Memory Exhaustion in Long Inference

2. The Existing Methods — and Two Infrastructure Problems

3. A System Solution to the Two Infrastructure Problems

Solution to Problem 1: No Attention Scores Needed — Pre-RoPE Geometry

5. Closing Thoughts

References

Similar Articles

Memory

@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

Local LLM CPU users... How long is it taking you to do anything?

@techNmak: Your LLM inference is burning 50% of its compute on work it has already done. If you're running RAG or Multi-Turn Chat,…

Submit Feedback

Similar Articles

@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…

@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm

Local LLM CPU users... How long is it taking you to do anything?

@techNmak: Your LLM inference is burning 50% of its compute on work it has already done. If you're running RAG or Multi-Turn Chat,…