@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…

X AI KOLs Timeline Papers

Summary

IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.

LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy? IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens
Original Article
View Cached Full Text

Cached at: 04/23/26, 01:07 PM

LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy? IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens

Similar Articles

llama.cpp - how to free up even more space on your GPU

Reddit r/LocalLLaMA

A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.

GPU Memory Math for LLMs (2026 Edition)

Reddit r/LocalLLaMA

A practical guide explaining how to calculate VRAM requirements for LLMs based on parameter count and quantization level, plus additional overhead from KV cache, activations, and batching.