@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…
Summary
IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.
View Cached Full Text
Cached at: 04/23/26, 01:07 PM
LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly sacrificing accuracy? IceCache is a new method for managing KV caches that leverages Dynamic Continuous Indexing (DCI) to efficiently group and retrieve tokens
Similar Articles
llama.cpp - how to free up even more space on your GPU
A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.
GPU Memory Math for LLMs (2026 Edition)
A practical guide explaining how to calculate VRAM requirements for LLMs based on parameter count and quantization level, plus additional overhead from KV cache, activations, and batching.
Personal continual learning for LLMs without GPU — position paper [OC]
The author proposes two architectures, Internal KV-Sphere Architecture (IKSA) and Background Micro Fine-Tuning (BMFT), for enabling LLMs to learn continually from personal interactions without GPU requirements and without catastrophic forgetting.
Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.
A tool that estimates which LLMs fit on a user's GPU memory, ranking models by performance while considering memory constraints and quantization levels.
@tom_doerr: Runs 70B LLMs on single 4GB GPU https://github.com/lyogavin/airllm
AirLLM is an open-source tool that optimizes inference memory usage, enabling 70B LLMs to run on a single 4GB GPU without quantization, and supports 405B models on 8GB VRAM.