gpu-memory

#gpu-memory

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.

Reddit r/LocalLLaMA ↗ · yesterday Cached

A tool that estimates which LLMs fit on a user's GPU memory, ranking models by performance while considering memory constraints and quantization levels.

0 favorites 0 likes

#gpu-memory

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

Reddit r/LocalLLaMA ↗ · 2d ago

InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.

0 favorites 0 likes

#gpu-memory

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Hugging Face Daily Papers ↗ · 6d ago Cached

Proposes Lookahead Sparse Attention with a Neural Memory Indexer on DeepSeek-V4, reducing GPU memory usage to ~13.5% of full-context baseline while maintaining or slightly improving accuracy.

0 favorites 0 likes

#gpu-memory

GPU Memory Math for LLMs (2026 Edition)

Reddit r/LocalLLaMA ↗ · 2026-05-20 Cached

A practical guide explaining how to calculate VRAM requirements for LLMs based on parameter count and quantization level, plus additional overhead from KV cache, activations, and batching.

0 favorites 0 likes

#gpu-memory

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI ↗ · 2026-05-20

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

0 favorites 0 likes

#gpu-memory

Are the rich RAM /poor GPU people wrong here?

Reddit r/LocalLLaMA ↗ · 2026-05-15

Discusses the trade-off between dense and Mixture-of-Experts (MoE) models for local AI, noting that high-RAM users have limited MoE options beyond Qwen 3.5 122B, and questioning if large GPU is the only viable path.

0 favorites 0 likes

#gpu-memory

@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…

X AI KOLs Timeline ↗ · 2026-04-23 Cached

IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.

0 favorites 0 likes

gpu-memory

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

GPU Memory Math for LLMs (2026 Edition)

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

Are the rich RAM /poor GPU people wrong here?

@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…

Submit Feedback