gpu-memory

#gpu-memory

Hierarchical Global Attention (HGA)

arXiv cs.LG ↗ · 3d ago Cached

Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers that uses hierarchical two-level routing to enable exact attention over a small routed working set, allowing models like Qwen3-30B to run at 64K context on a single RTX 5090 with minimal quality loss.

0 favorites 0 likes

#gpu-memory

llama.cpp - how to free up even more space on your GPU

Reddit r/LocalLLaMA ↗ · 2026-06-17

A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.

0 favorites 0 likes

#gpu-memory

@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…

X AI KOLs Timeline ↗ · 2026-06-17 Cached

A detailed blog post explaining how vLLM works, including PagedAttention, KV cache management, and continuous batching for efficient LLM serving.

1 favorites 1 likes

#gpu-memory

Built a tool that tells you exactly which LLMs fit on your GPU. Feedback wanted.

Reddit r/LocalLLaMA ↗ · 2026-06-12 Cached

A tool that estimates which LLMs fit on a user's GPU memory, ranking models by performance while considering memory constraints and quantization levels.

0 favorites 0 likes

#gpu-memory

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

Reddit r/LocalLLaMA ↗ · 2026-06-12

InfiniteKV is an open-source KV cache technique that compresses old tokens into 104-byte searchable records stored in RAM or on disk, enabling models to handle million-token contexts beyond their trained window without discarding data. Verified working with Mistral-7B and SmolLM2.

0 favorites 0 likes

#gpu-memory

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

Proposes Lookahead Sparse Attention with a Neural Memory Indexer on DeepSeek-V4, reducing GPU memory usage to ~13.5% of full-context baseline while maintaining or slightly improving accuracy.

0 favorites 0 likes

#gpu-memory

GPU Memory Math for LLMs (2026 Edition)

Reddit r/LocalLLaMA ↗ · 2026-05-20 Cached

A practical guide explaining how to calculate VRAM requirements for LLMs based on parameter count and quantization level, plus additional overhead from KV cache, activations, and batching.

0 favorites 0 likes

#gpu-memory

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI ↗ · 2026-05-20

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.

0 favorites 0 likes

#gpu-memory

Are the rich RAM /poor GPU people wrong here?

Reddit r/LocalLLaMA ↗ · 2026-05-15

Discusses the trade-off between dense and Mixture-of-Experts (MoE) models for local AI, noting that high-RAM users have limited MoE options beyond Qwen 3.5 122B, and questioning if large GPU is the only viable path.

0 favorites 0 likes

#gpu-memory

@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…

X AI KOLs Timeline ↗ · 2026-04-23 Cached

IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.

0 favorites 0 likes

gpu-memory

Submit Feedback