Tag
Proposes Block-GTQ, a RoPE-aware bit allocation method for key-value cache quantization that improves long-context performance and memory efficiency by allocating more bits to high-energy RoPE blocks.
A visual deep dive into FlashAttention, explaining memory optimization and operator fusion for efficient attention computation in language model training.
Reverse engineering the Qualcomm NPU compiler reveals undocumented VTCM memory management, MILP-based placement, automatic precision alteration, and a hidden analytical simulator (Hextimate) for edge deployment optimization.
LMCache is a KV cache management layer that accelerates large model inference and reduces VRAM consumption by caching and reusing KV cache. It has received 9.2K stars and joined the PyTorch Foundation, and is integrated by NVIDIA Dynamo.
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.
PolyKV is a layer-wise KV cache compression framework that assigns heterogeneous eviction policies and non-uniform budgets per layer, significantly improving over uniform baselines on LongBench with LLaMA-3.1-8B and Qwen3-8B.
This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.
The paper introduces Tangram, a serving framework that statically resolves non-uniform KV cache compression for multi-turn LLM serving, achieving up to 2.6x throughput improvement over the full-KV baseline by eliminating runtime overheads.
IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.
This paper proposes an operator fusion strategy for LLM inference on Tenstorrent's Tensix architecture, fusing RMSNorm with matrix multiplications to improve data locality and reduce DRAM accesses. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP.
This thread challenges the fundamental assumption that Transformers require separate Q, K, and V projections, proposing that merging them can yield massive memory savings for KV cache.
A blog post exploring the optimization of a ring buffer data structure for storing ping timestamps, discussing tagged unions, bitfields, and struct padding to reduce memory footprint.
Google has developed a method to shrink AI memory usage from 31GB to 4GB, representing a significant efficiency breakthrough for AI models.
A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.
Explains the memory challenge of expanding transformer context windows due to quadratic attention complexity, and hints at solutions.
AirLLM is a fully open-source tool that uses layered inference (loading and releasing VRAM layer by layer) to enable 70B large language models to run on GPUs with only 4GB VRAM, without quantization, distillation, or pruning. It already supports running Llama3.1 405B on 8GB VRAM.
dMoE proposes block-level expert routing for diffusion LLMs, reducing the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% performance and achieving 76-80% memory reduction with 1.14-1.66× speedup.
A user benchmarks a V100-compatible port of Flash Attention 2, reporting 3x-17x speedups and up to 94% memory reduction over default PyTorch attention.
Proposes GEMQ, a global expert-level mixed-precision quantization method for MoE LLMs that uses linear programming and router fine-tuning to reduce memory and accelerate inference with minimal accuracy degradation.
A tweet shares the KV Cache Size Calculator from KVCache.ai, a tool for estimating KV cache memory usage for local LLM inference, highlighting that 1M tokens for DeepSeek V4 Pro uses only 5GB of RAM.