memory-optimization

Tag

Cards List
#memory-optimization

RoPE-Aware Bit Allocation for KV-Cache Quantization

arXiv cs.LG · 4d ago Cached

Proposes Block-GTQ, a RoPE-aware bit allocation method for key-value cache quantization that improves long-context performance and memory efficiency by allocating more bits to high-energy RoPE blocks.

0 favorites 0 likes
#memory-optimization

@thtrkim: Visual deep dive on FlashAttention by hand (drawn with Excalidraw) https://winterrykim.github.io/blog/2026/training-lm-…

X AI KOLs Timeline · 4d ago Cached

A visual deep dive into FlashAttention, explaining memory optimization and operator fusion for efficient attention computation in language model training.

0 favorites 0 likes
#memory-optimization

Reverse Engineering the Qualcomm NPU Compiler

Lobsters Hottest · 2026-06-20 Cached

Reverse engineering the Qualcomm NPU compiler reveals undocumented VTCM memory management, MILP-based placement, automatic precision alteration, and a hidden analytical simulator (Hextimate) for edge deployment optimization.

0 favorites 0 likes
#memory-optimization

@FakeMaidenMaker: Incredible! This open-source project can significantly speed up and save VRAM for self-hosted large model inference. It has garnered 9.2K stars on GitHub, joined the PyTorch Foundation, and NVIDIA's Dynamo has integrated it. GitHub: https://github.com/LMC…

X AI KOLs Timeline · 2026-06-18 Cached

LMCache is a KV cache management layer that accelerates large model inference and reduces VRAM consumption by caching and reusing KV cache. It has received 9.2K stars and joined the PyTorch Foundation, and is integrated by NVIDIA Dynamo.

0 favorites 0 likes
#memory-optimization

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Reddit r/LocalLLaMA · 2026-06-18

NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.

0 favorites 0 likes
#memory-optimization

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

arXiv cs.LG · 2026-06-16 Cached

PolyKV is a layer-wise KV cache compression framework that assigns heterogeneous eviction policies and non-uniform budgets per layer, significantly improving over uniform baselines on LongBench with LLaMA-3.1-8B and Qwen3-8B.

0 favorites 0 likes
#memory-optimization

Efficient On-Device Diffusion LLM Inference with Mobile NPU

arXiv cs.LG · 2026-06-15 Cached

This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.

0 favorites 0 likes
#memory-optimization

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Hugging Face Daily Papers · 2026-06-15 Cached

The paper introduces Tangram, a serving framework that statically resolves non-uniform KV cache compression for multi-turn LLM serving, achieving up to 2.6x throughput improvement over the full-KV baseline by eliminating runtime overheads.

0 favorites 0 likes
#memory-optimization

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

arXiv cs.LG · 2026-06-10 Cached

IntentKV introduces a cross-turn intent-aware KV cache pruning method for multi-turn LLM agents, maintaining session-level query memory to efficiently prune cache without accuracy loss, significantly reducing token usage and KV reads.

0 favorites 0 likes
#memory-optimization

Operator Fusion for LLM Inference on the Tensix Architecture

arXiv cs.LG · 2026-06-10 Cached

This paper proposes an operator fusion strategy for LLM inference on Tenstorrent's Tensix architecture, fusing RMSNorm with matrix multiplications to improve data locality and reduce DRAM accesses. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP.

0 favorites 0 likes
#memory-optimization

@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …

X AI KOLs Timeline · 2026-06-09 Cached

This thread challenges the fundamental assumption that Transformers require separate Q, K, and V projections, proposing that merging them can yield massive memory savings for KV cache.

0 favorites 0 likes
#memory-optimization

Premature Optimization is Fun Sometimes

Lobsters Hottest · 2026-06-08 Cached

A blog post exploring the optimization of a ring buffer data structure for storing ping timestamps, discussing tagged unions, bitfields, and struct padding to reduce memory footprint.

0 favorites 0 likes
#memory-optimization

@RoundtableSpace: GOOGLE JUST FOUND A WAY TO SHRINK 31GB OF AI MEMORY DOWN TO 4GB

X AI KOLs Timeline · 2026-06-06 Cached

Google has developed a method to shrink AI memory usage from 31GB to 4GB, representing a significant efficiency breakthrough for AI models.

0 favorites 0 likes
#memory-optimization

Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA · 2026-06-05

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

0 favorites 0 likes
#memory-optimization

@akshay_pachaar: Extending the context window isn't just about larger matrices. In a traditional transformer, expanding tokens by 8x inc…

X AI KOLs Following · 2026-06-03 Cached

Explains the memory challenge of expanding transformer context windows due to quadratic attention complexity, and hints at solutions.

0 favorites 0 likes
#memory-optimization

@NFTCPS: 4GB VRAM running 70B large model? It actually works! AirLLM did a clever trick — layered inference, not loading the whole model into VRAM at once, but layer by layer, compute and discard, squeezing the giant into a small GPU. The best part: 100% open source, freebie warning https://github.com/0xSo…

X AI KOLs Timeline · 2026-06-03 Cached

AirLLM is a fully open-source tool that uses layered inference (loading and releasing VRAM layer by layer) to enable 70B large language models to run on GPUs with only 4GB VRAM, without quantization, distillation, or pruning. It already supports running Llama3.1 405B on 8GB VRAM.

0 favorites 0 likes
#memory-optimization

dMoE: dLLMs with Learnable Block Experts

arXiv cs.CL · 2026-06-01 Cached

dMoE proposes block-level expert routing for diffusion LLMs, reducing the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% performance and achieving 76-80% memory reduction with 1.14-1.66× speedup.

0 favorites 0 likes
#memory-optimization

Anyone using Flash Attention 2 (ai-bond) on their V100's? How is the performance?

Reddit r/LocalLLaMA · 2026-05-29

A user benchmarks a V100-compatible port of Flash Attention 2, reporting 3x-17x speedups and up to 94% memory reduction over default PyTorch attention.

0 favorites 0 likes
#memory-optimization

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

arXiv cs.LG · 2026-05-25 Cached

Proposes GEMQ, a global expert-level mixed-precision quantization method for MoE LLMs that uses linear programming and router fine-tuning to reduce memory and accelerate inference with minimal accuracy degradation.

0 favorites 0 likes
#memory-optimization

@Michaelzsguo: Found this great tool that may be handy for your local LLM inference optimization: https://kvcache.ai/tools/kv-cache-ca…

X AI KOLs Timeline · 2026-05-23 Cached

A tweet shares the KV Cache Size Calculator from KVCache.ai, a tool for estimating KV cache memory usage for local LLM inference, highlighting that 1M tokens for DeepSeek V4 Pro uses only 5GB of RAM.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback