Tag
Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.
Modular published a blog post explaining why traditional HTTP routing doesn't work for LLM inference workloads. The article describes how their distributed inference framework handles stateful, heterogeneous GPU pods with KV caches, specialized prefill/decode backends, and conversation-level routing that traditional stateless routing algorithms cannot address.
IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.
TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.
Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.
Solo researcher released Spiral, a tool that compresses LLMs to INT3 and KV-cache to INT2 with custom fused Metal kernels for Apple Silicon, currently shipping Qwen-7B preview.
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
STOP method prunes doomed reasoning trajectories early via KV-cache states, cutting token usage 70% and boosting AIME/GPQA accuracy across 1.5B–20B models.
River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.
Aravind Jayendran argues that while longer context windows improve LLM performance, they cannot fully replace weight updates, framing context as transient software and weights as hardware that fundamentally alters model capabilities.
Proposes an SRC pipeline that uses entropy-based selection and low-rank reconstruction to summarize KV cache instead of pruning tokens, reducing VRAM for million-token LLM contexts while avoiding catastrophic attention errors.
Researchers propose Prefill-as-a-Service (PrfaaS), a system that offloads long-context prefill to remote compute-dense clusters and streams KVCache over commodity Ethernet, enabling independent scaling and 32-54% higher throughput for a 1T-parameter hybrid model.
This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.
This paper introduces PagedAttention, an algorithm inspired by virtual memory paging, and vLLM, a serving system that significantly improves LLM throughput by reducing memory fragmentation in key-value caches.