Tag
A detailed thread explaining key concepts of LLM inference: attention, KV caching, chunked prefill, and batching techniques, including continuous batching used in vLLM and SGLang.
CompactAttention introduces Block-Union KV Selection to accelerate chunked prefill for long-context LLMs, achieving up to 2.72x attention speedup on LLaMA-3.1-8B at 128K context while maintaining accuracy close to dense attention.