@venkat_systems: Inference is not just GPU/Accelerator problem. Unoptimized cpu work in hot path can drastically affect performance. v0.…
Summary
Venkat explains that unoptimized CPU work in the hot path can severely impact inference performance, and introduces his PR to mooncake that adds a memory arena for lock-free, allocation-free operations, benefiting vLLM and SGL projects.
View Cached Full Text
Cached at: 06/20/26, 02:38 PM
Inference is not just GPU/Accelerator problem. Unoptimized cpu work in hot path can drastically affect performance. v0.3.11 of mooncake by @Kimi_Moonshot has my first PR to the repo.
The lock-free playbook keeps repeating wherever performance matters. LMAX did it first. pre-allocated ring buffer, lock-free CAS sequencing, no allocations in the hot path. @TigerBeetleDB lives by it. After startup there is no malloc or free.
My PR 1820 introduces a memory arena. Mooncake grabs one big block of memory at startup and reuses it for every cache operation. No kernel calls in the hot path after that. Enable it in @vllm_project and @sgl_project and enjoy free goodput gains !
Every GPU generation makes the same CPU work a bigger fraction of total request time. Amdahl’s law eventually finds every fixed cost in the hot path you didn’t optimize. Worth getting ahead of it.
Similar Articles
@_avichawla: Prefill & decode in LLM inference. Have you ever noticed that the first token from an LLM always takes a moment to appe…
Explains the two phases of LLM inference - prefill and decode - detailing how GPU bottlenecks shift from compute-bound during prefill to memory-bound during decode, and the importance of KV caching.
@TheAhmadOsman: Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving t…
Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.
@_avichawla: A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on lo…
Explains why evicting 90% of KV cache tokens fails to free GPU memory when serving reasoning models on vLLM, due to paged attention fragmentation, and introduces NVIDIA's TriAttention as a solution that achieves 2.5x speedup and 10.7x memory reduction.
@che_shr_cat: 1/ We have been treating GPU memory all wrong. What if the GPU didn't need to store your model at all? MegaTrain enable…
MegaTrain enables full-precision training of 100B+ LLMs on a single GPU by treating VRAM as a transient stateless cache, inverting the memory hierarchy.
@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…
IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.