Tag
A practical guide explaining how to calculate VRAM requirements for LLMs based on parameter count and quantization level, plus additional overhead from KV cache, activations, and batching.
The article highlights that the same AI model can exhibit different behaviors depending on the inference stack (e.g., scheduling, quantization, speculative decoding), especially in long sessions or agent workflows, making the serving method nearly as important as the model itself.
vLLM v0.19.1rc0 release includes cleanup of Gemma4 implementation as part of routine maintenance and optimization of the popular open-source LLM inference and serving library.