model-serving

#model-serving

GPU Memory Math for LLMs (2026 Edition)

Reddit r/LocalLLaMA ↗ · 2026-05-20 Cached

A practical guide explaining how to calculate VRAM requirements for LLMs based on parameter count and quantization level, plus additional overhead from KV cache, activations, and batching.

0 favorites 0 likes

#model-serving

The “same” model increasingly behaves like a different product depending on the inference stack behind it

Reddit r/ArtificialInteligence ↗ · 2026-05-14

The article highlights that the same AI model can exhibit different behaviors depending on the inference stack (e.g., scheduling, quantization, speculative decoding), especially in long sessions or agent workflows, making the serving method nearly as important as the model itself.

0 favorites 0 likes

#model-serving

vllm-project/vllm v0.19.1rc0: [Misc] Clean up Gemma4 implementation (#38872)

GitHub Releases Watchlist ↗ · 2026-04-03 Cached

vLLM v0.19.1rc0 release includes cleanup of Gemma4 implementation as part of routine maintenance and optimization of the popular open-source LLM inference and serving library.

0 favorites 0 likes

model-serving

GPU Memory Math for LLMs (2026 Edition)

The “same” model increasingly behaves like a different product depending on the inference stack behind it

vllm-project/vllm v0.19.1rc0: [Misc] Clean up Gemma4 implementation (#38872)

Submit Feedback