vllm

Tag

Cards List
#vllm

@kazukifujii: This vLLM blog post explains weight updates in RL + KV cache recompute in a very clear and illustrated way, and it also…

X AI KOLs Timeline · yesterday Cached

This article explains vLLM's weight syncing API for reinforcement learning, covering how it facilitates weight updates and KV cache recompute in RL training, with a focus on reducing complexity for training frameworks.

0 favorites 0 likes
#vllm

@Mayhem4Markets: https://x.com/Mayhem4Markets/status/2069090022117019928

X AI KOLs Following · yesterday Cached

A detailed technical comparison of two dominant LLM serving frameworks, SGLang and vLLM, covering architectural differences in KV cache management (RadixAttention vs PagedAttention), throughput, latency, and deployment considerations for self-hosted environments.

0 favorites 0 likes
#vllm

Local LLM Inference Optimization: The Complete Guide

Reddit r/LocalLLaMA · 2d ago Cached

A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.

0 favorites 0 likes
#vllm

@TheAhmadOsman: Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving t…

X AI KOLs Following · 2d ago Cached

Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.

0 favorites 0 likes
#vllm

ROCm vs Vulkan vs vLLM on Dual R9700's

Reddit r/LocalLLaMA · 2d ago

A comparison of AI inference frameworks ROCm, Vulkan, and vLLM running on dual AMD Radeon 9700 GPUs, likely benchmarking performance for large language models.

0 favorites 0 likes
#vllm

$1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

Reddit r/LocalLLaMA · 4d ago

A user shares a configuration of 4x RTX 5060 Ti 16GB with P2P to run Qwen3.6-27B-FP8 at 55 tok/s with 262K context, highlighting the low cost of about $1800 for single-user inference.

0 favorites 0 likes
#vllm

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Reddit r/LocalLLaMA · 5d ago

A user shares their experience running DiffusionGemma 26B on a 4090 GPU via vLLM, achieving up to 475t/s but noting drawbacks like single-user limitation, lower accuracy, and short context, concluding it's not worth using over the regular 26B model.

0 favorites 0 likes
#vllm

@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…

X AI KOLs Following · 5d ago Cached

Ray Serve LLM achieves 4.4x and 24.8x throughput improvements on prefill- and decode-heavy workloads via direct streaming, a new vLLM V2 executor backend, and HAProxy ingress, now available in Ray 2.56 in partnership with Google Cloud and vLLM.

0 favorites 0 likes
#vllm

@SpaceTimeViking: Qwen3.6 27B getting some love on the new AEON ULTIMATE VLLM image @NVIDIAAI DGX SPARK OPTIMIZED! https://github.com/AEO…

X AI KOLs Timeline · 5d ago Cached

AEON-7 releases a fully uncensored, capability-enhanced abliteration of Qwen3.6-27B, optimized for NVIDIA DGX Spark with NVFP4 quantization and DFlash speculative decoding for improved performance.

0 favorites 0 likes
#vllm

@0xSero: Rejoice fellow 6000 enjoyers. We have GLM at home

X AI KOLs Following · 5d ago Cached

A turnkey Docker setup to serve the GLM-5.2-NVFP4-REAP-469B model on 4× RTX PRO 6000 Blackwell GPUs using vLLM, with detailed instructions and configuration options.

0 favorites 0 likes
#vllm

@amitiitbhu: New Article: How does vLLM work? Read here: https://outcomeschool.com/blog/how-does-vllm-work…

X AI KOLs Timeline · 6d ago Cached

A detailed blog post explaining how vLLM works, including PagedAttention, KV cache management, and continuous batching for efficient LLM serving.

1 favorites 1 likes
#vllm

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

X AI KOLs Following · 6d ago Cached

This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.

0 favorites 0 likes
#vllm

@TheAhmadOsman: You can run local models at home and use any agent harness like Codex or Claude Code with them

X AI KOLs Following · 2026-06-16 Cached

Ahmad built a simple tool that makes Claude Code work with any local LLM, demonstrated using vLLM serving GLM-4.5 Air on 4x RTX 3090s.

0 favorites 0 likes
#vllm

vLLM has a new streaming parser for Qwen3+ available in nightly

Reddit r/LocalLLaMA · 2026-06-15 Cached

vLLM now has a streaming parser for Qwen3+ models, available in the nightly build. vLLM is a fast and easy-to-use library for LLM inference and serving.

0 favorites 0 likes
#vllm

@CyrusHakha: One pattern we keep seeing with customers serving LLMs at scale: Prefill-decode disaggregation is often treated like a …

X AI KOLs Following · 2026-06-15 Cached

Discusses the nuanced reality of prefill-decode disaggregation in LLM serving at scale, based on customer patterns and validated on AMD with vLLM.

0 favorites 0 likes
#vllm

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Hugging Face Daily Papers · 2026-06-15 Cached

The paper introduces Tangram, a serving framework that statically resolves non-uniform KV cache compression for multi-turn LLM serving, achieving up to 2.6x throughput improvement over the full-KV baseline by eliminating runtime overheads.

0 favorites 0 likes
#vllm

@MiaAI_lab: A PR to vLLM to allow TP=3 for MiniMax M3 His NVFP4 quant is 260GB - lukealonso/MiniMax-M3-NVFP4 Hopefully this will wo…

X AI KOLs Timeline · 2026-06-14 Cached

A pull request to vLLM adds support for tensor parallelism degree 3 for MiniMax M3 with its NVFP4 quantization, enabling the model to run on 3x DGX Sparks with 87GB memory each.

0 favorites 0 likes
#vllm

Minimax M3 sm_120

Reddit r/LocalLLaMA · 2026-06-12

Minimax's M3 model requires vllm updates to support sm_120 compute capability, as the current repo only supports sm_100.

0 favorites 0 likes
#vllm

DifussionGemma 4 on 4x7900xtx

Reddit r/LocalLLaMA · 2026-06-11

Reports running DiffusionGemma 26B on four AMD 7900 XTX GPUs using vllm, achieving 100 tps generation with overall 45-60 t/s, sharing performance metrics and setup commands.

0 favorites 0 likes
#vllm

@vllm_project: Congrats to @GoogleDeepMind on DiffusionGemma A 26B diffusion language model on the Gemma4 backbone, and the first dLLM…

X AI KOLs Timeline · 2026-06-10 Cached

vLLM announces native support for Google DeepMind's DiffusionGemma, a 26B discrete diffusion language model that generates 256-token blocks in parallel, enabling low-latency inference at 1200+ tok/s on a single H200.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback