vllm

#vllm

vLLM ROCm has been added to Lemonade as an experimental backend

Reddit r/LocalLLaMA ↗ · yesterday

Lemonade has added an experimental ROCm backend for vLLM, allowing users to easily run safetensors LLMs on AMD GPUs with a simple command.

0 favorites 0 likes

#vllm

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA ↗ · yesterday

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

0 favorites 0 likes

#vllm

@knoYee_: https://x.com/knoYee_/status/2052626513888203131

X AI KOLs Timeline ↗ · yesterday Cached

This article introduces 7 production-ready skills from the Hermes Skills Hub, covering the full lifecycle from tool integration and structured output to deployment, observability, and security.

0 favorites 0 likes

#vllm

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

arXiv cs.CL ↗ · yesterday Cached

UniPrefill is a new prefill acceleration framework proposed in a research paper that enables block-wise dynamic sparsification for universal long-context processing in LLMs. It integrates with vLLM to achieve up to 2.1x speedup in Time-To-First-Token across various model architectures.

0 favorites 0 likes

#vllm

Benchmark Qwen 3.6 27B MTP on 2x3090 NVLINK

Reddit r/LocalLLaMA ↗ · yesterday

A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.

0 favorites 0 likes

#vllm

TokenSpeed: A Speed-of-Light LLM Inference Engine for Agentic Workloads (5 minute read)

TLDR AI ↗ · 2d ago Cached

Lightseek releases TokenSpeed, a high-performance LLM inference engine optimized for agentic workloads, featuring compiler-backed parallelism and advanced kernel optimizations that have been adopted by vLLM.

0 favorites 0 likes

#vllm

vLLM V0 to V1: Correctness Before Corrections in RL

Hugging Face Blog ↗ · 3d ago Cached

ServiceNow engineers detail their migration from vLLM V0 to V1, focusing on resolving backend correctness issues like logprob semantics and runtime defaults to ensure stable reinforcement learning training dynamics.

0 favorites 0 likes

#vllm

vllm-project/vllm v0.20.1

GitHub Releases Watchlist ↗ · 5d ago Cached

vLLM v0.20.1 is a minor version update for the popular open-source LLM inference and serving library, maintaining its focus on high-throughput and efficient memory management.

0 favorites 0 likes

#vllm

vllm-project/vllm v0.20.2rc0: [MRV2] Add shutdown() method (#41297)

GitHub Releases Watchlist ↗ · 6d ago Cached

vLLM v0.20.2rc0 release candidate adds a shutdown() method to the LLM serving library.

0 favorites 0 likes

#vllm

z-lab/gemma-4-31B-it-DFlash

Hugging Face Models Trending ↗ · 2026-04-30 Cached

Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.

0 favorites 0 likes

#vllm

vllm-project/vllm v0.20.0

GitHub Releases Watchlist ↗ · 2026-04-27 Cached

vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.

0 favorites 0 likes

#vllm

vllm-project/vllm v0.20.1rc0: Add system_fingerprint field to OpenAI-compatible API responses (#40537)

GitHub Releases Watchlist ↗ · 2026-04-27 Cached

vLLM version 0.20.1rc0 is released, adding a system_fingerprint field to OpenAI-compatible API responses for better request tracking.

0 favorites 0 likes

#vllm

froggeric/Qwen-Fixed-Chat-Templates

Hugging Face Models Trending ↗ · 2026-04-23 Cached

This repository provides fixed Jinja chat templates for Qwen 3.5 and 3.6, addressing rendering errors, token waste, and missing features in the official templates for engines like LM Studio and llama.cpp.

0 favorites 0 likes

#vllm

Intel LLM-Scaler vllm-0.14.0-b8.2 released with official Arc Pro B70 support

Reddit r/artificial ↗ · 2026-04-22 Cached

Intel’s LLM-Scaler vllm-0.14.0-b8.2 adds official support for the Arc Pro B70 GPU, enabling Docker-based large-model inference on Battlemage hardware.

0 favorites 0 likes

#vllm

@iotcoi: Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on v…

X AI KOLs Timeline ↗ · 2026-04-22 Cached

A developer ran 10 concurrent agents of the 35B-parameter Qwen3.6 model on a single 74W GB10 GPU at 436 tok/s total using vLLM, demonstrating high-efficiency edge deployment.

0 favorites 0 likes

#vllm

@vllm_project: We just shipped a major redesign of http://recipes.vllm.ai. "How do I run model X on hardware Y for task Z?" now has a …

X AI KOLs Following ↗ · 2026-04-21

vLLM launched a redesigned recipes site that turns any HuggingFace model URL into a ready-to-run inference recipe for specific hardware and tasks.

0 favorites 0 likes

#vllm

Every AI researcher should grasp inference acceleration—CUDA Graph is the heart of vLLM's GPU efficiency

X AI KOLs Timeline ↗ · 2026-04-21 Cached

A tweet urging AI researchers to learn inference-acceleration basics and spotlighting CUDA Graph as the key to vLLM’s GPU utilization.

0 favorites 0 likes

#vllm

@0xSero: Here's everything you need to know about inference and hosting LLMs. Have you ever seen: - vllm - sglang - llama.cpp - …

X AI KOLs Timeline ↗ · 2026-04-20 Cached

An overview of popular open-source inference engines including vLLM, SGLang, llama.cpp, and ExLlamaV3 for hosting and running large language models.

0 favorites 0 likes

#vllm

@RedHat_AI: GuideLLM just hit 1,000 GitHub stars. Benchmarking tool for LLM inference under @vllm_project. Test your deployment wit…

X AI KOLs Following ↗ · 2026-04-20 Cached

GuideLLM, a benchmarking tool for LLM inference built on the vLLM project, reached 1,000 GitHub stars. It enables developers to test deployments with real workloads and measure throughput and latency before production.

0 favorites 0 likes

#vllm

Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules

Reddit r/LocalLLaMA ↗ · 2026-04-20

A user benchmarks three Qwen models (Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE) on 4x RTX 3090 GPUs under real agentic workloads, finding that MoE models consistently underperform the dense 27B at following strict global rules despite speed advantages, with the Qwen3.6-35B leading in generation throughput.

0 favorites 0 likes

vllm

Submit Feedback