Tag
Lemonade has added an experimental ROCm backend for vLLM, allowing users to easily run safetensors LLMs on AMD GPUs with a simple command.
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
This article introduces 7 production-ready skills from the Hermes Skills Hub, covering the full lifecycle from tool integration and structured output to deployment, observability, and security.
UniPrefill is a new prefill acceleration framework proposed in a research paper that enables block-wise dynamic sparsification for universal long-context processing in LLMs. It integrates with vLLM to achieve up to 2.1x speedup in Time-To-First-Token across various model architectures.
A benchmark analysis of Qwen 3.6 27B MTP on 4x RTX 3090 GPUs, demonstrating that using NVLink for tensor parallelism yields significant throughput improvements (up to +53%) over PCIe configurations.
Lightseek releases TokenSpeed, a high-performance LLM inference engine optimized for agentic workloads, featuring compiler-backed parallelism and advanced kernel optimizations that have been adopted by vLLM.
ServiceNow engineers detail their migration from vLLM V0 to V1, focusing on resolving backend correctness issues like logprob semantics and runtime defaults to ensure stable reinforcement learning training dynamics.
vLLM v0.20.1 is a minor version update for the popular open-source LLM inference and serving library, maintaining its focus on high-throughput and efficient memory management.
vLLM v0.20.2rc0 release candidate adds a shutdown() method to the LLM serving library.
Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.
vLLM v0.20.0 is released, an open-source library for high-throughput LLM inference and serving, featuring PagedAttention and support for various hardware architectures.
vLLM version 0.20.1rc0 is released, adding a system_fingerprint field to OpenAI-compatible API responses for better request tracking.
This repository provides fixed Jinja chat templates for Qwen 3.5 and 3.6, addressing rendering errors, token waste, and missing features in the official templates for engines like LM Studio and llama.cpp.
Intel’s LLM-Scaler vllm-0.14.0-b8.2 adds official support for the Arc Pro B70 GPU, enabling Docker-based large-model inference on Battlemage hardware.
A developer ran 10 concurrent agents of the 35B-parameter Qwen3.6 model on a single 74W GB10 GPU at 436 tok/s total using vLLM, demonstrating high-efficiency edge deployment.
vLLM launched a redesigned recipes site that turns any HuggingFace model URL into a ready-to-run inference recipe for specific hardware and tasks.
A tweet urging AI researchers to learn inference-acceleration basics and spotlighting CUDA Graph as the key to vLLM’s GPU utilization.
An overview of popular open-source inference engines including vLLM, SGLang, llama.cpp, and ExLlamaV3 for hosting and running large language models.
GuideLLM, a benchmarking tool for LLM inference built on the vLLM project, reached 1,000 GitHub stars. It enables developers to test deployments with real workloads and measure throughput and latency before production.
A user benchmarks three Qwen models (Qwen3.5-27B dense, Qwen3.5-122B-A10B MoE, Qwen3.6-35B-A3B MoE) on 4x RTX 3090 GPUs under real agentic workloads, finding that MoE models consistently underperform the dense 27B at following strict global rules despite speed advantages, with the Qwen3.6-35B leading in generation throughput.