@vllm_project: Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. Day-0 @NVID…
Summary
vLLM-Omni v0.22.0 is a major upgrade adding robust support for NVIDIA Cosmos world models, production TTS (Qwen3-TTS, Qwen3-Omni, VoxCPM2), faster diffusion model serving (Wan 2.2, HunyuanVideo 1.5, LTX-2.3), and broader quantization and hardware coverage with 339 commits from 124 contributors.
View Cached Full Text
Cached at: 06/08/26, 05:30 PM
Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. Day-0 @NVIDIAAI Cosmos 3 world models: text, image, audio, video, and action, in and out. Robot serving: DreamZero + OpenPI realtime API. Production TTS: Qwen3-TTS, Qwen3-Omni, VoxCPM2 and more. Faster image/video/diffusion: Wan 2.2, HunyuanVideo 1.5, LTX-2.3. Broader quantization (FP8/INT8, MXFP4/MXFP8, W4A16, ModelOpt) and hardware coverage. 339 commits, 124 contributors, 52 of them new. Thank you all. https://github.com/vllm-project/vllm-omni/releases/tag/v0.22.0…
vllm-project/vllm-omni
Source: https://github.com/vllm-project/vllm-omni
Easy, fast, and cheap omni-modality model serving for everyone
| Documentation | DeepWiki | User Forum | Developer Slack | WeChat | Paper | Slides |
Latest News 🔥
- [2026/05] We released 0.20.0 - refreshes the serving/runtime stack for large-scale omni workloads, and improves diffusion model performance, quantization, and hardware readiness across CUDA, ROCm, MUSA, NPU, and XPU backends.
- [2026/03] We released 0.18.0 - strengthens the core runtime through a large entrypoint refactor and scheduler/runtime cleanups, expands unified quantization and diffusion execution, broadens multimodal model coverage, and improves production readiness across audio, omni, image, video, RL, and multi-platform deployments.
- [2026/03] Check out our first public project deepdive at the vLLM Hong Kong Meetup!
- [2026/03] vllm-omni-skills is a community-driven collection of AI assistant skills that help developers work with vLLM-Omni more effectively. These skills can be used with popular agentic AI coding assistants like Cursor IDE, Claude, Codex, and more.
- [2026/02] We released 0.16.0 - A major alignment + capability release that rebases onto upstream vLLM v0.16.0 and significantly expands performance, distributed execution, and production readiness across Qwen3-Omni / Qwen3-TTS, Bagel, MiMo-Audio, GLM-Image and the Diffusion (DiT) image/video stack—while also improving platform coverage (CUDA / ROCm / NPU / XPU), CI quality, and documentation.
- [2026/02] We released 0.14.0 - This is the first stable release of vLLM-Omni that expands Omni’s diffusion / image-video generation and audio / TTS stack, improves distributed execution and memory efficiency, and broadens platform/backend coverage (GPU/ROCm/NPU/XPU). It also brings meaningful upgrades to serving APIs, profiling & benchmarking, and overall stability. Please check our latest paper for architecture design and performance results.
- [2025/11] vLLM community officially released vllm-project/vllm-omni in order to support omni-modality models serving.
About
vLLM was originally designed to support large language models for text-based autoregressive generation tasks. vLLM-Omni is a framework that extends its support for omni-modality model inference and serving:
- Omni-modality: Text, image, video, and audio data processing
- Non-autoregressive Architectures: extend the AR support of vLLM to Diffusion Transformers (DiT) and other parallel generation models
- Heterogeneous outputs: from traditional text generation to multimodal outputs
vLLM-Omni is fast with:
- State-of-the-art AR support by leveraging efficient KV cache management from vLLM
- Pipelined stage execution overlapping for high throughput performance
- Fully disaggregation based on OmniConnector and dynamic resource allocation across stages
vLLM-Omni is flexible and easy to use with:
- Heterogeneous pipeline abstraction to manage complex model workflows
- Seamless integration with popular Hugging Face models
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
vLLM-Omni seamlessly supports most popular open-source models on HuggingFace, including:
- Omni-modality models (e.g. Qwen-Omni)
- Multi-modality generation models (e.g. Qwen-Image)
Getting Started
Visit our documentation to learn more.
Contributing
We welcome and value any contributions and collaborations. Please check out Contributing to vLLM-Omni for how to get involved.
Citation
If you use vLLM-Omni for your research, please cite our paper:
@article{yin2026vllmomni,
title={vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models},
author={Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, Didan Deng, Zifeng Mo, Cong Wang, James Cheng, Roger Wang, Hongsheng Liu},
journal={arXiv preprint arXiv:2602.02204},
year={2026}
}
Join the Community
Feel free to ask questions, provide feedbacks and discuss with fellow users of vLLM-Omni in #sig-omni slack channel at slack.vllm.ai or vLLM user forum at discuss.vllm.ai.
Star History
License
Apache License 2.0, as found in the LICENSE file.
Similar Articles
@vllm_project: vLLM v0.21.0 is out! 367 commits from 202 contributors (49 new). Highlights: KV Offload + HMA, spec decode with thinkin…
vLLM v0.21.0 has been released with KV Offload + HMA, speculative decoding with thinking budget for reasoning models, TOKENSPEED_MLA on Blackwell for DSR1/Kimi K2.5, Mooncake distributed KV, DeepSeek V4 pipeline parallelism, and a C++20 + Transformers v5 baseline.
vllm-project/vllm v0.19.1
vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.
vllm-project/vllm v0.21.0rc1
vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.
@Prince_Canuma: mlx-audio v0.4.3 is here A massive release across models, server, and DX → 6 new TTS models: Higgs Audio v2 (voice clon…
mlx-audio v0.4.3 releases with 6 new TTS models including Higgs Audio v2 and OmniVoice (646+ languages), plus server improvements like concurrent requests and continuous batching, ~3x faster Voxtral Realtime on 4-bit, and slimmer dependencies for Apple Silicon.
Qwen3.5-Omni Technical Report
Qwen3.5-Omni is a hundreds-of-billions-parameter multimodal model with advanced audio-visual understanding and generation capabilities, featuring novel Audio-Visual Vibe Coding and achieving SOTA results across 215 benchmarks while matching Gemini-3.1 Pro.