VibeVoice Technical Report
Summary
VibeVoice is a new model from Microsoft that synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer. It achieves superior fidelity and compression, supporting up to 90 minutes of audio with multiple speakers.
View Cached Full Text
Cached at: 05/08/26, 08:34 AM
Paper page - VibeVoice Technical Report
Source: https://huggingface.co/papers/2508.19205 Published on Aug 26, 2025
·
Submitted byhttps://huggingface.co/unilm
Li Dongon Aug 27, 2025
#1 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
Abstract
VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.
This report presents VibeVoice, a novel model designed to synthesizelong-form speechwith multiple speakers by employingnext-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novelcontinuous speech tokenizerthat, when compared to the popularEncodecmodel, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preservesaudio fidelitywhile significantly boostingcomputational efficiencyfor processing long sequences. Thus, VibeVoice can synthesizelong-form speechfor up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe’’ and surpassing open-source and proprietarydialogue models.
View arXiv pageView PDFProject pageGitHub46.8kAdd to collection
Get this paper in your agent:
hf papers read 2508\.19205
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper66
#### microsoft/VibeVoice-1.5B Text-to-Speech• 3B• UpdatedJan 22 • 259k • 2.36k
#### microsoft/VibeVoice-Realtime-0.5B Text-to-Speech• 1B• UpdatedDec 12, 2025 • 949k • 1.22k
#### aoi-ot/VibeVoice-Large Text-to-Speech• 9B• UpdatedSep 25, 2025 • 8.38k • 230
#### vibevoice/VibeVoice-7B Text-to-Speech• 9B• UpdatedSep 5, 2025 • 8.42k • 178
Browse 66 models citing this paper## Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2508.19205 in a dataset README.md to link it from this page.
Spaces citing this paper123
Collections including this paper22
Similar Articles
Qwen3-TTS Technical Report
The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.
datawhalechina/easy-vibe
easy-vibe is an open-source beginner-friendly learning resource and tutorial framework by Datawhale China that teaches users to build AI-powered applications through natural language (vibe coding). It includes interactive tutorials, a learning map, and multilingual support.
Open Vibe
Open Vibe is a new product designed to help developers ship SaaS applications with AI integration without facing common development bottlenecks.
VibeAround
VibeAround is a local AI coding agent that users can chat with from any instant messaging app or web browser.
Vokenization: Multimodel Learning for Vision and Language
The article explains 'Vokenization,' a multimodal learning technique that bridges computer vision and natural language processing by using weak supervision to link visual data with language tokens. It contrasts this approach with text-only models like GPT-3 and BERT, highlighting how visual grounding can improve language understanding.