VibeVoice Technical Report

Papers with Code Trending 08/26/25, 05:09 PM Papers

Summary

VibeVoice is a new model from Microsoft that synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer. It achieves superior fidelity and compression, supporting up to 90 minutes of audio with multiple speakers.

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 08:34 AM

Paper page - VibeVoice Technical Report

Source: https://huggingface.co/papers/2508.19205 Published on Aug 26, 2025

Submitted byhttps://huggingface.co/unilm

Li Dongon Aug 27, 2025

#1 Paper of the day Authors:

Abstract

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

This report presents VibeVoice, a novel model designed to synthesizelong-form speechwith multiple speakers by employingnext-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novelcontinuous speech tokenizerthat, when compared to the popularEncodecmodel, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preservesaudio fidelitywhile significantly boostingcomputational efficiencyfor processing long sequences. Thus, VibeVoice can synthesizelong-form speechfor up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe’’ and surpassing open-source and proprietarydialogue models.

View arXiv page View PDF Project page GitHub46.8k Add to collection

Get this paper in your agent:

hf papers read 2508\.19205

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash