Qwen3-TTS Technical Report

Papers with Code Trending 01/22/26, 03:51 AM Papers

text-to-speech voice-cloning multilingual open-source qwen speech-synthesis low-latency

Summary

The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

Original Article

View Cached Full Text

Cached at: 05/10/26, 06:36 PM

Paper page - Qwen3-TTS Technical Report

Source: https://huggingface.co/papers/2601.15621 Authors:

Abstract

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

In this report, we present the Qwen3-TTS series, a family of advancedmultilingual, controllable, robust, and streamingtext-to-speechmodels. Qwen3-TTS supports state-of-the-art 3-secondvoice cloningand description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts adual-track LM architecturefor real-time synthesis, coupled with twospeech tokenizers: 1)Qwen-TTS-Tokenizer-25Hzis a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enablesstreaming waveform reconstructionvia a block-wiseDiT. 2)Qwen-TTS-Tokenizer-12Hzachieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causalConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTSmultilingualtest set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.