Qwen3-TTS Technical Report

Papers with Code Trending Papers

Summary

The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
Original Article
View Cached Full Text

Cached at: 05/10/26, 06:36 PM

Paper page - Qwen3-TTS Technical Report

Source: https://huggingface.co/papers/2601.15621 Authors:

,

,

,

,

,

,

,

,

,

Abstract

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

In this report, we present the Qwen3-TTS series, a family of advancedmultilingual, controllable, robust, and streamingtext-to-speechmodels. Qwen3-TTS supports state-of-the-art 3-secondvoice cloningand description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts adual-track LM architecturefor real-time synthesis, coupled with twospeech tokenizers: 1)Qwen-TTS-Tokenizer-25Hzis a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enablesstreaming waveform reconstructionvia a block-wiseDiT. 2)Qwen-TTS-Tokenizer-12Hzachieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causalConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTSmultilingualtest set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

View arXiv pageView PDFGitHub11.3kAdd to collection

Get this paper in your agent:

hf papers read 2601\.15621

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper240

#### Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice Text-to-Speech• 2B• UpdatedJan 29 • 1.65M • 1.46k #### Qwen/Qwen3-TTS-12Hz-1.7B-Base UpdatedJan 23 • 1.67M • 390 #### Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign Text-to-Speech• 2B• UpdatedJan 29 • 454k • 338 #### Qwen/Qwen3-TTS-12Hz-0.6B-Base Text-to-Speech• UpdatedJan 29 • 665k • 234 Browse 240 models citing this paper## Datasets citing this paper1

#### Izzyzlin/CFSDD Viewer• UpdatedApr 7 • 395k • 264

Spaces citing this paper1,583

Collections including this paper20

Browse 20 collections that include this paper

Similar Articles

Qwen3.5-Omni Technical Report

Hugging Face Daily Papers

Qwen3.5-Omni is a hundreds-of-billions-parameter multimodal model with advanced audio-visual understanding and generation capabilities, featuring novel Audio-Visual Vibe Coding and achieving SOTA results across 215 benchmarks while matching Gemini-3.1 Pro.

Qwen3.7 Preview lands on Arena (1 minute read)

TLDR AI

Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.

Qwen-Image-2.0 Technical Report

Hugging Face Daily Papers

Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.

QWEN3.6 + ik_llama is fast af

Reddit r/LocalLLaMA

User reports successful deployment of Qwen 3.6 with ik_llama quantization achieving 50+ tokens/second on consumer hardware (16GB VRAM, 32GB RAM) with 200k context window.