Qwen3-TTS Technical Report
Summary
The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.
View Cached Full Text
Cached at: 05/10/26, 06:36 PM
Paper page - Qwen3-TTS Technical Report
Source: https://huggingface.co/papers/2601.15621 Authors:
,
,
,
,
,
,
,
,
,
Abstract
The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.
In this report, we present the Qwen3-TTS series, a family of advancedmultilingual, controllable, robust, and streamingtext-to-speechmodels. Qwen3-TTS supports state-of-the-art 3-secondvoice cloningand description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts adual-track LM architecturefor real-time synthesis, coupled with twospeech tokenizers: 1)Qwen-TTS-Tokenizer-25Hzis a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enablesstreaming waveform reconstructionvia a block-wiseDiT. 2)Qwen-TTS-Tokenizer-12Hzachieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission (97,ms) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causalConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTSmultilingualtest set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
View arXiv pageView PDFGitHub11.3kAdd to collection
Get this paper in your agent:
hf papers read 2601\.15621
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper240
#### Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice Text-to-Speech• 2B• UpdatedJan 29 • 1.65M • 1.46k
#### Qwen/Qwen3-TTS-12Hz-1.7B-Base UpdatedJan 23 • 1.67M • 390
#### Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign Text-to-Speech• 2B• UpdatedJan 29 • 454k • 338
#### Qwen/Qwen3-TTS-12Hz-0.6B-Base Text-to-Speech• UpdatedJan 29 • 665k • 234
Browse 240 models citing this paper## Datasets citing this paper1
#### Izzyzlin/CFSDD Viewer• UpdatedApr 7 • 395k • 264
Spaces citing this paper1,583
Collections including this paper20
Similar Articles
Qwen3 TTS is seriously underrated - I got it running locally in real-time and it's one of the most expressive open TTS models I've tried
Developer shows how to run Qwen3 TTS locally in real-time with streaming, quantization, word-level alignment, and custom voice fine-tuning for an expressive open-source TTS pipeline.
Qwen3.5-Omni Technical Report
Qwen3.5-Omni is a hundreds-of-billions-parameter multimodal model with advanced audio-visual understanding and generation capabilities, featuring novel Audio-Visual Vibe Coding and achieving SOTA results across 215 benchmarks while matching Gemini-3.1 Pro.
Qwen3.7 Preview lands on Arena (1 minute read)
Alibaba Qwen announces two major model releases: Qwen3-Omni, the first natively end-to-end omni-modal AI unifying text, image, audio and video, and Qwen3-Next-80B-A3B, an ultra-efficient MoE model with 3B activated parameters per token, achieving SOTA performance and 10x faster inference than Qwen3-32B.
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.
QWEN3.6 + ik_llama is fast af
User reports successful deployment of Qwen 3.6 with ik_llama quantization achieving 50+ tokens/second on consumer hardware (16GB VRAM, 32GB RAM) with 200k context window.