@ZyphraAI: Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the m…
Summary
Zyphra releases ZONOS2, an open-source real-time TTS model with high-fidelity voice cloning, under Apache 2.0, available on Zyphra Cloud on AMD.
View Cached Full Text
Cached at: 06/15/26, 12:52 AM
Today we’re releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning.
ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on @AMD.
Real-time TTS has always forced a tradeoff between quality and speed.
We achieve both with ZONOS2, the first sparse MoE TTS model released open-source, with 8B total params, 900M active.
ZONOS2 is fast, inference efficient, and super expressive.
ZONOS2 excels at voice cloning, making it the most natural-sounding open-source TTS model out there.
It captures far more of what makes a voice distinctive, so clones sound convincing across a wide range of speakers. Voice cloning is zero-shot, needing no fine-tuning.
ZONOS2 predicts Descript Audio Codec (DAC) tokens for studio-quality 44.1 kHz audio.
DAC tokens maximize quality but are harder to model than low-fi autoencoders. We close that gap with model + data scale, so fidelity doesn’t cost stability.
For the text, we do not use a phonemizer, instead ZONOS2 reads raw UTF-8 bytes. This gives us:
→ broader coverage, especially lower-resource languages → big gains on Chinese, Korean, Japanese → native code-switching mid-sentence
Training data scaled from ~200K hours to 6M+ hours (~707 years of audio).
Staged data filtering ramps transcript-agreement strictness across pretraining → midtraining → annealing. This leads to fewer hallucinations, mispronunciations, and repetitions.
We’re also releasing ZTTS1-Eval, a new TTS benchmark.
Existing evals lean on outdated ASR and read speech. ZTTS1-Eval spans clean + in-the-wild sets across up to 17 languages, modern judges (Qwen3-ASR, ReDimNet, MSR-UTMOS), and prosody metrics.
ZONOS2 is open-weights under Apache 2.0, and free on Zyphra Cloud for a limited time.
Try it on Zyphra Cloud: http://cloud.zyphra.com Blog: http://zyphra.com/our-work/zonos2 Weights: http://huggingface.co/Zyphra/ZONOS2 Inference code: http://github.com/Zyphra/ZONOS2 Eval code: http://github.com/Zyphra/ZTTS1-Eval…
@ZyphraAI is an open superintelligence research and product company based in San Francisco, CA on a mission to build human-aligned AI that helps individuals and organizations reach their fullest potential.
Apply to join us!
Similar Articles
Zyphra/ZONOS2
ZONOS2 is a new text-to-speech model from Zyphra trained on over 6 million hours of multilingual speech, offering high-quality voice cloning and low latency using a mixture-of-experts architecture. It supports 30+ languages and includes a high-performance inference server.
@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:
Zyphra released ZONOS2, an open-source MoE text-to-speech model trained on over 6 million hours of multilingual speech, supporting voice cloning and high-quality synthesis across many languages.
Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.
Technical breakdown and benchmarks of VoxCPM2, an open-source TTS model featuring Ultimate Cloning Mode for capturing breathing and accents, tested locally with low VRAM footprint and cross-lingual accent retention.
@Prince_Canuma: mlx-audio v0.4.3 is here A massive release across models, server, and DX → 6 new TTS models: Higgs Audio v2 (voice clon…
mlx-audio v0.4.3 releases with 6 new TTS models including Higgs Audio v2 and OmniVoice (646+ languages), plus server improvements like concurrent requests and continuous batching, ~3x faster Voxtral Realtime on 4-bit, and slimmer dependencies for Apple Silicon.
@AdinaYakup: dots.tts New TTS from Xiaohongshu (RedNote) 2B - Apache 2.0 Fully continuous architecture (no codec tokens) 48kHz synth…
Dots.tts is a new TTS model from Xiaohongshu (RedNote) with 2B parameters, Apache 2.0 license, fully continuous architecture without codec tokens, 48kHz synthesis, and zero-shot voice cloning.