@ZyphraAI: Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the m…

X AI KOLs Following Models

Summary

Zyphra releases ZONOS2, an open-source real-time TTS model with high-fidelity voice cloning, under Apache 2.0, available on Zyphra Cloud on AMD.

Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on @AMD. 🧵 https://t.co/WvI7PXS80M
Original Article
View Cached Full Text

Cached at: 06/15/26, 12:52 AM

Today we’re releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning.

ZONOS2 is the most expressive open-source TTS model, released under Apache 2.0 and available on Zyphra Cloud on @AMD.

Real-time TTS has always forced a tradeoff between quality and speed.

We achieve both with ZONOS2, the first sparse MoE TTS model released open-source, with 8B total params, 900M active.

ZONOS2 is fast, inference efficient, and super expressive.

ZONOS2 excels at voice cloning, making it the most natural-sounding open-source TTS model out there.

It captures far more of what makes a voice distinctive, so clones sound convincing across a wide range of speakers. Voice cloning is zero-shot, needing no fine-tuning.

ZONOS2 predicts Descript Audio Codec (DAC) tokens for studio-quality 44.1 kHz audio.

DAC tokens maximize quality but are harder to model than low-fi autoencoders. We close that gap with model + data scale, so fidelity doesn’t cost stability.

For the text, we do not use a phonemizer, instead ZONOS2 reads raw UTF-8 bytes. This gives us:

→ broader coverage, especially lower-resource languages → big gains on Chinese, Korean, Japanese → native code-switching mid-sentence

Training data scaled from ~200K hours to 6M+ hours (~707 years of audio).

Staged data filtering ramps transcript-agreement strictness across pretraining → midtraining → annealing. This leads to fewer hallucinations, mispronunciations, and repetitions.

We’re also releasing ZTTS1-Eval, a new TTS benchmark.

Existing evals lean on outdated ASR and read speech. ZTTS1-Eval spans clean + in-the-wild sets across up to 17 languages, modern judges (Qwen3-ASR, ReDimNet, MSR-UTMOS), and prosody metrics.

ZONOS2 is open-weights under Apache 2.0, and free on Zyphra Cloud for a limited time.

Try it on Zyphra Cloud: http://cloud.zyphra.com Blog: http://zyphra.com/our-work/zonos2 Weights: http://huggingface.co/Zyphra/ZONOS2 Inference code: http://github.com/Zyphra/ZONOS2 Eval code: http://github.com/Zyphra/ZTTS1-Eval…

@ZyphraAI is an open superintelligence research and product company based in San Francisco, CA on a mission to build human-aligned AI that helps individuals and organizations reach their fullest potential.

Apply to join us!

Similar Articles

Zyphra/ZONOS2

Hugging Face Models Trending

ZONOS2 is a new text-to-speech model from Zyphra trained on over 6 million hours of multilingual speech, offering high-quality voice cloning and low latency using a mixture-of-experts architecture. It supports 30+ languages and includes a high-performance inference server.