@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:
Summary
Zyphra released ZONOS2, an open-source MoE text-to-speech model trained on over 6 million hours of multilingual speech, supporting voice cloning and high-quality synthesis across many languages.
View Cached Full Text
Cached at: 06/14/26, 12:16 AM
ZONOS2: Open-source MoE TTS model with 8B total parameters and 0.9B activated parameters. Supports multilingual, voice cloning, and Chinese, delivering good results in Chinese. Model: https://t.co/ORL0UATU92
— # Zyphra/ZONOS2 · Hugging Face Source: https://huggingface.co/Zyphra/ZONOS2 ZONOS2 title card Discord (https://discord.gg/gTW9JwST8q) — ZONOS2 is our latest text-to-speech model trained on more than 6 million hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers at low latency with MoE. ZONOS2 excels at high-fidelity and naturalistic voice cloning. During inference we use nemo TN normalized UTF-8 bytes and an ECAPA-TDNN embedding to generate DAC tokens with our MoE backbone. An inference overview can be seen below. ZONOS2 title card Language support is as follows. TierLanguagesTier 1English, Mandarin Chinese, JapaneseTier 2Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, DutchTier 3Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, Latvian For local inference we provide a high-performance TTS inference server built onMini-SGLang (https://github.com/sgl-project/mini-sglang). For more details and speech samples, check out ourblog (https://www.zyphra.com/our-work/zonos2). We also have a hosted version available atcloud.zyphra.com/audio-playground (https://cloud.zyphra.com/audio-playground). — ## https://huggingface.co/Zyphra/ZONOS2#quick-startQuick Start > Platform Support: Linux only (x86_64). Requires NVIDIA GPU with CUDA toolkit matching your driver version (nvidia\-smito check). ### https://huggingface.co/Zyphra/ZONOS2#1-installation1. Installation Requiresuv (https://docs.astral.sh/uv/getting-started/installation/). git clone https://github.com/Zyphra/ZONOS2.git cd ZONOS2 uv sync ### https://huggingface.co/Zyphra/ZONOS2#2-launch-the-tts-server2. Launch the TTS Server uv run python -m minisgl --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/ uv runalways uses the project environment, so no venv activation is needed. The server starts onhttp://localhost:1919by default. TTS mode is auto-detected for zonos2 models.\-\-tts\-default\-voices\-dir pre-populates the web UI with voice-clone speakers from disk; the folder is scanned recursively for speaker audio (\.wav,\.mp3,\.flac,\.m4a,\.ogg,\.opus,\.aac,\.webm) and saved embeddings (\.npy,\.npz). The newest voice is selected automatically on startup. ### https://huggingface.co/Zyphra/ZONOS2#3-generate-speech3. Generate Speech curl: curl -X POST http://localhost:1919/tts/generate \ -H "Content-Type: application/json" \ -d '{"text": "Hello world", "stream": true}' \ --output output.pcm # Convert to WAV ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wav **Web UI:**Openhttp://localhost:1919/in your browser. ## https://huggingface.co/Zyphra/ZONOS2#python-api-offline-inferencePython API (offline inference) You can also run the engine directly in a Python script, without starting a server, viaTTSLLM: from minisgl.message import TTSSamplingParams from minisgl.tts import TTSLLM tts = TTSLLM(model_path="Zyphra/ZONOS2") results = tts.generate( ["Hello from the offline Python API.", "Batched prompts work too."], TTSSamplingParams(seed=42), ) for i, result in enumerate(results): print(f"frames={len(result['audio_tokens'])}, eos_frame={result['eos_frame']}") tts.save_audio(result["audio"], f"output_{i}.wav") ## https://huggingface.co/Zyphra/ZONOS2#citationCitation If you find this model useful in an academic context please cite as: @misc{zyphra2025zonos, title = {Zonos V2 Technical Report}, author = {Gabriel Clark, Sofian Mejjoute, Mohamed Osman, George Close, Beren Millidge}, year = {2026}, }
Similar Articles
Zyphra/ZONOS2
ZONOS2 is a new text-to-speech model from Zyphra trained on over 6 million hours of multilingual speech, offering high-quality voice cloning and low latency using a mixture-of-experts architecture. It supports 30+ languages and includes a high-performance inference server.
@ZyphraAI: Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the m…
Zyphra releases ZONOS2, an open-source real-time TTS model with high-fidelity voice cloning, under Apache 2.0, available on Zyphra Cloud on AMD.
@Gorden_Sun: NetEase Youdao open-sources Confucius4-TTS, a 1.3B TTS model, supports multilingual, supports voice cloning, good results, very fast. Github: https://github.com/netease-youdao/Confucius4-TTS… Online demo: …
NetEase Youdao open-sourced the 1.3B parameter Confucius4-TTS model, supporting zero-shot voice cloning and cross-lingual speech synthesis in 14 languages, fast and with excellent results.
@Honcia13: Open-source TTS is going crazy! New weapons for industrial park scams? Tsinghua OpenBMB just released VoxCPM2: 20 billion parameters + 2 million hours of multilingual data training, 48kHz studio-quality sound! The most intense part is—no Tokenizer needed at all, performing diffusion autoregression directly in continuous latent space, maximizing detail retention!
Tsinghua University's OpenBMB has released VoxCPM2, an open-source multilingual TTS model with 20 billion parameters. It supports continuous latent space diffusion autoregressive generation without a Tokenizer, offering 48kHz studio-quality audio and powerful voice cloning and design capabilities.
@Chenzeze777: Found an open-source voice synthesis model that I just had to share. 2 billion parameters, trained on 2 million hours of data, supports 30 languages plus 9 Chinese dialects—just input text and it synthesizes speech, including Sichuanese, Cantonese, and Northeastern dialects. The craziest part? Use natural language to describe a voice—like "young female, gentle and sweet"—and it creates a brand-new voice from scratch without needing any reference audio.
Introducing an open-source voice synthesis model with 2 billion parameters and 2 million hours of training. It supports 30 languages and 9 Chinese dialects, allows voice description via natural language, can clone voices from a 3-second recording, delivers 48kHz studio-quality audio, and is free for commercial use under the Apache-2.0 license.