Zyphra/ZONOS2

Hugging Face Models Trending 06/11/26, 02:20 AM Models

text-to-speech voice-cloning multilingual mixture-of-experts tts-model open-source

Summary

ZONOS2 is a new text-to-speech model from Zyphra trained on over 6 million hours of multilingual speech, offering high-quality voice cloning and low latency using a mixture-of-experts architecture. It supports 30+ languages and includes a high-performance inference server.

Task: text-to-speech Tags: ZONOS2, text-to-speech, license:apache-2.0, region:us

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:07 AM

Zyphra/ZONOS2 · Hugging Face

Source: https://huggingface.co/Zyphra/ZONOS2 ZONOS2 title card

ZONOS2 is our latest text-to-speech model trained on more than 6 million hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers at low latency with MoE. ZONOS2 excels at high-fidelity and naturalistic voice cloning.

During inference we use nemo TN normalized UTF-8 bytes and an ECAPA-TDNN embedding to generate DAC tokens with our MoE backbone. An inference overview can be seen below.

ZONOS2 title card

Language support is as follows.

TierLanguagesTier 1English, Mandarin Chinese, JapaneseTier 2Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, DutchTier 3Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, Latvian For local inference we provide a high-performance TTS inference server built onMini-SGLang.

For more details and speech samples, check out ourblog.

We also have a hosted version available atcloud.zyphra.com/audio-playground.

https://huggingface.co/Zyphra/ZONOS2#quick-startQuick Start

Platform Support: Linux only (x86_64). Requires NVIDIA GPU with CUDA toolkit matching your driver version (nvidia\-smito check).

https://huggingface.co/Zyphra/ZONOS2#1-installation1. Installation

Requiresuv.

git clone https://github.com/Zyphra/ZONOS2.git
cd ZONOS2
uv sync

https://huggingface.co/Zyphra/ZONOS2#2-launch-the-tts-server2. Launch the TTS Server

uv run python -m minisgl --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/

uv runalways uses the project environment, so no venv activation is needed.

The server starts onhttp://localhost:1919by default. TTS mode is auto-detected for zonos2 models.\-\-tts\-default\-voices\-dir <folder\>pre-populates the web UI with voice-clone speakers from disk; the folder is scanned recursively for speaker audio (\.wav,\.mp3,\.flac,\.m4a,\.ogg,\.opus,\.aac,\.webm) and saved embeddings (\.npy,\.npz). The newest voice is selected automatically on startup.

https://huggingface.co/Zyphra/ZONOS2#3-generate-speech3. Generate Speech

curl:

curl -X POST http://localhost:1919/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "stream": true}' \
  --output output.pcm

# Convert to WAV
ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wav

**Web UI:**Openhttp://localhost:1919/in your browser.

https://huggingface.co/Zyphra/ZONOS2#python-api-offline-inferencePython API (offline inference)

You can also run the engine directly in a Python script, without starting a server, viaTTSLLM:

from minisgl.message import TTSSamplingParams
from minisgl.tts import TTSLLM

tts = TTSLLM(model_path="Zyphra/ZONOS2")

results = tts.generate(
    ["Hello from the offline Python API.", "Batched prompts work too."],
    TTSSamplingParams(seed=42),
)

for i, result in enumerate(results):
    print(f"frames={len(result['audio_tokens'])}, eos_frame={result['eos_frame']}")
    tts.save_audio(result["audio"], f"output_{i}.wav")

https://huggingface.co/Zyphra/ZONOS2#citationCitation

If you find this model useful in an academic context please cite as:

@misc{zyphra2025zonos,
  title     = {Zonos V2 Technical Report},
  author    = {Gabriel Clark, Sofian Mejjoute, Mohamed Osman, George Close, Beren Millidge},
  year      = {2026},
}

Zyphra/ZONOS2

Zyphra/ZONOS2 · Hugging Face

https://huggingface.co/Zyphra/ZONOS2#quick-startQuick Start

https://huggingface.co/Zyphra/ZONOS2#1-installation1. Installation

https://huggingface.co/Zyphra/ZONOS2#2-launch-the-tts-server2. Launch the TTS Server

https://huggingface.co/Zyphra/ZONOS2#3-generate-speech3. Generate Speech

https://huggingface.co/Zyphra/ZONOS2#python-api-offline-inferencePython API (offline inference)

https://huggingface.co/Zyphra/ZONOS2#citationCitation

Similar Articles

@ZyphraAI: Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the m…

@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:

k2-fsa/OmniVoice

OpenBMB/VoxCPM

@tom_doerr: Zero-shot voice cloning for 30 languages https://github.com/sunnyxrxrx/X-Voice…

Submit Feedback

Similar Articles

@ZyphraAI: Today we're releasing ZONOS2, our next-generation real-time TTS model with high-fidelity voice cloning. ZONOS2 is the m…

@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:

@tom_doerr: Zero-shot voice cloning for 30 languages https://github.com/sunnyxrxrx/X-Voice…