@tom_doerr: Zero-shot voice cloning for 30 languages https://github.com/sunnyxrxrx/X-Voice…

X AI KOLs Timeline Models

Summary

X-Voice is a flow-matching-based multilingual text-to-speech system that enables zero-shot voice cloning across 30 languages, with open-source code, model, and demo available.

Zero-shot voice cloning for 30 languages https://t.co/H3gRfI6z3U https://t.co/6XoJ4zTjTy
Original Article
View Cached Full Text

Cached at: 05/26/26, 07:04 AM

Zero-shot voice cloning for 30 languages

https://t.co/H3gRfI6z3U https://t.co/6XoJ4zTjTy


sunnyxrxrx/X-Voice

Source: https://github.com/sunnyxrxrx/X-Voice

X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

X-Voice

Paper Demo HF Space HF Model Python HF Dataset HF Benchmark ModelScope Dataset ModelScope Testset X-LANCE SII Geely CLSP

X-Voice is a flow-matching-based multilingual text-to-speech system that enables one speaker to speak 30 languages.

News

Installation

Create a separate environment if needed

# Create a conda env with python_version>=3.10
conda create -n x-voice python=3.11
conda activate x-voice

# Install FFmpeg if you haven't yet
conda install ffmpeg

Install PyTorch with matched device

NVIDIA GPU
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
AMD GPU
# Install pytorch with your ROCm version (Linux only), e.g.
pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
Intel GPU
# Install pytorch with your XPU version, e.g.
pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu
Apple Silicon
# Install the stable pytorch, e.g.
pip install torch torchaudio

Install X-Voice

git clone https://github.com/sunnyxrxrx/X-Voice.git
cd X-Voice
pip install -e .

Check your ESpeak-ng installation:

espeak-ng --version

If not found, run bash src/x_voice/prepare_ipa.sh first.

Inference

1. Gradio App

x-voice_infer-gradio --host 0.0.0.0 --port 7860

2. CLI Inference

# X-Voice Stage1
python -m x_voice.infer.infer_cli_stage1 -c src/x_voice/infer/examples/basic/basic_stage1.toml

# X-Voice Stage2
python -m x_voice.infer.infer_cli_stage2 -c src/x_voice/infer/examples/basic/basic_stage2.toml

Training

TTS Model Training

Refer to training guidance for best practice.

Speaking Rate Predictor Training

Refer to speaking rate predictor guidance for the multilingual speaking rate predictor used in X-Voice.

Evaluation

Refer to evaluation guidance for benchmark and metric scripts.

Repo Structure

X-Voice/
├── ckpts/                  # checkpoints
├── data/                   # datasets and processed data
├── src/
│   ├── rate_pred/          # speaking rate predictor
│   ├── third_party/
│   │   └── BigVGAN/        # BigVGAN submodule
│   └── x_voice/            # main X-Voice package
└── pyproject.toml          # package definition and dependencies

Development

Use pre-commit to ensure code quality:

pip install pre-commit
pre-commit install
pre-commit run --all-files

Acknowledgements

Citations

If you find our work useful, please cite as:

@article{xu2026xvoiceenablingspeak30,
      title={X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning}, 
      author={Rixi Xu and Qingyu Liu and Haitao Li and Yushen Chen and Zhikang Niu and Yunting Yang and Jian Zhao and Ke Li and Berrak Sisman and Qinyuan Cheng and Xipeng Qiu and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2605.05611},
      year={2026},
}

License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data X-Voice Dataset.

Similar Articles

k2-fsa/OmniVoice

Hugging Face Models Trending

OmniVoice is a massively multilingual zero-shot text-to-speech model supporting over 600 languages, built on a diffusion language model architecture with fast inference and voice cloning capabilities.

Microsoft MAI-Voice-2

Product Hunt

Microsoft has released MAI-Voice-2, an expressive text-to-speech system supporting voice cloning in 15 languages.

@GitTrend0x: Holy cow, guys! Run voice cloning and cinematic video dubbing locally, supporting 646 languages, fully offline, no API key, no internet needed. ElevenLabs is crushed! https://github.com/debpalash/OmniVoice-Studio… This open-source marvel is insane...

X AI KOLs Timeline

OmniVoice Studio is an open-source desktop app that enables local voice cloning and cinematic video dubbing across 646 languages, fully offline with no API keys, positioning itself as a privacy-focused alternative to ElevenLabs.