@tom_doerr: 零样本语音克隆支持30种语言 https://github.com/sunnyxrxrx/X-Voice…

X AI KOLs Timeline 2026/05/26 01:06 模型

zero-shot voice-cloning multilingual text-to-speech open-source flow-matching

摘要

X-Voice 是一个基于流匹配的多语言文本转语音系统，支持跨30种语言的零样本语音克隆，并提供开源代码、模型及演示。

零样本语音克隆，支持30种语言 https://t.co/H3gRfI6z3U https://t.co/6XoJ4zTjTy

查看原文

查看缓存全文

缓存时间: 2026/05/26 07:04

零样本语音克隆，支持30种语言 https://t.co/H3gRfI6z3U https://t.co/6XoJ4zTjTy — # sunnyxrxrx/X-Voice 来源：https://github.com/sunnyxrxrx/X-Voice # X-Voice：让每个人都能通过零样本跨语言语音克隆说30种语言 X-Voice 是一个基于流匹配的多语言文本转语音系统，能够使一个说话者说出30种语言。 ## 新闻 - 2026/04/30：X-Voice 代码库、模型、演示、Hugging Face 空间、数据集和基准均发布。 - 2026/05/08：X-Voice 论文 (https://arxiv.org/abs/2605.05611) 在 arXiv 上公开发布。 ## 安装 ### 如有需要，创建独立环境 bash # 创建 conda 环境，python_version>=3.10 conda create -n x-voice python=3.11 conda activate x-voice # 如果尚未安装 FFmpeg，请安装 conda install ffmpeg ### 安装与设备匹配的 PyTorch NVIDIA GPU > bash > # 根据你的 CUDA 版本安装 PyTorch，例如 > pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128 > AMD GPU > bash > # 根据你的 ROCm 版本安装 PyTorch（仅 Linux），例如 > pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2 > Intel GPU > bash > # 根据你的 XPU 版本安装 PyTorch，例如 > pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu > Apple Silicon > bash > # 安装稳定的 PyTorch，例如 > pip install torch torchaudio > ### 安装 X-Voice bash git clone https://github.com/sunnyxrxrx/X-Voice.git cd X-Voice pip install -e . 检查 ESpeak-ng 安装情况： bash espeak-ng --version 如果找不到，请先运行 bash src/x_voice/prepare_ipa.sh。 ## 推理 - 为达到理想的性能，请花点时间阅读详细指南。 ### 1. Gradio 应用 bash x-voice_infer-gradio --host 0.0.0.0 --port 7860 ### 2. CLI 推理 bash # X-Voice Stage1 python -m x_voice.infer.infer_cli_stage1 -c src/x_voice/infer/examples/basic/basic_stage1.toml # X-Voice Stage2 python -m x_voice.infer.infer_cli_stage2 -c src/x_voice/infer/examples/basic/basic_stage2.toml ## 训练 ### TTS 模型训练关于最佳实践，请参考训练指南。 ### 语速预测器训练关于 X-Voice 中使用的多语言语速预测器，请参考语速预测器指南。 ## 评估关于基准和指标脚本，请参考评估指南。 ## 仓库结构 text X-Voice/ ├── ckpts/ # 检查点 ├── data/ # 数据集和处理后的数据 ├── src/ │ ├── rate_pred/ # 语速预测器 │ ├── third_party/ │ │ └── BigVGAN/ # BigVGAN 子模块 │ └── x_voice/ # X-Voice 主包 └── pyproject.toml # 包定义和依赖项 ## 开发使用 pre-commit 确保代码质量： bash pip install pre-commit pre-commit install pre-commit run --all-files ## 致谢 - F5-TTS (https://arxiv.org/abs/2410.06885) 出色的工作以及本代码库的基础 - Cross-Lingual F5-TTS 2 使用了合成音频提示的监督微调策略 - Cross-Lingual F5-TTS (https://arxiv.org/abs/2509.14579) 提供了语速预测器 - NLLB (https://huggingface.co/facebook/nllb-200-distilled-600M) 用于 Gradio 演示中的翻译 - torchdiffeq (https://github.com/rtqichen/torchdiffeq) 作为 ODE 求解器，Vocos (https://huggingface.co/charactr/vocos-mel-24khz) 和 BigVGAN (https://github.com/NVIDIA/BigVGAN) 作为声码器 - FunASR (https://github.com/modelscope/FunASR)、faster-whisper (https://github.com/SYSTRAN/faster-whisper)、UniSpeech (https://github.com/microsoft/UniSpeech)、SpeechMOS (https://github.com/tarepan/SpeechMOS) 作为评估工具 - MAVL (https://github.com/k1064190/MAVL/tree/main) 用于日语音节计数 ## 引用如果您觉得我们的工作有用，请引用如下： @article{xu2026xvoiceenablingspeak30, title={X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning}, author={Rixi Xu and Qingyu Liu and Haitao Li and Yushen Chen and Zhikang Niu and Yunting Yang and Jian Zhao and Ke Li and Berrak Sisman and Qinyuan Cheng and Xipeng Qiu and Kai Yu and Xie Chen}, journal={arXiv preprint arXiv:2605.05611}, year={2026}, } ## 许可证我们的代码以 MIT 许可证发布。预训练模型采用 CC-BY-NC 许可证，原因是训练数据 X-Voice 数据集。

@tom_doerr: 零样本语音克隆支持30种语言 https://github.com/sunnyxrxrx/X-Voice…

相似文章

@tom_doerr: 以70倍实时速度转录音频 https://github.com/m-bain/whisperX

@0x0SojalSec: 别再为ElevenLabs或云端TTS付费了。只需3秒即可在笔记本电脑上完全本地克隆声音，将你的文档转…

k2-fsa/OmniVoice

Microsoft MAI-Voice-2

Scenema Audio：零样本富有表现力的语音克隆与语音生成 [N]

提交意见反馈