@Gorden_Sun: 有道开源Confucius4-TTS 1.3B大小的TTS模型,支持多语言,支持语音克隆,效果不错,速度特别快。 Github:https://github.com/netease-youdao/Confucius4-TTS… 在线使用:…

X AI KOLs Timeline 模型

摘要

有道开源了1.3B参数的Confucius4-TTS模型,支持14种语言的零样本语音克隆与跨语言语音合成,速度快且效果优秀。

有道开源Confucius4-TTS 1.3B大小的TTS模型,支持多语言,支持语音克隆,效果不错,速度特别快。 Github:https://github.com/netease-youdao/Confucius4-TTS… 在线使用:https://confucius4-tts.youdao.com/gradio/
查看原文
查看缓存全文

缓存时间: 2026/06/18 12:15

有道开源Confucius4-TTS 1.3B大小的TTS模型,支持多语言,支持语音克隆,效果不错,速度特别快。 Github:https://github.com/netease-youdao/Confucius4-TTS… 在线使用:https://confucius4-tts.youdao.com/gradio/


netease-youdao/Confucius4-TTS

Source: https://github.com/netease-youdao/Confucius4-TTS

Confucius4-TTS

Confucius4-TTS: a Multilingual and Cross-Lingual Zero-Shot TTS Engine

One voice. Any language.

                             

Confucius4-TTS is an advanced LLM-based text-to-speech (TTS) system designed for multilingual and cross-lingual speech synthesis. Built on a speech encoder + large language model (LLM) architecture, Confucius4-TTS enables high-quality speech generation while preserving speaker identity across languages. You can try our online demo at https://confucius4-tts.youdao.com/gradio.

✨ Key Features

  • 14 Languages Supported: Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay and Vietnamese (more coming soon)
  • Unconstrained Voice Cloning: No reference transcript required
  • Cross-Lingual Voice Transfer: Unaccented speech synthesis across 14 languages
  • Zero-Shot Voice Transfer: Clone voices without additional training
  • Seamless Emotion Transfer: Clone the feeling, not just the voice
  • Robust Generalization: Stable performance in real-world multilingual scenarios

With strong cross-lingual generalization, Confucius4-TTS allows users to seamlessly switch languages while keeping the same voice, delivering fluent, natural, and expressive speech.

Video Demo

Contents

🛠 Installation

Requirements

  • Python 3.10
  • CUDA 12.6

Setup

  1. Clone the repository:
git clone https://github.com/netease-youdao/Confucius4-TTS.git
cd Confucius4-TTS
  1. Create and activate a conda environment:
conda create -n confuciustts python=3.10 -y
conda activate confuciustts
  1. Install dependencies:
pip install -r requirements.txt

🚀 Inference

Use the provided example.py script for zero-shot TTS synthesis:

python example.py \
    --prompt_wav path/to/reference.wav \
    --text "Hello, this is a test of zero-shot voice cloning." \
    --lang en \
    --out output.wav \
    --config config/inference_config.yaml

You can also use the Python API directly:

import torch
import torchaudio
from confuciustts.cli.inference import ConfuciusTTS

model = ConfuciusTTS(
    config_path="config/inference_config.yaml",
    device="cuda" if torch.cuda.is_available() else "cpu",
)

audio = model.generate(
    text="Hello, welcome to Confucius4-TTS.",
    lang="en",
    prompt_wav="path/to/reference.wav",
    verbose=True,
)

torchaudio.save("output.wav", audio.cpu(), model.sample_rate)

🚀 Fine-Tuning

Confucius4-TTS follows a “speech encoder + LLM” architecture. The training pipeline covers two modules:

  • Text2Semantic (T2S): generates semantic token sequences from text and speaker conditioning.
  • Semantic2Acoustic (S2A): a flow-matching model that converts semantic tokens into mel spectrograms.

1. Prepare Pretrained Models

Download the two external models:

# Wav2Vec2-BERT (speaker conditioning & semantic feature extraction)
huggingface-cli download facebook/w2v-bert-2.0 \
    --local-dir pretrained/w2v-bert-2.0

# Amphion MaskGCT (semantic codec implementation)
git clone https://github.com/open-mmlab/Amphion.git external/Amphion

After downloading, your directory should look like:

checkpoints/
├── t2s_model.safetensors        # pretrained T2S weights
├── s2a_model.pt                 # pretrained S2A weights
├── wav2vec2bert_stats.pt        # semantic feature normalization statistics
├── special_tokens_map.json      # tokenizer files
├── tokenizer.json
├── tokenizer.model
└── tokenizer_config.json
pretrained/
├── w2v-bert-2.0/                # Wav2Vec2-BERT model
└── campplus/
    └── campplus_cn_common.bin   # CAMPPlus speaker encoder checkpoint
external/
└── Amphion/                     # MaskGCT semantic codec implementation

2. Prepare Training Data

Training data is provided as TSV files (tab-separated, no header) with the following 5 columns:

ColumnDescription
langLanguage code (e.g. zh, en, ja)
wav_pathPath to the target audio
norm_textNormalized text
semantic_ids_pathPre-extracted semantic tokens (.npy file path)
ref_audio_pathsReference audio path(s), comma-separated for multiple

Configure the train/validation paths in config/train_t2s.yaml:

data:
  train_data_path:
    - data/train.tsv
  val_data_path:
    - data/val.tsv

3. Launch T2S Training

Set the pretrained T2S checkpoint path in config/train_t2s.yaml:

paths:
  t2s_checkpoint: checkpoints/t2s_model.safetensors

Single-node training:

python -m confuciustts.cli.train_t2s -c config/train_t2s.yaml

4. Launch S2A Training

Set the checkpoint paths in config/train_s2a.yaml. t2s_checkpoint points to the frozen T2S backbone; s2a_checkpoint is optional and can be used to resume from a pretrained S2A model:

paths:
  t2s_checkpoint: checkpoints/t2s_model.safetensors
  s2a_checkpoint: checkpoints/s2a_model.pt   # optional: resume from pretrained S2A

Single-node training:

python -m confuciustts.cli.train_s2a -c config/train_s2a.yaml

During S2A training, the T2S model, speaker encoder (Wav2Vec2-BERT), and style encoder (CAMPPlus) are all frozen. Only the flow-matching S2A model is trained.

📊 Performance

Confucius4-TTS achieves competitive results on multilingual and cross-lingual zero-shot TTS benchmarks, with strong intelligibility and speaker similarity across multiple languages.

Lower is better for WER/CER (↓), and higher is better for SIM (↑).

CV3-eval Cross-lingual

CV3-eval Cross-lingual Results (click to expand)
DirectionMetricConfucius4-TTSF5-TTS†Spark-TTSCosyVoice2†CosyVoice3-0.5B†CosyVoice3-0.5B + DiffRO†CosyVoice3-1.5B†CosyVoice3-1.5B + DiffRO†
en→zhWER↓6.7111.6012.4013.508.485.168.015.09
ja→zhWER↓4.9348.106.863.226.783.05
ko→zhWER↓1.467.705.241.033.301.06
zh→enWER↓3.195.577.3617.106.834.415.394.20
ja→enWER↓3.4411.205.864.785.944.19
ko→enWER↓3.4213.1018.307.9113.707.08

† Requires reference text.

X-Voice Benchmark

X-Voice Cross-lingual Results (click to expand)
DirectionMetricConfucius4-TTSX-VoiceOmniVoice†IndexTTS2
de→zhWER↓2.863.0713.103.46
SIM↑0.5690.5160.6910.544
en→zhWER↓3.273.064.033.78
SIM↑0.5040.4430.5440.485
fr→zhWER↓2.743.0118.103.53
SIM↑0.5500.5180.6860.543
ja→zhWER↓3.503.3979.104.11
SIM↑0.6370.6290.7090.650
ko→zhWER↓2.863.1311.882.90
SIM↑0.6490.6550.7180.650
th→zhWER↓2.872.793.303.08
SIM↑0.6230.6140.6610.622
vi→zhWER↓2.752.7810.512.98
SIM↑0.6400.6410.7010.641

† Requires reference text.

Seed-TTS-eval

Seed-TTS-eval English & Chinese Zero-shot Results (click to expand)
LanguageMetricConfucius4-TTSQwen3-TTSFishAudio S2†OmniVoice†VoxCPM2†X-Voice
EnglishWER↓1.491.240.991.601.841.91
SIM↑0.700.7140.7410.7530.627
ChineseCER↓0.940.770.540.840.971.47
SIM↑0.7650.7700.7770.7950.746

† Requires reference text.

MiniMax-Multilingual-Test

MiniMax-Multilingual-Test Results (click to expand)
LanguageMetricConfucius4-TTSElevenLabQwen3-TTSFishAudio S2†OmniVoice†VoxCPM2†X-Voice
GermanWER↓0.470.571.240.550.960.682.00
SIM↑0.7750.6140.7680.7670.8120.8030.763
FrenchWER↓3.665.222.863.053.354.534.73
SIM↑0.7230.5350.7160.6980.8010.7350.746
IndonesianWER↓1.121.061.461.971.081.47
SIM↑0.7650.6600.7630.8050.8000.725
KoreanWER↓1.841.871.761.182.651.962.27
SIM↑0.8120.7000.7900.8170.8280.8330.788
ThaiWER↓1.5673.944.233.982.964.71
SIM↑0.7730.5880.7860.8410.8400.791
JapaneseWER↓4.1410.653.822.764.034.637.13
SIM↑0.7880.7380.7710.7960.8280.8280.765
VietnameseWER↓1.6173.427.411.373.311.40
SIM↑0.7510.3690.7400.8050.8060.672
ItalianWER↓1.301.740.951.272.071.562.27
SIM↑0.7870.5790.7520.7470.8120.7800.780
PortugueseWER↓2.481.331.531.142.511.942.61
SIM↑0.7960.7110.8050.7810.8590.8370.794
SpanishWER↓1.021.081.130.911.031.442.91
SIM↑0.7780.6150.8140.7760.8040.8310.747
RussianWER↓4.643.883.212.402.233.636.49
SIM↑0.7870.6750.7840.7900.7830.8110.799

† Requires reference text.


Acknowledgements

Confucius4-TTS builds on the following open-source projects:

  • Qwen3-TTS — Speaker encoder (ECAPA-TDNN) and text embedding projector architectures
  • CosyVoice — Text normalization pipeline
  • Amphion / MaskGCT — Semantic codec implementation
  • w2v-BERT 2.0 — Semantic feature extraction and speaker conditioning
  • Seed-VC — Flow matching architecture reference
  • BigVGAN — High-fidelity neural vocoder for mel-spectrogram to waveform synthesis

Citation

If you find Confucius4-TTS useful in your research or project, please consider citing:

@misc{confucius4tts_2026,
  title        = {Confucius4-TTS: A Multilingual and Cross-Lingual Zero-Shot TTS Engine},
  author       = {{NetEase Youdao}},
  year         = {2026},
  howpublished = {\url{https://github.com/netease-youdao/Confucius4-TTS}},
  note         = {GitHub repository}
}

Feng Zhou (@fengzhou): 🚀 Just released: Youdao Confucius4-TTS An open-weight 1.3B high-quality TTS engine:

  • 14 langs: EN, ZH, JA, KO, DE, FR, ES, ID, IT, TH, PT, RU, MS, VI
  • High-quality voice cloning w/o transcript
  • Cross-lingual voice transfer with min accent
  • Emotion transfer
  • Apache License

相似文章

@lxfater: 网易有道开源了子曰4大模型,27B参数内,数理 SOTA 但真正让我觉得有趣的是它那个语音功能!! 克隆个声音不稀奇,ElevenLabs 早能做 但它们都有个通病,跨语种会串口音 拿你的中文声音去说日语,它带着一口中国腔,一听就是外国人…

X AI KOLs Timeline

网易有道开源了子曰4大模型,27B参数,数理性能达到SOTA;其语音功能支持3秒跨语言声音克隆,14种语言且无口音问题,同时开源了全场景智能体龙虾。

@Honcia13: 开源TTS直接卷疯了!园区诈骗又有新武器? 清华 OpenBMB 刚刚放出 VoxCPM2: 200亿参数 + 200万小时多语言数据训练,48kHz录音棚级音质! 最狠的是——完全不用Tokenizer,直接在连续潜空间做扩散自回归,细…

X AI KOLs Timeline

清华大学 OpenBMB 发布了 VoxCPM2,这是一个拥有 200 亿参数的开源多语言 TTS 模型,支持无需 Tokenizer 的连续潜空间扩散自回归生成,具备 48kHz 录音棚级音质和强大的声音克隆与设计能力。

@Chenzeze777: 发现一个开源的语音合成模型,不说一声实在过意不去。 20 亿参数,200 万小时数据训练,30 种语言+9 种中国方言直接输入文本就合成,连四川话粤语东北话都有。 最离谱的是什么? 你用自然语言描述音色——「年轻女性,温柔甜美」——它就给…

X AI KOLs Timeline

介绍了一个开源的语音合成模型,20亿参数、200万小时训练,支持30种语言和9种中国方言,可用自然语言描述音色,3秒录音即可克隆声音,音质达48kHz,Apache-2.0协议免费商用。