@QT9277: 《不是,这AI声音合成已经变态到这种程度了???》 阿台我今天刷GitHub直接懵了。 VoxCPM2,趋势榜第一,星标干到2万+,海外彻底炸了。我本来以为是又一个PPT开源项目,结果仔细看了眼Demo——我耳朵真的分不清哪个是真人了。 …

X AI KOLs Timeline 模型

摘要

介绍VoxCPM2,一个完全免费商用、开源的多语言语音合成模型,支持声音设计、克隆及48kHz高质量输出,在GitHub趋势榜第一。

《不是,这AI声音合成已经变态到这种程度了???》 阿台我今天刷GitHub直接懵了。 VoxCPM2,趋势榜第一,星标干到2万+,海外彻底炸了。我本来以为是又一个PPT开源项目,结果仔细看了眼Demo——我耳朵真的分不清哪个是真人了。 给你们说说这玩意儿有多离谱: 打字就能出声音 你输入「沉稳的30代女声」,它啪的一下就给你生成出来。不用录、不用调,一句话搞定。 丢一段录音,连口癖都给你复刻 不是那种僵硬的机器音,是你说话的语气、断句、甚至口头禅,它全给你学去了。这哪是合成?这是克隆啊! 48kHz录音棚级音质 听起来跟专业录音棚出来的没差,我戴耳机听了三遍,愣是没找出破绽。 最离谱的是——完全免费商用 Apache 2.0协议,随便用、随便改、随便拿去赚钱,不花一分钱。对阿台我这种负债翻身党来说,这就是天降神兵好吧? 做短视频不想出镜露脸的、做播客没设备的、做项目需要配音的——零成本直接上车,还要啥自行车? 阿台我已经Star了,回头研究研究怎么结合到我的内容 workflow 里。有已经玩上的兄弟吗?评论区交流下! 纯个人分享,不是广告,我也是刚刷到。 最骚的是,全部免费 https://github.com/OpenBMB/VoxCPM
查看原文
查看缓存全文

缓存时间: 2026/06/05 17:17

《不是,这AI声音合成已经变态到这种程度了???》

阿台我今天刷GitHub直接懵了。

VoxCPM2,趋势榜第一,星标干到2万+,海外彻底炸了。我本来以为是又一个PPT开源项目,结果仔细看了眼Demo——我耳朵真的分不清哪个是真人了。

给你们说说这玩意儿有多离谱:

打字就能出声音 你输入「沉稳的30代女声」,它啪的一下就给你生成出来。不用录、不用调,一句话搞定。

丢一段录音,连口癖都给你复刻 不是那种僵硬的机器音,是你说话的语气、断句、甚至口头禅,它全给你学去了。这哪是合成?这是克隆啊!

48kHz录音棚级音质 听起来跟专业录音棚出来的没差,我戴耳机听了三遍,愣是没找出破绽。

最离谱的是——完全免费商用 Apache 2.0协议,随便用、随便改、随便拿去赚钱,不花一分钱。对阿台我这种负债翻身党来说,这就是天降神兵好吧?

做短视频不想出镜露脸的、做播客没设备的、做项目需要配音的——零成本直接上车,还要啥自行车?

阿台我已经Star了,回头研究研究怎么结合到我的内容 workflow 里。有已经玩上的兄弟吗?评论区交流下!

纯个人分享,不是广告,我也是刚刷到。

最骚的是,全部免费

https://github.com/OpenBMB/VoxCPM


OpenBMB/VoxCPM

Source: https://github.com/OpenBMB/VoxCPM

VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

English | 中文

Project Page Live Playground Documentation Hugging Face ModelScope

VoxCPM Logo

OpenBMB%2FVoxCPM | Trendshift

👋 Join our community for discussion and support!
Feishu  |  Discord

VoxCPM is a tokenizer-free Text-to-Speech system that directly generates continuous speech representations via an end-to-end diffusion autoregressive architecture, bypassing discrete tokenization to achieve highly natural and expressive synthesis.

VoxCPM2 is the latest major release — a 2B parameter model trained on over 2 million hours of multilingual speech data, now supporting 30 languages, Voice Design, Controllable Voice Cloning, and 48kHz studio-quality audio output. Built on a MiniCPM-4 backbone.

✨ Highlights

  • 🌍 30-Language Multilingual — Input text in any of the 30 supported languages and synthesize directly, no language tag needed
  • 🎨 Voice Design — Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace …), no reference audio required
  • 🎛️ Controllable Cloning — Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre
  • 🎙️ Ultimate Cloning — Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail — timbre, rhythm, emotion, and style (same as VoxCPM1.5)
  • 🔊 48kHz High-Quality Audio — Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2’s asymmetric encode/decode design, with built-in super-resolution — no external upsampler needed
  • 🧠 Context-Aware Synthesis — Automatically infers appropriate prosody and expressiveness from text content
  • Real-Time Streaming — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by Nano-vLLM or vLLM-Omni — official vLLM omni-modal serving for VoxCPM2 with PagedAttention and an OpenAI-compatible API
  • 📜 Fully Open-Source & Commercial-Ready — Weights and code released under the Apache-2.0 license, free for commercial use
🌍 Supported Languages (30)
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese

Chinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话

News

  • [2026.04] 🔥 We release VoxCPM2 — 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! Weights | Docs | Playground
  • [2025.12] 🎉 Open-source VoxCPM1.5 weights with SFT & LoRA fine-tuning. (🏆 #1 GitHub Trending)
  • [2025.09] 🔥 Release VoxCPM Technical Report.
  • [2025.09] 🎉 Open-source VoxCPM-0.5B weights (🏆 #1 HuggingFace Trending)

Contents


🚀 Quick Start

Installation

pip install voxcpm

Requirements: Python ≥ 3.10 (<3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0. See Quick Start Docs for details.

Python API

🗣️ Text-to-Speech

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained(
  "openbmb/VoxCPM2",
  load_denoiser=False,
)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("saved: demo.wav")

If you prefer downloading from ModelScope first, you can use:

pip install modelscope
from modelscope import snapshot_download
snapshot_download("OpenBMB/VoxCPM2", local_dir='./pretrained_models/VoxCPM2') # specify the local directory to save the model

from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("./pretrained_models/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)

🎨 Voice Design

Create a voice from a natural-language description — no reference audio needed. Format: put the description in parentheses at the start of text(e.g. "(your voice description)The text to synthesize."):

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

🎛️ Controllable Voice Cloning

Upload a reference audio. The model clones the timbre, and you can still use control instructions to adjust speed, emotion, or style.

wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="path/to/voice.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)

🎙️ Ultimate Cloning

Provide both the reference audio and its exact transcript for audio-continuation-based cloning with every vocal nuance reproduced. For maximum cloning similarity, pass the same reference clip to both reference_wav_path and prompt_wav_path as shown below:

wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav", # optional, for better simliarity 
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
🔄 Streaming API
import numpy as np

chunks = []
for chunk in model.generate_streaming(
    text="Streaming text to speech is easy with VoxCPM!",
):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

CLI Usage

# Voice design (no reference audio needed)
voxcpm design \
  --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
  --output out.wav

# Controllable voice cloning with style control
voxcpm design \
  --text "VoxCPM2 brings studio-quality multilingual speech synthesis." \
  --control "Young female voice, warm and gentle, slightly smiling" \
  --output out.wav

# Voice cloning (reference audio)
voxcpm clone \
  --text "This is a voice cloning demo." \
  --reference-audio path/to/voice.wav \
  --output out.wav

# Ultimate cloning (prompt audio + transcript)
voxcpm clone \
  --text "This is a voice cloning demo." \
  --prompt-audio path/to/voice.wav \
  --prompt-text "reference transcript" \
  --reference-audio path/to/voice.wav \ # optional, for better simliarity
  --output out.wav

# Batch processing
voxcpm batch --input examples/input.txt --output-dir outs

# Help
voxcpm --help

Web Demo

python app.py --port 8808  # then open in browser: http://localhost:8808

Use --device to choose the runtime device:

python app.py --device auto

Supported values are auto, cpu, mps, cuda, and cuda:N. On Apple Silicon Macs, auto uses MPS when available.

🚢 Production Deployment (Nano-vLLM)

For high-throughput serving, use Nano-vLLM-VoxCPM — a dedicated inference engine built on Nano-vLLM with concurrent request support and an async API.

pip install nano-vllm-voxcpm
from nanovllm_voxcpm import VoxCPM
import numpy as np, soundfile as sf

server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="Hello from VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()

RTF as low as ~0.13 on NVIDIA RTX 4090 (vs ~0.3 with the standard PyTorch implementation), with support for batched concurrent requests and a FastAPI HTTP server. See the Nano-vLLM-VoxCPM repo for deployment details.

🏭 Production Serving (vLLM-Omni)

For production multi-tenant deployments, use vLLM-Omni — the official vLLM project’s omni-modal extension with native VoxCPM2 support. PagedAttention KV cache, continuous batching, and a drop-in OpenAI-compatible /v1/audio/speech endpoint.

# Install from source (latest main — vllm-omni is rapidly evolving)
uv pip install vllm==0.19.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git && cd vllm-omni
uv pip install -e .

See the vLLM-Omni installation guide for other platforms (ROCm, XPU, MUSA, NPU) and Docker images.

# Launch an OpenAI-compatible TTS server (--omni enables omni-modal serving)
vllm serve openbmb/VoxCPM2 --omni --port 8000

# Call it from any OpenAI client
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"openbmb/VoxCPM2","input":"Hello from VoxCPM2 on vLLM-Omni!","voice":"default"}' \
  --output out.wav

Built on the upstream vLLM scheduler, with batched concurrent requests, streaming chunk delivery, and multi-GPU deployment out of the box. See the VoxCPM2 example for full deployment recipes.

Full parameter reference, multi-scenario examples, and voice cloning tips → Quick Start Guide | Usage Guide | Cookbook


📦 Models & Versions

VoxCPM2VoxCPM1.5VoxCPM-0.5B
Status🟢 LatestStableLegacy
Backbone Parameters2B0.6B0.5B
Audio Sample Rate48kHz44.1kHz16kHz
LM Token Rate6.25Hz6.25Hz12.5Hz
Languages302 (zh, en)2 (zh, en)
Cloning ModeIsolated Reference & ContinuationContinuation onlyContinuation only
Voice Design
Controllable Voice Cloning
SFT / LoRA
RTF (RTX 4090)~0.30~0.15~0.17
RTF in Nano-VLLM (RTX 4090)~0.13~0.08~0.10
VRAM~8 GB~6 GB~5 GB
Weights🤗 HF / MS🤗 HF / MS🤗 HF / MS
Technical ReportComing soonarXiv ICLR 2026
Demo PageAudio SamplesAudio Samples

VoxCPM2 is built on a tokenizer-free, diffusion autoregressive paradigm. The model operates entirely in the latent space of AudioVAE V2, following a four-stage pipeline: LocEnc → TSLM → RALM → LocDiT, enabling rich expressiveness and 48kHz native audio output.

VoxCPM2 Model Architecture

For full architectural details, VoxCPM2-specific upgrades, and a model comparison table, see the Architecture Design.


📊 Performance

VoxCPM2 achieves state-of-the-art or comparable results on public zero-shot and controllable TTS benchmarks.

Seed-TTS-eval

Seed-TTS-eval WER(⬇)&SIM(⬆) Results (click to expand)
ModelParametersOpen-Sourcetest-ENtest-ZHtest-Hard
WER/%⬇SIM/%⬆CER/%⬇SIM/%⬆CER/%⬇SIM/%⬆
MegaTTS30.5B2.7977.11.5279.0--
DiTAR0.6B1.6973.51.0275.3--
CosyVoice30.5B2.0271.81.1678.06.0875.8
CosyVoice31.5B2.2272.01.1278.15.8375.8
Seed-TTS-2.2576.21.1279.67.5977.6
MiniMax-Speech-1.6569.20.8378.3--
F5-TTS0.3B2.0067.01.5376.08.6771.3
MaskGCT1B2.6271.72.2777.4--
CosyVoice0.3B4.2960.93.6372.311.7570.9
CosyVoice20.5B3.0965.91.3875.76.8372.4
SparkTTS0.5B3.1457.31.5466.0--
FireRedTTS0.5B3.8246.01.5163.517.4562.1
FireRedTTS-21.5B1.9566.51.1473.6--
Qwen2.5-Omni7B2.7263.21.7075.27.9774.7
Qwen3-Omni30B-A3B1.39-1.07---
OpenAudio-s1-mini0.5B1.9455.01.1868.523.3764.3
IndexTTS21.5B2.2370.61.0376.57.1275.5
VibeVoice1.5B3.0468.91.1674.4--
HiggsAudio-v23B2.4467.71.5074.055.0765.6
VoxCPM-0.5B0.6B1.8572.90.9377.28.8773.0
VoxCPM1.50.8B2.1271.41.1877.07.7473.1
MOSS-TTS1.8573.41.2078.8--
Qwen3-TTS1.7B1.2371.71.2277.06.7674.8
FishAudio S24B0.99-0.54-5.99-
LongCat-Audio-DiT3.5B1.5078.61.0981.86.0479.7
VoxCPM22B1.8475.30.9779.58.1375.3

CV3-eval

CV3-eval Multilingual WER/CER(⬇) Results (click to expand)
Modelzhenhard-zhhard-enjakodeesfritru
CosyVoice24.086.3212.5811.969.1319.7-----
CosyVoice3-1.5B3.914.999.7710.557.575.696.434.4711.810.56.64
Fish Audio S22.652.439.104.403.962.762.222.006.262.042.78
VoxCPM23.655.008.558.485.965.694.773.809.854.255.21

MiniMax-Multilingual-Test

Minimax-MLS-test WER(⬇) Results (click to expand)
LanguageMinimaxElevenLabsQwen3-TTSFishAudio S2VoxCPM2
Arabic1.6651.6663.50013.046
Cantonese34.11151.51330.67038.584
Chinese2.25216.0260.9280.7301.136
Czech3.8752.1082.84024.132
Dutch1.1430.8030.9900.913
English2.1642.3390.9341.6202.289
Finnish4.6662.9643.3302.632
French4.0995.2162.8583.0504.534
German1.9060.5721.2350.5500.679
Greek2.0160.9915.7402.844
Hindi6.9625.82714.64019.699
Indonesian1.2371.0591.4601.084
Italian1.5431.7430.9481.2701.563
Japanese3.51910.6463.8232.7604.628
Korean1.7471.8651.7551.1801.962
Polish1.4150.7661.2601.141
Portuguese1.8771.3311.5261.1401.938
Romanian2.8781.34710.74021.577
Russian4.2813.8783.2122.4003.634
Spanish1.0291.0841.1260.9101.438
Thai2.70173.9364.2302.961
Turkish1.520.6990.8700.817
Ukrainian1.0820.9972.3006.316
Vietnamese0.8873.4157.4103.307
Minimax-MLS-test SIM(⬆) Results (click to expand)
LanguageMinimaxElevenLabsQwen3-TTSFishAudio S2VoxCPM2
Arabic73.670.675.079.1
Cantonese77.867.080.583.5
Chinese78.067.779.981.682.5
Czech79.668.579.878.3
Dutch73.868.073.080.8
English75.661.377.579.785.4
Finnish83.575.981.989.0
French62.853.562.869.873.5
German73.361.477.576.780.3
Greek82.673.379.586.0
Hindi81.873.082.185.6
Indonesian72.966.076.380.0
Italian69.957.981.774.778.0
Japanese77.673.878.879.682.8
Korean77.670.079.981.783.3
Polish80.272.981.988.4
Portuguese80.571.181.778.183.7
Romanian80.969.973.379.7
Russian76.167.679.279.081.1
Spanish76.261.581.477.683.1
Thai80.058.878.684.0
Turkish77.959.683.587.1
Ukrainian73.064.774.779.8
Vietnamese74.336.974.080.6

Internal 30-Language ASR Benchmark

We additionally run an internal multilingual intelligibility benchmark with 30 languages × 500 samples. ASR transcription is evaluated via Gemini 3.1 Flash Lite API.

Internal 30-Language ASR Benchmark (click to expand)
LanguageMetricVoxCPM2Fish S2-Pro
ar (Arabic)CER1.23%0.30%
da (Danish)WER2.70%3.52%
de (German)WER0.96%0.64%
el (Greek)WER3.17%4.61%
en (English)WER0.42%1.03%
es (Spanish)WER1.33%0.64%
fi (Finnish)WER2.24%2.80%
fr (French)WER2.16%2.34%
he (Hebrew)CER2.98%15.27%
hi (Hindi)CER0.79%0.91%
id (Indonesian)WER1.36%1.68%
it (Italian)WER1.65%1.08%
ja (Japanese)CER2.40%1.82%
km (Khmer)CER2.05%75.15%
ko (Korean)CER0.95%0.29%
lo (Lao)CER1.90%87.40%
ms (Malay)WER1.75%1.41%
my (Burmese)CER1.42%85.27%
nl (Dutch)WER1.25%1.68%
no (Norwegian)WER2.49%3.76%
pl (Polish)WER1.90%1.65%
pt (Portuguese)WER1.48%1.49%
ru (Russian)WER0.90%0.86%
sv (Swedish)WER2.22%2.63%
sw (Swahili)CER1.07%2.02%
th (Thai)CER0.94%1.92%
tl (Tagalog)WER2.63%4.00%
tr (Turkish)WER1.65%1.65%
vi (Vietnamese)WER1.56%5.56%
zh (Chinese)CER0.92%1.02%
Average (30 languages)1.68%-

InstructTTSEval

Instruction-Guided Voice Design Results (click to expand)
ModelInstructTTSEval-ZHInstructTTSEval-EN
APS⬆DSD⬆RP⬆APS⬆DSD⬆RP⬆
Hume83.075.354.3
VoxInstruct47.552.342.654.957.039.3
Parler-tts-mini63.448.728.6
Parler-tts-large60.045.931.2
PromptTTS64.347.231.4
PromptStyle57.446.430.9
VoiceSculptor75.764.761.5
Mimo-Audio-7B-Instruct75.774.361.580.677.659.5
Qwen3TTS-12Hz-1.7B-VD85.281.165.182.982.468.4
VoxCPM285.271.560.884.283.271.4

⚙️ Fine-tuning

VoxCPM supports both full fine-tuning (SFT) and LoRA fine-tuning. With as little as 5–10 minutes of audio, you can adapt to a specific speaker, language, or domain.

# LoRA fine-tuning (parameter-efficient, recommended)
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml

# WebUI for training & inference
python lora_ft_webui.py   # then open http://localhost:7860

Full guide → Fine-tuning Guide (data preparation, configuration, training, LoRA hot-swapping, FAQ)


📚 Documentation

Full documentation: voxcpm.readthedocs.io

TopicLink
Quick Start & InstallationQuick Start
Usage Guide & CookbookUser Guide
VoxCPM SeriesModels
Fine-tuning (SFT & LoRA)Fine-tuning Guide
FAQ & TroubleshootingFAQ

🌟 Ecosystem & Community

ProjectDescription
Nano-vLLMHigh-throughput and Fast GPU serving
vLLM-OmniOfficial vLLM omni-modal serving for VoxCPM2 — PagedAttention, OpenAI-compatible API
VoxCPM.cppGGML/GGUF: CPU, CUDA, Vulkan inference
VoxCPM-ONNXONNX export for CPU inference
VoxCPMANEApple Neural Engine backend
voxcpm_rsRust re-implementation
ComfyUI-VoxCPMComfyUI node-based workflows
ComfyUI_RH_VoxCPMFeature-complete ComfyUI workflow for VoxCPM 2 with multi-speaker generation, LoRA, and auto-ASR
ComfyUI-VoxCPMTTSComfyUI TTS extension
TTS WebUIBrowser-based TTS extension

See the full Ecosystem in the docs. Community projects are not officially maintained by OpenBMB. Built something cool? Open an issue or PR to add it!


⚠️ Risks and Limitations

  • Potential for Misuse: VoxCPM’s voice cloning can generate highly realistic synthetic speech. It is strictly forbidden to use VoxCPM for impersonation, fraud, or disinformation. We strongly recommend clearly marking any AI-generated content.
  • Controllable Generation Stability: Voice Design and Controllable Voice Cloning results can vary between runs — you may try to generate 1~3 times to obtain the desired voice or style. We are actively working on improving controllability consistency.
  • Language Coverage: VoxCPM2 officially supports 30 languages. For languages not on the list, you are welcome to test directly or try fine-tuning on your own data. We plan to expand language coverage in future releases.
  • Usage: This model is released under the Apache-2.0 license. For production deployments, we recommend conducting thorough testing and safety evaluation tailored to your use case.

📖 Citation

If you find VoxCPM helpful, please consider citing our work and starring ⭐ the repository!

@article{voxcpm2_2026,
  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
  author  = {VoxCPM Team},
  journal = {GitHub},
  year    = {2026},
}

@article{voxcpm2025,
  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation
             and True-to-Life Voice Cloning},
  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
  journal = {arXiv preprint arXiv:2509.24650},
  year    = {2025},
}

📄 License

VoxCPM model weights and code are open-sourced under the Apache-2.0 license.

🙏 Acknowledgments

  • DiTAR for the diffusion autoregressive backbone
  • MiniCPM-4 for the language model foundation
  • CosyVoice for the Flow Matching-based LocDiT implementation
  • DAC for the Audio VAE backbone
  • Our community users for trying VoxCPM, reporting issues, sharing ideas, and contributing—your support helps the project keep getting better

Institutions

ModelBest     THUHCSI

⭐ Star History

Star History Chart

相似文章

@Honcia13: 开源TTS直接卷疯了!园区诈骗又有新武器? 清华 OpenBMB 刚刚放出 VoxCPM2: 200亿参数 + 200万小时多语言数据训练,48kHz录音棚级音质! 最狠的是——完全不用Tokenizer,直接在连续潜空间做扩散自回归,细…

X AI KOLs Timeline

清华大学 OpenBMB 发布了 VoxCPM2,这是一个拥有 200 亿参数的开源多语言 TTS 模型,支持无需 Tokenizer 的连续潜空间扩散自回归生成,具备 48kHz 录音棚级音质和强大的声音克隆与设计能力。

@denziideng: 又发现一个AI语音克隆“降维打击”…… 之前分享的 CosyVoice 3秒可克隆,觉得已经够吓人了,结果今天这个更要命,随便录了1分钟自己的声音训练后,它直接把声线、语气、情感、呼吸、停顿全部复刻,简直像本人灵魂附体! 阿里达摩院的 C…

X AI KOLs Timeline

GPT-SoVITS 是一款开源 AI 语音克隆工具,支持零样本(5秒声音)和少样本(1分钟训练)高保真声音克隆,跨语言推理,并自带完整 WebUI 工具链,在 GitHub 上已获 57.8k 星,成为语音克隆领域的领先开源项目。

@Chenzeze777: 发现一个开源的语音合成模型,不说一声实在过意不去。 20 亿参数,200 万小时数据训练,30 种语言+9 种中国方言直接输入文本就合成,连四川话粤语东北话都有。 最离谱的是什么? 你用自然语言描述音色——「年轻女性,温柔甜美」——它就给…

X AI KOLs Timeline

介绍了一个开源的语音合成模型,20亿参数、200万小时训练,支持30种语言和9种中国方言,可用自然语言描述音色,3秒录音即可克隆声音,音质达48kHz,Apache-2.0协议免费商用。

@laowangbabababa: 震惊了,抖音上祁博士一天卖 50w 的数字人 agent,我2 分钟就开发完成了。 用的就是Pixelle-Video这个项目,已经22k stars。包括数字人口播、动作迁移、图生视频全支持。 支持ComfyUI,输入主题,从写脚本到加…

X AI KOLs Timeline

介绍开源项目Pixelle-Video:一个全自动AI短视频引擎,输入主题即可自动生成带文案、配图、语音和背景音乐的视频,支持本地和云端模型,模块化设计可灵活替换各环节模型。