openbmb/VoxCPM2
Summary
VoxCPM2 is an open-source, tokenizer-free diffusion autoregressive Text-to-Speech model supporting 30 languages with 2B parameters, 48kHz audio output, and features including voice design from natural language descriptions, controllable voice cloning, and real-time streaming capabilities.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
openbmb/VoxCPM2 Β· Hugging Face
Source: https://huggingface.co/openbmb/VoxCPM2 VoxCPM2is a tokenizer-free, diffusion autoregressive Text-to-Speech model β2B parameters,30 languages,48kHzaudio output, trained on over2 million hoursof multilingual speech data.
https://huggingface.co/openbmb/VoxCPM2#highlightsHighlights
- π30-Language Multilingualβ No language tag needed; input text in any supported language directly
- π¨Voice Designβ Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, paceβ¦); no reference audio required
- ποΈControllable Cloningβ Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
- ποΈUltimate Cloningβ Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
- π48kHz Studio-Quality Outputβ Accepts 16kHz reference; outputs 48kHz via AudioVAE V2βs built-in super-resolution, no external upsampler needed
- π§ Context-Aware Synthesisβ Automatically infers appropriate prosody and expressiveness from text content
- β‘Real-Time Streamingβ RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated byNano-VLLM
- πFully Open-Source & Commercial-Readyβ Apache-2.0 license, free for commercial use
**Supported Languages (30)**Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
Chinese Dialects: εε·θ―, η²€θ―, ε΄θ―, δΈεθ―, ζ²³εθ―, ιθ₯Ώθ―, ε±±δΈθ―, 倩ζ΄₯θ―, ι½εθ―
https://huggingface.co/openbmb/VoxCPM2#quick-startQuick Start
https://huggingface.co/openbmb/VoxCPM2#installationInstallation
pip install voxcpm
**Requirements:**Python β₯ 3.10, PyTorch β₯ 2.5.0, CUDA β₯ 12.0 Β·Full Quick Start β
https://huggingface.co/openbmb/VoxCPM2#text-to-speechText-to-Speech
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
https://huggingface.co/openbmb/VoxCPM2#voice-designVoice Design
Put the voice description in parentheses at the start oftext, followed by the content to synthesize:
wav = model.generate(
text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
https://huggingface.co/openbmb/VoxCPM2#controllable-voice-cloningControllable Voice Cloning
# Basic cloning
wav = model.generate(
text="This is a cloned voice generated by VoxCPM2.",
reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)
# Cloning with style control
wav = model.generate(
text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
reference_wav_path="speaker.wav",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
https://huggingface.co/openbmb/VoxCPM2#ultimate-cloningUltimate Cloning
Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to bothreference\_wav\_pathandprompt\_wav\_pathfor highest similarity:
wav = model.generate(
text="This is an ultimate cloning demonstration using VoxCPM2.",
prompt_wav_path="speaker_reference.wav",
prompt_text="The transcript of the reference audio.",
reference_wav_path="speaker_reference.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
https://huggingface.co/openbmb/VoxCPM2#streamingStreaming
import numpy as np
chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)
https://huggingface.co/openbmb/VoxCPM2#model-detailsModel Details
PropertyValueArchitectureTokenizer-free Diffusion Autoregressive (LocEnc β TSLM β RALM β LocDiT)BackboneBased on MiniCPM-4, totally 2B parametersAudio VAEAudioVAE V2 (asymmetric encode/decode, 16kHz in β 48kHz out)Training Data2M+ hours multilingual speechLM Token Rate6.25 HzMax Sequence Length8192 tokensdtypebfloat16VRAM~8 GBRTF (RTX 4090)~0.30 (standard) / ~0.13 (Nano-vLLM)
https://huggingface.co/openbmb/VoxCPM2#performancePerformance
VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
See theGitHub repofor full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
https://huggingface.co/openbmb/VoxCPM2#fine-tuningFine-tuning
VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5β10 minutes of audio:
# LoRA fine-tuning (recommended)
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
See theFine-tuning Guidefor full instructions.
https://huggingface.co/openbmb/VoxCPM2#limitationsLimitations
- Voice Design and Style Control results may vary between runs; generating 1β3 times is recommended to obtain the desired output.
- Performance varies across languages depending on training data availability.
- Occasional instability may occur with very long or highly expressive inputs.
- Strictly forbiddento use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.
https://huggingface.co/openbmb/VoxCPM2#citationCitation
@article{voxcpm2_2026,
title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
author = {VoxCPM Team},
journal = {GitHub},
year = {2026},
}
@article{voxcpm2025,
title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
journal = {arXiv preprint arXiv:2509.24650},
year = {2025},
}
https://huggingface.co/openbmb/VoxCPM2#licenseLicense
Released under theApache-2.0license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.
Similar Articles
OpenBMB/VoxCPM
OpenBMB releases VoxCPM2, a 2B-parameter tokenizer-free TTS model trained on 2M+ hours of multilingual speech data, supporting 30 languages, voice design, controllable cloning, and 48kHz output.
Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.
Technical breakdown and benchmarks of VoxCPM2, an open-source TTS model featuring Ultimate Cloning Mode for capturing breathing and accents, tested locally with low VRAM footprint and cross-lingual accent retention.
@Honcia13: Open-source TTS is going crazy! New weapons for industrial park scams? Tsinghua OpenBMB just released VoxCPM2: 20 billion parameters + 2 million hours of multilingual data training, 48kHz studio-quality sound! The most intense part isβno Tokenizer needed at all, performing diffusion autoregression directly in continuous latent space, maximizing detail retention!
Tsinghua University's OpenBMB has released VoxCPM2, an open-source multilingual TTS model with 20 billion parameters. It supports continuous latent space diffusion autoregressive generation without a Tokenizer, offering 48kHz studio-quality audio and powerful voice cloning and design capabilities.
@FakeMaidenMaker: Explosive! This open-source project converts text to human-like voice for free, can clone anyone's voice, and adjust timbre with text! GitHub has garnered 30K stars, from Mianbao Intelligent OpenBMB, VoxCPM previously topped both GitHub and HuggingFace charts. Do...
VoxCPM2 is an open-source speech synthesis model from OpenBMB, using a tokenizer-free diffusion autoregressive architecture, supporting 30 languages, voice design, and controllable voice cloning. It can clone a voice with just one sentence, or create a brand new voice using text, outputting 48kHz high-quality audio, and is commercially usable.
@Gorden_Sun: ZONOS2: Open-source MoE TTS model. 8B total parameters, 0.9B activated parameters. Supports multilingual, voice cloning, Chinese, and Chinese results are good. Model:
Zyphra released ZONOS2, an open-source MoE text-to-speech model trained on over 6 million hours of multilingual speech, supporting voice cloning and high-quality synthesis across many languages.