openbmb/VoxCPM2

Hugging Face Models Trending 04/03/26, 05:25 AM Models

text-to-speech multilingual open-source voice-synthesis diffusion-model tts audio-generation

Summary

VoxCPM2 is an open-source, tokenizer-free diffusion autoregressive Text-to-Speech model supporting 30 languages with 2B parameters, 48kHz audio output, and features including voice design from natural language descriptions, controllable voice cloning, and real-time streaming capabilities.

Task: text-to-speech Tags: voxcpm, safetensors, text-to-speech, tts, multilingual, voice-cloning, voice-design, diffusion, audio, zh, en, ar, my, da, nl, fi, fr, de, el, he, hi, id, it, ja, km, ko, lo, ms, no, pl, pt, ru, es, sw, sv, tl, th, tr, vi, arxiv:2509.24650, license:apache-2.0, region:us

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:45 PM

openbmb/VoxCPM2 · Hugging Face

Source: https://huggingface.co/openbmb/VoxCPM2 VoxCPM2is a tokenizer-free, diffusion autoregressive Text-to-Speech model —2B parameters,30 languages,48kHzaudio output, trained on over2 million hoursof multilingual speech data.

https://huggingface.co/openbmb/VoxCPM2#highlightsHighlights

🌍30-Language Multilingual— No language tag needed; input text in any supported language directly
🎨Voice Design— Generate a novel voice from a natural-language description alone (gender, age, tone, emotion, pace…); no reference audio required
🎛️Controllable Cloning— Clone any voice from a short clip, with optional style guidance to steer emotion, pace, and expression while preserving timbre
🎙️Ultimate Cloning— Provide reference audio + its transcript for audio-continuation cloning; every vocal nuance faithfully reproduced
🔊48kHz Studio-Quality Output— Accepts 16kHz reference; outputs 48kHz via AudioVAE V2’s built-in super-resolution, no external upsampler needed
🧠Context-Aware Synthesis— Automatically infers appropriate prosody and expressiveness from text content
⚡Real-Time Streaming— RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated byNano-VLLM
📜Fully Open-Source & Commercial-Ready— Apache-2.0 license, free for commercial use

**Supported Languages (30)**Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese

Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话

https://huggingface.co/openbmb/VoxCPM2#quick-startQuick Start

https://huggingface.co/openbmb/VoxCPM2#installationInstallation

pip install voxcpm

**Requirements:**Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 ·Full Quick Start →

https://huggingface.co/openbmb/VoxCPM2#text-to-speechText-to-Speech

from voxcpm import VoxCPM
import soundfile as sf

model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)

wav = model.generate(
    text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)

https://huggingface.co/openbmb/VoxCPM2#voice-designVoice Design

Put the voice description in parentheses at the start oftext, followed by the content to synthesize:

wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

https://huggingface.co/openbmb/VoxCPM2#controllable-voice-cloningControllable Voice Cloning

# Basic cloning
wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

# Cloning with style control
wav = model.generate(
    text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
    reference_wav_path="speaker.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)

https://huggingface.co/openbmb/VoxCPM2#ultimate-cloningUltimate Cloning

Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to bothreference\_wav\_pathandprompt\_wav\_pathfor highest similarity:

wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="speaker_reference.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="speaker_reference.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)

https://huggingface.co/openbmb/VoxCPM2#streamingStreaming

import numpy as np

chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

https://huggingface.co/openbmb/VoxCPM2#model-detailsModel Details

PropertyValueArchitectureTokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT)BackboneBased on MiniCPM-4, totally 2B parametersAudio VAEAudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out)Training Data2M+ hours multilingual speechLM Token Rate6.25 HzMax Sequence Length8192 tokensdtypebfloat16VRAM~8 GBRTF (RTX 4090)~0.30 (standard) / ~0.13 (Nano-vLLM)

https://huggingface.co/openbmb/VoxCPM2#performancePerformance

VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.

See theGitHub repofor full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).

https://huggingface.co/openbmb/VoxCPM2#fine-tuningFine-tuning

VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:

# LoRA fine-tuning (recommended)
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml

See theFine-tuning Guidefor full instructions.

https://huggingface.co/openbmb/VoxCPM2#limitationsLimitations

Voice Design and Style Control results may vary between runs; generating 1–3 times is recommended to obtain the desired output.
Performance varies across languages depending on training data availability.
Occasional instability may occur with very long or highly expressive inputs.
Strictly forbiddento use for impersonation, fraud, or disinformation. AI-generated content should be clearly labeled.

https://huggingface.co/openbmb/VoxCPM2#citationCitation

@article{voxcpm2_2026,
  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},
  author  = {VoxCPM Team},
  journal = {GitHub},
  year    = {2026},
}

@article{voxcpm2025,
  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning},
  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and
             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and
             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},
  journal = {arXiv preprint arXiv:2509.24650},
  year    = {2025},
}

https://huggingface.co/openbmb/VoxCPM2#licenseLicense

Released under theApache-2.0license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.