Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.
Summary
Technical breakdown and benchmarks of VoxCPM2, an open-source TTS model featuring Ultimate Cloning Mode for capturing breathing and accents, tested locally with low VRAM footprint and cross-lingual accent retention.
Similar Articles
openbmb/VoxCPM2
VoxCPM2 is an open-source, tokenizer-free diffusion autoregressive Text-to-Speech model supporting 30 languages with 2B parameters, 48kHz audio output, and features including voice design from natural language descriptions, controllable voice cloning, and real-time streaming capabilities.
OpenBMB/VoxCPM
OpenBMB releases VoxCPM2, a 2B-parameter tokenizer-free TTS model trained on 2M+ hours of multilingual speech data, supporting 30 languages, voice design, controllable cloning, and 48kHz output.
@Honcia13: Open-source TTS is going crazy! New weapons for industrial park scams? Tsinghua OpenBMB just released VoxCPM2: 20 billion parameters + 2 million hours of multilingual data training, 48kHz studio-quality sound! The most intense part is—no Tokenizer needed at all, performing diffusion autoregression directly in continuous latent space, maximizing detail retention!
Tsinghua University's OpenBMB has released VoxCPM2, an open-source multilingual TTS model with 20 billion parameters. It supports continuous latent space diffusion autoregressive generation without a Tokenizer, offering 48kHz studio-quality audio and powerful voice cloning and design capabilities.
@QT9277: "No way, AI voice synthesis has gotten this insane???" I was browsing GitHub today and was completely stunned. VoxCPM2, trending #1, over 20k stars, blowing up overseas. I thought it was another PPT open-source project, but after carefully checking the demo—my ears really couldn't tell which one was real. …
Introducing VoxCPM2, a completely free for commercial use, open-source multilingual voice synthesis model supporting voice design, cloning, and 48kHz high-quality output, ranked #1 on GitHub trending.
seshat-tts: A local real-time narrator for games that supports voice cloning
seshat-tts is an open-source tool that enables real-time game narration with voice cloning, using OCR or an LLM for text extraction and local synthesis with pocket-tts. Voice cloning takes ~10 seconds on an RTX 2070 Super and runs on CPU after caching.