Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.

Reddit r/ArtificialInteligence 06/04/26, 03:01 AM Models

Summary

Technical breakdown and benchmarks of VoxCPM2, an open-source TTS model featuring Ultimate Cloning Mode for capturing breathing and accents, tested locally with low VRAM footprint and cross-lingual accent retention.

Hey everyone, I’ve been diving deep into open-source Text-to-Speech models to build local automation workflows, and I wanted to share my technical breakdown and benchmarks for **VoxCPM2**. Most open-source TTS models struggle with emotional flatness or metallic artifacts. However, VoxCPM2 features an architecture called **"Ultimate Cloning Mode"** which attempts to bridge this gap by mapping non-verbal human speech elements. # 1. Key Technical Features Tested: * **Micro-Detail Capture:** Unlike standard bark or tortoise-based models, this architecture captures breathing gaps, micro-pauses, and natural human speech rhythm. * **Local VRAM Footprint:** It runs entirely locally. VRAM consumption is highly optimized, making it viable for local MicroSaaS backend integration or pipeline automation without racking up heavy API bills. * **Cross-Lingual Accent Retention:** Tested across its 30+ supported languages. The model retains the core voice timbre/characteristic even when forcing the speaker to speak a completely foreign language. # 2. The Sandbox Architecture: For this benchmark, I isolated the model locally and fed it a clean 15-second studio voice sample. The pipeline was set to output studio-grade 48kHz audio. The alignment between the synthesized phonemes and the original audio's emotional curve was surprisingly tight. # 3. 55-Second Audio Comparison & Benchmark Walkthrough: I recorded the exact terminal execution, VRAM behaviors, and a side-by-side audio output comparison (Original Voice vs Cloned Voice generating technical prose) in a quick breakdown video. You can listen to the raw voice replication quality and check the real-time processing speed directly here**:** [**https://youtube.com/shorts/qIKywJXLQhU**](https://youtube.com/shorts/qIKywJXLQhU) #

Original Article

Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.

Similar Articles

openbmb/VoxCPM2

OpenBMB/VoxCPM

seshat-tts: A local real-time narrator for games that supports voice cloning

Submit Feedback

Similar Articles

@Honcia13: Open-source TTS is going crazy! New weapons for industrial park scams? Tsinghua OpenBMB just released VoxCPM2: 20 billion parameters + 2 million hours of multilingual data training, 48kHz studio-quality sound! The most intense part is—no Tokenizer needed at all, performing diffusion autoregression directly in continuous latent space, maximizing detail retention!

@QT9277: "No way, AI voice synthesis has gotten this insane???" I was browsing GitHub today and was completely stunned. VoxCPM2, trending #1, over 20k stars, blowing up overseas. I thought it was another PPT open-source project, but after carefully checking the demo—my ears really couldn't tell which one was real. …

seshat-tts: A local real-time narrator for games that supports voice cloning