Tag
MOSS TTS 1.5 is a new text-to-speech model with voice cloning capabilities, offered via a Hugging Face Space, and is considered better than Fish Audio S2 Pro due to open licensing.
seshat-tts is an open-source tool that enables real-time game narration with voice cloning, using OCR or an LLM for text extraction and local synthesis with pocket-tts. Voice cloning takes ~10 seconds on an RTX 2070 Super and runs on CPU after caching.
GitHub open-source project VoxCPM2 achieves AI voice cloning without reference audio, generating target voice precisely with just one sentence, has gained 20K stars.
A woman lost $5,400 after scammers used AI voice cloning to mimic her daughter's voice in a fake kidnapping scheme, highlighting the growing threat of AI-powered scams.
MOSS-TTS v1.5 is an updated open-source text-to-speech model with improved multilingual synthesis (supporting 31 languages), more stable zero-shot voice cloning, and explicit inline pause control.
GPT-SoVITS is an open-source AI voice cloning tool that supports zero-shot (5-second voice) and few-shot (1-minute training) high-fidelity voice cloning, cross-lingual inference, and comes with a complete WebUI toolchain. It has garnered 57.8k stars on GitHub, becoming the leading open-source project in the voice cloning field.
X-Voice is a flow-matching-based multilingual text-to-speech system that enables zero-shot voice cloning across 30 languages, with open-source code, model, and demo available.
OmniVoice Studio is a free, open-source tool that locally dubs MP4 videos into 600 languages using Whisper for transcription, voice cloning from 3 seconds of audio, and Demucs for background separation, eliminating the need for paid subscriptions like ElevenLabs and HeyGen.
Voice-Pro is a web tool that integrates six top open-source models (Whisper, Demucs, CosyVoice, F5-TTS, etc.), supporting YouTube video downloading, vocal removal, transcription, translation, voice cloning, and fully automatic dubbing. It takes less than 2 minutes, runs 100% locally, and is free.
NetEase Youdao open-sourced the ZiYue 4 model with 27B parameters, achieving SOTA in math and science; its voice feature supports 3-second cross-language voice cloning across 14 languages with no accent issue, along with open-sourcing the all-scenario intelligent agent 'Longxia' (Lobster).
StepFun launches Step Plan subscription at $6.99/month, integrating LLM, TTS, ASR, image generation, and other AI models. Supports direct OpenAI SDK connection, applicable for voice cloning, meeting transcription, AI podcast generation, etc.
A user benchmarks 21 consumer GPUs on vast.ai running a small TTS model (OmniVoice) with peak VRAM of 5GB, comparing performance relative to real-time and to an RTX 3090.
OpenAI quietly acquired voice-cloning startup Weights.gg and absorbed its six-person team, likely to remove the public catalog of unauthorized celebrity voices while keeping its own Voice Engine restricted on safety grounds.
An open-source app called Voicebox replaces ElevenLabs and WisprFlow with local voice cloning, multiple TTS engines, and MCP server support, running on various hardware with MIT license.
DramaBox is an open-weight TTS model fine-tuned from LTX-2.3 that uses stage directions as prompts to generate expressive speech, with optional voice cloning from a 10-second sample.
Scenema AI releases Scenema Audio, an open-source diffusion-based model for zero-shot expressive voice cloning and speech generation, separating emotional performance from voice identity to allow any voice to perform any emotion.
OmniVoice Studio is an open-source desktop app that enables local voice cloning and cinematic video dubbing across 646 languages, fully offline with no API keys, positioning itself as a privacy-focused alternative to ElevenLabs.
Irodori-TTS-500M-v3 is a Japanese TTS model based on Rectified Flow Diffusion Transformer, supporting zero-shot voice cloning and unique emoji-based style/sound effect control.
Tsinghua University's OpenBMB has released VoxCPM2, an open-source multilingual TTS model with 20 billion parameters. It supports continuous latent space diffusion autoregressive generation without a Tokenizer, offering 48kHz studio-quality audio and powerful voice cloning and design capabilities.
mlx-audio v0.4.3 releases with 6 new TTS models including Higgs Audio v2 and OmniVoice (646+ languages), plus server improvements like concurrent requests and continuous batching, ~3x faster Voxtral Realtime on 4-bit, and slimmer dependencies for Apple Silicon.