speech-generation

#speech-generation

dots.tts Technical Report

Hugging Face Daily Papers ↗ · 4d ago Cached

dots.tts presents a 2B-parameter continuous autoregressive TTS model trained on multilingual data, achieving state-of-the-art performance on benchmarks like Seed-TTS-Eval with low-latency streaming via CFG-aware MeanFlow distillation. The model, code, and checkpoints are released under Apache 2.0.

0 favorites 0 likes

#speech-generation

ElevenLabs Dubbing v2

Reddit r/singularity ↗ · 2026-05-29 Cached

ElevenLabs launched Dubbing v2, an AI dubbing model that preserves the original speaker's emotion, tone, and performance across 90+ languages by conditioning on the original audio directly, offering broadcast-quality dubbing at a fraction of the cost.

0 favorites 0 likes

#speech-generation

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Hugging Face Daily Papers ↗ · 2026-05-27 Cached

Swanbench-Speech is a comprehensive benchmark for evaluating long-form speech generation across diverse scenarios, using multi-dimensional metrics covering acoustics, semantics, and expressiveness, revealing limitations of current models.

0 favorites 0 likes

#speech-generation

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

arXiv cs.CL ↗ · 2026-05-21 Cached

This paper introduces InterRS, a method for real-time speech generation that interleaves reasoning steps during natural pauses in speech, achieving better performance on math and logic benchmarks while maintaining fluent and instant responses.

0 favorites 0 likes

#speech-generation

Scenema Audio: Zero-shot expressive voice cloning and speech generation [N]

Reddit r/MachineLearning ↗ · 2026-05-13

Scenema AI releases Scenema Audio, an open-source diffusion-based model for zero-shot expressive voice cloning and speech generation, separating emotional performance from voice identity to allow any voice to perform any emotion.

0 favorites 0 likes

#speech-generation

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

arXiv cs.CL ↗ · 2026-05-11 Cached

VITA-QinYu is an expressive end-to-end spoken language model capable of role-playing and singing, trained on 15.8K hours of data to outperform peers in expressiveness and conversational accuracy.

0 favorites 0 likes

#speech-generation

ScenemaAI/scenema-audio

Hugging Face Models Trending ↗ · 2026-04-26 Cached

Scenema Audio is a zero-shot expressive voice cloning and speech generation model that produces speech with emotional arcs, pacing, and breath control from text prompts. Built on an audio diffusion transformer, it supports multilingual generation, voice cloning from 10-20 seconds of reference audio, and scene-aware audio with ambient effects.

0 favorites 0 likes

#speech-generation

OpenMOSS-Team/MOSS-TTS-Nano-100M

Hugging Face Models Trending ↗ · 2026-04-02 Cached

MOSS-TTS-Nano is an open-source multilingual speech generation model with only 0.1B parameters, designed for real-time TTS that runs directly on CPU without GPU. Released by OpenMOSS team and MOSI.AI, it enables simple local deployment for web serving and product integration.

0 favorites 0 likes

speech-generation

Submit Feedback