speech-synthesis

#speech-synthesis

Hierarchical Codec Diffusion for Video-to-Speech Generation

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.

0 favorites 0 likes

#speech-synthesis

Qwen3.5-Omni Technical Report

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

Qwen3.5-Omni is a hundreds-of-billions-parameter multimodal model with advanced audio-visual understanding and generation capabilities, featuring novel Audio-Visual Vibe Coding and achieving SOTA results across 215 benchmarks while matching Gemini-3.1 Pro.

0 favorites 0 likes

#speech-synthesis

@GoogleDeepMind: More natural sounding speech Support for 70+ languages like Hindi, Japanese, and German SynthID watermarking on all out…

X AI KOLs ↗ · 2026-04-15 Cached

Google DeepMind upgraded its speech synthesis model to sound more natural across 70+ languages and now applies SynthID watermarking to all outputs.

0 favorites 0 likes

#speech-synthesis

Continuous Audio Language Models

Papers with Code Trending ↗ · 2025-09-08 Cached

This paper introduces Continuous Audio Language Models (CALM), which generate audio using continuous frames instead of discrete tokens to improve fidelity and reduce computational cost in speech and music generation.

0 favorites 0 likes

#speech-synthesis

VibeVoice Technical Report

Papers with Code Trending ↗ · 2025-08-26 Cached

VibeVoice is a new model from Microsoft that synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer. It achieves superior fidelity and compression, supporting up to 90 minutes of audio with multiple speakers.

0 favorites 0 likes

#speech-synthesis

Advanced audio dialog and generation with Gemini 2.5

Google DeepMind Blog ↗ · 2025-06-03 Cached

Google announces Gemini 2.5's advanced native audio capabilities, enabling real-time conversational AI with natural speech generation, style control, and multimodal understanding across 24+ languages.

0 favorites 0 likes

speech-synthesis

Hierarchical Codec Diffusion for Video-to-Speech Generation

Qwen3.5-Omni Technical Report

@GoogleDeepMind: More natural sounding speech Support for 70+ languages like Hindi, Japanese, and German SynthID watermarking on all out…

Continuous Audio Language Models

VibeVoice Technical Report

Advanced audio dialog and generation with Gemini 2.5

Submit Feedback