Tag
HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.
Qwen3.5-Omni is a hundreds-of-billions-parameter multimodal model with advanced audio-visual understanding and generation capabilities, featuring novel Audio-Visual Vibe Coding and achieving SOTA results across 215 benchmarks while matching Gemini-3.1 Pro.
Google DeepMind upgraded its speech synthesis model to sound more natural across 70+ languages and now applies SynthID watermarking to all outputs.
This paper introduces Continuous Audio Language Models (CALM), which generate audio using continuous frames instead of discrete tokens to improve fidelity and reduce computational cost in speech and music generation.
VibeVoice is a new model from Microsoft that synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer. It achieves superior fidelity and compression, supporting up to 90 minutes of audio with multiple speakers.
Google announces Gemini 2.5's advanced native audio capabilities, enabling real-time conversational AI with natural speech generation, style control, and multimodal understanding across 24+ languages.