Tag
Thinking Machines announced TML-Interaction-Small, a 276B MoE model designed for real-time, always-on interaction with sub-0.4s latency and integrated multimodal processing.
The article discusses a robot capable of mimicking human speech, highlighting advancements in robotic voice synthesis and interaction.
HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.
Qwen3.5-Omni is a hundreds-of-billions-parameter multimodal model with advanced audio-visual understanding and generation capabilities, featuring novel Audio-Visual Vibe Coding and achieving SOTA results across 215 benchmarks while matching Gemini-3.1 Pro.
Google DeepMind upgraded its speech synthesis model to sound more natural across 70+ languages and now applies SynthID watermarking to all outputs.
The Qwen3-TTS technical report introduces a series of advanced multilingual text-to-speech models with voice cloning and controllable generation, featuring a dual-track LM architecture and specialized tokenizers for low-latency streaming.
This paper introduces Continuous Audio Language Models (CALM), which generate audio using continuous frames instead of discrete tokens to improve fidelity and reduce computational cost in speech and music generation.
VibeVoice is a new model from Microsoft that synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer. It achieves superior fidelity and compression, supporting up to 90 minutes of audio with multiple speakers.
Google announces Gemini 2.5's advanced native audio capabilities, enabling real-time conversational AI with natural speech generation, style control, and multimodal understanding across 24+ languages.