Tag
Higgs Audio v3 is a 4B parameter TTS model designed for voice chat applications, supporting 100 languages with inline control capabilities.
SpurAudio is a new benchmark designed to evaluate shortcut learning and spurious correlations in few-shot audio classification, revealing that state-of-the-art methods—including large pretrained audio foundation models—suffer significant performance degradation when background correlations are disrupted.
SpeechEditBench is a bilingual multi-attribute benchmark for evaluating instruction-guided speech editing across seven atomic tasks and compositional tasks, using an anchor-based evaluation protocol with three metrics. Evaluation of mainstream Speech LLMs reveals no single model excels across all dimensions, and compositional editing remains highly challenging.
OpenSTBench is a unified multidimensional evaluation framework for speech translation systems that jointly assesses translation quality, speech quality, speaker preservation, emotion fidelity, and latency across both S2TT and S2ST systems in offline and streaming settings. The framework addresses the gap left by fragmented evaluation protocols and provides a reproducible benchmark for comparing heterogeneous speech translation systems.
A new AI music model has been released, with demos that sound surprisingly realistic.
Spotify debuts a new desktop app called Studio by Spotify Labs that uses AI to generate personalized podcasts from users' email, calendar, and documents, directly competing with Google's NotebookLM.
GPT-Realtime-2 demonstrates a 15 percentage point improvement over version 1.5 on the Big Bench Audio benchmark, approaching saturation levels.
APEX is a large-scale multi-task learning framework that predicts both popularity and aesthetic quality of AI-generated music using frozen audio embeddings. The model demonstrates strong generalization across different generative architectures by jointly predicting engagement signals and perceptual quality dimensions.
Google has released Lyria 3, its newest music generation model, available to developers through the Gemini API and Google AI Studio. The model offers two variants: Lyria 3 Pro for full songs and Lyria 3 Clip for shorter clips, with controls for tempo, lyrics, and image-to-music multimodal input.
Google has developed DolphinGemma, a large language model designed to learn and generate dolphin vocalizations, collaborating with Georgia Tech and the Wild Dolphin Project to advance understanding of dolphin communication patterns and enable potential interspecies dialogue.