Tag
Thom Wolf and Cerebras released a fully open-source realtime voice demo with models and code, showcasing state-of-the-art speech-to-speech capabilities.
This paper proposes a reference-based evaluation protocol for assessing prosody and rhythm in speech-to-speech AI systems, using matched human conversation data to provide interpretable behavioral plausibility checks.
Hugging Face and Cerebras demonstrate a real-time speech-to-speech pipeline combining open-source models (Nvidia's Parakeet, Gemma 4, Qwen3TTS) with Cerebras' fast inference, enabling natural conversational AI and powering robots like Reachy Mini.
This paper presents a modular end-to-end speech-to-speech conversational system for the low-resource Algerian Dialect, integrating ASR, NLU, RAG, and TTS with dedicated datasets and fine-tuned models.
Discusses leveraging Gemma 4 12B's encoder-free architecture for native voice input, seeking out-of-the-box solutions for low-latency streaming audio ingestion.
Gemini 3.5 Live Translate is a new audio model for real-time speech-to-speech translation.
Google DeepMind announces Live Translate, a feature that converts speech into over 70 languages in real-time while preserving tone, pace, and pitch for more natural conversations.
Google releases Gemini 3.5 Live Translate, an audio model for near real-time speech-to-speech translation in over 70 languages, preserving speaker intonation and pacing. It is rolling out across Google products including the Gemini Live API, Google Meet, and Google Translate.
COMPASS is a unified benchmarking framework for speech-to-speech translation (S2ST) that integrates 46 metrics across eight dimensions, evaluated on 1,248 model-language configurations. It identifies complementary architecture strengths and proposes reduced metric subsets that preserve rankings while cutting evaluation time.
OpenAI released a new specialized model, gpt-realtime-translate, that takes speech audio from over 70 input languages and outputs speech in 13 target languages for real-time translation.
OpenSTBench is a unified multidimensional evaluation framework for speech translation systems that jointly assesses translation quality, speech quality, speaker preservation, emotion fidelity, and latency across both S2TT and S2ST systems in offline and streaming settings. The framework addresses the gap left by fragmented evaluation protocols and provides a reproducible benchmark for comparing heterogeneous speech translation systems.
OpenAI releases gpt-realtime-translate, a low-latency speech-to-speech model optimized for live interpretation, accompanied by a developer cookbook for building multilingual browser, phone, and video applications.
Announces liquid-audio, an open-source repository for Liquid AI's end-to-end speech-to-speech LFM models (LFM2-Audio-1.5B and LFM2.5-Audio-1.5B) with interleaved and sequential generation modes and fine-tuning support.
OpenAI has released gpt-realtime-2, a new speech-to-speech model optimized for real-time voice agent interactions with low-latency tool calling.
OpenAI is making the Realtime API generally available with a new advanced speech-to-speech model called gpt-realtime, featuring improved instruction following, tool calling, and natural speech quality. New capabilities include MCP server support, image inputs, SIP phone calling, and two new voices (Cedar and Marin).