Tag
May saw over $1.8 billion in voice AI funding, led by Sierra's $925M and Hark's $700M rounds, while ElevenLabs launched new models for music generation and dubbing with enhanced control. The newsletter also highlights healthcare deals and India's growing voice market.
This paper evaluates four leading real-time voice AI systems (GPT Realtime 2, Gemini 3.1 Flash Live, Qwen3.5 Omni Plus, Omni Flash) and finds they consistently act on words rather than vocal tone, ignoring distress, fear, or sarcasm even when they can perceive them—termed the 'emotional intelligence gap' of voice AI.
Koval is a simulation and observability platform for voice agents, helping enterprises scale voice applications safely. Founder Brooke Hopkins shared the potential of voice as a natural interface for AI, as well as the architectural similarities between voice AI and autonomous driving.
Coval, a startup focusing on simulation and evaluation for voice AI agents, raises a $28M Series A led by Norwest Venture Partners.
EdgeSpeak desktop voice transcription tool is now live, featuring the local Lattice-2 voice model. It supports offline audio/video transcription, multiple languages and accents, and provides a local API for developers to integrate.
This article shares hard-won lessons from building real-time voice AI agents, highlighting the importance of proper turn-taking, VAD handling, billing awareness, and avoiding echo loops.
Andrew Ng announces a new course on adding voice to AI agents using VocalBridge, taught by its CEO. The course covers three integration patterns and evaluation techniques for building reliable and low-latency voice applications.
A comparison cheatsheet between Vapi and Elevenlabs, highlighting their features and differences in voice AI.
A guide on building a Voice AI capable of performing mathematical calculations and generating accurate quotes.
Announcement of white label AI voice agents, enabling businesses to deploy customizable voice AI solutions under their own brand.
Tyto by ai-coustics is a tool that provides audio insights to predict voice AI performance.
The article shares key prompting habits for making voice AI agents sound more human, including reading prompts aloud, explicitly using filler words, showing examples instead of telling, handling special characters, and allowing the agent to say it doesn't know.
A developer built a local voice-controlled music system using an ESP32 microcontroller, a MacBook, Magenta Realtime 2 for real-time music generation, MLX Whisper for transcription, and a Qwen model for tool calling, enabling conversational control over music elements like genre and instruments.
Hush is an open-source tool for noise suppression designed for voice AI agents, improving audio clarity in real-time interactions.
A collection of 50+ hands-on AI engineering tutorials covering AI agents, RAG, MCP, OCR, voice AI, and more, open-sourced with 1k+ GitHub stars.
Santiago highlights the limitation of traditional STT pipelines that lose tone and emotion, then introduces Velma, a voice-native AI model from Modulate that analyzes raw audio to capture intent, emotion, and other acoustic signals, available via API at 10x cheaper than LLM-based approaches.
A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.
ElevenLabs introduces the ability to call your Hermes Agent, enabling voice-based interaction with AI agents through their platform.
Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.
An open-weights 8B parameter voice model achieves only 110ms latency, faster than average human conversation latency of 200-250ms. It can be run locally and is freely available via a GitHub repository.