Tag
This paper introduces the first public multimodal dataset of 100 Turkish scam and benign phone calls, evaluating seven LLMs under raw audio, ASR transcripts, and human-corrected transcripts. Results show transcript-based inputs outperform direct audio, highlighting the need for inclusive AI safety research in low-resource languages.
This article shares hard-won lessons from building real-time voice AI agents, highlighting the importance of proper turn-taking, VAD handling, billing awareness, and avoiding echo loops.
A local CLI tool that uses OpenAI's Whisper to detect and remove filler words (um, uh, erm) from audio recordings, employing techniques to avoid audio artifacts like clicks and background hiss.
Hush is an open-source tool for noise suppression designed for voice AI agents, improving audio clarity in real-time interactions.
Microsoft released VibeVoice, an open-source model that processes a full hour of audio in one pass and returns a structured transcript with speaker identification and timestamps, disrupting paid transcription services.
Resonate is a low-latency, low-memory algorithm for perceptually relevant spectral analysis of audio signals, using resonator models with exponentially weighted moving averages.
Santiago highlights the limitation of traditional STT pipelines that lose tone and emotion, then introduces Velma, a voice-native AI model from Modulate that analyzes raw audio to capture intent, emotion, and other acoustic signals, available via API at 10x cheaper than LLM-based approaches.
An open-source project that uses a phone microphone for live breath detection and biofeedback, processing audio on-device to enhance self-awareness without wearables or cloud uploads.
Perplexity shared engineering best practices for adding voice functionality to their AI browser Comet using the OpenAI Realtime API, including key techniques like chunked context feeding, role management, and unified audio pipeline.
This article explains the technical architecture of a real-time chord recognizer, detailing a four-stage pipeline using pitch-class bitmasks, candidate generation, score normalization, and musical heuristics.
Derpy Turtle is a Windows GUI tool designed to enhance Kokoro voice outputs by integrating voice search, RVC model training, and post-generation voice conversion into a unified workflow.
The article discusses how multimodal AI models like GPT-4o and Claude 3.5 Sonnet are overcoming text-only bottlenecks by enabling visual debugging, audio-to-data conversion, and enhanced RAG systems.
GPT-Realtime-2 is introduced as a tool for instant real-time audio translation.
mlx-audio v0.4.3 releases with 6 new TTS models including Higgs Audio v2 and OmniVoice (646+ languages), plus server improvements like concurrent requests and continuous batching, ~3x faster Voxtral Realtime on 4-bit, and slimmer dependencies for Apple Silicon.
A web-based guitar tuner that utilizes the phone's accelerometer to detect string vibrations and calculate pitch.