Tag
An evaluation of leading STT models on 1000+ noisy real-world clips reveals most perform poorly in noisy environments, with DG Nova performing best. Applying noise cancellation significantly improves accuracy.
The author argues that for live voice agents, STT latency and real-time behavior are more critical than raw transcription accuracy, and proposes a different evaluation scorecard.
Explores whether OpenAI's Whisper remains the top choice for real-time speech-to-text applications, considering alternatives and performance trade-offs.
A guide to building a fully local voice assistant using Platypush on a Raspberry Pi, covering hotword detection, speech-to-text, text-to-speech, and home automation integration.
Google AI Edge Eloquent now supports Mac as a fully local Wispr Flow alternative, offering real-time voice transcription and voice command text editing based on the latest Gemma model. Free, no subscription, and fully private locally.
Mutter AI Dictation is a private AI dictation tool that operates offline.
Andrei Cebotar, a gamer with Spinal Muscular Atrophy, shares the assistive tools he uses daily to play games and communicate, including PlayAbility for facial gesture control, Handy for local speech-to-text, and the Xbox Adaptive Controller.
This paper documents the Montreal Forced Aligner 3.0, a widely used open-source tool for forced alignment, achieving state-of-the-art performance across English, Japanese, and Korean with mean boundary errors below 15 ms.
Cartesia released Sonic-3.5 (text-to-speech) and Ink-2 (speech-to-text), claiming they are the #1 streaming models for voice agents, with potential to disrupt call centers.
A post compiling multiple open-source tools for content creation, including video editing, speech-to-text, AI drawing, media processing, etc., emphasizing free and open-source and the ability to build your own system.
NVIDIA released Nemotron 3.5 ASR, an open-source multilingual speech-to-text model with the lowest latency tested, available in multilingual and English-only variants, ideal for voice agents and self-hosted deployments.
A demo of Telugu Thodu, an app built using SarvamAI's speech-to-text system that translates Telugu to English with high accuracy, handling pauses and nuances.
NVIDIA's Parakeet speech-to-text models have been ported to pure C++/ggml, achieving byte-identical output to NeMo, up to 5x faster inference on GPU, and quantized GGUF variants for efficient deployment anywhere without Python or PyTorch.
A personal account of how the Linux desktop's upcoming Wayland-only future will break accessibility for users relying on input tools like Talon Voice, highlighting the lack of attention to input accessibility compared to output accessibility.
Recommend Scribe2SRT, an open-source speech-to-subtitle tool based on PySide6 and ElevenLabs API, supporting multiple languages with optimized formatting for fast generation of high-quality SRT subtitles.
The article evaluates Wispr Flow, an AI-powered transcription tool, comparing it with free alternatives like open-source models (Whisper, Canary) and built-in features (Apple dictation, Google Voice Typing), concluding that paid subscriptions may not be necessary for many users.
Parrot Speech-to-text API offers fast and accurate transcription for production-grade voice agents.
Fine-tuned Cohere Transcribe, the best open-source speech-to-text model, to support diarization and timestamps. The new model is available on Hugging Face.
A detailed explainer on the five-layer architecture of AI voice agents, including speech-to-text, LLM, text-to-speech, orchestrator, and telephony, all operating under a 500ms latency constraint to maintain natural conversation flow.
StepFun launches Step Plan subscription at $6.99/month, integrating LLM, TTS, ASR, image generation, and other AI models. Supports direct OpenAI SDK connection, applicable for voice cloning, meeting transcription, AI podcast generation, etc.