Tag
Explores whether OpenAI's Whisper remains the top choice for real-time speech-to-text applications, considering alternatives and performance trade-offs.
A guide to building a fully local voice assistant using Platypush on a Raspberry Pi, covering hotword detection, speech-to-text, text-to-speech, and home automation integration.
Google AI Edge Eloquent now supports Mac as a fully local Wispr Flow alternative, offering real-time voice transcription and voice command text editing based on the latest Gemma model. Free, no subscription, and fully private locally.
Mutter AI Dictation is a private AI dictation tool that operates offline.
Andrei Cebotar, a gamer with Spinal Muscular Atrophy, shares the assistive tools he uses daily to play games and communicate, including PlayAbility for facial gesture control, Handy for local speech-to-text, and the Xbox Adaptive Controller.
This paper documents the Montreal Forced Aligner 3.0, a widely used open-source tool for forced alignment, achieving state-of-the-art performance across English, Japanese, and Korean with mean boundary errors below 15 ms.
Cartesia released Sonic-3.5 (text-to-speech) and Ink-2 (speech-to-text), claiming they are the #1 streaming models for voice agents, with potential to disrupt call centers.
A post compiling multiple open-source tools for content creation, including video editing, speech-to-text, AI drawing, media processing, etc., emphasizing free and open-source and the ability to build your own system.
NVIDIA released Nemotron 3.5 ASR, an open-source multilingual speech-to-text model with the lowest latency tested, available in multilingual and English-only variants, ideal for voice agents and self-hosted deployments.
A demo of Telugu Thodu, an app built using SarvamAI's speech-to-text system that translates Telugu to English with high accuracy, handling pauses and nuances.
NVIDIA's Parakeet speech-to-text models have been ported to pure C++/ggml, achieving byte-identical output to NeMo, up to 5x faster inference on GPU, and quantized GGUF variants for efficient deployment anywhere without Python or PyTorch.
A personal account of how the Linux desktop's upcoming Wayland-only future will break accessibility for users relying on input tools like Talon Voice, highlighting the lack of attention to input accessibility compared to output accessibility.
Recommend Scribe2SRT, an open-source speech-to-subtitle tool based on PySide6 and ElevenLabs API, supporting multiple languages with optimized formatting for fast generation of high-quality SRT subtitles.
The article evaluates Wispr Flow, an AI-powered transcription tool, comparing it with free alternatives like open-source models (Whisper, Canary) and built-in features (Apple dictation, Google Voice Typing), concluding that paid subscriptions may not be necessary for many users.
Parrot Speech-to-text API offers fast and accurate transcription for production-grade voice agents.
Fine-tuned Cohere Transcribe, the best open-source speech-to-text model, to support diarization and timestamps. The new model is available on Hugging Face.
A detailed explainer on the five-layer architecture of AI voice agents, including speech-to-text, LLM, text-to-speech, orchestrator, and telephony, all operating under a 500ms latency constraint to maintain natural conversation flow.
StepFun launches Step Plan subscription at $6.99/month, integrating LLM, TTS, ASR, image generation, and other AI models. Supports direct OpenAI SDK connection, applicable for voice cloning, meeting transcription, AI podcast generation, etc.
TongueType is a local dictation app for macOS that does not require a subscription.
Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.