Tag
Whisper large-v3-turbo has been compressed to 368 MB using Q3_K-matched quantization-aware training, with multilingual word error rate results reported.
Whisperian is an Android application that enables users to use a microphone with local automatic speech recognition (ASR) models, and it is available on the Play Store.
This paper presents NEST-V1, a proof-of-concept multimodal framework for generating emotion-conditioned Nepali Sign Language avatars from spoken input, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 audio samples from 50 speakers.
This paper proposes an error-aware TF-IDF retrieval-augmented generation framework for correcting automatic speech recognition errors, achieving significant accuracy gains on Persian FLEURS with near-zero inference latency.
This paper investigates the impact of data scale versus latency on cross-lingual transfer for streaming ASR, finding that multilingual initialization benefits are data-limited, not latency-limited, and diminish as target-language data increases.
This paper uses layer-wise probing to investigate how wav2vec 2.0 and Whisper encode consonant cluster reduction in African American English, finding that both models distinguish reduced and canonical forms and preserve cues to underlying stops.
Introduces the FFASR Leaderboard, an open, community-driven benchmark for evaluating automatic speech recognition models under realistic far-field acoustic conditions, highlighting the significant performance gap between near-field and far-field scenarios.
NVIDIA quietly released Nemotron-3.5-ASR, a lightweight 0.6B parameter open-source speech recognition model designed for real-time streaming with support for 40+ languages, low latency, and cache-aware architecture.
EdgeSpeak desktop voice transcription tool is now live, featuring the local Lattice-2 voice model. It supports offline audio/video transcription, multiple languages and accents, and provides a local API for developers to integrate.
Andrew Ng announces a new course on adding voice to AI agents using VocalBridge, taught by its CEO. The course covers three integration patterns and evaluation techniques for building reliable and low-latency voice applications.
ASTRA is an end-to-end training simulator for air traffic control operators that automates sim pilot roles using locally adapted speech models, achieving a significant reduction in word error rates for Singaporean-accented aviation speech and incorporating AI-assisted performance evaluation.
parakeet.cpp enables running NVIDIA Parakeet ASR behind the OpenAI API locally with prebuilt Docker images, supporting CPU and CUDA (including arm64) for real-time transcription with word timestamps.
This study evaluates bilingual fine-tuning with language identification tokens for improving ASR in low-resource languages across nine diverse language pairs, finding that high LID accuracy is beneficial and that providing the LID token at inference can boost performance when LID accuracy is low.
A speech company trained a model that cancels noise and identifies the primary speaker, achieving 50% lower word error rate on leading ASR models in noisy environments.
This paper introduces MoDiCoL, a modular diagnostic continual learning dataset for robust speech recognition, enabling controlled analysis of linguistic content, speaker characteristics, and acoustic environments, and proposes a continual learning curriculum to study how robustness is acquired, transferred, and forgotten.
This paper proposes a continual learning approach to integrate disfluency tokens into pretrained ASR models, addressing catastrophic forgetting and improving recognition of disfluent speech.
At PyTorch Conference Europe 2026, Mistral AI's Patrick von Platen explains why real-world AI interaction requires streaming architectures that process continuous input and produce continuous output, using Vox Real Time as a live transcription example.
WhisperX is a tool for fast automatic speech recognition with word-level timestamps and speaker diarization, offering 70x realtime transcription using Whisper large-v2.
Revi is a voice dictation app that runs on-device without needing cloud services or an account.
The article explains how to implement ASR biasing for voice transcription models, using examples from Groq and local models, and introduces the open-source Freestyle project that incorporates this feature.