Tag
WhisperX is a tool for fast automatic speech recognition with word-level timestamps and speaker diarization, offering 70x realtime transcription using Whisper large-v2.
Xiaomi has released updates to its MiMo model series, including mimo-v2.5-asr (supporting multiple dialects and lyric transcription), mimo-v2.5-pro (trillion parameters, 1M context), mimo-v2.5 (full-modal perception), and a TTS series, significantly improving agent performance and recognition capability in complex acoustic scenarios.
A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.
Cohere Transcribe, an open-source speech recognition model, achieved first place on Hugging Face's new Far-Field ASR benchmark.
Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.
ServiceNow AI releases a benchmark and dataset for evaluating automatic speech recognition (ASR) on code-switched speech across four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in enterprise HR and IT scenarios, finding that current frontier ASR models still struggle with code-switching, leading to higher error rates.
Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.
Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.
pyVideoTrans is an open-source video translation tool that supports automatic speech recognition, subtitle translation, AI dubbing, and video synthesis. It integrates multiple ASR, translation, and TTS engines, making it suitable for cross-language video production and localization.
SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.
parakeet.cpp is a fast, dependency-light C++17 inference pipeline for NVIDIA's NeMo Parakeet speech recognition models, built on ggml. It achieves byte-identical transcripts to NeMo with significant speedups on CPU and GPU.
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.
This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.
This paper introduces CLD, a lightweight convex optimization-based language detection head for ASR that achieves 97-98% accuracy with under 100 training samples while reducing compute costs by 13x, addressing accent and dialect robustness across 5 languages and 24 sub-dialects.
StepAudio 2.5 is a unified audio-language model that achieves state-of-the-art results across ASR, TTS, and real-time spoken interaction by leveraging task-tailored reinforcement learning from human feedback to optimize shared representations.
Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.
SCRIBE is a diagnostic evaluation framework for automatic speech recognition that provides categorical error decomposition for Indic languages, releasing benchmarks and open-weight rich transcription models for Hindi, Malayalam, and Kannada.