Tag
This paper investigates whether pretrained self-supervised speech models like Wav2Vec2 and HuBERT can accurately recognize click consonants, which are rare in training data, by fine-tuning on Khoisan languages. Results show the models recognize clicks more accurately than non-clicks, indicating generalization to uncommon phonemes.
VTT for Mac is a voice-to-text tool for macOS that offers a fully on-device option for privacy.
A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.
Cohere Transcribe, an open-source speech recognition model, achieved first place on Hugging Face's new Far-Field ASR benchmark.
Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.
Omi Health founder fine-tuned NVIDIA's Parakeet TDT 0.6B for medical ASR, releasing open-weights model Omi Med STT v1 that achieves competitive medical-WER while running locally on Mac, CUDA, or CPU.
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
提出一种利用语言特定统计图构建的领域感知发音错误检测与诊断方法,在L2-ARCTIC基准上达到59.52%的F1分数,优于多个基线模型。
This paper demonstrates that Whisper's hallucination failures on silence, noise, or music can be detected and mitigated purely from internal activations using sparse autoencoders, achieving large reductions in hallucination rate without fine-tuning.
Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.
This paper investigates whether code-switching ASR capabilities learned from limited seen language pairs can generalize to unseen pairs using model merging and domain generalization methods, finding only modest transfer.
LaSR proposes a latent reasoning training paradigm for context-aware speech recognition, aligning chain-of-thought supervision around acoustic features to improve terminology recognition without added latency, outperforming standard fine-tuning on Fun-Audio-Chat.
A routing-based approach for real-time multilingual ASR that uses smaller monolingual models with a rollback mechanism to handle language switches, achieving ~13% WER on inter-utterance code-switching and open-sourcing the system.
This paper presents the first bias evaluation of multimodal speech recognition models, finding significant accuracy differences across gender and ethnicity when pairing faces with audio, with implications for fairness in AI systems.
parakeet.cpp is a fast, dependency-light C++17 inference pipeline for NVIDIA's NeMo Parakeet speech recognition models, built on ggml. It achieves byte-identical transcripts to NeMo with significant speedups on CPU and GPU.
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.
This paper introduces MeDial-Speech, a dataset of robot-patient and doctor-patient medical dialogues for spoken language processing, and evaluates three LLMs on a sentence selection benchmark, finding Claude Sonnet 4 most accurate.
This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.
This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.