Tag
The paper proposes a novel framework (CDDTLDA) using transfer learning and data augmentation to improve Chinese dialects discrimination under low-resource conditions, achieving state-of-the-art results on two benchmark corpora.
ASTRA is an end-to-end training simulator for air traffic control operators that automates sim pilot roles using locally adapted speech models, achieving a significant reduction in word error rates for Singaporean-accented aviation speech and incorporating AI-assisted performance evaluation.
parakeet.cpp enables running NVIDIA Parakeet ASR behind the OpenAI API locally with prebuilt Docker images, supporting CPU and CUDA (including arm64) for real-time transcription with word timestamps.
This study evaluates bilingual fine-tuning with language identification tokens for improving ASR in low-resource languages across nine diverse language pairs, finding that high LID accuracy is beneficial and that providing the LID token at inference can boost performance when LID accuracy is low.
A speech company trained a model that cancels noise and identifies the primary speaker, achieving 50% lower word error rate on leading ASR models in noisy environments.
This paper proposes a continual learning approach to integrate disfluency tokens into pretrained ASR models, addressing catastrophic forgetting and improving recognition of disfluent speech.
WhisperX is a tool for fast automatic speech recognition with word-level timestamps and speaker diarization, offering 70x realtime transcription using Whisper large-v2.
Xiaomi has released updates to its MiMo model series, including mimo-v2.5-asr (supporting multiple dialects and lyric transcription), mimo-v2.5-pro (trillion parameters, 1M context), mimo-v2.5 (full-modal perception), and a TTS series, significantly improving agent performance and recognition capability in complex acoustic scenarios.
A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.
Cohere Transcribe, an open-source speech recognition model, achieved first place on Hugging Face's new Far-Field ASR benchmark.
Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.
ServiceNow AI releases a benchmark and dataset for evaluating automatic speech recognition (ASR) on code-switched speech across four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in enterprise HR and IT scenarios, finding that current frontier ASR models still struggle with code-switching, leading to higher error rates.
Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.
A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.
A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.
Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.
pyVideoTrans is an open-source video translation tool that supports automatic speech recognition, subtitle translation, AI dubbing, and video synthesis. It integrates multiple ASR, translation, and TTS engines, making it suitable for cross-language video production and localization.
SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.
parakeet.cpp is a fast, dependency-light C++17 inference pipeline for NVIDIA's NeMo Parakeet speech recognition models, built on ggml. It achieves byte-identical transcripts to NeMo with significant speedups on CPU and GPU.
This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.