asr

#asr

@tom_doerr: Transcribes audio at 70x real-time speed https://github.com/m-bain/whisperX

X AI KOLs Timeline ↗ · yesterday Cached

WhisperX is a tool for fast automatic speech recognition with word-level timestamps and speaker diarization, offering 70x realtime transcription using Whisper large-v2.

0 favorites 0 likes

#asr

@seclink: Xiaomi is on a strong growth trajectory! MiMo v2.5-ASR Released on 2026-06-02

X AI KOLs Following ↗ · yesterday Cached

Xiaomi has released updates to its MiMo model series, including mimo-v2.5-asr (supporting multiple dialects and lyric transcription), mimo-v2.5-pro (trillion parameters, 1M context), mimo-v2.5 (full-modal perception), and a TTS series, significantly improving agent performance and recognition capability in complex acoustic scenarios.

0 favorites 0 likes

#asr

Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t

Reddit r/LocalLLaMA ↗ · 2d ago

A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.

0 favorites 0 likes

#asr

@cohere: Cohere Transcribe, our open-source speech recognition model, is #1 on the new @huggingface Far-Field ASR benchmark.

X AI KOLs Following ↗ · 2d ago Cached

Cohere Transcribe, an open-source speech recognition model, achieved first place on Hugging Face's new Far-Field ASR benchmark.

0 favorites 0 likes

#asr

Speaker Group Encoding in Self-supervised Speech Recognition Models

arXiv cs.CL ↗ · 3d ago Cached

Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.

0 favorites 0 likes

#asr

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Hugging Face Blog ↗ · 3d ago Cached

ServiceNow AI releases a benchmark and dataset for evaluating automatic speech recognition (ASR) on code-switched speech across four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in enterprise HR and IT scenarios, finding that current frontier ASR models still struggle with code-switching, leading to higher error rates.

0 favorites 0 likes

#asr

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

arXiv cs.CL ↗ · 5d ago Cached

Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.

0 favorites 0 likes

#asr

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Reddit r/MachineLearning ↗ · 2026-06-05

A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.

0 favorites 0 likes

#asr

Latency matters more than model selection when building AI tutoring systems

Reddit r/AI_Agents ↗ · 2026-06-04

A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.

0 favorites 0 likes

#asr

@uniswap12: Microsoft open-sourced a voice AI that can transcribe 60 minutes of long audio in one go, handling 4 people speaking simultaneously. VibeVoice, open-sourced by Microsoft, 24.8k stars, I only found out about it today. For converting recordings to text, I've been using Whisper, but it often times out on long meeting recordings and struggles with multi-speaker recognition...

X AI KOLs Timeline ↗ · 2026-06-04 Cached

Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.

0 favorites 0 likes

#asr

@yhslgg: Bro, sharing another open-source video translation tool—pyVideoTrans, with 17,700 stars on GitHub, a must-have for video repurposing and localization! In a nutshell: drop a video in, and it automatically runs through the entire pipeline of speech recognition → subtitle translation → AI dubbing → video synthesis, outputting a complete video in another language. Core...

X AI KOLs Timeline ↗ · 2026-06-03 Cached

pyVideoTrans is an open-source video translation tool that supports automatic speech recognition, subtitle translation, AI dubbing, and video synthesis. It integrates multiple ASR, translation, and TTS engines, making it suitable for cross-language video production and localization.

0 favorites 0 likes

#asr

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

arXiv cs.CL ↗ · 2026-06-02 Cached

SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.

0 favorites 0 likes

#asr

@badlogicgames: what a wonderful project: parakeet.cpp https://github.com/mudler/parakeet.cpp… GGML based parakeet inference pipeline t…

X AI KOLs Following ↗ · 2026-05-31 Cached

parakeet.cpp is a fast, dependency-light C++17 inference pipeline for NVIDIA's NeMo Parakeet speech recognition models, built on ggml. It achieves byte-identical transcripts to NeMo with significant speedups on CPU and GPU.

0 favorites 0 likes

#asr

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.

0 favorites 0 likes

#asr

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.

0 favorites 0 likes

#asr

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline ↗ · 2026-05-22 Cached

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

0 favorites 0 likes

#asr

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Hugging Face Daily Papers ↗ · 2026-05-22 Cached

This paper introduces CLD, a lightweight convex optimization-based language detection head for ASR that achieves 97-98% accuracy with under 100 training samples while reducing compute costs by 13x, addressing accent and dialect robustness across 5 languages and 24 sub-dialects.

0 favorites 0 likes

#asr

StepAudio 2.5 Technical Report

Hugging Face Daily Papers ↗ · 2026-05-22 Cached

StepAudio 2.5 is a unified audio-language model that achieves state-of-the-art results across ASR, TTS, and real-time spoken interaction by leveraging task-tailored reinforcement learning from human feedback to optimize shared representations.

0 favorites 0 likes

#asr

@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speaker…

X AI KOLs Following ↗ · 2026-05-21 Cached

Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.

0 favorites 0 likes

#asr

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

arXiv cs.CL ↗ · 2026-05-21 Cached

SCRIBE is a diagnostic evaluation framework for automatic speech recognition that provides categorical error decomposition for Indic languages, releasing benchmarks and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

0 favorites 0 likes

asr

Submit Feedback