asr

#asr

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

arXiv cs.CL ↗ · yesterday Cached

The paper proposes a novel framework (CDDTLDA) using transfer learning and data augmentation to improve Chinese dialects discrimination under low-resource conditions, achieving state-of-the-art results on two benchmark corpora.

0 favorites 0 likes

#asr

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

arXiv cs.LG ↗ · yesterday Cached

ASTRA is an end-to-end training simulator for air traffic control operators that automates sim pilot roles using locally adapted speech models, achieving a significant reduction in word error rates for Singaporean-accented aviation speech and incorporating AI-assisted performance evaluation.

0 favorites 0 likes

#asr

@mudler_it: parakeet.cpp now runs NVIDIA Parakeet behind the OpenAI API. Point any OpenAI client at a local server, send an audio, …

X AI KOLs Timeline ↗ · yesterday Cached

parakeet.cpp enables running NVIDIA Parakeet ASR behind the OpenAI API locally with prebuilt Docker images, supporting CPU and CUDA (including arm64) for real-time transcription with word timestamps.

0 favorites 0 likes

#asr

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

arXiv cs.CL ↗ · 2d ago Cached

This study evaluates bilingual fine-tuning with language identification tokens for improving ASR in low-resource languages across nine diverse language pairs, finding that high LID accuracy is beneficial and that providing the LID token at inference can boost performance when LID accuracy is low.

0 favorites 0 likes

#asr

Voice agents in noisy environments

Reddit r/AI_Agents ↗ · 2d ago

A speech company trained a model that cancels noise and identifies the primary speaker, achieving 50% lower word error rate on leading ASR models in noisy environments.

0 favorites 0 likes

#asr

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

arXiv cs.CL ↗ · 4d ago Cached

This paper proposes a continual learning approach to integrate disfluency tokens into pretrained ASR models, addressing catastrophic forgetting and improving recognition of disfluent speech.

0 favorites 0 likes

#asr

@tom_doerr: Transcribes audio at 70x real-time speed https://github.com/m-bain/whisperX

X AI KOLs Timeline ↗ · 2026-06-12 Cached

WhisperX is a tool for fast automatic speech recognition with word-level timestamps and speaker diarization, offering 70x realtime transcription using Whisper large-v2.

0 favorites 0 likes

#asr

@seclink: Xiaomi is on a strong growth trajectory! MiMo v2.5-ASR Released on 2026-06-02

X AI KOLs Following ↗ · 2026-06-12 Cached

Xiaomi has released updates to its MiMo model series, including mimo-v2.5-asr (supporting multiple dialects and lyric transcription), mimo-v2.5-pro (trillion parameters, 1M context), mimo-v2.5 (full-modal perception), and a TTS series, significantly improving agent performance and recognition capability in complex acoustic scenarios.

0 favorites 0 likes

#asr

Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t

Reddit r/LocalLLaMA ↗ · 2026-06-10

A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.

0 favorites 0 likes

#asr

@cohere: Cohere Transcribe, our open-source speech recognition model, is #1 on the new @huggingface Far-Field ASR benchmark.

X AI KOLs Following ↗ · 2026-06-10 Cached

Cohere Transcribe, an open-source speech recognition model, achieved first place on Hugging Face's new Far-Field ASR benchmark.

0 favorites 0 likes

#asr

Speaker Group Encoding in Self-supervised Speech Recognition Models

arXiv cs.CL ↗ · 2026-06-10 Cached

Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.

0 favorites 0 likes

#asr

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Hugging Face Blog ↗ · 2026-06-09 Cached

ServiceNow AI releases a benchmark and dataset for evaluating automatic speech recognition (ASR) on code-switched speech across four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in enterprise HR and IT scenarios, finding that current frontier ASR models still struggle with code-switching, leading to higher error rates.

0 favorites 0 likes

#asr

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

arXiv cs.CL ↗ · 2026-06-08 Cached

Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.

0 favorites 0 likes

#asr

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Reddit r/MachineLearning ↗ · 2026-06-05

A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.

0 favorites 0 likes

#asr

Latency matters more than model selection when building AI tutoring systems

Reddit r/AI_Agents ↗ · 2026-06-04

A practitioner argues that speech start latency—not model selection—is the critical factor in AI tutoring systems, recommending targets under 1 second for speech start and highlighting streaming TTS as the highest-leverage optimization. The post outlines a full pipeline from ASR through TTS and avatar sync, identifying where latency compounds most.

0 favorites 0 likes

#asr

@uniswap12: Microsoft open-sourced a voice AI that can transcribe 60 minutes of long audio in one go, handling 4 people speaking simultaneously. VibeVoice, open-sourced by Microsoft, 24.8k stars, I only found out about it today. For converting recordings to text, I've been using Whisper, but it often times out on long meeting recordings and struggles with multi-speaker recognition...

X AI KOLs Timeline ↗ · 2026-06-04 Cached

Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.

0 favorites 0 likes

#asr

@yhslgg: Bro, sharing another open-source video translation tool—pyVideoTrans, with 17,700 stars on GitHub, a must-have for video repurposing and localization! In a nutshell: drop a video in, and it automatically runs through the entire pipeline of speech recognition → subtitle translation → AI dubbing → video synthesis, outputting a complete video in another language. Core...

X AI KOLs Timeline ↗ · 2026-06-03 Cached

pyVideoTrans is an open-source video translation tool that supports automatic speech recognition, subtitle translation, AI dubbing, and video synthesis. It integrates multiple ASR, translation, and TTS engines, making it suitable for cross-language video production and localization.

0 favorites 0 likes

#asr

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

arXiv cs.CL ↗ · 2026-06-02 Cached

SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.

0 favorites 0 likes

#asr

@badlogicgames: what a wonderful project: parakeet.cpp https://github.com/mudler/parakeet.cpp… GGML based parakeet inference pipeline t…

X AI KOLs Following ↗ · 2026-05-31 Cached

parakeet.cpp is a fast, dependency-light C++17 inference pipeline for NVIDIA's NeMo Parakeet speech recognition models, built on ggml. It achieves byte-identical transcripts to NeMo with significant speedups on CPU and GPU.

0 favorites 0 likes

#asr

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.

0 favorites 0 likes

asr

Submit Feedback