speech-recognition

#speech-recognition

Pretrained self-supervised speech models can recognize unseen consonants

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper investigates whether pretrained self-supervised speech models like Wav2Vec2 and HuBERT can accurately recognize click consonants, which are rare in training data, by fine-tuning on Khoisan languages. Results show the models recognize clicks more accurately than non-clicks, indicating generalization to uncommon phonemes.

0 favorites 0 likes

#speech-recognition

VTT for Mac

Product Hunt ↗ · 2026-06-11

VTT for Mac is a voice-to-text tool for macOS that offers a fully on-device option for privacy.

0 favorites 0 likes

#speech-recognition

Tried to benchmark Google’s new on-device dictation models (Eloquent) and basically couldn’t

Reddit r/LocalLLaMA ↗ · 2026-06-10

A user attempted to benchmark Google's new on-device dictation app Eloquent, which uses proprietary models, and found it frequently drops words or returns incomplete transcripts, with accuracy competitive only when complete. The author theorizes the underlying chat-style model sometimes refuses transcribing.

0 favorites 0 likes

#speech-recognition

@cohere: Cohere Transcribe, our open-source speech recognition model, is #1 on the new @huggingface Far-Field ASR benchmark.

X AI KOLs Following ↗ · 2026-06-10 Cached

Cohere Transcribe, an open-source speech recognition model, achieved first place on Hugging Face's new Far-Field ASR benchmark.

0 favorites 0 likes

#speech-recognition

Speaker Group Encoding in Self-supervised Speech Recognition Models

arXiv cs.CL ↗ · 2026-06-10 Cached

Investigates how self-supervised speech recognition models encode speaker group information (gender, age, dialect, ethnicity, native speaker status) across layers, and how finetuning for tasks like ASR or speaker identification affects this encoding.

0 favorites 0 likes

#speech-recognition

I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU

Reddit r/LocalLLaMA ↗ · 2026-06-09

Omi Health founder fine-tuned NVIDIA's Parakeet TDT 0.6B for medical ASR, releasing open-weights model Omi Med STT v1 that achieves competitive medical-WER while running locally on Mac, CUDA, or CPU.

0 favorites 0 likes

#speech-recognition

Benchmark: ONNX Runtime vs HF Transformers vs GGUF for Parakeet TDT 0.6B on CPU-only hardware [D]

Reddit r/MachineLearning ↗ · 2026-06-05

A benchmark comparing ONNX Runtime, HF Transformers, and GGUF for the Parakeet TDT 0.6B ASR model on CPU-only hardware shows ONNX Runtime achieves 37% faster inference than HF Transformers bfloat16, while GGUF prioritizes memory efficiency.

0 favorites 0 likes

#speech-recognition

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

arXiv cs.CL ↗ · 2026-06-05 Cached

提出一种利用语言特定统计图构建的领域感知发音错误检测与诊断方法，在L2-ARCTIC基准上达到59.52%的F1分数，优于多个基线模型。

0 favorites 0 likes

#speech-recognition

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Hugging Face Daily Papers ↗ · 2026-06-05 Cached

This paper demonstrates that Whisper's hallucination failures on silence, noise, or music can be detected and mitigated purely from internal activations using sparse autoencoders, achieving large reductions in hallucination rate without fine-tuning.

0 favorites 0 likes

#speech-recognition

@uniswap12: Microsoft open-sourced a voice AI that can transcribe 60 minutes of long audio in one go, handling 4 people speaking simultaneously. VibeVoice, open-sourced by Microsoft, 24.8k stars, I only found out about it today. For converting recordings to text, I've been using Whisper, but it often times out on long meeting recordings and struggles with multi-speaker recognition...

X AI KOLs Timeline ↗ · 2026-06-04 Cached

Microsoft open-sourced the VibeVoice speech AI framework, which supports one-shot transcription of 60-minute long audio, multi-speaker diarization and timestamp labeling, and also provides multi-role TTS synthesis capabilities. It is based on Qwen2.5 and comes with a 0.5B lightweight real-time version. It has received 24.8k stars on GitHub.

0 favorites 0 likes

#speech-recognition

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper investigates whether code-switching ASR capabilities learned from limited seen language pairs can generalize to unseen pairs using model merging and domain generalization methods, finding only modest transfer.

0 favorites 0 likes

#speech-recognition

LaSR: Context-Aware Speech Recognition via Latent Reasoning

arXiv cs.CL ↗ · 2026-06-02 Cached

LaSR proposes a latent reasoning training paradigm for context-aware speech recognition, aligning chain-of-thought supervision around acoustic features to improve terminology recognition without added latency, outperforming standard fine-tuning on Fun-Audio-Chat.

0 favorites 0 likes

#speech-recognition

Real-time multilingual ASR using rolling buffers and monolingual models [P]

Reddit r/MachineLearning ↗ · 2026-06-01

A routing-based approach for real-time multilingual ASR that uses smaller monolingual models with a rollback mechanism to handle language switches, achieving ~13% WER on inter-utterance code-switching and open-sourcing the system.

0 favorites 0 likes

#speech-recognition

Your Multimodal Speech Model Says I Have a Face for Radio

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper presents the first bias evaluation of multimodal speech recognition models, finding significant accuracy differences across gender and ethnicity when pairing faces with audio, with implications for fairness in AI systems.

0 favorites 0 likes

#speech-recognition

@badlogicgames: what a wonderful project: parakeet.cpp https://github.com/mudler/parakeet.cpp… GGML based parakeet inference pipeline t…

X AI KOLs Following ↗ · 2026-05-31 Cached

parakeet.cpp is a fast, dependency-light C++17 inference pipeline for NVIDIA's NeMo Parakeet speech recognition models, built on ggml. It achieves byte-identical transcripts to NeMo with significant speedups on CPU and GPU.

0 favorites 0 likes

#speech-recognition

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.

0 favorites 0 likes

#speech-recognition

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

arXiv cs.AI ↗ · 2026-05-27 Cached

This paper introduces MeDial-Speech, a dataset of robot-patient and doctor-patient medical dialogues for spoken language processing, and evaluates three LLMs on a sentence selection benchmark, finding Claude Sonnet 4 most accurate.

0 favorites 0 likes

#speech-recognition

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.

0 favorites 0 likes

#speech-recognition

Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.

0 favorites 0 likes

#speech-recognition

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline ↗ · 2026-05-22 Cached

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

0 favorites 0 likes

speech-recognition

Submit Feedback