Voice agents in noisy environments

Reddit r/AI_Agents Models

Summary

A speech company trained a model that cancels noise and identifies the primary speaker, achieving 50% lower word error rate on leading ASR models in noisy environments.

I have been working with Voice agents since more than a year now. And they always break in certain environments - noises, people talking around, etc. I looked for solutions, even asked on a few subreddits, and found nothing that worked well. Being a Speech company, we decided to solve this. We trained a model that identifies the primary speaker and cancels out everything else. Months of exploration, data sourcing, and multiple model iterations later, we built something that is now outperforming industry benchmarks- 50% lower WER on leading ASR models. Since I actively follow and learn from this group, just wanted to share the journey. :)
Original Article

Similar Articles

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

Hugging Face Blog

ServiceNow AI releases a benchmark and dataset for evaluating automatic speech recognition (ASR) on code-switched speech across four language pairs (Spanish-English, French-English, Canadian French-English, German-English) in enterprise HR and IT scenarios, finding that current frontier ASR models still struggle with code-switching, leading to higher error rates.

@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…

X AI KOLs Timeline

Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

arXiv cs.CL

This paper evaluates nine ASR models (Whisper, Parakeet, Wav2Vec2) on Dutch child speech datasets JASMIN and DART, finding that fine-tuned Whisper-medium achieves the best performance (WER 5.54% on JASMIN, 70.37% on DART). It also proposes a selection method to automatically identify correctly pronounced utterances with high precision, reducing the need for manual verification.