How I implemented ASR bias for voice transcription models [Open Source]
Summary
The article explains how to implement ASR biasing for voice transcription models, using examples from Groq and local models, and introduces the open-source Freestyle project that incorporates this feature.
Similar Articles
@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.
Evaluating Bias in Phoneme-Based Automatic Speech Recognition Systems: An Analysis of IPA Transcription Models
This paper evaluates demographic and accent biases in phoneme-based ASR systems, specifically WhisperIPA and ZIPA, using phoneme error rate and a new Soft PER metric, revealing persistent disparities across languages and groups.
SamaVaani: Auditing and Debiasing Multilingual Clinical ASR for Indian Languages
This paper audits multilingual clinical ASR systems on psychiatric interviews in Indian languages and proposes SamaVaani, a unified debiasing technique to improve performance and fairness across demographic groups.
Real-time multilingual ASR using rolling buffers and monolingual models [P]
A routing-based approach for real-time multilingual ASR that uses smaller monolingual models with a rollback mechanism to handle language switches, achieving ~13% WER on inter-utterance code-switching and open-sourcing the system.
Introducing next-generation audio models in the API
OpenAI introduced next-generation audio models for the API, including improved speech-to-text (gpt-4o-transcribe, gpt-4o-mini-transcribe) and customizable text-to-speech models that enable developers to build more intelligent and expressive voice agents with enhanced accuracy across challenging scenarios.