@MaxForAI: 如果你在做语音Agent,你应该试一下这个项目 来自南洋理工、新国立和上海 AI Lab的团队发布了:Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建,目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈…

X AI KOLs Timeline 模型

摘要

南洋理工、新国立和上海 AI Lab 联合发布 Mega-ASR,一个基于 Qwen3-ASR 构建的完全开源 ASR 模型,通过 Voices-in-the-Wild-2M 数据集和渐进式声学到语义优化,在真实世界嘈杂环境中实现最高 30% 的相对词错误率下降,且仅 1.7B 参数可在消费级硬件高效推理。

如果你在做语音Agent,你应该试一下这个项目 来自南洋理工、新国立和上海 AI Lab的团队发布了:Mega-ASR 这个完全开源的ASR基于 Qwen3-ASR构建,目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈。 以前ASR(例如Whisper)当然也有「抗噪」「远场」「会议转写」「口音鲁棒」这些能力。 但Mega-ASR想解决的是Whisper没真正解决的那块: 真实世界里的烂音频。 远场、混响、回声、电流声、设备录音失真、传输丢包、遮挡、背景噪声等等。 为此他们做了一个Voices-in-the-Wild-2M数据集,覆盖7类典型声学现象和54种复合场景。 然后用A2S-SFT和DG-WGPO做训练,也就是先让模型适应声学污染,再让它学会在高错误率音频里保留语义、减少漏句和幻觉。 结果非常不错。 在专为真实世界设计的具有挑战性的场景中,与 Qwen3-ASR、Gemini-3.1-Pro、Seed-ASR 和 Whisper 等强基线相比,它可实现高达 30% 的相对词错误率(WER)下降。 同时由于仅仅只有1.7B的参数大小, 在消费级硬件上依然可以高效推理。 所有内容均以 Apache 2.0 许可发布:模型权重、训练代码、评估工具、200 万数据集以及 Voices-in-the-Wild-Bench。 研究人员、开发者和企业可以在没有限制的情况下进行微调、部署或进行二次开发。 技术报告:https://arxiv.org/abs/2605.19833 项目主页:https://xzf-thu.github.io/Mega-ASR/ Github:https://github.com/xzf-thu/Mega-ASR… Hugging Face:https://huggingface.co/zhifeixie/Mega-ASR…
查看原文
查看缓存全文

缓存时间: 2026/05/22 09:47

如果你在做语音Agent,你应该试一下这个项目

来自南洋理工、新国立和上海 AI Lab的团队发布了:Mega-ASR

这个完全开源的ASR基于 Qwen3-ASR构建,目的是打破长期困扰ASR的在嘈杂、混响或其他受损现实环境中表现的瓶颈。

以前ASR(例如Whisper)当然也有「抗噪」「远场」「会议转写」「口音鲁棒」这些能力。

但Mega-ASR想解决的是Whisper没真正解决的那块: 真实世界里的烂音频。

远场、混响、回声、电流声、设备录音失真、传输丢包、遮挡、背景噪声等等。

为此他们做了一个Voices-in-the-Wild-2M数据集,覆盖7类典型声学现象和54种复合场景。

然后用A2S-SFT和DG-WGPO做训练,也就是先让模型适应声学污染,再让它学会在高错误率音频里保留语义、减少漏句和幻觉。

结果非常不错。

在专为真实世界设计的具有挑战性的场景中,与 Qwen3-ASR、Gemini-3.1-Pro、Seed-ASR 和 Whisper 等强基线相比,它可实现高达 30% 的相对词错误率(WER)下降。

同时由于仅仅只有1.7B的参数大小, 在消费级硬件上依然可以高效推理。

所有内容均以 Apache 2.0 许可发布:模型权重、训练代码、评估工具、200 万数据集以及 Voices-in-the-Wild-Bench。

研究人员、开发者和企业可以在没有限制的情况下进行微调、部署或进行二次开发。

技术报告:https://arxiv.org/abs/2605.19833 项目主页:https://xzf-thu.github.io/Mega-ASR/ Github:https://github.com/xzf-thu/Mega-ASR… Hugging Face:https://huggingface.co/zhifeixie/Mega-ASR…


Mega-ASR: Towards In-the-wild2 Speech Recognition via Scaling Up Real-world Acoustic Simulation

Source: https://arxiv.org/html/2605.19833 Zhifei Xie1*, Kaiyu Pang3*, Haobin Zhang2*, Deheng Ye1†\dagger, Xiaobin Hu2†\dagger Shuicheng Yan2†\dagger,Chunyan Miao1†\dagger 1NTU2NUS3Shanghai AI Lab [email protected]

Abstract

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an “acoustic robustness bottleneck”: models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We proposeMEGA-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduceVOICES-IN-THE-WILD-2M, covering7classic acoustic phenomena and54physically plausible compound scenarios, and train MEGA-ASR withAcoustic-to-Semantic Progressive Supervised Fine-TuningandDual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that MEGA-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, MEGA-ASR further deliversover 30%relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.

Refer to caption

Figure 1:Radar comparison of Qwen3-ASR-1.7B and Mega-ASR across selected ASR evaluation subsets, covering both clean and robustness benchmarks.## 1Introduction

Automatic speech recognition (ASR) is one of the most fundamental tasks in the speech domain, and has evolved rapidly in recent years. State-of-the-art ASR models(Shiet al.,2026; Xuet al.,2026; Gaoet al.,2022)achieve excellent accuracy on widely used benchmarks(Panayotovet al.,2015), with word error rates approaching 1%. Beyond this, large audio-language models (LALMs)(Xuet al.,2025b; Dinget al.,2025)scale to billion-parameter architectures that integrate pretrained linguistic knowledge and even support reasoning-based error correction(Lina and Aksyonov,2024), improving contextual consistency and achieving human-level performance on canonical benchmarks.

However, performance drops sharply under real-world acoustic conditions: WER typically rises to 10%–30%, and in harder cases can be as high as 70%, often accompanied bydropped utterances orsevere hallucination. Recent work onASR-in-the-wild(Yanet al.,2025; Hanet al.,2017)seeks to bridge this gap through improved data and post-training strategies. Nevertheless, three limitations persist.(D1) Limited scenario coverage.Prior work typically targets one or two isolated conditions (e.g., noise or far-field), requiring different specialized models for different environments.(D2) Lack of compositional robustness.Robustness factors are studied independently, while real-world conditions are inherently compositional (e.g., simultaneous reverberation, echo, and frequency dropout), and large-scale data for such mixtures remains scarce.(D3) Mismatch between training data and real-world conditions.The data that existing models are trained on emphasize relatively mild WER ranges (4%4\%–10%10\%), which do not reflect challenging settings where WER exceeds 30% and demands stronger semantic reasoning over degraded signals. These gaps motivate a shift towardASR-in-the-wild2, pushing ASR models to handle acoustic conditions that are not just singly complex, and to recognize speech under much harder settings.

In this work, we proposeMega-ASR, a framework specifically designed to strengthen ASR capability underin-the-wildcomplex acoustic environments.Mega-ASRis able to(1)achieve state-of-the-art accuracy on individual environmental conditions within a single model,(2)deliver superior performance on real-world recordings exhibiting compound environmental effects, and(3)recover semantic information under highly challenging conditions, which requires a dataset that is both close to the real-world distribution and scalable. To this end, we introduceVoices-in-the-wild-2M, a large-scale ASR dataset comprising7canonical meta-scenarios and54newly constructed compound scenarios, generated by a spectral-manipulation-based simulation method. We first(i) simulate 7atomic acoustic effectsin isolation as the foundation, then(ii) scale to 54 compound scenarioswith an agentic check that verifies physical plausibility (e.g., a church corresponds to far-field plus echo). To obtain data that is both challenging and suitable for training, we(iii) calibrate the difficultydistributionthrough controlled experiments, and finally(iv) filter out samples with WER above 70%to ensure training stability. We then developAcoustic-to-Semantic Progressive Supervised Fine-Tuning (A2S-SFT), addressing two coupled bottlenecks at medium-to-high WER: extracting semantic information from acoustic signals under heavy perturbation, and recovering the intended semantics. Through this progressive capability building, we obtainxMega-ASR-Base, whose foundational capabilities for the reward signal that subsequent reinforcement learning depends on.

Finally, during RL training, recognition errors at medium difficulty are mostly word-level mistakes, but once WER exceeds 30%, the dominant failure mode changes sharply into severely incorrect semantics, hallucinated guesses, and large portions of dropped sentences. As a result, WER-based rewards cannot provide an effective learning signal in this situation. We therefore proposeDual-Granularity WER-Gated Policy Optimization (DG-WGPO), a dynamic reward scheme with two parts. We also adopt a classicstatic rule-based rewardconsisting of WER and a repetition penalty as the basic learning signal. As the core of DG-WGPO, we introduce aDual-Granularity Dynamic Rewarddesigned specifically for ASR under complex acoustic environments, which combines atoken-level refinement reward for local information recoveryand asentence-level reconstruction reward for overall semantic preservation on hard samples, with aWER-gated mirrored fusion strategythat dynamically allocates the weights between them. Extensive experiments show that MEGA-ASR substantially outperforms prior state-of-the-art systems on adverse-condition and compositional real-world benchmarks.

2Related Work

ASR Foundation Models and Robust Speech Recognition.

Recent ASR foundation models, spanning encoder-decoder systems, large-scale self-supervised models, and audio-language models, have achieved strong results on standard benchmarks(Radfordet al.,2023; Xuet al.,2026; Shiet al.,2026; Gaoet al.,2023; Xuet al.,2025a,b; Dinget al.,2025; Wuet al.,2025). However, strong performance under clean or mildly noisy conditions does not imply robustness in deployment, where speech is often corrupted by simultaneous degradations such as noise, far-field propagation, reverberation, obstructed, device distortion, and transmission dropout. Existing robust ASR studies typically address only one or two such factors, leaving severe and compositional conditions underexplored.

Datasets and Simulation for In-the-wild ASR.

A long line of robust ASR benchmarks studies recognition under adverse conditions, including additive noise, distant microphones, reverberation, replayed speech, and device effects(Hu and Loizou,2007; Watanabeet al.,2016; Richeyet al.,2018; Mysore,2014; Rousseauet al.,2012; Ardilaet al.,2020; Pavlichenkoet al.,2021), but most emphasize isolated factors or mild degradation regimes. In practice, environments such as classrooms, corridors, or vehicles routinely combine background noise, far-field attenuation, echo, occlusion, and device-induced distortion. Augmentation methods like noise mixing, RIR convolution, spectral masking, clipping, and codec simulation partially address this(Snyderet al.,2015; Reddyet al.,2020; Koet al.,2015,2017; Paradaet al.,2022), but typically serve as local training perturbations rather than a systematic model of real acoustic worlds.

3Voices-in-the-wild-2M

3.1Overview

Existing datasets for robust ASR mostly cover only a narrow set of isolated acoustic conditions, with mild WER typically between4%–10%as shown in Table1, whereas real-world environments mix multiple environmental effects (e.g., far-field with echo&reverb in a church interior) and routinely push WER beyond 30%. To facilitate research in this regime, we introduceVoices-in-the-wild-2M, a large-scale dataset built through spectrogram-level code-based simulation, the design choice that makes its scale tractable. To faithfully simulate the complex acoustic conditions encountered in-the-wild, we first identify, as shown in Figure2, seven classic in-the-field acoustic effects{noise,far-field,obstructed,echo&reverb,recording,electronic distortion,transmission dropout}\left\{\textit{noise},\ \textit{far-field},\ \textit{obstructed},\ \textit{echo\&reverb},\ \textit{recording},\ \textit{electronic distortion},\ \textit{transmission dropout}\right\}, which we termatomic acoustic effects. Each atomic effect is implemented as a dedicated spectral processing pipeline and iteratively calibrated against real recordings, with parameters re-tuned and validated via SFT on Qwen3-ASR until the simulator attains best fit on real data. The atomic phenomena are then composed into54agent-validated configurations, yielding2.4Msynthesized clips whose effectiveness on real-world data is empirically verified after mixed-condition training.Voices-in-the-wild-2Mis also substantially more challenging, thereby promoting robustness in complex real-world environments: even the state-of-the-art Qwen3-ASR(Shiet al.,2026)attains a high average WER of35%on this benchmark.

Table 1:Coverage comparison of acoustic degradation scenarios across datasets.sourceAcoustic PhenomenaDatasetreal.sim.NoiseFarBarr.E&RRecordDistortDropScaleWERNOIZEUS(Hu and Loizou,2007)✗✔✔✗✗✗✗✗✗1K9.45TED-LIUM(Rousseauet al.,2012)✔✗✗✔✗✗✗✗✗59K2.31CHiME-4(Watanabeet al.,2016)✔✔✔✗✗✗✗✗✗15K5.39VOiCES(Watanabeet al.,2016)✔✗✔✔✔✔✔✗✗1M8.94BERSt(Tuttösíet al.,2026)✔✗✗✔✔✗✗✗✗4.5K22.41DAPS(Mysore,2014)✔✗✗✗✗✗✗✔✗2K6.24Voices-in-the-wild-2M✔✔✔✔✔✔✔✔✔2M18.42Refer to captionFigure 2:Voices-in-the-wild-2Menables environmentally robust ASR by expanding 7 meta-scenarios into 54 hybrid scenarios, covering diverse real-world acoustic degradations at scale.

3.2Realistic Simulation of Compound Acoustic Environments

In principle, two routes exist for building such a dataset:(Option 1) curating existing materials such as online videos, which we found costly and fundamentally unscalable, and(Option 2) synthesizing from clean speech clips. We adopt the latter for its flexibility and, more importantly, its scalability. The pipeline proceeds as follows.(i) Atomic acoustic effect simulation.As the foundation of the pipeline, we simulate each of the seven phenomena directly on the spectrogram via filtering, convolution, and related signal-level transformations, with parameters iteratively tuned to best fit real-world recordings. We further incorporate a broad collection of real-world material spanning comprehensive background and speech sources: noise from MUSAN(Snyderet al.,2015), DNS Challenge(Reddyet al.,2020), ESC-50(Piczak,2015), and UrbanSound8K(Salamonet al.,2014)(~42K clips, 129 hours), and clean speech from LibriSpeech(Panayotovet al.,2015), Common Voice(Richeyet al.,2018), WenetSpeech(Zhanget al.,2022), and AISHELL-1(Buet al.,2017).(ii) Reality-grounded composition.Since real environments rarely exhibit a single isolated effect, we scale from atomic effects to compound scenarios by composing 2 to 5 atomic effects, retaining only physically plausible combinations (e.g., far-field with ambient noise in a church interior) and yielding the 54 compound configurations above.(iii) Controllable-difficulty synthesis.To obtain data that is both challenging and suitable for training, we calibrate the difficulty distribution by exposing a unified severity parameterk∈[0,1]k\in[0,1]for every effect and generating 50K probe samples under four candidate distributions overkk(Sqrt-Forward, Sqrt-Backward, Gaussian-Mid, Linear); as shown in Figure3, theLinear distribution is adopted as the severity profile of the dataset.(iv) Learnability fi-

Refer to captionFigure 3:Left: SFT accuracy curves on real samples after careful tuning, shown for individual and mixed atomic effects. Right: comparison of difficulty sampling distributions on Noizeus 0dB.tering.To ensure training stability, we discard samples with WER above 70%, which we observe to destabilize training otherwise. Full pipeline details and examples are provided in the appendixC.

3.3Voices-in-the-wild-Bench: A Real-Recording Evaluation Benchmark

We further release Voices-in-the-wild-Bench, a 5,000-clip English/Mandarin evaluation set covering the same seven atomic phenomena asVoices-in-the-wild-2M, comprising 3,500 synthetic clips and 1,500 real-world recordings collected from internet sources and 16 human participants.

4Mega-ASR

We propose a framework, as shown in figure4for robust speech recognition under complex acoustic conditions. We first developMega-ASR-Baseon top of Qwen3-ASR(Shiet al.,2026)viaAcoustic-to-Semantic Progressive Supervised Fine-Tuning, instilling perceptual robustness and semantic recovery.We then applyDual-Granularity WER-Gated Policy Optimizationthat supplies token- and sentence-level rewards, dynamically modulating their granularity to mitigate WER reward failure.

Refer to captionFigure 4:Overview of the proposed DG-WGPO framework. Starting from A2S-SFT initialization, the policy model generates multiple hypotheses scored by a dynamic reward with gated fusion.### 4.1Acoustic-to-Semantic Progressive Supervised Fine-Tuning

We observe that existing ASR models struggle to maintain reliable acoustic understanding in the medium and high WER regimes, often producingempty outputs, severe hallucinations, or off-audiotranscriptions.The failure stems from two coupled bottlenecks:(i)extracting reliable acoustic evidence from corrupted waveforms, which the encoder-aligner stack alone cannot guarantee, and(ii)leveraging the LLM’s semantic prior to reconstruct the intended transcription when that evidence is only partially reliable. A2S-SFT addresses them in three phases:(i)a WER-graded curriculum on the encoder and aligner, successively expanding fromWER<30%\text{WER}{<}30\%toWER<50%\text{WER}{<}50\%and finally toWER<70%\text{WER}{<}70\%, to build acoustic perception incrementally;(ii)LLM fine-tuning on fullWER<70%\text{WER}{<}70\%samples to activate semantic recovery under unreliable acoustic evidence; and(iii)joint fine-tuning of encoder, aligner, and LLM for end-to-end alignment.

4.2Dual-Granularity WER-Gated Policy Optimization

Building onMega-ASR-Base, we apply DAPO(Yuet al.,2025)to sharpen the policy.We observe during training that errors whenWER<=30%\text{WER}{<=}30\%are predominantly word-level confusions, whereas beyond this threshold they shift abruptly into sentence-level failures such as hallucinations and omissions.The standard WER reward, however, conflates these two regimes and further saturates under heavy degradation, collapsing intra-group dispersion precisely where the policy needs it most. We therefore proposeDual-Granularity WER-Gated Policy Optimization (DG-WGPO), which retains a classicstatic rule-based reward(WER plus a repetition penalty) as the basic learning signal, and introduces aDual-Granularity Dynamic Rewardas its core, applying WER-gated fine- and coarse-grained rewards aligned with the two error regimes.

4.2.1Static Rule-Based Rewards

The static rewards provide a stable, sample-independent anchor that ties the policy directly to the evaluation metric while filtering out degenerate rollouts.

WER reward.

The WER reward serves as a direct anchor to the evaluation metric:

Rwer​(H,R)=1−WER​(H,R).R_{\text{wer}}(H,R)=1-\text{WER}(H,R).(1)

Anti-repetition reward.

Rollouts occasionally collapse into repeated short n-grams, inflating token coverage with hallucinated content. We apply a multiplicative hard gate that zeros out such rollouts:

Rrep​(H)={0,if​H​contains repeatedn-grams beyond threshold,1,otherwise.R_{\text{rep}}(H)=\begin{cases}0,&\text{if }H\text{ contains repeated n-grams beyond threshold},\\ 1,&\text{otherwise}.\end{cases}(2)We aggregate the two into a single static signal that gates transcription accuracy on non-degenerate rollouts:

Rstatic=Rrep⋅Rwer.R_{\text{static}}=R_{\text{rep}}\cdot R_{\text{wer}}.(3)

4.2.2Dual-Granularity Dynamic Reward

At the core of DG-WGPO, the Dual-Granularity Dynamic Reward is designed specifically for ASR under complex acoustic environments. It combines a token-level refinement reward for local information recovery and a sentence-level reconstruction reward for overall semantic preservation on hard samples, with a WER-gated mirrored fusion strategy that dynamically allocates the weights between them.

Token-level refinement reward.

Targeting failure mode(i), we partition substitution errors by character-level edit similarity. Given a hypothesis tokenhhand reference tokenrr,

sim​(h,r)=1−edit​(h,r)max⁡(|h|,|r|)∈[0,1],\text{sim}(h,r)=1-\frac{\text{edit}(h,r)}{\max(|h|,|r|)}\in[0,1],(4)and we classify a substitution assoftifsim​(h,r)≥0.5\text{sim}(h,r)\geq 0.5(the midpoint of the similarity range) andhardotherwise. Insertions and deletions are uniformly treated as hard, since both signal hallucination rather than acoustic confusion. The refinement reward discounts the two error types separately:

Rfine=nCnC+nhard+αs​nsoft+ϵ,R_{\text{fine}}=\frac{n_{C}}{n_{C}+n_{\text{hard}}+\alpha_{s}\,n_{\text{soft}}+\epsilon},(5)wherenCn_{C},nhardn_{\text{hard}},nsoftn_{\text{soft}}are the counts of correct tokens, hard errors, and soft errors respectively,αs∈(0,1)\alpha_{s}\in(0,1)is the soft-error discount, andϵ=10−8\epsilon=10^{-8}ensures numerical stability.

Sentence-level reconstruction reward.

Targeting failure mode(ii), we score the hypothesis by backbone preservation rather than token-level agreement:

Rstruc=12⋅LCS​(H,R)|R|+12⋅max⁡(0,1−||H|−|R|||R|),R_{\text{struc}}=\frac{1}{2}\cdot\frac{\text{LCS}(H,R)}{|R|}+\frac{1}{2}\cdot\max\!\left(0,\,1-\frac{\big||H|-|R|\big|}{|R|}\right),(6)where the LCS term rewards backbone agreement under local reordering and the length term penalizes truncation and runaway generation. The two terms are equally weighted as both contribute to structural integrity.

WER-gated dynamic fusion.

The relative usefulness of the two granularities flips at the refinement-reconstruction boundary, so we fuse them with a WER-gated mirrored weighting that always assigns the dominant weight to the regime-appropriate granularity:

Rdynamic={0.75​Rfine+0.25​Rstruc,WER​(H,R)<τ,0.25​Rfine+0.75​Rstruc,WER​(H,R)≥τ.R_{\text{dynamic}}=\begin{cases}0.75\,R_{\text{fine}}+0.25\,R_{\text{struc}},&\text{WER}(H,R)<\tau,\\[2.0pt] 0.25\,R_{\text{fine}}+0.75\,R_{\text{struc}},&\text{WER}(H,R)\geq\tau.\end{cases}(7)

Final objective.

The full reward combines the rule-based anchor with the dynamic signal:

R=(1−αdyn)​Rsimple+αdyn​Rdynamic.R=(1-\alpha_{\text{dyn}})\,R_{\text{simple}}+\alpha_{\text{dyn}}\,R_{\text{dynamic}}.(8)We set the three hyperparameters asτ=0.3\tau=0.3,αs=0.4\alpha_{s}=0.4, andαdyn=0.6\alpha_{\text{dyn}}=0.6.

4.3Environment-Aware Routing for Plug-and-Play Inference

TrainingMega-ASRon heavily degraded audio sharpens its noise robustness but partially erodes complementary capabilities such as clean-speech recognition, hotword recognition, and streaming ASR. To preserve both, we route each utterance to the appropriate model at inference time. Specifically, as illustrated in figure4.3we fine-tune a lightweight binary classifier with LoRA on a mixture of clean speech andVoices-in-the-Wildsamples, predicting whether an input requires Mega-ASR’s noise-robust weights or the original backbone. This routing keepsMega-ASRas a plug-and-play module that activates only when the acoustic environment demands it, leaving clean-domain performance untouched.

Refer to captionFigure 5:Environment-aware routing for plug-and-play inference.Table 2:Performance comparison on noisy and robust ASR benchmarks.ModelCHiME-4VOiCESNOIZEUSAvg.RealSimAvg.rm1rm2rm3rm4Avg.0dB5dB10dB15dBAvg.Closed-source modelsGemini3-Flash6.585.676.1253.104.2725.9921.8613.8155.7824.4818.498.5226.8215.59Doubao-LLM ASR9.9511.6210.794.866.9917.237.859.2325.789.514.962.8710.7810.27GPT-4o-trans.5.367.576.4710.9712.5646.6829.3822.6562.4020.566.152.6422.9417.35Open-source modelsVoxtral-Mini6.019.047.533.503.5127.5416.4512.7541.0615.804.852.9416.1612.15Kimi-Audio5.667.466.562.102.2326.9515.1311.6038.3311.364.342.2714.0810.74Whisper-L-v35.658.397.022.852.9725.6815.6511.7934.7112.553.932.1713.3410.72Canary-1B-v27.199.738.463.143.0024.8815.5611.6538.5312.766.563.7715.4111.84Parakeet-v36.618.827.723.233.2719.7713.8410.0338.9514.675.993.1515.6911.15Qwen2.5-Omni6.628.137.374.154.0344.7622.5318.8754.9117.723.200.8819.1815.14Step-Audio-2-mini5.357.066.201.811.9823.2515.1910.5632.028.943.722.2711.749.50Qwen3-ASR4.666.115.392.522.6219.1811.448.9423.978.473.411.969.457.93Our modelMega-ASR4.416.045.232.362.4315.139.467.3519.806.612.790.887.526.70Mega-ASR w/ router4.385.625.002.422.4915.329.267.3719.806.973.051.767.906.76 Table 3:Performance comparison on standard ASR benchmarks. For LibriSpeech, each entry is reported as clean/other. Underline indicates the best performance among open-source models.ModelLibriSp.Comm.VoiceFleursAISHELL-1WenetSp.VoxPop.DevTestzhenzhentestnetmeetingenClosed-source modelsGemini-3-Flash1.7|3.561.81|4.9113.588.497.524.012.6614.3817.627.74Doubao-LLM ASR2.95|4.062.92|5.324.607.122.927.220.984.464.907.14GPT-4o-trans.1.52|3.291.75|4.2312.617.222.622.713.5215.7131.407.02Open-source modelsCanary-1B-v22.07|4.032.20|3.58-8.91-4.48---6.20Parakeet-TDT-0.6B-v31.91|3.541.93|3.60-8.54-4.88---6.11Voxtral-Mini-3B-25071.89|3.881.89|4.08-10.15-3.84---7.08Step-Audio-2-mini1.21|2.501.37|2.754.777.042.483.930.815.565.467.43Kimi-Audio-7B1.38|2.561.34|2.556.748.355.888.070.766.416.258.15Whisper Large-v31.74|3.681.78|3.5315.3316.187.704.105.8912.0217.799.00Qwen2.5-Omni-7B2.05|4.192.37|4.215.018.564.644.011.156.169.646.02Qwen3-ASR-1.7B1.62|3.071.62|3.407.427.573.933.191.524.995.806.25Our modelOurs1.62|3.211.78|3.575.88.155.433.761.495.196.177.44Ours w/ router1.64|3.071.63|3.377.377.573.863.171.534.955.896.26

5Experiments

5.1Experimental setup

Datasets and Evaluation.

We initialize fromQwen3-ASR-1.7B(Shiet al.,2026)and train onVoices-in-the-wild-2Mfor both SFT and RL stages. We evaluate along three axes.(i) Standard ASR: LibriSpeech(Panayotovet al.,2015), CommonVoice22(Ardilaet al.,2020), FLEURS(Conneauet al.,2023), AISHELL-1(Buet al.,2017), WenetSpeech(Zhanget al.,2022), and VoxPopuli(Pavlichenkoet al.,2021), reported with and without our dynamic routing LoRA to verify that robustness adaptation does not regress clean-speech performance.(ii) Adverse-condition ASR: CHiME-4(Watanabeet al.,2016), VOiCES(Richeyet al.,2018), and NOIZEUS(Hu and Loizou,2007), covering noise, reverberation, far-field, and signal degradation.(iii) Compound conditions: ourVoices-in-the-Wild-Bench, targeting realistic multi-factor acoustic environments.

Baselines.

We compare against 12 representative systems spanning conventional ASR, large audio-language models, and omni-modal foundation models: Whisper-Large-v3(Radfordet al.,2023), Canary-1B-v2(Sekoyanet al.,2025), Parakeet-TDT-0.6B-v3(Sekoyanet al.,2025), Qwen2.5-Omni-7B(Xuet al.,2025a), Step-Audio-2-mini(Wuet al.,2025), Voxtral-Mini-3B(Liuet al.,2025), Kimi-Audio-7B(Dinget al.,2025), Gemini-3-Flash , Seed-ASR(Baiet al.,2024), GPT-4o(Hurstet al.,2024), and Step-Audio-2(Wuet al.,2025).

Implementation Details.

A2S-SFT uses learning rates of1×10−31{\times}10^{-3}for the audio encoder and adapter,2×10−52{\times}10^{-5}for the LLM, and2×10−62{\times}10^{-6}for the joint stage. RL runs for 6,000 steps with learning rate1×10−61{\times}10^{-6}andK=16K{=}16rollouts per input, optimized under the combined reward0.4​Rrule+0.6​Rdynamic0.4\,R_{\text{rule}}+0.6\,R_{\text{dynamic}}.

5.2Main results

The main results demonstrate3key findings, verifying thatMega-ASRachieves strong robustness from clean speech to highly compositional real-world acoustic environments.[Enh.1] Competitive general ASR with adaptive routing (Table3).Mega-ASRremains highly competitive on clean and multilingual benchmarks against Qwen3-ASR, Seed-ASR, and Kimi-Audio. With routing, it improves LibriSpeech WER from 1.78/3.57 to 1.63/3.37, achieves 3.86/3.17 on Fleurs zh/en, and shows consistent gains on WenetSpeech-meeting and VoxPopuli.[Enh.2] State-of-the-art robustness under acoustic perturbations (Table3Figure1).Mega-ASRachieves the best overall robustness on CHiME-4, VOiCES, and NOIZEUS with an average WER of 6.70, outperforming Qwen3-ASR (7.93), Whisper-Large-v3 (10.72), and Qwen2.5-Omni (15.14). Under extreme NOIZEUS 0dB conditions, it further reduces WER to 19.80 versus 23.97 for Qwen3-ASR and 55.78 for Gemini-3-Flash, a relative reduction of 17.4% over the strongest baseline and 64.5% over Gemini-3-Flash.[Enh.3] Superior robustness in compositional real-world environments (Table4).On Voices-in-the-Wild-Bench,Mega-ASRconsistently achieves the strongest performance across mixed degradations, far-field speech, and recording artifacts. Under mixed degradations, it achieves 2.73/4.57 WER, substantially outperforming Whisper-Large-v3 (8.91/14.79) and Gemini-3-Flash (7.99/9.62), corresponding to a 65.8%/69.1% relative reduction over Whisper-Large-v3 and 65.8% over Gemini-3-Flash.

Table 4:Breakdown results onVoices-in-the-Wild-Benchby acoustic scenario.ModelNoiseFar.Obst.Echo.Record.Elc.Dis.Trans.Drop.MixedReal.Sim.Real.Sim.Real.Sim.Real.Sim.Real.Sim.Real.Sim.Real.Sim.Real.Sim.Closed-source modelsGemini3-Flash7.6310.615.141.903.732.658.7514.868.3819.853.157.565.477.657.999.62Seed-ASR8.218.113.063.193.102.7616.5518.2118.4823.333.895.717.977.466.889.29GPT-4o-trans.13.1945.781.872.391.572.7715.6228.7613.3722.603.708.438.767.715.6211.00Open-source modelsWhisper-L-v316.5718.193.386.853.066.0125.3439.8718.3331.813.748.777.048.058.9114.79Qwen2.5-Omni11.9217.882.352.442.402.0820.0132.6413.7130.092.465.966.345.886.4010.29Kimi-Audio35.1014.592.711.922.491.6424.0026.588.7318.091.832.784.546.334.446.19Qwen3-ASR7.519.522.231.541.731.2710.4014.619.5719.421.543.414.164.193.305.39Our modelOurs6.338.262.351.611.621.238.6212.597.6514.211.713.722.592.622.734.57Ours w/ router6.128.092.331.691.801.418.6612.226.9113.231.603.352.722.882.634.53

5.3Analysis

Through ablation studies, we derive five key observations ([Obs.1]–[Obs.5]) spanning semantic-level gains, training recipe, reward design, and hyperparameter sensitivity. We elaborate each below, with the corresponding evidence drawn from Tables55.3

[Obs.1] Mega-ASR’s gains generalize beyond WER to semantic-level metrics.

Table8shows consistent semantic-level improvements over Qwen3-ASR, with missed-content dropping from14.214.2to5.95.9. This validates thatMega-ASRdelivers semantic- and holistic-level gains, exemplified by reduced hallucination and dropped utterances, beyond merely lowering WER.

Table 5:A2S-SFT and DG-WGPO ablation. WER (%,↓\downarrow) on Voices/Noizeus mid+high.VariantVoicesNoizeusQwen3-ASR (baseline)8.949.45+ SFT w/o A2S8.318.79Mega-ASR-Base7.598.12+ vanilla GRPO (RwerR_{\text{wer}}only)7.738.11+ vanilla DAPO (RwerR_{\text{wer}}only)7.627.98+ DG-WGPO w/oRrepR_{\text{rep}}7.467.73+ DG-WGPO w/oRfineR_{\text{fine}}7.457.71+ DG-WGPO w/oRstrucR_{\text{struc}}7.547.85+ DG-WGPO w/o gated fusion7.417.68Mega-ASR (full)7.357.64Table 6:Reward design. WER (%,↓\downarrow) on three test sets and average training time per step (Avg. T., relative).RewardVoicesNoizeusVoi-R.Avg.T.LLM-judge7.517.719.2762.23Rule-based7.537.649.3819.57 Table 7:LLM-as-judge evaluation. Avg over Voices and Noizeus.ModelHall.MissSem.KeyE.Qwen3-ASR18.714.271.322.5Mega-ASR-Base15.411.679.820.1Mega-ASR11.85.986.419.5

Table 8:Sensitivity to reward weights(αdyn,αs)(\alpha_{\text{dyn}},\alpha_{s}). WER (%,↓\downarrow) is reported on four held-out subsets grouped by degradation type (V.N.R.: Voices-Noise-Real; V.F.R.: Voices-Far-Real).SettingsNoiseFarNzV.N.R.V.F.V.F.R.αdyn=0.4,αs=0.4\alpha_{\text{dyn}}{=}0.4,\ \alpha_{s}{=}0.47.77.67.89.5αdyn=0.4,αs=0.6\alpha_{\text{dyn}}{=}0.4,\ \alpha_{s}{=}0.67.87.67.99.4αdyn=0.6,αs=0.2\alpha_{\text{dyn}}{=}0.6,\ \alpha_{s}{=}0.27.87.57.69.3αdyn=0.6,αs=0.6\alpha_{\text{dyn}}{=}0.6,\ \alpha_{s}{=}0.67.57.57.49.3αdyn=0.8,αs=0.4\alpha_{\text{dyn}}{=}0.8,\ \alpha_{s}{=}0.48.19.18.09.9𝜶dyn=0.6,𝜶𝒔=0.4\boldsymbol{\alpha_{\text{dyn}}{=}0.6,\ \alpha_{s}{=}0.4}7.67.47.49.2

[Obs.2] Ablation of A2S-SFT and DG-WGPO components.

We ablate each stage of A2S-SFT and each component of DG-WGPO on Voices/Noizeus in Table5. Removing the first two progressive stages (SFT w/o A2S) reaches8.31/8.798.31/8.79WER, still0.72/0.670.72/0.67behind Mega-ASR-Base, confirming the value of staged acoustic-to-semantic adaptation. On top of Mega-ASR-Base, vanilla DAPO withRwerR_{\text{wer}}alone outperforms vanilla GRPO by0.11/0.130.11/0.13WER, motivating our choice of DAPO as the RL backbone. Among the DG-WGPO components, removingRstrucR_{\text{struc}}causes the largest degradation (7.54/7.857.54/7.85), indicating that sentence-level reconstruction is critical on mid- and high-WER samples; removingRrepR_{\text{rep}},RfineR_{\text{fine}}, or gated fusion each yields a smaller but consistent drop. The fullMega-ASRreaches7.35/7.647.35/7.64, a1.59/1.811.59/1.81reduction over Qwen3-ASR.

[Obs.3] Rule-based reward matches LLM-judge at3.2×3.2\timeslower time-cost.

We replaceRdynamicR_{\text{dynamic}}with a Gemini-2.5-flash-lite scalar score and compare it against our rule-based design (Table8). The two variants achieve comparable WER across all three test sets, with differences within roughly 0.1 on Voices and Noizeus and 0.11 on Voi-R., suggesting that the rule-based reward already captures the supervision signals an LLM judge would provide. The LLM-judge variant, however, takes 62.23s per training step compared to 19.57s for the rule-based reward, a3.2×3.2\timesslowdown that scales unfavorably with longer training. Given the negligible accuracy difference and the substantial computational overhead, we adopt the rule-based design as the default.

[Obs.4] Ablation on hyperparameters.

We perturbαdyn\alpha_{\text{dyn}}andαs\alpha_{s}around the default(0.6,0.4)(0.6,0.4)in Table8. Pushingαdyn\alpha_{\text{dyn}}to 0.8 causes the sharpest degradation, with V.N.R. rising from 7.4 to 9.1 and Nz from 7.6 to 8.1, indicating that an over-weighted gating term suppresses the dominant WER-driven signalRwerR_{\text{wer}}and harms recognition. Loweringαdyn\alpha_{\text{dyn}}to 0.4 instead hurts the far-field subsets, where V.F. rises by 0.4 and V.F.R. by 0.3, while varyingαs\alpha_{s}in{0.2,0.6}\{0.2,0.6\}produces only minor fluctuations across all four subsets. These observations suggest thatαdyn\alpha_{\text{dyn}}governs a more sensitive trade-off thanαs\alpha_{s}, and we therefore adopt(αdyn,αs)=(0.6,0.4)(\alpha_{\text{dyn}},\alpha_{s}){=}(0.6,0.4), which achieves the best or near-best WER on every subset.

We further sweep the gating thresholdτ\taufrom 0.2 to 0.5 (Table5.3). The trend mirrors our earlier observation:τ=0.3\tau{=}0.3gives the most balanced result,τ=0.2\tau{=}0.2andτ=0.4\tau{=}0.4have only marginal effect, whileτ=0.5\tau{=}0.5leads to a clear degradation, consistent with the over-restrictive gating effect seen at highαdyn\alpha_{\text{dyn}}.

Table 9:Sensitivity to gating thresholdτ\tau. WER (%,↓\downarrow) on Noizeus.τ\tau0.20.30.40.5Noizeus7.687.647.667.70

6Case study

Figure6presents a comparative case study where the state-of-the-art closed-source modelGemini-3-Pro, the open-source modelQwen3-ASR, and our proposedMega-ASRtranscribe the same challenging audio across three scenarios: far-field reconstruction, content hallucination, and entity recovery. In the far-field case (Peak-5.2 dB),Qwen3-ASRoffers only a superficial response, returning an empty transcription with a WER of100.0%.Gemini-3-Progoes beyond this and produces a fluent hypothesis, yet fabricates content unrelated to the source (WER86.1%). In contrast,Mega-ASRprecisely recovers the reference transcript (WER0.0%), a pattern that persists under severe noise and in entity-dense utterances. This highlights the intrinsic difficulty of robust speech recognition: errors are often subtle, originate from degraded signals or rare entities, and remain hidden behind outputs that appear fluent at the surface level.

Refer to captionFigure 6:Case study against SOTA modelsGemini-3-ProandQwen3-ASRon semantic reconstruction under strong environmental robustness, hallucination, and fine-grained detail recovery.Mega-ASRfaithfully aligns with the reference transcript (WER0.0%\mathbf{0.0\%}on far-field), while competing SOTA systems either return empty outputs or fabricate fluent but incorrect content.

7Conclusion

We presentedMEGA-ASR, a unified ASR-in-the-wild framework designed to overcome the acoustic robustness bottleneck of current ASR and large audio-language models under severe, compositional distortions. Central to MEGA-ASR isVOICES-IN-THE-WILD-2M, a large-scale dataset covering7classic acoustic phenomena and54physically plausible compound scenarios, together withAcoustic-to-Semantic Progressive Supervised Fine-TuningandDual-Granularity WER-Gated Policy Optimizationfor robust perceptual recovery and semantic reconstruction. Extensive experiments show that MEGA-ASR achieves significant improvements over prior state-of-the-art systems, especially under challenging real-world acoustic conditions where relative WER reductions can exceed30%30\%. Our results highlight the importance of modeling compound acoustic environments at scale and establish MEGA-ASR as a scalable paradigm for robust ASR in-the-wild.

References

  • R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber (2020)Common voice: a massively-multilingual speech corpus.InProceedings of the twelfth language resources and evaluation conference,pp. 4218–4222.Cited by:Appendix F,§2,§5.1.
  • Seed-asr: understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675.Cited by:Appendix F,§5.1.
  • H. Bu, J. Du, X. Na, B. Wu, and H. Zheng (2017)Aishell-1: an open-source mandarin speech corpus and a speech recognition baseline.In2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA),pp. 1–5.Cited by:§3.2,§5.1.
  • A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna (2023)Fleurs: few-shot learning evaluation of universal representations of speech.In2022 IEEE Spoken Language Technology Workshop (SLT),pp. 798–805.Cited by:§5.1.
  • S. Deshmukh, B. Elizalde, R. Singh, and H. Wang (2023)Pengi: an audio language model for audio tasks.Advances in Neural Information Processing Systems36,pp. 18090–18108.Cited by:Appendix F.
  • D. Ding, Z. Ju, Y. Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang,et al.(2025)Kimi-audio technical report.arXiv preprint arXiv:2504.18425.Cited by:§1,§2,§5.1.
  • Z. Gao, Z. Li, J. Wang, H. Luo, X. Shi, M. Chen, Y. Li, L. Zuo, Z. Du, Z. Xiao,et al.(2023)Funasr: a fundamental end-to-end speech recognition toolkit.arXiv preprint arXiv:2305.11013.Cited by:§2.
  • Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan (2022)Paraformer: fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition.arXiv preprint arXiv:2206.08317.Cited by:§1.
  • J. J. Godfrey, E. C. Holliman, and J. McDaniel (1992)SWITCHBOARD: telephone speech corpus for research and development.In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing,Vol.1,pp. 517–520.Cited by:Appendix F.
  • Y. Gong, H. Luo, A. Liu, L. Karlinsky, and J. R. Glass (2024)Listen, think, and understand.InInternational Conference on Learning Representations,Vol.2024,pp. 18516–18545.Cited by:Appendix F.
  • A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu,et al.(2020)Conformer: convolution-augmented transformer for speech recognition.arXiv preprint arXiv:2005.08100.Cited by:Appendix F.
  • K. J. Han, S. Hahm, B. Kim, J. Kim, and I. R. Lane (2017)Deep learning-based telephony speech recognition in the wild..InInterspeech,pp. 1323–1327.Cited by:§1.
  • W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021)Hubert: self-supervised speech representation learning by masked prediction of hidden units.IEEE/ACM transactions on audio, speech, and language processing29,pp. 3451–3460.Cited by:Appendix F.
  • S. Hu, L. Zhou, S. Liu, S. Chen, L. Meng, H. Hao, J. Pan, X. Liu, J. Li, S. Sivasankaran,et al.(2024)Wavllm: towards robust and adaptive speech large language model.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp. 4552–4572.Cited by:Appendix F,Appendix F.
  • Y. Hu and P. C. Loizou (2007)Subjective comparison and evaluation of speech enhancement algorithms.Speech communication49(7-8),pp. 588–601.Cited by:§2,Table 1,§5.1.
  • T. Huang, V. Shejwalkar, O. Chang, M. Nasr, and L. Liu (2025)Rebellion: noise-robust reasoning training for audio reasoning models.arXiv preprint arXiv:2511.09682.Cited by:Appendix F.
  • A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al.(2024)Gpt-4o system card.arXiv preprint arXiv:2410.21276.Cited by:§5.1.
  • T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015)Audio augmentation for speech recognition..InInterspeech,Vol.2015,pp. 3586.Cited by:§2.
  • T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur (2017)A study on data augmentation of reverberant speech for robust speech recognition.In2017 IEEE international conference on acoustics, speech and signal processing (ICASSP),pp. 5220–5224.Cited by:§2.
  • Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro (2024)Audio flamingo: a novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831.Cited by:Appendix F.
  • W. Kraaij, T. Hain, M. Lincoln, and W. Post (2005)The ami meeting corpus.InProc. International Conference on Methods and Techniques in Behavioral Research,pp. 1–4.Cited by:Appendix F.
  • S. Lina and K. A. Aksyonov (2024)Error correction for speech recognition systems using large language model reasoning capabilities.In2024 IEEE 25th International Conference of Young Professionals in Electron Devices and Materials (EDM),pp. 2300–2303.Cited by:§1.
  • A. H. Liu, A. Ehrenberg, A. Lo, C. Denoix, C. Barreau, G. Lample, J. Delignon, K. R. Chandu, P. von Platen, P. R. Muddireddy,et al.(2025)Voxtral.arXiv preprint arXiv:2507.13264.Cited by:§5.1.
  • G. J. Mysore (2014)Can we automatically transform speech recorded on common consumer devices in real-world environments into professional production quality speech?—a dataset, insights, and challenges.IEEE Signal Processing Letters22(8),pp. 1006–1010.Cited by:§2,Table 1.
  • V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015)Librispeech: an asr corpus based on public domain audio books.In2015 IEEE international conference on acoustics, speech and signal processing (ICASSP),pp. 5206–5210.Cited by:Appendix F,§1,§3.2,§5.1.
  • P. P. Parada, A. Dobrowolska, K. Saravanan, and M. Ozay (2022)PMCT: patched multi-condition training for robust speech recognition.arXiv preprint arXiv:2207.04949.Cited by:§2.
  • D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019)Specaugment: a simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779.Cited by:Appendix F.
  • N. Pavlichenko, I. Stelmakh, and D. Ustalov (2021)Crowdspeech and voxdiy: benchmark datasets for crowdsourced audio transcription.arXiv preprint arXiv:2107.01091.Cited by:§2,§5.1.
  • K. J. Piczak (2015)ESC: dataset for environmental sound classification.InProceedings of the 23rd ACM international conference on Multimedia,pp. 1015–1018.Cited by:§3.2.
  • A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever (2023)Robust speech recognition via large-scale weak supervision.InInternational conference on machine learning,pp. 28492–28518.Cited by:Appendix F,§2,§5.1.
  • C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun,et al.(2020)The interspeech 2020 deep noise suppression challenge: datasets, subjective testing framework, and challenge results.arXiv preprint arXiv:2005.13981.Cited by:§2,§3.2.
  • C. Richey, M. A. Barrios, Z. Armstrong, C. Bartels, H. Franco, M. Graciarena, A. Lawson, M. K. Nandwana, A. Stauffer, J. van Hout,et al.(2018)Voices obscured in complex environmental settings (voices) corpus.arXiv preprint arXiv:1804.05053.Cited by:§2,§3.2,§5.1.
  • A. Rousseau, P. Deléglise, and Y. Esteve (2012)TED-lium: an automatic speech recognition dedicated corpus..InLREC,pp. 125–129.Cited by:Appendix F,§2,Table 1.
  • P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov,et al.(2023)Audiopalm: a large language model that can speak and listen.arXiv preprint arXiv:2306.12925.Cited by:Appendix F.
  • J. Salamon, C. Jacoby, and J. P. Bello (2014)A dataset and taxonomy for urban sound research.InProceedings of the 22nd ACM international conference on Multimedia,pp. 1041–1044.Cited by:§3.2.
  • M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bartley, N. Karpov, J. Balam, and B. Ginsburg (2025)Canary-1b-v2 & parakeet-tdt-0.6 b-v3: efficient and high-performance models for multilingual asr and ast.arXiv preprint arXiv:2509.14128.Cited by:§5.1.
  • M. Shah, D. Solans Noguero, M. Heikkilä, B. Raj, and N. Kourtellis (2025)Speech robust bench: a robustness benchmark for speech recognition.InInternational Conference on Learning Representations,Vol.2025,pp. 38625–38651.Cited by:Appendix F.
  • X. Shi, X. Wang, Z. Guo, Y. Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y. Xi, B. Yang,et al.(2026)Qwen3-asr technical report.arXiv preprint arXiv:2601.21337.Cited by:§1,§2,§3.1,§4,§5.1.
  • D. Snyder, G. Chen, and D. Povey (2015)Musan: a music, speech, and noise corpus.arXiv preprint arXiv:1510.08484.Cited by:§2,§3.2.
  • P. Tuttösí, M. Dhillon, L. Sang, S. Eastwood, P. Bhatia, Q. M. Dinh, A. Kapoor, Y. Jin, and A. Lim (2026)BERSting at the screams: a benchmark for distanced, emotional and shouted speech recognition.Computer Speech & Language95,pp. 101815.Cited by:Table 1.
  • N. Vaessen and D. A. Van Leeuwen (2022)Fine-tuning wav2vec2 for speaker recognition.InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pp. 7967–7971.Cited by:Appendix F.
  • E. V. S. Watanabe, M. Mandel, and J. Barker (2016)The 4th chime speech separation and recognition challenge.Cited by:Appendix F,§2,Table 1,Table 1,§5.1.
  • B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Li,et al.(2025)Step-audio 2 technical report.arXiv preprint arXiv:2507.16632.Cited by:Appendix F,§2,§5.1.
  • Z. Xie, Z. Ma, Z. Liu, K. Pang, H. Li, J. Zhang, Y. Liao, D. Ye, C. Miao, and S. Yan (2025)Mini-omni-reasoner: token-level thinking-in-speaking in large speech models.arXiv preprint arXiv:2508.15827.Cited by:Appendix F.
  • Z. Xie and C. Wu (2024a)Mini-omni: language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725.Cited by:Appendix F.
  • Z. Xie and C. Wu (2024b)Mini-omni2: towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190.Cited by:Appendix F.
  • J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y. Fan, K. Dang, B. Zhang, X. Wang, Y. Chu, and J. Lin (2025a)Qwen2.5-omni technical report.External Links:2503.20215,LinkCited by:Appendix F,§2,§5.1.
  • J. Xu, Z. Guo, H. Hu, Y. Chu, X. Wang, J. He, Y. Wang, X. Shi, T. He, X. Zhu,et al.(2025b)Qwen3-omni technical report.arXiv preprint arXiv:2509.17765.Cited by:§1,§2.
  • K. Xu, Y. Jia, K. Huang, J. Chen, W. Li, K. Liu, F. Xie, X. Tang, and Y. Hu (2026)FireRedASR2S: a state-of-the-art industrial-grade all-in-one automatic speech recognition system.arXiv preprint arXiv:2603.10420.Cited by:§1,§2.
  • B. Yan, V. Pratap, S. Watanabe, and M. Auli (2025)Improving multilingual asr in the wild using simple n-best re-ranking.InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pp. 1–5.Cited by:§1.
  • Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu,et al.(2025)Dapo: an open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476.Cited by:§4.2.
  • B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng,et al.(2022)Wenetspeech: a 10000+ hours multi-domain mandarin corpus for speech recognition.InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),pp. 6182–6186.Cited by:§3.2,§5.1.
  • X. Zhifei, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao (2025)Audio-reasoner: improving reasoning capability in large audio language models.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp. 23840–23862.Cited by:Appendix F.

Appendix AQualitative Case Studies

We provide representative qualitative examples to illustrate howMega-ASRchanges the error modes of the baseline under severe acoustic degradation. The examples cover five common failure patterns: off-audio hallucination, empty-output collapse, dropout-induced semantic drift, noisy semantic drift, and entity-level recovery on standard noisy benchmarks. These examples are not intended to replace quantitative evaluation; instead, they clarify the types of errors that are reduced by Mega-ASR.

Observation.

Across these examples, the baseline errors are often not local substitutions. They include cross-lingual hallucination, empty outputs, severe semantic drift, and missing key entities.Mega-ASRoften converts these catastrophic failures into correct or near-correct transcriptions, preserving the semantic backbone of the utterance even when minor lexical differences remain. As shown in Figure9, the baseline frequently fails in ways that are qualitatively different from ordinary word-level substitutions. In the compound case, it produces a cross-lingual off-audio hallucination; under recording coloration, it collapses into an empty output; under dropout and noise, it drifts toward plausible but incorrect semantics; and on CHiME-4, it changes both the named entity and the relation.Mega-ASRreduces these catastrophic errors and recovers the semantic backbone of the reference utterances. These examples support our central observation that severe acoustic degradation changes the ASR error regime from local recognition errors to sentence-level semantic failures, and thatMega-ASRmitigates this transition.

Appendix BAdditional Robust Benchmark Results

To complement the main results, we provide a focused comparison among Qwen3-ASR-1.7B, Merged-v2, and the quality-routed 3-LoRA variant on three robust ASR benchmarks: CHiME-4, NOIZEUS, and VOiCES. These benchmarks cover different types of adverse acoustic conditions, including real and simulated noisy speech, controlled additive noise at different SNR levels, and far-field room acoustics. Table10summarizes the average WER on each benchmark, while Tables1113provide detailed subset-level breakdowns.

Table 10:Average WER comparison on three robust ASR benchmarks. Lower WER is better.ModelCHiME-4NOIZEUSVOiCESAvg.Qwen3-ASR-1.7B5.399.458.947.93Mega-ASR5.237.527.356.70Mega-ASR w/ router5.007.907.376.76Both enhanced variants improve robustness over the Qwen3-ASR-1.7B backbone on the three-benchmark average. Merged-v2 achieves the best average WER on VOiCES and NOIZEUS, indicating that always-on robust adaptation is particularly effective under far-field room acoustics and controlled noisy conditions. The quality-routed 3-LoRA variant achieves the best average WER on CHiME-4, suggesting that quality-aware routing is especially useful when the model needs to balance real and simulated noisy speech while preserving backbone behavior. Overall, the results show that the proposed robust adaptation improves recognition accuracy across different adverse acoustic regimes, while the routed variant provides a practical trade-off between robustness and backbone preservation.

Table 11:Detailed WER comparison on CHiME-4. Lower WER is better.SubsetQwen3-ASR-1.7BMega-ASRMega-ASR w/ routerdt05_bus_real4.013.553.76dt05_bus_simu3.854.223.71dt05_caf_real4.003.843.90dt05_caf_simu5.925.945.61dt05_ped_real3.573.763.50dt05_ped_simu4.594.704.22dt05_str_real3.794.033.89dt05_str_simu5.025.274.62et05_bus_real6.596.215.94et05_bus_simu6.065.475.53et05_caf_real5.535.224.89et05_caf_simu8.368.367.97et05_ped_real4.924.534.71et05_ped_simu6.476.616.08et05_str_real4.854.164.42et05_str_simu8.647.747.19Average5.395.235.00

Table 12:Detailed WER comparison on NOIZEUS. Lower WER is better.SubsetQwen3-ASR-1.7BMega-ASRMega-ASR w/ routerairport_0dB16.1212.8012.31airport_5dB5.373.313.72airport_10dB2.892.072.89airport_15dB1.240.411.65babble_0dB24.7925.2027.43babble_5dB9.505.795.79babble_10dB2.072.072.48babble_15dB1.241.241.24car_0dB29.3423.9022.47car_5dB7.855.796.20car_10dB2.892.072.89car_15dB2.070.831.65exhibition_0dB16.1213.7014.20exhibition_5dB9.098.685.79exhibition_10dB3.311.652.89exhibition_15dB1.650.832.07restaurant_0dB23.1418.1015.03restaurant_5dB9.927.857.85restaurant_10dB2.892.892.48restaurant_15dB2.070.411.65station_0dB29.3421.1023.71station_5dB6.615.375.79station_10dB3.312.482.48station_15dB1.650.831.65street_0dB28.9321.9022.47street_5dB10.748.2611.16street_10dB4.964.134.13street_15dB2.481.242.07train_0dB23.9721.7020.81train_5dB8.687.859.50train_10dB4.964.964.13train_15dB3.311.242.07Average9.457.527.90

Table 13:Detailed WER comparison on VOiCES. Lower WER is better.SubsetQwen3-ASR-1.7BMega-ASRMega-ASR w/ routerrm1_babb_clo2.241.942.16rm1_babb_far3.143.002.89rm1_musi_clo2.312.202.21rm1_musi_far2.772.722.69rm1_none_clo2.101.952.12rm1_none_far2.162.232.23rm1_tele_clo2.442.212.28rm1_tele_far3.002.662.76rm2_babb_clo2.262.272.26rm2_babb_far3.533.243.29rm2_musi_clo2.252.132.12rm2_musi_far3.142.692.89rm2_none_clo2.021.961.98rm2_none_far2.412.142.24rm2_tele_clo2.282.172.17rm2_tele_far3.082.842.97rm3_babb_clo7.706.776.50rm3_babb_far46.6236.5037.60rm3_musi_clo5.484.935.15rm3_musi_far33.8325.8026.70rm3_none_clo3.142.642.62rm3_none_far10.358.408.80rm3_tele_clo5.964.854.85rm3_tele_far40.4031.1530.34rm4_babb_clo2.792.562.71rm4_babb_far54.0145.6943.73rm4_musi_clo2.401.992.26rm4_musi_far12.4310.549.71rm4_none_clo2.031.912.08rm4_none_far2.692.642.64rm4_tele_clo2.182.022.12rm4_tele_far12.958.368.80Average8.947.357.37

Appendix CDetails ofVoices-in-the-wild-2MConstruction

C.1Hierarchical Simulation Pipeline

Voices-in-the-wild-2Mis constructed through a hierarchical acoustic simulation pipeline. Rather than directly enumerating complex real-world environments, we decompose in-the-wild speech degradation into three levels: primitive acoustic effects, atomic acoustic effects, and compound acoustic scenarios.

At the lowest level, we define eight primitive acoustic effects, each corresponding to an independent and controllable signal-level transformation: additive noise, echo delay, reverberation, nonlinear distortion, resampling, spectral filtering, loudness transformation, and frame-level stutter. These primitive effects are designed to capture basic physical or device-induced degradation mechanisms, such as background interference, delayed reflection, room reverberation, clipping, bandwidth limitation, spectral attenuation, gain mismatch, and packet loss.

At the intermediate level, we construct seven atomic acoustic effects from these primitive effects:noise,far-field,obstructed,echo&reverb,recording,electronic distortion, andtransmission dropout. Each atomic effect is not necessarily implemented by a single primitive effect. Instead, it is instantiated as a physically motivated composition of one dominant primitive effect and several auxiliary primitive effects. For example, far-field speech is not only quieter, but also more reverberant and spectrally attenuated; similarly, low-quality recording may simultaneously involve bandwidth limitation, gain mismatch, and channel coloration.

At the highest level, we construct compound acoustic scenarios by composing multiple atomic acoustic effects. This produces complex acoustic environments that better match in-the-wild speech, where multiple degradation sources often co-occur. Importantly, during both atomic-effect construction and compound scenario construction, we preserve a fixed topological order among primitive effects. This avoids physically implausible processing chains and ensures that the same low-level degradation mechanism is applied consistently across different scenarios. The final pipeline can therefore be summarized as

8 primitive effects→7 atomic acoustic effects→54 compound acoustic scenarios.\text{8 primitive effects}\rightarrow\text{7 atomic acoustic effects}\rightarrow\text{54 compound acoustic scenarios}.

C.2Primitive Acoustic Effects

Motivation.

The seven atomic acoustic effects used in the main paper are high-level descriptions of real-world acoustic phenomena. However, such phenomena are usually caused by multiple lower-level signal transformations. For example, speech behind a door may be attenuated, low-pass filtered, and slightly reverberant; speech transmitted through an unstable communication channel may contain repeated frames, local dropouts, and bandwidth loss. We therefore first define a set of primitive acoustic effects, which serve as the basic signal-level operators for building both atomic and compound scenarios.

Each primitive effect exposes a small number of interpretable parameters. These parameters control the strength of the degradation and are later tied to the global severity variable used in dataset synthesis. The primitive effects are kept modular, allowing them to be composed while preserving a consistent topological order.

Table 14:Eight primitive acoustic effects used as signal-level building blocks inVoices-in-the-wild-2M.Primitive effectMain parametersSimulated degradationTypical real-world sourceAdditive noisenoise source, noise category, relative noise level, wet ratioBackground interference from environmental sounds, human voices, or device noiseStreet, office, vehicle, crowd, household environmentEcho delaydelay time, feedback, mix ratioDiscrete delayed reflections and repeated copies of speechEmpty room, corridor, tunnel, large hallReverberationroom size, damping, wet level, dry levelDense room reflections and long-tail spatial smearingClassroom, auditorium, church, meeting roomNonlinear distortiondrive gain, wet ratioOverload, saturation, clipping, and harmonic artifactsLow-quality microphone, over-amplified recorder, damaged deviceResamplingtarget sampling rate, wet ratio, probability gateBandwidth limitation and high-frequency information lossTelephone channel, compressed audio, low-bandwidth transmissionSpectral filteringfilter type, cutoff frequency, repeat count, wet ratioFrequency attenuation and channel colorationMask, door, wall, glass, narrow-band deviceLoudness transformationtarget LUFSGain mismatch, distance-induced attenuation, or abnormal recording levelDistant speaker, quiet recording, over-amplified microphoneFrame-level stutterframe length, stutter probability, repeat probability, maximum repeatsLocal dropout, repeated frames, and unstable temporal continuityPacket loss, unstable streaming, corrupted recording

Additive noise.

The additive-noise primitive mixes an external noise waveform into the clean speech signal. The noise source can be selected from a specified noise category or from a given noise file. If the noise waveform is shorter than the speech signal, it is tiled and then cropped to match the speech duration. The noise is RMS-normalized according to a target relative level and then mixed with the clean speech using a wet ratio. This primitive captures background interference from environmental sounds, human voices, and device noise.

Echo delay.

The echo-delay primitive adds delayed copies of the original signal. It is controlled by the delay time, feedback strength, and dry-wet mix ratio. The delay time determines the temporal offset between the direct path and the reflected path, while the feedback parameter controls the strength and number of repeated reflections. This primitive mainly simulates sparse and perceptible echoes in highly reflective spaces.

Reverberation.

The reverberation primitive simulates dense room reflections. It is controlled by room size, damping, wet level, and dry level. Larger room size and higher wet level produce stronger spatial smearing, while damping controls the decay of high-frequency components in the reverberant tail. Unlike echo delay, which models discrete delayed repetitions, reverberation models dense and continuous reflection patterns.

Nonlinear distortion.

The nonlinear-distortion primitive applies overdrive to the waveform and produces saturation or clipping artifacts. The drive gain controls the strength of the distortion: small values introduce mild coloration, whereas large values produce clear overload artifacts and additional harmonics. After distortion, the output is clipped to the valid amplitude range, further simulating harsh device overload.

Resampling.

The resampling primitive first downsamples the waveform to a lower target sampling rate and then upsamples it back to the original sampling rate. This removes high-frequency details and introduces bandwidth limitation while keeping the final sampling rate compatible with the rest of the pipeline. A probability gate is used so that resampling can be applied stochastically when constructing mixed scenarios.

Spectral filtering.

The spectral-filtering primitive applies either a low-pass or high-pass filter with a specified cutoff frequency. The filter can be repeatedly applied to increase the strength of spectral attenuation. Low-pass filtering removes high-frequency details and is useful for muffled or occluded speech, while high-pass filtering removes low-frequency energy and simulates thin channel responses or device coloration.

Loudness transformation.

The loudness primitive adjusts the signal to a target LUFS value. Unlike simple amplitude scaling, LUFS normalization provides a perceptually meaningful measure of loudness. This primitive is used to simulate distance-induced attenuation, microphone gain mismatch, quiet speech, and over-amplified recordings. When the loudness of extremely short or silent audio cannot be estimated reliably, the original signal is kept unchanged.

Frame-level stutter.

The frame-level stutter primitive partitions the waveform into short frames and randomly triggers local replacement events. Once triggered, several consecutive frames are either replaced by the previous frame or replaced by silence. The total audio length is kept unchanged, which allows the resulting audio to remain aligned with the original transcript while still containing local temporal discontinuities. This primitive simulates packet loss, unstable streaming, frame repetition, and local dropout.

C.3Construction of Seven Atomic Acoustic Effects

Based on the eight primitive acoustic effects, we further construct seven atomic acoustic effects that correspond to common in-the-wild acoustic conditions:noise,far-field,obstructed,echo&reverb,recording coloration,electronic distortion, andtransmission dropout. Each atomic effect is implemented as an ordered chain of primitive effects. The primitive chain is designed to make one degradation mechanism dominant while retaining secondary artifacts that naturally co-occur in the corresponding real-world condition.

Table15first provides a structural overview of the seven atomic acoustic effects. It reports the ordered primitive-effect chain, the dominant degradation mechanism, and the representative real-world condition for each atomic effect. The listed order follows the corresponding scene configuration rather than a manually imposed global order.

To make the simulation fully reproducible, Table16further summarizes the key parameters used in each scene configuration. We group parameters by primitive effect and distinguish randomly sampled ranges from fixed values. Parameters marked with “core” are the primary severity-controlling parameters used to modulate the difficulty of the corresponding atomic effect.

Table 15:Construction of seven atomic acoustic effects from primitive acoustic effects.Atomic effectPrimitive-effect chainDominant degradationRepresentative conditionNoiseadd_noise→\rightarrowchange_volumeLow signal-to-noise ratioStreet, cafe, vehicle, crowdFar-fieldadd_reverb→\rightarrowapply_filter→\rightarrowchange_volumeDistance-induced reverberation and attenuationSpeaking to a distant microphoneObstructedapply_filter→\rightarrowadd_reverb→\rightarrowchange_volumeOcclusion-induced spectral lossSpeech behind a wall, door, or maskEcho&reverbadd_reverb→\rightarrowapply_filter→\rightarrowadd_echo→\rightarrowchange_volumeStrong reflections and delayed echoesGymnasium, garage, large hallRecording Colorationadd_resample→\rightarrowadd_noise→\rightarrowapply_filter→\rightarrowapply_filter→\rightarrowchange_volumePlayback-recording channel degradationPhone playback recorded by another deviceElectronic distortionadd_distortion→\rightarrowapply_filter→\rightarrowchange_volume_distortionClipping and nonlinear overloadClose-talking with excessive recording gainTransmission dropoutadd_stutter_replace→\rightarrowchange_volumeLocal temporal discontinuityVoIP packet loss, unstable Bluetooth or streamingTable 16:Parameterization of the seven atomic acoustic effects. Randomly sampled parameters are shown as ranges, while fixed parameters are listed separately. Core parameters are the main severity-controlling variables in the corresponding scene configuration.Atomic effectPrimitive effectSampled severity parametersFixed parametersNoiseadd_noisenoise_db∈[−5,10]\in[-5,10](core)noise_category=filtered_wavs,wet=1.0change_volume–target_lufs=-23.0Far-fieldadd_reverbroom_size∈[0.4,0.6]\in[0.4,0.6](core),damping∈[0.6,0.8]\in[0.6,0.8],wet_level∈[0.4,0.5]\in[0.4,0.5]dry_level=0.5apply_filtercutoff_hz∈[3500,4500]\in[3500,4500](core)filter_type=lowpass,repeat=3,wet=1.0change_volumetarget_lufs∈[−38,−27]\in[-38,-27](core)–Obstructedapply_filtercutoff_hz∈[1500,2000]\in[1500,2000](core),repeat∈{2,3,4}\in\{2,3,4\}filter_type=lowpass,wet=0.9add_reverbwet_level∈[0.5,0.7]\in[0.5,0.7]room_size=0.4,damping=0.9,dry_level=0.4change_volumetarget_lufs∈[−25,−15]\in[-25,-15](core)–Echo&reverbadd_reverbroom_size∈[0.8,0.95]\in[0.8,0.95](core),wet_level∈[0.6,0.8]\in[0.6,0.8]damping=0.5,dry_level=0.4apply_filtercutoff_hz∈[100,300]\in[100,300]filter_type=highpass,repeat=1,wet=1.0add_echodelay_seconds∈[0.1,0.3]\in[0.1,0.3](core),feedback∈[0.3,0.5]\in[0.3,0.5],mix∈[0.2,0.3]\in[0.2,0.3]–change_volumetarget_lufs∈[−30,−23]\in[-30,-23](core)–Recordingadd_resampleprob∈[0,1]\in[0,1](core)target_sr=8000,wet=1.0,threshold=0.4add_noisenoise_db∈[−5,10]\in[-5,10](core)use_white_noise=True,wet=1.0apply_filtercutoff_hz∈[400,600]\in[400,600](core),repeat∈{4,5,6}\in\{4,5,6\}filter_type=highpass,wet=1.0apply_filtercutoff_hz∈[3500,4500]\in[3500,4500](core),repeat∈{4,5,6}\in\{4,5,6\}filter_type=lowpass,wet=1.0change_volume–target_lufs=-23.0Electronic distortionadd_distortiondrive_db∈[20,60]\in[20,60](core)wet=1.0apply_filtercutoff_hz∈[2800,6000]\in[2800,6000]filter_type=lowpass,repeat=1,wet=1.0change_volume_distortiontarget_lufs∈[−38,−27]\in[-38,-27](core)–Transmission dropoutadd_stutter_replacestutter_prob∈[0.05,0.3]\in[0.05,0.3](core),max_repeats∈{2,3,4}\in\{2,3,4\}repeat_prob=0.7,frame_ms=20change_volume–target_lufs=-23.0##### Noise.

The noise atomic effect is designed to isolate low-SNR recognition difficulty. It therefore uses additive noise as the dominant primitive effect and avoids introducing strong reverberation or filtering artifacts. The noise source is sampled from the prepared noise pool, and its relative level is varied to produce different SNR regimes. A final loudness normalization step keeps the overall output level comparable across samples, ensuring that the primary challenge comes from masking rather than from abnormal global volume. This design matches common noisy environments such as streets, cafes, vehicles, and crowded rooms.

Far-field.

The far-field atomic effect models speech captured by a distant microphone. Its primitive chain first introduces room reverberation, then applies low-pass filtering to mimic high-frequency attenuation, and finally reduces the loudness to simulate distance-induced energy decay. This combination reflects the main acoustic properties of far-field speech: stronger room reflections, weaker direct-path energy, and mild spectral attenuation. The resulting samples target scenarios such as speaking to a smart speaker from across a room.

Obstructed.

The barrier atomic effect simulates speech transmitted through an obstacle, such as a wall, door, glass, or mask. The dominant operation is low-pass filtering, which removes high-frequency components that are difficult to transmit through physical barriers. The filter can be repeatedly applied to represent thicker or more absorptive obstacles. We then add reverberation with a relatively high wet component, reflecting the fact that the listener often receives a mixture of attenuated direct speech and room-reflected sound from the other side of the barrier. Finally, the signal is attenuated through loudness transformation. This makes the generated speech muffled, weaker, and less spectrally detailed.

Echo&reverb.

The strong-echo atomic effect targets highly reflective environments. It combines dense reverberation with a separate echo-delay primitive. The reverberation component produces a long reflection tail, while the echo-delay component introduces perceptible delayed copies of the speech signal. A mild high-pass filter is additionally applied to control low-frequency muddiness, and the final loudness transformation keeps the generated samples within a reasonable intensity range. This construction is suitable for large empty spaces, underground garages, gymnasiums, and other environments where both reverberant smearing and discrete echoes are present.

Recording coloration.

The recording or acoustic-crosstalk atomic effect simulates a playback-recording loop, such as playing speech from one phone and recording it with another device. This chain first applies resampling to model bandwidth limitation, then adds white or device-like noise, and subsequently applies both high-pass and low-pass filtering. The high-pass filter removes low-frequency energy and makes the signal thinner, while the low-pass filter limits the upper bandwidth of the playback-recording channel. A final loudness normalization step standardizes the output level. Together, these operations produce speech that is narrower in frequency response, noisier, and more blurred than the original recording.

Electronic distortion.

The electronic-distortion atomic effect focuses on nonlinear device overload. It uses distortion as the dominant primitive effect, where larger drive values produce stronger saturation and clipping. A subsequent low-pass filter mimics the limited response of low-quality microphone hardware under large input dynamics, while the final loudness adjustment controls the output level. Unlike far-field or strong-echo conditions, this scene intentionally avoids adding reverberation or background noise, so that the dominant challenge remains waveform-level clipping and harmonic distortion rather than room acoustics or SNR degradation.

Transmission dropout.

The transmission-dropout atomic effect models temporal corruption rather than spectral coloration. It uses frame-level stutter as the dominant primitive effect: short frames are randomly replaced by previous frames or by silence, creating local repetitions and dropouts while keeping the total audio length unchanged. A final loudness normalization step keeps the recording level standard. We intentionally avoid additional filtering because network or Bluetooth instability does not necessarily make the speech spectrally muffled. This effect therefore isolates temporal discontinuity caused by VoIP packet loss, unstable wireless links, or corrupted streaming.

These seven atomic acoustic effects form the basic scenario vocabulary ofVoices-in-the-wild-2M. Each one emphasizes a distinct degradation mechanism, while still including the secondary primitive effects required to make the simulation realistic. In the next subsection, we use these atomic effects as building blocks for constructing compound acoustic scenarios.

C.4Construction of Compound Acoustic Scenarios

The seven atomic acoustic effects above serve as the basic scenario vocabulary for constructing compound acoustic scenarios. However, not all atomic effects play the same role in real-world acoustic environments. We therefore divide them into two groups:scene-defining anchor effectsandportable modifier effects.

The scene-defining anchor effects includefar-field,Echo&reverb, andobstructed. These effects usually determine the dominant acoustic geometry of a recording condition. For example, far-field speech is primarily characterized by distance-induced attenuation and reverberation; strong echo corresponds to highly reflective spaces with delayed reflections; and barrier speech is dominated by occlusion-induced spectral attenuation. Since these effects describe mutually distinctive propagation conditions, we do not directly combine multiple anchor effects within the same scenario.

The portable modifier effects includerecording coloration,electronic distortion,noise, andtransmission dropout. These effects are more flexible and can be attached to different anchor conditions. They correspond to playback-recording artifacts, device overload, background interference, and unstable transmission, respectively. Such factors commonly co-occur with different acoustic geometries in real deployments. For example, far-field speech may also be noisy and distorted, and barrier speech may additionally suffer from recording-channel degradation.

Scenario enumeration.

Following this anchor–modifier decomposition, we enumerate 54 acoustic scenario categories in total. The enumeration consists of four groups.

First, we include the seven single-effect scenarios, corresponding to the seven atomic acoustic effects themselves. Second, we construct 18 two-effect scenarios, including all anchor–modifier pairs and all modifier–modifier pairs. Third, we construct 13 three-effect scenarios. These include anchor-prefixed combinations with two selected modifiers, as well as all three-way combinations among the four modifier effects. Finally, we construct 16 higher-order scenarios, including anchor-prefixed combinations with three or four modifiers and the modifier-only four-way combination.

Table 17:Enumeration of the 54 acoustic scenario categories inVoices-in-the-wild-2M. Anchor effects arefar-field,echo&reverb, andobstructed; modifier effects arerecording coloration,electronic distortion,noise, andtransmission dropout.GroupConstruction ruleNumberSingle-effect scenariosSeven atomic acoustic effects77Two-effect scenariosAnchor–modifier pairs:3×43\times 4; modifier–modifier pairs:(42)\binom{4}{2}12+6=1812+6=18Three-effect scenariosAnchor with two selected modifiers:3×33\times 3; modifier-only triples:(43)\binom{4}{3}9+4=139+4=13Higher-order scenariosAnchor with three modifiers:3×(43)3\times\binom{4}{3}; modifier-only four-way combination:11; anchor with all four modifiers:3312+1+3=1612+1+3=16Total–54

Anchor–modifier composition.

The anchor effects define the main acoustic environment, while the modifier effects introduce additional degradations that are portable across environments. This design avoids unrealistic combinations among mutually distinctive propagation geometries. For example,far-field,echo&reverb, andobstructedeach describe a different dominant acoustic path, and therefore are not directly combined with each other. In contrast, modifiers such asnoise,distortion,recording coloration, andtransmission dropoutcan naturally co-occur with many acoustic geometries.

For two-effect scenarios, we include all anchor–modifier pairs and all modifier–modifier pairs. For three-effect scenarios, we include two types of compositions: an anchor effect combined with two modifiers, and modifier-only triples. For higher-order scenarios, we further include anchor-prefixed combinations with three modifiers, the modifier-only four-way combination, and anchor-prefixed combinations with all four modifiers. This enumeration yields a balanced set of atomic, moderate-composition, and high-composition acoustic conditions.

Effect-chain maintenance.

Each compound scenario is represented by a list of atomic effects. To generate the final signal-processing chain, we merge the ordered primitive-effect chains of the selected atomic effects. The merge procedure preserves the within-scene order of each atomic effect and removes cross-scene duplicate primitive effects, except for additive noise. This exception is used because real environments may contain multiple independent noise sources. The resulting merged chain is then parameterized and applied sequentially to the waveform.

Algorithm 1Effect-chain maintenance for compound acoustic scenarios1:Atomic scene configurations

𝒞\mathcal{C}; selected atomic effects

S=[s1,…,sm]S=[s_{1},\ldots,s_{m}] 2:Duplicate-allowed primitive set

𝒟={add_noise}\mathcal{D}=\{\texttt{add\_noise}\} 3:Initialize merged chain

M←[]M\leftarrow[\,] 4:Initialize previously seen primitive set

V←∅V\leftarrow\emptyset 5:for

sis_{i}in

SSdo

6:Load ordered primitive-effect chain

EiE_{i}from

𝒞​[si]\mathcal{C}[s_{i}] 7:Initialize current-scene primitive set

U←∅U\leftarrow\emptyset 8:forprimitive effect

eein

EiE_{i}do

9:if

e.name∈𝒟e.\mathrm{name}\in\mathcal{D}then

10:Append

eeto

MM 11:elseif

e.name∉Ve.\mathrm{name}\notin Vthen

12:Append

eeto

MM 13:Add

e.namee.\mathrm{name}to

UU 14:else

15:Skip

eeas a cross-scene duplicate

16:endif

17:endfor

18:

V←V∪UV\leftarrow V\cup U 19:endfor

20:returnmerged primitive-effect chain

MM

This merge strategy is important for preserving atomic-effect definitions. For example, the recording coloration effect intentionally contains two filtering operations, one high-pass and one low-pass, to narrow the frequency response. Such within-scene repeated operators are preserved, while duplicate operators introduced by different atomic effects are removed unless explicitly allowed.

C.5Severity Sampling and Difficulty Calibration

To make the simulated data both diverse and controllable, we associate each generated sample with a global severity variablem∈[0,1]m\in[0,1]. Rather than sampling every effect parameter independently, we first sample a latent variablex∼𝒰​(0,1)x\sim\mathcal{U}(0,1)and then map it to the final severity valuemmusing a predefined difficulty mapping function. The resultingmmis shared across the primitive effects in the same sample, which ensures that the degradation level remains globally coherent instead of varying arbitrarily across different effects.

Formally, for each generated sample we first draw

x∼𝒰​(0,1),x\sim\mathcal{U}(0,1),and compute

wheref​(⋅)f(\cdot)is one of four candidate mapping functions. We consider the following mappings:

mlinear​(x)=x,m_{\text{linear}}(x)=x,(9) msqrt-fwd​(x)=x,m_{\text{sqrt-fwd}}(x)=\sqrt{x},(10) msqrt-bwd​(x)=x2,m_{\text{sqrt-bwd}}(x)=x^{2},(11) mgaussian-mid(x)=clip(Φ−1(0.05+0.9x;μ=0.5,σ),0,1),m_{\text{gaussian-mid}}(x)=\mathrm{clip}\!\left(\Phi^{-1}(0.05+0.9x;\mu=0.5,\sigma),\,0,\,1\right),(12)whereΦ−1​(⋅;μ,σ)\Phi^{-1}(\cdot;\mu,\sigma)denotes the inverse CDF of a Gaussian distribution with meanμ=0.5\mu=0.5, andσ\sigmais set such that the central region is emphasized while the two extremes are compressed. In practice, this mapping increases the density of medium-difficulty samples and avoids over-sampling the easiest and hardest regimes.

The four mappings differ in how they distribute probability mass over the final severity variablemm. The linear mapping preserves the original uniform sampling and therefore distributes difficulty evenly over the full range. The sqrt-forward mapping allocates more samples to the hard regime by increasingmmrapidly at smallxx. In contrast, the sqrt-backward mapping biases the sampling toward easier samples, sincemmgrows more slowly at smallxx. The gaussian-mid mapping concentrates more samples around intermediate difficulty levels and suppresses both extremes.

Figure8visualizes the four mapping functions. Although they all map the same uniform variablexxinto the common severity range[0,1][0,1], they induce substantially different difficulty profiles for the generated dataset.

After obtaining the global severity valuemm, we use it to instantiate the random parameters in each primitive effect. Each random parameter is defined by a range and a monotonicity flag indicating whether smaller values are easier or harder. For a parameter with range[a,b][a,b], the sampled value is computed as

θ={a+(b−a)​m,if larger values correspond to harder samples,b−(b−a)​m,if smaller values correspond to harder samples.\theta=\begin{cases}a+(b-a)m,&\text{if larger values correspond to harder samples},\\[3.0pt] b-(b-a)m,&\text{if smaller values correspond to harder samples}.\end{cases}Integer-valued parameters are rounded to the nearest valid integer after this mapping. For categorical parameters, the same severity valuemmis used to select an option index from an ordered candidate list.

This design gives us a unified severity interface across heterogeneous effects. For example, a largermmmay correspond to lower cutoff frequency in a low-pass filter, larger distortion drive, stronger reverberation, lower target loudness, or higher stutter probability, depending on the semantics of the parameter. Importantly, because all parameters in the same sample share the same global severity variable, the resulting degradation remains internally consistent: a hard sample tends to be hard across all of its active primitive effects, while an easy sample remains globally mild.

We empirically compare the four candidate mappings by generating probe sets under each mapping and evaluating the resulting training utility on real noisy speech. The goal is not only to increase nominal difficulty, but to obtain a severity profile that yields the best downstream robustness after supervised fine-tuning. Among the four candidates, the linear mapping provides the most balanced coverage of easy, medium, and hard samples, and leads to the best overall robustness in our pilot experiments. We therefore adopt the linear mapping as the default severity profile inVoices-in-the-wild-2M.

Intuitively, the sqrt-forward mapping over-emphasizes hard samples, which may reduce learnability in the early stages of training; the sqrt-backward mapping places too much mass on easy samples and therefore under-exposes the model to challenging conditions; and the gaussian-mid mapping improves coverage of medium-difficulty samples but under-represents the two ends of the spectrum. The linear mapping strikes the best balance between coverage, learnability, and difficulty diversity.

Implementation detail: global severity sharing.

In our implementation, we use a shared global severity value for all primitive effects within the same sample. Concretely, a singlexxis first sampled and mapped into a singlemm, and thismmis then reused when resolving the random parameters of all active primitive effects in the corresponding effect chain. This mechanism avoids internally inconsistent mixtures such as a sample with extremely strong reverberation but almost negligible noise, or severe dropout combined with otherwise near-clean recording quality. In this way, the sampled difficulty more faithfully reflects a coherent acoustic condition rather than an arbitrary mixture of independently sampled parameter strengths.

Appendix DRouter Implementation and Training Details

D.1Motivation

Mega-ASR is optimized for acoustically degraded speech, but always using the robust weights is not necessarily optimal for all inputs. In particular, the original Qwen3-ASR backbone retains strong clean-domain behavior and can better preserve complementary capabilities such as clean-speech recognition, hotword recognition, and streaming-style inference. We therefore introduce a lightweight environment-aware router that predicts whether an input utterance should be processed by the original backbone or by the robust LoRA-enhanced Mega-ASR weights.

The router is used only for model selection. It does not generate transcripts and does not modify the ASR decoding process. Given an input audio clip, the router outputs a binary decision: clean inputs are routed to the base Qwen3-ASR model, while degraded inputs are routed to theMega-ASRLoRA branch. This makes Mega-ASR a plug-and-play robustness module rather than a full replacement of the original ASR system.

D.2Router Model Architecture

The router is implemented as a lightweight audio-quality classifier. It takes log-Mel acoustic features as input and predicts a binary label indicating whether the input is clean or degraded. We use a single-layer Transformer architecture to minimize routing overhead.

The model first extracts 80-dimensional log-Mel features from the waveform. A lightweight convolutional frontend maps the Mel features to a hidden dimension and performs temporal downsampling. The downsampled sequence is then augmented with sinusoidal positional encoding and passed through a single Transformer encoder layer. Finally, an attention-pooling module aggregates the frame-level representations into an utterance-level embedding, followed by a linear binary classification head.

Table 18:Architecture of the environment-aware router.ComponentConfigurationInput feature80-dimensional log-Mel spectrogramSample rate16 kHzMaximum duration30 sFrontendLightweight 1D convolutional frontendTemporal downsampling2×2\timesHidden dimension128Transformer layers1Attention heads4Feed-forward dimension256PoolingAttention poolingClassifierLinear binary headOutput labelsclean / degraded

D.3Router Training Data

The router is trained with binary supervision. Clean speech is labeled as0and routed to the original Qwen3-ASR backbone, while degraded speech is labeled as11and routed to theMega-ASRLoRA branch. The clean subset is constructed from LibriSpeech, AISHELL-1, CommonVoice22, and WenetSpeech. The degraded subset is constructed fromVoices-in-the-wild-2M.

The final router dataset contains 552,651 clean samples and 674,107 degraded samples. We split the data into training, validation, and test sets, containing 1,104,084, 61,337, and 61,337 samples respectively.

Table 19:Dataset used for training the environment-aware router. Clean samples are labeled as 0 and degraded samples are labeled as 1.SubsetSourceNumber of samplesCleanLibriSpeech, AISHELL-1, CommonVoice22, WenetSpeech552,651DegradedVoices-in-the-wild-2M674,107Train splitMixed clean/degraded1,104,084Validation splitMixed clean/degraded61,337Test splitMixed clean/degraded61,337On the held-out development set, the router achieves over 99.5% binary classification accuracy, indicating that the acoustic difference between clean and degraded inputs can be reliably detected by a lightweight model.

D.4Training Objective and Optimization

The router is trained as a binary classifier. Given an input utterancexxand a labely∈{0,1}y\in\{0,1\}, wherey=1y=1denotes degraded speech, the router predictspθ​(y∣x)p_{\theta}(y\mid x). We optimize the standard cross-entropy loss:

ℒrouter=−log⁡pθ​(y∣x).\mathcal{L}_{\mathrm{router}}=-\log p_{\theta}(y\mid x). During training, each audio file is resampled to 16 kHz, converted to mono, and truncated to at most 30 seconds. We extract log-Mel spectrogram features and pad variable-length sequences within each mini-batch. For training samples, we apply lightweight augmentation including random gain perturbation and weak additive noise. The model is optimized with AdamW, a warmup cosine learning-rate schedule, gradient clipping, label smoothing, and mixed-precision training.

Table 20:Router training configuration.ItemSettingTaskBinary clean/degraded classificationInput featureLog-Mel spectrogramSample rate16 kHzMaximum duration30 sLossCross entropyLabel smoothing0.1OptimizerAdamWLearning-rate scheduleWarmup + cosine decayWarmup ratio0.1Gradient clipping1.0Mixed precisionEnabledValidation metricsAccuracy, precision, recall, F1, AUCBest checkpoint criterionValidation accuracy / AUCDevelopment accuracy>99.5%>99.5\%

D.5Integration with Qwen3-ASR and LoRA Delta Switching

We integrate the router with Qwen3-ASR-1.7B using a single-model delta-switching design. Instead of loading separate base and robust ASR models, we load one Qwen3-ASR-1.7B instance and precompute the LoRA weight deltas of the robust adapters. At inference time, the router predicts whether the input is degraded. If the input is predicted as clean, the model keeps or switches to the base weights; if it is predicted as degraded, the LoRA deltas are activated and the utterance is decoded with the robustMega-ASRbranch.

Concretely, the system first runs the audio-quality predictor and obtains a dirty probabilitypdirtyp_{\mathrm{dirty}}. With thresholdγ=0.5\gamma=0.5, routing is defined as

route​(x)={Mega-ASR LoRA branch,pdirty​(x)≥γ,Qwen3-ASR base branch,pdirty​(x)<γ.\mathrm{route}(x)=\begin{cases}\text{Mega-ASR LoRA branch},&p_{\mathrm{dirty}}(x)\geq\gamma,\\ \text{Qwen3-ASR base branch},&p_{\mathrm{dirty}}(x)<\gamma.\end{cases}The LoRA switch is implemented by adding or subtracting the precomputed LoRA delta tensors from the corresponding base weights. Therefore, switching does not require reloading the full model and introduces only a small runtime overhead.

Algorithm 2Router-guided LoRA delta switching for Qwen3-ASR1:Input audio

xx, router

gg, threshold

γ\gamma 2:Qwen3-ASR base model with preloaded LoRA deltas

3:Compute dirty probability

pdirty←g​(x)p_{\mathrm{dirty}}\leftarrow g(x) 4:if

pdirty≥γp_{\mathrm{dirty}}\geq\gammathen

5:Set LoRA state to active

6:Decode

xxwith theMega-ASRbranch

7:else

8:Set LoRA state to inactive

9:Decode

xxwith the Qwen3-ASR base branch

10:endif

11:Return transcription

D.6Inference Overhead

Because the router is a small single-layer classifier and LoRA switching is implemented by adding or subtracting precomputed delta tensors, the additional runtime cost is negligible. The router is executed once before transcription, and the ASR model itself is not reloaded during switching. In batch inference, we first group utterances by routing decision and then decode the LoRA-routed and base-routed groups separately, further reducing unnecessary switching.

We measure inference time on CHiME-4 using the same evaluation pipeline for direct Qwen3-ASR inference and router-guided inference. As shown in Table21, the routed system has a total runtime of 371 seconds, compared with 374 seconds for direct Qwen3-ASR inference. The relative difference is−0.8%-0.8\%, which is within normal runtime fluctuation. Therefore, the router and delta-switching mechanism introduce no measurable inference overhead in practice.

Table 21:Inference-time overhead of router-guided LoRA switching on CHiME-4. The routed system shows comparable runtime to direct Qwen3-ASR inference, with relative difference below 1%.SystemDatasetTotal runtimeRelative differenceQwen3-ASR-1.7BCHiME-4374 s–Qwen3-ASR + router + LoRA delta switchCHiME-4371 s−0.8%-0.8\%

Appendix ETraining and Implementation Details

E.1A2S-SFT Hyperparameters

This section reports the training configuration of the Acoustic-to-Semantic Supervised Fine-Tuning (A2S-SFT) stage. A2S-SFT is implemented as a three-phase training procedure:(i)encoder-aligner acoustic adaptation,(ii)LLM-side semantic adaptation, and(iii)joint acoustic-semantic adaptation. All phases are initialized from Qwen3-ASR-1.7B and use LoRA-based parameter-efficient fine-tuning. Unless otherwise specified, the effective batch size is set to128128.

Training schedule.

The three phases differ in both trainable scope and data schedule. In the first phase, only the acoustic encoder and the speech-to-LLM aligner are updated. This phase is the only stage where we apply a WER-graded curriculum. Specifically, the training subset is progressively expanded fromWER<30%\mathrm{WER}<30\%toWER<50%\mathrm{WER}<50\%, and finally toWER<70%\mathrm{WER}<70\%. This schedule provides a stable acoustic warm start before exposing the encoder-aligner stack to harder and noisier samples.

The second and third phases do not use the progressive WER curriculum. Instead, they are trained directly on the full targeted split. In Phase II, the acoustic encoder and aligner are frozen, and only the LLM-side LoRA parameters are updated to adapt the language model to noisy transcription recovery. In Phase III, the encoder, aligner, and LLM are jointly updated with LoRA to align the acoustic representations and semantic decoding behavior end-to-end.

Stage I: Encoder–Aligner Acoustic AdaptationLoRA update on acoustic encoder and speech-to-LLM alignerWER<30%→WER<50%→WER<70%\mathrm{WER}<30\%\;\rightarrow\;\mathrm{WER}<50\%\;\rightarrow\;\mathrm{WER}<70\%⇓\DownarrowStage II: LLM-side Semantic AdaptationLoRA update on LLM-side parameters; full targeted split⇓\DownarrowStage III: Joint Acoustic-Semantic AdaptationLoRA update on encoder, aligner, and LLM; full targeted split

Figure 7:A2S-SFT training schedule. The WER-graded curriculum is applied only in Stage I for encoder-aligner adaptation. Stages II and III are trained on the full targeted split.

Data construction.

For the encoder-aligner warm-start phase, we sample3030K utterances from the training pool and use them to construct the WER-graded acoustic curriculum. The curriculum first uses relatively reliable samples to stabilize the acoustic interface, then gradually introduces more challenging samples with higher WER. The subsequent LLM-side and joint adaptation phases use the full targeted split constructed from the same preprocessing pipeline. The validation set is kept disjoint from the training set and is used for checkpoint monitoring and failure pattern inspection.

Stage-wise hyperparameters.

Table22summarizes the main hyperparameters of the three phases. Phase I adapts the acoustic tower and the projection/aligner module with LoRA. In our implementation, the trainable acoustic scope focuses on the upper acoustic blocks and the projection module, so that the model can adjust high-level acoustic representations while keeping the majority of the pretrained backbone stable. Phase II freezes the acoustic side and updates the LLM-side LoRA parameters. Phase III jointly updates all three module groups, with a smaller learning rate for the acoustic encoder and aligner to avoid disrupting the representations obtained in Phase I.

Table 22:Stage-wise hyperparameters of A2S-SFT. The WER curriculum is used only in Phase I; Phases II and III are trained on the full targeted split.SettingPhase IPhase IIPhase IIITraining roleAcoustic warm startSemantic adaptationJoint alignmentTrainable modulesEncoder + alignerLLMEncoder + aligner + LLMData scheduleWER-graded curriculumFull targeted splitFull targeted splitWER range<30%→<50%→<70%<30\%\rightarrow<50\%\rightarrow<70\%Full targeted rangeFull targeted rangePer-device batch size888Number of GPUs222Gradient accumulation888Effective batch size128128128Epochs211Encoder learning rate1.0×10−61.0\times 10^{-6}frozen5.0×10−75.0\times 10^{-7}Aligner learning rate1.0×10−61.0\times 10^{-6}frozen5.0×10−75.0\times 10^{-7}LLM learning ratefrozen1.0×10−61.0\times 10^{-6}1.0×10−61.0\times 10^{-6}Warmup ratio0.050.050.03Weight decay0.010.010.01Maximum gradient norm1.01.01.0LoRA rankrr888LoRA alpha161616LoRA dropout0.050.050.05Checkpoint interval200 steps200 steps200 stepsSaved weightsAdapter onlyAdapter onlyAdapter / merged adapter

Optimization details.

All phases are trained with distributed data parallelism on two GPUs. The effective batch size is computed as

Beff=Bdevice×NGPU×Naccum=8×2×8=128.B_{\mathrm{eff}}=B_{\mathrm{device}}\times N_{\mathrm{GPU}}\times N_{\mathrm{accum}}=8\times 2\times 8=128.We use conservative learning rates because the model is initialized from a pretrained ASR-LLM checkpoint rather than trained from scratch. In the final joint phase, the encoder and aligner learning rates are reduced to5.0×10−75.0\times 10^{-7}, while the LLM-side learning rate remains1.0×10−61.0\times 10^{-6}. This asymmetric setting helps preserve the acoustic adaptation from Phase I while still allowing the language model to adjust to the full noisy transcription distribution. Gradients are clipped to1.01.0in all phases, and checkpoints are saved every200200optimization steps. We save adapter-only checkpoints during intermediate phases to reduce storage overhead and simplify later merging.

Table 23:Common implementation settings used in A2S-SFT.ItemSettingBackbone modelQwen3-ASR-1.7BTraining typeLoRA-based supervised fine-tuningDistributed training2-GPU training with one process per GPUPer-device batch size8Gradient accumulation steps8Effective batch size128Optimizer regularizationWeight decay 0.01Gradient clippingMaximum gradient norm 1.0WarmupLinear warmup with ratio 0.05 or 0.03 depending on phaseCheckpointingSave every 200 optimization stepsCheckpoint formatAdapter-only during intermediate phases; merged adapter for downstream useModel selectionValidation WER together with inspection of empty, hallucinated, and off-audio outputs

Preliminary training variants.

Before fixing the above schedule, we examined several alternative update orders. These comparisons were used to validate the need for staged optimization rather than as separate model variants in the final system. Directly training all modules from the beginning was less stable on medium- and high-WER samples, since the language model could adapt to unreliable acoustic representations before the encoder-aligner interface became sufficiently grounded. Training only the encoder-aligner improved acoustic consistency but gave limited gains on heavily corrupted samples requiring semantic recovery. Conversely, adapting the LLM before the acoustic warm start made the model more prone to relying on language priors. The final schedule therefore uses encoder-aligner adaptation first, LLM-side adaptation second, and joint acoustic-semantic alignment last.

Table 24:Preliminary A2S-SFT variants considered during development.VariantObservationDirect joint SFT from the beginningLess stable in medium- and high-WER regimes; the model could adapt to unreliable acoustic representations early in training.Encoder-aligner onlyImproved acoustic grounding, but provided limited recovery when the acoustic evidence was incomplete or severely corrupted.LLM adaptation before acoustic warm startIncreased reliance on the language prior before the acoustic interface was sufficiently stabilized.No WER curriculum in Phase IProduced larger validation fluctuations during the acoustic warm-start stage.Final three-phase scheduleProvided the most stable training behavior by separating acoustic adaptation, semantic adaptation, and final end-to-end alignment.

E.2DG-WGPO Hyperparameters

This section provides the implementation and training details of Dual-Granularity WER-Gated Policy Optimization (DG-WGPO). DG-WGPO is implemented with DAPO-style policy optimization in an RLHF framework. Since Qwen3-ASR is not a standard text-only causal language model, we introduce a custom multimodal adaptation layer to make the audio encoder, speech-to-language aligner, and language model compatible with group-based policy optimization.

Framework and model adaptation.

Qwen3-ASR takes both an audio signal and a text-side prompt as input. Therefore, directly treating it as a pure text model would break the rollout and loss construction used in GRPO/DAPO. We adapt Qwen3-ASR as a multimodal policy model while keeping its official inference behavior unchanged. The adaptation mainly addresses four issues:(i)preserving the original audio preprocessing and prompt construction protocol,(ii)exposing a training interface that accepts both text tokens and acoustic features,(iii)making the inner language model compatible with LoRA-based policy updates, and(iv)ensuring that rollout completions retain the raw ASR format for reward parsing.

Table 25:Summary of the Qwen3-ASR adaptation used for DG-WGPO. We only list the model-level adaptation principles and omit implementation-specific function or class names.Adaptation itemPurposeMultimodal model loadingLoad Qwen3-ASR as an audio-language policy model rather than a text-only decoder.Official processor consistencyKeep the same audio preprocessing, tokenizer behavior, and prompt protocol between inference, rollout, and RL training.Multimodal forward interfaceAllow the trainer to pass text tokens, text masks, acoustic features, acoustic masks, and response labels in a unified training call.Language-model interface alignmentExpose the inner language-model embeddings and output head to the LoRA/RLHF trainer without changing the Qwen3-ASR architecture.Module groupingSeparate the model into language model, acoustic encoder, and aligner groups, so that update scopes can be controlled explicitly.Rollout re-encodingRe-encode sampled completions together with the original multimodal prompt and apply loss only on the generated response tokens.Padding policyUse left padding for text-side sequences and temporal padding for audio-side features, matching decoder-only generation and acoustic feature batching.Raw-output preservationPreserve the original ASR completion format during decoding, allowing the reward function to parse empty outputs, language prefixes, repetitions, and format irregularities consistently.

Data and initialization.

The DG-WGPO stage is initialized from the A2S-SFT LoRA-merged checkpoint rather than from the original Qwen3-ASR checkpoint. This ensures that the initial policy already has stable acoustic grounding and reasonable transcription ability before entering reinforcement learning. The RL training and validation sets are constructed as targeted WER-aware splits. Unlike general SFT data, the RL data are enriched with medium- and high-WER examples, while relatively clean utterances are reduced to prevent the policy update from being dominated by easy samples. This design matches the goal of DG-WGPO: improving robustness in the regimes where standard supervised fine-tuning and WER-only rewards provide limited corrective signals.

Each RL example contains a multimodal prompt, an audio input, a reference transcription, the base-model prediction, and the base WER used for data selection and analysis. Table26summarizes the data schema. During training, the reference transcription is used only by the reward function; the policy is optimized from sampled completions and their group-wise relative rewards.

Table 26:JSONL data schema used in DG-WGPO. The absolute audio paths are omitted from the paper.FieldDescriptionmessagesSystem and user instructions that define the ASR task, e.g., transcribe the given audio and output plain text only.audiosA list containing the audio file associated with the current prompt. Each training example uses one audio input.solutionReference transcription used for WER computation and reward evaluation.predictionTranscription generated by the initialization model. This field is used for data targeting and diagnostic comparison, not as a supervised label.base_werWER of the initialization model on the current sample. We use it to emphasize medium- and high-WER regions in RL data construction and analysis.metaOptional metadata field for bookkeeping.

Policy optimization setup.

DG-WGPO uses GRPO-style group-relative advantage estimation with the DAPO loss. The reported main run uses three GPUs, and the same setting can be scaled to four or eight GPUs by increasing the number of distributed processes. We keep the per-device batch size, number of generations, and reward settings unchanged when scaling the number of GPUs.

In the main run, the language model, acoustic encoder, and aligner are all allowed to receive LoRA updates. Although all three module groups participate in policy optimization, the total number of trainable parameters remains controlled because the update is parameter-efficient. This full-scope LoRA update is important for DG-WGPO because the reward simultaneously targets acoustic grounding and semantic reconstruction. Updating only the language model improves language-side fluency but is less effective for acoustically induced substitutions and omissions, while updating only the encoder-aligner limits sentence-level recovery in high-WER cases. Therefore, the reported DG-WGPO results use LoRA updates across the acoustic encoder, aligner, and LLM.

Table 27:Main training hyperparameters of DG-WGPO. The effective prompt batch size shown below corresponds to the three-GPU main run.HyperparameterSettingInitializationA2S-SFT LoRA-merged Qwen3-ASR-1.7B checkpointOptimization methodGRPO-style advantage estimation with DAPO lossTraining typeLoRATrainable scopeAcoustic encoder + aligner + language modelMain number of GPUs3Scalable GPU settings3 / 4 / 8 GPUsPer-device train batch size4Per-device evaluation batch size4Gradient accumulation steps16Effective prompt batch size4×3×16=1924\times 3\times 16=192Number of generations per prompt12Evaluation generations per prompt4Maximum completion length256 tokensLearning rate5.0×10−55.0\times 10^{-5}Learning-rate schedulerCosine decayWarmup ratio0.03KL coefficientβ\beta0.04DAPO upper clipping parameter0.28Number of RL iterations2Dynamic samplingEnabledMaximum resampling times4Overlong filteringEnabledTruncation strategyDelete overlong samplesCheckpoint intervalEvery 20 stepsLogging intervalEvery 5 steps

Rollout and generation protocol.

For each prompt, DG-WGPO samplesK=12K=12candidate transcriptions and computes group-relative rewards. We use stochastic decoding because the policy update requires sufficient intra-group diversity: if all completions are nearly identical, the advantage signal collapses. However, excessive exploration can increase hallucinations, overlong outputs, and format violations. We therefore choose the generation parameters through a small exploratory probing stage rather than fixing them heuristically.

Letbib_{i}denote the WER of the initialization model on exampleii, and letHi,k(T)H_{i,k}^{(T)}be thekk-th sampled completion under temperatureTT. We use two statistics to compare temperature settings. The first measures whether the sample group contains a potentially better candidate:

CPIδ​(T)=1N​∑i=1N𝕀​[min1≤k≤K⁡WER​(Hi,k(T),Ri)≤bi−δ],\mathrm{CPI}_{\delta}(T)=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\left[\min_{1\leq k\leq K}\mathrm{WER}\!\left(H_{i,k}^{(T)},R_{i}\right)\leq b_{i}-\delta\right],(13)whereCPI\mathrm{CPI}denotes the candidate potential indicator. The second measures whether the reward-selected candidate preserves the base transcription ability:

BAPδ​(T)=1N​∑i=1N𝕀​[WER​(Hi,k⋆(T),Ri)≤bi+δ],k⋆=arg⁡maxk⁡R​(Hi,k(T),Ri),\mathrm{BAP}_{\delta}(T)=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\left[\mathrm{WER}\!\left(H_{i,k^{\star}}^{(T)},R_{i}\right)\leq b_{i}+\delta\right],\quad k^{\star}=\arg\max_{k}R\!\left(H_{i,k}^{(T)},R_{i}\right),(14)whereBAP\mathrm{BAP}denotes base-ability preservation. In practice, we also monitor the valid-output rate, including non-empty outputs, non-repetitive outputs, and completions within the maximum length. The final temperature is chosen to balance candidate potential, base-ability preservation, and output validity.

Table 28:Generation settings used in the main DG-WGPO run.HyperparameterSettingNumber of generations12Evaluation generations4Temperature0.50Top-pp0.95Top-kk50Repetition penalty1.08Maximum completion length256 tokensDynamic samplingEnabledMaximum resampling times4Overlong filteringEnabledTable 29:Temperature probing protocol. The final run usesT=0.50T=0.50because it provides moderate exploration while preserving the base ASR behavior.TemperatureObserved tendencyRole in selection0.30Conservative decoding with low intra-group diversityUsed to check the lower-exploration regime; often yields weak advantage dispersion.0.50Moderate diversity with stable ASR formatting and relatively high valid-output rateSelected as the default setting by balancing candidate potential and base-ability preservation.0.70Higher diversity and more possible correctionsUsed to probe whether more aggressive sampling reveals better candidates, but requires stronger filtering.0.90Strong exploration with higher risk of hallucination, off-audio text, and overlong outputsUsed only as a stress test for reward robustness and filtering behavior.

Reward tuning and diagnostics.

The main reward function follows the DG-WGPO formulation described in the main text. We use a WER-gated dynamic reward withτ=0.5\tau=0.5, soft-error discountαs=0.4\alpha_{s}=0.4, and dynamic-reward weightαdyn=0.6\alpha_{\mathrm{dyn}}=0.6. For samples below the WER gate, the reward emphasizes token-level refinement; for samples above the gate, it assigns more weight to sentence-level structural recovery. This design is especially important for the targeted RL split, where many examples contain medium- or high-WER predictions and the standard WER reward can become less discriminative.

Table 30:Reward hyperparameters used in DG-WGPO.Reward hyperparameterSettingStatic WER reward1−WER1-\mathrm{WER}Repetition gateEnabledSoft substitution thresholdCharacter-level edit similarity≥0.5\geq 0.5Soft-error discountαs\alpha_{s}0.4WER gate thresholdτ\tau0.5Dynamic reward weightαdyn\alpha_{\mathrm{dyn}}0.6Low-WER fusion0.75​Rfine+0.25​Rstruc0.75R_{\mathrm{fine}}+0.25R_{\mathrm{struc}}High-WER fusion0.25​Rfine+0.75​Rstruc0.25R_{\mathrm{fine}}+0.75R_{\mathrm{struc}}Reward scalingGroup-wise scalingLength and overlong controlEnabled through rollout filtering and reward diagnosticsWe tune the reward by inspecting rollout groups rather than relying only on the scalar training reward. For each diagnostic group, we compare the reference, the initial model prediction, sampled hypotheses, component rewards, and final reward ranking. This allows us to check whether a higher reward corresponds to a real error reduction, such as correcting acoustically plausible substitutions, recovering omitted content, reducing repeated phrases, or improving sentence-level structure. We also inspect failure cases where the scalar reward prefers a shorter but incomplete hypothesis, a fluent but off-audio hypothesis, or a format-valid but semantically incorrect transcription. These diagnostics are used to calibrate the WER gate, the soft-error discount, and the balance between the static and dynamic rewards.

Model selection is based on validation WER together with rollout quality statistics, including empty-output rate, repetition rate, overlong-output rate, and the behavior of reward-selected candidates on medium- and high-WER samples. This avoids selecting checkpoints that over-optimize a single reward component while degrading transcription faithfulness.

Appendix FAdditional Related works

Traditional Robust ASR Methods.

Robust automatic speech recognition has been widely studied to improve transcription under noise, reverberation, channel mismatch, speaker variation, and domain shift. Traditional methods usually rely on speech enhancement, feature normalization, speaker adaptation, multi-condition training, and language-model rescoring. With the development of end-to-end ASR, data augmentation and large-scale pretraining have become dominant solutions. Representative works include SpecAugmentParket al.[2019], ConformerGulatiet al.[2020], wav2vec 2.0Vaessen and Van Leeuwen [2022], HuBERTHsuet al.[2021], WavLMHuet al.[2024], and WhisperRadfordet al.[2023]. These methods greatly improve robustness, but they are still mainly optimized for transcription accuracy and are usually evaluated by word error rate, rather than semantic understanding or reasoning over speech.

Large Audio Language Models.

Large Audio Language Models (LALMs) connect speech or general audio signals with large language models, enabling audio-conditioned instruction following, question answering, and reasoning. Compared with conventional ASR systems, LALMs are attractive for ASR because they can use linguistic knowledge and contextual reasoning to recover corrupted or ambiguous speech. Existing LALMs have shown strong capabilities in audio-conditioned understanding, instruction following, and speech-language reasoningGonget al.[2024], Deshmukhet al.[2023], Xuet al.[2025a], Rubensteinet al.[2023], Huet al.[2024], Konget al.[2024], Xie and Wu [2024a,b], Wuet al.[2025], Baiet al.[2024]. However, this ability may also introduce hallucinations, where the model generates plausible but incorrect transcriptions that are not grounded in the input audio. Recent works have further explored reasoning-based ASR, attempting to use audio-language reasoning to improve recognition beyond direct acoustic decodingZhifeiet al.[2025], Xieet al.[2025], Huanget al.[2025].

Speech Recognition Datasets and Benchmarks.

Speech recognition datasets and benchmarks provide the basis for evaluating ASR performance under different acoustic and linguistic conditions. Commonly used clean or read-speech datasets include LibriSpeechPanayotovet al.[2015]and TED-LIUMRousseauet al.[2012], while SwitchboardGodfreyet al.[1992]is widely used for conversational speech recognition. Common VoiceArdilaet al.[2020]supports multilingual and diverse-speaker ASR evaluation. For noisy, far-field, and meeting scenarios, representative benchmarks include CHiMEWatanabeet al.[2016], AMIKraaijet al.[2005], and Speech Robust BenchShahet al.[2025]. These datasets mainly evaluate transcription quality with WER, making them suitable for measuring ASR robustness but less focused on reasoning or instruction-following ability.

Refer to captionFigure 8:Difficulty mapping functions used to transform a uniform samplex∈[0,1]x\in[0,1]into the final global severity variablem∈[0,1]m\in[0,1]. Linear preserves a uniform severity profile; Sqrt Forward emphasizes hard samples; Sqrt Backward emphasizes easy samples; and Gaussian Mid concentrates samples around the medium-difficulty region.Refer to captionFigure 9:Qualitative case studies showing error-mode transitions from Qwen3-ASR to Mega-ASR. The examples cover compound acoustic degradation, recording coloration, dropout, noise, and CHiME-4 street noise. Compared with the baseline,Mega-ASRreduces catastrophic failure modes such as cross-lingual hallucination, empty-output collapse, semantic drift, and entity-relation errors.

Xie Zhifei (@XieZhifei14110): Hi everyone, 如果你的项目中有语音识别,你一定要看这个帖子。

在去年的一些项目中我们发现,面对远场、混响、电子录制杂音等场景时,即使是目前最好的开源/闭源模型表现也不理想,常常出现大范围幻觉和句子丢失(最典型的就是远场场景,识别率往往不到一半)。

为此我们做了

相似文章

@aigclink: 阿里通义实验室最新发布了款ASR:Fun-ASR 1.5,核心能力:方言工业级可用 单模型即可无缝覆盖30种语言、汉语七大方言体系及20+ 地方口音,古诗词吟诵也能精准转写 典型方言场景CER相对上代下降56.2%,有5种方言准确率破 9…

X AI KOLs Timeline

阿里通义实验室发布Fun-ASR 1.5,单模型覆盖30种语言、汉语七大方言及20余种地方口音,典型方言场景字错率较上代下降56.2%,5种方言准确率突破90%。