@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speaker…

X AI KOLs Following 05/21/26, 01:51 PM Models

asr speech-recognition robust-asr huggingface qwen3 apache-2.0 open-source

Summary

Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.

Mega-ASR 🔊 https://t.co/KzapZyRL2Y ✨1.7B ✨Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speakers, the scenarios where standard ASR breaks down It includes an audio quality router: clean audio takes the standard path, degraded audio gets the robust path https://t.co/ha0tN39ET4

Original Article

View Cached Full Text

Cached at: 05/21/26, 09:38 PM

Mega-ASR 🔊 https://t.co/KzapZyRL2Y

✨1.7B ✨Apache 2.0

Built for Noise/Reverb/Clipping/Overlapping speakers, the scenarios where standard ASR breaks down

It includes an audio quality router: clean audio takes the standard path, degraded audio gets the robust path https://t.co/ha0tN39ET4

zhifeixie/Mega-ASR · Hugging Face

Source: https://huggingface.co/zhifeixie/Mega-ASR Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.

The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.

https://huggingface.co/zhifeixie/Mega-ASR#model-detailsModel Details

**Model name:**Mega-ASR
**Task:**Automatic speech recognition
**Backbone:**Qwen3-ASR-1.7B
**Primary use case:**In-the-wild ASR under challenging acoustic conditions
**Default decoding:**Greedy decoding
**Default max new tokens:**256 in the Mega-ASR inference wrapper
**Router:**Audio quality classifier with a default threshold of 0.5
**License:**Apache-2.0

https://huggingface.co/zhifeixie/Mega-ASR#repository-contentsRepository Contents

Mega-ASR/
├── Qwen3-ASR-1.7B/              # Backbone model, tokenizer, processor, and generation config
├── mega-asr-merged/             # Mega-ASR adaptation weights used by the inference wrapper
├── audio_quality_router/        # Audio quality router checkpoint
└── README.md                    # Model card

https://huggingface.co/zhifeixie/Mega-ASR#intended-useIntended Use

Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.

https://huggingface.co/zhifeixie/Mega-ASR#quick-startQuick Start

Install the Mega-ASR codebase and dependencies:

git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR

conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt

Place this checkpoint directory at:

ckpt/Mega-ASR

Run inference:

python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR

Disable routing if you want to always use the robust recognition path:

python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false

Python usage:

from MegaASR.model.megaASR import MegaASR

model = MegaASR(
    model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
    router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
    routing_enabled=True,
)

result = model.infer("/path/to/audio.wav", return_route=True)
print(result)

https://huggingface.co/zhifeixie/Mega-ASR#decoding-defaultsDecoding Defaults

The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper,max\_new\_tokensis set to 256.

The default generation configuration is deterministic:

do_sample: false
num_beams: 1
repetition_penalty: 1.0
top_p: 1.0
top_k: 50

Becausedo\_sampleis false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.

https://huggingface.co/zhifeixie/Mega-ASR#training-summaryTraining Summary

Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.

The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.

https://huggingface.co/zhifeixie/Mega-ASR#evaluationEvaluation

Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:

WERfor English and whitespace-tokenized languages
CERfor Chinese and character-based evaluation

The Mega-ASR repository includes an evaluation script:

python src/MegaASR/eval/evaluate_wer.py \
  --ckpt_dir ckpt/Mega-ASR \
  --input_jsonl examples/test.jsonl \
  --output_jsonl outputs/pred_with_wer.jsonl

Input JSONL format:

{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}

https://huggingface.co/zhifeixie/Mega-ASR#citationCitation

If you use Mega-ASR, please cite the project:

@misc{xie2026megaasrinthewild2speechrecognition,
      title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
      author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2605.19833},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.19833},
}

https://huggingface.co/zhifeixie/Mega-ASR#acknowledgementsAcknowledgements

Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.

@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speaker…

zhifeixie/Mega-ASR · Hugging Face

https://huggingface.co/zhifeixie/Mega-ASR#model-detailsModel Details

https://huggingface.co/zhifeixie/Mega-ASR#repository-contentsRepository Contents

https://huggingface.co/zhifeixie/Mega-ASR#intended-useIntended Use

https://huggingface.co/zhifeixie/Mega-ASR#quick-startQuick Start

https://huggingface.co/zhifeixie/Mega-ASR#decoding-defaultsDecoding Defaults

https://huggingface.co/zhifeixie/Mega-ASR#training-summaryTraining Summary

https://huggingface.co/zhifeixie/Mega-ASR#evaluationEvaluation

https://huggingface.co/zhifeixie/Mega-ASR#citationCitation

https://huggingface.co/zhifeixie/Mega-ASR#acknowledgementsAcknowledgements

Similar Articles

@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

@AdinaYakup: dots.tts New TTS from Xiaohongshu (RedNote) 2B - Apache 2.0 Fully continuous architecture (no codec tokens) 48kHz synth…

Submit Feedback

Similar Articles

@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…

@AdinaYakup: dots.tts New TTS from Xiaohongshu (RedNote) 2B - Apache 2.0 Fully continuous architecture (no codec tokens) 48kHz synth…