@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speaker…
Summary
Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.
View Cached Full Text
Cached at: 05/21/26, 09:38 PM
Mega-ASR 🔊 https://t.co/KzapZyRL2Y
✨1.7B ✨Apache 2.0
Built for Noise/Reverb/Clipping/Overlapping speakers, the scenarios where standard ASR breaks down
It includes an audio quality router: clean audio takes the standard path, degraded audio gets the robust path https://t.co/ha0tN39ET4
zhifeixie/Mega-ASR · Hugging Face
Source: https://huggingface.co/zhifeixie/Mega-ASR Mega-ASR is a robust automatic speech recognition system designed for real-world audio with severe acoustic degradation. It targets noisy, reverberant, clipped, band-limited, overlapping, and otherwise difficult recording conditions where standard ASR systems often produce empty outputs, omissions, repetitions, or hallucinated text.
The release contains the Qwen3-ASR-1.7B foundation model files, Mega-ASR adaptation weights, and an audio quality router. The router decides whether to use the robust Mega-ASR path or the base recognition path for each input, which helps preserve clean-speech recognition quality while improving robustness on degraded speech.
https://huggingface.co/zhifeixie/Mega-ASR#model-detailsModel Details
- **Model name:**Mega-ASR
- **Task:**Automatic speech recognition
- **Backbone:**Qwen3-ASR-1.7B
- **Primary use case:**In-the-wild ASR under challenging acoustic conditions
- **Default decoding:**Greedy decoding
- **Default max new tokens:**256 in the Mega-ASR inference wrapper
- **Router:**Audio quality classifier with a default threshold of 0.5
- **License:**Apache-2.0
https://huggingface.co/zhifeixie/Mega-ASR#repository-contentsRepository Contents
Mega-ASR/
├── Qwen3-ASR-1.7B/ # Backbone model, tokenizer, processor, and generation config
├── mega-asr-merged/ # Mega-ASR adaptation weights used by the inference wrapper
├── audio_quality_router/ # Audio quality router checkpoint
└── README.md # Model card
https://huggingface.co/zhifeixie/Mega-ASR#intended-useIntended Use
Mega-ASR is intended for speech-to-text transcription of real-world audio, especially audio affected by compound acoustic distortions. Example scenarios include far-field recording, environmental noise, reverberation, low-quality microphones, compression artifacts, partial signal corruption, and mixed acoustic conditions.
https://huggingface.co/zhifeixie/Mega-ASR#quick-startQuick Start
Install the Mega-ASR codebase and dependencies:
git clone https://github.com/xzf-thu/Mega-ASR.git
cd Mega-ASR
conda create -n mega-asr python=3.10 -y
conda activate mega-asr
pip install -r requirements.txt
Place this checkpoint directory at:
ckpt/Mega-ASR
Run inference:
python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR
Disable routing if you want to always use the robust recognition path:
python infer.py --audio /path/to/audio.wav --ckpt_dir ckpt/Mega-ASR --routing false
Python usage:
from MegaASR.model.megaASR import MegaASR
model = MegaASR(
model_path="ckpt/Mega-ASR/Qwen3-ASR-1.7B",
router_checkpoint="ckpt/Mega-ASR/audio_quality_router/best_acc_model.pt",
routing_enabled=True,
)
result = model.infer("/path/to/audio.wav", return_route=True)
print(result)
https://huggingface.co/zhifeixie/Mega-ASR#decoding-defaultsDecoding Defaults
The Mega-ASR wrapper uses Qwen3-ASR generation defaults unless explicitly overridden. In the provided wrapper,max\_new\_tokensis set to 256.
The default generation configuration is deterministic:
do_sample: false
num_beams: 1
repetition_penalty: 1.0
top_p: 1.0
top_k: 50
Becausedo\_sampleis false, decoding is greedy by default and sampling controls such as temperature, top-p, and top-k do not affect normal inference.
https://huggingface.co/zhifeixie/Mega-ASR#training-summaryTraining Summary
Mega-ASR is trained for robust speech recognition in realistic acoustic environments. The training pipeline uses acoustic-to-semantic supervised fine-tuning, where the model is exposed to progressively harder speech examples and learns to recover both local acoustic details and sentence-level semantics under degradation.
The system is designed to improve recognition robustness on difficult audio while using a routing mechanism to reduce unnecessary changes on clean audio.
https://huggingface.co/zhifeixie/Mega-ASR#evaluationEvaluation
Mega-ASR is evaluated on standard ASR benchmarks, noisy robustness benchmarks, and in-the-wild compound acoustic scenarios. The recommended evaluation metrics are:
- WERfor English and whitespace-tokenized languages
- CERfor Chinese and character-based evaluation
The Mega-ASR repository includes an evaluation script:
python src/MegaASR/eval/evaluate_wer.py \
--ckpt_dir ckpt/Mega-ASR \
--input_jsonl examples/test.jsonl \
--output_jsonl outputs/pred_with_wer.jsonl
Input JSONL format:
{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}
https://huggingface.co/zhifeixie/Mega-ASR#citationCitation
If you use Mega-ASR, please cite the project:
@misc{xie2026megaasrinthewild2speechrecognition,
title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
year={2026},
eprint={2605.19833},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2605.19833},
}
https://huggingface.co/zhifeixie/Mega-ASR#acknowledgementsAcknowledgements
Mega-ASR builds on Qwen3-ASR. We thank the Qwen3-ASR team and the creators of public speech and audio datasets used in the project.
Similar Articles
@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…
Open sourcing Mega-ASR, a full-scenario SOTA industrial-grade ASR model designed for challenging audio conditions like far-field and noise, outperforming existing open and closed models by 10-30% on real-world benchmarks.
@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
Mega-ASR proposes scaling up real-world acoustic simulation to improve automatic speech recognition in challenging, wild conditions, aiming to narrow the performance gap between lab and real-world settings.
@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…
Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.
@AdinaYakup: dots.tts New TTS from Xiaohongshu (RedNote) 2B - Apache 2.0 Fully continuous architecture (no codec tokens) 48kHz synth…
Dots.tts is a new TTS model from Xiaohongshu (RedNote) with 2B parameters, Apache 2.0 license, fully continuous architecture without codec tokens, 48kHz synthesis, and zero-shot voice cloning.