Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Hugging Face Daily Papers 05/13/26, 12:00 AM Papers

indic-speech-recognition asr fine-tuning benchmark whisper multilingual-asr low-resource-languages

Summary

Introduces Vividh-ASR, a complexity-tiered benchmark for Hindi and Malayalam ASR, identifies studio-bias in fine-tuning, and proposes R-MFT to improve spontaneous speech performance efficiently.

Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stage fine-tuning (R-MFT), a training recipe that enables a parameter-efficient 244M Whisper model to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis via CKA and SVD reveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder's acoustic geometry. We release the benchmark and models.

Original Article

View Cached Full Text

Cached at: 05/14/26, 08:17 AM

Paper page - Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Source: https://huggingface.co/papers/2605.13087

Abstract

Research identifies studio-bias in multilingual ASR fine-tuning and proposes R-MFT method to improve spontaneous speech performance while maintaining efficiency.

Fine-tuningmultilingual ASR models likeWhisperforlow-resource languagesoften improves read speech but degrades spontaneous audio performance, a phenomenon we termstudio-bias. To diagnose this mismatch, we introduceVividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stagefine-tuning(R-MFT), a training recipe that enables a parameter-efficient 244MWhispermodel to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis viaCKAandSVDreveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder’s acoustic geometry. We release the benchmark and models.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.13087

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13087 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.13087 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13087 in a Space README.md to link it from this page.

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Paper page - Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

BlasBench: An Open Benchmark for Irish Speech Recognition

@SarvamAI: We're open-sourcing two frameworks for evaluating Indian ASR, and a full guide on evaluation across 22 languages. WER (…

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Submit Feedback

Similar Articles

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

BlasBench: An Open Benchmark for Irish Speech Recognition

@SarvamAI: We're open-sourcing two frameworks for evaluating Indian ASR, and a full guide on evaluation across 22 languages. WER (…

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models