Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition
Summary
Introduces Vividh-ASR, a complexity-tiered benchmark for Hindi and Malayalam ASR, identifies studio-bias in fine-tuning, and proposes R-MFT to improve spontaneous speech performance efficiently.
View Cached Full Text
Cached at: 05/14/26, 08:17 AM
Paper page - Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition
Source: https://huggingface.co/papers/2605.13087
Abstract
Research identifies studio-bias in multilingual ASR fine-tuning and proposes R-MFT method to improve spontaneous speech performance while maintaining efficiency.
Fine-tuningmultilingual ASR models likeWhisperforlow-resource languagesoften improves read speech but degrades spontaneous audio performance, a phenomenon we termstudio-bias. To diagnose this mismatch, we introduceVividh-ASR, a complexity-stratified benchmark for Hindi and Malayalam across four tiers: studio, broadcast, spontaneous, and synthetic noise. Through a controlled study of learning-rate timing and curriculum ordering, we find that early large parameter updates improve global WER by 12 absolute points, while a hard-to-easy curriculum adds gains for spontaneous speech. These findings motivate reverse multi-stagefine-tuning(R-MFT), a training recipe that enables a parameter-efficient 244MWhispermodel to match or exceed conventionally fine-tuned 769M counterparts. Representational analysis viaCKAandSVDreveals effective schedules concentrate adaptation in the decoder, preserving the pre-trained encoder’s acoustic geometry. We release the benchmark and models.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.13087
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.13087 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.13087 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.13087 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India
Researchers introduce Voice of India, a 536-hour closed benchmark of unscripted telephonic conversations across 15 Indian languages and 139 regional clusters, exposing geographic and demographic ASR performance disparities.
BlasBench: An Open Benchmark for Irish Speech Recognition
BlasBench introduces an open evaluation benchmark for Irish speech recognition with Irish-aware text normalization that preserves linguistic features like fadas, lenition, and eclipsis. The paper benchmarks 12 ASR systems across four architecture families, revealing significant generalization gaps and showing that existing multilingual systems struggle with Irish due to inadequate normalization.
@SarvamAI: We're open-sourcing two frameworks for evaluating Indian ASR, and a full guide on evaluation across 22 languages. WER (…
SarvamAI releases open-source evaluation frameworks and a guide tailored for 22 Indian languages, addressing limitations of standard WER/CER metrics.
A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
This paper presents a calculus-based framework that uses first and second derivative tests to estimate the optimal vocabulary size hyper-parameter for end-to-end ASR systems, improving performance on the Librispeech corpus.
MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
MTR-DuplexBench introduces a comprehensive benchmark for evaluating Full-Duplex Speech Language Models in multi-round conversations, addressing challenges like blurred turn boundaries and context inconsistency while assessing conversational features, dialogue quality, instruction following, and safety.