Convex Low-resource Accent-Robust Language Detection in Speech Recognition
Summary
This paper introduces CLD, a lightweight convex optimization-based language detection head for ASR that achieves 97-98% accuracy with under 100 training samples while reducing compute costs by 13x, addressing accent and dialect robustness across 5 languages and 24 sub-dialects.
View Cached Full Text
Cached at: 05/29/26, 11:04 PM
Paper page - Convex Low-resource Accent-Robust Language Detection in Speech Recognition
Source: https://huggingface.co/papers/2605.23235 🎵 Meet Convex Language Detection (CLD)!
Automatic Speech Recognition (ASR) frequently exhibits failures on accents and dialects. But collecting more data to retrain a larger model is slow and expensive. CLD solves this—not by grid-searching hyperparameters or collecting massive datasets, but through the elegant geometry of convex optimization.
🌐🎙️ Instead of relying on unpredictable large-scale neural networks that struggle with accent variance, CLD introduces a lightweight, pluggable detection head that yields mathematically certified margin stability.
We benchmarked CLD across 5 languages, 24 unique sub-dialects (including highly challenging regimes like Singaporean English and regional Mandarin), and foundational models like Whisper and MMS-1B. The results: Even with under 100 training samples, CLD locks in 97–98% accuracy, reduces cross-lingual decoding failures, and cuts compute costs by a massive 13x.
The structural shift is fundamentally distinct: Current multilingual ASR models are heavily imbalanced toward standard, high-resource speech datasets, leaving millions of global speakers facing cascading errors. By recasting language identification as a convex program solved via parallelized ADMM in JAX, we don’t just guess a boundary—we calculate a verifiable radius of label invariance with guarantees. We see this as a highly scalable, theoretically backed plug-and-play module which aims to bring equity, speed, and reliability to global speech systems.
🛠️ Open-Source Code:https://github.com/pilancilab/CLD 📦 JAX Package: pip install jaxcld (https://pypi.org/project/jaxcld/) 📄 Full Paper:https://arxiv.org/abs/2605.23235
Similar Articles
Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs
This paper applies Direct Preference Optimization (DPO) to align Audio LLMs for transcribing English-Mandarin code-switching speech, achieving up to 89.6% MER reduction in-distribution and 20% out-of-distribution. It identifies three failure modes—language omission, translation instead of transcription, and hallucination—and shows that preference-based alignment effectively elicits correct code-switching behavior from multilingual Audio LLMs.
Linear Semantic Segmentation for Low-Resource Spoken Dialects
This paper introduces a benchmark for semantic segmentation in low-resource dialectal Arabic and proposes a model that improves performance on conversational speech compared to standard baselines.
CRoCoDiL: Continuous and Robust Conditioned Diffusion for Language
CRoCoDiL proposes a continuous and robust conditioned diffusion approach for language that shifts masked diffusion models into a continuous semantic space, achieving superior generation quality and 10x faster sampling speeds compared to discrete methods like LLaDA.
LaSR: Context-Aware Speech Recognition via Latent Reasoning
LaSR proposes a latent reasoning training paradigm for context-aware speech recognition, aligning chain-of-thought supervision around acoustic features to improve terminology recognition without added latency, outperforming standard fine-tuning on Fun-Audio-Chat.
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation
Proposes LiSCP, a lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Achieves superior performance on in-domain and cross-domain detection with notable robustness.