Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Summary
This paper introduces Agentic ASR, an interactive speech recognition framework that uses semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement. It also proposes a new sentence-level semantic error rate metric and an interactive simulation system for benchmarking.
View Cached Full Text
Cached at: 06/08/26, 03:16 PM
Paper page - Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Source: https://huggingface.co/papers/2605.29430 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.
Automatic speech recognition(ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as amulti-turn refinementtask and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end withsemantic correction,intent routing, andreasoning-based editing. We further introduce theSentence-level Semantic Error Rate(S^2ER), an LLM-based semantic evaluation metric, together with anInteractive Simulation Systemfor scalable and reproducible benchmarking. Experiments onmultilingual,named-entity-intensive, andcode-switchingbenchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2605\.29430
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29430 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29430 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29430 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...
NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
This paper introduces Afrispeech Semantics, a benchmark for evaluating audio language models on semantic reasoning tasks including entailment, consistency, plausibility, accent drift, and accent restraint across diverse domains and accents.
Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation
Mega-ASR proposes scaling up real-world acoustic simulation to improve automatic speech recognition in challenging, wild conditions, aiming to narrow the performance gap between lab and real-world settings.
HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation
This paper introduces HawkesLLM, a framework that models semantic uncertainty propagation in multi-step agentic text simulations by combining a multivariate Hawkes process for temporal influence and memory selection with a language model for text generation. Evaluation on a GDELT news-cascade case study shows improved late-stage semantic alignment under compact prompt-memory constraints.
@sheriyuo: This paper proposes ASAG, Attention-State Adaptive Generation, a training-free, plug-and-play stopping framework for re…
ASAG uses attention entropy to detect when reasoning is unproductive, stopping early to improve accuracy and reduce token generation. Experiments on Qwen3-8B show a 4.4% accuracy gain and over 40% fewer generated tokens.