Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

This paper introduces Agentic ASR, an interactive speech recognition framework that uses semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement. It also proposes a new sentence-level semantic error rate metric and an interactive simulation system for benchmarking.

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

Original Article

View Cached Full Text

Cached at: 06/08/26, 03:16 PM

Paper page - Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Source: https://huggingface.co/papers/2605.29430 Authors:

Abstract

Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.

Automatic speech recognition(ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as amulti-turn refinementtask and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end withsemantic correction,intent routing, andreasoning-based editing. We further introduce theSentence-level Semantic Error Rate(S^2ER), an LLM-based semantic evaluation metric, together with anInteractive Simulation Systemfor scalable and reproducible benchmarking. Experiments onmultilingual,named-entity-intensive, andcode-switchingbenchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2605\.29430

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29430 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29430 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29430 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Paper page - Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation

@sheriyuo: This paper proposes ASAG, Attention-State Adaptive Generation, a training-free, plug-and-play stopping framework for re…

Submit Feedback

Similar Articles

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation

@sheriyuo: This paper proposes ASAG, Attention-State Adaptive Generation, a training-free, plug-and-play stopping framework for re…