Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Hugging Face Daily Papers Papers

Summary

This paper introduces Agentic ASR, an interactive speech recognition framework that uses semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement. It also proposes a new sentence-level semantic error rate metric and an interactive simulation system for benchmarking.

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as a multi-turn refinement task and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the Sentence-level Semantic Error Rate (S^2ER), an LLM-based semantic evaluation metric, together with an Interactive Simulation System for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/
Original Article
View Cached Full Text

Cached at: 06/08/26, 03:16 PM

Paper page - Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Source: https://huggingface.co/papers/2605.29430 Authors:

,

,

,

,

,

,

,

,

,

Abstract

Interactive ASR framework integrates semantic correction and reasoning-based editing to reduce semantic errors through multi-turn refinement, validated by a new sentence-level semantic error rate metric and interactive simulation system.

Automatic speech recognition(ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate Interactive ASR as amulti-turn refinementtask and propose Agentic ASR, a closed-loop framework that combines a single-pass ASR front-end withsemantic correction,intent routing, andreasoning-based editing. We further introduce theSentence-level Semantic Error Rate(S^2ER), an LLM-based semantic evaluation metric, together with anInteractive Simulation Systemfor scalable and reproducible benchmarking. Experiments onmultilingual,named-entity-intensive, andcode-switchingbenchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in S^2ER than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

View arXiv pageView PDFProject pageGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.29430

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29430 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29430 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29430 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation

arXiv cs.CL

This paper introduces HawkesLLM, a framework that models semantic uncertainty propagation in multi-step agentic text simulations by combining a multivariate Hawkes process for temporal influence and memory selection with a language model for text generation. Evaluation on a GDELT news-cascade case study shows improved late-stage semantic alignment under compact prompt-memory constraints.