Graph-Based Phonetic Error Correction of Noisy ASR

arXiv cs.CL 06/25/26, 04:00 AM Papers
asr error-correction phonetic-modeling graph-neural-network llm inference-time
Summary
Proposes G-SPIN, a lightweight framework that combines phonetic graph modeling with contextual language understanding for correcting ASR errors, using a GNN to generate phonetically plausible candidate tokens, an MLM for local scoring, and an LLM for final re-ranking, all operating at inference time.
arXiv:2606.24889v1 Announce Type: new Abstract: Automatic speech recognition (ASR) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment-bearing words. These errors are often structured, arising from phonetic similarity rather than random noise, making naive token-level correction insufficient. We propose a structured ASR correction framework, that we call G-SPIN, that combines phonetic graph modeling with contextual language understanding. A graph neural network (GNN) first constructs acoustically plausible candidate neighborhoods for flagged tokens, explicitly restricting the correction search space to phonetic alternatives. A masked language model (MLM) then provides local contextual scoring, and an instruction-tuned large language model (LLM) performs final context-aware re-ranking over this compact candidate set. By decoupling structured phonetic reasoning from contextual semantic selection, our method avoids unconstrained generation while improving correction accuracy. The framework is lightweight, modular, and operates entirely at inference time.
Original Article
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# Graph-Based Phonetic Error Correction of Noisy ASR
Source: [https://arxiv.org/html/2606.24889](https://arxiv.org/html/2606.24889)
Pankaj Wasnik Sony Research India \{pratik\.singh, mohammadi\.zaki, aneesh\.mukkamala, pankaj\.wasnik\}@sony\.com

###### Abstract

Automatic speech recognition \(ASR\) systems, despite low overall word error rates, produce residual lexical errors that disproportionately affect semantically critical tokens such as named entities, negations, and sentiment\-bearing words\. These errors are often structured, arising from phonetic similarity rather than random noise, making naive token\-level correction insufficient\. We propose a structured ASR correction framework, that we call G\-SPIN, that combines phonetic graph modeling with contextual language understanding\. A graph neural network \(GNN\) first constructs acoustically plausible candidate neighborhoods for flagged tokens, explicitly restricting the correction search space to phonetic alternatives\. A masked language model \(MLM\) then provides local contextual scoring, and an instruction\-tuned large language model \(LLM\) performs final context\-aware re\-ranking over this compact candidate set\. By decoupling structured phonetic reasoning from contextual semantic selection, our method avoids unconstrained generation while improving correction accuracy\. The framework is lightweight, modular, and operates entirely at inference time\.

Graph\-Based Phonetic Error Correction of Noisy ASR

Pratik Rakesh Singh, Mohammadi Zaki, Aneesh Mukkamala and Pankaj WasnikSony Research India\{pratik\.singh, mohammadi\.zaki, aneesh\.mukkamala, pankaj\.wasnik\}@sony\.com

## 1Introduction

Automatic speech recognition \(ASR\) systems serve as a foundational interface between spoken language and text\-based processing\. Despite substantial progress and low aggregate word error rates, residual transcription errors remain inevitable in spontaneous, conversational, and acoustically challenging settingsszymański2020werwerthink\. Importantly, these errors are not uniformly distributed across tokens, nor are they purely randomLenget al\.\([2023](https://arxiv.org/html/2606.24889#bib.bib19)\)\. Instead, ASR errors exhibit structured patterns, often governed by phonetic similarity, acoustic ambiguity, and contextual dependencies\.

A central challenge in ASR correction lies in the nature of these perturbations\. Many errors arise from systematic phoneme\-level confusions—homophones, near\-homophones, boundary shifts, or insertions and deletions driven by acoustic similarityWanget al\.\([2024](https://arxiv.org/html/2606.24889#bib.bib13)\); Nemotoet al\.\([2008](https://arxiv.org/html/2606.24889#bib.bib14)\)\. Consequently, effective correction cannot rely solely on generic contextual rewriting\. Naïve token\-level edits risk either failing to repair acoustically plausible confusions or introducing spurious replacements that are contextually fluent but phonetically implausible\.Bouselmiet al\.\([2006](https://arxiv.org/html/2606.24889#bib.bib28)\); Chenet al\.\([2021](https://arxiv.org/html/2606.24889#bib.bib12)\)

Recent approaches have employed LLMs in prompt\-driven settings to directly rewrite noisy transcripts\. While flexible, such methods treat correction as unconstrained generation\. As a result, they may hallucinate novel content, over\-correct benign variations, or exhibit instability across similar contextsLyuet al\.\([2025](https://arxiv.org/html/2606.24889#bib.bib16)\); Maet al\.\([2025](https://arxiv.org/html/2606.24889#bib.bib17)\)\. These behaviors reflect reliance on implicit in\-context reasoning rather than explicit modeling of ASR noise structure\. Recent work has also explored multi\-stage ASR error correction using LLMs without explicit structural modeling\. For example, the Reliable LLM Correction Framework \(RLLM\-CF\) proposed byFanget al\.\([2025a](https://arxiv.org/html/2606.24889#bib.bib20)\)decomposes ASR correction into three LLM\-driven stages—error pre\-detection, iterative chain\-of\-thought subtask correction, and reasoning verification—to reduce hallucinations and improve correction accuracy\. While this approach demonstrates promising reductions in character and word error rates, it remains fully dependent on prompt\-based reasoning and iterative LLM calls for identification and rewriting, without explicit modeling of phonetic ambiguity or structured candidate generation\. By contrast, our framework constrains the correction search space using a phonetic graph and integrates contextual scoring models, thereby reducing reliance on unconstrained LLM generation and improving correction consistency across contexts\.

An alternative direction is to incorporate structured inductive bias into the correction process\. Since many ASR errors arise from phonetic ambiguity, candidate corrections should be restricted to acoustically plausible alternatives before contextual reasoning is applied\. This suggests a two\-level correction paradigm: first constrain the lexical search space using phonetic structure, then apply contextual models to select the most coherent alternative\.

In this work, we propose a structured ASR correction framework that explicitly separates phonetic reasoning from contextual selection\. We construct a phoneme\-level graph and train a graph neural network \(GNN\) to model relationships among acoustically similar lexical candidates\. For each flagged token, the GNN defines a compact, structured candidate neighborhood\. A masked language model \(MLM\) provides local contextual scoring, and an instruction\-tuned LLM performs final context\-aware re\-ranking strictly within this constrained set\. By reducing the correction search space prior to semantic reasoning, our approach avoids unconstrained rewriting while remaining flexible and context\-sensitive\.

Our framework is lightweight and modular, and primarily operates at inference time\. While G\-SPIN does not require any retraining or fine\-tuning during deployment, it includes a one\-time offline pretraining step for the GNN component, which captures phonetic relationships within a fixed vocabulary for a given language\. This pretraining is data\-independent and need not be repeated across datasets or domains\. Once trained, the GNN is reused without modification, and the entire correction pipeline runs in an inference\-only manner\. In contrast to conventional approaches that require repeated fine\-tuning for domain adaptation, our method avoids any task\- or dataset\-specific retraining\.

Empirical results demonstrate that structured phonetic constraint combined with contextual re\-ranking significantly improves correction accuracy over prompt\-based rewriting and purely contextual baselines\. These findings highlight the importance of integrating phonetic structure into modern ASR correction systems\. Our contributions can be summarized as follows:

- •We propose a principled ASR correction framework that combines phonetic structure and contextual understanding\. A phoneme\-level GNN constructs acoustically plausible candidate neighborhoods, and MLM scoring with instruction\-tuned LLM re\-ranking selects corrections strictly within this set, avoiding unconstrained rewriting and hallucinations\.
- •Our method operates entirely at inference time without retraining ASR systems or LLMs, making it easily deployable and backbone\-agnostic\.
- •Extensive experiments on English, Telugu, Spanish, and Hindi demonstrate consistent improvements over strong LLM\-based baselines for ASR correction across diverse linguistic settings\.

## 2Problem Setup and Motivation

##### Problem Setup\.

Letx∈𝒳x\\in\\mathcal\{X\}denote the intended \(clean\) sentence andx~=x\+δ\\tilde\{x\}=x\+\\deltaits noisy ASR realization, whereδ\\deltarepresents lexical perturbations arising from phonetic confusions, along with insertions, deletions, and boundary shifts\. In this work, we focus primarily on correcting*phonetic confusions*, which constitute a dominant class of ASR errors, by leveraging phonetic similarity structure\. Concretely, we assume that erroneous tokens inx~\\tilde\{x\}can be mapped to a set of acoustically plausible alternatives defined over a phonetic neighborhood\.

The ASR correction task seeks to recover a corrected sentencex^\\hat\{x\}fromx~\\tilde\{x\}such thatx^\\hat\{x\}is both \(i\) phonetically plausible under the acoustic evidence that producedx~\\tilde\{x\}—operationalized via constraints over a phonetic similarity graph—and \(ii\) contextually coherent at the sentence level\. Formally, we aim to design a correction operatorC:𝒳→𝒳C:\\mathcal\{X\}\\rightarrow\\mathcal\{X\}that maps noisy transcripts to refined textual realizations:x^=C\(x~\),\\hat\{x\}=C\(\\tilde\{x\}\),with the objective thatx^\\hat\{x\}closely approximates the latent clean sentencexx\.

##### Motivation\.

Automatic speech recognition \(ASR\) systems have achieved impressive reductions in aggregate word error rates\. However, residual transcription errors remain inevitable in spontaneous, conversational, and acoustically challenging environments\. Importantly, these errors are not uniformly distributed across tokens\. Perturbations affecting semantically salient words—such as named entities, negations, or sentiment\-bearing terms—can substantially alter the meaning of a sentence, even when overall error rates appear low\.

Letxxdenote the intended source sentence andx~=x\+δ\\tilde\{x\}=x\+\\deltaits ASR realization\. The perturbationδ\\deltais rarely arbitrary noise\. Instead, ASR errors typically arise from structured phonetic confusions, including substitutions among acoustically similar words, insertions, deletions, or boundary shifts\. As a result, many erroneous tokens are locally plausible yet globally inconsistent with sentence\-level context\.

Naive correction strategies that rely purely on contextual rewriting treat ASR repair as unconstrained generation\. While such approaches may improve fluency, they risk hallucinating new content, over\-correcting benign variations, or introducing inconsistencies across similar contexts\. Conversely, purely string\-based or edit\-distance heuristics fail to capture long\-range semantic dependencies and may under\-correct impactful errors\.

Effective ASR correction therefore requires integrating two complementary inductive biases: \(i\) phonetic structure, to restrict candidate replacements to acoustically plausible alternatives, and \(ii\) contextual language understanding, to select the most coherent option within that constrained set\. The challenge is to design a lightweight inference\-time mechanism that combines these signals without relying on unconstrained rewriting or expensive retraining\.

Importantly, the need to restrict decoding to a structured phonetic space is not only empirically motivated but also theoretically grounded\. As shown in Appendix[A\.1](https://arxiv.org/html/2606.24889#A1.SS1)\(Lemma[1](https://arxiv.org/html/2606.24889#Thmlemma1)\), phonetic space restriction induces a contraction in the input perturbation norm, thereby improving the local stability of the frozen LLM under ASR noise\. This formal result provides a principled justification for structured correction prior to downstream generation\.

These considerations motivate a structured correction framework that explicitly models phonetic neighborhoods while leveraging contextual scoring for final selection, thereby balancing acoustic plausibility with semantic coherence\.

## 3Methodology

Here we discuss our Methodology Graph\-based Structured Phonetic INference\(G−SPIN\)\(G\-SPIN\)the ASR input correction as a three\-step process: 1\) Faulty Input Identification, 2\) Correct Input retrieval, 3\) Input Correction\. This three\-step process ensures identifying noisy ASR input and replacing it with the best possible alternatives fitting the context\.

##### Step1: Faulty Input Identification:

Automatic Speech Recognition \(ASR\) systems often produce noisy outputs, where acoustically confusable or out\-of\-vocabulary words are substituted with incorrect lexical forms\. To identify such erroneous words without access to reference transcripts, we propose a Contextual Anomaly Detection \(CAD\) method based on a masked language model \(MLM\)\.

Given a sentenceS=\(w1,…,wn\)S=\(w\_\{1\},\\dots,w\_\{n\}\)embedded within its local discourse contextCC, we approximate the pseudo\-log\-likelihood \(PLL\) of each word by masking its constituent tokens and measuring the model’s confidence in reconstructing them\. Let a wordwiw\_\{i\}consist of tokens\{ti1,…,tim\}\\\{t\_\{i1\},\\dots,t\_\{im\}\\\}\. For each tokentijt\_\{ij\}, we compute:P\(tij∣S∖tij,C\)P\(t\_\{ij\}\\mid S\_\{\\setminus t\_\{ij\}\},C\), whereS∖tijS\_\{\\setminus t\_\{ij\}\}denotes the input sequence with tokentijt\_\{ij\}replaced by the\[MASK\]symbol\. The word\-level log\-probability is then defined as the average token log\-probability:log⁡P\(wi∣C\)=1m∑j=1mlog⁡P\(tij∣S∖tij,C\)\\log P\(w\_\{i\}\\mid C\)=\\frac\{1\}\{m\}\\sum\_\{j=1\}^\{m\}\\log P\(t\_\{ij\}\\mid S\_\{\\setminus t\_\{ij\}\},C\)and the corresponding word probability is:P\(wi∣C\)=exp⁡\(log⁡P\(wi∣C\)\)\.P\(w\_\{i\}\\mid C\)=\\exp\\left\(\\log P\(w\_\{i\}\\mid C\)\\right\)\.

Additionally, we compute a minimum token confidence score:Pmin\(wi\)=minj⁡P\(tij∣S∖tij,C\),P\_\{\\min\}\(w\_\{i\}\)=\\min\_\{j\}P\(t\_\{ij\}\\mid S\_\{\\setminus t\_\{ij\}\},C\),which captures the weakest reconstruction confidence among the word’s sub\-word tokens\. To detect anomalous words, we first filter candidates whose word probability falls below a predefined thresholdτ\\tau:

P\(wi∣C\)<τ\.P\(w\_\{i\}\\mid C\)<\\tau\.\(1\)For the remaining candidates, we compute a combined anomaly score:

𝒜\(wi\)=log⁡P\(wi∣C\)\+αlog⁡\(Pmin\(wi\)\),\\mathcal\{A\}\(w\_\{i\}\)=\\log P\(w\_\{i\}\\mid C\)\+\\alpha\\log\\left\(P\_\{\\min\}\(w\_\{i\}\)\\right\),whereα\\alphacontrols the contribution of the minimum token confidence\. Words are ranked by𝒜\(wi\)\\mathcal\{A\}\(w\_\{i\}\)in ascending order, and the top\-ff\(5 in our case\) lowest\-scoring words are flagged as potential ASR errors our experiment we useα=0\.5,τ=10−2\\alpha=0\.5,\\tau=10^\{\-2\}which are determined through experimentally\.

This formulation enables context\-aware detection of improbable lexical realizations without requiring supervision, making it suitable for large\-scale post\-processing of ASR outputs in realistic conversational settings\.

##### Step2: Correct input retrieval\.

Previous studies observe that automatic speech recognition \(ASR\) frequently substitutes phonetically similar yet semantically unrelated tokensRuanet al\.\([2020](https://arxiv.org/html/2606.24889#bib.bib29)\); Huang and Chen \([2020](https://arxiv.org/html/2606.24889#bib.bib10)\)\. To exploit this structure we convert the language model vocabulary into a phoneme\-level graphical space and retrieve candidate correct inputs via graph\-based similarity\. Formally, let𝒱\\mathcal\{V\}denote the \(LLM\) vocabulary and letϕ:𝒱→𝒫\+\\phi:\\mathcal\{V\}\\rightarrow\\mathcal\{P\}^\{\+\}be a phonemizer that maps each vocabulary item to a sequence of phonemes\. We create a phoneme\-node set

𝒩=\{p∣p∈ϕ\(v\),v∈𝒱\},\\mathcal\{N\}=\\big\\\{p\\mid p\\in\\phi\(v\),\\ v\\in\\mathcal\{V\}\\big\\\},and obtain fixed\-length phoneme embeddings via an embedding functione:𝒫\+→ℝde:\\mathcal\{P\}^\{\+\}\\rightarrow\\mathbb\{R\}^\{d\}\(e\.g\., averaging subtoken/phoneme encoder outputs\)\. We then build an undirected graphG=\(𝒩,ℰ\)G=\(\\mathcal\{N\},\\mathcal\{E\}\)where

\(pi,pj\)∈ℰiffcosine\(e\(pi\),e\(pj\)\)≥η,\(p\_\{i\},p\_\{j\}\)\\in\\mathcal\{E\}\\quad\\text\{iff\}\\quad\\mathrm\{cosine\}\\big\(e\(p\_\{i\}\),e\(p\_\{j\}\)\\big\)\\geq\\eta,for a similarity thresholdη\\eta\(0\.9 in our case\)\. Node attributes may include the phoneme embeddinge\(p\)e\(p\), enabling downstream disambiguation from phonetic signals\.

##### GNN training\.

We train a GNN \(fθf\_\{\\theta\}\) for a link\-prediction objective that encourages nodes representing confusable phonemes \(and thus confusable words\) to be strongly associated while separating spurious pairs\. Lethv=fθ\(v\)h\_\{v\}=f\_\{\\theta\}\(v\)denote the learned node representation\. Define a differentiable pairwise scores\(u,v\)=hu⊤hvs\(u,v\)=h\_\{u\}^\{\\top\}h\_\{v\}\(or a small feed\-forward scorers\(u,v\)=MLP\(\[hu;hv;hu⊙hv\]\)s\(u,v\)=\\mathrm\{MLP\}\(\[h\_\{u\};h\_\{v\};h\_\{u\}\\odot h\_\{v\}\]\)\)\. For each positive pair\(u,v\)∈𝒫\+\(u,v\)\\in\\mathcal\{P\}^\{\+\}we sample a set of negative nodes𝒩u−\\mathcal\{N\}\_\{u\}^\{\-\}\(negative sampling\) and minimize the binary cross\-entropy style objective

ℒ\(θ\)=−∑\(u,v\)∈𝒫\+log⁡σ\(s\(u,v\)\)−λ∑\(u,n\)∈𝒩−log⁡\(1−σ\(s\(u,n\)\)\),\\displaystyle\\begin\{split\}\\mathcal\{L\}\(\\theta\)=&\-\\sum\_\{\(u,v\)\\in\\mathcal\{P\}^\{\+\}\}\\log\\sigma\\big\(s\(u,v\)\\big\)\\\\ &\-\\lambda\\sum\_\{\(u,n\)\\in\\mathcal\{N\}^\{\-\}\}\\log\\big\(1\-\\sigma\\big\(s\(u,n\)\\big\)\\big\),\\end\{split\}whereσ\\sigmais the sigmoid function andλ\\lambdabalances positive/negative terms\. In practice we construct positives using multi\-hop connectivity: nodes within≤H\\leq Hhops are treated as positive pairs \(to capture one\-to\-many phonetic neighborhoods\), while negatives are sampled uniformly or by degree\-aware sampling to avoid easy negatives\.

##### Inference and correction\.

At inference time, for a faulty ASR tokenwASRw\_\{\\mathrm\{ASR\}\}, we first compute its phoneme representationpASR=ϕ\(wASR\)p\_\{\\mathrm\{ASR\}\}=\\phi\(w\_\{\\mathrm\{ASR\}\}\)\. We then locate the corresponding nodeuuin the graph and compute link scoress\(u,v\)s\(u,v\)to candidate nodesvv\. The top\-kkphoneme candidates are finally mapped back to vocabulary items in𝒱\\mathcal\{V\}using an inverted lexicon to produce correction candidates\.

##### Step 3: Beam\-search decoding with MLM scoring\.

Given a flagged ASR tokenwASRw\_\{\\text\{ASR\}\}, the GNN provides a set of candidate phoneme\-aligned token segments𝒮=\{𝒮1,…,𝒮K\}\\mathcal\{S\}=\\\{\\mathcal\{S\}\_\{1\},\\dots,\\mathcal\{S\}\_\{K\}\\\}, where each segment𝒮i\\mathcal\{S\}\_\{i\}contains multiple candidate subword tokens\. Our objective is to reconstruct the most plausible lexical candidate by composing one token from each segment\. Since the search space grows exponentially with the number of segments, we employ beam search to efficiently explore the candidate space\.

At decoding steptt, each beam represents a partial candidate composed of subword pieces𝐩1:t=\(p1,…,pt\)\\mathbf\{p\}\_\{1:t\}=\(p\_\{1\},\\dots,p\_\{t\}\), which are detokenized into a surface formctc\_\{t\}\. Each candidate is scored using a weighted combination of contextual language model likelihood\.

Score\(ct\)=λlm⋅log⁡PMLM\(ct∣S\)\+λedit⋅\(−NED\(ct,wASR\)\)\+λfreq⋅Freq\(ct\),\\displaystyle\\begin\{split\}&\\mathrm\{Score\}\(c\_\{t\}\)=\\lambda\_\{\\mathrm\{lm\}\}\\cdot\\log P\_\{\\mathrm\{MLM\}\}\(c\_\{t\}\\mid S\)\\\\ &\+\\lambda\_\{\\mathrm\{edit\}\}\\cdot\\big\(\-\\mathrm\{NED\}\(c\_\{t\},w\_\{\\text\{ASR\}\}\)\\big\)\+\\lambda\_\{\\mathrm\{freq\}\}\\cdot\\mathrm\{Freq\}\(c\_\{t\}\),\\end\{split\}whereNED\(⋅,⋅\)\\mathrm\{NED\}\(\\cdot,\\cdot\)denotes normalized edit distance, andPMLM\(ct∣S\)P\_\{\\mathrm\{MLM\}\}\(c\_\{t\}\\mid S\)is computed using a masked language model \(XLM\-RoBERTa\) by masking the original token position and evaluating the log\-probability of the candidate’s subword tokens\. We setλlm=1,λedit=2,λfreq=0\.5\\lambda\_\{\\mathrm\{lm\}\}=1,\\lambda\_\{\\mathrm\{edit\}\}=2,\\lambda\_\{\\mathrm\{freq\}\}=0\.5\.

Beam search maintains the top\-BBcandidates \(whereBB= 10 in our experiment\) at each step:

ℬt=TopB⁡\(⋃𝐩∈ℬt−1,s∈𝒮t\{𝐩⊕s\}\),\\mathcal\{B\}\_\{t\}=\\operatorname\{TopB\}\\left\(\\bigcup\_\{\\mathbf\{p\}\\in\\mathcal\{B\}\_\{t\-1\},\\ s\\in\\mathcal\{S\}\_\{t\}\}\\\{\\mathbf\{p\}\\oplus s\\\}\\right\),where⊕\\oplusdenotes concatenation\. After processing all segments, the final decoded candidate is selected as:w^=arg⁡maxc∈ℬK⁡Score\(c\)\.\\hat\{w\}=\\arg\\max\_\{c\\in\\mathcal\{B\}\_\{K\}\}\\mathrm\{Score\}\(c\)\.This beam\-based formulation enables efficient exploration of the large combinatorial candidate space while jointly incorporating phonetic similarity, contextual plausibility, and lexical priors to recover the most likely correction\.

![Refer to caption](https://arxiv.org/html/2606.24889v1/x1.png)Figure 1:Block Diagram of the proposed Graph\-Based ASR Correction strategy\. Phase I represents the offline training phase of the GNN, while Phase II represents the inference pipeline\.
##### LLM\-based contextual candidate selection\.

Although beam search with masked language model \(MLM\) scoring produces a set of plausible candidate corrections, the final selection requires deeper semantic and contextual reasoning beyond token\-level likelihoods\. To address this, we employ an instruction\-tuned LLM \(LLM\), as a context\-aware selection module\. Given the noisy ASR sentenceSS, the flagged wordwASRw\_\{\\text\{ASR\}\}, and the decoded candidate set𝒞=\{c1,…,cK\}\\mathcal\{C\}=\\\{c\_\{1\},\\dots,c\_\{K\}\\\}from beam search, the LLM is prompted to select the single best replacement that fits the sentence context\. The full prompt is shown in Appendix[A\.2](https://arxiv.org/html/2606.24889#A1.SS2)\. Formally, the LLM defines a conditional selection function:

c^=gLLM\(S,wASR,𝒞\),\\hat\{c\}=g\_\{\\mathrm\{LLM\}\}\\left\(S,w\_\{\\text\{ASR\}\},\\mathcal\{C\}\\right\),wheregLLMg\_\{\\mathrm\{LLM\}\}maps the noisy sentence, faulty token, and candidate set to the most contextually appropriate correction\.

This LLM\-based selection stage serves as a semantic correction layer on top of the phonetic graph and MLM\-based decoding, enabling robust recovery of contextually appropriate lexical forms from noisy ASR outputs\. An complete overview of our pipeline can be inferred through Figure[1](https://arxiv.org/html/2606.24889#S3.F1)\.

## 4Experiments

### 4\.1Experimental Setup

We evaluate our method and baselines on Gemma\-3\-4b\-itTeamet al\.\([2025](https://arxiv.org/html/2606.24889#bib.bib21)\), a strong and budget\-friendly multilingual model\. For faulty words detection \(CAD\) in Step 1 and Step 3 MLM\-based decoding, we use XLM\-RoBERTa for the MLM processConneauet al\.\([2020](https://arxiv.org/html/2606.24889#bib.bib22)\)\. For training of GNN, we used GraphSAGEHamiltonet al\.\([2017](https://arxiv.org/html/2606.24889#bib.bib31)\), for link prediction, we use MLP layers connecting to the final layer of GNN\. Below is the architectural setting for GNN in Appendix[4](https://arxiv.org/html/2606.24889#A1.T4)\.

### 4\.2Baselines

We compare our approach, G\-SPIN, against several strong baselines for ASR error correction\. First, we evaluate against DoCIALyuet al\.\([2025](https://arxiv.org/html/2606.24889#bib.bib16)\), a contextual LLM\-based ASR refinement framework that leverages document\-level context to improve transcription consistency and correctness\. Second, we include RLLM\-CFFanget al\.\([2025b](https://arxiv.org/html/2606.24889#bib.bib15)\), a three\-stage LLM\-based correction pipeline designed to reduce hallucinations and improve factual and contextual reliability in ASR refinement\. In addition to LLM\-based baselines, we also evaluate a graph\-based baseline using the seed knowledge graph \(KG\) directly, without GNN retrieval\. The baseline is a simple ASR noisy output\. We evaluate G\-SPIN and baselines on 3 metrics: WER \(word error rate\), SeMaScore\(S\.\)\(S\.\)Sasindranet al\.\([2024](https://arxiv.org/html/2606.24889#bib.bib30)\), and BertScore\(B\.\)\(B\.\)Zhanget al\.\([2020](https://arxiv.org/html/2606.24889#bib.bib27)\)\. Furthermore, we evaluate GNN training onHits@10Hits@10andHits@20Hits@20and the AUC score\.

### 4\.3Dataset and languages

For our experimental setup, we use Loquacious\-SetParcolletet al\.\([2025](https://arxiv.org/html/2606.24889#bib.bib24)\), which contains noisy and clean output, making it perfect for our use case\. To create parallel noisy–clean pairs, we systematically injected pure environmental and ambient noise samples into the clean audio\. The noise types and mixing process were designed to approximate realistic deployment conditions encountered in real\-world ASR scenarios\. To ensure that the injected noise meaningfully impacted ASR performance without excessively degrading the linguistic content we designed a validation protocol across multiple noise levels\. Specifically, we constructed a validation set spanning varying signal\-to\-noise ratios \(SNRs\) and evaluated transcription quality using both Word Error Rate \(WER\) and SeMa\-Score\. In addition to these quantitative metrics, we performed manual inspection to verify: The noise induced plausible ASR errors rather than arbitrary corruption, and the underlying semantic content of the utterances remained largely recoverable\. We transcribe and translate audio for clean and noisy, we use seamless\-m4t\-v2\-largeTeam \([2023](https://arxiv.org/html/2606.24889#bib.bib25)\)\. The languages we compare models arehi,en,es, andte\.

### 4\.4Results:

Table[1](https://arxiv.org/html/2606.24889#S4.T1)shows that G\-SPIN outperforms all baselines in terms of WER, demonstrating the effectiveness of our GNN\-based candidate generation combined with LLM\-based contextual selection\. G\-SPIN also achieves the best SeMA score\(S\.\)\(S\.\), although the margin over competing methods is relatively small, indicating comparable semantic preservation across approaches\. In contrast, BERTScore\(B\.\)\(B\.\)shows minimal variation among methods, suggesting that it is less sensitive to fine\-grained lexical corrections and provides limited discriminative insight for ASR error correction performance\.

Table 1:WER↓\\downarrow, BERTScore \(B\.\)↑\\uparrowand SeMaScore \(S\.\)↑\\uparrowcomparison across languages\. All scores lie between \[0,1\]![Refer to caption](https://arxiv.org/html/2606.24889v1/x2.png)Figure 2:Comarison of different selection methods with the corresponding performance metrics\.![Refer to caption](https://arxiv.org/html/2606.24889v1/newkvswer.png)Figure 3:Ablation plot of value ofKKvs WER\.
### 4\.5Ablation Experiments

##### Effect of Top\-KKcandidate selection\.

In Figure[3](https://arxiv.org/html/2606.24889#S4.F3), we analyze the impact of varying the number of Top\-KKcandidate tokens retrieved from the GNN during decoding\. We observe that for smaller values such asK=5K=5andK=10K=10, the WER remains largely unchanged, indicating that the highest\-confidence candidates produced by the GNN are already highly relevant and sufficient for effective correction\. Increasing the candidate pool toK=20K=20results in further improvement, suggesting that some optimal corrections may lie slightly beyond the top\-ranked candidates and become accessible with a larger search space\. However, whenK=30K=30, performance begins to degrade, likely due to the introduction of lower\-quality or noisy candidates, which increase ambiguity during decoding and negatively affect candidate selection\.

##### MLM vs LLM based selection\.

To evaluate contextual candidate selection, we compare our LLM\-based selection against an MLM\-based baseline, where the candidate with the highest masked language model probability is chosen\. As shown in Figure[2](https://arxiv.org/html/2606.24889#S4.F2), the LLM\-based method significantly outperforms MLM\-based selection in terms of WER and achieves notable improvements in SeMA score, while both methods yield comparable BERTScore\. These results indicate that LLM\-based selection enables more effective context\-aware disambiguation, leading to more accurate correction of ASR errors beyond token\-level likelihood estimation\.

Table 2:GNN training evaluation across languages\. Higher is better for all metrics \(↑\\uparrow\)\.
##### GNN link prediction performance\.

From Table[2](https://arxiv.org/html/2606.24889#S4.T2), we observe that the GNN achieves strong performance across all languages on the link prediction task, as reflected by high Hits@10, Hits@20, and AUC scores\. These results indicate that the model effectively captures phonetic similarity and reliably connects phonemes with acoustically and articulatorily similar counterparts\. For Hindi and Telugu, the scores are slightly lower compared to other languages, which can be attributed to higher phonetic variability and increased ambiguity in phoneme–token mappings\. This motivates the use of a larger candidate pool during retrieval\. In particular, selecting Top\-20 candidates from the GNN improves recall of correct phonetic matches, ensuring optimal correction candidates are included during downstream decoding and contextual re\-ranking\.

### 4\.6Error Analysis and Qualitative Behavior

We perform a detailed error analysis to better understand the strengths and limitations of our proposed method, G\-SPIN, across different categories of ASR noise\. For this analysis, we curate 200 samples for each error type from English audio and compute Word Error Rate \(WER\) across different noise categories \(see Table[3](https://arxiv.org/html/2606.24889#S4.T3)\)\.

#### 4\.6\.1Substitution and Phonetically Similar Errors

Our method is particularly effective at correcting substitution errors, especially those arising from phonetic ambiguity\. This aligns with the design of G\-SPIN, which explicitly models phonetic similarity during correction\. As shown in Table[3](https://arxiv.org/html/2606.24889#S4.T3), our approach significantly reduces errors in:

- •Grammatical errors: from 0\.4037→\\rightarrow0\.1866
- •Similar\-sounding substitutions: from 0\.4000→\\rightarrow0\.149

These improvements indicate that the proposed three\-step pipeline—particularly the error detection and phonetic correction stages—is effective at identifying and resolving substitution\-type errors\. Qualitatively, we observe that G\-SPIN is able to recover intended words even when the ASR output contains plausible but incorrect phonetic variants\.

#### 4\.6\.2Entity\-Level Errors

We also observe strong improvements in entity\-related errors \(e\.g\., named entities\), with error rates reduced from 0\.5107→\\rightarrow0\.192\. This suggests that the model is capable of correcting semantically important tokens, which is critical for downstream real\-world applications\.

#### 4\.6\.3Insertion Errors

For insertion errors, G\-SPIN achieves moderate improvements \(0\.4941→\\rightarrow0\.3656\), but does not fully resolve them\. This is expected, as the current framework is primarily designed around phonetic correction rather than sequence\-level restructuring\. However, we find that the faulty word detection stage \(Step 1\) is effective at identifying spurious inserted tokens that degrade sentence intelligibility\. This suggests that, while G\-SPIN alone does not completely eliminate insertion errors, it can serve as a strong precursor for downstream filtering or language model\-based refinement\.

#### 4\.6\.4Deletion Errors

Deletion errors remain the most challenging category, with only marginal improvement \(0\.73→\\rightarrow0\.71\)\. This limitation stems from the inherent difficulty of recovering missing information from text\-only ASR outputs without access to the underlying speech signal\. To the best of our knowledge, accurately correcting deletion errors in such settings is fundamentally challenging, as the model must infer absent content without sufficient contextual or acoustic cues\.

Table 3:Category\-wise WER \(↓\\downarrow\) on English ASR outputs \(200 samples per category\)\.

## 5Conclusion

We introduce G\-SPIN, a lightweight, inference\-time framework that improves frozen LLMs under ASR noise\. By combining phonetic constraints with graph\-based contextual reasoning, it corrects transcription errors without retraining or unconstrained rewriting\. Experiments onen,es,hi, andteshow consistent gains, and theoretical analysis demonstrates that restricting the phonetic space improves local stability\. G\-SPIN offers a practical, plug\-and\-play approach to enhancing robustness in speech\-driven applications\.

## Limitations

One limitation of our work is that the effectiveness of G\-SPIN depends on the quality of the phonetic neighborhoods constructed\. If the candidate space fails to include the correct alternative, the correction module cannot recover the intended token\. In highly noisy or domain\-mismatched ASR settings, this may limit performance gains\. G\-SPIN’s performance degrades when the ASR output contains missing or dropped audio segments\. In such cases, the method cannot recover the missing information, potentially resulting in incomplete or inaccurate corrections\.

## Ethics Statement

Our work focuses on improving the robustness of multilingual ASR output correction to enhance accessibility and communication\. However, the system relies on automatically generated transcriptions, which may contain errors or biases that could lead to misinterpretation, particularly in sensitive or minority\-language contexts\. Moreover, the datasets and underlying ASR systems may reflect demographic or linguistic biases, and while our method can correct transcription errors, it does not inherently remove these societal biases\. Responsible deployment requires careful auditing for fairness across languages, accents, and dialects, as well as caution when using the system in critical applications\.

## References

- G\. Bouselmi, D\. Fohr, I\. Illina, and J\. P\. Haton \(2006\)Multilingual non\-native speech recognition using phonetic confusion\-based acoustic model modification and graphemic constraints\.InInterspeech,External Links:[Link](https://api.semanticscholar.org/CorpusID:17472805)Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p2.1)\.
- Y\. Chen, C\. Cheng, C\. Chen, M\. Sung, and Y\. Yeh \(2021\)Integrated semantic and phonetic post\-correction for Chinese speech recognition\.InProceedings of the 33rd Conference on Computational Linguistics and Speech Processing \(ROCLING 2021\),L\. Lee, C\. Chang, and K\. Chen \(Eds\.\),Taoyuan, Taiwan,pp\. 95–102\.External Links:[Link](https://aclanthology.org/2021.rocling-1.13/)Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p2.1)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Unsupervised cross\-lingual representation learning at scale\.External Links:1911\.02116,[Link](https://arxiv.org/abs/1911.02116)Cited by:[§4\.1](https://arxiv.org/html/2606.24889#S4.SS1.p1.1)\.
- Y\. Fang, B\. Chen, J\. Peng, X\. Li, Y\. Xi, C\. Zhang, and G\. Zhong \(2025a\)Fewer hallucinations, more verification: a three\-stage llm\-based framework for asr error correction\.arXiv preprint arXiv:2505\.24347\.Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p3.1)\.
- Y\. Fang, B\. Chen, J\. Peng, X\. Li, Y\. Xi, C\. Zhang, and G\. Zhong \(2025b\)Fewer hallucinations, more verification: a three\-stage llm\-based framework for asr error correction\.External Links:2505\.24347,[Link](https://arxiv.org/abs/2505.24347)Cited by:[§4\.2](https://arxiv.org/html/2606.24889#S4.SS2.p1.4)\.
- W\. L\. Hamilton, R\. Ying, and J\. Leskovec \(2017\)Inductive representation learning on large graphs\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 1025–1035\.External Links:ISBN 9781510860964Cited by:[§4\.1](https://arxiv.org/html/2606.24889#S4.SS1.p1.1)\.
- C\. Huang and Y\. Chen \(2020\)Learning asr\-robust contextualized embeddings for spoken language understanding\.External Links:1909\.10861,[Link](https://arxiv.org/abs/1909.10861)Cited by:[§3](https://arxiv.org/html/2606.24889#S3.SS0.SSS0.Px2.p1.2)\.
- Y\. Leng, X\. Tan, W\. Liu, K\. Song, R\. Wang, X\. Li, T\. Qin, E\. Lin, and T\. Liu \(2023\)SoftCorrect: error correction with soft detection for automatic speech recognition\.External Links:2212\.01039,[Link](https://arxiv.org/abs/2212.01039)Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p1.1)\.
- X\. Lyu, W\. Tang, Y\. Li, X\. Zhao, M\. Zhu, J\. Li, Y\. Lu, M\. Zhang, D\. Wei, H\. Yang, and M\. Zhang \(2025\)DoCIA: an online document\-level context incorporation agent for speech translation\.External Links:2504\.05122,[Link](https://arxiv.org/abs/2504.05122)Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p3.1),[§4\.2](https://arxiv.org/html/2606.24889#S4.SS2.p1.4)\.
- R\. Ma, M\. Qian, M\. Gales, and K\. Knill \(2025\)ASR error correction using large language models\.External Links:2409\.09554,[Link](https://arxiv.org/abs/2409.09554)Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p3.1)\.
- R\. Nemoto, I\. Vasilescu, and M\. Adda\-Decker \(2008\)Speech errors on frequently observed homophones in French: perceptual evaluation vs automatic classification\.InProceedings of the Sixth International Conference on Language Resources and Evaluation \(LREC’08\),N\. Calzolari, K\. Choukri, B\. Maegaard, J\. Mariani, J\. Odijk, S\. Piperidis, and D\. Tapias \(Eds\.\),Marrakech, Morocco\.External Links:[Link](https://aclanthology.org/L08-1519/)Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p2.1)\.
- T\. Parcollet, Y\. Tseng, S\. Zhang, and R\. van Dalen \(2025\)Loquacious set: 25,000 hours of transcribed and diverse english speech recognition data for research and commercial use\.External Links:2505\.21578,[Link](https://arxiv.org/abs/2505.21578)Cited by:[§4\.3](https://arxiv.org/html/2606.24889#S4.SS3.p1.1)\.
- W\. Ruan, Y\. Nechaev, L\. Chen, C\. Su, and I\. Kiss \(2020\)Towards an asr error robust spoken language understanding system\.InProceedings of Interspeech 2020,pp\. 901–905\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2020-2844)Cited by:[§3](https://arxiv.org/html/2606.24889#S3.SS0.SSS0.Px2.p1.2)\.
- Z\. Sasindran, H\. Yelchuri, and T\. V\. Prabhakar \(2024\)SeMaScore : a new evaluation metric for automatic speech recognition tasks\.ArXivabs/2401\.07506\.External Links:[Link](https://api.semanticscholar.org/CorpusID:266999790)Cited by:[§4\.2](https://arxiv.org/html/2606.24889#S4.SS2.p1.4)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§4\.1](https://arxiv.org/html/2606.24889#S4.SS1.p1.1)\.
- S\. C\. Team \(2023\)Seamless: multilingual expressive and streaming speech translation\.Cited by:[§4\.3](https://arxiv.org/html/2606.24889#S4.SS3.p1.1)\.
- Y\. Wang, H\. Wang, B\. Yan, C\. Lin, and B\. Chen \(2024\)DANCER: entity description augmented named entity corrector for automatic speech recognition\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 4333–4342\.Cited by:[§1](https://arxiv.org/html/2606.24889#S1.p2.1)\.
- T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with bert\.External Links:1904\.09675,[Link](https://arxiv.org/abs/1904.09675)Cited by:[§4\.2](https://arxiv.org/html/2606.24889#S4.SS2.p1.4)\.

## Appendix ATheoretical Motivation

### A\.1Local Stability under Structured Phonetic Projection

We provide a first\-order analysis to understand why constraining corrections to phonetic neighborhoods improves local stability\.

##### Setup\.

Letx∈ℝdx\\in\\mathbb\{R\}^\{d\}denote the embedding representation of the clean sentence, and letx~=x\+δ\\tilde\{x\}=x\+\\deltadenote its ASR\-corrupted version\. Letf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}be a continuously differentiable contextual scoring function \(e\.g\., masked likelihood, semantic consistency score\)\.

Assume the correction operatorCCproduces:

C\(x~\)=x\+δC,C\(\\tilde\{x\}\)=x\+\\delta\_\{C\},whereδC\\delta\_\{C\}lies in a structured phonetic manifoldℳ⊆ℝd\\mathcal\{M\}\\subseteq\\mathbb\{R\}^\{d\}, representing acoustically plausible lexical directions\.

We interpretδC\\delta\_\{C\}as a projection ofδ\\deltaontoℳ\\mathcal\{M\}:

δC=Pℳ\(δ\)\.\\delta\_\{C\}=P\_\{\\mathcal\{M\}\}\(\\delta\)\.
###### Lemma 1\(Local Stability under Structured Projection\)\.

Letffbe continuously differentiable in a neighborhood ofxx\. Then for sufficiently small perturbationsδ\\delta,

f\(x\+δ\)−f\(x\)=∇f\(x\)⊤δ\+𝒪\(‖δ‖2\)\.f\(x\+\\delta\)\-f\(x\)=\\nabla f\(x\)^\{\\top\}\\delta\+\\mathcal\{O\}\(\\\|\\delta\\\|^\{2\}\)\.
If the correction operatorCCsatisfies:

‖δC‖≤α‖δ‖,for someα<1,\\\|\\delta\_\{C\}\\\|\\leq\\alpha\\\|\\delta\\\|,\\quad\\text\{for some \}\\alpha<1,then the first\-order deviation obeys

\|f\(x\+δC\)−f\(x\)\|≤α‖∇f\(x\)‖‖δ‖\+𝒪\(‖δ‖2\)\.\|f\(x\+\\delta\_\{C\}\)\-f\(x\)\|\\leq\\alpha\\\|\\nabla f\(x\)\\\|\\\|\\delta\\\|\+\\mathcal\{O\}\(\\\|\\delta\\\|^\{2\}\)\.

###### Proof\.

By first\-order Taylor expansion,

f\(x\+δ\)=f\(x\)\+∇f\(x\)⊤δ\+𝒪\(‖δ‖2\)\.f\(x\+\\delta\)=f\(x\)\+\\nabla f\(x\)^\{\\top\}\\delta\+\\mathcal\{O\}\(\\\|\\delta\\\|^\{2\}\)\.Applying this toδC\\delta\_\{C\}gives:

f\(x\+δC\)=f\(x\)\+∇f\(x\)⊤δC\+𝒪\(‖δC‖2\)\.f\(x\+\\delta\_\{C\}\)=f\(x\)\+\\nabla f\(x\)^\{\\top\}\\delta\_\{C\}\+\\mathcal\{O\}\(\\\|\\delta\_\{C\}\\\|^\{2\}\)\.Using Cauchy–Schwarz,

\|∇f\(x\)⊤δC\|≤‖∇f\(x\)‖‖δC‖\.\|\\nabla f\(x\)^\{\\top\}\\delta\_\{C\}\|\\leq\\\|\\nabla f\(x\)\\\|\\\|\\delta\_\{C\}\\\|\.Substituting‖δC‖≤α‖δ‖\\\|\\delta\_\{C\}\\\|\\leq\\alpha\\\|\\delta\\\|completes the proof\. ∎

##### Directional Attenuation\.

Beyond norm reduction, stability further improves when the projection removes components aligned with high\-sensitivity directions\. Letg=∇f\(x\)g=\\nabla f\(x\)and decompose:

δ=δ∥\+δ⟂,\\delta=\\delta\_\{\\parallel\}\+\\delta\_\{\\perp\},whereδ∥\\delta\_\{\\parallel\}lies inspan\(g\)\\mathrm\{span\}\(g\)\. Ifℳ\\mathcal\{M\}excludes components aligned withgg, then

\|g⊤δC\|<\|g⊤δ\|,\|g^\{\\top\}\\delta\_\{C\}\|<\|g^\{\\top\}\\delta\|,even when‖δC‖≈‖δ‖\\\|\\delta\_\{C\}\\\|\\approx\\\|\\delta\\\|\.

##### Interpretation\.

The correction operator therefore improves local stability via two mechanisms:

1. 1\.Norm contraction:reducing perturbation magnitude\.
2. 2\.Directional filtering:attenuating components aligned with high\-sensitivity semantic directions\.

Since ASR errors are primarily phonetic in nature, constraining corrections to phonetic neighborhoods acts as a structured projection that removes arbitrary semantic drift\. This yields improved first\-order stability without requiring retraining or modification of contextual language models\.

### A\.2Prompt used by our LLM\-based Decoding strategy

![Refer to caption](https://arxiv.org/html/2606.24889v1/x3.png)Figure 4:Prompt used for Ranking\.##### GNN training architecture details

Table 4:Graph neural network architecture and training configuration used for phoneme\-level link prediction\.

### A\.3Pseudo\-code for ASR Error Correction\.

Algorithm 1Graph\-Guided ASR Error Correction \(G\-SPIN\)1:Input:noisy sentence

SS, flagged tokens

\{w\}\\\{w\\\}, GNN dictionary

2:foreach flagged token

wwin

SSdo

3:

𝒞GNN←CollectGNN\(w\)\\mathcal\{C\}\_\{\\text\{GNN\}\}\\leftarrow\\text\{CollectGNN\}\(w\)
4:

𝒞←BeamDecodeMLM\(S,w,𝒞GNN\)\\mathcal\{C\}\\leftarrow\\text\{BeamDecodeMLM\}\(S,w,\\mathcal\{C\}\_\{\\text\{GNN\}\}\)
5:

𝒞←Unique\(𝒞∪\{w\}\)\\mathcal\{C\}\\leftarrow\\text\{Unique\}\(\\mathcal\{C\}\\cup\\\{w\\\}\)
6:

c⋆←gLLM\(S,w,𝒞\)c^\{\\star\}\\leftarrow g\_\{\\mathrm\{LLM\}\}\(S,w,\\mathcal\{C\}\)
7:if

c⋆c^\{\\star\}is emptythen

8:

c⋆←arg⁡maxc∈𝒞⁡ScoreMLM\(c\)c^\{\\star\}\\leftarrow\\arg\\max\_\{c\\in\\mathcal\{C\}\}\\mathrm\{Score\}\_\{\\mathrm\{MLM\}\}\(c\)
9:endif

10:Replace

wwin

SSwith

c⋆c^\{\\star\}
11:endfor

12:Output:corrected sentence

SS
Graph-Based Phonetic Error Correction of Noisy ASR

Similar Articles

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian

Submit Feedback

Similar Articles

JSPG: Dynamic Dictionary Filtering via Joint Semantic-Pinyin-Glyph Retrieval for Chinese Contextual ASR
Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition
Can Large Language Models Reliably Correct Errors in Low-Resource ASR? A Contamination-Aware Case Study on West Frisian