SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation

arXiv cs.CL Papers

Summary

Proposes SERC, a training-free method inspired by LDPC codes to correct hallucinations in LLMs by treating generation as a noisy channel and using sparse verification queries against external evidence.

arXiv:2605.28837v1 Announce Type: new Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self-correction methods attempt to address this, but often fail due to self-bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC-inspired semantic error correction for retrieval-augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise-corrupted codewords. Inspired by low-density parity-check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low-density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama-3-8B and Qwen2.5-14B. Experimental results demonstrate that SERC outperforms both intrinsic self-correction methods and strong retrieval-augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training-free, model-agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade-off between cost and fidelity in resource-constrained environments.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:12 AM

# SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation
Source: [https://arxiv.org/html/2605.28837](https://arxiv.org/html/2605.28837)
11institutetext:Department of Information Communications Engineering, Hankuk University of Foreign Studies, Republic of Korea22institutetext:Division of Computer Engineering, Hankuk University of Foreign Studies, Republic of Korea33institutetext:Department of Statistics, Hankuk University of Foreign Studies, Republic of Korea
33email:\{rhzs1208, amry0719, jaehakim, mashan120, krson, ijang\}@hufs\.ac\.krJuhwan Park11footnotemark:1Jaeha Kim11footnotemark:1Seunggyun HanKyungrak Son22footnotemark:2Ikbeom JangCorresponding authors\.

###### Abstract

While Large Language Models \(LLMs\) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations\. Existing intrinsic self\-correction methods attempt to address this, but often fail due to self\-bias, where models struggle to identify errors in their own outputs without external verification\. To overcome these limitations, we propose the LDPC\-inspired semantic error correction for retrieval\-augmented generation \(SERC\), providing a theoretical framework to interpret and mitigate LLM hallucinations\. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise\-corrupted codewords\. Inspired by low\-density parity\-check \(LDPC\) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low\-density verification queries and validates them against external evidence to efficiently detect and correct errors\. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama\-3\-8B and Qwen2\.5\-14B\. Experimental results demonstrate that SERC outperforms both intrinsic self\-correction methods and strong retrieval\-augmented baselines, demonstrating significant gains especially in factual precision \(FactScore\)\. Notably, SERC enables small language models \(SLMs\) to surpass the performance of larger baselines in hallucination reduction and information preservation\. Our findings demonstrate that SERC provides a training\-free, model\-agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade\-off between cost and fidelity in resource\-constrained environments\. The code and data are available at[https://github\.com/labhai/SERC](https://github.com/labhai/SERC)\.

## 1Introduction

Recent advancements in large language models \(LLMs\)\[[2](https://arxiv.org/html/2605.28837#bib.bib17)\]are constrained by hallucinations—content inconsistent with factual reality\[[8](https://arxiv.org/html/2605.28837#bib.bib4)\]\. This structural limitation poses significant risks in high\-stakes domains like healthcare and law\[[21](https://arxiv.org/html/2605.28837#bib.bib19)\], making the detection and correction of reasoning errors a critical challenge\.

Existing self\-correction methods, such as Chain\-of\-Verification \(CoVe\)\[[3](https://arxiv.org/html/2605.28837#bib.bib3)\], largely rely on intrinsic reasoning but suffer from Self\-Bias, where initial biases propagate into the verification phase\[[6](https://arxiv.org/html/2605.28837#bib.bib5)\]\. Implementing such procedures in small language models \(SLMs\) is further complicated by their limited reasoning capacity\[[20](https://arxiv.org/html/2605.28837#bib.bib18)\]\. We argue that incorporating Retrieval\-augmented generation \(RAG\) is a prerequisite for effective self\-correction, especially in resource\-constrained SLM environments\[[11](https://arxiv.org/html/2605.28837#bib.bib1)\]\.

To address these limitations, we redefine hallucinations through an information\-theoretic lens, interpreting LLM outputs as signals transmitted via an imperfectsemantic channel\[[16](https://arxiv.org/html/2605.28837#bib.bib16)\]\. By reformulating hallucination as a probabilistic inference problem where generated text is a noisy observation of unobserved factual latent variables\[[10](https://arxiv.org/html/2605.28837#bib.bib21)\], we apply classical error\-correction principles to recover the ground\-truth manifold\.

Building upon this, we propose the SERC \(Semantic Error\-Reduction and Correction\) framework\. Inspired by the design philosophy of low\-density parity\-check \(LDPC\) codes\[[4](https://arxiv.org/html/2605.28837#bib.bib13)\], SERC establishes alow\-density verification planto efficiently expose error patterns\. Detected errors are rectified through a procedure analogous to the belief propagation \(BP\) algorithm\[[15](https://arxiv.org/html/2605.28837#bib.bib22)\], propagating updated beliefs to restore global semantic coherence\. Our primary contributions are threefold: \(1\) we propose a semantic channel model that interprets LLM hallucinations through the lens of error correcting codes \(ECC\), providing a theoretical basis for self\-correction; \(2\) we implement the SERC framework, which utilizes low\-density verification on atomic facts to efficiently detect errors without additional training; and \(3\) we demonstrate through experiments on LongForm Bio and TruthfulQA that SERC significantly outperforms baselines like CoVe, specifically enhancing the reliability of SLMs\.

## 2Related Works

LLM Correction and Retrieval\-Augmented Generation:LLM hallucinations stem from training data tails and parametric gaps\[[8](https://arxiv.org/html/2605.28837#bib.bib4),[13](https://arxiv.org/html/2605.28837#bib.bib8)\]\. Intrinsic self\-correction \(e\.g\., CoVe\[[3](https://arxiv.org/html/2605.28837#bib.bib3)\]\) often succumbs toself\-biaswithout external grounding\[[6](https://arxiv.org/html/2605.28837#bib.bib5)\]\. RAG\[[11](https://arxiv.org/html/2605.28837#bib.bib1)\]mitigates this, but adaptive variants \(Self\-RAG\[[1](https://arxiv.org/html/2605.28837#bib.bib2)\]\) require costly fine\-tuning and struggle with noisy retrieval\[[24](https://arxiv.org/html/2605.28837#bib.bib7)\]\. Advanced systems optimizing retrieval pipelines \(CRAG\[[23](https://arxiv.org/html/2605.28837#bib.bib26)\], MIGRES\[[19](https://arxiv.org/html/2605.28837#bib.bib27)\], Adaptive RAG\[[7](https://arxiv.org/html/2605.28837#bib.bib50)\]\) focus primarily on information gathering, remaining vulnerable to semantic noise introduced during generation\. Conversely, SERC transcends mere retrieval, functioning as an information\-theoretic error correction \(ECC\) mechanism to systematically rectify corrupted semantics\. Operatingtraining\-freeandmodel\-agnosticat thesemantic proposition level, SERC’slow\-densityverification provides superior cost\-efficiency over exhaustive baselines like RARR\[[5](https://arxiv.org/html/2605.28837#bib.bib48)\]\.

Information Theory and Coding:Error correcting codes \(ECC\)\[[17](https://arxiv.org/html/2605.28837#bib.bib12)\]ensure reliable transmission\. Notably, low\-density parity\-check \(LDPC\) codes\[[4](https://arxiv.org/html/2605.28837#bib.bib13)\]enable efficient decoding via sparse parity\-check structures, represented as Tanner Graphs\[[18](https://arxiv.org/html/2605.28837#bib.bib14)\]and decoded via iterative belief propagation \(BP\)\[[15](https://arxiv.org/html/2605.28837#bib.bib22)\]\. In semantic communication, systems like DeepSC\[[22](https://arxiv.org/html/2605.28837#bib.bib51)\]transmit semantic meaning rather than mere bits, preserving semantics over noisy channels\. SERC adapts these principles for hallucination mitigation, interpreting LLM outputs as noise\-corrupted codewords and verifications as sparse parity checks\. The resulting Tanner\-style graph links factual propositions with evidence, enabling structured hallucination correction from an information\-theoretic perspective\.

## 3Information Theoretic Abstraction

### 3\.1Semantic Channel Modeling: LLM as a Noisy Channel

![Refer to caption](https://arxiv.org/html/2605.28837v1/x1.png)Figure 1:The proposed Semantic Channel Model\. The generation process of an LLM is modeled as a noisy channel where hallucinations are treated as semantic noise, and SERC acts as the decoder to restore original information\.To address the self\-bias in intrinsic self\-correction\[[3](https://arxiv.org/html/2605.28837#bib.bib3),[6](https://arxiv.org/html/2605.28837#bib.bib5)\], we model the LLM generation process as a semantic noisy channel\. By drawing a formal analogy to classical information theory\[[17](https://arxiv.org/html/2605.28837#bib.bib12),[16](https://arxiv.org/html/2605.28837#bib.bib16)\], we decompose the QA process into five core components as illustrated in Fig\.[1](https://arxiv.org/html/2605.28837#S3.F1):

Source:The origin of information, representing the real\-world entity or subject of inquiry that the user intends to learn about\.

Message \(MM\):The set of objective ground\-truth facts existing in the real world regarding the subject\.MMdefines the complete space of factual validity independent of its linguistic expression\.

Codeword \(CC\):The ideal, hallucination\-free natural language response constructed solely using facts belonging to messageMM\. For example, for a query regarding Einstein, a potentialCodewordis: “Einstein was born in Germany and published the theory of relativity\.”

Semantic Noisy Channel:The stochastic LLM generation process\. While an ideal model outputsCC, actual LLMs act as a channel injecting semantic noise\[[8](https://arxiv.org/html/2605.28837#bib.bib4)\], distorting the ideal codeword into a corrupted observationC′=C⊕NoiseC^\{\\prime\}=C\\oplus\\text\{Noise\}, where⊕\\oplusdenotes the superposition of semantic distortions\(see Supp\. Sec\. 5\)\.

Decoder \(SERC\):The proposed framework that serves as a semantic decoder\. Analogous to LDPC\[[4](https://arxiv.org/html/2605.28837#bib.bib13)\]and BP\[[15](https://arxiv.org/html/2605.28837#bib.bib22)\], SERC reconstructs the original codewordCCfrom the noisy observationC′C^\{\\prime\}to restore factual fidelity to the original messageMM\.

### 3\.2Manifold of Truth and Operational Approximation

Building on the semantic channel abstraction, SERC operates as an external error correction layer that rectifies the noisy outputC′C^\{\\prime\}without accessing the LLM’s latent space\. We define the Operational Fact SetF=\{fk,i∣1≤k≤n,1≤i≤mk\}F=\\\{f\_\{k,i\}\\mid 1\\leq k\\leq n,1\\leq i\\leq m\_\{k\}\\\}as the collection of atomic factual propositions extracted fromC′C^\{\\prime\}\. Here,fk,if\_\{k,i\}denotes theii\-th atomic fact derived from thekk\-th sentencesks\_\{k\}, serving as the individual variable nodes for our decoding graph\.

The objective of SERC is to project the corrupted setFFonto the manifold of truth \(ℳtruth​\(Q\)\\mathcal\{M\}\_\{\\text\{truth\}\}\(Q\)\), defined as the space of all fact sets derived from ideal, hallucination\-free responses𝒞∗​\(Q\)\\mathcal\{C\}^\{\*\}\(Q\):

ℳtruth​\(Q\)=\{Facts​\(C∗\)∣C∗∈𝒞∗​\(Q\)\}\\mathcal\{M\}\_\{\\text\{truth\}\}\(Q\)=\\\{\\text\{Facts\}\(C^\{\*\}\)\\mid C^\{\*\}\\in\\mathcal\{C\}^\{\*\}\(Q\)\\\}\(1\)Sinceℳtruth​\(Q\)\\mathcal\{M\}\_\{\\text\{truth\}\}\(Q\)is unobservable, SERC operationally approximates it using the subspace of facts consistent with external evidence retrieved via RAG\.

To perform this projection efficiently, we adopt a graph\-based strategy inspired by low\-density parity\-check \(LDPC\) codes\. Instead of dense, exhaustive verification, we construct a sparse Tanner graph where multiple facts \(variable nodes\) are verified by a single, grouped verification query \(check node\)\. This sparsity minimizes the computational overhead of verification \(LLM/RAG calls\)\. Finally, the correction process mimics belief propagation \(BP\); a local correction in one fact \(e\.g\., entity type\) logically propagates to related sentences, ensuring the reconstructed text converges to global semantic consistency\.

## 4Proposed Methodology: SERC Framework

We propose the SERC \(Semantic Error\-Reduction and Correction\) framework\. Building on the theoretical foundation established in Section[3](https://arxiv.org/html/2605.28837#S3), SERC instantiates the semantic decoder\. We map the abstract components of the channel model to concrete RAG operations: the noisy codewordC′C^\{\\prime\}is instantiated as the initial LLM responseRinitR\_\{\\text\{init\}\}, and the parity check constraints are implemented via sparse verification queries\. As illustrated in Algorithm[1](https://arxiv.org/html/2605.28837#alg1), the framework operates sequentially through three mathematically defined phases to align the response with the Manifold of Truth\.

### 4\.1Phase 1: Coarse Alignment and Entity Firewall

The process initiates with the generation of an initial responseRinitR\_\{\\text\{init\}\}\(corresponding to the noisy observationC′C^\{\\prime\}\) from the language modelL​MLMgiven a user queryQQ:

Rinit∼PL​M​\(y∣Q\)R\_\{\\text\{init\}\}\\sim P\_\{LM\}\(y\\mid Q\)\(2\)Standard RAG approaches often fail when the initial generation suffers from source confusion \(e\.g\., confusing two people with the same name\)\. In channel coding terms, this represents a Synchronization Error, where the decoder attempts to decode a signal using the wrong codebook\. To mitigate this, we introduce the entity firewall mechanism\. Let𝒯​\(⋅\)\\mathcal\{T\}\(\\cdot\)be a topic entity extraction function\. We extract the core subject entity from both the model’s internal knowledge \(Tmodel=𝒯​\(Rinit\)T\_\{\\text\{model\}\}=\\mathcal\{T\}\(R\_\{\\text\{init\}\}\)\) and external evidence retrieved via a search moduleℛ\\mathcal\{R\}\(Trag=𝒯​\(ℛ​\(Q\)\)T\_\{\\text\{rag\}\}=\\mathcal\{T\}\(\\mathcal\{R\}\(Q\)\)\)\. The firewall verifies the consistency between these two entities\(see Supp\. 1\.1 for the exact judge prompt\):

Rinit=\{L​M​\(Q,ℛ​\(Q\)\)ifConsistency​\(Tmodel,Trag\)=False\(Hard Reset\)RinitotherwiseR\_\{\\text\{init\}\}=\\begin\{cases\}LM\(Q,\\mathcal\{R\}\(Q\)\)&\\text\{if \}\\text\{Consistency\}\(T\_\{\\text\{model\}\},T\_\{\\text\{rag\}\}\)=\\text\{False\}\\quad\(\\text\{Hard Reset\}\)\\\\ R\_\{\\text\{init\}\}&\\text\{otherwise\}\\end\{cases\}\(3\)If a mismatch is detected, aHard Resetis triggered, forcing the model to regenerate the baseline using the retrieved context to align the semantic trajectory before fine\-grained verification\.

### 4\.2Phase 2: Fact Decomposition and Sparse Verification

![Refer to caption](https://arxiv.org/html/2605.28837v1/x2.png)Figure 2:Semantic Tanner Graph Structure\.The bottom nodes \(Variable Nodes\) represent atomic facts extracted from sentences, while the top nodes \(Check Nodes\) represent grouped verification queries\. This bipartite structure enables efficient sparse verification\.To perform granular error correction,we first decompose the continuous signalRinitR\_\{\\text\{init\}\}into discrete semantic symbols\. The baseline responseRinitR\_\{\\text\{init\}\}is decomposed into a set of sentencesS=\{s1,…,sn\}S=\\\{s\_\{1\},\\dots,s\_\{n\}\\\}\. For each sentencesks\_\{k\}, we extract a subset of atomic factsFk=\{fk,1,…,fk,mk\}F\_\{k\}=\\\{f\_\{k,1\},\\dots,f\_\{k,m\_\{k\}\}\\\}using a dedicated prompt\(see Supp\. 1\.2\)\.

Tanner Graph ConstructionWe map the verification problem to a Semantic Tanner GraphG=\(V,Nc,𝒜\)G=\(V,N\_\{c\},\\mathcal\{A\}\), analogous to low\-density parity\-check \(LDPC\) codes, where𝒜\\mathcal\{A\}represents the set of arcs connecting facts to verification queries, as illustrated in Figure[2](https://arxiv.org/html/2605.28837#S4.F2):

Variable Nodes \(VV\):Represent individual atomic factsfk,if\_\{k,i\}\(bottom nodes\)\.

Check Nodes \(NcN\_\{c\}\):Represent verification queries \(top nodes\)\. To optimize computational cost, we adopt a sparse verification strategy\. Instead of verifying each fact independently, we generate a single comprehensive queryqk=GenQ​\(Fk\)q\_\{k\}=\\text\{GenQ\}\(F\_\{k\}\)for each sentence groupFkF\_\{k\}, which functions as the Check Node constraint\.\(details in Supp\. 1\.3\)

Syndrome DetectionUsing the queryqkq\_\{k\}, we retrieve external context viaℛ\\mathcal\{R\}and generate a concise evidence summaryEkE\_\{k\}using the backbone model\. The validity of each fact is then evaluated by a verifier𝒱\\mathcal\{V\}\(refer to Supp\. 1\.3 for the verdict prompt\) to compute the Semantic Syndromeσk,i\\sigma\_\{k,i\}:

σk,i=𝒱​\(fk,i,Ek\)∈\{SUP,CON,NF\}\\sigma\_\{k,i\}=\\mathcal\{V\}\(f\_\{k,i\},E\_\{k\}\)\\in\\\{\\text\{SUP\},\\text\{CON\},\\text\{NF\}\\\}\(4\)whereSUP\(Supported\) indicates a valid fact,CON\(Contradicted\) indicates a hallucination, andNF\(Not Found\) indicates unverifiable information\. Contradicted facts are collected in a syndrome bufferℬsyn\\mathcal\{B\}\_\{\\text\{syn\}\}, while unverifiable facts are assigned to a deletion set𝒮del\\mathcal\{S\}\_\{\\text\{del\}\}for pruning\.

### 4\.3Phase 3: Logic Propagation via BP\-inspired Heuristic

The final phase corrects detected errors and reconstructs the text\. We define the correction process as a functionBP​\(ℬsyn\)\\text\{BP\}\(\\mathcal\{B\}\_\{\\text\{syn\}\}\), which outputs a Correction MapΦ\\Phi\. This map assigns a corrected factfk,i′f^\{\\prime\}\_\{k,i\}to eachfk,if\_\{k,i\}associated with a non\-zero syndrome\.

While traditional BP algorithms rely on iterative message passing until convergence, such recursion is computationally prohibitive in the semantic domain due to the high latency of LLM inference\. To address this, we propose aSingle\-Pass Unfoldingof the belief propagation algorithm\. We propagate the “beliefs” of verified facts \(acting as clamped variable nodes\) exactly once through the semantic dependency graph\. This acts as afirst\-order approximationof BP, capturing the core mechanism of message passing—updating dependent variables based on reliable evidence—while maintaining inference efficiency\. Crucially, we empirically observed that semantic dependencies in hallucinations are predominantly local \(1\-hop\); thus, a single\-pass approximation provides sufficient error\-correction capability without the need for iterative convergence\.

Step 1: Local Belief UpdateBased on the syndromeσk,i\\sigma\_\{k,i\}, we update the belief of each variable node \(fact\) to produce a corrected factfk,i′f^\{\\prime\}\_\{k,i\}:

fk,i′=\{Correct​\(fk,i,Ek\)if​σk,i=CON∅​\(Prune\)if​σk,i=NFfk,iif​σk,i=SUPf^\{\\prime\}\_\{k,i\}=\\begin\{cases\}\\text\{Correct\}\(f\_\{k,i\},E\_\{k\}\)&\\text\{if \}\\sigma\_\{k,i\}=\\text\{CON\}\\\\ \\emptyset\\text\{ \(Prune\)\}&\\text\{if \}\\sigma\_\{k,i\}=\\text\{NF\}\\\\ f\_\{k,i\}&\\text\{if \}\\sigma\_\{k,i\}=\\text\{SUP\}\\end\{cases\}\(5\)Crucially, we employ aLogic Propagationprompt\. If a fundamental attribute is corrected, this change propagates to logically dependent facts within the same group using the shared evidenceEkE\_\{k\}\.

Step 2: Sequential ReconstructionTo ensure global coherence, the text is reconstructed sequentially\. To prevent the regurgitation of hallucinations caused by anchoring bias, we adopt aFact\-to\-Textgeneration strategy, explicitly excluding the original sentenceSkS\_\{k\}\. Consequently, the generation of the corrected sentencesk′s^\{\\prime\}\_\{k\}is conditionedsolelyon the corrected factsFk′F^\{\\prime\}\_\{k\}and the history of previously reconstructed sentencesHk−1H\_\{k\-1\}\(represented asCdraftC\_\{\\text\{draft\}\}in Algorithm[1](https://arxiv.org/html/2605.28837#alg1)\):

sk′=L​Mrewrite​\(Fk′∣Hk−1\)s^\{\\prime\}\_\{k\}=LM\_\{\\text\{rewrite\}\}\(F^\{\\prime\}\_\{k\}\\mid H\_\{k\-1\}\)\(6\)This auto\-regressive reconstruction ensures that local corrections propagate through the narrative flow\. Finally, a lightweight Polishing module smooths the reconstructed draft to yield the final responseCfinalC\_\{\\text\{final\}\}\(see Supp\. 1\.4 for details\)\.

Algorithm 1Compact SERC Execution Flow1:Query

QQ, Model

L​MLM, Retriever

ℛ\\mathcal\{R\}
2:Final Response

CfinalC\_\{\\text\{final\}\}
3:1\. Alignment:

Rinit←L​M​\(Q\)R\_\{\\text\{init\}\}\\leftarrow LM\(Q\)
4:if

Consistency​\(𝒯​\(Rinit\),𝒯​\(ℛ​\(Q\)\)\)=False\\text\{Consistency\}\(\\mathcal\{T\}\(R\_\{\\text\{init\}\}\),\\mathcal\{T\}\(\\mathcal\{R\}\(Q\)\)\)=\\text\{False\}then

5:

Rinit←L​M​\(Q,ℛ​\(Q\)\)R\_\{\\text\{init\}\}\\leftarrow LM\(Q,\\mathcal\{R\}\(Q\)\)⊳\\trianglerightFirewall \(Hard Reset\)

6:endif

7:2\. Detection:

S←Split​\(Rinit\)S\\leftarrow\\text\{Split\}\(R\_\{\\text\{init\}\}\);

ℬsyn←∅\\mathcal\{B\}\_\{\\text\{syn\}\}\\leftarrow\\emptyset;

𝒮del←∅\\mathcal\{S\}\_\{\\text\{del\}\}\\leftarrow\\emptyset
8:foreach

Sk∈SS\_\{k\}\\in Sdo

9:

Fk←Facts​\(Sk\)F\_\{k\}\\leftarrow\\text\{Facts\}\(S\_\{k\}\);

Ek←L​M​\(GenQ​\(Fk\)∣ℛ​\(GenQ​\(Fk\)\)\)E\_\{k\}\\leftarrow LM\(\\text\{GenQ\}\(F\_\{k\}\)\\mid\\mathcal\{R\}\(\\text\{GenQ\}\(F\_\{k\}\)\)\)
10:for

f∈Fkf\\in F\_\{k\}do

11:

σ←𝒱​\(f,Ek\)\\sigma\\leftarrow\\mathcal\{V\}\(f,E\_\{k\}\)⊳\\trianglerightCalculate Syndrome

12:if

σ=CON\\sigma=\\text\{CON\}then

ℬsyn\.add​\(f,Ek\)\\mathcal\{B\}\_\{\\text\{syn\}\}\.\\text\{add\}\(f,E\_\{k\}\)
13:if

σ=NF\\sigma=\\text\{NF\}then

𝒮del\.add​\(f\)\\mathcal\{S\}\_\{\\text\{del\}\}\.\\text\{add\}\(f\)⊳\\trianglerightMark for deletion

14:endfor

15:endfor

16:3\. Correction:

Φ←BP​\(ℬsyn\)\\Phi\\leftarrow\\text\{BP\}\(\\mathcal\{B\}\_\{\\text\{syn\}\}\)⊳\\trianglerightCompute Correction Map

17:

∀f∈𝒮del,Φ​\[f\]←∅\\forall f\\in\\mathcal\{S\}\_\{\\text\{del\}\},\\Phi\[f\]\\leftarrow\\emptyset⊳\\trianglerightMap ’NF’ to empty

18:4\. Reconstruction:

Cdraft←∅C\_\{\\text\{draft\}\}\\leftarrow\\emptyset
19:foreach

Sk∈SS\_\{k\}\\in Sdo

20:

Fk′←\{Φ​\[f\]​if​f∈dom​\(Φ\)​else​f∣f∈Fk\}F^\{\\prime\}\_\{k\}\\leftarrow\\\{\\Phi\[f\]\\text\{ if \}f\\in\\text\{dom\}\(\\Phi\)\\text\{ else \}f\\mid f\\in F\_\{k\}\\\}
21:

Cdraft\.append​\(L​Mrewrite​\(Fk′∣Cdraft\)\)C\_\{\\text\{draft\}\}\.\\text\{append\}\(LM\_\{\\text\{rewrite\}\}\(F^\{\\prime\}\_\{k\}\\mid C\_\{\\text\{draft\}\}\)\)
22:endfor

23:return

Polish​\(Q,Cdraft\)\\text\{Polish\}\(Q,C\_\{\\text\{draft\}\}\)
24:

25:Definitions:

26:

𝒯​\(⋅\)\\mathcal\{T\}\(\\cdot\): Topic Extractor

Consistency​\(⋅\)\\text\{Consistency\}\(\\cdot\): Entity Consistency Checker

27:

Facts​\(⋅\)\\text\{Facts\}\(\\cdot\): Fact Decomposition

GenQ​\(⋅\)\\text\{GenQ\}\(\\cdot\): Verification Query Generator

28:

𝒱​\(⋅\)\\mathcal\{V\}\(\\cdot\): Fact Verifier

BP​\(⋅\)\\text\{BP\}\(\\cdot\): Belief Propagation Correction

29:

Φ\\Phi: Correction Map

L​Mrewrite​\(⋅\)LM\_\{\\text\{rewrite\}\}\(\\cdot\): Fact\-to\-Text Generator

## 5Experiment

### 5\.1Overview and Model Selection

We evaluate SERC’s effectiveness regarding factual precision, token efficiency, and scalability across model scales\. We utilize Llama\-3\-8B\-Instruct and Qwen2\.5\-14B\-Instruct as backbone models\. Llama\-3\-8B serves as a testbed for knowledge\-deficient SLMs on consumer GPUs, while Qwen2\.5\-14B, with its expanded capacity and distinct training distribution, verifies the scalability of our semantic correction mechanism\.

### 5\.2Benchmarks and Evaluation Metrics

Experiments use a generation temperature of 0 to ensure reproducibility\.

LongForm Bio & FactScore: We evaluate long\-form factual consistency usingFActScore\[[14](https://arxiv.org/html/2605.28837#bib.bib10)\], which decomposes text into atomic propositions and verifies them against a reliable knowledge base\.

Fact Preservation Rate\[[14](https://arxiv.org/html/2605.28837#bib.bib10)\]: To distinguish information refinement from defensive deletion, we measure the ratio of facts retained relative to the initial answer, ensuring SERC does not merely truncate uncertain content\.

TruthfulQA & LLM\-as\-a\-Judge: We assess the model’s tendency to mimic human falsehoods\[[12](https://arxiv.org/html/2605.28837#bib.bib9)\]\. We replace traditional selection\-based metrics \(MC1/MC2\) with theLLM\-as\-a\-Judgeparadigm\[[25](https://arxiv.org/html/2605.28837#bib.bib11)\]\. To mitigate bias, Gemini\-3\-Pro\(gemini\-3\-pro\-preview\) and GPT\-5\.1 \(gpt\-5\.1\-2025\-11\-13\) serve as dual evaluators to ensure objective scoring\[[25](https://arxiv.org/html/2605.28837#bib.bib11)\]\(rubric and prompts are in Supp\. Sec\. 2\)\.

Cost Analysis: Total token usage is analyzed and compared against exhaustive baselines like RARR\[[5](https://arxiv.org/html/2605.28837#bib.bib48)\]to demonstrate the efficiency of our low\-density verification strategy\.

### 5\.3Baselines and Comparative Methods

To verify the effectiveness of SERC, we compare it against the following baselines and state\-of\-the\-art methodologies\(prompts in Supp\. Sec\. 4\):

Initial Answer:The raw, uncorrected response generated by the backbone model \(Llama\-3\-8B or Qwen2\.5\-14B\)\.

CoVe\(Chain\-of\-Verification\)\[[3](https://arxiv.org/html/2605.28837#bib.bib3)\]:An intrinsic self\-correction method that uses a multi\-step reasoning process \(generating verification questions and answering them\) without external tools\.

CoVe\+RAG:An augmented version of CoVe provided with the same external knowledge as SERC, used to isolate the performance gains attributable to our ECC\-inspired framework versus simple retrieval\.

RARR\(Research and Revise\)\[[5](https://arxiv.org/html/2605.28837#bib.bib48)\]:A retrieval\-augmented baseline that exhaustively researches and revises each claim in the generated text using external evidence\.

Re\-Ex\(Revising after Explanation\)\[[9](https://arxiv.org/html/2605.28837#bib.bib49)\]:A baseline focused on the iterative re\-extraction and verification of claims from retrieved contexts to ensure maximum factual grounding\.

### 5\.4Implementation Details

Experiments were conducted on a single NVIDIA A100 \(80GB\) GPU using thetransformerslibrary\. Backbone models were loaded with 16\-bit \(bfloat16\) quantization to optimize memory usage\. For retrieval, we utilized theTavily Search APIwith an “advanced” search depth and a top\-kkof 8, truncating retrieved contexts at 20,000 characters to prevent overflow\.

The pipeline utilized deterministic decoding \(T=0\.0T=0\.0\) for most phases, including fact extraction, verification, and reconstruction, to ensure factual consistency\. TemperatureT=0\.1T=0\.1was applied exclusively during the final polishing stage to enhance linguistic fluency, withmax\_new\_tokens=512for all generative tasks\. Implementation details are described in the supplement and our GitHub repo\.—e\.g\., prompt templates, TruthfulQA evaluation protocol, baseline method prompts, configuration and hyperparameters, failure cases, examples of extracted facts, generated queries, and semantic noise injection\.

Table 1:Main results on LongForm Bio and TruthfulQA\. Fact Preservation Rate \(in parentheses\) indicates the ratio of facts retained compared to the Initial Answer\. SERC demonstrates superior preservation \(e\.g\., 106\.5% for 14B\), indicating it corrects errors by refining information rather than deleting it\.

## 6Results and Analysis

### 6\.1Main Results

The experimental results demonstrate that the proposed SERC framework achieves the highest quantitative metrics across both benchmarks and model scales\. As detailed in Table[1](https://arxiv.org/html/2605.28837#S5.T1), SERC significantly enhances factual precision; specifically, the Llama\-3\-8B model showed a43\.4%43\.4\\%improvement \(0\.5976→0\.85680\.5976\\rightarrow 0\.8568\) and the Qwen2\.5\-14B model exhibited a68%68\\%improvement \(0\.4850→0\.81460\.4850\\rightarrow 0\.8146\) inFactScore\. OnTruthfulQA, the accuracy improved by30\.030\.0percentage points for the 8B model and17\.517\.5percentage points for the 14B model\.

Domain Optimization and Inference\-time Alignment\.The initial FactScore of Qwen2\.5\-14B\-Instruct \(0\.48500\.4850\) was lower than that of Llama\-3\-8B\-Instruct \(0\.59760\.5976\)\. This disparity is attributed to the models’ distinct optimization focuses: while Llama\-3 is highly tuned for general English reasoning, Qwen2\.5 excels in multilingual, mathematical, and coding domains\. Despite this pre\-training domain bias, SERC successfully bridged the gap, boosting Qwen’s performance to0\.81460\.8146\. This empirically proves that SERC acts as a robust inference\-time alignment layer, ensuring high factual reliability regardless of the backbone model’s specific optimization\.

Resolution of Ambiguity and Granular Correction\.A critical insight lies in the analysis of Fact Preservation\. While iterative methods like CoVe often suffer from a decrease in fact count \(defensive deletion\), SERC maintains a robust volume of validated facts\.

Specifically, for the 8B model, while Re\-Ex exhibits a higher preservation rate \(98\.6%98\.6\\%\) than SERC \(77\.7%77\.7\\%\), its significantly lower FactScore \(0\.76360\.7636vs\.0\.85680\.8568\) implies a failure to filter out hallucinations \(Over\-preservation\)\. In contrast, SERC demonstratesselective preservation, effectively pruning semantic noise\.

Furthermore, for the 14B model, the preservation rate reaches 106\.5%\. This increase does not imply hallucinated expansion, but rather thegranular refinementof vague assertions\. For instance, when SERC corrects a broad claim \(e\.g\., "He was a politician"\) into specific, verified roles \(e\.g\., "He was a diplomat and a scholar"\), the number of atomic facts naturally increases\. This confirms that SERC functions as a high\-resolution error\-corrector that restores semantic fidelity\.

Table 2:Comparison of average token usage and performance efficiency on LongForm Bio \(Llama\-3\-8B\)\. Although SERC incurs a higher token cost than baselines, it provides a justified trade\-off by achieving significantly superior factual precision\.Cost\-Benefit Analysis: The Price of Reliability\.Table[2](https://arxiv.org/html/2605.28837#S6.T2)presents the computational cost analysis\. While SERC incurs a higher token overhead \(39,49239,492\) compared to baselines like RARR \(29,65829,658\) or Re\-Ex \(27,35227,352\), this increase is a justifiedReliability Costfor deep semantic verification\.

Unlike RARR, which often performs surface\-level edits, SERC engages in a rigorous reconstruction process via the LDPC\-inspired framework\. This investment translates directly into a substantial performance gap; SERC surpasses RARR by approximately \+5\.7 points in FactScore \(8B\) and \+5\.0 points \(14B\)\. Consequently, SERC occupies the optimal trade\-off point, delivering state\-of\-the\-art fidelity that outweighs the moderate increase in computational resources\.

Table 3:Performance degradation when RAG is disabled\. Reliance on intrinsic knowledge fails to improve factuality, leading to score regression\.Table 4:Failure case study where local correction is applied without coarse\-level alignment\. The model attempts to patch individual facts, resulting in a contradictorySemantic Chimera\.
### 6\.2Ablation Study

To validate our structural rationale, we conduct an ablation study on 20 random LongForm Bio instances\. As shown in Tables 3–5, removing RAG or coarse\-grained alignment causes significant score regression and semantic inconsistencies\. Conversely, our low\-density verification maintains high precision while being 45\.6% more cost\-efficient than exhaustive baselines, confirming the synergy of each component in the SERC framework\.

RAG module\.To demonstrate the necessity of external knowledge, we deactivated the RAG module \(Table[3](https://arxiv.org/html/2605.28837#S6.T3)\)\. The results indicate that correction attempts relying solely on intrinsic knowledge lead to quality degradation; for the 8B model,FactScoredropped from0\.68250\.6825to0\.66950\.6695\. This suggests that SLMs cannot overcomeself\-biaswithout a reliable external reference\. Thus, RAG is a prerequisite for the SERC framework\.

Coarse\-grained alignment\.As illustrated in Table[4](https://arxiv.org/html/2605.28837#S6.T4), omitting coarse alignment results in asemantic chimera\. While local facts \(e\.g\., birth dates\) were corrected, the incorrect context \(e\.g\., football career\) persisted\. This confirms that local “bit flips” cannot restore the message if the initial code deviates too far from the truth manifold\. coarse\-grained alignment serves as an essential fail\-safe to force the global context onto the correct trajectory\.

Table 5:Efficiency and Performance Comparison\.Initialrepresents the raw LLM output before correction\.SERCsignificantly reduces token costs while achieving higher FactScore than the 1:1 verification baseline \(High\-Density\)\.Note\.Reduct\.: Cost Reduction \(%\)\. Pres\.%: Preservation Rate relative to the initial answer\.

Low\-Density Verification\.We compare SERC’s sparse verification with aHigh\-Density\(1:1\) baseline that verifies each atomic fact \(Table[5](https://arxiv.org/html/2605.28837#S6.T5)\)\. Although dense verification can increase coverage, it substantially increases prompt length and token usage due to repeated retrieval and redundant reasoning over overlapping fact clusters\. In practice, High\-Density consumed 72,634 tokens \(8B\) and 54,407 tokens \(14B\), whereas SERC reduced the cost to 39,492 \(−45\.6%\-45\.6\\%\) and 36,533 \(−32\.9%\-32\.9\\%\) tokens, yielding a more practical performance–cost trade\-off\.

Moreover, dense verification may introduce excessive evidence and correction signals, increasing over\-correction risk\. For 8B, Dense yields only a modest FactScore gain \(0\.6572→\\rightarrow0\.7002\) while sharply reducing Avg Facts \(23\.55→\\rightarrow14\.50\), suggesting defensive deletion\. While Dense improves 14B more strongly \(0\.5052→\\rightarrow0\.8128; Avg Facts 21\.25→\\rightarrow24\.75\), the cost remains disproportionately high and the process becomes overly fragmented\. By grouping related facts and enforcing key constraints via sparse checks, SERC reduces redundant signals and promotes more stable refinement\.

### 6\.3Qualitative Analysis: Logic Propagation

Table 6:Example of Logic Propagation\. SERC updates the dependent context \(“fast bowling” to “ball\-carrying”\) after correcting the root entity type \(“cricket” to “rugby”\), ensuring global semantic coherence\.To further investigate the internal mechanisms of SERC, we perform a qualitative analysis of its logic propagation capability\(full traces in Supp\. Sec\. 3\)\. Unlike baseline methods that often perform isolated local corrections , SERC ensures global semantic consistency by approximating the belief propagation \(BP\) algorithm\.

As shown in Table[6](https://arxiv.org/html/2605.28837#S6.T6), when a root entity is corrected \(e\.g\., from a “cricket player” to a rugby player\), SERC recognizes that dependent attributes—such as specific skills or team associations—must also be updated to maintain logical coherence\. Without this mechanism, the model produces a “Semantic Chimera,” where factually correct updates \(e\.g\., identity\) coexist with hallucinated contexts \(e\.g\., “fast bowling” skills remaining for a rugby player\)\. This result confirms that SERC’s information\-theoretic approach successfully restores the intended information manifold from noisy semantic observations\.

## 7Conclusion

This study proposes SERC \(semantic error\-reduction and correction\), a novel self\-correction framework inspired by Error Correcting Codes \(ECC\), to address the critical challenge of hallucinations in large language models \(LLMs\)\. By reconceptualizing text generation as a semantic noisy channel, we treat generated responses as corrupted codewords and systematically perform error detection and restoration based on information\-theoretic principles\. Experimental results show that SERC achieves state\-of\-the\-art performance across benchmarks, marking a substantial improvement in factual precision compared to strong retrieval\-augmented baselines\. Notably, in an 8B\-scale small language model \(SLM\) setting, SERC attains higher factual precision than a 14B model equipped with CoVe\+RAG, demonstrating robust error correction that is not dependent on parameter scale\. In addition, SERC exhibits a high information preservation rate by replacing incorrect information with verified facts rather than simply deleting it, thereby offering practical value for high\-fidelity generation\.

The academic contributions of this work are twofold\. First, we establish a new theoretical foundation by interpreting the deep learning\-based generation process as a semantic noisy channel model\. By transplanting LDPC mechanisms such as sparse parity\-check matrices and syndrome decoding into the text domain, we provide a principled basis for systematic hallucination mitigation\. Second, we implement a model agnostic, training free methodology applicable at the prompt level, empirically showing that reliable response generation is achievable even in resource\-constrained SLM environments\. Future work will extend this framework to multimodal channels and optimize the verification density for real\-time applications\.

## 8Limitations and Ethical Considerations

Operational Cost, Latency, and Environmental Impact\.While achieving a 45\.6% token reduction over exhaustive methods, SERC’s multi\-step pipeline incurs higher token overhead and inference latency than vanilla generation \(Table[2](https://arxiv.org/html/2605.28837#S6.T2)\)\. This increased computational and environmental cost makes SERC currently better suited for offline rather than real\-time applications\. To mitigate latency without compromising precision, future work will investigateCheck Node parallelizationwithin the Tanner Graph, evaluating sparse parity\-checks independently\.

Dependency on External Knowledge and Bias\.SERC’s reliability heavily depends on the RAG module’s knowledge base, risking the propagation of biased or incorrect retrievals\. Furthermore, reliance on Western\-centric benchmarks \(e\.g\., LongForm Bio, TruthfulQA\) leaves performance in culturally diverse or non\-English settings unverified, potentially leading to demographic disparities\.

Domain, Modality Constraints, and Generalization\.Currently optimized for entity\-centric natural language QA, SERC requires adaptations for structured domains like mathematics or code\. However, abstracting "atomic facts" into broader"logical semantic units"enables extending SERC’s information\-theoretic framework to symbolic reasoning\. Applying this semantic channel model to such domains and cross\-modal environments remains a critical future direction\.

Balance between Precision and Verbosity\.While high preservation rates \(e\.g\., 106\.5%\) demonstrate effective granular correction, they risk verbosity\. Information\-theoretically, an ideal decoder reconstructs messages without extraneous details\. Future iterations will incorporate length constraints to strictly align the corrected response’s information density with the original codeword\.

## References

- \[1\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[2\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p1.1)\.
- \[3\]S\. Dhuliawala, M\. Komeili, J\. Xu, R\. Raileanu, X\. Li, A\. Celikyilmaz, and J\. Weston\(2023\)Chain\-of\-verification reduces hallucination in large language models\.InarXiv preprint arXiv:2309\.11495,Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p2.1),[§2](https://arxiv.org/html/2605.28837#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p1.1),[§5\.3](https://arxiv.org/html/2605.28837#S5.SS3.p3.1.1),[Table 1](https://arxiv.org/html/2605.28837#S5.T1.1.3.3.1),[Table 1](https://arxiv.org/html/2605.28837#S5.T1.1.9.9.1)\.
- \[4\]R\. G\. Gallager\(1962\)Low\-density parity\-check codes\.IRE Transactions on Information Theory8\(1\),pp\. 21–28\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p4.1),[§2](https://arxiv.org/html/2605.28837#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p6.3)\.
- \[5\]L\. Gao, J\. Schulman, and J\. Hilton\(2023\)RARR: researching and revising what language models say, using language models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 16477–16508\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1),[§5\.2](https://arxiv.org/html/2605.28837#S5.SS2.p5.1),[§5\.3](https://arxiv.org/html/2605.28837#S5.SS3.p5.1.1),[Table 1](https://arxiv.org/html/2605.28837#S5.T1.1.11.11.1),[Table 1](https://arxiv.org/html/2605.28837#S5.T1.1.5.5.1)\.
- \[6\]J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou\(2024\)Large language models cannot self\-correct reasoning yet\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p2.1),[§2](https://arxiv.org/html/2605.28837#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p1.1)\.
- \[7\]S\. Jeong, J\. Baek, S\. Cho, S\. J\. Hwang, and J\. C\. Park\(2024\)Adaptive\-rag: learning to adapt retrieval\-augmented large language models through question complexity\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 7036–7050\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[8\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p1.1),[§2](https://arxiv.org/html/2605.28837#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p5.3)\.
- \[9\]J\. Kimet al\.\(2024\)Re\-Ex: revising after explanation reduces the factual errors in LLM responses\.arXiv preprint arXiv:2402\.17097\.External Links:[Link](https://arxiv.org/abs/2402.17097)Cited by:[§5\.3](https://arxiv.org/html/2605.28837#S5.SS3.p6.1.1),[Table 1](https://arxiv.org/html/2605.28837#S5.T1.1.12.12.2),[Table 1](https://arxiv.org/html/2605.28837#S5.T1.1.6.6.2)\.
- \[10\]D\. P\. Kingma and M\. Welling\(2013\)Auto\-encoding variational bayes\.arXiv preprint arXiv:1312\.6114\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p3.1)\.
- \[11\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p2.1),[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[12\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 3214–3252\.Cited by:[§5\.2](https://arxiv.org/html/2605.28837#S5.SS2.p4.1)\.
- \[13\]A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi\(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 9802–9822\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[14\]S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi\(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 12076–12100\.Cited by:[§5\.2](https://arxiv.org/html/2605.28837#S5.SS2.p2.1),[§5\.2](https://arxiv.org/html/2605.28837#S5.SS2.p3.1)\.
- \[15\]J\. Pearl\(1988\)Probabilistic reasoning in intelligent systems: networks of plausible inference\.Morgan Kaufmann Publishers\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p4.1),[§2](https://arxiv.org/html/2605.28837#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p6.3)\.
- \[16\]Z\. Qin, X\. Tao, J\. Liu, and G\. Y\. Li\(2021\)Semantic communications: principles and challenges\.arXiv preprint arXiv:2201\.01389\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p1.1)\.
- \[17\]C\. E\. Shannon\(1948\)A mathematical theory of communication\.The Bell System Technical Journal27\(3\),pp\. 379–423\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p2.1),[§3\.1](https://arxiv.org/html/2605.28837#S3.SS1.p1.1)\.
- \[18\]R\. M\. Tanner\(1981\)A recursive approach to low complexity codes\.IEEE Transactions on Information Theory27\(5\),pp\. 533–547\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p2.1)\.
- \[19\]K\. Wanget al\.\(2024\)LLMs know what they need: leveraging a missing information guided framework to empower retrieval\-augmented generation\.arXiv preprint arXiv:2404\.14043\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[20\]J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler,et al\.\(2022\)Emergent abilities of large language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p2.1)\.
- \[21\]L\. Weidingeret al\.\(2021\)Ethical and social risks of harm from language models\.arXiv preprint arXiv:2112\.04359\.Cited by:[§1](https://arxiv.org/html/2605.28837#S1.p1.1)\.
- \[22\]H\. Xie, Z\. Qin, G\. Y\. Li, and B\. Juang\(2021\)Deep learning enabled semantic communication systems\.IEEE transactions on signal processing69,pp\. 2663–2675\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p2.1)\.
- \[23\]S\. Yan, J\. Gu, Y\. Zhu, and Z\. Ling\(2024\)Corrective retrieval augmented generation\.arXiv preprint arXiv:2401\.15884\.Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[24\]O\. Yoran, T\. Wolfson, O\. Ram, and J\. Berant\(2024\)Making retrieval\-augmented language models robust to irrelevant context\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.28837#S2.p1.1)\.
- \[25\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2024\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§5\.2](https://arxiv.org/html/2605.28837#S5.SS2.p4.1)\.

Similar Articles

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

arXiv cs.CL

This paper presents PCNet, a probabilistic circuit trained as a tractable density estimator on LLM residual streams to detect hallucinations as geometric anomalies. It also introduces PC-LDCD, a dynamic correction method that only intervenes on hallucinated tokens, achieving near-perfect detection and reduced corruption rates.

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

arXiv cs.CL

This paper introduces Micro-Macro Retrieval (M2R), a retrieve-while-generate framework that reduces hallucination in long-form LLM outputs by ensuring key information stays close to generated text. It uses curriculum learning-based reinforcement learning to train retrieval and grounding skills, showing effectiveness especially in lengthy contexts.