Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

arXiv cs.CL Papers

Summary

This paper introduces a unified benchmark for span-level hallucination detection in RAG systems that extends beyond natural language to code, tool output, and structured documents, and presents a fine-tuned Qwen3.5-2B detector that outperforms existing methods on these new domains while remaining competitive on standard NLP benchmarks.

arXiv:2607.00895v1 Announce Type: new Abstract: Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:39 AM

# Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents
Source: [https://arxiv.org/html/2607.00895](https://arxiv.org/html/2607.00895)
Ádám Kovács1,Bowei He2,3,Xue Liu2,3,István Boros1,Szilveszter Tóth1,Gábor Recski1,4 1KR Labs,2MBZUAI,3McGill University,4TU Wien Correspondence:[kovacs@krlabs\.eu](https://arxiv.org/html/2607.00895v1/mailto:[email protected])

###### Abstract

Hallucination detection for retrieval\-augmented generation \(RAG\) is usually evaluated on natural\-language document evidence\. However, grounded generation systems increasingly rely on structured inputs: source code, developer\-tool output, markdown documents, tables, and repository metadata\. We introduce a unified benchmark for span\-level hallucination detection over code, tool output, structured documents, and existing natural\-language RAG datasets\. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence\-based review\. Our fine\-tuned Qwen3\.5\-2B detector reaches0\.6890\.689span\-F1 on the unified test set and0\.600\.60on the code\-agent source, where it substantially outperforms LettuceDetect\-large \(0\.170\.17\) and the strongest zero\-shot LLM judges we evaluated \(at most0\.220\.22\)\. The same model remains competitive on established natural\-language benchmarks, with81\.881\.8RAGTruth example\-F1 and0\.7240\.724English PsiloQA IoU\.

Beyond Document Grounding: Span\-Level Hallucination Detection over Code, Tool Output, and Documents

Ádám Kovács1, Bowei He2,3, Xue Liu2,3, István Boros1, Szilveszter Tóth1, Gábor Recski1,41KR Labs,2MBZUAI,3McGill University,4TU WienCorrespondence:[kovacs@krlabs\.eu](https://arxiv.org/html/2607.00895v1/mailto:[email protected])

## 1Introduction

Retrieval\-augmented generation \(RAG\) grounds model outputs in external evidence\(Lewiset al\.,[2020](https://arxiv.org/html/2607.00895#bib.bib2)\), but it does not remove the need for verification\. A generated answer can still contradict the retrieved context, introduce unsupported information, or cite a reference that is not present in the evidence\. Hallucination detection methods therefore ask whether an answer is supported by the supplied context, often at the level of examples, sentences, tokens, or spans\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3); Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30); Vazquezet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib31)\)\.

Most existing benchmarks and detectors study this problem in natural\-language RAG, where both the evidence and the answer are usually document text\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3); Belyiet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib7); Tanget al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib29); Kovács and Recski,[2025](https://arxiv.org/html/2607.00895#bib.bib1); Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30); Vazquezet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib31)\)\. Real grounded\-generation systems are broader: coding agents work over repository files, git history, and test output\(Jimenezet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib12)\); developer assistants summarize command output and tool observations\(Kovacs,[2026](https://arxiv.org/html/2607.00895#bib.bib13)\); and research or documentation systems retrieve markdown pages, tables, citations, and structured documents\(Recskiet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib14); Index,[2026](https://arxiv.org/html/2607.00895#bib.bib15)\)\. These settings are not well covered by current training data or evaluations: there is no shared span\-level benchmark that covers generated code, tool observations, and structured documents under the same verification task\.

We study post\-generation verification for this structured setting: given an answer that has already been produced, together with its request and context, a detector should flag the parts that the evidence does not support\. We frame this at the span level rather than as a whole\-answer accept/reject decision, because in code and tool output a single unsupported substring, such as a wrong field, a fabricated method name, a misreported value, or an invented section reference, can change program behavior or mislead a user while leaving the rest of the answer correct\. A verifier should therefore point to the unsupported substring, not just reject the answer\.

Existing hallucination\-detection benchmarks and models leave this setting only partially covered\. RAGTruth\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\), Luna\(Belyiet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib7)\), MiniCheck\(Tanget al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib29)\), and LettuceDetect\(Kovács and Recski,[2025](https://arxiv.org/html/2607.00895#bib.bib1)\)verify generated text against retrieved documents\. Code hallucination work studies generated snippets\(Tianet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib8); Agarwalet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib9)\), generation\-time divergence\(Jianget al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib10)\), or agent trajectories\(Liuet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib11)\)\. These resources are useful, but they do not provide one span\-level formulation that covers natural\-language RAG, structured documents, source code, and developer\-tool output\.

We address this gap with a unified benchmark spanning SWE\-bench code\(Jimenezet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib12)\), developer\-tool output from Squeez\(Kovacs,[2026](https://arxiv.org/html/2607.00895#bib.bib13)\), ACL paper chunks\(Recskiet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib14)\), READMEs, Wikipedia markdown\(Index,[2026](https://arxiv.org/html/2607.00895#bib.bib15)\), RAGTruth\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\), and PsiloQA\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)\. The dataset construction uses a shared edit\-based labeling step: start from a correct grounded answer, inject a small localized hallucination, recover exact character offsets from the edit, and split by grounding source so the test set uses unseen repositories, papers, or articles\. On this benchmark we train two detector families\. Our best model isLettuceDetect\-Qwen\-2B, a fine\-tuned Qwen3\.5\-2B detector\(Team,[2026](https://arxiv.org/html/2607.00895#bib.bib18)\)with a 32,768\-token maximum sequence length that localizes unsupported spans across code, tool output, structured documents, and natural\-language RAG\. We also trainLettuceDetect\-mmBERT\-base, a 307M\-parameter mmBERT\-base encoder\(Maroneet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib19)\), as a token classification model\. Multilingual supervision comes from the 14\-language PsiloQA portion of the training set\.

Our contributions are:

- •a span\-level task formulation for post\-generation hallucination detection over structured grounded\-generation contexts;
- •a unified benchmark with 74,285 newly constructed examples across code, tool output, and structured documents, plus converted examples from existing natural\-language RAG benchmarks;
- •two detector families that localize unsupported spans across code, tool output, structured documents, and natural\-language RAG:LettuceDetect\-Qwen\-2B, a fine\-tuned Qwen3\.5\-2B detector, andLettuceDetect\-mmBERT\-base, a 307M\-parameter encoder baseline, with results showing that the generative detector substantially outperforms off\-the\-shelf detectors and zero\-shot LLM judges on the code\-agent split while remaining competitive on natural\-language RAG\.

Code, data, model checkpoints, prompts, evaluation outputs, and the reviewed code\-test arbitration files are released through GitHub and Hugging Face\.111Code at[https://github\.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect); models and datasets under[https://huggingface\.co/KRLabsOrg](https://huggingface.co/KRLabsOrg)

## 2Related Work

#### Grounded text verification\.

Hallucination detection in grounded generation includes prompt\-based judging, self\-consistency methods such as SelfCheckGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2607.00895#bib.bib5)\), and benchmark\-driven detectors trained on datasets such as HaluEval\(Liet al\.,[2023](https://arxiv.org/html/2607.00895#bib.bib6)\)and RAGTruth\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\)\. Recent compact detectors, including Luna\(Belyiet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib7)\)and LettuceDetect\(Kovács and Recski,[2025](https://arxiv.org/html/2607.00895#bib.bib1)\), show that long\-context encoders can localize unsupported spans at lower cost than LLM judges\. We use mmBERT, a ModernBERT\-family multilingual encoder trained with annealed language learning\(Maroneet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib19)\); PsiloQA reports strong span\-localization results from fine\-tuning mmBERT\-base\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)\. Other work optimizes related but different objectives: RAG\-HAT\(Songet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib27)\)reports response\-level F1 on RAGTruth, RL4HS\(Suet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib28)\)optimizes span\-F1 with reinforcement learning, and PsiloQA\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)and Mu\-SHROOM\(Vazquezet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib31)\)evaluate multilingual span localization\. Fine\-grained taxonomies have also been proposed, most notably FAVA\(Mishraet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib4)\), but these taxonomies are still mainly designed for natural\-language responses grounded in textual documents\.

#### Code hallucination\.

CodeHalu\(Tianet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib8)\)and CodeMirage\(Agarwalet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib9)\)treat hallucinations as defects in generated snippets\. Collu\-Bench\(Jianget al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib10)\)predicts hallucination during generation from token probabilities and execution feedback\. AgentHallu\(Liuet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib11)\)attributes failures across agent trajectories\. Delulu\(Erfanianet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib32)\)is closest in spirit because it targets code hallucination, but it is an execution\-verified fill\-in\-the\-middle benchmark with binary accept/reject labels\. In contrast, our task is post\-generation, repository\-grounded, and span\-level\.

## 3Task

Each example consists of a requestqq, contextcc, and answeraa\. The context may contain source code at a specific commit, tool output, or structured document text\. The goal is to predict character spans inaathat are not supported byqqandcc\.

We use three top\-level hallucination categories\.Contradictioncovers wrong logic, values, fields, or conditions in otherwise plausible code\.Unsupported additioncovers extra behavior or claims not requested or evidenced\.Fabricated referencecovers invented methods, attributes, keyword arguments, sections, or identifiers\. This split follows the common distinction made by prior taxonomies between conflicts with evidence, baseless additions, and invented entities or references\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3); Mishraet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib4)\)\. For code, the first two are mostly judged against the request, while fabricated references are judged against repository and library evidence\. The detector does not see generator logits or tool trajectories; it only sees inputs available to an external checker after the answer has been produced\.

The top\-level labels are paired with 13 subcategories that describe the surface element affected:entity,temporal,numerical,value,relational,identifier,section,attribute,claim,behavior,elaboration,subjective, andunspecified\. We choose these subcategories by harmonizing distinctions used in prior taxonomies: RAGTruth’s conflict and baseless\-information labels\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\), FAVA’s entity, relation, invented, subjective, and unverifiable labels\(Mishraet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib4)\), code\-hallucination categories such as naming, resource, and logic errors\(Tianet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib8)\), and the code/tool labels needed by our structured sources\. The result keeps the three\-way category decision interpretable while preserving the surface\-level error types needed for analysis\.

## 4Benchmark Construction

### 4\.1Sources

The benchmark has five newly constructed sources and two incorporated natural\-language RAG benchmarks\. The programming\-oriented sources are*code*, built from SWE\-bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib12)\), and*tool output*, built from Squeez\(Kovacs,[2026](https://arxiv.org/html/2607.00895#bib.bib13)\)containing a query, verbose tool observation, and gold relevant lines\. The structured\-document sources are*ACL*, built from ACL\-Verbatim retrieved paper chunks and questions\(Recskiet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib14)\);*README*, collected from popular GitHub repositories through the GitHub API; and*Wikipedia*, sampled from Englishopen\-wikipedia\-markdownarticles\(Index,[2026](https://arxiv.org/html/2607.00895#bib.bib15)\)\. We also include RAGTruth, a human\-annotated word\-level hallucination corpus for RAG outputs\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\), and PsiloQA, a 14\-language span\-level hallucination benchmark built from Wikipedia QA\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\), to keep the model tied to established natural\-language detection tasks\.

All new sources use the same sample abstraction: a prompt containing context and request, an answer, and character\-level span annotations over the answer\. Context and clean\-answer construction differ by source\. Code examples use the gold SWE\-bench fix; tool\-output examples generate a short answer from the Squeez query and relevant lines; ACL examples use the top five retrieved paper chunks as context; README and Wikipedia examples are generated from heading\-based markdown chunks\. Train, development, and test splits are separated by grounding source\.

Table 1:Newly constructed benchmark sources\. The complete evaluation also includes converted RAGTruth and PsiloQA examples\. New\-source splits are train/dev/test = 66,368 / 2,816 / 5,101\.
### 4\.2Code Source

SWE\-bench provides real GitHub issues, repository metadata, base commits, and gold patches\. For each instance we recover the files touched by the gold patch at the base commit and render the gold fix as one coding\-assistant answer: a patched function, a changed\-line fragment, or a natural\-language edit instruction\. Clean examples use this answer verbatim\. Hallucinated examples contain a small edit to the same answer\. We do not include clean and hallucinated versions of the same instance as a pair\. This setup differs from snippet\-only code hallucination, because the answer is evaluated against a concrete repository state and request\.

The three answer renderings are meant to cover the kinds of outputs a developer may ask an assistant for\. The*function*rendering gives the largest patched function that fits the length cap, and is closest to a direct code suggestion\. The*fragment*rendering gives the changed hunk, preserving the local edit without forcing the model to read a full function\. The*edit*rendering gives an instruction such as “in file X, replace Y with Z”, which is common when agents summarize a patch rather than printing a full diff\. Instances whose gold answer is a trivial version bump or too long for the context budget are skipped\. This filtering is practical rather than conceptual: if the answer itself consumes the full sequence window, the detector cannot use the repository evidence\.

The raw SWE\-bench issue text is also rewritten by our pipeline into a short developer request\. We keep the technical intent but remove issue\-tracker noise, reproduction logs, and long discussions that would make the answer\-verification task depend on irrelevant text\. This request is included in the detector input\. It matters especially forunsupported\_addition, because an extra behavior can be unsupported even when it is syntactically valid and uses real repository symbols\.

### 4\.3Generation Pipeline

The construction has a shared labeling step across sources\. First, source\-specific preparation builds the context, request, and a known\-correct grounded answer\. Then a source\-specific injector proposes localized replacement edits as structuredoriginal/hallucinatedpairs \(Figure[1](https://arxiv.org/html/2607.00895#S4.F1)\)\. Applying these edits gives exact character offsets without diffing mixed natural\-language/code answers\.

1\. Clean answer torch\.cuda\.set\_device\(gpu\) correct by construction from the grounding source2\. Injector output structured JSON\{"original": "set\_device", "hallucinated": "set\_active\_device", "category": "fabricated\_reference"\}3\. Span\-labeled answer torch\.cuda\. set\_active\_device\(gpu\) fabricated\_reference, exact character offsetsinjectapply \+ locate

Figure 1:Edit\-based injection yields exact spans\. The injector returns each change as anoriginal/hallucinatedpair; applying the edit deterministically locates the hallucinated substring in the final answer\.The edit\-based design is important for label quality\. Instead of diffing two mixed natural\-language/code answers, we apply the injector’s structured replacement and locate the hallucinated substring directly in the final answer\. For the code source, the clean answer is the gold SWE\-bench fix and the injector is Gemma 4 31B\(Google DeepMind,[2026](https://arxiv.org/html/2607.00895#bib.bib16)\)\. For tool output and the markdown sources, Qwen 3\.6 35B A3B\(Qwen Team,[2026](https://arxiv.org/html/2607.00895#bib.bib17)\)first generates a request or clean answer when the source does not already supply one, and then proposes the hallucination edit\. Because the final label comes from the applied replacement, the span can stay narrow: intorch\.cuda\.set\_active\_device\(gpu\), onlyset\_active\_deviceis labeled, not the surrounding correct call structure\.

Qualitative benchmark example: contradiction / wrong implementationRepository / request\.matplotlib\_\_matplotlib\-23049\. User request:*“Can you add support for aminorkeyword argument toplt\.xticks\(\)?”*Answer snippet\.```
def xticks(ticks=None, labels=None, *, minor=False, **kwargs):
    ...
    if ticks is None:
        locs = ax.get_xticks(minor=minor)
    else:
        [H] locs = ax.set_xticks(ticks, minor=not minor) [/H]
    ...

def yticks(ticks=None, labels=None, *, minor=False, **kwargs):
    ...
    if ticks is None:
        [H] locs = ax.get_yticks(minor=not minor) [/H]
    ...
```

Type\.Contradiction \(wrong\_implementation\)\. The answer is syntactically valid and locally plausible, but reverses the intended behavior of theminorargument\.Figure 2:Example benchmark instance\. We use\[H\] \.\.\. \[/H\]markers for gold unsupported spans\.The injector is prompted to preserve the surrounding answer and to make only localized changes\. For code, the target changes fall into two families\. Intent errors change a value, field, condition, or side effect while leaving the answer plausible\. Structural errors replace a real method, attribute, or keyword with a plausible name that does not exist\. The injector returns JSON containing the original substring, the hallucinated substring, and the category\. We then apply the replacement deterministically\. Attempts are discarded when the original substring is not uniquely found, the edit is a no\-op, the hallucinated span covers too much of the answer, or the injected name already occurs in the evidence\.

This approach deliberately trades generation yield for cleaner labels\. Roughly half of the attempted code injections pass the automatic checks\. The most common failures are non\-unique originals, overly broad edits, hint words such as “incorrect”, and fabricated references that are not actually absent from context\. These rejections are useful: they remove examples where the model could learn artifacts of the generation process rather than the verification task\.

#### Reference grounding\.

Clean answers often refer to sibling methods, imported helpers, or third\-party APIs not present in a truncated source context\. If these are not added, a verifier cannot distinguish a correct but unseen reference from a fabricated one\. We therefore append referenced definitions from the modified files, imported repository modules, and modules imported by the changed files, all resolved at the historical base commit rather than by current\-code search\. For third\-party APIs, which are not in the repository at all, we retrieve real signatures from Context7222[https://context7\.com](https://context7.com/), a library\-documentation index: we parse the external libraries imported by the answer and query Context7 for each imported symbol’s signature and usage, then append the returned snippets \(Figure[3](https://arxiv.org/html/2607.00895#S4.F3)\)\. This gives the verifier genuine evidence for external calls, so a real but unfamiliar library API is not penalized as a fabrication\.

Answer references self\.save\_checkpoint,deprecated,torch\.cuda\.set\_deviceRepository evidence changed functions and full changed files at the base commitImport graph evidence modules imported by the answer or by changed filesThird\-party evidence compact library signatures for external APIsVerifier context original prompt plus referenced definitions / library signatures

Figure 3:Reference grounding for code examples\. Correct references missing from the truncated source context are resolved at the base commit and appended to the verifier context; third\-party APIs are grounded with compact signatures\.
#### Other sources\.

The non\-code sources use the same edit\-application framework but different prompts\. Tool\-output injections misreport identifiers, line references, values, or claims from the observation\. ACL injections use paper\-specific numerical, entity, relational, methodological, and citation\-like edits detectable from the retrieved excerpts\. README and Wikipedia injections use a generic markdown prompt covering numerical, temporal, entity, relational, fabricated\-reference, and unsupported\-claim edits\.

### 4\.4Test\-Set Verification

For quality assurance, we reviewed every code\-source test sample before release: 2,038 examples were reviewed and 2,015 retained\. Model\-assisted triage first flagged span validity, category, boundary, and plausibility issues333We used Claude Sonnet 4\.6 to flag candidate issues; final decisions were made by the authors during arbitration\.; flagged cases were then re\-judged blindly and finally arbitrated by the authors against the true pre\-fix repository evidence rather than the injector’s intended edit\. For the 44 disputed hallucinations, arbitration upheld 41 as genuine and dropped 3 that matched original code\. The review tightened 235 boundaries, dropped 23 invalid spans, corrected 2 categories, reclassified 5 examples as clean, and removed 23 examples for question–answer mismatch, no\-op edits, or incoherent rendering\. No rebalancing was applied after review; the released code test set is 50\.3% hallucinated\. Per\-sample verdicts, contested cases, resolutions, and the rubric are released with the dataset\.

## 5Benchmark Characteristics

The code source contains 18,524 examples from 53 repositories, split by SWE\-bench repository into 16,319 train / 190 development / 2,015 test examples over 35 / 6 / 12 disjoint repositories \(50\.0% hallucinated overall\)\. Average answers are about 1\.8k characters, and hallucinated code examples contain just under two labeled spans on average\. Answer formats are uneven—function \(8,550\), fragment \(4,893\), and edit\-style \(5,081\) renderings—as are error types over hallucinated examples: contradiction \(4,719\), fabricated reference \(2,275\), and unsupported addition \(2,274\)\.

The examples vary in what evidence is needed\. Some hallucinations are local: a wrong config key is contradicted by the nearby code\. Others require cross\-file evidence, for example a method called onselfwhose definition lives in an imported mixin\. Third\-party API fabrications require library signatures rather than repository code\. Without reference grounding, correct answers would contain many unsupported\-looking names simply because the context was truncated\.

For the tool\-output dataset the context is often an observation rather than a source file: a test failure, a grep result, a package\-manager response, or a git command\. Here hallucinations are usually value and relation errors: reporting the wrong file, count, version, status, or failing test\. The markdown sources broaden the data beyond code while preserving structure\. ACL papers, READMEs, and Wikipedia pages contain headings, tables, citations, and lists, so the detector sees contexts that are not plain paragraphs but are still text\-grounded\.

## 6Experiments

### 6\.1Models and Metrics

Our best model,LettuceDetect\-Qwen\-2B, is a Qwen3\.5\-2B generative detector\(Team,[2026](https://arxiv.org/html/2607.00895#bib.bib18)\)fine\-tuned to return JSON spans with category and subcategory\. We fine\-tune and evaluate it with a 32,768\-token maximum sequence length, so each prediction can include the user request, retrieved documents or repository evidence, tool output, and the answer to check\. It is trained on the unified training set using the same data\-agnostic prompt across sources\. We compare it withLettuceDetect\-mmBERT\-base, a 307M\-parameter mmBERT\-base encoder\(Maroneet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib19)\)trained on the same task, an LFM2\.5\-8B\-A1B generative sibling\(AI,[2026](https://arxiv.org/html/2607.00895#bib.bib26)\)fine\-tuned with the same prompt, LettuceDetect\-large\(Kovács and Recski,[2025](https://arxiv.org/html/2607.00895#bib.bib1)\), answer\-level faithfulness detectors, and two large zero\-shot LLM judges: Nemotron\-3\-Ultra\-550B\(NVIDIAet al\.,[2026](https://arxiv.org/html/2607.00895#bib.bib23)\)and gpt\-oss\-120b\(OpenAIet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib24)\)\.

We report character\-overlap span precision, recall, and F1; example\-F1, where an answer is flagged if at least one span is predicted; and mean IoU between predicted and gold span mass\. For typed detection, a predicted span only receives credit when its category or subcategory also matches\.

#### Training details\.

The generative detector is trained with supervised fine\-tuning on the merged structured and natural\-language training split\. The 74,285 examples in Table[1](https://arxiv.org/html/2607.00895#S4.T1)are newly constructed; after adding converted RAGTruth and PsiloQA examples, the full split contains 145,250 train, 6,171 validation, and 10,698 test examples\. The training and evaluation splits are released on Hugging Face, and the construction/evaluation code is released on GitHub\.444Code at[https://github\.com/KRLabsOrg/LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect); models and datasets under[https://huggingface\.co/KRLabsOrg](https://huggingface.co/KRLabsOrg)We fine\-tune Qwen3\.5\-2B with LoRA rank 32,α=64\\alpha=64, dropout 0, bf16 weights, learning rate2×10−42\{\\times\}10^\{\-4\}, linear schedule with 3% warmup, weight decay 0\.01, two epochs, effective batch size 32, and a 32,768\-token maximum sequence length\. The multilingual portion of the training data comes from PsiloQA, while the long context budget is mainly used by code, tool\-output, and structured\-document examples\. The prompt is the same across sources: it defines a hallucinated span, lists the taxonomy, then provides the user request, context, and answer to verify\. Returned strings are matched back into the answer to recover character offsets\.

LettuceDetect\-mmBERT\-basefollows the LettuceDetect token\-classification architecture withjhu\-clsp/mmBERT\-baseas the backbone: context and request tokens are masked in the loss, and answer tokens receive supported/unsupported labels\. We train it on the same unified split with an 8,192\-token maximum sequence length, batch size 8, gradient accumulation 4, learning rate10−510^\{\-5\}, and three epochs, selecting the best checkpoint by development hallucinated\-token F1\. The encoder is cheaper at inference and decodes contiguous positive answer tokens into character spans, but it only predicts binary spans in its base form\. We also evaluate a typed encoder cascade in which spans from the frozen binary model are classified by a label\-conditioned head that scores each span against category descriptions\.

### 6\.2Main Results

Table[2](https://arxiv.org/html/2607.00895#S6.T2)gives per\-source results for the 2B generative detector\. The model performs best on README and Wikipedia, where grounding resembles factual document QA, and remains strong on the harder coding\-agent sources\. Code\-agent is the most difficult source: the context is long, the answer often contains new code, and some errors are intent mistakes rather than simple factual contradictions\.

Table 2:Per\-source results\. P/R/F1/Ex\.\-F1 are character\-overlap span metrics for LettuceDetect\-Qwen\-2B; LD\-mmBERT, LFM\-8B, and gpt\-oss report span\-F1\. gpt\-oss was run only on the five newly constructed sources\.The generative detector outperforms LettuceDetect\-mmBERT\-base on every source:0\.6890\.689vs\.0\.6420\.642span\-F1 overall, and0\.6020\.602vs\.0\.5080\.508on code\-agent\. On code\-agent answers it is also well above LettuceDetect\-large and the strongest zero\-shot LLM judge we evaluated; on natural\-language RAG benchmarks it is close to specialized systems\.

### 6\.3Comparison with Existing RAG Benchmarks

Table[3](https://arxiv.org/html/2607.00895#S6.T3)compares against published natural\-language RAG results\. On RAGTruth, our model is highly competitive with specialized systems while also covering code and tool output\. On PsiloQA, it sets the best reported English IoU \(0\.7240\.724\), above the fine\-tuned mmBERT\-base model \(0\.7070\.707\) of the benchmark authors and far above their 32B few\-shot judge \(0\.4000\.400\)\. Across all 14 PsiloQA languages, it reaches0\.6890\.689mean IoU, compared with0\.6230\.623for the PsiloQA mmBERT\-base model and0\.3830\.383for the Qwen2\.5\-32B judge reported byRykovet al\.\([2025](https://arxiv.org/html/2607.00895#bib.bib30)\)\. These are the strongest reported results among the published PsiloQA systems we compare against, while using the same detector trained for code, tool output, and structured documents\.

BenchmarkMethodScoreRAGTruth Ex\.\-F1RAG\-HAT 8B\(Songet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib27)\)83\.9RAGTruth Ex\.\-F1LettuceDetect\-Qwen\-2B81\.8RAGTruth Ex\.\-F1LettuceDetect\-large\(Kovács and Recski,[2025](https://arxiv.org/html/2607.00895#bib.bib1)\)79\.2RAGTruth Ex\.\-F1Fine\-tuned Llama\-2\-13B\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\)78\.7RAGTruth Ex\.\-F1GPT\-4\(Niuet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib3)\)63\.4PsiloQA EN IoULettuceDetect\-Qwen\-2B0\.724PsiloQA EN IoUmmBERT\-base \(PsiloQA\)\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)0\.707PsiloQA EN IoUQwen2\.5\-32B judge\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)0\.400PsiloQA 14\-lang IoULettuceDetect\-Qwen\-2B0\.689PsiloQA 14\-lang IoUmmBERT\-base \(PsiloQA\)\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)0\.623PsiloQA 14\-lang IoUQwen2\.5\-32B judge\(Rykovet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib30)\)0\.383Table 3:Comparison with established natural\-language RAG benchmarks\. Rows combine published numbers with our evaluation, so they should be read as external reference points rather than a single shared leaderboard\.
### 6\.4Code and Tool Evidence Evaluation

Table[4](https://arxiv.org/html/2607.00895#S6.T4)shows results on the code\-agent test set\. LettuceDetect\-large, trained for natural\-language RAG, reaches only0\.170\.17span\-F1\. Large zero\-shot judges find some true error regions but over\-predict heavily: the task\-aware Nemotron prompt improves over a generic prompt but still reaches only0\.220\.22span\-F1\. Answer\-level faithfulness systems, including HHEM\-2\.1\-Open\(Liet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib20)\), Lynx\-8B\(Raviet al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib21)\), Granite\-Guardian\-4\.1\-8B\(Padhiet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib22)\), and MiniCheck\-7B\(Tanget al\.,[2024](https://arxiv.org/html/2607.00895#bib.bib29)\), show the same tendency at answer level: high recall but much lower precision on the hallucinated class\. In most cases these judges flag correct newly written patch code as unsupported instead of checking whether the edit follows from the request and repository evidence\.

Table[2](https://arxiv.org/html/2607.00895#S6.T2)shows results on other sources\. Under the same generic prompt, gpt\-oss\-120b reaches reasonable span\-F1 on document\-like data \(ACL, README, Wikipedia\), but performs poorly on tool output and code\-agent answers\(OpenAIet al\.,[2025](https://arxiv.org/html/2607.00895#bib.bib24)\)\.

Detector \(code\-agent\)EvalPRF1Ex\.\-F1LettuceDetect\-large \(v1\)span0\.1120\.3730\.1720\.684Nemotron\-3\-Ultra, genericspan0\.1080\.6650\.1860\.655Nemotron\-3\-Ultra, task\-awarespan0\.1320\.5890\.2160\.700gpt\-oss\-120b, task\-awarespan0\.1350\.5010\.2120\.691HHEM\-2\.1\-Openanswer0\.500\.860\.63–Lynx\-8Banswer0\.520\.730\.61–Granite\-Guardian\-4\.1\-8Banswer0\.530\.900\.66–MiniCheck\-7Banswer0\.501\.000\.67–LettuceDetect\-mmBERT\-basespan0\.6190\.4300\.5080\.770LFM2\.5\-8B\-A1B \(ours, SFT\)span0\.5310\.4850\.5070\.811LettuceDetect\-Qwen\-2Bspan0\.5960\.6090\.6020\.835Table 4:Baselines on the code\-agent test set\. Span systems report character\-overlap span P/R/F1 plus example\-F1; answer systems report hallucinated\-class example P/R/F1\.
### 6\.5Typed Detection

The generative detector emits a category and subcategory with every span\. On the full unified test set it reaches category\-gated span\-F10\.5850\.585and subcategory\-gated span\-F10\.4680\.468, compared with binary span\-F10\.6890\.689\. Subcategory prediction is harder, especially for natural\-language examples whereclaim,elaboration,value, andrelationalcan overlap\. As a cheaper alternative, we also evaluate a typed encoder cascade: given gold spans, the label\-conditioned head reaches0\.820\.82category and0\.640\.64subcategory accuracy, but end\-to\-end it trails the generative model \(category\-gated span\-F10\.4610\.461vs\.0\.5850\.585, subcategory\-gated0\.3150\.315vs\.0\.4680\.468overall\)\.

### 6\.6Error Analysis

The hardest remaining code examples are request\-grounded intent errors and broad generated edits\. A wrong field name or condition may use real repository symbols and look structurally valid, while only one small part of a larger generated block is unsupported\. Tool\-output examples show the same pattern when an otherwise fluent answer copies one version, count, filename, or status incorrectly\.

Zero\-shot judges make similar errors\. With a generic prompt, Nemotron\-3\-Ultra often marks clean patch code as fabricated when it should verify newly written lines against the request and repository evidence\. A task\-aware prompt improves precision from 0\.11 to 0\.13 but does not solve the problem, and gpt\-oss\-120b behaves similarly\. In a small reasoning\-judge diagnostic, correct patch lines were treated as unsupported, truncated context as evidence for fabricated references, and edit\-style answers produced non\-verbatim spans\. These results show that strong zero\-shot judges do not replace task\-specific training here\.

## 7Conclusion

We introduced a span\-level hallucination detection benchmark across code, tool output, structured documents, and natural\-language RAG\. A 2B generative detector reaches0\.6890\.689span\-F1 overall and0\.6020\.602on code\-agent answers, well above the off\-the\-shelf detectors and zero\-shot LLM judges we evaluated\. It also reaches the best reported English PsiloQA IoU \(0\.7240\.724\) and81\.881\.8RAGTruth example\-F1, close to RAG\-HAT’s83\.983\.9\.

## 8Limitations

Most labels come from synthetic injection\. The reviewed code test set gives us confidence in that split, but train/development labels and non\-code test labels are generated labels guarded by automated checks, and the code review is model\-assisted rather than independently annotated by multiple human annotators\. The benchmark covers final\-answer verification, not full agent trajectories, and should not be read as measuring real\-world hallucination prevalence\.

## References

- CodeMirage: hallucinations in code generated by large language models\.External Links:2408\.08333,[Link](https://arxiv.org/abs/2408.08333)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px2.p1.1)\.
- L\. AI \(2026\)LFM2\.5\-8b\-a1b: personal assistant on your laptop\.Liquid AI Blog\.Note:www\.liquid\.ai/blog/lfm2\-5\-8b\-a1bCited by:[§6\.1](https://arxiv.org/html/2607.00895#S6.SS1.p1.1)\.
- M\. Belyi, R\. Friel, S\. Shao, and A\. Sanyal \(2025\)Luna: a lightweight evaluation model to catch language model hallucinations with high accuracy and low cost\.InProceedings of the 31st International Conference on Computational Linguistics: Industry Track,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, S\. Schockaert, K\. Darwish, and A\. Agarwal \(Eds\.\),Abu Dhabi, UAE,pp\. 398–409\.External Links:[Link](https://aclanthology.org/2025.coling-industry.34/)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Erfanian, N\. D\. Troncoso, A\. Garg, A\. Gale, X\. Liu, P\. A\. Golnari, and S\. Fu \(2026\)Delulu: a verified multi\-lingual benchmark for code hallucination detection in fill\-in\-the\-middle tasks\.External Links:2605\.07024,[Link](https://arxiv.org/abs/2605.07024)Cited by:[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px2.p1.1)\.
- Google DeepMind \(2026\)Gemma 4 31b it model card\.Note:[https://huggingface\.co/google/gemma\-4\-31B\-it](https://huggingface.co/google/gemma-4-31B-it)Cited by:[§4\.3](https://arxiv.org/html/2607.00895#S4.SS3.p2.1)\.
- O\. Index \(2026\)Open wikipedia \(markdown\)\.Hugging Face\.External Links:[Link](https://huggingface.co/datasets/open-index/open-wikipedia-markdown)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§4\.1](https://arxiv.org/html/2607.00895#S4.SS1.p1.1)\.
- N\. Jiang, Q\. Li, L\. Tan, and T\. Zhang \(2024\)Collu\-bench: a benchmark for predicting language model hallucinations in code\.External Links:2410\.09997,[Link](https://arxiv.org/abs/2410.09997)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px2.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.External Links:2310\.06770,[Link](https://arxiv.org/abs/2310.06770)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§4\.1](https://arxiv.org/html/2607.00895#S4.SS1.p1.1)\.
- Á\. Kovács and G\. Recski \(2025\)LettuceDetect: a hallucination detection framework for RAG applications\.External Links:2502\.17125,[Link](https://arxiv.org/abs/2502.17125)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2607.00895#S6.SS1.p1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.4.4.2.1.1)\.
- A\. Kovacs \(2026\)Squeez: task\-conditioned tool\-output pruning for coding agents\.External Links:2604\.04979,[Link](https://arxiv.org/abs/2604.04979)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§4\.1](https://arxiv.org/html/2607.00895#S4.SS1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InProceedings of the 34th International Conference on Neural Information Processing Systems,NIPS ’20,Red Hook, NY, USA\.External Links:ISBN 9781713829546Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p1.1)\.
- J\. Li, X\. Cheng, X\. Zhao, J\. Nie, and J\. Wen \(2023\)HaluEval: a large\-scale hallucination evaluation benchmark for large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 6449–6464\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.397/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.397)Cited by:[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Li, R\. Luo, and O\. Mendelevitch \(2024\)HHEM\-2\.1\-Open\.Hugging Face\.External Links:[Link](https://huggingface.co/vectara/hallucination_evaluation_model),[Document](https://dx.doi.org/10.57967/hf/3240)Cited by:[§6\.4](https://arxiv.org/html/2607.00895#S6.SS4.p1.2)\.
- X\. Liu, X\. Yang, Z\. Li, P\. Li, and R\. He \(2026\)AgentHallu: benchmarking automated hallucination attribution of LLM\-based agents\.External Links:2601\.06818,[Link](https://arxiv.org/abs/2601.06818)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9004–9017\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.557/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.557)Cited by:[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Marone, O\. Weller, W\. Fleshman, E\. Yang, D\. Lawrie, and B\. V\. Durme \(2025\)MmBERT: a modern multilingual encoder with annealed language learning\.External Links:2509\.06888,[Link](https://arxiv.org/abs/2509.06888)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1),[§6\.1](https://arxiv.org/html/2607.00895#S6.SS1.p1.1)\.
- A\. Mishra, A\. Asai, V\. Balachandran, Y\. Wang, G\. Neubig, Y\. Tsvetkov, and H\. Hajishirzi \(2024\)Fine\-grained hallucination detection and editing for language models\.External Links:2401\.06855,[Link](https://arxiv.org/abs/2401.06855)Cited by:[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2607.00895#S3.p2.1),[§3](https://arxiv.org/html/2607.00895#S3.p3.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 10862–10878\.External Links:[Link](https://aclanthology.org/2024.acl-long.585/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.585)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p1.1),[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2607.00895#S3.p2.1),[§3](https://arxiv.org/html/2607.00895#S3.p3.1),[§4\.1](https://arxiv.org/html/2607.00895#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.5.5.2.1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.6.6.2.1.1)\.
- NVIDIA, :, A\. Blakeman, A\. Thomas, A\. Jhunjhunwala, A\. Gupta, A\. Khattar, A\. Rajfer, A\. Renduchintala, A\. Asif, A\. Vavre, A\. F\. Miranda, A\. Bilal, A\. Zaman, A\. Hotchandani, A\. Shukla, A\. Bercovich, A\. Ficek, A\. Gronskiy, A\. Kondratenko, A\. Steiner, A\. Ye, A\. Bukharin, A\. Milesi, A\. Taghibakhshi, A\. Gatti, A\. Liu, A\. Kumar, A\. Phanishayee, A\. S\. Mahabaleshwarkar, A\. Klein, A\. Zuker, A\. Geifman, A\. Bhiwandiwalla, A\. Subramaniam, A\. Santilli, A\. Fulks, A\. McHarg, A\. Tao, A\. Skliar, A\. Agrusa, A\. Srivastava, A\. Verma, A\. Shors, A\. Warno, A\. S\. I\. Llaquet, A\. Mehta, A\. Nowaczynski, A\. Jain, A\. Aithal, A\. Poojary, A\. Ahamed, A\. Mishra, A\. K\. Thekkumpate, A\. Sohrabizadeh, A\. Kaur, A\. Vem, A\. Dattagupta, B\. S\. Anandan, B\. Sadeghi, B\. Lanir, B\. Schifferer, B\. Nushi, B\. Kartal, B\. Thiede, B\. D\. Rouhani, B\. Deng, B\. Schatz, B\. Ginsburg, B\. Wang, B\. Nemire, B\. Norick, B\. Dang, B\. Westphal, B\. Yu, B\. Khailany, B\. Catanzaro, C\. del Mundo, C\. Aarish, C\. Lee, C\. Hwang, C\. Sakr, C\. Wang, C\. Truong, C\. Cui, C\. Cheng, C\. Hsieh, C\. Zhang, C\. Deng, C\. Patel, C\. Alexiuk, C\. Cosgrove, C\. Munley, C\. Harvey, C\. Parisien, C\. Shen, C\. Li, C\. Neale, C\. Gao, C\. Meurillon, D\. Gil, D\. Su, D\. Zhao, D\. Corneil, D\. Afrimi, D\. Egert, D\. Korzekwa, D\. Lo, D\. Machlab, D\. Serebrenik, D\. Sorokin, D\. Gitman, D\. Levy, D\. Stosic, D\. Mosallanezhad, D\. Yu, D\. Karamyan, D\. Donia, D\. Debroy, D\. Narayanan, D\. O’Kelly, D\. Peri, D\. Nathawani, Di, Wu, D\. Rekesh, D\. Kakwani, D\. Plummer, D\. Anh, D\. Yu, D\. Jiang, D\. Kim, D\. Poorkay, D\. Riach, D\. Stosic, D\. VanStee, E\. Meng, E\. Minasyan, E\. Lin, E\. M\. P\. Long, E\. Sarafin, E\. Segal, E\. Lantz, E\. Evans, E\. Ning, E\. Chung, E\. Harper, E\. Pham\-Hung, E\. Tramel, E\. Yang, E\. Galinkin, E\. Pounds, E\. G\. Goncalves, E\. Briones, E\. Wu, E\. Bakhturina, E\. Tsykunov, E\. Dobrowolska, F\. Ladhak, F\. Memarian, F\. Wang, F\. Jia, F\. Soares, F\. V\. Frujeri, F\. Chen, F\. Lin, F\. Galko, F\. Sun, F\. Siino, F\. Hou, G\. H\. Agam, G\. Kaplun, G\. Bhatt, G\. Prasad, G\. Kulshreshtha, G\. Armstrong, G\. Shen, G\. Borghesi, G\. Neskovic, G\. Batmaz, G\. Lam, G\. Mason, G\. Pauloski, G\. Nalbandyan, G\. Chlebus, G\. Karch, G\. Liu, G\. Zhang, G\. Huang, H\. Maron, H\. Qian, H\. Elisha, H\. Ren, H\. K\. S\. Kumar, H\. Hud, H\. Nover, H\. S\. Hall, H\. Iso, H\. Ngo, H\. Hum, H\. Sahota, H\. Wang, H\. Soni, H\. Tamoyan, H\. Li, H\. Chen, H\. Li, H\. Wang, H\. Nguyen, I\. Chiles, I\. Galil, I\. Shahaf, I\. Gitman, I\. Shovkun, I\. Loshchilov, I\. Guehring, I\. Schen, I\. Levy, I\. Neeman, I\. Moshkov, I\. Golan, I\. Putterman, J\. Choi, J\. Slowikowski, J\. Kautz, J\. P\. Scowcroft, J\. Casper, J\. Mitra, J\. Glick, J\. Chen, J\. Oliver, J\. Xu, J\. Zhu, J\. Song, J\. Zhang, J\. Jiao, J\. Zeng, J\. Lou, J\. King, J\. Zhang, J\. Wang, J\. Choi, J\. Chu, J\. Conway, J\. Guman, J\. Jatko, J\. Rausch, J\. Kamalu, J\. Roberts, J\. Greco, J\. Mensel, J\. Alben, J\. Yang, J\. Cohen, J\. Raiman, J\. Jennings, J\. Mabry, J\. Pierce, J\. Daw, J\. V\. Vialard, J\. Yi, J\. Parmar, K\. Jain, K\. Zhu, K\. Briski, K\. Cheung, K\. Luna, K\. Willowhawk, K\. Wyss, K\. Santhanam, K\. Shih, K\. Kong, K\. Nguyen, K\. Bhardwaj, K\. S\. Sivamani, K\. Krommydas, K\. C\. Puvvada, K\. Pawelec, K\. Anik, K\. Keprios, K\. Day, L\. McAfee, L\. Du, L\. Derczynski, L\. Ding, L\. Liu, L\. Wu, L\. Kadoch, L\. Wei, L\. Vega, L\. Robison, L\. Su, M\. V\. Segbroeck, M\. J\. Mikulski, M\. R\. de Melo, M\. Sypula, M\. Fathi, M\. N\. Sreedhar, M\. T\. Chandran, M\. Kilaru, M\. Ashkenazi, M\. Cuevas, M\. Romeijn, M\. Chochowski, M\. Cai, M\. Mozolewski, M\. Kliegl, M\. Stepniewska\-Dziubinska, M\. Patelka, M\. Machczynski, M\. Novikov, M\. Ferrato, M\. Golub, M\. Samadi, M\. Corpuz, M\. Wang, M\. Wu, M\. Price, M\. Boubdir, M\. Schaffer, M\. Andersch, M\. Boone, M\. Gschwind, M\. Lightstone, M\. Loh, M\. Bien, M\. Zawalski, M\. Gill, M\. Martinez, M\. Khona, M\. Chrzanowski, M\. Houston, M\. Ma, M\. Lee, M\. Fawzy, M\. Dabbah, M\. Shoeybi, M\. Patwary, N\. Mulepati, N\. Nabwani, N\. Dhameja, N\. Hennouni, N\. Hereth, N\. Pinckney, N\. Algarici, N\. Assaf, N\. Haber, N\. Knight, N\. Reamaroon, N\. Quak, N\. Bhatia, N\. Desai, N\. Ludwig, N\. Tajbakhsh, N\. Xu, N\. Ailon, N\. Juluru, N\. Nitin, O\. Masad, O\. Rybakov, O\. Hrinchuk, O\. Kuchaiev, O\. Viessmann, O\. Delalleau, O\. Olabiyi, O\. U\. Argov, O\. Puny, O\. Tropp, P\. Ribalta, P\. Bhattacharya, P\. Lampropoulos, P\. Mannan, P\. Shamis, P\. Legresley, P\. Gibbons, P\. Molchanov, P\. Morkisz, P\. Dykas, P\. Jin, P\. Aquilanti, P\. Xu, P\. Januszewski, P\. Laskiewicz, P\. Jannaty, P\. Gurumurthy, P\. P\. Thombre, P\. Varshney, P\. Gundecha, P\. Tredak, P\. Meng, Q\. Wan, R\. K\. Mahabadi, R\. Oberman, R\. Garg, R\. Sri\-Tharan, R\. Kandu, R\. Sanadhya, R\. El\-Yaniv, R\. Zilberstein, R\. Shafipour, R\. Macalisang, R\. Tian, R\. Kovacs, R\. Pi, R\. Izzo, R\. Shahbazyan, R\. Garg, R\. Puri, R\. F\. Neves, R\. Zhao, R\. Borkar, R\. Gala, R\. Islam, R\. Clark, R\. Hesse, R\. Kirby, R\. Waleffe, R\. Watve, R\. Koren, R\. Banner, R\. Zhang, R\. J\. Hewett, R\. Prenger, R\. Stewart, R\. Egashira, S\. Mahdavi, S\. Paliwal, S\. Singh, S\. Modi, S\. Dave, S\. Shinagawa, S\. Kriman, S\. Bhaskar, S\. Lym, S\. Kariyappa, S\. Satheesh, S\. V\. Murari, S\. Pasumarthi, S\. Mishra, S\. Muralidharan, S\. Hara, S\. Narentharen, S\. Anandaraj, S\. Na, S\. Bak, S\. Bak, S\. Sameni, S\. Mard, S\. Panev, S\. Henneman, S\. Poulos, S\. Mor, S\. Acharya, S\. Ghosh, S\. T\. Sreenivas, S\. Mendelson, S\. Kotek, S\. Wang, S\. Aharon, S\. Gharghabi, S\. Lin, S\. Chen, S\. Fan, S\. Baskaran, S\. Gopa, S\. Prabhumoye, S\. Pachori, S\. Toshniwal, S\. Ding, S\. Krishnamurthy, S\. Singh, S\. Sun, S\. Das, S\. A\. Thottakara, S\. Ithape, S\. Majumdar, S\. Singhal, S\. H\. Singudasu, S\. Bhuvanapalli, S\. Veccham, S\. Sergienko, S\. Alborghetti, S\. Ge, S\. Rong, S\. D\. Devare, S\. Rao, S\. K\. Barua, S\. Ha, S\. Gai, S\. Gunasekar, S\. Panguluri, S\. Gupta, S\. Hinzburh, S\. Priyadarshi, S\. N\. Akter, T\. Abramovich, T\. Bui, T\. Varshney, T\. Ter\-Hovhannisyan, T\. Ene, T\. Kong, T\. Do, T\. Zhang, T\. Moore, T\. Blankevoort, T\. Moon, T\. Mitra, T\. Balough, T\. Grzegorzek, T\. Hliwiak, T\. Asida, T\. B\. Natan, T\. Keren, T\. Ronen, T\. Salim, T\. Wang, T\. Rebedea, T\. Konuk, T\. Vashishth, U\. Karpas, U\. De, V\. Noorozi, V\. Srinivasan, V\. Elango, V\. Agrawal, V\. Cui, V\. Korthikanti, V\. Mehta, V\. Rao, V\. Wu, V\. Kurin, V\. Lavrukhin, V\. Anisimov, V\. Pham, W\. Jiang, W\. U\. Ahmad, W\. Ishihara, W\. Du, W\. Ping, W\. Chai, W\. Dai, W\. Helmholz, W\. Jennings, W\. Zhu, W\. Prazuch, X\. Ren, X\. Yu, Y\. Breek, Y\. Chen, Y\. Yu, Y\. Chen, Y\. Galron, Y\. Karnati, Y\. Choi, Y\. Meyer, Y\. Wu, Y\. Zhang, Y\. Lin, Y\. Geifman, Y\. Fu, Y\. Kwon, Y\. Yao, Y\. Guvvla, Y\. Huang, Y\. Liu, Z\. Moshe, Z\. Newell, Z\. Wang, Z\. Li, Z\. Zhu, Z\. Yang, Z\. Liu, Z\. Yan, and Z\. Wertheimer \(2026\)Nemotron 3 ultra: open, efficient mixture\-of\-experts hybrid mamba\-transformer model for agentic reasoning\.External Links:2606\.15007,[Link](https://arxiv.org/abs/2606.15007)Cited by:[§6\.1](https://arxiv.org/html/2607.00895#S6.SS1.p1.1)\.
- OpenAI, :, S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao, B\. Barak, A\. Bennett, T\. Bertao, N\. Brett, E\. Brevdo, G\. Brockman, S\. Bubeck, C\. Chang, K\. Chen, M\. Chen, E\. Cheung, A\. Clark, D\. Cook, M\. Dukhan, C\. Dvorak, K\. Fives, V\. Fomenko, T\. Garipov, K\. Georgiev, M\. Glaese, T\. Gogineni, A\. Goucher, L\. Gross, K\. G\. Guzman, J\. Hallman, J\. Hehir, J\. Heidecke, A\. Helyar, H\. Hu, R\. Huet, J\. Huh, S\. Jain, Z\. Johnson, C\. Koch, I\. Kofman, D\. Kundel, J\. Kwon, V\. Kyrylov, E\. Y\. Le, G\. Leclerc, J\. P\. Lennon, S\. Lessans, M\. Lezcano\-Casado, Y\. Li, Z\. Li, J\. Lin, J\. Liss, Lily, Liu, J\. Liu, K\. Lu, C\. Lu, Z\. Martinovic, L\. McCallum, J\. McGrath, S\. McKinney, A\. McLaughlin, S\. Mei, S\. Mostovoy, T\. Mu, G\. Myles, A\. Neitz, A\. Nichol, J\. Pachocki, A\. Paino, D\. Palmie, A\. Pantuliano, G\. Parascandolo, J\. Park, L\. Pathak, C\. Paz, L\. Peran, D\. Pimenov, M\. Pokrass, E\. Proehl, H\. Qiu, G\. Raila, F\. Raso, H\. Ren, K\. Richardson, D\. Robinson, B\. Rotsted, H\. Salman, S\. Sanjeev, M\. Schwarzer, D\. Sculley, H\. Sikchi, K\. Simon, K\. Singhal, Y\. Song, D\. Stuckey, Z\. Sun, P\. Tillet, S\. Toizer, F\. Tsimpourlas, N\. Vyas, E\. Wallace, X\. Wang, M\. Wang, O\. Watkins, K\. Weil, A\. Wendling, K\. Whinnery, C\. Whitney, H\. Wong, L\. Yang, Y\. Yang, M\. Yasunaga, K\. Ying, W\. Zaremba, W\. Zhan, C\. Zhang, B\. Zhang, E\. Zhang, and S\. Zhao \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[§6\.1](https://arxiv.org/html/2607.00895#S6.SS1.p1.1),[§6\.4](https://arxiv.org/html/2607.00895#S6.SS4.p2.1)\.
- I\. Padhi, M\. Nagireddy, G\. Cornacchia, S\. Chaudhury, T\. Pedapati, P\. Dognin, K\. Murugesan, E\. Miehling, M\. Santillán Cooper, K\. Fraser, G\. Zizzo, M\. Z\. Hameed, M\. Purcell, M\. Desmond, Q\. Pan, I\. Vejsbjerg, E\. M\. Daly, M\. Hind, W\. Geyer, A\. Rawat, K\. R\. Varshney, and P\. Sattigeri \(2025\)Granite guardian: comprehensive LLM safeguarding\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 3: Industry Track\),W\. Chen, Y\. Yang, M\. Kachuee, and X\. Fu \(Eds\.\),Albuquerque, New Mexico,pp\. 607–615\.External Links:[Link](https://aclanthology.org/2025.naacl-industry.49/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-industry.49),ISBN 979\-8\-89176\-194\-0Cited by:[§6\.4](https://arxiv.org/html/2607.00895#S6.SS4.p1.2)\.
- Qwen Team \(2026\)Qwen3\.6\-35B\-A3B: agentic coding power, now open to all\.External Links:[Link](https://qwen.ai/blog?id=qwen3.6-35b-a3b)Cited by:[§4\.3](https://arxiv.org/html/2607.00895#S4.SS3.p2.1)\.
- S\. S\. Ravi, B\. Mielczarek, A\. Kannappan, D\. Kiela, and R\. Qian \(2024\)Lynx: an open source hallucination evaluation model\.External Links:2407\.08488,[Link](https://arxiv.org/abs/2407.08488)Cited by:[§6\.4](https://arxiv.org/html/2607.00895#S6.SS4.p1.2)\.
- G\. Recski, S\. Toth, N\. Verdha, I\. Boros, and A\. Kovacs \(2026\)ACL\-verbatim: hallucination\-free question answering for research\.External Links:2605\.21102,[Link](https://arxiv.org/abs/2605.21102)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§4\.1](https://arxiv.org/html/2607.00895#S4.SS1.p1.1)\.
- E\. Rykov, K\. Petrushina, M\. Savkin, V\. Olisov, A\. Vazhentsev, K\. Titova, A\. Panchenko, V\. Konovalov, and J\. Belikova \(2025\)When models lie, we learn: multilingual span\-level hallucination detection with PsiloQA\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 11663–11682\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.626/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.626),ISBN 979\-8\-89176\-335\-7Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p1.1),[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2607.00895#S4.SS1.p1.1),[§6\.3](https://arxiv.org/html/2607.00895#S6.SS3.p1.6),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.11.11.2.1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.12.12.2.1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.8.8.2.1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.9.9.2.1.1)\.
- J\. Song, X\. Wang, J\. Zhu, Y\. Wu, X\. Cheng, R\. Zhong, and C\. Niu \(2024\)RAG\-HAT: a hallucination\-aware tuning pipeline for LLM in retrieval\-augmented generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,F\. Dernoncourt, D\. Preoţiuc\-Pietro, and A\. Shimorina \(Eds\.\),Miami, Florida, US,pp\. 1548–1558\.External Links:[Link](https://aclanthology.org/2024.emnlp-industry.113/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-industry.113)Cited by:[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2607.00895#S6.T3.1.2.2.2.1.1)\.
- H\. Su, T\. Hu, H\. S\. Koppula, K\. Krishna, H\. Pouransari, C\. Hsieh, C\. Koc, J\. Y\. Cheng, O\. Tuzel, and R\. Vemulapalli \(2025\)Learning to reason for hallucination span detection\.External Links:2510\.02173,[Link](https://arxiv.org/abs/2510.02173)Cited by:[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Tang, P\. Laban, and G\. Durrett \(2024\)MiniCheck: efficient fact\-checking of LLMs on grounding documents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8818–8847\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.499/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.499)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§6\.4](https://arxiv.org/html/2607.00895#S6.SS4.p1.2)\.
- Q\. Team \(2026\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p5.1),[§6\.1](https://arxiv.org/html/2607.00895#S6.SS1.p1.1)\.
- Y\. Tian, W\. Yan, Q\. Yang, X\. Zhao, Q\. Chen, W\. Wang, Z\. Luo, L\. Ma, and D\. Song \(2025\)CodeHalu: investigating code hallucinations in llms via execution\-based verification\.InProceedings of the Thirty\-Ninth AAAI Conference on Artificial Intelligence and Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’25/IAAI’25/EAAI’25\.External Links:ISBN 978\-1\-57735\-897\-8,[Link](https://doi.org/10.1609/aaai.v39i24.34717),[Document](https://dx.doi.org/10.1609/aaai.v39i24.34717)Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p4.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2607.00895#S3.p3.1)\.
- R\. Vazquez, T\. Mickus, E\. Zosa, T\. Vahtola, J\. Tiedemann, A\. Sinha, V\. Segonne, F\. Sanchez \- Vega, A\. Raganato, J\. Libovický, J\. Karlgren, S\. Ji, J\. Helcl, L\. Guillou, O\. De Gibert, J\. Bengoetxea, J\. Attieh, and M\. Apidianaki \(2025\)SemEval\-2025 task 3: mu\-SHROOM, the multilingual shared\-task on hallucinations and related observable overgeneration mistakes\.InProceedings of the 19th International Workshop on Semantic Evaluation \(SemEval\-2025\),S\. Rosenthal, A\. Rosá, D\. Ghosh, and M\. Zampieri \(Eds\.\),Vienna, Austria,pp\. 2472–2497\.External Links:[Link](https://aclanthology.org/2025.semeval-1.322/),ISBN 979\-8\-89176\-273\-2Cited by:[§1](https://arxiv.org/html/2607.00895#S1.p1.1),[§1](https://arxiv.org/html/2607.00895#S1.p2.1),[§2](https://arxiv.org/html/2607.00895#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix APrompt Templates

We release the exact prompt code with the dataset\. This appendix gives the main templates used for detector training and synthetic data construction\. Variable slots are shown as\{name\}; lines beginning with\#\#are annotations in the figure, not part of the prompt\.

Youareanexpertannotatorwhoidentifieshallucinatedspansinageneratedanswer

withrespecttoagivencontext\(theonlytrustedevidence\)\.Ahallucinatedspanisa

substringoftheanswerthatisnotsupportedbythecontext\.Spansconsistentwiththe

contextarenothallucinations\.

Quoteeachhallucinatedspanverbatimfromtheanswerandclassifyitintoexactlyone

categoryandonesubcategory\.

Categories:

\-contradiction:conflictswiththecontext\(awrongvalue,number,date,name,orrelationship\)

\-fabricated\_reference:anentity,name,identifier,orsectionthatisabsentfromthecontext

\-unsupported\_addition:aclaim,detail,orbehaviorthecontextneverstates

Subcategories:

entity,temporal,numerical,value,relational,identifier,section,attribute,

claim,behavior,elaboration,subjective,unspecified

ReplywithONLYaJSONobject:

\{"hallucinated\_spans":\[\{"text":"\.\.\.","category":"\.\.\.","subcategory":"\.\.\."\}\]\}\.

Ifnothingisunsupported,reply\{"hallucinated\_spans":\[\]\}\.

Usermessage:

\{request\_and\_context\}

Answertoverify:

\{answer\}

Figure 4:Detector prompt used for generative detector training and zero\-shot LLM\-judge evaluation\. For code\-agent rows in the generative detector training set, the same prompt also requests a short explanation field for each span; clean and hallucinated rows from that source both use the explanation variant, so the prompt choice does not leak the label\.\#\#QuestiongenerationforREADMEandWikipediasources

Yougenerateasingle,naturalinformation\-seekingquestionthatcanbeanswered

fromagivendocument\.

Document:

\{document\_chunk\}

Generateone\{question\_type\}question\(\{question\_type\_definition\}\)thatthe

documentanswers\.

Rules:

1\.ReturnONLYthequestion,nothingelse\.

2\.Useneutral,self\-containedphrasing\.

3\.Keepitshortandnatural,likeaquerytypedintoasearchengine\.

4\.Itmustbeanswerablefromthedocumentabove\.

\#\#Clean\-answergenerationfortool\-output,ACL,README,andWikipediasources

Youareahelpfulassistantansweringauser’squestionusingONLYtheprovided

evidence\.Writeacorrect,naturalanswergroundedstrictlyinthatevidence\.

YouranswerMUST:

\-Beaccurateandfullysupportedbytheevidence\-\-inventnothing\.

\-Referenceconcretedetailsfromtheevidencewhererelevant\.

\-Beconcise\.

YouranswermustNOT:

\-Addclaims,identifiers,orvaluesnotpresentintheevidence\.

\-Includefiller\.

Figure 5:Question and clean\-answer generation prompts\. README and Wikipedia examples first generate a self\-contained question from a markdown chunk; tool\-output, ACL, README, and Wikipedia examples then generate a clean answer grounded only in the supplied evidence\.Youareahallucinationinjectorforbuildingahallucinationdetectiondataset\.

YouaregivenaCORRECTanswerandtheCONTEXTitisgroundedin\.ReturnONLYa

smallsetoflocalizedreplacementeditsthatturntheanswerintoonecontaining

aspecifickindofhallucination\.Thepipelineappliesyouredits;outsidethem

theanswermuststayidentical\.

Targethallucination:

\-Category:\{category\}

\-Subtype:\{subcategory\}

CRITICALgroundingrule:

\-TheinjectederrorMUSTbedetectablebycomparingtheansweragainstthe

providedcontextalone\.

Rules:

\-Make1\-2distincteditstargetingthesubtypeabove\.

\-Eachreplacementspanmustbeassmallaspossible\.

\-Totalchangedtextmustbelessthan30%oftheanswer\.

\-Changesmustbeplausibleandsubtle\.

\-Each"original"mustbeanexactsubstringoftheanswer,appearingexactlyonce\.

RespondinJSON:

\{"changes":\[\{"original":"\.\.\.","hallucinated":"\.\.\.","explanation":"\.\.\."\}\]\}

Figure 6:Generic injection prompt\. The model proposes structured replacement edits; the pipeline applies them deterministically and recovers exact character offsets\. Code uses two source\-specific variants, one for wrong implementations and unrequested changes and one for fabricated methods, attributes, and keyword arguments\. Both keep the sameoriginal/hallucinatedoutput format\.

Similar Articles

RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration

arXiv cs.CL

RAGognizer introduces a hallucination-aware fine-tuning approach that integrates a lightweight detection head into LLMs for joint optimization of language modeling and hallucination detection in RAG systems. The paper presents RAGognize, a dataset of naturally occurring closed-domain hallucinations with token-level annotations, and demonstrates state-of-the-art hallucination detection while reducing hallucination rates without degrading language quality.

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

arXiv cs.CL

This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.