Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection

arXiv cs.CL Papers

Summary

This paper introduces CiteTracer, a multi-agent framework for detecting citation hallucinations in LLM-generated scientific writing, achieving high accuracy on synthetic and real-world benchmarks.

arXiv:2605.08583v1 Announce Type: new Abstract: Large language models are increasingly used in scientific writing, yet they can fabricate citation-shaped references that appear plausible but fail bibliographic verification. Existing detectors often reduce verification to binary found/not-found decisions and rely on brittle parsing or incomplete retrieval, offering little field-level signal to auditors. We reframe citation hallucination detection as taxonomy-aligned field-level adjudication and introduce a 12-code taxonomy spanning Real, Potential, and Hallucinated citations. Based on this taxonomy, we build CiteTracer, a cascading multi-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class-specialist judgers. We release a benchmark of 2,450 synthetic citations built from real seeds with controlled LLM mutations, paired with 957 real-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk-rejected submissions. CiteTracer reaches 97.1% accuracy on the synthetic benchmark, with class-level F1 scores of 97.0, 95.8, and 98.5 for Real, Potential, and Hallucinated, respectively, and detects 97.1% of fabrications on the real-world set without abstaining. Code: https://github.com/aaFrostnova/CiteTracer.
Original Article
View Cached Full Text

Cached at: 05/12/26, 06:52 AM

# Source or It Didn’t Happen: A Multi-Agent Framework for Citation Hallucination Detection
Source: [https://arxiv.org/html/2605.08583](https://arxiv.org/html/2605.08583)
Mingzhe Li1,Zhiqiang Lin2,Shiqing Ma1 1University of Massachusetts Amherst,2The Ohio State University

###### Abstract

Large language models are increasingly used in scientific writing, yet they can fabricate citation\-shaped references that appear plausible but fail bibliographic verification\. Existing detectors often reduce verification to binary found/not\-found decisions and rely on brittle parsing or incomplete retrieval, offering little field\-level signal to auditors\. We reframe citation hallucination detection as taxonomy\-aligned field\-level adjudication and introduce a1212\-code taxonomy spanningReal,Potential, andHallucinatedcitations\. Based on this taxonomy, we buildCiteTracer, a cascading multi\-agent detector that extracts structured citations from PDF and BibTeX, retrieves evidence through cache lookup, URL fetch, scholar connectors, and web search, applies deterministic field matching, and routes ambiguous cases to class\-specialist judgers\. We release a benchmark of2,4502\{,\}450synthetic citations built from real seeds with controlled LLM mutations, paired with957957real\-world fabricated citations drawn from ICLR 2026 and an anonymous conference desk\-rejected submissions\.CiteTracerreaches97\.1%97\.1\\%accuracy on the synthetic benchmark, with class\-levelF1F\_\{1\}of97\.097\.0,95\.895\.8, and98\.598\.5forReal,Potential, andHallucinated, respectively, and detects97\.1%97\.1\\%of fabrications on the real\-world set without abstaining\. Code:[https://github\.com/aaFrostnova/CiteTracer](https://github.com/aaFrostnova/CiteTracer)\.

## 1Introduction

Citations are the infrastructure of scientific communication: they justify claims, allocate scholarly credit, and trace the chain of evidence behind every paperWaltman \([2016](https://arxiv.org/html/2605.08583#bib.bib38)\)\. Within this broader notion of citation integrity, bibliographic integrity asks whether a cited entry’s title, authors, venue, year, and identifiers actually correspond to a real publication\(Yuanet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib1)\)\. A bibliographic\-level error denies the original authors their credit, breaks reproducibility because the metadata no longer leads back to a retrievable source, and propagates downstream as search engines surface the fabricated entry\(Rekdal,[2014](https://arxiv.org/html/2605.08583#bib.bib29); Sarolet al\.,[2024](https://arxiv.org/html/2605.08583#bib.bib26)\)\.

Large language models are now deeply embedded in the research workflow, especially in academic writing, where they help generate ideas, polish exposition, and draft submission text\. This shift introduces a new bibliographic failure mode: an LLM can rely on distributional patterns in text to produce citation\-shaped entries with hallucinated or mismatched fields, such as an incorrect title, a nonexistent author, or a venue that does not correspond to the cited work\(Yuanet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib1)\)\. This risk follows from the broader problem of hallucination, but citations make the failure especially consequential: they are high\-stakes factual claims whose fields should be externally verifiable, yet LLMs are highly fluent at producing references that appear plausible by construction\(Walters and Wilder,[2023](https://arxiv.org/html/2605.08583#bib.bib12); Chelliet al\.,[2024](https://arxiv.org/html/2605.08583#bib.bib13)\)\. Hallucinated citations range from incorrect metadata on real papers, to entries that mix real and fabricated fields, to entirely nonexistent publications, and they call for different auditor responses \(correction, rejection, or uncertainty\) rather than a single binary judgment\. The problem is now operational at the venue level: ICLR 2026 chairs assembled a desk\-reject queue of more than600600submissions flagged for fabricated references, and ICML and ACM CCS have announced similar policies for the 2026 cycle\(Sakaiet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib14); GPTZero,[2025a](https://arxiv.org/html/2605.08583#bib.bib16); The Register,[2026](https://arxiv.org/html/2605.08583#bib.bib15)\)\.

![Refer to caption](https://arxiv.org/html/2605.08583v1/figs/overview.png)Figure 1:Overview ofCiteTracer\. Four stages run in sequence: \(1\) the*Reference Extractor*parses each citation block into a structured field\-level record; \(2\) the*Cascading Evidence Collector*walks a memory cache, URL fetch, eight Scholar Connectors, and web search; \(3\) the*Field Matcher*compares the record against the evidence field by field; \(4\)*Class\-specialist Judgers*adjudicate ambiguous cases and emit a taxonomy\-aligned verdict with the offending fields and reasons\.Existing detectors miss this failure surface in two specific ways\. First, they lack a fine\-grained taxonomy and the field\-level audit that would back one\. Commercial citation auditors such as Citely\(Citely,[2024](https://arxiv.org/html/2605.08583#bib.bib5)\), SwanRef\(SwanRef,[2024](https://arxiv.org/html/2605.08583#bib.bib6)\), CiteCheck\(CiteCheck,[2024](https://arxiv.org/html/2605.08583#bib.bib7)\), and RefCheck\-AI\(RefCheck\-AI,[2024](https://arxiv.org/html/2605.08583#bib.bib8)\)report only a binary Real\-or\-Fake label\(van Rensburg,[2025](https://arxiv.org/html/2605.08583#bib.bib9)\), and academic auditors such as CiteAudit\(Yuanet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib1)\)query multiple bibliographic APIs but still emit the same binary verdict, so the ambiguous middle ground \(nickname variants, non\-academic sources, peripheral metadata gaps\) collapses into the same yes/no signal\. Open tools such as Hallucinator\(Sbardella,[2024](https://arxiv.org/html/2605.08583#bib.bib3)\)consult more than ten bibliographic databases in parallel, but key the verdict on title and author and leave venue, year, DOI, pages, and publisher unaudited\. GPTZero’s hallucination mode\(GPTZero Team,[2023](https://arxiv.org/html/2605.08583#bib.bib4)\)does cross\-check external sources, but audits only five fields \(title, author, date, URL, publisher\) and gates the throughput behind a paid subscription\. Second, PDF input compounds the gap: their reference parsers drop entries, mis\-segment author and title spans, and occasionally hallucinate fields of their own, so the verifier inherits a corrupted input before any auditing happens\. To address these gaps, we introduce a comprehensive benchmark and a multi\-agent framework for citation hallucination detection\. The benchmark spans the three classes an auditor actually needs to act on \(correct citations, the ambiguous middle ground, and concrete fabrications\) and exercises every core bibliographic field \(title, authors, venue, year, identifiers, and peripheral metadata\); we build it by drawing real\-world citations from heterogeneous bibliographic sources and applying controlled LLM\-driven mutations field by field, so every entry carries a known ground\-truth code \(Table[1](https://arxiv.org/html/2605.08583#S3.T1)\)\. The framework then strengthens the three steps prior systems leave brittle: a layout\-aware PDF extractor that re\-parses each reference from a bounding\-box crop with a vision LLM, a comprehensive retrieval pipeline that queries every applicable bibliographic connector in parallel, and a rigorous layered verification stage that resolves easy cases with deterministic rules and reserves class\-specialist judge agents only for the ambiguous remainder\. Experiments show thatCiteTracerreaches97\.1%97\.1\\%accuracy on the2,4502\{,\}450\-citation synthetic benchmark, with class\-levelF1F\_\{1\}of97\.097\.0forReal,95\.895\.8forPotential, and98\.598\.5forHallucinated, surpassing every baseline under both PDF and BibTeX inputs; on a real\-world hallucinated\-citation dataset of957957fabricated citations released by venue chairs,CiteTracerdetects97\.1%97\.1\\%of fabrications without abstaining\. Our contributions are summarized as follows:

- •We introduce a1212\-code citation hallucination taxonomy that names every field\-level failure mode under three classes \(Real,Potential,Hallucinated\), and release a2,4502\{,\}450\-citation synthetic benchmark spanning five rendering styles\.
- •We proposeCiteTracer, a four\-module multi\-agent detector that combines a layout\-aware vision\-LLM Reference Extractor, a verdict\-driven cascade over eight bibliographic connectors, deterministic field\-level rule matching, and three class\-specialist judgers, emitting per\-field taxonomy\-aligned verdicts\.
- •We evaluateCiteTraceragainst five advanced baselines \(GPT\-5\.5 Thinking, Claude 4\.7 Opus Adaptive Thinking, Gemini 3\.1 Pro, GPTZero, Hallucinator\) under both PDF and BibTeX inputs, whereCiteTracerreaches97\.1%97\.1\\%accuracy on the synthetic benchmark and97\.1%97\.1\\%recall on the real\-world set, surpassing every baseline on every class\.

## 2Related Work

Hallucination in Academic Writing\.Large language models hallucinate factual content even when surface fluency is maintained, a failure mode characterized across model families, training regimes, and deployment settings in recent surveys\(Huanget al\.,[2025](https://arxiv.org/html/2605.08583#bib.bib20); Tonmoyet al\.,[2024](https://arxiv.org/html/2605.08583#bib.bib21); Rahmanet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib46)\)and in zero\-resource detection work such as SelfCheckGPT\(Manakulet al\.,[2023](https://arxiv.org/html/2605.08583#bib.bib19)\)\. The failure is especially consequential in academic writing because citations are structured factual claims whose title, authors, venue, year, and identifiers should resolve to a real publication, yet LLMs readily produce references that look plausible but fail bibliographic verification\(Walters and Wilder,[2023](https://arxiv.org/html/2605.08583#bib.bib12); Chelliet al\.,[2024](https://arxiv.org/html/2605.08583#bib.bib13); Sakaiet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib14)\)\. The problem is now operational at venue scale\. NeurIPS 2025 chairs documented widespread fabricated references in submitted papers, with third\-party tooling flagging dozens of cases per session\(GPTZero,[2025b](https://arxiv.org/html/2605.08583#bib.bib17); The Register,[2026](https://arxiv.org/html/2605.08583#bib.bib15)\); ICLR 2026 assembled a desk\-reject queue of submissions whose bibliographies contained hallucinated citations\(GPTZero,[2025a](https://arxiv.org/html/2605.08583#bib.bib16)\); and ACM CCS 2026 published a Transparency Report enumerating the citations its review cycle flagged as AI\-fabricated\(ACM CCS 2026 Program Committee,[2026](https://arxiv.org/html/2605.08583#bib.bib18)\)\. These cases establish citation hallucination as a deployment\-level concern rather than a research curiosity, and motivate the field\-level, taxonomy\-aligned detection that we target in this paper\.

Citation Hallucination Detection\.Existing tools split into two camps that each leave the verdict hard to audit at the field level\. Commercial citation auditors such as Citely\(Citely,[2024](https://arxiv.org/html/2605.08583#bib.bib5)\), SwanRef\(SwanRef,[2024](https://arxiv.org/html/2605.08583#bib.bib6)\), CiteCheck\(CiteCheck,[2024](https://arxiv.org/html/2605.08583#bib.bib7)\), and RefCheck\-AI\(RefCheck\-AI,[2024](https://arxiv.org/html/2605.08583#bib.bib8)\)report only a binary Real\-or\-Fake label\(van Rensburg,[2025](https://arxiv.org/html/2605.08583#bib.bib9)\), which hides which field is wrong and forces auditors to redo the diagnostic work themselves\. Academic auditors such as CiteAudit\(Yuanet al\.,[2026](https://arxiv.org/html/2605.08583#bib.bib1)\)query multiple bibliographic APIs but still emit a binary verdict, so thePotentialmiddle ground \(nickname variants, non\-academic sources, peripheral metadata gaps\) collapses into the same yes/no signal\. Open tools such as Hallucinator\(Sbardella,[2024](https://arxiv.org/html/2605.08583#bib.bib3)\)consult more than ten bibliographic databases in parallel, but key the verdict on title and author and leave venue, year, DOI, pages, and publisher unaudited\. GPTZero’s hallucination mode\(GPTZero Team,[2023](https://arxiv.org/html/2605.08583#bib.bib4)\)does cross\-check external sources, but audits only five fields \(title, author, date, URL, publisher\), gates throughput behind an expensive paid subscription, and accepts only PDF input\. None of these systems exposes a per\-field taxonomy that supports auditing which field is wrong and why, which is the gap our1111\-code taxonomy and field\-level multi\-agent detector close\.

## 3Benchmark

Existing citation auditors are largely closed\-source and report opaque metrics, so the field lacks an open benchmark that compares methods on consistent ground truth\. We close this gap with a2,4502\{,\}450\-citation synthetic benchmark grounded in real bibliographies and a957957\-citation real\-world test set drawn from the ICLR 2026 desk\-reject queue \(807807citations\) and another anonymous conference \(150150citations\); full construction and per\-code details are deferred to Appendix[A](https://arxiv.org/html/2605.08583#A1)\.

Taxonomy\.A bibliographic citation decomposes into a fixed set of fields \(title, authors, venue, year, identifiers, peripheral metadata\), and the appropriate auditor response depends on which field is wrong and whether the error can be verified externally\. We define1212fine\-grained codes grouped into three auditor\-facing classes \([Table 1](https://arxiv.org/html/2605.08583#S3.T1)\)\.Real\(R1–R3\) covers exact matches and normalizable formatting variants such as venue abbreviations, author initials, and*et al\.*truncation\.Hallucinated\(H1–H6\) localizes a single bibliographic error to one field: title \(H1\), authors \(H2\), venue \(H3\), year \(H4\), identifier \(H5\), or peripheral metadata \(H6\)\.Potential\(P1–P3\) buffers auditor\-ambiguous cases: nickname or transliteration variants \(P1\), non\-academic sources whose existence cannot be verified through bibliographic indices \(P2\), and peripheral fields that no public source records for the cited paper \(P3\)\. Per\-field localization gives the benchmark its diagnostic value: a wrong title and a wrong DOI on otherwise identical seeds correspond to two distinct error modes that require different auditor corrections\.

Table 1:The1212\-code citation hallucination taxonomy and per\-code counts in the2,4502\{,\}450\-citation synthetic benchmark\.Construction\.We draw seed BibTeX entries from open\-access bibliographic repositories \(e\.g\., DBLP, arXiv, ACL\) across5050recent ML and CS papers, prioritizing entries that populate the largest set of fields\. For every non\-R1code we apply a code\-specific mutation operator that touches a documented set of fields and leaves the rest of the seed identical: an LLM\-driven generator proposes a candidate value, and a deterministic post\-processor enforces the operator’s field schema\. We do not include syntheticP2cases becauseP2is defined by source type rather than bibliographic\-field correctness: any clearly non\-academic citation, such as a blog post, GitHub repository, or forum thread, is directly routed toP2, making it a routing case rather than a challenging verification case\. Each synthetic entry passes three independent checks before it enters the benchmark—a round\-trip audit on operator diffs, a verifiability check on everyR1andP3entry, and an author\-curated boundary review on everyP1substitution—which retains2,4502\{,\}450taxonomy\-labeled instances out of3,1003\{,\}100generated entries; per\-code counts are reported alongside each code in[Table 1](https://arxiv.org/html/2605.08583#S3.T1)\.

Real\-world test set\.We additionally collect two real\-world slices on which fabrications were flagged by the venue’s own chairs\. The first slice contains807807citations from647647ICLR 2026submissions that the program chairs desk\-rejected for fabricated references111[https://openreview\.net/group?id=ICLR\.cc/2026/Conference\#tab\-desk\-rejected\-submissions](https://openreview.net/group?id=ICLR.cc/2026/Conference#tab-desk-rejected-submissions)\. The second slice contains150150citations from4141an anonymous conference desk\-rejected submissions\. Every entry in both slices carries the chairs’ verdict and the cited bibliographic record, so synthetic\-set numbers can be cross\-checked against fabrications two different venues actually rejected\.

## 4Methodology

In this section, we introduceCiteTracer, an end\-to\-end agentic framework that turns citation hallucination detection into per\-citation, per\-field verdicts an auditor can act on\. Instead of asking a single model to audit an entire bibliography in one prompt,CiteTracerdecomposes the task into four modules: 1\) a Reference Extractor, 2\) a Cascading Evidence Collector, 3\) a Field Matcher, and 4\) a panel of Class\-specialist Judgers\. Given a paper, these modules parse every reference into a structured citation record, retrieve external evidence across public bibliographic sources, perform deterministic field\-level matching between the parsed citation and retrieved evidence, and route each case to a class\-specialist judge that returns a taxonomy\-aligned code together with the offending field span and the bibliographic sources that produced the verdict\. At a high level, the full pipeline maps an input paper to a set of citation\-level decisions\. Formally, for an input paperPP,CiteTracerproduces

CiteTracer​\(P\)=\{\(ri,yi,Δi,𝒮i\)\}i=1N,\\textsc\{CiteTracer~\}\(P\)=\\\{\(r\_\{i\},y\_\{i\},\\Delta\_\{i\},\\mathcal\{S\}\_\{i\}\)\\\}\_\{i=1\}^\{N\},whererir\_\{i\}is theii\-th structured citation record,yiy\_\{i\}is its taxonomy\-aligned verdict,Δi\\Delta\_\{i\}is the set of offending field spans, and𝒮i\\mathcal\{S\}\_\{i\}is the set of bibliographic sources supporting the decision\.

### 4\.1Reference Extractor

The Reference Extractor takes a paper as input and produces a list of canonical citation records, with every bibliographic field a downstream verifier might check\. This step is challenging because citation extraction still requires character\-level precision under realistic PDF layouts\. Although modern OCR systems can detect bibliography regions and citation blocks, their transcriptions may still contain subtle character\-level errors, especially for author names, venue abbreviations, page numbers, and identifiers\. Moreover, bibliography styles vary widely across papers, and even references within the same paper may exhibit different surface formats\. As a result, purely rule\-based extraction is often brittle and difficult to scale across bibliography styles, and learning\-based approaches such as soft\-constrained citation field extractors trained on the UMass Citations corpus\(Anzarootet al\.,[2014](https://arxiv.org/html/2605.08583#bib.bib43); Anzaroot and McCallum,[2013](https://arxiv.org/html/2605.08583#bib.bib44)\)still leave residual character\-level errors that propagate into downstream verification\.

To address these issues, we use the OCR model as a high\-recall citation\-block proposer rather than as the final parser\. Letℳocr\\mathcal\{M\}\_\{\\mathrm\{ocr\}\}denote the OCR model\. Given the bibliography regionPbibP\_\{\\mathrm\{bib\}\}of an input paperPP, the OCR model returns citation blocks together with their initial transcriptions:

\{\(Bk,Tk\)\}k=1K=ℳocr​\(Pbib\),\\\{\(B\_\{k\},T\_\{k\}\)\\\}\_\{k=1\}^\{K\}=\\mathcal\{M\}\_\{\\mathrm\{ocr\}\}\(P\_\{\\mathrm\{bib\}\}\),whereBkB\_\{k\}is the page\-level region of thekk\-th detected citation block, andTkT\_\{k\}is its OCR transcription\. We then introduce a parsing agent as a second safeguard\. Let𝒜Parser\\mathcal\{A\}\_\{\\mathrm\{Parser\}\}denote the parsing agent\. For each detected citation block, the agent takes the cropped block image and its OCR transcription as input, rechecks the extracted text against the visual evidence, and directly extracts structured bibliographic fields\. Formally, letℱ\\mathcal\{F\}denote the set of bibliographic fields to be verified, including title, authors, venue, year, DOI, pages, publisher, location, and URL\. For thekk\-th detected citation block, the parsing agent produces a provisional structured citation record:

rk=𝒜Parser​\(P​\[Bk\],Tk\)=\{\(f,vk,f\)∣f∈ℱ\},r\_\{k\}=\\mathcal\{A\}\_\{\\mathrm\{Parser\}\}\(P\[B\_\{k\}\],T\_\{k\}\)=\\\{\(f,v\_\{k,f\}\)\\mid f\\in\\mathcal\{F\}\\\},wherevk,fv\_\{k,f\}is the extracted value of fieldfffrom thekk\-th detected citation block\. This crop\-level rechecking allows the extractor to repair OCR errors without relying on rigid hand\-crafted rules for specific bibliography styles\. Some references may be split across a column boundary or a page boundary, so a detected citation block does not always correspond to a complete reference\. In these boundary cases, the parsing agent identifies continuation blocks and merges their visual\-textual evidence before finalizing the structured record\. This boundary repair step allows the extractor to recover references that are fragmented across columns or pages\. The final output of the Reference Extractor is the set of structured citation recordsℛ​\(P\)=\{ri\}i=1N\\mathcal\{R\}\(P\)=\\\{r\_\{i\}\\\}\_\{i=1\}^\{N\}, whereNNis the number of finalized references after boundary repair\.

### 4\.2Cascading Evidence Collector

The Cascading Evidence Collector takes a structured citation recordrir\_\{i\}and returns a ranked list of candidate matches together with the bibliographic evidence supporting each match\. This step is challenging because citation verification must balance retrieval cost against source coverage\. Many citations can be resolved by cheap signals, such as previously verified records or explicit DOI/arXiv links, but long\-tail references may only appear in specialized bibliographic sources or unstructured web pages\. As a result, querying every source for every citation wastes connector calls on the easy majority, while relying on a single source leaves biomedical papers, ACL Anthology entries, workshop papers, and non\-standard web references uncovered\.

To address this trade\-off, we use a four\-stage retrieval cascade ordered from cheapest to most general:Memory,URL Fetch,Scholar Connectors, andWeb Search\. The first stage,Memory, queries a cache initialized from an offline DBLP mirror and updated with every newly verifiedRealcitation, in the spirit of long\-term memory layers proposed for production agent systems\(Chhikaraet al\.,[2025](https://arxiv.org/html/2605.08583#bib.bib47)\)\. It returns previously seen candidate records at near\-zero cost\. The second stage,URL Fetch, is triggered when the citation contains explicit links such as a DOI, arXiv URL, or publisher landing page\. The Web Agent follows each URL and extracts structured metadata, so this stage produces evidence from direct citation links rather than from a general query\.

The third stage,Scholar Connectors, sends the Scholar Agent to query multiple public bibliographic sources in parallel\. This parallel fan\-out keeps latency bounded while covering both general computer science literature and domain\-specific sources\. The final stage,Web Search, uses the Web Agent again, but now with a search query generated from the citation record rather than a direct URL, in the spirit of multi\-agent systems that collect evidence from open\-web sources for misinformation detection and structured data acquisition\(Tianet al\.,[2024](https://arxiv.org/html/2605.08583#bib.bib45); Maet al\.,[2025](https://arxiv.org/html/2605.08583#bib.bib48)\)\. It retrieves raw web summaries or pages and extracts candidate bibliographic records when structured sources miss\.

The cascade stops on a*verdict*\. After each stage, the Field Matcher and Class\-Specialist Judgers \(Sections[4\.3](https://arxiv.org/html/2605.08583#S4.SS3)and[4\.4](https://arxiv.org/html/2605.08583#S4.SS4)\) examine the cumulative*evidence bundle*ℰi\\mathcal\{E\}\_\{i\}, the union of candidate records collected by every stage tried so far, and emit a citation\-level verdict in \{Real,Potential,Hallucinated\}\. The cascade stops at the first stage whose evidence supports aRealverdict and returns that verdict immediately, skipping the remaining stages\.

### 4\.3Field Matcher

The Field Matcher takes a structured citation recordrir\_\{i\}and its evidence bundleℰi\\mathcal\{E\}\_\{i\}as input, and emits a field\-level status profile for downstream judgers\. This step is necessary because citation correctness is often field\-dependent: a citation may match the retrieved evidence on title and year, but disagree on authors, venue, DOI, or peripheral metadata\. A citation\-level similarity score would hide these differences, whereas field\-level matching exposes which parts of the reference are supported by evidence\. The challenge is to avoid unnecessary LLM calls on the easy majority while still handling residual cases that require flexible reasoning\. To address this, the Field Matcher uses two stages\. The first stage is a deterministic rule matcher, which applies field\-specific normalizers and supports early exit\. The second stage is a Matcher Agent, which is invoked only when deterministic rules cannot fully resolve the citation\.

For the deterministic stage, letνf​\(⋅\)\\nu\_\{f\}\(\\cdot\)denote the rule\-based normalizer for fieldff\. These normalizers only encode high\-confidence, reproducible transformations, such as case folding, punctuation removal, DOI canonicalization, page\-range normalization, author\-order normalization, and known venue abbreviations\. Given the extracted field valuevi,fv\_\{i,f\}from citationrir\_\{i\}and the corresponding field valueue,fu\_\{e,f\}from candidate evidencee∈ℰie\\in\\mathcal\{E\}\_\{i\}, the rule matcher assigns

mi,e,frule=\{match,νf​\(vi,f\)=νf​\(ue,f\),missing,vi,f=∅​or​ue,f=∅,mismatch,otherwise\.m^\{\\mathrm\{rule\}\}\_\{i,e,f\}=\\begin\{cases\}\\textsc\{match\},&\\nu\_\{f\}\(v\_\{i,f\}\)=\\nu\_\{f\}\(u\_\{e,f\}\),\\\\ \\textsc\{missing\},&v\_\{i,f\}=\\varnothing\\ \\text\{or\}\\ u\_\{e,f\}=\\varnothing,\\\\ \\textsc\{mismatch\},&\\text\{otherwise\}\.\\end\{cases\}Here,mi,e,frulem^\{\\mathrm\{rule\}\}\_\{i,e,f\}is a deterministic field status and does not rely on generative reasoning\. If at least one candidate matches all explicitly provided fields under these deterministic normalizers, the matcher exits early without invoking the Matcher Agent\. Letℱi\+=\{f∈ℱ∣vi,f≠∅\}\\mathcal\{F\}\_\{i\}^\{\+\}=\\\{f\\in\\mathcal\{F\}\\mid v\_\{i,f\}\\neq\\varnothing\\\}denote the fields present in citationrir\_\{i\}\. The early\-exit condition is

∃e∈ℰis\.t\.∀f∈ℱi\+,mi,e,frule=match\.\\exists e\\in\\mathcal\{E\}\_\{i\}\\quad\\text\{s\.t\.\}\\quad\\forall f\\in\\mathcal\{F\}\_\{i\}^\{\+\},\\;m^\{\\mathrm\{rule\}\}\_\{i,e,f\}=\\textsc\{match\}\.When this condition holds, the citation is treated as a deterministicValidcase\. If no candidate satisfies the early\-exit condition, the case is passed to the Matcher Agent\. Let𝒜Matcher\\mathcal\{A\}\_\{\\mathrm\{Matcher\}\}denote the Matcher Agent\. Unlike the deterministic normalizer, the Matcher Agent does not merely canonicalize strings; it examines the citation, the retrieved evidence, and the rule\-based status pattern to produce a residual field\-status profile:

𝐦i=𝒜Matcher​\(ri,ℰi,\{mi,e,frule\}\)\.\\mathbf\{m\}\_\{i\}=\\mathcal\{A\}\_\{\\mathrm\{Matcher\}\}\\left\(r\_\{i\},\\mathcal\{E\}\_\{i\},\\\{m^\{\\mathrm\{rule\}\}\_\{i,e,f\}\\\}\\right\)\.The output𝐦i\\mathbf\{m\}\_\{i\}records, for each audited field, whether the residual discrepancy is best explained by a normalizable variation, missing candidate metadata, missing reference metadata, or a true field contradiction\. For example, the Matcher Agent may label an author mismatch as reordered authors, a venue mismatch as match after abbreviation, or a publisher/page field as candidate missing\. This residual field\-status profile is then passed to the Class\-Specialist Judgers for taxonomy\-level adjudication\.

### 4\.4Class\-Specialist Judgers

The Class\-Specialist Judgers adjudicate cases that cannot be fully resolved by deterministic field matching and emit a final taxonomy\-aligned verdict for each citation\. This step is challenging because different error classes require different decision logic\. For example, format variations such as author reordering or venue abbreviation should be treated differently from missing candidate metadata, and both are different from cases where the retrieved evidence contradicts the cited title, year, DOI, or venue\. A single general\-purpose judge over all taxonomy codes can easily become miscalibrated because it must apply different evidence thresholds acrossReal,Potential, andHallucinatedcases\.

To address this issue, we use class\-specialist judgers instead of one monolithic judge\. The routing decision is based on the field\-status profile produced by the Field Matcher\. Let𝐦i\\mathbf\{m\}\_\{i\}denote the final field\-level status profile for citationrir\_\{i\}, and letℰi\\mathcal\{E\}\_\{i\}denote its retrieved evidence bundle\. A judger router selects the specialist judger according to the residual field pattern:

Ji=ρ​\(ri,ℰi,𝐦i\),Ji∈𝒥cls,J\_\{i\}=\\rho\(r\_\{i\},\\mathcal\{E\}\_\{i\},\\mathbf\{m\}\_\{i\}\),\\qquad J\_\{i\}\\in\\mathcal\{J\}\_\{\\mathrm\{cls\}\},whereρ\\rhois the routing function and𝒥cls\\mathcal\{J\}\_\{\\mathrm\{cls\}\}is the set of class\-specialist judgers\. This routing step sends normalizable residual cases to the Valid Judger, ambiguous but plausible cases to the Potential Judger, and evidence\-contradicting or evidence\-absent cases to the Hallucinated Judger\.

The selected judger then produces the final citation\-level decision\. Formally,

\(yi,Δi,𝒮i\)=Ji​\(ri,ℰi,𝐦i\),\(y\_\{i\},\\Delta\_\{i\},\\mathcal\{S\}\_\{i\}\)=J\_\{i\}\(r\_\{i\},\\mathcal\{E\}\_\{i\},\\mathbf\{m\}\_\{i\}\),whereyiy\_\{i\}is the final taxonomy code,Δi⊆ℱ\\Delta\_\{i\}\\subseteq\\mathcal\{F\}is the set of offending or unresolved fields, and𝒮i\\mathcal\{S\}\_\{i\}is the supporting evidence used to justify the decision\.

## 5Evaluation

Table 2:Label\-level performance on BibTeX and PDF inputs\.Table 3:Per\-subtype TPR and FPR \(%\);Raggregates the threeRealcodes, all other buckets are reported individually\.MethodInputMetricRP1P3P\-avgH1H2H3H4H5H6H\-avgGPT\-5\.5PDFTPR94\.340\.21\.320\.872\.277\.289\.787\.296\.495\.586\.4FPR11\.50\.10\.70\.41\.61\.70\.50\.40\.16\.91\.9BibTeXTPR93\.678\.03\.941\.090\.588\.992\.498\.598\.594\.093\.8FPR3\.90\.10\.20\.21\.22\.40\.20\.30\.07\.62\.0Claude 4\.7 OpusPDFTPR92\.723\.013\.318\.153\.048\.288\.790\.395\.379\.575\.8FPR17\.70\.15\.02\.51\.60\.20\.60\.60\.15\.71\.5BibTeXTPR90\.751\.615\.633\.670\.053\.090\.993\.896\.587\.381\.9FPR11\.90\.24\.62\.41\.20\.50\.70\.50\.46\.51\.6Gemini 3\.1 ProPDFTPR90\.017\.22\.59\.926\.322\.856\.764\.164\.637\.245\.3FPR41\.50\.74\.52\.64\.70\.50\.70\.90\.12\.61\.6BibTeXTPR88\.519\.87\.213\.519\.525\.853\.369\.747\.019\.339\.1FPR46\.30\.212\.06\.12\.40\.40\.80\.80\.00\.90\.9GPTZero†PDFTPR62\.0———51\.034\.5—72\.833\.337\.845\.9FPR36\.8———3\.56\.1—33\.65\.411\.612\.0OursPDFTPR90\.8100\.099\.499\.7100\.099\.099\.595\.4100\.0100\.099\.0FPR0\.10\.60\.40\.51\.11\.00\.30\.01\.10\.10\.6BibTeXTPR94\.3100\.099\.499\.7100\.099\.099\.595\.4100\.0100\.099\.0FPR0\.10\.60\.40\.50\.50\.70\.30\.00\.30\.10\.3### 5\.1Experiment Setup

Datasets and Input Modes\.We evaluate on two corpora introduced in Section[3](https://arxiv.org/html/2605.08583#S3): a synthetic benchmark of2,4502\{,\}450citations covering the1111taxonomy codes exceptP2, and a957957\-citation real\-world set drawn from647647ICLR 2026 and4141another anonymous conference desk\-rejected submissions that venue chairs flagged as fabricated references \(ground truthHallucinatedby construction; Section[5\.4](https://arxiv.org/html/2605.08583#S5.SS4)\)\. Synthetic\-benchmark citations are rendered under five bibliography styles spanning single\-column \(plain, ICLR\) and two\-column \(IEEE, ACM Reference Format, Springer LNCS\) layouts\. Each system is run under two input modes:*PDF input*on the rendered benchmark PDF \(N=2,392N=2\{,\}392after excluding5858render\-omitted citations\) and*BibTeX input*on the source\.bibentries \(N=2,450N=2\{,\}450\)\.

Baselines\.We compareCiteTraceragainst frontier AI chatbots and existing citation auditors: GPT\-5\.5 ThinkingOpenAI \([2026](https://arxiv.org/html/2605.08583#bib.bib39)\), Claude 4\.7 Opus Adaptive ThinkingAnthropic \([2026](https://arxiv.org/html/2605.08583#bib.bib40)\), and Gemini 3\.1 Pro\(Google,[2026](https://arxiv.org/html/2605.08583#bib.bib41)\), prompted with the same audit prompt; Hallucinator\(Sbardella,[2024](https://arxiv.org/html/2605.08583#bib.bib3)\), which queries twelve bibliographic sources in parallel but keys the verdict on title and author only; GPTZero\(GPTZero Team,[2023](https://arxiv.org/html/2605.08583#bib.bib4)\), which audits five fields \(title, author, date, URL, publisher\) behind a paid subscription\. Neither Hallucinator nor GPTZero exposes aPotentialprediction class, so we score them as binaryReal\-vs\-Hallucinatedclassifiers; GPTZero further accepts only PDF input\.

Evaluation Metrics\.We evaluate at two granularities\. At the label level we cast the three\-way verdict \(Real,Potential,Hallucinated\) as a one\-versus\-rest task and report precision, recall, andF1F\_\{1\}per class\. At the subtype level we score predictions against the nine fine\-grained buckets \(R,P1,P3,H1–H6\) with in\-bucket TPR \(bucket recall\) and out\-of\-bucket FPR; the \(TPR, FPR\) pair shows whether the system identifies the failure mode without flooding other buckets\.

### 5\.2Main Verification Results

Label\-level Performance\.We compareCiteTraceragainst three frontier AI chatbots and two existing citation auditors on the three\-way verdict \(Real/Potential/Hallucinated\)\. As shown in[Table 2](https://arxiv.org/html/2605.08583#S5.T2),CiteTracersurpasses every baseline on every class under both input modes, with the largest margin on thePotentialclass that binary auditors cannot represent\. Concretely, on BibTeX inputCiteTracerattainsF1F\_\{1\}ofReal\(97\.0\),Potential\(95\.8\), andHallucinated\(98\.5\), with the largest gap onPotentialwhere the strongest baseline GPT\-5\.5 reaches only43\.843\.8; on PDF inputCiteTracerrecords95\.195\.1/95\.595\.5/96\.996\.9, similarly ahead; on the binaryReal\-vs\-HallucinatedsubsetCiteTracerkeeps a30\+30\+\-pointF1F\_\{1\}lead over Hallucinator and GPTZero\.

Per\-subtype Performance\.We further evaluate whetherCiteTraceridentifies the correct fine\-grained code among the nine scoring buckets\. As shown in[Table 3](https://arxiv.org/html/2605.08583#S5.T3),CiteTracerreaches the highest in\-bucket TPR with the lowest out\-of\-bucket FPR on every reported code on BibTeX input, and the gap is largest on thePotentialbuckets that prior auditors cannot adjudicate\. Concretely,CiteTracerattains TPR/FPR ofR\(94\.3/0\.1\),P1\(100\.0/0\.6\),P3\(99\.4/0\.4\), and an H\-average of \(99\.0/0\.3\); the strongest baseline GPT\-5\.5 reachesR\(93\.6/3\.9\), P\-avg \(41\.0/0\.2\), and H\-avg \(93\.8/2\.0\), while Gemini 3\.1 Pro collapses onP3\(TPR 7\.2\), and GPTZero leaves everyP\*bucket blank because its output space cannot represent thePotentialclass\.[Figure 2](https://arxiv.org/html/2605.08583#S5.F2)corroborates this from a different angle: the BibTeX confusion matrix concentrates on the diagonal,H1,H5, andH6are fully recovered, and the residual errors are dominated by5050R\-row leaks intoP1or theH\*codes when a single peripheral field fails rule\-based normalization\.

![Refer to caption](https://arxiv.org/html/2605.08583v1/x1.png)Figure 2:Confusion matrix on BibTeX input\.Table 4:Per\-field extraction accuracy across three reference extractor variants\.

### 5\.3Ablations

PDF Extraction\.We compared three Reference Extractor variants share the same OCR and reference\-segmentation pass and differ only in the parsing step:Arule\-based parser only,Badds an LLM reparse over the OCR text, andCattaches the per\-entry cropped page image to the same reparse\. As shown in Table[4](https://arxiv.org/html/2605.08583#S5.T4), the LLM reparse step \(A to B\) is the larger gain, liftingAuthorsfrom85\.685\.6to98\.298\.2,Locationfrom81\.681\.6to100\.0100\.0,Venuefrom92\.092\.0to99\.799\.7, andVolumefrom93\.593\.5to99\.899\.8\. Adding the page image \(B to C\) is cleaner:Titlefrom96\.596\.5to98\.598\.5andIdentifierfrom93\.193\.1to96\.596\.5, with the largest gain on the densest layouts \.

Table 5:Impact of the Web and Scholar Agent in the cascading evidence collector\.Impact of Web Agent and Scholar Connectors\.The cascading evidence collector pulls from three sources: Scholar Connectors as the primary academic lookup, URL Fetch for direct DOI/arXiv links, and the Web Agent as the long\-tail fallback when academic endpoints rate\-limit or the cited work lives in an unindexed database\. We disable each group and re\-run the cascade\. As shown in[Table 5](https://arxiv.org/html/2605.08583#S5.T5), removing the Web Agent dropsF1F\_\{1\}across all three classes \(Realfrom97\.097\.0to79\.679\.6,Potentialfrom95\.895\.8to79\.079\.0,Hallucinatedfrom98\.598\.5to85\.885\.8\), and removing the Scholar Connectors collapses the pipeline further \(Realto31\.431\.4,Potentialto43\.343\.3,Hallucinatedto69\.169\.1\) because Web Agent and URL Fetch alone cannot recover the structured metadata that academic APIs return\. The two ablations establish that Scholar Connectors and the Web Agent address distinct failure modes, and the system needs both\.

### 5\.4Real\-World Evaluation

We evaluate on two real\-world hallucination sets where venue chairs themselves flagged the fabrications\. On807807citations from647647ICLR 2026 desk\-rejected submissions,CiteTracerflags796796asHallucinatedreference for98\.6%\\mathbf\{98\.6\\%\}recall, with the1111remaining citations landing inPotential\(11P1,44P3,66P2on non\-academic mentions\); on150150chair\-confirmed hallucinated citations from4141anonymous conference papers,CiteTracerlabels𝟏𝟑𝟑\\mathbf\{133\}asFake\-Referenceand the remaining1717asPotential\(author\-variant ambiguity\), surfacing every confirmed hallucination across both venues\. On average each correctly\-detected citation triggers2\.242\.24distinct error codes, consistent with LLM\-fabricated references inventing multiple fields at once\.

## 6Conclusion

We reframed citation hallucination detection from a binary found\-or\-not problem into a1212\-code taxonomy and built a four\-module cascading multi\-agent detector that follows the taxonomy’s structure: a deterministic rule matcher closesValidandHallucinatedcases at near\-zero cost, an ordered cascade over eight bibliographic connectors collects evidence before any LLM call, and three specialist agents adjudicate disjoint taxonomy slices with calibrated evidence thresholds\. The2,4502\{,\}450\-citation synthetic benchmark and a957957\-citation real\-world set from real\-world conferences let us attribute improvements to specific design choices:CiteTracerreaches97\.1%97\.1\\%accuracy on the synthetic set and97\.1%97\.1\\%recall on the real\-world set\.

## References

- \[1\]\(2026\)Transparency report on AI\-generated citations in ACM CCS 2026 submissions\.Note:[https://github\.com/ACM\-CCS\-2026/Transparency\-Report](https://github.com/ACM-CCS-2026/Transparency-Report)Accessed: 2026\-05Cited by:[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[2\]Anthropic\(2026\)Claude \(opus 4\.7 version\) \[large language model\]\.External Links:[Link](https://claude.ai/)Cited by:[§5\.1](https://arxiv.org/html/2605.08583#S5.SS1.p2.1)\.
- \[3\]S\. Anzaroot and A\. McCallum\(2013\)UMass citation field extraction dataset\.Note:[http://www\.iesl\.cs\.umass\.edu/data/data\-umasscitationfield](http://www.iesl.cs.umass.edu/data/data-umasscitationfield)Cited by:[§4\.1](https://arxiv.org/html/2605.08583#S4.SS1.p1.1)\.
- \[4\]S\. Anzaroot, A\. Passos, D\. Belanger, and A\. McCallum\(2014\)Learning soft linear constraints with application to citation field extraction\.arXiv preprint arXiv:1403\.1349\.Cited by:[§4\.1](https://arxiv.org/html/2605.08583#S4.SS1.p1.1)\.
- \[5\]S\. Bai, Y\. Cai, R\. Chen,et al\.\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[Appendix C](https://arxiv.org/html/2605.08583#A3.p1.11)\.
- \[6\]M\. Chelli, J\. Descamps, V\. Lavoué, C\. Trojani, M\. Azar, M\. Deckert, J\. Raynier, G\. Clowez, P\. Boileau, C\. Ruetsch\-Chelli,et al\.\(2024\)Hallucination rates and reference accuracy of ChatGPT and Bard for systematic reviews: comparative analysis\.Journal of Medical Internet Research26\(1\),pp\. e53164\.Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p2.1),[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[7\]P\. Chhikara, D\. Khant, S\. Aryan, T\. Singh, and D\. Yadav\(2025\)Mem0: building production\-ready ai agents with scalable long\-term memory\.arXiv preprint arXiv:2504\.19413\.Cited by:[§4\.2](https://arxiv.org/html/2605.08583#S4.SS2.p2.1)\.
- \[8\]CiteCheck\(2024\)CiteCheck: ai\-powered citation verification\.Note:[https://citecheck\.ai/](https://citecheck.ai/)Accessed: 2026\-04Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1)\.
- \[9\]Citely\(2024\)Citely: AI citation assistant\.Note:[https://citely\.ai/](https://citely.ai/)Accessed: 2026\-04Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1)\.
- \[10\]Google\(2026\)Gemini \(3\.1 pro version\) \[large language model\]\.External Links:[Link](https://gemini.google.com/)Cited by:[§5\.1](https://arxiv.org/html/2605.08583#S5.SS1.p2.1)\.
- \[11\]GPTZero Team\(2023\)GPTZero: detecting AI\-generated text\(Website\)GPTZero\.External Links:[Link](https://gptzero.me/)Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.08583#S5.SS1.p2.1)\.
- \[12\]GPTZero\(2025\)GPTZero finds over 50 hallucinations in ICLR 2026 submissions\.Note:[https://gptzero\.me/news/iclr\-2026](https://gptzero.me/news/iclr-2026)Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p2.1),[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[13\]GPTZero\(2025\)GPTZero flags fabricated citations in NeurIPS submissions\.Note:[https://gptzero\.me/news/neurips/](https://gptzero.me/news/neurips/)Accessed: 2026\-05Cited by:[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[14\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[15\]Kimi Team, T\. Bai, Y\. Bai,et al\.\(2026\)Kimi k2\.5: visual agentic intelligence\.External Links:2602\.02276,[Link](https://arxiv.org/abs/2602.02276)Cited by:[Appendix C](https://arxiv.org/html/2605.08583#A3.p1.11)\.
- \[16\]T\. Ma, Y\. Qian, Z\. Zhang, Z\. Wang, X\. Qian, F\. Bai, Y\. Ding, X\. Luo, S\. Zhang, K\. Murugesan,et al\.\(2025\)AutoData: a multi\-agent system for open web data collection\.arXiv preprint arXiv:2505\.15859\.Cited by:[§4\.2](https://arxiv.org/html/2605.08583#S4.SS2.p3.1)\.
- \[17\]P\. Manakul, A\. Liusie, and M\. J\. F\. Gales\(2023\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of EMNLP,Cited by:[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[18\]OpenAI\(2026\)ChatGPT \(5\.5 version\) \[large language model\]\.External Links:[Link](https://chat.openai.com/)Cited by:[§5\.1](https://arxiv.org/html/2605.08583#S5.SS1.p2.1)\.
- \[19\]S\. S\. Rahman, M\. A\. Islam, M\. M\. Alam, M\. Zeba, M\. A\. Rahman, S\. S\. Chowa, M\. A\. K\. Raiaan, and S\. Azam\(2026\)Hallucination to truth: a review of fact\-checking and factuality evaluation in large language models\.Artificial Intelligence Review\.Cited by:[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[20\]RefCheck\-AI\(2024\)RefCheck\-AI\.Note:[https://github\.com/HuaHenry/RefCheck\_ai](https://github.com/HuaHenry/RefCheck_ai)Accessed: 2026\-04Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1)\.
- \[21\]O\. B\. Rekdal\(2014\)Academic urban legends\.Social Studies of Science44\(4\),pp\. 638–654\.Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p1.1)\.
- \[22\]Y\. Sakai, H\. Kamigaito, and T\. Watanabe\(2026\)HalluCitation matters: revealing the impact of hallucinated references with 300 hallucinated papers in ACL conferences\.Note:[https://arxiv\.org/abs/2601\.18724](https://arxiv.org/abs/2601.18724)Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p2.1),[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[23\]M\. J\. Sarol, S\. Ming, S\. Radhakrishna, J\. Schneider, and H\. Kilicoglu\(2024\)Assessing citation integrity in biomedical publications: corpus annotation and NLP models\.Bioinformatics40\(7\),pp\. btae420\.Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p1.1)\.
- \[24\]G\. Sbardella\(2024\)Hallucinator: a citation hallucination checker\.Note:[https://github\.com/gianlucasb/hallucinator](https://github.com/gianlucasb/hallucinator)Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1),[§5\.1](https://arxiv.org/html/2605.08583#S5.SS1.p2.1)\.
- \[25\]SwanRef\(2024\)SwanRef: reference verification platform\.Note:[https://www\.swanref\.org/](https://www.swanref.org/)Accessed: 2026\-04Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1)\.
- \[26\]The Register\(2026\)AI conference’s papers contaminated by AI hallucinations\.Note:[https://www\.theregister\.com/2026/01/22/neurips\_papers\_contaiminated\_ai\_hallucinations/](https://www.theregister.com/2026/01/22/neurips_papers_contaiminated_ai_hallucinations/)Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p2.1),[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[27\]J\. Tian, H\. Yu, Y\. Orlovskiy, T\. Vergho, M\. Rivera, M\. Goel, Z\. Yang, J\. Godbout, R\. Rabbany, and K\. Pelrine\(2024\)Web retrieval agents for evidence\-based misinformation detection\.arXiv preprint arXiv:2409\.00009\.Cited by:[§4\.2](https://arxiv.org/html/2605.08583#S4.SS2.p3.1)\.
- \[28\]S\. M\. T\. I\. Tonmoy, S\. M\. Zaman, V\. Jain, A\. Rani, V\. Rawte, A\. Chadha, and A\. Das\(2024\)A comprehensive survey of hallucination mitigation techniques in large language models\.arXiv preprint arXiv:2401\.01313\.Cited by:[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[29\]L\. J\. J\. van Rensburg\(2025\)AI\-powered citation auditing: a zero\-assumption protocol for systematic reference verification in academic research\.External Links:2511\.04683,[Link](https://arxiv.org/abs/2511.04683)Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1)\.
- \[30\]W\. H\. Walters and E\. I\. Wilder\(2023\)Fabrication and errors in the bibliographic citations generated by ChatGPT\.Scientific Reports13\(1\),pp\. 14045\.Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p2.1),[§2](https://arxiv.org/html/2605.08583#S2.p1.1)\.
- \[31\]L\. Waltman\(2016\)A review of the literature on citation impact indicators\.Journal of informetrics10\(2\),pp\. 365–391\.Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p1.1)\.
- \[32\]H\. Wei, Y\. Sun, and Y\. Li\(2026\)DeepSeek\-ocr 2: visual causal flow\.arXiv preprint arXiv:2601\.20552\.Cited by:[Appendix C](https://arxiv.org/html/2605.08583#A3.p1.11)\.
- \[33\]Z\. Yuan, K\. Shi, Z\. Zhang, L\. Sun, N\. V\. Chawla, and Y\. Ye\(2026\)CiteAudit: you cited it, but did you read it? a benchmark for verifying scientific references in the llm era\.arXiv preprint arXiv:2602\.23452\.Cited by:[§1](https://arxiv.org/html/2605.08583#S1.p1.1),[§1](https://arxiv.org/html/2605.08583#S1.p2.1),[§1](https://arxiv.org/html/2605.08583#S1.p3.8),[§2](https://arxiv.org/html/2605.08583#S2.p2.1)\.

## Appendix ABenchmark Details

This appendix expands Section[3](https://arxiv.org/html/2605.08583#S3)with the per\-code prose, mutation operator schemas, and quality\-control protocols that the main paper compresses for space\.

### A\.1Per\-code Definitions

The taxonomy of[Table 1](https://arxiv.org/html/2605.08583#S3.T1)groups1212codes into three auditor\-facing classes\. The class\-level summary in the main paper compresses what each code names; here we restate the codes in full so an auditor can map a verdict to a concrete auditor action\.

Realcitations resolve to the intended publication on every field an auditor would normally check\.R1matches the seed BibTeX entry character\-for\-character\.R2differs only by a normalizable surface variant such as a venue abbreviation, punctuation difference, capitalization change, or initialed author name\.R3replaces a long author list with*et al\.*while preserving the correctness of the named authors and the underlying publication\.

Hallucinatedcitations contain field\-level bibliographic errors that can be verified against external sources, and each code targets exactly one field so the label identifies the exact correction or auditor action required\.H1corrupts the title through word substitution, paraphrase, or full fabrication\.H2corrupts the author list through addition, deletion, reordering, substitution, or fabrication\.H3preserves title and authors but assigns the work to a venue in which it did not appear\.H4changes the publication year\.H5replaces an identifier with one that either resolves to a different work or fails to resolve\.H6corrupts peripheral metadata \(pages, volume, publisher, location\) when that metadata can still be checked against an indexed source\.

Potentialcitations cannot be safely resolved by automatic verification alone and should be routed for manual inspection; these cases are not necessarily erroneous, but lack either a stable matching rule or sufficient external evidence for a confident automatic verdict\.P1covers author\-name variants, where a citation uses a known nickname, spelling variant, or transliteration variant such as “Kate” for “Katherine” or “Mike” for “Michael”; bibliographic records often do not explicitly validate such equivalences, and strict string matching may falsely flag them\.P2marks non\-academic sources, including blog posts, GitHub repositories, model release notes, and forum threads, whose citation formats are too diverse to support a uniform bibliographic\-index\-based judgment\.P3covers peripheral metadata when the relevant field is absent from available bibliographic sources; because these fields are often less consistently indexed, their absence may reflect incomplete source coverage rather than fabrication\.

### A\.2Source Selection

Table 6:Mutation operators and per\-code counts in the synthetic benchmark\.We extract official BibTeX entries from publicly available open\-access bibliographic repositories such as Crossref, DBLP, and arXiv, spanning a broad spectrum of research areas and publication venues\. To control seed quality, we prioritize entries that populate the largest number of bibliographic fields \(title, authors, venue, year, identifiers, peripheral metadata\), so each seed offers a rich substrate for downstream mutation\. We then apply the per\-code mutation operators of[Table 6](https://arxiv.org/html/2605.08583#A1.T6)to generate the synthetic entries\.

[Figure 3](https://arxiv.org/html/2605.08583#A1.F3)summarizes the seed\-pool composition for the2,2702\{,\}270benchmark entries that derive from a real publication \(the remaining180180entries areP3pure fabrications with no real seed by construction\)\. The left panel breaks down seeds by the Scholar Connector that returned the canonical record: Crossref \(41\.6%41\.6\\%\) and DBLP \(35\.2%35\.2\\%\) together cover three quarters of the pool, ACL Anthology adds15\.2%15\.2\\%, and the remaining8\.0%8\.0\\%is distributed across arXiv, OpenAlex, and Semantic Scholar\. The right panel breaks down seeds by research topic: the1515topics span reinforcement learning, graph neural networks, knowledge distillation, large language models, and other major subareas of contemporary AI and machine learning, with no single topic exceeding9\.6%9\.6\\%and the smallest topic still contributing3\.5%3\.5\\%, so no single subarea dominates the benchmark\.

![Refer to caption](https://arxiv.org/html/2605.08583v1/figs/dataset_distribution.png)Figure 3:Seed\-pool composition of the2,2702\{,\}270synthetic entries that derive from a real publication\. Left: distribution over the six Scholar Connectors that returned the canonical record\. Right: distribution over the1515research topics used to query the connectors\.P3pure fabrications \(180180entries\) are excluded from both panels by construction\.
### A\.3Per\-code Mutation Operators

For every non\-R1code we apply a small fixed set of mutation operators that produce exactly the failure mode the code names; every operator changes a documented set of fields and leaves the rest identical to the seed\. An LLM\-driven generator proposes a candidate value for each operator, and a deterministic post\-processing step enforces the field boundaries documented in the operator schema\. ThePotentialclass admits operators that no purely surface\-text method can recognize:P1substitutes a single author name with a known nickname or transliteration variant, so the citation remains semantically correct yet trips strict matchers;P3fabricates a peripheral field that no public bibliographic source indexes for the cited paper, so the verdict requires recognizing coordinated absence across sources rather than a contradicting source\. EachHallucinatedcode targets exactly one bibliographic field, so a wrong title \(H1\) and a wrong DOI \(H5\) on otherwise identical seeds produce two distinct benchmark entries and two distinct error modes\.

### A\.4Quality Control

Every synthetic entry passes three independent checks before it enters the benchmark\. The*round\-trip audit*re\-runs each operator against its seed and verifies that the resulting diff matches the operator’s documented changed fields; entries that fail the audit are regenerated\. The*verifiability check*confirms that everyR1seed resolves on at least one public bibliographic source and that everyP3fabrication is unresolvable across every source consulted, so theP3ground\-truth label does not depend on any single source’s coverage\. The*author\-curated boundary review*hand\-inspects everyP1citation and confirms that the substituted nickname or transliteration is a recognized variant for the named author rather than a plausible\-but\-fictional alternative; this protectsP1from absorbingH2mutations\. After applying these filters we retain2,4502\{,\}450taxonomy\-labeled instances out of3,1003\{,\}100collected and synthesized entries\.

## Appendix BEfficiency Analysis

Across the2,4502\{,\}450\-citation BibTeX benchmark, the Cascading Evidence Collector closes3\.6%3\.6\\%within seconds via cache hits and non\-academic short\-circuits, the Field Matcher closes another61\.7%61\.7\\%at deterministic rule\-based latency with no LLM call, and the remaining34\.7%34\.7\\%reach the Class\-Specialist Judgers, where the Potential and Hallucinated judges run sequential LLM passes plus external\-API cross\-checks that account for most of the per\-citation latency\.CiteTracersustains roughly0\.500\.50citations per second end\-to\-end, and the long tail comes primarily from external\-API round\-trip rather than LLM inference itself\.

## Appendix CImplementation Details

The OCR modelℳocr\\mathcal\{M\}\_\{\\mathrm\{ocr\}\}uses DeepSeek\-OCR 2\[[32](https://arxiv.org/html/2605.08583#bib.bib49)\]for layout\-aware bibliography\-region detection and citation\-block transcription, and the Parser Agent𝒜Parser\\mathcal\{A\}\_\{\\mathrm\{Parser\}\}runs on Kimi K2\.5\[[15](https://arxiv.org/html/2605.08583#bib.bib50)\]for the cropped\-block reparse and boundary merging\. The Matcher Agent𝒜Matcher\\mathcal\{A\}\_\{\\mathrm\{Matcher\}\}runs on Qwen3\-VL\-235B\[[5](https://arxiv.org/html/2605.08583#bib.bib42)\]\. Every LLM call samples at temperature0with a4,0964\{,\}096\-token generation cap, and the Cascading Evidence Collector keeps the top\-55candidates per connector for downstream adjudication\. The Scholar Connectors𝒜Scholar\\mathcal\{A\}\_\{\\mathrm\{Scholar\}\}connect to eight academic data sources \(arXiv, DBLP, Crossref, Semantic Scholar, OpenAlex, ACL Anthology, Europe PMC, and PubMed\); the URL Fetch step covers direct DOI and arXiv links; and the Web Agent𝒜Web\\mathcal\{A\}\_\{\\mathrm\{Web\}\}uses a general web\-search engine for the residual long tail\. By default the pipeline runs three nested layers of parallelism: up to1616papers are processed concurrently, within each paper up to1616citations are verified in parallel, and within each citation up to1010Scholar Connector queries are issued in parallel\.

### C\.1Agent Prompts

We list the LLM prompts behind the three agents discussed in Section[4](https://arxiv.org/html/2605.08583#S4): the Parser Agent \(Reference Extractor\), the Matcher Agent \(Field Matcher\), and the Potential Judger in Class\-Specialist Judgers\.

#### C\.1\.1Parser Agent

The Parser Agent \(Section[4\.1](https://arxiv.org/html/2605.08583#S4.SS1)\) takes the OCR transcription of a reference block with the cropped page image and emits a structured citation record\. The system and user prompts the agent uses for the text\-only reparse path are:

Parser Agent Prompt[⬇](data:text/plain;base64,W1N5c3RlbSBwcm9tcHRdCllvdSBhcmUgYSBzdHJpY3QgYmlibGlvZ3JhcGh5IHBhcnNlci4gRXh0cmFjdCBzdHJ1Y3R1cmVkIGZpZWxkcyBmcm9tIGEgcmF3IHJlZmVyZW5jZSBzdHJpbmcuIFJldHVybiBKU09OIG9ubHkuCgpbVXNlciBwcm9tcHRdClJlZmVyZW5jZToKe3Jhd190ZXh0fQoKUmV0dXJuIG9ubHkgdmFsaWQgSlNPTiB3aXRoIGV4YWN0IGtleXM6CnsidGl0bGUiOiAiIiwgImF1dGhvcnMiOiBbXSwgInZlbnVlIjogIiIsICJ5ZWFyIjogbnVsbCwgInZvbHVtZSI6ICIiLCAicGFnZXMiOiAiIiwgInB1Ymxpc2hlciI6ICIiLCAibG9jYXRpb24iOiAiIiwgImRvaSI6ICIiLCAiYXJ4aXZfaWQiOiAiIiwgInVybCI6ICIifQoKUnVsZXM6Ci0gYXV0aG9ycyBtdXN0IGJlIGFuIGFycmF5IG9mIHN0cmluZ3MuCi0gSU1QT1JUQU5UOiBJZiB0aGUgcmVmZXJlbmNlIHVzZXMgJ2V0IGFsLicsICdldCBhbCcsICdvdGhlcnMnLCBvciAnYW5kIG90aGVycycgdG8gaW5kaWNhdGUgdGhhdCB0aGUgYXV0aG9yIGxpc3Qgd2FzIHRydW5jYXRlZCwgeW91IE1VU1QgaW5jbHVkZSB0aGF0IG1hcmtlciBhcyB0aGUgTEFTVCBpdGVtIGluIHRoZSBhdXRob3JzIGFycmF5IHZlcmJhdGltIChlLmcuLCBbJ1RvbSBCcm93bicsICdCZW5qYW1pbiBNYW5uJywgLi4uLCAnZXQgYWwuJ10pLiBEbyBOT1Qgc2lsZW50bHkgZHJvcCBpdCAtLSBkb3duc3RyZWFtIGNvZGUgdXNlcyB0aGlzIG1hcmtlciB0byBkZXRlY3QgdGhhdCB0aGUgYXV0aG9yIGxpc3QgaXMgaW50ZW50aW9uYWxseSBwYXJ0aWFsLgotIHllYXIgbXVzdCBiZSBpbnRlZ2VyIG9yIG51bGwuCi0gdmVudWUgaXMgT05MWSB0aGUgam91cm5hbCBvciBjb25mZXJlbmNlIG5hbWUgKGUuZy4sICdOZXVySVBTJywgJ05hdHVyZScsICdQcm9jZWVkaW5ncyBvZiBFTU5MUCAyMDIxJykuIERvIE5PVCBpbmNsdWRlIHBhZ2VzLCB2b2x1bWUsIGlzc3VlLCBsb2NhdGlvbiwgb3IgcHVibGlzaGVyIGluIHZlbnVlLgotIHZvbHVtZSBpcyB0aGUgdm9sdW1lIG51bWJlciAoZS5nLiwgJzEwNCcsICcxNScpLiBFbXB0eSBpZiBub3QgcHJlc2VudC4KLSBwYWdlcyBpcyB0aGUgcGFnZSByYW5nZSAoZS5nLiwgJzEyMzQtMTI0NScpLiBSZWNvZ25pemUgJ3BwLicgYXMgcGFnZXMuIEVtcHR5IGlmIG5vdCBwcmVzZW50LgotIHB1Ymxpc2hlciBpcyB0aGUgcHVibGlzaGluZyBvcmdhbml6YXRpb24gKGUuZy4sICdBc3NvY2lhdGlvbiBmb3IgQ29tcHV0YXRpb25hbCBMaW5ndWlzdGljcycsICdTcHJpbmdlcicpLiBFbXB0eSBpZiBub3QgcHJlc2VudC4KLSBsb2NhdGlvbiBpcyB0aGUgY29uZmVyZW5jZSBsb2NhdGlvbiAoZS5nLiwgJ09ubGluZScsICdTZW91bCwgS29yZWEnKS4gRW1wdHkgaWYgbm90IHByZXNlbnQuCi0gZG9pL2FyeGl2X2lkL3VybCBzaG91bGQgYmUgcGxhaW4gc3RyaW5ncyAoZW1wdHkgaWYgdW5rbm93bikuCi0gaWYgdW5jZXJ0YWluLCBsZWF2ZSBlbXB0eSBzdHJpbmcgLyBbXSAvIG51bGwuCi0gZG8gbm90IG91dHB1dCBleHBsYW5hdGlvbnMu)\[Systemprompt\]Youareastrictbibliographyparser\.Extractstructuredfieldsfromarawreferencestring\.ReturnJSONonly\.\[Userprompt\]Reference:\{raw\_text\}ReturnonlyvalidJSONwithexactkeys:\{"title":"","authors":\[\],"venue":"","year":null,"volume":"","pages":"","publisher":"","location":"","doi":"","arxiv\_id":"","url":""\}Rules:\-authorsmustbeanarrayofstrings\.\-IMPORTANT:Ifthereferenceuses'etal\.','etal','others',or'andothers'toindicatethattheauthorlistwastruncated,youMUSTincludethatmarkerastheLASTitemintheauthorsarrayverbatim\(e\.g\.,\['TomBrown','BenjaminMann',\.\.\.,'etal\.'\]\)\.DoNOTsilentlydropit\-\-downstreamcodeusesthismarkertodetectthattheauthorlistisintentionallypartial\.\-yearmustbeintegerornull\.\-venueisONLYthejournalorconferencename\(e\.g\.,'NeurIPS','Nature','ProceedingsofEMNLP2021'\)\.DoNOTincludepages,volume,issue,location,orpublisherinvenue\.\-volumeisthevolumenumber\(e\.g\.,'104','15'\)\.Emptyifnotpresent\.\-pagesisthepagerange\(e\.g\.,'1234\-1245'\)\.Recognize'pp\.'aspages\.Emptyifnotpresent\.\-publisheristhepublishingorganization\(e\.g\.,'AssociationforComputationalLinguistics','Springer'\)\.Emptyifnotpresent\.\-locationistheconferencelocation\(e\.g\.,'Online','Seoul,Korea'\)\.Emptyifnotpresent\.\-doi/arxiv\_id/urlshouldbeplainstrings\(emptyifunknown\)\.\-ifuncertain,leaveemptystring/\[\]/null\.\-donotoutputexplanations\.

#### C\.1\.2Matcher Agent \(Field Matcher\)

The Matcher Agent \(Section[4\.3](https://arxiv.org/html/2605.08583#S4.SS3)\) is invoked when the deterministic rule matcher cannot fully resolve a citation\-candidate pair\. For each \(citation, candidate\) pair the agent emits a per\-field verdict on authors, venue, and publisher; the citation\-side and candidate\-side values for those fields are spliced into the prompt at runtime\. We reproduce its directive, the category labels for each audited field, and the output schema\.

Matcher Agent Prompt \(Field Classifier\)[⬇](data:text/plain;base64,WW91IGFyZSBhIGNpdGF0aW9uIGZpZWxkIGVxdWl2YWxlbmNlIGNsYXNzaWZpZXIuIEZvciBhIHNpbmdsZSAoY2l0YXRpb24sIGNhbmRpZGF0ZSkgcGFpciwgcHJvZHVjZSBhdXRob3JpdGF0aXZlIHZlcmRpY3RzIGZvciB0aHJlZSBmaWVsZHMgYXQgb25jZTogYXV0aG9ycywgdmVudWUsIHB1Ymxpc2hlci4KCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQpBVVRIT1JTIC0tIDQgY2F0ZWdvcmllcwo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KLSBleGFjdDogICAgICAgZXZlcnkgYXV0aG9yIHBhaXIgaXMgYnl0ZS1pZGVudGljYWwgYWZ0ZXIgY2FzZSBub3JtYWxpemF0aW9uLgotIHIyX2luaXRpYWw6ICBzYW1lIHBlb3BsZTsgdGhlIG9ubHkgZmlyc3QtbmFtZSBkaWZmZXJlbmNlIGlzIHNpbmdsZS1sZXR0ZXIKICAgICAgICAgICAgICAgaW5pdGlhbCBleHBhbnNpb24gKGUuZy4sICJHLiBIYW8iIDwtPiAiR2FvIEhhbyIpLgotIHAxX3ZhcmlhbnQ6ICBzYW1lIHBlb3BsZSAoc3VybmFtZXMgZXF1YWwpLCBidXQgZmlyc3QtbmFtZSBmb3JtIGRpZmZlcnMgYnkKICAgICAgICAgICAgICAgbmlja25hbWUsIG11bHRpLWxldHRlciB0cnVuY2F0aW9uLCB0cmFuc2xpdGVyYXRpb24sIG9yCiAgICAgICAgICAgICAgIG1pZGRsZS1uYW1lIGFkZC9kcm9wIChlLmcuLCAiTWlrZSIgPC0+ICJNaWNoYWVsIiwKICAgICAgICAgICAgICAgIkNoYW8iIDwtPiAiQ2hhb3dlaSIsICJEbWl0cnkgUC4gVmV0cm92IiA8LT4gIkRtaXRyeSBWZXRyb3YiKS4KLSBoMl9lcnJvcjogICAgZ2VudWluZWx5IGRpZmZlcmVudCBwZW9wbGUsIHJlb3JkZXJlZCBhdXRob3JzLCBvciBjb3VudAogICAgICAgICAgICAgICBtaXNtYXRjaCB3aXRoIG5vIGV0LWFsIG1hcmtlci4KCltkZXRhaWxlZCBuYW1lLWRlY29tcG9zaXRpb24gcnVsZXMsIHN1cm5hbWUtZXF1YWxpdHkgaGFyZCBydWxlLCBjb3VudC0KIG1pc21hdGNoIGRlZGljYXRlZCBydWxlLCBhbmQgfjUwIHdvcmtlZCBleGFtcGxlcyBhcmUgbGlzdGVkIGluIHRoZSBmdWxsCiBwcm9tcHQ7IGZ1bGwgdGV4dCBpbiB0aGUgcmVsZWFzZWQgY29kZSBhdAogcGFja2FnZXMvY29yZS9iZWRyb2NrX2FnZW50cy5weV0KCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQpWRU5VRSAtLSAzIGNhdGVnb3JpZXMKPT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09Ci0gZXhhY3Q6ICAgIGJ5dGUtaWRlbnRpY2FsIGFmdGVyIGNhc2UvcHVuY3R1YXRpb24gbm9ybWFsaXphdGlvbi4KLSBhbGlhczogICAgc2FtZSB2ZW51ZSwgZXhwcmVzc2VkIGRpZmZlcmVudGx5CiAgICAgICAgICAgIChhY3JvbnltIDwtPiBmdWxsIG5hbWU7ICJQcm9jZWVkaW5ncyBvZiAuLi4iIDwtPiBhY3JvbnltOwogICAgICAgICAgICAiRU1OTFAgMjAyNCIgPC0+ICJFTU5MUCI7IHByZXByaW50IHN5bm9ueW1zOyBzdWItdHJhY2sgb2YgdGhlCiAgICAgICAgICAgIHNhbWUgY29uZmVyZW5jZTsgbm9uLUVuZ2xpc2ggb2ZmaWNpYWwgdHJhbnNsYXRpb24pLgotIGRpZmZlcmVudDogZGlzdGluY3QgdmVudWVzIChlLmcuLCAiQUNNIiB2cyAiQUNMIiwgIklDTUwiIHZzICJOZXVySVBTIikuCgo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KUFVCTElTSEVSIC0tIDMgY2F0ZWdvcmllcyAoc2FtZSBzdHJ1Y3R1cmUgYXMgdmVudWUpCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQotIGV4YWN0LCBhbGlhcywgZGlmZmVyZW50LiBBbGlhc2VzIGluY2x1ZGUgYWNyb255bSA8LT4gZnVsbCBuYW1lCiAgKCJBQ00iIDwtPiAiQXNzb2NpYXRpb24gZm9yIENvbXB1dGluZyBNYWNoaW5lcnkiOyAiUE1MUiIgPC0+ICJQcm9jZWVkaW5ncwogIG9mIE1hY2hpbmUgTGVhcm5pbmcgUmVzZWFyY2giOyAiU3ByaW5nZXIiIDwtPiAiU3ByaW5nZXIgTmF0dXJlIikuCgo9PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT0KT1VUUFVUIChKU09OIG9ubHksIG5vIG90aGVyIHRleHQpCj09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PQpGb3IgZWFjaCBmaWVsZCwgd3JpdGUgdGhlICJyZWFzb24iIGtleSBGSVJTVCAoc3RlcC1ieS1zdGVwIGV2aWRlbmNlIHdhbGspLAp0aGVuIHRoZSAib3ZlcmFsbCIga2V5LCB3aGljaCBNVVNUIGVxdWFsIHRoZSBjb25jbHVzaW9uIG9mIHRoYXQgcmVhc29uLgoKUmVxdWlyZWQgc2hhcGU6CnsKICAiYXV0aG9ycyI6ICAgeyJyZWFzb24iOiAiLi4uIiwgIm92ZXJhbGwiOiAiZXhhY3QifCJyMl9pbml0aWFsInwicDFfdmFyaWFudCJ8ImgyX2Vycm9yIn0sCiAgInZlbnVlIjogICAgIHsicmVhc29uIjogIi4uLiIsICJvdmVyYWxsIjogImV4YWN0InwiYWxpYXMifCJkaWZmZXJlbnQifSwKICAicHVibGlzaGVyIjogeyJyZWFzb24iOiAiLi4uIiwgIm92ZXJhbGwiOiAiZXhhY3QifCJhbGlhcyJ8ImRpZmZlcmVudCJ9Cn0=)Youareacitationfieldequivalenceclassifier\.Forasingle\(citation,candidate\)pair,produceauthoritativeverdictsforthreefieldsatonce:authors,venue,publisher\.============================================================AUTHORS\-\-4categories============================================================\-exact:everyauthorpairisbyte\-identicalaftercasenormalization\.\-r2\_initial:samepeople;theonlyfirst\-namedifferenceissingle\-letterinitialexpansion\(e\.g\.,"G\.Hao"<\-\>"GaoHao"\)\.\-p1\_variant:samepeople\(surnamesequal\),butfirst\-nameformdiffersbynickname,multi\-lettertruncation,transliteration,ormiddle\-nameadd/drop\(e\.g\.,"Mike"<\-\>"Michael","Chao"<\-\>"Chaowei","DmitryP\.Vetrov"<\-\>"DmitryVetrov"\)\.\-h2\_error:genuinelydifferentpeople,reorderedauthors,orcountmismatchwithnoet\-almarker\.\[detailedname\-decompositionrules,surname\-equalityhardrule,count\-mismatchdedicatedrule,and~50workedexamplesarelistedinthefullprompt;fulltextinthereleasedcodeatpackages/core/bedrock\_agents\.py\]============================================================VENUE\-\-3categories============================================================\-exact:byte\-identicalaftercase/punctuationnormalization\.\-alias:samevenue,expresseddifferently\(acronym<\-\>fullname;"Proceedingsof\.\.\."<\-\>acronym;"EMNLP2024"<\-\>"EMNLP";preprintsynonyms;sub\-trackofthesameconference;non\-Englishofficialtranslation\)\.\-different:distinctvenues\(e\.g\.,"ACM"vs"ACL","ICML"vs"NeurIPS"\)\.============================================================PUBLISHER\-\-3categories\(samestructureasvenue\)============================================================\-exact,alias,different\.Aliasesincludeacronym<\-\>fullname\("ACM"<\-\>"AssociationforComputingMachinery";"PMLR"<\-\>"ProceedingsofMachineLearningResearch";"Springer"<\-\>"SpringerNature"\)\.============================================================OUTPUT\(JSONonly,noothertext\)============================================================Foreachfield,writethe"reason"keyFIRST\(step\-by\-stepevidencewalk\),thenthe"overall"key,whichMUSTequaltheconclusionofthatreason\.Requiredshape:\{"authors":\{"reason":"\.\.\.","overall":"exact"\|"r2\_initial"\|"p1\_variant"\|"h2\_error"\},"venue":\{"reason":"\.\.\.","overall":"exact"\|"alias"\|"different"\},"publisher":\{"reason":"\.\.\.","overall":"exact"\|"alias"\|"different"\}\}

#### C\.1\.3Potential Judger

The Potential Judger \(Section[4\.4](https://arxiv.org/html/2605.08583#S4.SS4)\) is the class\-specialist agent invoked when the residual field\-status profile is consistent withPotentialbut the system needs to decide between explainable discrepancies \(P1/P2/P3\) and unexplained errors that escalate toHallucinated\. We additionally include several worked examples as in\-context learning demonstrations; the full set is available in our released code\.

Potential Judger Prompt[⬇](data:text/plain;base64,WW91IGFyZSBhIGNpdGF0aW9uIGRpc2NyZXBhbmN5IGFuYWx5c3QuIFRoZSBjaXRhdGlvbiB3YXMgTk9UIGEgZGlyZWN0IG1hdGNoLiBZb3VyIGpvYjogZGV0ZXJtaW5lIGlmIHRoZSBkaXNjcmVwYW5jaWVzIGFyZSBleHBsYWluYWJsZS4KCiMjIFRheG9ub215CgojIyMgUE9URU5USUFMIChkaXNjcmVwYW5jaWVzIGFyZSBleHBsYWluYWJsZSkKLSBQMS4gQXV0aG9yIG5hbWUgdmFyaWFudDogTmFtZSBpcyBhIHBsYXVzaWJsZSBuaWNrbmFtZS90cmFuc2xpdGVyYXRpb24vc3BlbGxpbmcgdmFyaWFudCBvZiB0aGUgU0FNRSBQRVJTT04uIFRoZSBzdXJuYW1lIHNldHMgbXVzdCBzdGlsbCBhbGlnbiBvbmUtdG8tb25lIChzYW1lIGNvdW50LCBzYW1lIHBlb3BsZSwgc2FtZSBvcmRlcikuIE9ubHkgdGhlIGZpcnN0LW5hbWUgZm9ybSBkaWZmZXJzLgogIFZhbGlkIFAxIGV4YW1wbGVzOiAiS2F0aGVyaW5lIiB2cyAiS2F0ZSIsICJNaWtlIiB2cyAiTWljaGFlbCIsICJXZWkgWmhhbmciIHZzICJXaWxsaWFtIFpoYW5nIiwgIk5hbmRvIiB2cyAiRmVybmFuZG8iLCAiU2h1IFlhbmciIHZzICJTaHVhbmcgWWFuZyIsICJBZm91cmFzIFQiIHZzICJUcmlhbnRhZnlsbG9zIEFmb3VyYXMiLCAiTC4gWmhhbyIgdnMgIkxvbmcgWmhhbyIuCi0gUDIuIE5vbi1hY2FkZW1pYyBzb3VyY2UgdW52ZXJpZmlhYmxlOiBDaXRhdGlvbiByZWZlcmVuY2VzIGEgbm9uLWFjYWRlbWljIHNvdXJjZSB3aG9zZSBleGlzdGVuY2UgY2Fubm90IGJlIGZ1bGx5IHZlcmlmaWVkIHRocm91Z2ggc3RydWN0dXJlZCBkYXRhYmFzZXMuCi0gUDMuIEluc3VmZmljaWVudCBmaWVsZCBldmlkZW5jZTogQ2l0YXRpb24gcHJvdmlkZXMgb3B0aW9uYWwgcGVyaXBoZXJhbCBmaWVsZHMgKHZvbHVtZSwgcGFnZXMsIHB1Ymxpc2hlciwgbG9jYXRpb24gLS0gT05MWSB0aGVzZSBmb3VyKSB0aGF0IGNhbm5vdCBiZSBjb25maXJtZWQgb3IgZGVuaWVkIGJlY2F1c2Ugbm8gY2FuZGlkYXRlIHNvdXJjZSBzdXBwbGllcyB0aGVtLiBUaGUgQ09SRSBpZGVudGl0eSBmaWVsZHMgKHRpdGxlICsgYXV0aG9ycyArIHllYXIgKyB2ZW51ZSkgYWxsIG1hdGNoIHRoZSBjYW5kaWRhdGUsIHNvIHRoZSBwYXBlciBpcyB2ZXJpZmlhYmx5IHJlYWwgLS0gYnV0IG9uZSBvciBtb3JlIG9mIHZvbHVtZS9wYWdlcy9wdWJsaXNoZXIvbG9jYXRpb24gY2Fycnkgbm8gZXh0ZXJuYWwgZXZpZGVuY2UgZWl0aGVyIHdheS4KICBTaWduYWxzOiBvbmUgb3IgbW9yZSBvZiB7dm9sdW1lLHBhZ2VzLHB1Ymxpc2hlcixsb2NhdGlvbn1fY2FuZGlkYXRlX21pc3NpbmcgaW4gdGhlIGlzc3VlcyBsaXN0IHdoaWxlIGNvcmUgZmllbGRzIG1hdGNoLgogIFZhbGlkIFAzIGV4YW1wbGVzOgogICAgLSBDaXRhdGlvbiBoYXMgdm9sdW1lPSIzNSIgZm9yIGEgTmV1cklQUyBwYXBlcjsgREJMUC9Dcm9zc1JlZi9hclhpdiBkbyBub3Qgc3VwcGx5IHZvbHVtZSBmb3IgTmV1cklQUyAtPiBjYW5ub3QgdmVyaWZ5LiBPdGhlciBmaWVsZHMgbWF0Y2ggLT4gUDMuCiAgICAtIENpdGF0aW9uIGxpc3RzIHB1Ymxpc2hlcj0iUE1MUiIgYW5kIGxvY2F0aW9uPSJWYW5jb3V2ZXIsIENhbmFkYSIgZm9yIGFuIElDTUwgcGFwZXI7IG5vIGNvbm5lY3RvciBjb25maXJtcyB0aGVzZSBzcGVjaWZpYyB2YWx1ZXMgLT4gUDMuCiAgRG8gTk9UIHVzZSBQMyB3aGVuOgogICAgLSBBbnkgY29yZSBmaWVsZCAodGl0bGUgLyBhdXRob3JzIC8geWVhciAvIHZlbnVlKSBoYXMgYSByZWFsIG1pc21hdGNoLgogICAgLSBBIHNlY29uZGFyeSBmaWVsZCBoYXMgYW4gZXhwbGljaXQgY29udHJhZGljdGluZyB2YWx1ZSBmcm9tIGEgY2FuZGlkYXRlICh0aGF0IGlzIEg2LCBub3QgUDMpLgogICAgLSBkb2lfY2FuZGlkYXRlX21pc3Npbmcgb3IgYXJ4aXZfaWRfY2FuZGlkYXRlX21pc3NpbmcgaXMgdGhlIG9ubHkgaXNzdWUgLS0gRE9JL2FyeGl2X2lkIGZhbGwgdW5kZXIgSDUsIG5vdCBQMy4KCiMjIyBIQUxMVUNJTkFURUQgKHVuZXhwbGFpbmVkIGVycm9ycyAtPiBlc2NhbGF0ZSkKKioqIEFVVEhPUiBDT1VOVCAvIFNFVCBNSVNNQVRDSCBJUyBBTFdBWVMgSEFMTFVDSU5BVEVEIChkbyBOT1QgZXhwbGFpbiBhcyBQMSkgKioqClRoZXNlIHBhdHRlcm5zIEFMV0FZUyBlc2NhbGF0ZSB0byBIQUxMVUNJTkFURUQsIG5ldmVyIFAxOgogIC0gQ2l0YXRpb24gaGFzIE1PUkUgYXV0aG9ycyB0aGFuIGNhbmRpZGF0ZSAoYWRkZWQgYXV0aG9yKSAtPiBIMgogIC0gQ2l0YXRpb24gaGFzIEZFV0VSIGF1dGhvcnMgdGhhbiBjYW5kaWRhdGUgQU5EIHJlZiBoYXMgbm8gImV0IGFsLiIgLT4gSDIKICAtIEEgc3VybmFtZSBpbiBjaXRhdGlvbiBpcyBhYnNlbnQgZnJvbSBjYW5kaWRhdGUgKG9yIHZpY2UgdmVyc2EpIC0+IEgyCiAgLSBBdXRob3JzIHJlb3JkZXJlZCAoc2FtZSBzZXQsIGRpZmZlcmVudCBvcmRlcikgLT4gSDIKREJMUCBkaXNhbWJpZ3VhdGlvbiBzdWZmaXhlcyAoIlRpbmcgQ2hlbiAwMDA3IikgYXJlIE5PVCBhIHJlYXNvbiB0byBleHBsYWluIGF3YXkgYSBtaXNzaW5nIGF1dGhvci4gVGhlIHN1ZmZpeCBhcHBsaWVzIHRvIHRoZSBleGlzdGluZyBhdXRob3I7IGEgc2VwYXJhdGUgYWRkZWQvcmVtb3ZlZCBhdXRob3IgaXMgc3RpbGwgYSByZWFsIGRpc2NyZXBhbmN5IC0+IGVzY2FsYXRlLgoKSWYgQU5ZIGRpc2NyZXBhbmN5IGNhbm5vdCBiZSBleHBsYWluZWQgYnkgUDEvUDIvUDMsIHJldHVybiBIQUxMVUNJTkFURUQuCgojIyBFeGFtcGxlcwoKW1NldmVyYWwgd29ya2VkIGV4YW1wbGVzIGZvciBQMSwgUDIsIFAzLCBhbmQgSEFMTFVDSU5BVEVEIGVzY2FsYXRpb24gY2FzZXMgYXJlIGluY2x1ZGVkIGhlcmUgYXMgaW4tY29udGV4dCBsZWFybmluZyBkZW1vbnN0cmF0aW9uczsgZnVsbCB0ZXh0IGluIHRoZSByZWxlYXNlZCBjb2RlIGF0IHBhY2thZ2VzL2NvcmUvYmVkcm9ja19hZ2VudHMucHkuXQoKIyMgTm93IGV2YWx1YXRlOgoKQ2l0YXRpb246IFRpdGxlPSd7Y2l0YXRpb25fdGl0bGV9JyBBdXRob3JzPXtjaXRhdGlvbl9hdXRob3JzfSBWZW51ZT0ne2NpdGF0aW9uX3ZlbnVlfScgWWVhcj17Y2l0YXRpb25feWVhcn0gTG9jYXRpb249J3tjaXRhdGlvbl9sb2NhdGlvbn0nCkNhbmRpZGF0ZTogVGl0bGU9J3tjYW5kaWRhdGVfdGl0bGV9JyBBdXRob3JzPXtjYW5kaWRhdGVfYXV0aG9yc30gVmVudWU9J3tjYW5kaWRhdGVfdmVudWV9JyBZZWFyPXtjYW5kaWRhdGVfeWVhcn0gTG9jYXRpb249J3tjYW5kaWRhdGVfbG9jYXRpb259JwoKSXNzdWVzIGZyb20gVmFsaWRBZ2VudDoge2lzc3Vlc30KVmFsaWRBZ2VudCByZWFzb246IHt2YWxpZF9yZWFzb259CgpTZWNvbmRhcnkgZXZpZGVuY2U6CntldmlkZW5jZV9saW5lc30KClJldHVybiBKU09OIG9ubHk6CnsibGFiZWwiOiAiUE9URU5USUFMIi8iSEFMTFVDSU5BVEVEIiwgInRheG9ub215IjogWyJQMSIvIlAyIi8iUDMiLyJIMSIuLiJINiJdLCAicmVhc29uIjogIi4uLiJ9)Youareacitationdiscrepancyanalyst\.ThecitationwasNOTadirectmatch\.Yourjob:determineifthediscrepanciesareexplainable\.\#\#Taxonomy\#\#\#POTENTIAL\(discrepanciesareexplainable\)\-P1\.Authornamevariant:Nameisaplausiblenickname/transliteration/spellingvariantoftheSAMEPERSON\.Thesurnamesetsmuststillalignone\-to\-one\(samecount,samepeople,sameorder\)\.Onlythefirst\-nameformdiffers\.ValidP1examples:"Katherine"vs"Kate","Mike"vs"Michael","WeiZhang"vs"WilliamZhang","Nando"vs"Fernando","ShuYang"vs"ShuangYang","AfourasT"vs"TriantafyllosAfouras","L\.Zhao"vs"LongZhao"\.\-P2\.Non\-academicsourceunverifiable:Citationreferencesanon\-academicsourcewhoseexistencecannotbefullyverifiedthroughstructureddatabases\.\-P3\.Insufficientfieldevidence:Citationprovidesoptionalperipheralfields\(volume,pages,publisher,location\-\-ONLYthesefour\)thatcannotbeconfirmedordeniedbecausenocandidatesourcesuppliesthem\.TheCOREidentityfields\(title\+authors\+year\+venue\)allmatchthecandidate,sothepaperisverifiablyreal\-\-butoneormoreofvolume/pages/publisher/locationcarrynoexternalevidenceeitherway\.Signals:oneormoreof\{volume,pages,publisher,location\}\_candidate\_missingintheissueslistwhilecorefieldsmatch\.ValidP3examples:\-Citationhasvolume="35"foraNeurIPSpaper;DBLP/CrossRef/arXivdonotsupplyvolumeforNeurIPS\-\>cannotverify\.Otherfieldsmatch\-\>P3\.\-Citationlistspublisher="PMLR"andlocation="Vancouver,Canada"foranICMLpaper;noconnectorconfirmsthesespecificvalues\-\>P3\.DoNOTuseP3when:\-Anycorefield\(title/authors/year/venue\)hasarealmismatch\.\-Asecondaryfieldhasanexplicitcontradictingvaluefromacandidate\(thatisH6,notP3\)\.\-doi\_candidate\_missingorarxiv\_id\_candidate\_missingistheonlyissue\-\-DOI/arxiv\_idfallunderH5,notP3\.\#\#\#HALLUCINATED\(unexplainederrors\-\>escalate\)\*\*\*AUTHORCOUNT/SETMISMATCHISALWAYSHALLUCINATED\(doNOTexplainasP1\)\*\*\*ThesepatternsALWAYSescalatetoHALLUCINATED,neverP1:\-CitationhasMOREauthorsthancandidate\(addedauthor\)\-\>H2\-CitationhasFEWERauthorsthancandidateANDrefhasno"etal\."\-\>H2\-Asurnameincitationisabsentfromcandidate\(orviceversa\)\-\>H2\-Authorsreordered\(sameset,differentorder\)\-\>H2DBLPdisambiguationsuffixes\("TingChen0007"\)areNOTareasontoexplainawayamissingauthor\.Thesuffixappliestotheexistingauthor;aseparateadded/removedauthorisstillarealdiscrepancy\-\>escalate\.IfANYdiscrepancycannotbeexplainedbyP1/P2/P3,returnHALLUCINATED\.\#\#Examples\[SeveralworkedexamplesforP1,P2,P3,andHALLUCINATEDescalationcasesareincludedhereasin\-contextlearningdemonstrations;fulltextinthereleasedcodeatpackages/core/bedrock\_agents\.py\.\]\#\#Nowevaluate:Citation:Title='\{citation\_title\}'Authors=\{citation\_authors\}Venue='\{citation\_venue\}'Year=\{citation\_year\}Location='\{citation\_location\}'Candidate:Title='\{candidate\_title\}'Authors=\{candidate\_authors\}Venue='\{candidate\_venue\}'Year=\{candidate\_year\}Location='\{candidate\_location\}'IssuesfromValidAgent:\{issues\}ValidAgentreason:\{valid\_reason\}Secondaryevidence:\{evidence\_lines\}ReturnJSONonly:\{"label":"POTENTIAL"/"HALLUCINATED","taxonomy":\["P1"/"P2"/"P3"/"H1"\.\."H6"\],"reason":"\.\.\."\}

## Appendix DPer\-Subtype TPR and FPR Heatmaps

![Refer to caption](https://arxiv.org/html/2605.08583v1/figs/per_subtype_heatmap.png)Figure 4:Per\-subtype TPR \(left, %\) and FPR \(right, %\) across the four chatbot baselines andCiteTracer, on both PDF and BibTeX inputs\.[Figure 4](https://arxiv.org/html/2605.08583#A4.F4)renders the per\-subtype data of[Table 3](https://arxiv.org/html/2605.08583#S5.T3)as two side\-by\-side continuous heatmaps, with methods on the vertical axis and the nine fine\-grained scoring buckets \(R,P1,P3,H1toH6\) on the horizontal axis grouped by their parent class \(Real,Potential,Hallucinated\)\. The left panel encodes in\-bucket TPR \(recall\) and the right panel encodes out\-of\-bucket FPR; both panels share a single linear interpolation from amber through cream to green, but the FPR colormap is inverted so that low false\-positive rates render green and high false\-positive rates render amber, giving every cell a consistent reading: green is good, amber is bad\. The FPR axis is capped at20%20\\%to keep the common0to5%5\\%range visually discriminating without saturating the fewR\-bucket cells where Gemini and Claude over\-predictReal\. GPTZero is omitted from both panels because three of its buckets are n/a by output\-space construction\. Three patterns become immediately readable\. First, on the TPR panel thePotentialcolumns \(P1,P3\) are dominated by amber across every baseline: none of the frontier chatbots reaches even the cream midpoint onP3, andP1stays amber for Claude and Gemini and only modestly above midpoint for GPT\-5\.5\. Second, the twoCiteTracerrows are uniformly deep green across every TPR bucket, with the only non\-saturated cell beingRon PDF input \(90\.8\) where Stage 1 extraction noise downgrades a small fraction ofRealcitations\. Third, the FPR panel shows that every chatbot baseline pays a large false\-positive cost on theRbucket \(Gemini reaches41\.5%41\.5\\%and46\.3%46\.3\\%on PDF and BibTeX\), reflecting the well\-known tendency of LLM judges to flag genuine citations as suspicious;CiteTracerkeepsR\-bucket FPR at0\.1%0\.1\\%on both inputs and stays under1\.1%1\.1\\%on every other bucket, making the right panel almost uniformly green\. Together the two panels are a visual restatement of the per\-subtype gain that[Table 3](https://arxiv.org/html/2605.08583#S5.T3)reports row by row, useful when the reader wants to scan across methods without parsing percentages\.

## Appendix ELimitations

Our evaluation concentrates on Computer Science papers, especially the ML literature; on citations from other fields with less standard formats, more complex structures, or limited coverage in the bibliographic connectors we query, the pipeline may miss candidates and emit incorrectHallucinatedverdicts\. Under high\-concurrency verification, parallel calls to the eight Scholar Connectors can trigger API rate limits and drop candidate evidence; a future Scholar Connector router that routes each citation to the most appropriate connector by venue, publisher, and documented API coverage would cut per\-citation query volume and improve system robustness\.

## Appendix FBroader impacts

CiteTracerserves two stakeholder groups\. For authors, it is a pre\-submission self\-check tool that surfaces field\-level citation errors before a manuscript leaves the desk, helping researchers ship more rigorous and reproducible publications and reducing the risk of inadvertently propagating fabricated references\. For conference chairs and journal editors, it is a triage tool that flags hallucinated citations during desk review, scaling the manual audits that ICLR 2026 and a real conference already run by hand\. We release the taxonomy, datasets, and pipeline for both groups; a wrongHallucinatedverdict on an honest citation is a reputational harm that our precision\-first design treats as the primary failure to avoid\.

Similar Articles

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2054617475484938719

X AI KOLs Timeline

Academic Research Skills is the first installable Claude Code workflow that packages a multi-agent pipeline to detect and prevent hallucinated citations in academic papers, addressing a problem where 146,932 hallucinated citations were counted in 2025 preprints.

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

arXiv cs.AI

This paper formalizes hallucination-to-action conversion in multimodal agents and proposes evidence-carrying agents (ECA) that use constrained verifiers to authorize only safe tool calls, achieving 0% unsafe-action rate on a 200-task pipeline.