Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs
Summary
Rutgers researchers trace citation hallucination in LLMs to sparse field-specific neurons, showing causal intervention can suppress fake references.
View Cached Full Text
Cached at: 04/22/26, 08:29 AM
# Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs
Source: [https://arxiv.org/html/2604.18880](https://arxiv.org/html/2604.18880)
Yihao QuanRutgers UniversityXiaodong LinRutgers UniversityRuixiang TangCorresponding author:ruixiang\.tang@rutgers\.eduRutgers University
Abstract
LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong\. We study this failure across 9 models and 108,000 generated references, and find that author names fail far more often than other fields across all models and settings\. Citation style has no measurable effect, while reasoning\-oriented distillation degrades recall\. Probes trained on one field transfer at near\-chance levels to the others, suggesting that hallucination signals do not generalize across fields\. Building on this finding, we apply elastic\-net regularization with stability selection to neuron\-level CETT values of Qwen2\.5\-32B\-Instruct and identify a sparse set of field\-specific hallucination neurons \(FH\-neurons\)\. Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields\. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone\.
## 1Introduction
Figure 1:Overview of citation hallucination in LLMs\. Given a topic, models generate references with plausible but incorrect metadata\. We investigate three research questions: the prevalence of such hallucinations, how they are encoded in model representations, and whether targeted neuron intervention can reduce errors\.Large language models are increasingly used to draft related work and bibliographies, but when they rely on parametric memory alone they cannot distinguish confident recall from confident fabrication\. This leads to a recurring pattern: references that look correct at first glance but contain errors in one or more bibliographic fields\. This problem has already appeared in recent work\. A recent audit of NeurIPS 2025 accepted papers found over 100 hallucinated citations that went undetected during peer review\[[18](https://arxiv.org/html/2604.18880#bib.bib22)\]\.
Prior work largely focuses on detecting or avoiding hallucination, without explicitly modeling its underlying causes\. For example, post\-hoc verification pipelines\[[13](https://arxiv.org/html/2604.18880#bib.bib3),[22](https://arxiv.org/html/2604.18880#bib.bib8)\]can detect incorrect references, but they require multiple API calls per citation and operate as black\-box checks, providing little insight into why the model makes these errors\. Retrieval\-augmented generation\[[8](https://arxiv.org/html/2604.18880#bib.bib18)\]reduces hallucination by grounding outputs in external documents\. However, because it relies on external information, it does not address the internal mechanisms that give rise to hallucination\. As a result, it remains unclear where citation hallucination arises within the model, and whether it can be detected and corrected using internal model signals\.
This question is especially important for citations because a bibliographic reference is a structured composition of distinct fields, including author, title, venue, year, and DOI\. Prior interpretability work has shown that truth\-related signals in LLM hidden states can be recovered with simple probes\[[11](https://arxiv.org/html/2604.18880#bib.bib5),[1](https://arxiv.org/html/2604.18880#bib.bib11)\], and that activation\-level interventions can steer model behavior toward truthfulness\[[9](https://arxiv.org/html/2604.18880#bib.bib36),[25](https://arxiv.org/html/2604.18880#bib.bib37)\]\. At the neuron level,\[[5](https://arxiv.org/html/2604.18880#bib.bib13)\]identified hallucination\-associated neurons and showed that suppressing them reduces errors in factoid QA\. Meanwhile,\[[16](https://arxiv.org/html/2604.18880#bib.bib21)\]found that truthfulness probes generalize weakly across datasets, suggesting that these internal signals may vary across settings rather than transfer uniformly\. Together, these findings highlight the importance of studying hallucination in structured settings such as citations, where a single reference contains multiple co\-dependent fields, each of which may fail for different reasons\. However, tools developed for general factoid QA cannot be directly applied to this structured, multi\-field setting\.
To address this gap, we conduct three analyses, each organized around a distinct research question \(Figure[1](https://arxiv.org/html/2604.18880#S1.F1)\):
- •How prevalent is citation hallucination in different fields?We build a large\-scale dataset by prompting multiple models to generate references across 50 topics and 8 citation styles, and verify each bibliographic field against metadata records from OpenAlex\[[17](https://arxiv.org/html/2604.18880#bib.bib9)\]\. Using this dataset, we find a consistent ranking of field\-level error rates across models and generation settings, with author names being the most error\-prone, followed by venues, titles, and years\.
- •Is hallucination encoded in field\-specific representations?We train linear probes on the hidden states of Qwen2\.5\-32B\-Instruct\[[2](https://arxiv.org/html/2604.18880#bib.bib41)\]and find that different fields show different layer\-wise patterns\. Probes trained on one field perform at near\-chance levels on the others, suggesting that citation errors in different fields are represented differently inside the model\.
- •Can targeted neuron intervention reduce field\-level errors?We apply elastic\-net regularization with stability selection to neuron\-level CETT contributions and identify a small set of field\-specific hallucination neurons \(FH\-neurons\)\. When we amplify these neurons, hallucination increases\. When we suppress them, accuracy improves, and this improvement does not appear under random ablation\. This provides causal evidence that specific neurons contribute to field\-level citation errors\.
Together, these findings show that citation hallucination is detectable from internal model signals, follows different patterns across bibliographic fields, and can be partly reduced through targeted neuron suppression without external retrieval\.
N \(Refs/Prompt\)N=5N=10N=15ModelsAuthorTitleYearVenueDOITotalAuthorTitleYearVenueDOITotalAuthorTitleYearVenueDOITotalQwen2\.5\-14B\-Instruct9\.339\.132\.825\.422\.74\.37\.731\.628\.923\.721\.03\.27\.129\.127\.423\.020\.02\.9Qwen2\.5\-32B\-Instruct16\.045\.843\.039\.854\.910\.513\.643\.441\.639\.353\.88\.413\.140\.240\.638\.653\.58\.0Qwen3\-30B\-A3B\-Base19\.257\.151\.938\.927\.75\.013\.342\.141\.331\.720\.73\.110\.332\.534\.026\.617\.72\.3Qwen3\-30B\-A3B\-Instruct15\.555\.247\.333\.033\.36\.811\.543\.139\.128\.625\.44\.59\.134\.433\.825\.920\.23\.1Moonlight\-16B\-A3B\-Instruct9\.644\.835\.226\.238\.64\.37\.137\.029\.321\.432\.72\.56\.231\.227\.820\.429\.82\.0Mistral\-Small\-24B\-Instruct\-250115\.452\.745\.533\.232\.65\.010\.242\.538\.527\.525\.02\.98\.234\.133\.625\.520\.41\.9DeepSeek\-R1\-Distill\-Qwen3\-8B20\.439\.844\.129\.023\.73\.212\.028\.331\.021\.225\.02\.210\.222\.525\.517\.525\.52\.2DeepSeek\-R1\-Distill\-Qwen2\.5\-14B7\.722\.026\.419\.813\.21\.13\.912\.721\.513\.88\.30\.63\.410\.921\.512\.56\.40\.4DeepSeek\-R1\-Distill\-Qwen2\.5\-32B19\.845\.547\.538\.623\.84\.014\.439\.638\.531\.021\.43\.213\.335\.836\.529\.218\.83\.0
Table 1:Field\-level verification accuracy across different models and number of references per prompt \(N\)\.
## 2Empirical Analysis of Citation Hallucination
### 2\.1Data Collection
We prompt a set of large language models to generate academic citations from parametric memory alone for 50 computer science research topics spanning machine learning, NLP, systems, security, and theory \(full list in Appendix[C](https://arxiv.org/html/2604.18880#A3)\)\. For each topic, the models generateN∈\{5,10,15\}N\\in\\\{5,10,15\\\}references under 8 citation styles: APA, MLA, Chicago, Harvard, Vancouver, IEEE, ACM, and AMA\. This design allows us to examine how citation style and generation volume affect hallucination rates\. For each model, this procedure yields approximately 12,000 generated references across 50 topics, 3 generation volumes, and 8 citation styles\. Outputs are constrained by a JSON schema with five structured fields per reference:title,authors,venue,year, anddoi\. Each reference is treated as an independent example and is subsequently verified and labeled\.
Figure 2:Per\-field citation accuracy atN=15N=15across all models\. Author accuracy is consistently the lowest across all models, while DOI accuracy is notably higher for Qwen2\.5\-32B\-Instruct than for all other models\.
### 2\.2Verification Pipeline
We first verify each generated reference against OpenAlex\[[17](https://arxiv.org/html/2604.18880#bib.bib9)\]through its public REST API\. We use a two\-stage lookup procedure\. If a DOI is present, we query OpenAlex directly using the normalized DOI\. Otherwise, we retrieve the top 10 candidates through title search and select the best match based on title similarity, first\-author overlap, and year proximity\. This step assigns each reference a binary label for each field, together with a global verdict ofSupported,Partial, orUnsupported\.
Some cases remain ambiguous after this first stage\. A reference may appearPartialbecause the model cites an arXiv preprint while OpenAlex records the published version, or vice versa\. A reference may also appearUnsupportedbecause the work has not yet been indexed in OpenAlex rather than because it is fabricated\. To resolve these cases, we introduce a second verification stage using GPT\-5\.4\-mini\[[15](https://arxiv.org/html/2604.18880#bib.bib32)\]with web search access\. For eachPartialorUnsupportedreference, the verifier retrieves live search results and compares them with the OpenAlex result\. When the web\-grounded evidence clearly supports a different verdict, we use the web\-based result as the final label\. Because this verifier relies on retrieval rather than parametric memory, it serves as a grounded judge rather than reproducing the same error mode under evaluation\. GPT\-5\.4\-mini is also not among the citation\-generation models in our benchmark, which avoids direct model overlap in the verification pipeline\. To validate the full pipeline, two expert annotators independently reviewed a random sample of 200 labels\. Their judgments agreed with the automated verdicts in 93% of cases, suggesting that the two\-stage procedure reliably distinguishes genuine hallucinations from database coverage artifacts\.
### 2\.3Model and Generation Factors
With this verification pipeline in place, we then study how model design and generation settings affect citation hallucination\. We vary four factors\. First, we compare Qwen2\.5\-14B\-Instruct and Qwen2\.5\-32B\-Instruct to isolate the effect of scale within the same model family\. Second, we compare Qwen2\.5\-32B\-Instruct with Qwen3\-30B\-Instruct\[[24](https://arxiv.org/html/2604.18880#bib.bib35)\]to examine differences across model versions\. Third, we include Qwen3\-30B\-Base together with its instruct counterpart to assess the effect of alignment training\. Finally, we vary the number of references requested per prompt acrossN∈\{5,10,15\}N\\in\\\{5,10,15\\\}to test whether hallucination increases as the model is asked to recall more entries at once\.
### 2\.4Results: Hallucination Rates and Patterns
Table[1](https://arxiv.org/html/2604.18880#S1.T1)reports per\-field accuracy across all models and generation volumes\. Even the strongest models produce errors on a substantial fraction of references, with author and DOI fields consistently showing the lowest correct\-generation rates\. We analyze these results along three dimensions below\.
Field\-level AnalysisTable[1](https://arxiv.org/html/2604.18880#S1.T1)and Figure[2](https://arxiv.org/html/2604.18880#S2.F2)reveal a consistent hierarchy of field difficulty\. Across all models and generation volumes, the author field is by far the most error\-prone, with correct generation rates below 14% atN=15N\{=\}15\. Title and year achieve broadly similar accuracy in the 27–41% range, with no consistent ordering between them\. Venue lags slightly behind in most configurations\. DOI accuracy is the most model\-dependent: Qwen2\.5\-32B\-Instruct reaches 53\.5%, suggesting it has internalized a disproportionately large number of exact identifiers, whereas all other models fall in the 17–30% range\.
Crucially, the ordering author<<venue<<title≈\\approxyear holds across every model and generation volume in the table, suggesting that field difficulty reflects a structural property of how these models encode bibliographic knowledge rather than an artifact of any particular configuration\. This hierarchy is compounded by a positional effect: accuracy is highest for the first two citations in a prompt and degrades sharply thereafter, indicating that models exhaust their most reliable parametric memory early regardless of field \(Appendix[D](https://arxiv.org/html/2604.18880#A4)\)\.
Citation Style\-level AnalysisFigure[3](https://arxiv.org/html/2604.18880#S2.F3)presents hallucination rates across eight citation styles and five bibliographic fields\. The author field consistently exhibits the highest hallucination rate across all styles, ranging from 0\.86 \(AMA\) to 0\.89 \(APA, Chicago, Harvard, Vancouver, and MLA\), while DOI shows the lowest and most variable rates\. The remaining fields cluster closely in the mid\-range, indicating that models achieve similar reliability on these fields regardless of format\.
Citation style itself has negligible influence on hallucination behavior\. The maximum observed difference in hallucination rate between any two styles is less than 0\.04 across all fields, and Kruskal\-Wallis tests confirm that none of these differences reach statistical significance \(allp\>0\.05p\>0\.05; see Appendix[A](https://arxiv.org/html/2604.18880#A1)for full results\)\. This indicates that hallucination is governed by the nature of the bibliographic field rather than the formatting conventions of the citation style\.
Figure 3:Hallucination rate by citation style and field\. Each axis corresponds to a citation style; lines represent individual bibliographic fields\. Author hallucination rates \(annotated\) are consistently the highest across all styles, while style itself has negligible effect on any field\.Model\-level Analysis\.As shown in Figure[4](https://arxiv.org/html/2604.18880#S2.F4)\(a\), we compare three models spanning a range of effective parameter counts: Qwen3\-30B\-A3B\-Instruct \(MoE, 3B active parameters\), Qwen2\.5\-14B\-Instruct \(dense, 14B\), and Qwen2\.5\-32B\-Instruct \(dense, 32B\)\. Despite activating only 3B parameters per token, the MoE model matches or slightly exceeds the 14B dense model on title, year, and venue, suggesting that its larger total parameter pool partially compensates for the sparse activation\. However, Qwen2\.5\-32B\-Instruct substantially outperforms both, with the largest gains on DOI and venue\. Authors accuracy remains below 14% for all three models, indicating that neither dense scaling nor MoE capacity resolves the most difficult recall task\. Moonlight\-16B\-A3B\-Instruct and Mistral\-Small\-24B\-Instruct\-2501 follow the same field\-difficulty hierarchy, with author accuracy below 10% and DOI showing the largest between\-model spread\.
Additionally, three DeepSeek\-R1\-Distill variants\[[6](https://arxiv.org/html/2604.18880#bib.bib38)\]reveal that reasoning\-oriented distillation degrades citation recall\. DeepSeek\-R1\-Distill\-Qwen2\.5\-14B achieves only 0\.4% total accuracy atN=15N\{=\}15, well below its non\-distilled counterpart Qwen2\.5\-14B\-Instruct, and DeepSeek\-R1\-Distill\-Qwen2\.5\-32B trails Qwen2\.5\-32B\-Instruct on every field, with the largest gap on DOI\. This pattern suggests that chain\-of\-thought distillation prioritizes reasoning structure at the expense of factual memorization\.
Figure 4:Per\-field accuracy under model\-level comparisons atN=15N\{=\}15\. \(a\) Qwen2\.5\-32B\-Instruct \(dense, 32B\) leads on all fields\. \(b\) Instruction tuning produces negligible differences across all fields\.Figure 5:Probe AUC across transformer layers for each bibliographic field in Qwen2\.5\-32B\-Instruct\.
## 3Probing for Field\-Level Hallucination
The preceding analysis establishes that citation hallucination is pervasive across model families and generation settings, with certain fields like author lists consistently more error\-prone than others\. These findings raise a deeper question:do hallucinated citations arise only at the decoding surface, or has the model already committed to an erroneous output within its internal representations?If the latter is the case, it should be possible to*read off*hallucination from hidden states before any output token is produced\.
We focus probing and neuron\-localization analyses on Qwen2\.5\-32B\-Instruct because it is the strongest citation generator in Table[1](https://arxiv.org/html/2604.18880#S1.T1)and provides open\-weight access for hidden\-state and neuron\-level analysis\. To examine whether the cross\-field transfer pattern extends beyond this model family, we additionally replicate the transfer heatmap on Mistral\-Small\-24B\-Instruct\-2501\[[14](https://arxiv.org/html/2604.18880#bib.bib33)\]in Appendix[E](https://arxiv.org/html/2604.18880#A5)\. We then train token\-level linear probes on the hidden states of Qwen2\.5\-32B\-Instruct and pose three progressively sharper questions:
- •Is citation hallucination decodable from internal representations at all?
- •How does the hallucination signal evolve across transformer layers, and do different bibliographic fields exhibit distinct layer\-wise profiles?
- •Can a probe trained on one field generalize to another, in other words does each field rely on the same internal mechanism?
To answer these questions, we train linear probes on hidden states extracted from field\-specific token spans to test whether hallucination signals are linearly decodable across layers\. The procedure consists of three steps: \(i\) serializing each reference with explicit field markers, \(ii\) extracting hidden states for tokens within each field span, and \(iii\) training field\-specific probes to detect hallucination signals\. This design enables layer\-wise and cross\-field analysis, as detailed below:
Figure 6:Probing pipeline for citation hallucination detection\.### 3\.1Input Serialization
Each generated reference is serialized into a plain\-text sequence in which every bibliographic field is enclosed by XML\-style tags \(see Appendix[B](https://arxiv.org/html/2604.18880#A2)for an example\)\. During serialization, the character\-level start and end offsets of each tagged block are recorded and later mapped to token positions via the tokenizer’s offset mapping\. This yields precise token spans for every field, enabling the field\-specific hidden\-state extraction described next\.
### 3\.2Feature Extraction
Given a serialized reference, the text is tokenized and passed through the model with all hidden states retained\. For a target fieldffwith token span\[ts,te\)\[t\_\{s\},t\_\{e\}\), we collect the hidden\-state vectors\{𝐡ℓ\(t\)\}t=tste−1\\\{\\mathbf\{h\}\_\{\\ell\}^\{\(t\)\}\\\}\_\{t=t\_\{s\}\}^\{t\_\{e\}\-1\}at each layerℓ\\ell\. Each vector serves as an individual training instance, paired with the binary hallucination label of fieldff\. Features are strictly field\-specific: probing for title hallucination uses only hidden states over thetitlespan, probing for venue uses only thevenuespan, and so on\.
### 3\.3Probe Training and Evaluation
We trainℓ2\\ell\_\{2\}\-regularized logistic regression probes \(L\-BFGS, balanced class weights\) independently for each field and layer\. A linear model is chosen deliberately: if hallucination is detectable by a linear probe, the signal must be linearly accessible in hidden\-state space, a stronger claim than nonlinear detectability\. To prevent topic leakage, all references from the same topic are assigned to either training or test, never both, with a 1:1 class balance and an 80%/20% topic\-level split\. To test whether fields share a common hallucination signal, we evaluate each field’s probe on every other field’s test data; near\-chance cross\-field AUC would confirm field\-specific encoding \(details in Appendix[J](https://arxiv.org/html/2604.18880#A10)\)\. The pipeline is shown in Figure[6](https://arxiv.org/html/2604.18880#S3.F6)\.
### 3\.4Layer\-wise and Cross\-field Probing Analysis
Figure[5](https://arxiv.org/html/2604.18880#S2.F5)plots probe AUC across all 64 transformer layers for each citation field\. All five fields share a common early\-layer dip: AUC starts relatively high at layers 2–4, drops through layers 6–10, then diverges sharply by field\. Authors exhibits the strongest recovery, climbing from a minimum of 0\.892 at layer 8 to a peak of 0\.935 at layer 46 before declining in the final layers, a swing of over four points that indicates substantial mid\-layer consolidation of author knowledge\. Title follows a similar but more gradual recovery, reaching its peak of 0\.888 only at layer 64, the sole field whose signal strengthens monotonically through the final layers\. DOI and Venue are comparatively stable across depth, with DOI fluctuating narrowly around 0\.878 and Venue around 0\.824\. Year is the only field whose signal degrades monotonically after the initial layers, declining from 0\.830 at layer 2 to 0\.790 by layer 62\.
Spearman rank correlations\[[19](https://arxiv.org/html/2604.18880#bib.bib34)\]confirm that these trajectories are statistically distinct: Authors, Title, and DOI show significant positive trends with depth \(ρ≥\+0\.55\\rho\\geq\+0\.55, allp<0\.001p\{<\}0\.001\), while Year trends negatively \(ρ=−0\.52\\rho\{=\}\-0\.52,p<0\.01p\{<\}0\.01\)\. Fisherzz\-tests show that Year’s trend differs significantly from every other field \(allp<0\.001p\{<\}0\.001\), and a permutation test rejects the null that all fields share the same layer–AUC profile \(p<0\.0001p\{<\}0\.0001\)\. Bootstrap 95% confidence intervals for the peak layer are non\-overlapping across four of five fields, spanning from layer 2 \(Year\) to layer 64 \(Title\), while Venue’s wide interval covering from layer 4 to 48 reflects its flat profile \(full statistical details in Appendix[I](https://arxiv.org/html/2604.18880#A9)\)\.
We next test whether the hallucination signal is shared across fields or encoded independently\. Figure[7](https://arxiv.org/html/2604.18880#S3.F7)presents a5×55\{\\times\}5cross\-field AUC heatmap, where entry\(i,j\)\(i,j\)reports the AUC of a probe trained on fieldiiand evaluated on the test set of fieldjj\. Diagonal entries \(in\-field performance\) range from 0\.812 to 0\.922, confirming that hallucination is reliably decodable within each field\. Off\-diagonal entries, however, remain near chance \(0\.46–0\.59\), indicating that a probe trained on one field’s hallucination signal carries almost no predictive power for any other field\. This gap demonstrates that each field’s hallucination is encoded in a structurally distinct subspace of the model’s representation, rather than a shared signal that generalizes across bibliographic fields\.
Figure 7:Cross\-field AUC heatmap for Qwen2\.5\-32B\-Instruct\. Diagonal entries \(in\-field\) range from 0\.812 to 0\.922, while off\-diagonal entries remain near chance \(0\.46–0\.59\), indicating that hallucination signals do not transfer across bibliographic fields\.
## 4Field\-Specific Neuron Localization
The probing results establish that hallucination is encoded in a field\-specific, linearly accessible manner, but do not identify which individual neurons carry the signal\. Pinpointing these neurons is a prerequisite for causal intervention\. We develop a two\-stage pipeline that isolates a sparse, stable set of per\-field hallucination neurons from the full CETT feature space and tests their causal role via activation patching\.We term the neurons identified through this procedurefield\-specific hallucination neurons\(FH\-neurons\)\.
For each FFN layerℓ\\elland intermediate neuronnn, CETT is defined as
\(1\)CETTℓ,n=\|an\|⋅‖Wdown\(ℓ\)\[:,n\]‖2‖y\(ℓ\)‖2,\\text\{CETT\}\_\{\\ell,n\}\\;=\\;\\frac\{\|a\_\{n\}\|\\cdot\\\|W\_\{\\text\{down\}\}^\{\(\\ell\)\}\[:,n\]\\\|\_\{2\}\}\{\\\|y^\{\(\\ell\)\}\\\|\_\{2\}\},whereana\_\{n\}is the neuron’s pre\-projection activation,Wdown\(ℓ\)W\_\{\\text\{down\}\}^\{\(\\ell\)\}is the down\-projection weight matrix, andy\(ℓ\)y^\{\(\\ell\)\}is the FFN output vector\. We compute CETT for every valid token across all 64 layers via forward hooks on each layer’s down\-projection, yielding a 1\.77M\-dimensional feature vector per token \(6464layers×\\times27,64827\{,\}648neurons\)\. To obtain a single feature representation per reference, we average the per\-token CETT values across all tokens within the target field’s span, yielding one vector per reference rather than per token\. Training and test sets are split at the topic level to prevent information leakage, and within each partition the two classes are balanced by downsampling\.
### 4\.1Selection Pipeline
The CETT feature space is both high\-dimensional and highly correlated across neighboring neurons, making direct feature selection unstable\. We therefore decompose FH\-neuron identification into two stages: sparse candidate selection via elastic\-net regression, followed by stability filtering across bootstrap resamples to retain only neurons whose selection is robust to the particular data split\.
Stage 1: Sparse selection via elastic\-net regression\.We train a logistic regression with an elastic\-net penalty over the full CETT feature vector\. The regularization combines anℓ1\\ell\_\{1\}term with a smallℓ2\\ell\_\{2\}term:
\(2\)ℒ=BCE\(y^,y\)\+αr∥w∥1\+α\(1−r\)2∥w∥22,\\mathcal\{L\}\\;=\\;\\text\{BCE\}\(\\hat\{y\},y\)\\;\+\\;\\alpha\\,r\\,\\lVert w\\rVert\_\{1\}\\;\+\\;\\frac\{\\alpha\\,\(1\-r\)\}\{2\}\\,\\lVert w\\rVert\_\{2\}^\{2\},wherer=0\.8r=0\.8is theℓ1\\ell\_\{1\}ratio\. Theℓ1\\ell\_\{1\}component is optimized via SGD with a proximal soft\-thresholding step after each gradient update, while theℓ2\\ell\_\{2\}component enters through standard gradient descent\. The smallℓ2\\ell\_\{2\}regularization improves stability when CETT features are correlated across neighboring neurons, encouraging grouped selection without sacrificing the sparsity induced byℓ1\\ell\_\{1\}\. The surviving non\-zero entries nominate candidate \(layer, neuron\) pairs whose CETT values are most predictive of hallucination\.
The regularization strengthα\\alphais selected via grid search, scoring each candidate by a composite criterion that balances validation AUC against sparsity\. We adopt AUC rather than accuracy as the detection metric because our goal is to identify neurons that*reliably discriminate*hallucinated from correct fields; AUC captures this ranking quality and is less sensitive to the choice of decision threshold\. The score penalizes the proportion of non\-zero neurons to favor parsimonious selections while preserving discriminative power\. We train one such model per field, yielding a field\-specific set of candidate FH\-neurons\.
Stage 2: Stability selection with permutation controThe candidates identified by a single elastic\-net fit may be sensitive to the particular data split\. To guard against this, we apply stability selection: we repeat the elastic\-net regression across 20 bootstrap resamples drawn with a 50% subsample ratio, and record how frequently each neuron is selected across resamples\. Only neurons whose selection frequency exceeds 60% are retained as stable FH\-neurons\. Among the stable set, we further restrict to neurons with*positive*regression weights, since a positive coefficient indicates that higher CETT activation of that neuron is associated with increased hallucination probability; these are thepro\-hallucinationneurons targeted by our intervention\.
After stability selection, this procedure retains 224, 78, 129, 51, and 30 positive\-weight FH\-neurons for Title, Authors, Year, Venue, and DOI respectively, out of 1,769,472 candidates per field \(at most 0\.013% of the feature space\)\. These neurons are not uniformly distributed across layers\. We divide the 64 layers into three equal bands: early \(layers 0–21\), middle \(layers 22–42\), and late \(layers 43–63\)\. Authors FH\-neurons concentrate in the middle band \(60\.3%\), DOI neurons cluster in the early band \(66\.7%\), while Title and Year neurons skew toward the late band \(41\.5% and 46\.5%\)\. Venue neurons are the most dispersed, spread across 30 layers with no layer exceeding 4 neurons\. The full per\-layer distribution is provided in Table[4](https://arxiv.org/html/2604.18880#A6.T4)\(Appendix[F](https://arxiv.org/html/2604.18880#A6)\)\.
Following the stability selection framework of Meinshausen and Bühlmann\[[12](https://arxiv.org/html/2604.18880#bib.bib24)\], the expected number of false discoveries is bounded by
\(3\)𝔼\[V\]≤q2\(2πthr−1\)⋅p,\\mathbb\{E\}\[V\]\\;\\leq\\;\\frac\{q^\{2\}\}\{\(2\\pi\_\{\\mathrm\{thr\}\}\-1\)\\cdot p\},whereqqis the average number of selected variables per subsample andppis the number of candidates\. With our threshold and largepp, this bound remains well below one\. As an additional sanity check, we repeat the entire procedure with randomly permuted labels\. No neuron exceeds the selection threshold under the null distribution, confirming that the stable set reflects genuine hallucination signal rather than statistical noise\.
### 4\.2Causal Intervention
The selection pipeline above establishes that certain neurons are reliably associated with hallucination, but association does not imply causation\. To test whether the identified FH\-neurons bear a causal relationship to citation hallucination, we perform activation patching at inference time\.
Intervention mechanismFor each pro\-hallucination FH\-neuron\(ℓ,n\)\(\\ell,n\), we register a forward pre\-hook on the down\-projection of layerℓ\\ellthat intercepts the intermediate activation vector before it is projected to the residual stream\. The target neuron’s activation is scaled by a factorβ\\beta:
\(4\)an′=β⋅an,a\_\{n\}^\{\\prime\}=\\beta\\cdot a\_\{n\},whereβ=0\\beta=0fully suppresses the neuron,β=0\.5\\beta=0\.5attenuates it,β=1\\beta=1leaves it unchanged \(baseline\), andβ\>1\\beta\>1amplifies it\. All other neurons remain untouched\.
Experimental conditionsWe design three conditions to establish causality from complementary directions\. In thesuppressionexperiment, we setβ<1\\beta<1and regenerate citations across the same topics and styles used in our dataset, then re\-verify each reference to measure the change in per\-field hallucination rate\. A decrease in hallucination rate would indicate that the suppressed neurons are causally involved in producing errors\. In theenhancementexperiment, we setβ\>1\\beta\>1to test whether amplifying FH\-neuron activations increases hallucination, providing a complementary confirmation of the causal direction\. Finally, arandom controlexperiment applies the same suppression \(β=0\\beta=0\) to randomly selected neurons of the same count, repeated over five trials\. A null effect in this condition would confirm that the observed changes are specific to the identified FH\-neurons rather than a generic consequence of neuron ablation\. All conditions use greedy decoding to ensure that any difference in output is attributable solely to the intervention\. To further confirm that the intervention itself does not corrupt generation, we verified that JSON schema validity remains high across all conditions \(baseline and suppression: 100%; enhancement: 99\.0%; random control: 97\.0%\), ruling out schema collapse as a confound\.
FieldDirectionβ\\betaTitleAuthorsYearVenueDOITitleEnhance↑\\uparrow2\.073\.2%15\.5%40\.2%42\.3%51\.5%Enhance↑\\uparrow4\.04\.7%0\.0%10\.6%8\.2%2\.4%Suppress↓\\downarrow0\.082\.8%20\.2%54\.5%41\.4%55\.6%Suppress↓\\downarrow0\.576\.5%19\.4%48\.0%38\.8%52\.0%AuthorsEnhance↑\\uparrow2\.077\.6%13\.3%55\.1%43\.9%59\.2%Enhance↑\\uparrow4\.070\.2%8\.5%33\.0%24\.5%41\.5%Suppress↓\\downarrow0\.069\.1%23\.7%45\.4%43\.3%58\.8%Suppress↓\\downarrow0\.568\.4%24\.2%46\.3%43\.2%60\.0%YearEnhance↑\\uparrow2\.066\.7%12\.5%40\.6%32\.3%46\.9%Enhance↑\\uparrow4\.063\.8%11\.7%27\.7%29\.8%62\.8%Suppress↓\\downarrow0\.077\.8%23\.2%46\.5%44\.4%59\.6%Suppress↓\\downarrow0\.579\.4%19\.6%44\.3%41\.2%50\.5%VenueEnhance↑\\uparrow2\.078\.9%15\.8%45\.3%38\.9%49\.5%Enhance↑\\uparrow4\.067\.4%11\.6%37\.9%31\.6%33\.7%Suppress↓\\downarrow0\.073\.2%20\.6%40\.2%35\.1%52\.6%Suppress↓\\downarrow0\.574\.7%20\.0%50\.5%41\.1%53\.7%DOIEnhance↑\\uparrow2\.078\.4%19\.6%42\.3%42\.3%47\.4%Enhance↑\\uparrow4\.080\.4%23\.7%57\.7%52\.6%38\.1%Suppress↓\\downarrow0\.077\.1%18\.8%42\.7%40\.6%55\.2%Suppress↓\\downarrow0\.577\.1%21\.9%55\.2%40\.6%57\.3%Random ControlRandom\-59\.9%15\.3%40\.8%35\.2%49\.1%BaselineBaseline\-76\.3%17\.5%48\.5%41\.2%54\.6%
Table 2:Field\-level accuracy on Qwen2\.5\-32B\-Instruct citations under neuron activation scaling\. Each cell reports the percentage of citations judged correct by the verifier for that field\.↑\\uparrow/↓\\downarrowdenotes the scaling direction \(enhance/suppress\)\. For computational feasibility, all intervention conditions and the baseline are evaluated on a fixed sample of 2 citation styles, 10 randomly selected topics, and 5 references per prompt\. All values are therefore mutually comparable within this table\. The baseline differs from the corresponding entry in Table[1](https://arxiv.org/html/2604.18880#S1.T1)due to topic\-level variance in the model’s parametric knowledge; see Appendix[H](https://arxiv.org/html/2604.18880#A8)for the full per\-topic accuracy distribution\.
### 4\.3Results
Table[2](https://arxiv.org/html/2604.18880#S4.T2)reports per\-field accuracy under each intervention condition; full statistical procedures are provided in Appendix[G](https://arxiv.org/html/2604.18880#A7)\.
The enhancement condition provides the clearest evidence for a causal link\. Atβ=4\.0\\beta=4\.0, accuracy on every targeted field drops sharply relative to baseline: Title collapses from 76\.3% to 4\.7%, Authors from 17\.5% to 8\.5%, and the remaining three fields lose between 10 and 21 percentage points\. All five paired differences are negative under a Wilcoxon signed\-rank test \(p=0\.031p=0\.031\), confirming that FH\-neurons actively promote hallucination when amplified\.
In the opposite direction, suppressing FH\-neurons atβ=0\\beta=0raises Title accuracy by 6\.5 percentage points and Authors by 6\.2 percentage points relative to baseline, while random ablation of the same number of neurons uniformly degrades all five fields, consistent with non\-specific disruption rather than targeted correction\. Comparing the two directly, FH\-neuron suppression outperforms random ablation on four of five targeted fields \(p=0\.062p=0\.062; narrowly aboveα=0\.05\\alpha=0\.05due to the small effective sample size ofn=5n\{=\}5, but consistent in direction and substantial in magnitude\)\. We also observe partial positive spillover to non\-targeted fields under moderate suppression, suggesting that bibliographic fields share some but not all of their underlying neural substrates—an asymmetry consistent with the cross\-field probe transfer reported in Section[3\.4](https://arxiv.org/html/2604.18880#S3.SS4)\.
Not all fields respond equally to intervention\. Venue\-targeted suppression atβ=0\\beta\{=\}0matches the random\-control level \(35\.1% vs 35\.2%\), yet the milderβ=0\.5\\beta\{=\}0\.5preserves accuracy at 41\.1%, suggesting that Venue’s 51 FH\-neurons are correctly identified but too sparsely distributed for complete ablation to remain distinguishable from random disruption\. DOI enhancement atβ=4\.0\\beta\{=\}4\.0unexpectedly improves Year and Venue while degrading only DOI itself—an effect unique to DOI, whose 30 FH\-neurons are the fewest of any field and concentrate in early layers \(66\.7% in layers 0–21\) rather than the middle\-to\-late layers where all other fields’ neurons reside in Table[4](https://arxiv.org/html/2604.18880#A6.T4)\.
## 5Related Work
#### Citation hallucination and model reliability\.
LLMs generate structurally plausible but factually incorrect references at alarming rates when operating from parametric memory alone\.\[[21](https://arxiv.org/html/2604.18880#bib.bib25)\]found that 55% of GPT\-3\.5 citations and 18% of GPT\-4 citations across 42 multidisciplinary topics were entirely fabricated, and\[[10](https://arxiv.org/html/2604.18880#bib.bib28)\]showed that fabrication rates in GPT\-4o vary sharply with topic familiarity\. At the level of published literature,\[[23](https://arxiv.org/html/2604.18880#bib.bib26)\]benchmarked 13 LLMs across 40 research domains and found hallucination rates ranging from 14% to 95%\. The practical consequences are already visible: a recent audit of NeurIPS 2025 accepted papers uncovered over 100 hallucinated citations that had passed peer review undetected\[[18](https://arxiv.org/html/2604.18880#bib.bib22)\], underscoring that this is not merely a theoretical concern but an active threat to the integrity of the scientific record\. Post\-hoc verification pipelines such as FactScore\[[13](https://arxiv.org/html/2604.18880#bib.bib3)\]and SAFE\[[22](https://arxiv.org/html/2604.18880#bib.bib8)\]can identify these errors but require dozens of API calls per reference and offer no insight into why the model erred; retrieval\-augmented generation\[[8](https://arxiv.org/html/2604.18880#bib.bib18)\]sidesteps the problem but is inapplicable offline\. In the embodied setting,\[[20](https://arxiv.org/html/2604.18880#bib.bib31)\]showed that standard token\-level uncertainty metrics can mask safety\-critical failure signals in vision\-language\-action models, requiring task\-specific aggregation to produce reliable confidence estimates\. These findings share a common theme with our work: models can appear confident while their internal signals are structured in ways that poorly reflect actual correctness, and domain\-specific analysis is needed to expose and address the gap\.
#### Probing and intervening on internal representations\.
A growing body of work shows that LLM hidden states encode linearly accessible truth\-related signals\[[11](https://arxiv.org/html/2604.18880#bib.bib5),[1](https://arxiv.org/html/2604.18880#bib.bib11),[3](https://arxiv.org/html/2604.18880#bib.bib39),[7](https://arxiv.org/html/2604.18880#bib.bib27)\], and that activation\-level interventions can steer models toward truthfulness at inference time\[[9](https://arxiv.org/html/2604.18880#bib.bib36),[4](https://arxiv.org/html/2604.18880#bib.bib40)\]\. At the neuron level,\[[5](https://arxiv.org/html/2604.18880#bib.bib13)\]identified hallucination\-associated neurons and showed that suppressing them reduces errors in factoid QA\.\[[16](https://arxiv.org/html/2604.18880#bib.bib21)\]found that truthfulness probes generalize weakly across datasets, implying that truthfulness encoding is multifaceted rather than universal\. However, these studies uniformly treat hallucination as a monolithic phenomenon\. Citation hallucination is harder to address: different fields within a reference can fail for different reasons\. It remains unclear whether these failures share a common mechanism or are field\-specific, and whether neuron\-level interventions can target them individually\.
## 6Conclusion
We study citation hallucination as a structured failure of bibliographic generation, where errors decompose across fields and are internally localized\. Empirically, different fields fail at different rates and follow distinct internal patterns, with signals learned for one field transferring poorly to others\. Building on these findings, we identify a sparse set of field\-specific hallucination neurons and show that targeted interventions on them can selectively improve citation accuracy\. Overall, our results indicate that citation hallucination arises from field\-dependent internal representations and can be partly mitigated from within the model, without relying solely on external retrieval\.
## 7Limitations
All intervention analyses are limited by the small number of structured fields \(n=5n\{=\}5\), which constrains statistical power, and we do not evaluate whether neuron suppression affects output fluency beyond the JSON schema validity reported in Section[4\.2](https://arxiv.org/html/2604.18880#S4.SS2)\. Additionally, probing and neuron localization use a single model \(Qwen2\.5\-32B\-Instruct\) and 50 topics from computer science, leaving generalization to other architectures and domains untested\.
## References
- \[1\]A\. Azaria and T\. Mitchell\(2023\)The internal state of an llm knows when it’s lying\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 967–976\.Cited by:[Appendix J](https://arxiv.org/html/2604.18880#A10.p1.3),[§1](https://arxiv.org/html/2604.18880#S1.p3.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[2\]J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[2nd item](https://arxiv.org/html/2604.18880#S1.I1.i2.p1.1)\.
- \[3\]C\. Burns, H\. Ye, D\. Klein, and J\. Steinhardt\(2022\)Discovering latent knowledge in language models without supervision\.arXiv preprint arXiv:2212\.03827\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[4\]Y\. Chuang, Y\. Xie, H\. Luo, Y\. Kim, J\. Glass, and P\. He\(2023\)Dola: decoding by contrasting layers improves factuality in large language models\.arXiv preprint arXiv:2309\.03883\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[5\]C\. Gao, H\. Chen, C\. Xiao, Z\. Chen, Z\. Liu, and M\. Sun\(2025\)H\-Neurons: on the existence, impact, and origin of hallucination\-associated neurons in LLMs\.arXiv preprint arXiv:2512\.01797\.Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p3.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[6\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§2\.4](https://arxiv.org/html/2604.18880#S2.SS4.p7.1)\.
- \[7\]J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. Malik, and Y\. Gal\(2024\)Semantic entropy probes: robust and cheap hallucination detection in llms\.arXiv preprint arXiv:2406\.15927\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[8\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p2.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[9\]K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg\(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.Advances in Neural Information Processing Systems36,pp\. 41451–41530\.Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p3.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[10\]J\. Linardon, H\. K\. Jarman, Z\. McClure, C\. Anderson, C\. Liu, and M\. Messer\(2025\)Influence of topic familiarity and prompt specificity on citation fabrication in mental health research using large language models: experimental study\.JMIR Mental Health12,pp\. e80371\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[11\]S\. Marks and M\. Tegmark\(2023\)The geometry of truth: emergent linear structure in large language model representations of true/false datasets\.arXiv preprint arXiv:2310\.06824\.Cited by:[Appendix J](https://arxiv.org/html/2604.18880#A10.p1.3),[§1](https://arxiv.org/html/2604.18880#S1.p3.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[12\]N\. Meinshausen and P\. Bühlmann\(2010\)Stability selection\.Journal of the Royal Statistical Society Series B: Statistical Methodology72\(4\),pp\. 417–473\.Cited by:[§4\.1](https://arxiv.org/html/2604.18880#S4.SS1.p6.4)\.
- \[13\]S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. W\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi\(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12076–12100\.Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p2.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[14\]Mistral AI Team\(2025\-01\-30\)Mistral small 3\.Mistral AI\.Note:[https://mistral\.ai/news/mistral\-small\-3](https://mistral.ai/news/mistral-small-3)Accessed: 2026\-04\-15Cited by:[§3](https://arxiv.org/html/2604.18880#S3.p2.1)\.
- \[15\]OpenAI\(2026\)Models\.Note:[https://platform\.openai\.com/docs/models](https://platform.openai.com/docs/models)Accessed: 2026\-04\-15Cited by:[§2\.2](https://arxiv.org/html/2604.18880#S2.SS2.p2.1)\.
- \[16\]H\. Orgad, M\. Toker, and Y\. Belinkov\(2024\)LLM\-Assisted hallucination detection in LLM\-generated text through probing\.InFindings of the Association for Computational Linguistics: ACL 2024,Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p3.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px2.p1.1)\.
- \[17\]J\. Priem, H\. Piwowar, and R\. Orr\(2022\)OpenAlex: a fully\-open index of the world’s research works, authors, venues, institutions, and concepts\.arXiv preprint arXiv:2205\.01833\.Cited by:[1st item](https://arxiv.org/html/2604.18880#S1.I1.i1.p1.1),[§2\.2](https://arxiv.org/html/2604.18880#S2.SS2.p1.1)\.
- \[18\]N\. Shmatko, A\. Adam, and P\. Esau\(2026\-01\)GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers\.Note:[https://gptzero\.me/news/neurips/](https://gptzero.me/news/neurips/)Accessed: 2026\-03\-20Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p1.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[19\]Cited by:[§3\.4](https://arxiv.org/html/2604.18880#S3.SS4.p2.7)\.
- \[20\]Y\. Tang, T\. Wang, Y\. Chen, B\. Zhang, Q\. Guan, and R\. Tang\(2026\)Shifting uncertainty to critical moments: towards reliable uncertainty quantification for vla model\.arXiv preprint arXiv:2603\.18342\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[21\]W\. H\. Walters and E\. I\. Wilder\(2023\)Fabrication and errors in the bibliographic citations generated by chatgpt\.Scientific Reports13\(1\),pp\. 14045\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[22\]J\. Wei, C\. Yang, X\. Song, Y\. Lu, N\. Hu, D\. Tran, D\. Peng, R\. Liu, D\. Huang, C\. Du,et al\.\(2024\)Long\-form factuality in large language models\.arXiv preprint arXiv:2403\.18802\.Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p2.1),[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[23\]Z\. Xu, Y\. Qiu, L\. Sun, F\. Miao, F\. Wu, X\. Wang, X\. Li, H\. Lu, Z\. Zhang, Y\. Hu,et al\.\(2026\)GhostCite: a large\-scale analysis of citation validity in the age of large language models\.arXiv preprint arXiv:2602\.06718\.Cited by:[§5](https://arxiv.org/html/2604.18880#S5.SS0.SSS0.Px1.p1.1)\.
- \[24\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§2\.3](https://arxiv.org/html/2604.18880#S2.SS3.p1.1)\.
- \[25\]A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§1](https://arxiv.org/html/2604.18880#S1.p3.1)\.
## Appendix AEffect of Citation Style on Hallucination Rate
Table[3](https://arxiv.org/html/2604.18880#A1.T3)reports Kruskal\-Wallis test results comparing hallucination rates across eight citation styles for each bibliographic field\. No field yields a significant difference \(p\>0\.05p\>0\.05,η2<0\.01\\eta^\{2\}<0\.01in all cases\), and the maximum observed difference between any two styles is less than 0\.04, confirming that citation format has negligible practical impact on hallucination behavior\.
Table 3:Kruskal\-Wallis test results across citation styles\.FieldH\-statisticp\-valueSig\.Title6\.3980\.4941nsAuthors5\.6420\.5821nsYear11\.0620\.1359nsVenue11\.8680\.1050nsDOI13\.6390\.0580ns
## Appendix BSerialization Example
Probing hidden states at the level of individual fields requires knowing*exactly*which tokens correspond to which bibliographic component\. To achieve this, each generated reference is serialized into a plain\-text sequence in which every field is enclosed by XML\-style tags:
```
<TITLE> Attention Is All You Need </TITLE>
<AUTHORS> Ashish Vaswani | ... </AUTHORS>
<VENUE> NeurIPS </VENUE>
<YEAR> 2017 </YEAR>
<DOI> 10.48550/arXiv.1706.03762 </DOI>
```
During serialization, the character\-level start and end offsets of each tagged block are recorded\. For theauthorsfield, individual author names \(delimited by “\|” separators\) are further tracked as sub\-spans, enabling finer\-grained analysis if needed\. These character offsets are later mapped to token positions via the tokenizer’s offset mapping, yielding precise token spans for every field\. This design ensures that downstream feature extraction operates on the exact set of tokens that encode a given field, rather than relying on heuristic or approximate boundaries\.
## Appendix CList of Computer Science Research Topics
The following 50 computer science research topics were used to prompt the models for citation generation\.
1. 1\.Causal inference and treatment effect estimation
2. 2\.Bayesian optimization and probabilistic modeling
3. 3\.Reinforcement learning and policy optimization
4. 4\.Robustness and adversarial machine learning
5. 5\.Explainable AI and interpretability methods
6. 6\.Representation learning and self\-supervised learning
7. 7\.Graph neural networks and graph learning
8. 8\.Federated learning and distributed ML
9. 9\.Fairness and bias mitigation in ML
10. 10\.ML evaluation, benchmarks, and reproducibility
11. 11\.Information extraction and relation extraction
12. 12\.Question answering and retrieval\-augmented generation
13. 13\.Machine translation and multilingual NLP
14. 14\.Summarization and factual consistency
15. 15\.Dialogue systems and conversational agents
16. 16\.Prompting and instruction tuning methods
17. 17\.NLP robustness, safety, and alignment
18. 18\.Text generation evaluation and metrics
19. 19\.Knowledge grounding and entity linking
20. 20\.Long\-context modeling and efficient attention
21. 21\.Image classification and representation learning
22. 22\.Object detection and instance segmentation
23. 23\.Vision transformers and efficient vision models
24. 24\.3D vision and point cloud understanding
25. 25\.Visual question answering and vision\-language models
26. 26\.Video understanding and temporal action recognition
27. 27\.Generative vision models and diffusion
28. 28\.Self\-supervised learning for vision
29. 29\.Vision robustness and adversarial attacks
30. 30\.Medical imaging and computer\-aided diagnosis
31. 31\.Distributed systems and consensus protocols
32. 32\.Cloud computing and serverless architectures
33. 33\.Storage systems and key\-value stores
34. 34\.Operating systems scheduling and resource management
35. 35\.Compilers and code optimization
36. 36\.Systems performance modeling and profiling
37. 37\.Datacenter networking and traffic engineering
38. 38\.GPU systems and ML systems optimization
39. 39\.Fault tolerance and reliability engineering
40. 40\.Observability and telemetry systems
41. 41\.Network security and intrusion detection
42. 42\.Cryptography and secure computation
43. 43\.Privacy\-preserving data analysis and differential privacy
44. 44\.Malware analysis and reverse engineering
45. 45\.Web security and authentication protocols
46. 46\.Database query optimization and indexing
47. 47\.Data integration and entity resolution
48. 48\.Human\-computer interaction and usability studies
49. 49\.Program analysis and static/dynamic analysis
50. 50\.Approximation algorithms and randomized algorithms
## Appendix DOutput Volume Analysis
To examine whether citation order within a prompt affects reliability, we analyze hallucination rate as a function of citation position across all models and generation volumes\. As shown in Figure[8](https://arxiv.org/html/2604.18880#A4.F8), the first two citations exhibit notably lower hallucination rates, after which performance rapidly degrades and plateaus from position 3 onward\. This suggests that models exhaust their most reliable parametric memory early, with subsequent references increasingly fabricated\.
Figure 8:Hallucination rate by citation position, aggregated across all models and generation volumes\. Each bar reflects the spread of hallucination rates observed across different experimental settings\. Hallucination rate increases sharply from the second to the third citation position, then plateaus\.
## Appendix EReplication on Mistral\-Small\-24B
Figure 9:Cross\-field AUC heatmap on Mistral\-Small\-24B\-Instruct\-2501\. In\-field performance remains strong, but off\-diagonal transfer is higher than in Qwen2\.5\-32B\-Instruct, indicating weaker field\-specific separation\.We repeated the cross\-field probe transfer analysis on Mistral\-Small\-24B\-Instruct\-2501\. The resulting heatmap in Figure[9](https://arxiv.org/html/2604.18880#A5.F9)shows that hallucination remains linearly decodable across all five bibliographic fields, with strong in\-field AUC on the diagonal\. However, compared with Qwen2\.5\-32B\-Instruct, off\-diagonal transfer is noticeably higher, indicating weaker field\-specific separation and greater overlap among field representations\. This suggests that the existence of decodable hallucination signals generalizes across model families, while the degree of field\-specific modularity is model\-dependent\.
## Appendix FFH\-Neuron Distribution
Table[4](https://arxiv.org/html/2604.18880#A6.T4)reports the number of positive\-weight FH\-neurons retained after stability selection for each bibliographic field, along with their distribution across layer terciles\.
Table 4:FH\-neuron count and layer distribution per field\. Terciles: Early \(layers 0–21\), Mid \(layers 22–42\), Late \(layers 43–63\)\. Total candidates per field: 1,769,472\.FieldFH\-neuronsSparsityEarlyMidLateTitle2240\.013%24\.6%33\.9%41\.5%Authors780\.004%12\.8%60\.3%26\.9%Year1290\.007%16\.3%37\.2%46\.5%Venue510\.003%49\.0%19\.6%31\.4%DOI300\.002%66\.7%23\.3%10\.0%
## Appendix GStatistical Analysis of Causal Intervention
This appendix details the statistical procedures underlying the causal intervention results reported in Section[4\.3](https://arxiv.org/html/2604.18880#S4.SS3)\. All tests treat the five bibliographic fields \(Title, Authors, Year, Venue, DOI\) as paired observations, with accuracy on the*targeted*field, i\.e\. the diagonal entry of Table[2](https://arxiv.org/html/2604.18880#S4.T2), as the dependent variable\.
#### Test 1: Enhancement effect\.
We compare the targeted\-field accuracy under Enhancementβ=4\.0\\beta=4\.0against Baseline for each field\.
FieldBaselineEnhanceβ=4\.0\\beta\{=\}4\.0Δ\\DeltaTitle76\.3%4\.7%−71\.6\-71\.6Authors17\.5%8\.5%−9\.0\-9\.0Year48\.5%27\.7%−20\.8\-20\.8Venue41\.2%31\.6%−9\.6\-9\.6DOI54\.6%38\.1%−16\.5\-16\.5
All five differences are negative \(H0H\_\{0\}: enhancement does not decrease accuracy;H1H\_\{1\}: enhancement decreases accuracy\)\. The test yieldsp=0\.031p=0\.031; we rejectH0H\_\{0\}atα=0\.05\\alpha=0\.05\. The mean degradation is−25\.5\-25\.5percentage points\.
#### Test 2: Random ablation effect\.
We compare Random Control accuracy against Baseline to characterize the effect of ablating an equal number of randomly selected neurons atβ=0\\beta=0\.
FieldBaselineRandomΔ\\DeltaTitle76\.3%59\.9%−16\.4\-16\.4Authors17\.5%15\.3%−2\.2\-2\.2Year48\.5%40\.8%−7\.7\-7\.7Venue41\.2%35\.2%−6\.0\-6\.0DOI54\.6%49\.1%−5\.5\-5\.5
All five differences are again negative \(p=0\.031p=0\.031\)\. Crucially, the degradation is uniform across fields with no field showing improvement, indicating non\-specific disruption of generation capacity rather than targeted interference with hallucination circuitry\.
#### Test 3: Specificity of FH\-neuron suppression\.
We compare FH\-neuron Suppression atβ=0\\beta=0against Random Control on each field’s targeted accuracy to determine whether FH\-neuron suppression yields improvements beyond what random ablation produces\.
FieldFH\-Suppressβ=0\\beta\{=\}0RandomΔ\\DeltaTitle82\.8%59\.9%\+22\.9\+22\.9Authors23\.7%15\.3%\+8\.4\+8\.4Year46\.5%40\.8%\+5\.7\+5\.7Venue35\.1%35\.2%−0\.1\-0\.1DOI55\.2%49\.1%\+6\.1\+6\.1
Four of five differences are positive; Venue is essentially tied \(Δ=−0\.1\\Delta=\-0\.1pp\)\. Excluding the tied pair yields an effective sample size ofn=4n=4, for which the one\-sided Wilcoxon signed\-rank test givesp=0\.062p=0\.062\. Although this narrowly exceeds the conventionalα=0\.05\\alpha=0\.05threshold due to the small sample size, the effect is consistent in direction and substantial in magnitude, with a mean improvement of\+8\.6\+8\.6percentage points across all five fields\. When considered alongside the strong enhancement result in Test 1 and the non\-specific degradation pattern of random ablation in Test 2, the combined evidence supports the conclusion that FH\-neuron suppression produces field\-selective improvements that are qualitatively distinct from random neuron removal\.
## Appendix HPer\-Topic Accuracy Distribution
Figure 10:Per\-topic accuracy on Qwen2\.5\-32B\-Instruct across the full 50\-topic set, shown for the firstNNreferences of each prompt \(N∈\{5,10,15\}N\\in\\\{5,10,15\\\}\)\. Dashed lines mark the per\-panel mean reported in Table[1](https://arxiv.org/html/2604.18880#S1.T1)\. Accuracy is relatively stable acrossNNbut varies substantially across topics, which explains why the baseline in Table[2](https://arxiv.org/html/2604.18880#S4.T2)\(evaluated on a sampled topic subset\) can differ from the 50\-topic average\.To clarify the baseline discrepancy between Table[1](https://arxiv.org/html/2604.18880#S1.T1)and Table[2](https://arxiv.org/html/2604.18880#S4.T2), we report per\-topic accuracy across the full 50\-topic set in Figure[10](https://arxiv.org/html/2604.18880#A8.F10)\. Accuracy varies substantially across topics, reflecting uneven coverage of research areas in the model’s parametric memory\. Title accuracy in particular ranges from near 0% on some topics to over 90% on others, with comparable spread observed for Year, Venue, and DOI\. Because our intervention experiments are conducted on a randomly sampled subset of topics for computational feasibility, the baseline accuracy observed in Table[2](https://arxiv.org/html/2604.18880#S4.T2)can differ from the 50\-topic average in Table[1](https://arxiv.org/html/2604.18880#S1.T1)\. Within Table[2](https://arxiv.org/html/2604.18880#S4.T2)itself, however, all intervention conditions are evaluated on the same topic sample, so the reported differences across conditions remain internally consistent and unaffected by topic composition\.
## Appendix IStatistical Analysis of Layer\-wise Probe Performance
This appendix reports the statistical tests underlying the layer\-wise probe analysis in Section[3\.4](https://arxiv.org/html/2604.18880#S3.SS4)\.
#### Spearman rank correlation\.
Table[5](https://arxiv.org/html/2604.18880#A9.T5)reports the Spearman correlation between layer index and probe AUC for each field\. The null hypothesis is that AUC has no monotonic association with layer depth \(ρ=0\\rho=0\)\.
Table 5:Spearman correlation between layer index and probe AUC per bibliographic field\.Fieldρ\\rhopp\-valueTrendDOI\+0\.763\+0\.7633\.87×10−73\.87\\times 10^\{\-7\}IncreasingAuthors\+0\.600\+0\.6002\.83×10−42\.83\\times 10^\{\-4\}IncreasingTitle\+0\.554\+0\.5549\.97×10−49\.97\\times 10^\{\-4\}IncreasingVenue\+0\.371\+0\.3713\.66×10−23\.66\\times 10^\{\-2\}IncreasingYear−0\.516\-0\.5162\.52×10−32\.52\\times 10^\{\-3\}Decreasing
#### Pairwise comparison of trends\.
Table[6](https://arxiv.org/html/2604.18880#A9.T6)reports Fisherzz\-testpp\-values comparing the Spearmanρ\\rhoof each field pair\. The null hypothesis isρi=ρj\\rho\_\{i\}=\\rho\_\{j\}\.
Table 6:Fisherzz\-testpp\-values for pairwise comparison of layer–AUC trends\. Significant results \(p<0\.05p<0\.05\) in bold\.TitleAuthorsYearVenueDOITitle—0\.793<<0\.0010\.3710\.150Authors—<<0\.0010\.2480\.238Year—<<0\.001<<0\.001Venue—0\.020
#### Global permutation test\.
To test whether the five fields share the same layer–AUC relationship, we use the variance of the five Spearmanρ\\rhovalues as a test statistic and generate a null distribution by permuting field labels across all \(layer, AUC\) pairs \(10,000 permutations\)\. The observed variance \(0\.205\) exceeds all permuted values \(p<0\.0001p<0\.0001\), confirming that the layer\-wise profiles differ across fields\.
#### Bootstrap confidence intervals for peak layer\.
Table[7](https://arxiv.org/html/2604.18880#A9.T7)reports bootstrap 95% confidence intervals for the peak\-AUC layer per field \(10,000 resamples with Gaussian noiseσ=0\.001\\sigma\{=\}0\.001to break ties\)\.
Table 7:Bootstrap 95% CI for peak layer per field\.FieldObserved PeakMedian95% CIYearL2L2\[2,2\]\[2,\\;2\]AuthorsL46L46\[44,46\]\[44,\\;46\]DOIL48L48\[46,48\]\[46,\\;48\]TitleL64L64\[62,64\]\[62,\\;64\]VenueL4L36\[4,48\]\[4,\\;48\]
## Appendix JProbe Training Details
We adopt a linear probe because it tests a stronger claim than nonlinear alternatives: that hallucination status is not merely encoded in the hidden states, but is linearly separable in representation space\. This follows the methodology established in prior probing work\[[11](https://arxiv.org/html/2604.18880#bib.bib5),[1](https://arxiv.org/html/2604.18880#bib.bib11)\], where linear separability is treated as evidence that a concept is represented as a direction rather than a nonlinear manifold\. Within each train/test partition, examples are downsampled to a 1:1 positive\-to\-negative ratio to address class imbalance, ensuring that probe accuracy is not inflated by majority\-class prediction\. For cross\-field evaluation, a probe trained on fieldiiis applied to the hidden states extracted for fieldjj, and performance is measured by AUC\. Concretely, a probe trained to detect author hallucination is evaluated on the hidden states of title, venue, year, and DOI spans\. If cross\-field AUC remains high, the fields share a common hallucination representation; if it drops to near chance \(AUC≈\\approx0\.5\), each field’s hallucination signal occupies a distinct subspace, confirming field\-specific rather than shared encoding\.Similar Articles
Source or It Didn't Happen: A Multi-Agent Framework for Citation Hallucination Detection
This paper introduces CiteTracer, a multi-agent framework for detecting citation hallucinations in LLM-generated scientific writing, achieving high accuracy on synthetic and real-world benchmarks.
Understanding New-Knowledge-Induced Factual Hallucinations in LLMs: Analysis and Interpretation
This paper investigates how fine-tuning LLMs on new knowledge induces factual hallucinations, showing that unfamiliarity within specific knowledge types drives hallucinations through weakened attention to key entities. The authors propose mitigating this by reintroducing known knowledge during later training stages.
Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs
Independent researchers show that sparse "hallucination neurons" identified in LLMs do not transfer across domains, dropping from 0.783 to 0.563 AUROC, indicating hallucination is domain-specific rather than a universal neural signature.
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts
This paper reveals that much of the reported progress in LLM hallucination detection is due to benchmark construction artifacts, where ground-truth answers are embedded in prompts, allowing a simple text-similarity baseline to achieve near-perfect scores. Through a large-scale controlled evaluation, the authors show that most methods perform near chance under proper controls, except for supervised probes on upper-layer hidden states such as SAPLMA and their proposed DRIFT.