Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
Summary
This paper presents causal evidence that hallucination in autoregressive language models results from early trajectory commitment governed by asymmetric attractor dynamics, using same-prompt bifurcation and activation patching experiments on Qwen2.5-1.5B to show that hallucinated trajectories diverge at the first token and exhibit strong causal asymmetry across model layers.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation Source: https://arxiv.org/html/2604.15400 ## Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation Gokturk Aytug Akarlar Chimera Research Initiative Istanbul, Turkey. Corresponding author: [email protected]. Chimera Research Initiative is an independent research effort exploring causal and neuro-symbolic approaches to AI systems. (April 2026) ###### Abstract We present causal evidence that hallucination in autoregressive language models is an *early trajectory commitment* governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (η² = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0. ## 1 Introduction Large language models hallucinate; they generate plausible but factually incorrect text with high confidence[1](https://arxiv.org/html/2604.15400#bib.bib1),[2](https://arxiv.org/html/2604.15400#bib.bib2). Despite extensive empirical study, the internal mechanisms governing hallucination remain poorly characterized. Existing work has established that hallucination correlates with identifiable features in model internals: probes over hidden states can detect hallucination above chance[3](https://arxiv.org/html/2604.15400#bib.bib3),[4](https://arxiv.org/html/2604.15400#bib.bib4), entropy-based signals precede hallucinated outputs[5](https://arxiv.org/html/2604.15400#bib.bib5), and representation engineering can partially steer models toward truthfulness[4](https://arxiv.org/html/2604.15400#bib.bib4),[6](https://arxiv.org/html/2604.15400#bib.bib6). Yet correlation does not establish mechanism. A central question persists: *when and where does a model commit to a hallucinated trajectory, and is this commitment causally reversible?* We address this question through two methodological contributions: #### Same-prompt bifurcation. Rather than comparing different prompts that elicit correct versus hallucinated outputs, which conflates prompt-level semantics with trajectory-level dynamics, we sample the *same prompt* repeatedly under non-zero temperature and identify prompts where the model produces *both* factual and hallucinated completions. This isolates the trajectory-level divergence: identical initial states yield different outcomes solely through the stochastic sampling path. #### Symmetric causal patching. For each bifurcating prompt, we collect correct and hallucinated runs with full hidden-state caches. We then perform bidirectional activation patching: replacing a hallucinated run's activation with a correct run's activation (and vice versa) at each layer and generation step, with three control conditions (random-prompt patch, wrong-to-wrong patch, unpatched baseline). Our findings reveal a sharp asymmetry. Corrupting a correct trajectory via single-layer activation replacement succeeds in 87.5% of trials, while correcting a hallucinated trajectory succeeds in only 33.3%, and correction requires sustained multi-step intervention to reach even this rate. We interpret this asymmetry through the lens of dynamical systems: hallucination operates as a locally stable attractor basin in the residual stream state space, characterized by easy entry and difficult escape (Figure 1). **Figure 1:** Conceptual overview. From a shared initial state h₀, stochastic token sampling commits the trajectory to either a correct (green) or hallucinated (red) basin. Activation patching reveals that corruption (crossing into the hallucination basin) requires only a single-point perturbation, while correction (escaping the hallucination basin) requires sustained multi-step intervention, the hallmark of an asymmetric attractor landscape. ## 2 Related Work #### Hallucination detection via internals. Li et al.[4](https://arxiv.org/html/2604.15400#bib.bib4) and Azaria & Mitchell[3](https://arxiv.org/html/2604.15400#bib.bib3) demonstrate that linear probes over hidden states can detect hallucination. Burns et al.[12](https://arxiv.org/html/2604.15400#bib.bib12) find truth-correlated directions via unsupervised methods. These establish that hallucination leaves detectable traces but do not address causality. #### Representation engineering and steering. Zou et al.[6](https://arxiv.org/html/2604.15400#bib.bib6) and Li et al.[4](https://arxiv.org/html/2604.15400#bib.bib4) show that adding learned steering vectors to activations can shift model behavior toward truthfulness. The concurrent work of Cherukuri & Varshney[7](https://arxiv.org/html/2604.15400#bib.bib7) frames hallucination through basin geometry and proposes geometry-aware steering. Our work differs in methodology: we employ same-prompt bifurcation and classical activation patching with controls, providing causal rather than correlational evidence. #### Activation patching and causal tracing. Meng et al.[8](https://arxiv.org/html/2604.15400#bib.bib8) introduce activation patching for localizing factual recall. Heimersheim & Neel[9](https://arxiv.org/html/2604.15400#bib.bib9) systematize its interpretation. We extend this methodology to hallucination trajectories, with the novel contribution of measuring *directional asymmetry* between corruption and correction. #### Trajectory analysis in generation. Suresh et al.[10](https://arxiv.org/html/2604.15400#bib.bib10) show that transformers activate coherent but input-insensitive features under uncertainty. Naparstek[11](https://arxiv.org/html/2604.15400#bib.bib11) studies commitment timing via projected autoregression in continuous state spaces. We provide the first same-prompt bifurcation analysis demonstrating that identical initial states diverge at the first generation step. ## 3 Method ### 3.1 Experimental Setup We conduct all experiments on Qwen2.5-1.5B[14](https://arxiv.org/html/2604.15400#bib.bib14), a 28-layer transformer with d_model = 1536, using TransformerLens[13](https://arxiv.org/html/2604.15400#bib.bib13) on Apple Silicon (MPS backend). Activations are extracted from the residual stream post-attention at each layer (h_l^(t) denotes layer l at generation step t). ### 3.2 Prompt Dataset We construct a dataset of 61 prompts across six categories designed to elicit hallucination through distinct mechanisms: - **Factual** (14 prompts): Questions with definite correct answers (e.g., "The capital of Myanmar is a city called"). - **False premise** (14 prompts): Statements embedding factual errors (e.g., "Since the Amazon River flows through Europe,"). - **Confabulation** (22 prompts): References to fictitious entities (e.g., "The Krasnov Effect in quantum mechanics describes"). - **Leading** (3 prompts): Common misconceptions posed as questions. - **Multi-hop** (4 prompts): Questions requiring chained reasoning. - **Math** (4 prompts): Arithmetic with verifiable answers. Each prompt is annotated with ground-truth indicators (for correct classification) and wrong-answer indicators (for hallucination classification). ### 3.3 Phase 1: Bifurcation Discovery For each prompt x, we generate N = 20 completions using temperature sampling at τ = 0.7. Each completion is classified as Correct, Hallucination, or Other based on substring matching against the ground-truth and wrong-answer indicators. ###### Definition 1 (Bifurcating prompt). A prompt x is *bifurcating* if at least 2 of its N completions are classified as Correct and at least 2 as Hallucination. Bifurcating prompts are the experimental targets: they demonstrate that the model occupies a decision boundary where identical inputs yield divergent outputs, with the outcome determined by the sampling trajectory rather than the prompt encoding. ### 3.4 Phase 2: Trajectory Divergence Analysis For each bifurcating prompt, we collect K = 6 cached runs per class (correct and hallucinated), storing the full residual stream at every (layer, step): {h_l^(t)}_{l=0}^{L-1} for each generation step t. #### Step-wise KL divergence. At each step t, we compute the KL divergence between the mean output distributions of correct and hallucinated runs: D_KL^(t) = D_KL(P̄_hall^(t) || P̄_corr^(t)) where P̄_hall^(t) = (1/K) ∑_{k=1}^K P_k^(t) is the mean softmax distribution over hallucinated runs at step t. We define the *divergence onset* as the first step where D_KL^(t) > 0.5. #### Layer-wise separation. At each (layer, step), we compute Cohen's d between the hidden states of correct and hallucinated runs: d_{l,t} = ||h̄_{l,t}^hall - h̄_{l,t}^corr||_2 / s_{l,t}^pooled where s_{l,t}^pooled is the pooled standard deviation across the two groups. This yields a separation heatmap over the (layer × step) grid. ### 3.5 Phase 3: Causal Activation Patching We perform activation patching[8](https://arxiv.org/html/2604.15400#bib.bib8) to establish causal relationships between hidden-state values and generation outcomes. ###### Definition 2 (Activation patch). Given a *target run* generating from prompt x and a *source run* of the same prompt, an activation patch at (layer l, step t) replaces the target run's residual stream activation with the source run's: h_l^(t),target ← h_l^(t),source Generation then continues autoregressively from step t+1 with the patched state propagated through all downstream layers. We implement this via TransformerLens forward hooks, patching only the last token position at the specified step. #### Experimental conditions. We test four patching configurations: 1. **H → C (correction)**: Target = hallucinated run, source = correct run. Measures whether injecting a correct activation redirects a hallucinated trajectory. 2. **C → H (corruption)**: Target = correct run, source = hallucinated run. Measures whether injecting a hallucinated activation derails a correct trajectory. 3. **Random clean control**: Target = hallucinated run, source = correct run from a *different prompt*. Tests whether any correct-looking activation suffices, or whether the effect is prompt-specific. 4. **Wrong-to-wrong control**: Target = hallucinated run, source = a *different* hallucinated run of the same prompt. Tests whether the patching effect is due to injecting a different state (any change) versus a specifically correct state. We additionally measure an unpatched baseline: the natural correct rate when simply resampling the prompt without intervention. #### Sweep protocol. We perform three sweeps: - **Layer sweep**: Fix step = 1, vary layer l ∈ {0, ..., 27}. - **Step sweep**: Fix layer = l* (best from layer sweep), vary step t ∈ {0, ..., 4}. - **Window sweep**: Fix layer = l*, patch steps {1}, {1,2}, {1,2,3}, {1,2,3,4}. Each condition is evaluated over 8 bifurcating prompts × 3 trials = 24 trials per cell. #### Metrics. For each patching condition, we report: - **Flip rate**: fraction of trials where the output classification changes to the target class (correct for H → C, hallucinated for C → H). - **Abstain rate**: fraction producing Other (neither clearly correct nor hallucinated). ## 4 Results ### 4.1 Bifurcation Discovery Of the 61 prompts, 27 (44.3%) exhibit genuine bifurcation. The distribution varies markedly by category (Table 1). **Table 1:** Bifurcation rates by hallucination category. Bifurcating prompts produce both correct and hallucinated outputs from identical inputs under temperature sampling (τ = 0.7, N = 20). **Figure 2:** Per-prompt correct rate (above axis) and hallucination rate (below axis) for all 61 prompts, colored by category. Stars (★) mark bifurcating prompts. False-premise prompts (red) are almost universally bifurcating; confabulation prompts (purple) tend toward deterministic hallucination. Three observations are noteworthy. First, *false premise* prompts are almost universally bifurcating: the model is genuinely uncertain whether to accept or reject the embedded falsehood. Second, *confabulation* prompts are predominantly deterministic; the model either confidently fabricates (9/22 always hallucinate) or occasionally self-corrects. This suggests that confabulatory hallucination reflects a different internal regime than false-premise hallucination. Third, an additional 6 prompts are *near-bifurcating* (producing exactly 1 correct or 1 hallucinated sample out of 20), indicating that bifurcation is not a binary property but lies on a continuum: the model's proximity to the decision boundary varies smoothly across prompts (Figure 2). ### 4.2 Step-wise Divergence Across all 27 bifurcating prompts, the KL divergence between correct and hallucinated output distributions follows a characteristic pattern: D_KL^(0) = 0.00, D_KL^(1) ∈ [0.12, 19.25], mean onset = 1.1 The zero KL at step 0 is a methodological validation: identical prompts produce identical logits (before sampling), confirming that any subsequent divergence is trajectory-driven rather than prompt-driven. **Figure 3:** Step-wise KL divergence across all 24 bifurcating prompts with trajectory data. Thin gray lines: individual prompts. Bold line: median. Shaded region: interquartile range. All prompts share the same pattern: D_KL^(0) = 0 (identical logits at step 0), fol...
Similar Articles
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
This paper presents a unified geometric framework for understanding transformer memory failures, distinguishing between conflict arbitration and hallucination through hidden-state attractor basins. It demonstrates that geometric margin is a superior diagnostic for detecting these failures compared to output entropy, particularly as model scale increases.
Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits
This paper presents PCNet, a probabilistic circuit trained as a tractable density estimator on LLM residual streams to detect hallucinations as geometric anomalies. It also introduces PC-LDCD, a dynamic correction method that only intervenes on hallucinated tokens, achieving near-perfect detection and reduced corruption rates.
HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.
Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.
TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG
TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.