Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

arXiv cs.CL 04/20/26, 04:00 AM Papers

Summary

This paper presents causal evidence that hallucination in autoregressive language models results from early trajectory commitment governed by asymmetric attractor dynamics, using same-prompt bifurcation and activation patching experiments on Qwen2.5-1.5B to show that hallucinated trajectories diverge at the first token and exhibit strong causal asymmetry across model layers.

arXiv:2604.15400v1 Announce Type: cross Abstract: We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation
Source: https://arxiv.org/html/2604.15400
## Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

Gokturk Aytug Akarlar Chimera Research Initiative Istanbul, Turkey. Corresponding author: [email protected]. Chimera Research Initiative is an independent research effort exploring causal and neuro-symbolic approaches to AI systems.

(April 2026)

###### Abstract

We present causal evidence that hallucination in autoregressive language models is an *early trajectory commitment* governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (η² = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

## 1 Introduction

Large language models hallucinate; they generate plausible but factually incorrect text with high confidence[1](https://arxiv.org/html/2604.15400#bib.bib1),[2](https://arxiv.org/html/2604.15400#bib.bib2). Despite extensive empirical study, the internal mechanisms governing hallucination remain poorly characterized. Existing work has established that hallucination correlates with identifiable features in model internals: probes over hidden states can detect hallucination above chance[3](https://arxiv.org/html/2604.15400#bib.bib3),[4](https://arxiv.org/html/2604.15400#bib.bib4), entropy-based signals precede hallucinated outputs[5](https://arxiv.org/html/2604.15400#bib.bib5), and representation engineering can partially steer models toward truthfulness[4](https://arxiv.org/html/2604.15400#bib.bib4),[6](https://arxiv.org/html/2604.15400#bib.bib6).

Yet correlation does not establish mechanism. A central question persists: *when and where does a model commit to a hallucinated trajectory, and is this commitment causally reversible?*

We address this question through two methodological contributions:

#### Same-prompt bifurcation.

Rather than comparing different prompts that elicit correct versus hallucinated outputs, which conflates prompt-level semantics with trajectory-level dynamics, we sample the *same prompt* repeatedly under non-zero temperature and identify prompts where the model produces *both* factual and hallucinated completions. This isolates the trajectory-level divergence: identical initial states yield different outcomes solely through the stochastic sampling path.

#### Symmetric causal patching.

For each bifurcating prompt, we collect correct and hallucinated runs with full hidden-state caches. We then perform bidirectional activation patching: replacing a hallucinated run's activation with a correct run's activation (and vice versa) at each layer and generation step, with three control conditions (random-prompt patch, wrong-to-wrong patch, unpatched baseline).

Our findings reveal a sharp asymmetry. Corrupting a correct trajectory via single-layer activation replacement succeeds in 87.5% of trials, while correcting a hallucinated trajectory succeeds in only 33.3%, and correction requires sustained multi-step intervention to reach even this rate. We interpret this asymmetry through the lens of dynamical systems: hallucination operates as a locally stable attractor basin in the residual stream state space, characterized by easy entry and difficult escape (Figure 1).

**Figure 1:** Conceptual overview. From a shared initial state h₀, stochastic token sampling commits the trajectory to either a correct (green) or hallucinated (red) basin. Activation patching reveals that corruption (crossing into the hallucination basin) requires only a single-point perturbation, while correction (escaping the hallucination basin) requires sustained multi-step intervention, the hallmark of an asymmetric attractor landscape.

## 2 Related Work

#### Hallucination detection via internals.

Li et al.[4](https://arxiv.org/html/2604.15400#bib.bib4) and Azaria & Mitchell[3](https://arxiv.org/html/2604.15400#bib.bib3) demonstrate that linear probes over hidden states can detect hallucination. Burns et al.[12](https://arxiv.org/html/2604.15400#bib.bib12) find truth-correlated directions via unsupervised methods. These establish that hallucination leaves detectable traces but do not address causality.

#### Representation engineering and steering.

Zou et al.[6](https://arxiv.org/html/2604.15400#bib.bib6) and Li et al.[4](https://arxiv.org/html/2604.15400#bib.bib4) show that adding learned steering vectors to activations can shift model behavior toward truthfulness. The concurrent work of Cherukuri & Varshney[7](https://arxiv.org/html/2604.15400#bib.bib7) frames hallucination through basin geometry and proposes geometry-aware steering. Our work differs in methodology: we employ same-prompt bifurcation and classical activation patching with controls, providing causal rather than correlational evidence.

#### Activation patching and causal tracing.

Meng et al.[8](https://arxiv.org/html/2604.15400#bib.bib8) introduce activation patching for localizing factual recall. Heimersheim & Neel[9](https://arxiv.org/html/2604.15400#bib.bib9) systematize its interpretation. We extend this methodology to hallucination trajectories, with the novel contribution of measuring *directional asymmetry* between corruption and correction.

#### Trajectory analysis in generation.

Suresh et al.[10](https://arxiv.org/html/2604.15400#bib.bib10) show that transformers activate coherent but input-insensitive features under uncertainty. Naparstek[11](https://arxiv.org/html/2604.15400#bib.bib11) studies commitment timing via projected autoregression in continuous state spaces. We provide the first same-prompt bifurcation analysis demonstrating that identical initial states diverge at the first generation step.

## 3 Method

### 3.1 Experimental Setup

We conduct all experiments on Qwen2.5-1.5B[14](https://arxiv.org/html/2604.15400#bib.bib14), a 28-layer transformer with d_model = 1536, using TransformerLens[13](https://arxiv.org/html/2604.15400#bib.bib13) on Apple Silicon (MPS backend). Activations are extracted from the residual stream post-attention at each layer (h_l^(t) denotes layer l at generation step t).

### 3.2 Prompt Dataset

We construct a dataset of 61 prompts across six categories designed to elicit hallucination through distinct mechanisms:

- **Factual** (14 prompts): Questions with definite correct answers (e.g., "The capital of Myanmar is a city called").
- **False premise** (14 prompts): Statements embedding factual errors (e.g., "Since the Amazon River flows through Europe,").
- **Confabulation** (22 prompts): References to fictitious entities (e.g., "The Krasnov Effect in quantum mechanics describes").
- **Leading** (3 prompts): Common misconceptions posed as questions.
- **Multi-hop** (4 prompts): Questions requiring chained reasoning.
- **Math** (4 prompts): Arithmetic with verifiable answers.

Each prompt is annotated with ground-truth indicators (for correct classification) and wrong-answer indicators (for hallucination classification).

### 3.3 Phase 1: Bifurcation Discovery

For each prompt x, we generate N = 20 completions using temperature sampling at τ = 0.7. Each completion is classified as Correct, Hallucination, or Other based on substring matching against the ground-truth and wrong-answer indicators.

###### Definition 1 (Bifurcating prompt).

A prompt x is *bifurcating* if at least 2 of its N completions are classified as Correct and at least 2 as Hallucination.

Bifurcating prompts are the experimental targets: they demonstrate that the model occupies a decision boundary where identical inputs yield divergent outputs, with the outcome determined by the sampling trajectory rather than the prompt encoding.

### 3.4 Phase 2: Trajectory Divergence Analysis

For each bifurcating prompt, we collect K = 6 cached runs per class (correct and hallucinated), storing the full residual stream at every (layer, step): {h_l^(t)}_{l=0}^{L-1} for each generation step t.

#### Step-wise KL divergence.

At each step t, we compute the KL divergence between the mean output distributions of correct and hallucinated runs:

D_KL^(t) = D_KL(P̄_hall^(t) || P̄_corr^(t))

where P̄_hall^(t) = (1/K) ∑_{k=1}^K P_k^(t) is the mean softmax distribution over hallucinated runs at step t. We define the *divergence onset* as the first step where D_KL^(t) > 0.5.

#### Layer-wise separation.

At each (layer, step), we compute Cohen's d between the hidden states of correct and hallucinated runs:

d_{l,t} = ||h̄_{l,t}^hall - h̄_{l,t}^corr||_2 / s_{l,t}^pooled

where s_{l,t}^pooled is the pooled standard deviation across the two groups. This yields a separation heatmap over the (layer × step) grid.

### 3.5 Phase 3: Causal Activation Patching

We perform activation patching[8](https://arxiv.org/html/2604.15400#bib.bib8) to establish causal relationships between hidden-state values and generation outcomes.

###### Definition 2 (Activation patch).

Given a *target run* generating from prompt x and a *source run* of the same prompt, an activation patch at (layer l, step t) replaces the target run's residual stream activation with the source run's:

h_l^(t),target ← h_l^(t),source

Generation then continues autoregressively from step t+1 with the patched state propagated through all downstream layers.

We implement this via TransformerLens forward hooks, patching only the last token position at the specified step.

#### Experimental conditions.

We test four patching configurations:

1. **H → C (correction)**: Target = hallucinated run, source = correct run. Measures whether injecting a correct activation redirects a hallucinated trajectory.
2. **C → H (corruption)**: Target = correct run, source = hallucinated run. Measures whether injecting a hallucinated activation derails a correct trajectory.
3. **Random clean control**: Target = hallucinated run, source = correct run from a *different prompt*. Tests whether any correct-looking activation suffices, or whether the effect is prompt-specific.
4. **Wrong-to-wrong control**: Target = hallucinated run, source = a *different* hallucinated run of the same prompt. Tests whether the patching effect is due to injecting a different state (any change) versus a specifically correct state.

We additionally measure an unpatched baseline: the natural correct rate when simply resampling the prompt without intervention.

#### Sweep protocol.

We perform three sweeps:

- **Layer sweep**: Fix step = 1, vary layer l ∈ {0, ..., 27}.
- **Step sweep**: Fix layer = l* (best from layer sweep), vary step t ∈ {0, ..., 4}.
- **Window sweep**: Fix layer = l*, patch steps {1}, {1,2}, {1,2,3}, {1,2,3,4}.

Each condition is evaluated over 8 bifurcating prompts × 3 trials = 24 trials per cell.

#### Metrics.

For each patching condition, we report:

- **Flip rate**: fraction of trials where the output classification changes to the target class (correct for H → C, hallucinated for C → H).
- **Abstain rate**: fraction producing Other (neither clearly correct nor hallucinated).

## 4 Results

### 4.1 Bifurcation Discovery

Of the 61 prompts, 27 (44.3%) exhibit genuine bifurcation. The distribution varies markedly by category (Table 1).

**Table 1:** Bifurcation rates by hallucination category. Bifurcating prompts produce both correct and hallucinated outputs from identical inputs under temperature sampling (τ = 0.7, N = 20).

**Figure 2:** Per-prompt correct rate (above axis) and hallucination rate (below axis) for all 61 prompts, colored by category. Stars (★) mark bifurcating prompts. False-premise prompts (red) are almost universally bifurcating; confabulation prompts (purple) tend toward deterministic hallucination.

Three observations are noteworthy. First, *false premise* prompts are almost universally bifurcating: the model is genuinely uncertain whether to accept or reject the embedded falsehood. Second, *confabulation* prompts are predominantly deterministic; the model either confidently fabricates (9/22 always hallucinate) or occasionally self-corrects. This suggests that confabulatory hallucination reflects a different internal regime than false-premise hallucination. Third, an additional 6 prompts are *near-bifurcating* (producing exactly 1 correct or 1 hallucinated sample out of 20), indicating that bifurcation is not a binary property but lies on a continuum: the model's proximity to the decision boundary varies smoothly across prompts (Figure 2).

### 4.2 Step-wise Divergence

Across all 27 bifurcating prompts, the KL divergence between correct and hallucinated output distributions follows a characteristic pattern:

D_KL^(0) = 0.00,    D_KL^(1) ∈ [0.12, 19.25],    mean onset = 1.1

The zero KL at step 0 is a methodological validation: identical prompts produce identical logits (before sampling), confirming that any subsequent divergence is trajectory-driven rather than prompt-driven.

**Figure 3:** Step-wise KL divergence across all 24 bifurcating prompts with trajectory data. Thin gray lines: individual prompts. Bold line: median. Shaded region: interquartile range. All prompts share the same pattern: D_KL^(0) = 0 (identical logits at step 0), fol...

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

Similar Articles

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Submit Feedback

Similar Articles

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG