mechanistic-analysis

#mechanistic-analysis

Reasoning Fine-Tuning Induces Persistent Latent Policy States

arXiv cs.CL ↗ · 8h ago Cached

This paper models Chain-of-Thought reasoning as a switching dynamical system, showing that reasoning fine-tuning globally reorganizes latent policy states, leading to improved multi-step reasoning. The proposed framework combines time-aware contrastive learning with discrete regime discovery, and experiments demonstrate that fine-tuned models exhibit richer latent-policy organization with functional specialization.

0 favorites 0 likes

#mechanistic-analysis

When Data Imbalance Helps: Robust Generalization Through Shortcut Saturation

arXiv cs.LG ↗ · 2026-07-14 Cached

This paper challenges the standard prescription of balancing datasets to avoid spurious correlations, showing that in a synthetic sum parity task with two-layer transformers, high data imbalance (spurious ratio 0.9) promotes robust generalization while low imbalance (0.5) hinders it, through a mechanism of shortcut saturation.

0 favorites 0 likes

#mechanistic-analysis

LLM Parameters for Math Across Languages: Shared or Separate?

arXiv cs.CL ↗ · 2026-06-18 Cached

This paper presents a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, finding partial overlap of math-associated parameters across languages, concentrated in intermediate layers. English has the largest set of math-relevant parameters, while lower-resource languages have smaller sets.

0 favorites 0 likes

#mechanistic-analysis

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

SWITCH is a switchable latent reasoning framework that uses explicit boundary tokens to enable trainable and interpretable recurrent hidden-state reasoning via on-policy reinforcement learning, outperforming prior approaches.

0 favorites 0 likes

#mechanistic-analysis

Mechanistic Analysis of Alignment Algorithms in Language Models

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper presents a systematic mechanistic analysis of six preference optimization methods (PPO, DPO, SimPO, ORPO, GRPO, KTO) across three open-weight model families, using probing and sparse autoencoders to reveal how alignment algorithms reshape internal representations in qualitatively distinct ways.

0 favorites 0 likes

#mechanistic-analysis

Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper identifies a failure mode in LLMs where they do not verify the validity of numerical statistics when synthesizing multiple sources, instead relying on the stylistic markers of analytical rigor. The authors term this 'epistemic alignment' and show that it persists across models and domains, resisting prompting-based mitigations.

0 favorites 0 likes

#mechanistic-analysis

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.

0 favorites 0 likes

#mechanistic-analysis

In-Context Learning Operates as Concept Subspace Learning

arXiv cs.LG ↗ · 2026-05-20

This paper proposes that in-context learning in LLMs operates through low-dimensional concept subspaces, where task-relevant information concentrates in a small fraction of the representation space, supported by experiments on Llama-3-8B and Qwen2.5-7B.

0 favorites 0 likes

#mechanistic-analysis

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.

0 favorites 0 likes

mechanistic-analysis

Submit Feedback