Tag
This paper replicates the finding of 'emotion vectors' in open-weight LLMs Apertus-8B and Gemma-4-E4B, showing that valence geometry is recoverable across models with differences in layer emergence. The study also finds that arousal encoding is sensitive to the story corpus used for extraction.
A curated GitHub list of tools, papers, and communities for LLM interpretability, helping researchers navigate the field efficiently.
Introduces the Awesome LLM Interpretability resource collection, which gathers various interpretability tools, papers, and community resources to help understand the internal workings of large language models.
This paper presents improvements to Activation Oracles (AOs) for interpreting residual stream activations, including a new conversational dataset, multi-layer injections, and on-policy training. The authors also release AObench, the first comprehensive evaluation suite for AO quality.
This paper proposes a neuron-level intervention method to identify gender-specific neurons in language models (feminine, masculine, gender-neutral) and steer sentence generation toward a target gender form while preserving meaning, with experiments showing precise control and bias mitigation.
This paper presents a methodology for delineating concepts and training linear probes to detect them in LLM embeddings, using four example concepts across three models. The work aims to enable scalable monitoring of LLM internal representations.
This paper characterizes compositional literary primitives in instruction-tuned LLMs using sparse autoencoders, discovering feature classes for self, style, and affect that enable emotion steering across two architectures.
This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.
This paper introduces the concept of 'minimal cores' in overcomplete reasoning traces, showing that on average 46% of steps can be removed while preserving the final answer, and that minimal cores improve trace separation and reduce intrinsic dimensionality.
This paper introduces a framework for token-level influence attribution in large language models by learning orthogonal latent spaces with sparse autoencoders, enabling precise identification of training data tokens that jointly influence predictions, with applications in high-stakes domains like healthcare.
This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.
This research paper investigates how Large Language Models encode social role granularity as a structured latent dimension. It demonstrates that this 'Granularity Axis' is consistent across architectures like Qwen3 and Llama-3, and can be causally manipulated via activation steering.
Researchers propose a surrogate modeling framework to quantify and interpret latent medical knowledge encoded in black-box LLMs, revealing both valid associations and persistent racial biases.
Researchers trace how LLMs recall relational facts by probing per-head attention contributions, showing these are strong linear features whose fidelity correlates with relation specificity and entity connectedness.
Independent researchers show that sparse "hallucination neurons" identified in LLMs do not transfer across domains, dropping from 0.783 to 0.563 AUROC, indicating hallucination is domain-specific rather than a universal neural signature.
ArXiv preprint maps stereotype-encoding neurons and attention heads in GPT-2 Small and Llama 3.2, showing biases cluster in small neuron subsets yet ablating them barely reduces biased text generation.
TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.
Researcher analyzes LLM internal representations across 8 languages and multiple models, finding that concept thinking occurs in geometric space in middle transformer layers independent of input language, supporting a universal deep structure hypothesis similar to Chomsky's theory rather than Sapir-Whorf linguistic relativism.