llm-interpretability

#llm-interpretability

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

arXiv cs.CL ↗ · yesterday Cached

This paper replicates the finding of 'emotion vectors' in open-weight LLMs Apertus-8B and Gemma-4-E4B, showing that valence geometry is recoverable across models with differences in layer emergence. The study also finds that arousal encoding is sensitive to the story corpus used for extraction.

0 favorites 0 likes

#llm-interpretability

@DanKornas: LLM interpretability is a rabbit hole. This repo gives you a map. Awesome LLM Interpretability is a curated GitHub list…

X AI KOLs Timeline ↗ · 5d ago Cached

A curated GitHub list of tools, papers, and communities for LLM interpretability, helping researchers navigate the field efficiently.

0 favorites 0 likes

#llm-interpretability

@GitHub_Daily: How do large language models work internally, why do they hallucinate, and why do they sometimes give irrelevant answers? For a deeper understanding, check out the Awesome LLM Interpretability resource collection, which provides a systematic path to unpack the AI black box. It covers attention visualization, neuron analysis, and more.

X AI KOLs Timeline ↗ · 2026-06-18 Cached

Introduces the Awesome LLM Interpretability resource collection, which gathers various interpretability tools, papers, and community resources to help understand the internal workings of large language models.

0 favorites 0 likes

#llm-interpretability

Building Better Activation Oracles

arXiv cs.LG ↗ · 2026-06-03 Cached

This paper presents improvements to Activation Oracles (AOs) for interpreting residual stream activations, including a new conversational dataset, multi-layer injections, and on-policy training. The authors also release AObench, the first comprehensive evaluation suite for AO quality.

0 favorites 0 likes

#llm-interpretability

Neuron-Level Interventions for Gendered and Gender-Neutral Generation in Language Models

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper proposes a neuron-level intervention method to identify gender-specific neurons in language models (feminine, masculine, gender-neutral) and steer sentence generation toward a target gender form while preserving meaning, with experiments showing precise control and bias mitigation.

0 favorites 0 likes

#llm-interpretability

What are They Thinking? Delineation, Probing and Tracking of Concepts in LLMs

arXiv cs.CL ↗ · 2026-05-29 Cached

This paper presents a methodology for delineating concepts and training linear probes to detect them in LLM embeddings, using four example concepts across three models. The work aims to enable scalable monitoring of LLM internal representations.

0 favorites 0 likes

#llm-interpretability

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

arXiv cs.LG ↗ · 2026-05-20

This paper characterizes compositional literary primitives in instruction-tuned LLMs using sparse autoencoders, discovering feature classes for self, style, and affect that enable emotion steering across two architectures.

0 favorites 0 likes

#llm-interpretability

Reasoning Models Don't Just Think Longer, They Move Differently

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper investigates whether reasoning-trained language models simply allocate more compute (longer chains of thought) or follow qualitatively different internal trajectories by analyzing hidden-state trajectory geometry across code, math, and SAT domains. After correcting for generation length, they find that reasoning-trained models exhibit distinct trajectory geometry—most clearly in code—indicating reasoning training changes how computation unfolds, not just how much is used.

0 favorites 0 likes

#llm-interpretability

Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

arXiv cs.AI ↗ · 2026-05-15 Cached

This paper introduces the concept of 'minimal cores' in overcomplete reasoning traces, showing that on average 46% of steps can be removed while preserving the final answer, and that minimal cores improve trace separation and reduce intrinsic dimensionality.

0 favorites 0 likes

#llm-interpretability

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

arXiv cs.LG ↗ · 2026-05-14 Cached

This paper introduces a framework for token-level influence attribution in large language models by learning orthogonal latent spaces with sparse autoencoders, enabling precise identification of training data tokens that jointly influence predictions, with applications in high-stakes domains like healthcare.

0 favorites 0 likes

#llm-interpretability

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.

0 favorites 0 likes

#llm-interpretability

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

This research paper investigates how Large Language Models encode social role granularity as a structured latent dimension. It demonstrates that this 'Granularity Axis' is consistent across architectures like Qwen3 and Llama-3, and can be causally manipulated via activation steering.

0 favorites 0 likes

#llm-interpretability

Surrogate modeling for interpreting black-box LLMs in medical predictions

arXiv cs.CL ↗ · 2026-04-23 Cached

Researchers propose a surrogate modeling framework to quantify and interpret latent medical knowledge encoded in black-box LLMs, revealing both valid associations and persistent racial biases.

0 favorites 0 likes

#llm-interpretability

Tracing Relational Knowledge Recall in Large Language Models

arXiv cs.CL ↗ · 2026-04-23 Cached

Researchers trace how LLMs recall relational facts by probing per-head attention contributions, showing these are strong linear features whose fidelity correlates with relation specificity and entity connectedness.

0 favorites 0 likes

#llm-interpretability

Do Hallucination Neurons Generalize? Evidence from Cross-Domain Transfer in LLMs

arXiv cs.CL ↗ · 2026-04-23 Cached

Independent researchers show that sparse "hallucination neurons" identified in LLMs do not transfer across domains, dropping from 0.783 to 0.563 AUROC, indicating hallucination is domain-specific rather than a universal neural signature.

0 favorites 0 likes

#llm-interpretability

Can We Locate and Prevent Stereotypes in LLMs?

arXiv cs.CL ↗ · 2026-04-23 Cached

ArXiv preprint maps stereotype-encoding neurons and attention heads in GPT-2 Small and Llama 3.2, showing biases cluster in small neuron subsets yet ablating them barely reduces biased text generation.

0 favorites 0 likes

#llm-interpretability

TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

arXiv cs.CL ↗ · 2026-04-20 Cached

TPA proposes a novel method for detecting hallucinations in RAG systems by attributing next-token probabilities to seven distinct sources (Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, Initial Embedding) and aggregating by Part-of-Speech tags. The approach achieves state-of-the-art performance across five LLMs including Llama2, Llama3, Mistral, and Qwen.

0 favorites 0 likes

#llm-interpretability

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.

0 favorites 0 likes

#llm-interpretability

Applied Explainability for Large Language Models: A Comparative Study

arXiv cs.CL ↗ · 2026-04-20 Cached

A comparative study evaluating three explainability techniques (Integrated Gradients, Attention Rollout, SHAP) on fine-tuned DistilBERT for sentiment classification, highlighting trade-offs between gradient-based, attention-based, and model-agnostic approaches for LLM interpretability.

0 favorites 0 likes

#llm-interpretability

LLM Neuroanatomy III - LLMs seem to think in geometry, not language

Reddit r/LocalLLaMA ↗ · 2026-04-19

Researcher analyzes LLM internal representations across 8 languages and multiple models, finding that concept thinking occurs in geometric space in middle transformer layers independent of input language, supporting a universal deep structure hypothesis similar to Chomsky's theory rather than Sapir-Whorf linguistic relativism.

0 favorites 0 likes

llm-interpretability

Submit Feedback