Tag
This paper introduces Prototype-Based Sparse Steering, a method that applies sparse autoencoders to attention query activations in LLMs, then uses gradient-based optimization during inference to steer generation toward target behaviors. The approach is validated in both a logical planning task and a stylistic educational domain, demonstrating interpretable and disentangled control.
This paper introduces a principled approach to multilingual language steering using sparse autoencoders (SAEs) trained on multilingual data and a novel layer selection rule based on the intersection of multilingual alignment and language separability, evaluated on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization.
This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.
This paper presents a function-centric framework using Transcoders to trace computational pathways in vision-language models, demonstrating stronger attribution of visual grounding and the ability to predict hallucinations via graph-based features.
This paper proposes a five-stage methodology for causal feature analysis in transformer language models, demonstrated on GPT-2 small for the IOI task. It finds that features are specifically causal but not necessary, and exposes a gap between detection and causal robustness.
This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.
This paper characterizes compositional literary primitives in instruction-tuned LLMs using sparse autoencoders, discovering feature classes for self, style, and affect that enable emotion steering across two architectures.
This paper introduces a diagnostic framework using Sparse Autoencoders to analyze concept-level forgetting in continual learning, finding that much forgetting is due to representational inaccessibility rather than erasure.
This paper investigates preference instability in reward models for LLMs, where subtle input variations cause contradictory preference assignments. The authors propose two SAE-based mitigation strategies—SAE Feature Steering and SAE Residual Correction—to reduce incorrect preference assignments without retraining.
This paper applies TopK Sparse Autoencoders to three EEG foundation models (SleepFM, REVE, LaBraM) to extract interpretable feature dictionaries and introduces a framework for concept steering, revealing representational failures and clinical entanglements.
SAE-FT introduces a novel fine-tuning method for CLIP models that uses sparse autoencoder constraints to regularize visual representations, improving robustness against distribution shifts while maintaining performance and enabling interpretability.
This paper introduces a framework for token-level influence attribution in large language models by learning orthogonal latent spaces with sparse autoencoders, enabling precise identification of training data tokens that jointly influence predictions, with applications in high-stakes domains like healthcare.
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
This preprint introduces a method to inject emotion vectors into language models to simulate somatic markers, aiming to bridge the gap between semantic and episodic memory. The authors demonstrate that combining emotional echoes with semantic knowledge improves decision-making capabilities, replicating findings from human cognitive science.
This paper introduces Latent Visualization by Optimization (LVO), a mechanistic interpretability technique that uses sparse autoencoders to visualize monosemantic features in diffusion models like Stable Diffusion 1.5.
This study investigates how LLMs ground abstract concepts compared to humans, finding a significant 'grounding gap' where models rely heavily on word associations rather than emotional or internal states. Using sparse autoencoders, the authors identify internal features related to grounding dimensions, suggesting LLMs possess this information but do not recruit it naturally during generation.
This research paper introduces 'Feature Rivalry' in Sparse Autoencoder representations as a mechanistic signature of uncertainty in LLMs. Using Gemma-2-2B, the study demonstrates that negatively correlated feature pairs localize uncertainty to specific layers and causally influence model outputs.
This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.
This paper compares Crosscoders and Differential SAEs for detecting backdoors in fine-tuned LLMs, finding that Diff-SAE significantly outperforms Crosscoders by isolating directional activation shifts.
An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.