ICA Lens: Interpreting Language Models Without Training Another Dictionary

Hugging Face Daily Papers Papers

Summary

ICA Lens revives independent component analysis as an efficient method for interpreting language model representations, offering a faster alternative to sparse autoencoder training while maintaining competitive performance.

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:41 PM

Paper page - ICA Lens: Interpreting Language Models Without Training Another Dictionary

Source: https://huggingface.co/papers/2606.11722

Abstract

Independent component analysis (ICA) is revived as an efficient method for discovering interpretable directions in language model representations, offering a faster alternative to sparse autoencoder training while maintaining competitive performance in probing tasks.

Finding interpretable directions inlanguage-model representationsis critical for understanding and controlling model behavior.Sparse autoencoders(SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible fromactivation geometrybefore training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisitindependent component analysis(ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find thatICAhas been underestimated forLLM interpretability, because prior uses often relied on off-the-shelfICAimplementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduceICALens, the first practical workflow for stable, efficient, and auditableICAanalysis of LLM representations. It combines an optimized GPU-parallelFastICApipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base,ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. OnSAEBench,ICAis competitive with public SAEs insparse probingand outperforms them intargeted probe perturbationunder small-to-medium budgets. These results suggest thatICAshould not be viewed as a weak baseline, but as an efficient and complementary first lens for exploringlanguage-model representations.

View arXiv pageView PDFProject pageGitHub20Add to collection

Get this paper in your agent:

hf papers read 2606\.11722

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.11722 in a model README.md to link it from this page.

Datasets citing this paper1

#### sida/ica-lens-paper Updatedabout 12 hours ago • 37

Spaces citing this paper1

Collections including this paper1

Similar Articles

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG

Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv cs.LG

This paper introduces KODA (Kernel Optimization for Discrepancy Analysis), a kernel-based framework for comparing and aligning vision-language model representations by identifying sample subsets that are clustered differently across models like CLIP, SigLIP, and BLIP. The method uses contrastive embedding clustering and randomized low-dimensional approximations to scale to large datasets while providing interpretable structural differences between representations.

ModelLens: Finding the Best for Your Task from Myriads of Models

Hugging Face Daily Papers

ModelLens is a unified framework that recommends AI models for unseen datasets by learning from public leaderboard data, eliminating the need for costly direct evaluations. It constructs a performance-aware latent space to rank candidates across diverse tasks, outperforming existing baselines on large-scale benchmarks.