A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

arXiv cs.LG Papers

Summary

This paper proposes a unified geometric framework for understanding concept learning and neuron interpretation in sparse autoencoders, formalizing concepts as sets and defining detection, separation, and approximation. It provides error bounds, capacity constraints, and links to formal concept analysis, with experiments on synthetic data.

arXiv:2606.07007v1 Announce Type: new Abstract: We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders (SAEs). While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ''concept'' and ''learning'' remains unclear. We formalize concepts as sets of data points and cast concept learning as a set-alignment problem between human-defined and model-induced concepts. This formulation distinguishes three increasingly strong notions of learning -- detection, separation, and approximation -- and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi-neuron units. It also provides a set-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many-to-many structure can be organized by concept lattices. Experiments on synthetic data with ReLU and Top-$K$ SAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:20 AM

# A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
Source: [https://arxiv.org/html/2606.07007](https://arxiv.org/html/2606.07007)
###### Abstract

We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencoders \(SAEs\)\. While SAEs improve interpretability of neural networks by learning sparse feature representations, a principled definition of ”concept” and ”learning” remains unclear\. We formalize concepts as sets of data points and cast concept learning as a set\-alignment problem between human\-defined and model\-induced concepts\. This formulation distinguishes three increasingly strong notions of learning—detection, separation, and approximation—and yields geometric conditions, error bounds, and capacity constraints for when concepts can be represented by individual neurons or multi\-neuron units\. It also provides a set\-theoretic account for common SAE phenomena, including feature splitting, feature absorption, feature families, and hierarchical concepts\. Finally, we connect concept learning and neuron interpretation through formal concept analysis, showing that the two directions need not agree and that their many\-to\-many structure can be organized by concept lattices\. Experiments on synthetic data with ReLU and Top\-KKSAEs illustrate the theory and reveal the effects of SAE size and sparsity on concept learning\.

Machine Learning, interpretability, ICML

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.07007v1/x1.png)Figure 1:Examples of single neuron total activation \(SNTA\) and total neuron single activation \(TNSA\) of ReLU SAE \(expansion factor=8 and L1 regularization=0\.5\) and Topk\-K SAE \(expansion factor=8 and K=4\)\. \(a\) SNTA of ReLU SAE, \(b\) TNSA of ReLU SAE; note that the SNTA of ReLU SAE is simply a half space and TNSA is a hyperplane arrangement region\(Stanley and others,[2007](https://arxiv.org/html/2606.07007#bib.bib28)\)\. \(c\) SNTA of Top\-K SAE, \(d\) TNSA of Top\-K SAE; note that the SNTA of Top\-K SAE is a subset of a half space and its TNSA is a subset of a hyperplane arrangement region\. The shaded area in \(d\) is the intersection of positively pre\-activated \(i\.e\.,z\>0z\>0\) hyperplanes and negatively pre\-activated \(i\.e\.,z<0z<0\) hyperplanes\.Large models based on neural networks achieve remarkable performance across many tasks, yet their internal mechanisms remain largely opaque\(Elhageet al\.,[2021](https://arxiv.org/html/2606.07007#bib.bib35); Olah,[2022](https://arxiv.org/html/2606.07007#bib.bib36)\)\. This lack of interpretability limits scientific understanding\(DeGraveet al\.,[2021](https://arxiv.org/html/2606.07007#bib.bib41); Simon and Zou,[2025](https://arxiv.org/html/2606.07007#bib.bib40)\), safety auditing\(Anwaret al\.,[2024a](https://arxiv.org/html/2606.07007#bib.bib38),[b](https://arxiv.org/html/2606.07007#bib.bib39)\), and reliable deployment\. Mechanistic interpretability\(Elhageet al\.,[2021](https://arxiv.org/html/2606.07007#bib.bib35); Olah,[2022](https://arxiv.org/html/2606.07007#bib.bib36)\)aims to understand the internal computations of models by analyzing how information is represented and used within model activations\.

A major challenge is that neurons in neural networks are mostly polysemantic\(Templeton,[2024](https://arxiv.org/html/2606.07007#bib.bib14)\)\. A single neuron may encode multiple unrelated concepts, while a single concept may be distributed across many neurons\. This phenomenon, known as polysemanticity or superposition\(Templeton,[2024](https://arxiv.org/html/2606.07007#bib.bib14); Elhageet al\.,[2022](https://arxiv.org/html/2606.07007#bib.bib13); O’Neillet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib5)\), makes neuron\-level interpretation difficult\. Sparse autoencoders \(SAEs\)\(Ng and others,[2011](https://arxiv.org/html/2606.07007#bib.bib3)\)address this by learning overcomplete sparse representations of activations, producing neurons that are often more interpretable and monosemantic\(Cunninghamet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib4)\)\.

SAEs are commonly motivated by the linear representation hypothesis\(Parket al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib1)\), which posits that semantically meaningful concepts correspond to directions in the activation space and are approximately linearly combined\. However, vector directions alone do not define human\-interpretable concepts in practice\. Instead, interpretation needs to be contextualized with respect to input data\. Specifically, data examples that highly activate a neuron are identified, and their shared patterns are summarized to describe the neuron\(Billset al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib30)\)\. Thus, SAE interpretation also fundamentally relies on relationships between internal neurons and sets of data examples\.

This perspective becomes particularly important in light of empirical SAE phenomena\. SAE neurons can resolve polysemanticity\(Cunninghamet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib4); O’Neillet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib5)\); larger SAEs may split coarse neurons into finer semantic components\(O’Neillet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib5); Brickenet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib16)\); feature absorption may occur when one neuron captures examples expected to belong to another\(Chaninet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib15)\); and groups of neurons may co\-activate as a feature family\(Brickenet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib16)\)\. These observations suggest that SAE neurons are not isolated vectors with fixed meanings, but part of a structured relationship defined through input data, activations, and human\-interpretable abstractions\. Despite extensive empirical study, we still lack a unified framework that explains when these phenomena arise, how they relate to activation geometry\(Costaet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib9); Felet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib21)\)\.

At the root of this difficulty lies a more fundamental ambiguity: neither “concept” nor “learning” is formally defined in most discussions of concept learning and neuron interpretation\(Ayonrinde and Jaburi,[2025](https://arxiv.org/html/2606.07007#bib.bib20)\)\. This ambiguity echoes a long\-standing philosophical debate\. A Platonic or realist view treats concepts as abstract entities independent of their instances, whereas a nominalist or data\-grounded view treats concepts as abstractions constructed from collections of examples\. Machine learning is closer to the latter\. That is, models are trained on empirical data, and interpretations are validated through examples associated with internal neurons\. From this data\-grounded perspective, a neuron should not be assumed to be a primitive object such as a single vector, but rather interpreted as a set of data points associated with the neuron\.

The data\-grounded view also clarifies what it means for a model to learn a concept\. Human concepts such as “animal” or “food” correspond to coherent sets of examples that humans can group and describe\. A neuron in a model learns a human concept only when the examples associated with the neuron align with a human\-understandable set\. Conversely, when a neuron activates on heterogeneous examples lacking a coherent abstraction, we regard it as polysemantic or uninterpretable\. In this work, we propose a novel framework for understanding concept learning and neuron interpretation\.

We formulate both concept learning and neuron interpretation as a set\-alignment problem between human concepts and model\-induced concepts, where concept corresponds to a set and learning corresponds to alignment\. And it is the implicit bias and underlying assumptions in both data and model that makes concept learning and neuron interpretation possible\.

Under this view, neuron interpretation corresponds to how to characterize the set selected by a neuron or SAE neuron, while concept learning corresponds to whether the learned set aligns with a target human concept\.

Therefore, in our proposed mathematical framework, we represent concepts as sets and study their alignment through geometric and set\-theoretic structures\. This framework distinguishes different modes of concept learning, derives conditions under which they arise, and explains empirical SAE phenomena such as feature splitting, feature absorption, and feature families\. We further connect concept learning and neuron interpretation through formal concept lattices, providing a formulation for representing hierarchical concept structure\.

To summarize, our contributions are as follows:

1\.We propose a unified geometric and set\-theoretic framework for concept learning, human concept alignment, and SAE interpretation, formulating concept learning as set alignmetn between human\-understandable and model\-induced concepts\.

2\.We distinguish three modes of concept learning—concept detection, concept separation, and concept formation—and derive sufficient and necessary conditions, along with scaling laws governing when each mode is achievable\.

3\.We show that concept learning and neuron interpretation are related but distinct, and connect them through formal concept lattices that characterize hierarchical structure and neuron semantics\.

## 2Related Work

Mechanistic Interpretability and Sparse Autoencoders\.Mechanistic interpretability studies the internal computations of large models\(Elhageet al\.,[2021](https://arxiv.org/html/2606.07007#bib.bib35); Olah,[2022](https://arxiv.org/html/2606.07007#bib.bib36); Olssonet al\.,[2022](https://arxiv.org/html/2606.07007#bib.bib37)\)\. A key challenge is superposition, where multiple concepts are represented in overlapping neural directions\(Elhageet al\.,[2022](https://arxiv.org/html/2606.07007#bib.bib13)\)\. Sparse autoencoders \(SAEs\), closely related to dictionary learning\(Olshausen and Field,[1997](https://arxiv.org/html/2606.07007#bib.bib2)\), learn overcomplete sparse features that reconstruct model activations\(Ng and others,[2011](https://arxiv.org/html/2606.07007#bib.bib3)\)\. Recent work shows that SAEs can disentangle superposed representations and recover more monosemantic, human\-interpretable features\(Cunninghamet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib4); O’Neillet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib5); Templeton,[2024](https://arxiv.org/html/2606.07007#bib.bib14)\)\. More recently,Bhallaet al\.\([2026](https://arxiv.org/html/2606.07007#bib.bib24)\)find that concepts lie on a low\-dimensional shape and that SAEs can globally and locally capture the concept manifold\. Our work is a generalization ofParket al\.\([2023](https://arxiv.org/html/2606.07007#bib.bib1)\)andBhallaet al\.\([2026](https://arxiv.org/html/2606.07007#bib.bib24)\), because we study the general case of concepts in a set\-theoretic framework where concepts can be arbitrary measurable sets\.

SAE Architectures and Phenomena\.A growing literature studies empirical phenomena in SAE features, including polysemanticity and monosemanticity\(Elhageet al\.,[2022](https://arxiv.org/html/2606.07007#bib.bib13)\), feature splitting and feature families\(Cunninghamet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib4); Brickenet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib16)\), and feature absorption\(Chaninet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib15)\)\. Several SAE variants aim to improve sparsity, feature quality, or structure, including Gated SAEs\(Rajamanoharanet al\.,[2024a](https://arxiv.org/html/2606.07007#bib.bib7)\), JumpReLU SAEs\(Rajamanoharanet al\.,[2024b](https://arxiv.org/html/2606.07007#bib.bib8)\), Top\-KKSAEs\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib6)\), matching\-pursuit SAEs\(Costaet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib9)\), ensemble SAEs\(Gadgilet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib11)\), SPaDE\(Hindupuret al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib17)\), and hierarchical SAEs\(Leasket al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib10); Muchaneet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib12)\)\. In particular, hierarchical and matching\-pursuit SAEs seek to capture hierarchical or conditional relationships among concepts\. However, recent work also raises concerns: SAEs may find seemingly interpretable features in randomly initialized transformers\(Heapet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib19)\), and very large SAEs can learn pathological concepts\(Michaudet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib18)\)\.

Concept Learning and Neuron Interpretation\.The linear representation hypothesis posits that concepts are represented as directions in activation space and combined approximately linearly\(Parket al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib1)\)\. Since such directions are not directly interpretable, neuron interpretation methods often explain a neuron or feature using its most activating examples\. For example,Billset al\.\([2023](https://arxiv.org/html/2606.07007#bib.bib30)\)use an LLM to infer concepts from highly activating and random samples, with related work studying black\-box neuron interpretation\(Singhet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib31)\)\. Complementary concept\-based methods instead start from data or predefined concepts and search for corresponding neurons or directions\(Gurneeet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib34); Kohet al\.,[2020](https://arxiv.org/html/2606.07007#bib.bib33)\)\. Evaluation is crucial for such associations, withOikarinenet al\.\([2025](https://arxiv.org/html/2606.07007#bib.bib32)\)summarizing metrics and proposing criteria for testing interpretation faithfulness\. Recent work further suggests that concepts may be represented by richer geometric structures rather than single linear directions\(Felet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib21); Shafranet al\.,[2026](https://arxiv.org/html/2606.07007#bib.bib22); Sarfatiet al\.,[2026](https://arxiv.org/html/2606.07007#bib.bib23); Costaet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib9); Hindupuret al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib17)\)\.

Network Capacity and Hyperplane Arrangements\.Neural network expressivity is often studied through the regions induced by activation patterns, with the number of regions serving as a measure of capacity\(Montúfaret al\.,[2014](https://arxiv.org/html/2606.07007#bib.bib26); Pascanuet al\.,[2013](https://arxiv.org/html/2606.07007#bib.bib27)\)\. This view connects naturally to hyperplane arrangements, where neurons define hyperplanes that partition representation space\(Stanley and others,[2007](https://arxiv.org/html/2606.07007#bib.bib28)\)\. Closely related to sparse selection mechanisms,Suet al\.\([2026](https://arxiv.org/html/2606.07007#bib.bib25)\)analyze the capacity of Top\-KKmixture\-of\-expert networks by casting expert selection as a hyperplane arrangement problem\. This geometric perspective is relevant for understanding the capacity of Top\-KKSAEs and related sparse architectures\.

## 3Preliminaries

We first review sparse autoencoders \(SAEs\), focusing on ReLU SAE\(Cunninghamet al\.,[2023](https://arxiv.org/html/2606.07007#bib.bib4)\)and Top\-K SAE\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07007#bib.bib6)\)\. Letx∈ℝnx\\in\\mathbb\{R\}^\{n\}be an activation vector from a large model\. An SAE mapsxxto a higher\-dimensional sparse activation vectora∈ℝda\\in\\mathbb\{R\}^\{d\}, whered≫nd\\gg n, and then reconstructsxxfromaa:

z\\displaystyle z=E​n​c​\(x\)=We​n​c​\(x−bp​r​e\)\+be​n​c,\\displaystyle=Enc\(x\)=W\_\{enc\}\(x\-b\_\{pre\}\)\+b\_\{enc\},a\\displaystyle a=A​c​t​\(z\),\\displaystyle=Act\(z\),x^\\displaystyle\\hat\{x\}=D​e​c​\(a\)=Wd​e​c​a\+bd​e​c\.\\displaystyle=Dec\(a\)=W\_\{dec\}a\+b\_\{dec\}\.HereA​c​tActis the SAE activation function,aia\_\{i\}is the activation of theii\-th SAE neuron, andx^\\hat\{x\}is the reconstruction ofxx\. For ReLU SAE,A​c​tActis the ReLU function, and the objective is usually written asℒR​e​L​U=‖x−x^‖22\+λ​‖a‖1\\mathcal\{L\}\_\{ReLU\}=\\\|x\-\\hat\{x\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|a\\\|\_\{1\}\. For Top\-K SAE,A​c​tActkeeps only thekklargest coordinates of the pre\-activation vector and sets the rest to zero, imposing a hard sparsity constraint, and the objective isℒt​o​p​k=‖x−x^‖22\\mathcal\{L\}\_\{topk\}=\\\|x\-\\hat\{x\}\\\|\_\{2\}^\{2\}\.

Throughout the paper, with a slight abuse of notation, we denoteW,bW,bforWe​n​c,be​n​cW\_\{enc\},b\_\{enc\}, respectively, andai​\(x\)=A​c​t​\(⟨wi,x⟩\+bi\)a\_\{i\}\(x\)=Act\(\\langle w\_\{i\},x\\rangle\+b\_\{i\}\)for the activation of SAE neuronii\.

## 4Main Framework

### 4\.1Notations and Definitions

Concepts\.LetX⊆ℝnX\\subseteq\\mathbb\{R\}^\{n\}denote the SAE input space, i\.e\., the activation space of the original model\. Although concepts and interpretations are usually described in the raw data space,\(Nikolaouet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib42)\)shows that the LLM imposes an invertible map between the raw data space and its internal activation space\. Thus, here we consider working in the internal space as equivalent to working in the raw data space\. We equipXXwith the Borel algebraℬ​\(X\)\\mathcal\{B\}\(X\)\. A human\-understandable concept is a measurable setC∈ℬ​\(X\)C\\in\\mathcal\{B\}\(X\), and the human concept set is𝒞⊆ℬ​\(X\)\\mathcal\{C\}\\subseteq\\mathcal\{B\}\(X\)\. We useμ\\mufor the data\-supported measure used in concept detection and concept separation, andν\\nufor a measure on the ambient space used in concept approximation\. Details are discussed in Section[4\.3](https://arxiv.org/html/2606.07007#S4.SS3)\. Intuitively,μ\\muevaluates only observed data, whereasν\\nualso evaluates blank or novel regions outside the observed data support\.

SAE Neurons\.For a neuronii, define the threshold set and its two sides by

Hi\+\\displaystyle H\_\{i\}^\{\+\}=\{x∈X:zi​\(x\)\>τi\},\\displaystyle=\\\{x\\in X:z\_\{i\}\(x\)\>\\tau\_\{i\}\\\},Hi−\\displaystyle H\_\{i\}^\{\-\}=\{x∈X:zi​\(x\)≤τi\},\\displaystyle=\\\{x\\in X:z\_\{i\}\(x\)\\leq\\tau\_\{i\}\\\},whereτi\\tau\_\{i\}is a threshold\. For ReLU SAE, we usually takeτi=0\\tau\_\{i\}=0\. The positive sideHi\+H\_\{i\}^\{\+\}is called theactivation regionof neuroniiunder ReLU gating\.

SAE Activations\.Anactivation patternssassigns a signσs;i∈\{\+,−\}\\sigma\_\{s;i\}\\in\\\{\+,\-\\\}to every neuron\. Its correspondingtotal neuron single activation\(TNSA\) region is

Rs=⋂i∈\[d\]Hiσs;i\.\\displaystyle R\_\{s\}=\\bigcap\_\{i\\in\[d\]\}H\_\{i\}^\{\\sigma\_\{s;i\}\}\.\(1\)The collection of all activation patterns, or equivalently the corresponding TNSA regions, is denoted by𝒜\\mathcal\{A\}\. The sparsity of a pattern is\|\{i:σs;i=\+\}\|\|\\\{i:\\sigma\_\{s;i\}=\+\\\}\|\.

Thesingle\-neuron total activation\(SNTA\), or simply the activation region of neuronii, is

Ni=⋃s∈𝒜:σs;i=\+Rs\.\\displaystyle N\_\{i\}=\\bigcup\_\{s\\in\\mathcal\{A\}:\\,\\sigma\_\{s;i\}=\+\}R\_\{s\}\.\(2\)For ReLU SAE,Ni=Hi\+N\_\{i\}=H\_\{i\}^\{\+\}\. For Top\-K SAE, however,NiN\_\{i\}is generally only a subset ofHi\+H\_\{i\}^\{\+\}, because a neuron with a positive score may still fail to enter the top\-kkset\. An example is shown in Figure[1](https://arxiv.org/html/2606.07007#S1.F1)\. This distinction is important and discussed in Section[4\.2](https://arxiv.org/html/2606.07007#S4.SS2)\.

For a set of neuronsM⊆\[d\]M\\subseteq\[d\], define themulti\-neuron activationas

θM=⋂j∈MNj\.\\displaystyle\\theta\_\{M\}=\\bigcap\_\{j\\in M\}N\_\{j\}\.\(3\)The collection of model\-learned concepts is denoted by

Θ=\{θM:M⊆\[d\]\}\.\\displaystyle\\Theta=\\\{\\theta\_\{M\}:M\\subseteq\[d\]\\\}\.\(4\)We argue that concept learning and neuron interpretation should useθM\\theta\_\{M\}rather than onlyNiN\_\{i\}orRsR\_\{s\}:NiN\_\{i\}can be too large and may cover many unrelated regions, leading to polysemanticity, whileRsR\_\{s\}can be too small, making model\-learned concepts fragile\. By aggregating selected neurons,θM\\theta\_\{M\}provides a useful granularity for concept learning and neuron interpretation\. Examples are shown in Figure[1](https://arxiv.org/html/2606.07007#S1.F1)\. Details are discussed in Section[4\.3](https://arxiv.org/html/2606.07007#S4.SS3)\.

Different concepts need not use the same number of neurons\. If conceptCCis represented byθM\\theta\_\{M\}, then the number of neurons used forCCis\|M\|\|M\|\.

### 4\.2Sparse Autoencoder Architectures

We categorize SAE architectures into two classes:absolute gatingandrelative gating\. In absolute gating, each neuron’s activation is determined independently of other neurons\. ReLU SAE, JumpReLU SAE, and Gated SAE are examples of absolute gating\. In relative gating, a neuron’s activation depends on other neurons\. Top\-K SAE, Matching Pursuit SAE, and SPaDE are examples of relative gating\.

The geometric difference between the two mechanisms appears in the SNTA regions\. In absolute gating, each neuron’s activation region is a halfspace,Ni=Hi\+N\_\{i\}=H\_\{i\}^\{\+\}\. Thus, for ReLU SAE,θM=⋂j∈MHj\+\\theta\_\{M\}=\\bigcap\_\{j\\in M\}H\_\{j\}^\{\+\}\. In relative gating, a neuron can be positive but inactive because it is not selected by the competitive gating rule\. For Top\-K SAE,

Ni=\{x∈X:zi​\(x\)\>τi​and​i∈TopKk​\(z​\(x\)\)\}\.\\displaystyle N\_\{i\}=\\\{x\\in X:z\_\{i\}\(x\)\>\\tau\_\{i\}\\text\{ and \}i\\in\\mathrm\{TopK\}\_\{k\}\(z\(x\)\)\\\}\.HenceNi⊆Hi\+N\_\{i\}\\subseteq H\_\{i\}^\{\+\}, andNiN\_\{i\}is generally a union of patches insideHi\+H\_\{i\}^\{\+\}\. An example is shown in Figure[1](https://arxiv.org/html/2606.07007#S1.F1)\. Prior work describes Top\-K gating regions as unions of polyhedra\(Suet al\.,[2026](https://arxiv.org/html/2606.07007#bib.bib25); Hindupuret al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib17)\)\. Therefore, most of our geometric results are first stated for absolute gating\. For Top\-K SAE, the same framework applies, but the additional relative\-gating effect must be considered\.

### 4\.3Concept Learning

For simplicity, we writeθ\\thetaforθM\\theta\_\{M\}\. Unless otherwise specified, we useconceptsto refer to human concepts andunitsto refer to model\-learned concepts\. When a distinction is needed, we useunitfor a multi\-neuron activation pattern andneuronfor the total activation region of a single neuron\.

The goal of concept learning is to align human concepts𝒞\\mathcal\{C\}with model\-learned conceptsΘ\\Theta\.We consider three levels of learning: concept detection, concept separation, and concept approximation\.

#### 4\.3\.1Concept Detection

Concept detection is the weakest form of concept learning\. Its goal is to cover a selected concept\. Formally, concept detection holds if∀C∈𝒞,∃θ∈Θ​such that​μ​\(C∖θ\)=0\\forall C\\in\\mathcal\{C\},\\exists\\theta\\in\\Theta\\text\{ such that \}\\mu\(C\\setminus\\theta\)=0\. In SAE, this condition is often easy to satisfy because it only requires at least one unit to cover the concept\. However, concept detection alone allows many\-to\-many mappings: one concept may be covered by multiple units, and one unit may cover multiple concepts\. This motivates the stronger notions below\.

#### 4\.3\.2Concept Separation

Concept separation asks whether a selected concept can be separated from other concepts on the observed data support\.

###### Definition 4\.1\(Concept separation\)\.

A conceptCCis said to be separated byθM\\theta\_\{M\}if \(i\)x∈Hi\+x\\in H\_\{i\}^\{\+\}for allx∈Cx\\in Candi∈Mi\\in M, and \(ii\)x′∈Hj−x^\{\\prime\}\\in H\_\{j\}^\{\-\}for allx′∈X∖Cx^\{\\prime\}\\in X\\setminus Candj∈\[d\]∖Mj\\in\[d\]\\setminus M\.

Concept separation removes some ambiguity in concept detection because the selected unit must coverCCexclusively\. Empirically, when a single neuron is used as the concept learner, both ReLU and Top\-K SAE can fail to separate complicated concepts\(Hindupuret al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib17)\)\.

The main limitation of concept separation is generalization\. Sinceμ\\muevaluates only observed data,θ\\thetamay separateCCon the training support while still including large blank regions outside that support\. Thus, concept separation is useful for classification\-like tasks, but it is not sufficient for novel concept discovery, where the model discovers concepts unknown to users\. Novel concept discovery is especially useful in scientific domains\(Singhet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib54); Stokeset al\.,[2020](https://arxiv.org/html/2606.07007#bib.bib55)\)\. The importance of novel concept discovery is also highlighted in agentic interpretability\(Kimet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib56)\), where models can help users understand new concepts and correct human concept annotations\(Simon and Zou,[2025](https://arxiv.org/html/2606.07007#bib.bib40)\)\.

#### 4\.3\.3Concept Approximation

Concept approximation is the strongest form of concept learning\. It evaluates whether a unit tightly approximates a concept in the ambient space\.

###### Definition 4\.2\(Concept approximation\)\.

A conceptCCis said to be approximated byθM\\theta\_\{M\}if \(i\)x∈Hi\+x\\in H\_\{i\}^\{\+\}for allx∈Cx\\in Candi∈Mi\\in M, and \(ii\)x′∈Hj−x^\{\\prime\}\\in H\_\{j\}^\{\-\}for allx′∈ℝd∖Cx^\{\\prime\}\\in\\mathbb\{R\}^\{d\}\\setminus Candj∈\[d\]∖Mj\\in\[d\]\\setminus M\.

The key difference between concept separation and concept approximation is the choice of space: the former is evaluated onXX, whereas the latter is evaluated inℝd\\mathbb\{R\}^\{d\}\. Intuitively, concept approximation requires the unit to surround the concept as tightly as possible\. Unlike concept separation, concept approximation can support concept discovery because it penalizes false positives and false negatives beyond the observed data\. An analogue of concept approximation is anomaly detection\(Ruffet al\.,[2018](https://arxiv.org/html/2606.07007#bib.bib50)\)\.

### 4\.4Neuron Interpretation

![Refer to caption](https://arxiv.org/html/2606.07007v1/x2.png)Figure 2:Toy example of a concept lattice\. From top to bottom, concepts become more specific, and the associated neuron intents become more refined\. From bottom to top, concepts become more general, and neuron intents are merged into coarser descriptions\.Concept learning and neuron interpretation are two directions of the same relation between concepts and units\. Denote𝒩=\[d\]\\mathcal\{N\}=\[d\]as the set of neurons\. Define a relationℛ⊆𝒞×𝒩\\mathcal\{R\}\\subseteq\\mathcal\{C\}\\times\\mathcal\{N\}\. The relationℛ\\mathcal\{R\}can be interpreted as a formal context in formal concept analysis\(Ganteret al\.,[1999](https://arxiv.org/html/2606.07007#bib.bib51)\)\. This lattice captures many\-to\-many correspondences between human concepts and model\-learned concepts, providing a more structured view than selecting a single best match in either direction\. An example of concept lattice can be found in Fig\.[2](https://arxiv.org/html/2606.07007#S4.F2)\.

### 4\.5Sparse Autoencoder Phenomena

We now formulate and explain several common SAE phenomena in our framework\.

Polysemanticity and monosemanticity\.Polysemanticity means that a neuron is related to multiple concepts\. In terms ofℛ\\mathcal\{R\}, this means that a neuron is associated with multiple unrelated concepts\. Monosemanticity means that this reverse relation is concentrated on one concept, and thus requiresffto be injective: different concepts should not be assigned to the same neuron\. Therefore, the number of available neurons must be at least the number of concepts\. This is made precise in Section[5\.3](https://arxiv.org/html/2606.07007#S5.SS3)\.

Feature splitting\.Feature splitting means that a broad neuron in a smaller SAE is split into several more specific neurons in a larger SAE\. Formally, ifθ\\thetais a broad neuron andθ1,…,θr\\theta\_\{1\},\\ldots,\\theta\_\{r\}are more specific neurons, then feature splitting can be expressed asθ≈⋃j=1rθj\\theta\\approx\\bigcup\_\{j=1\}^\{r\}\\theta\_\{j\}withθj∩θl≈∅\\theta\_\{j\}\\cap\\theta\_\{l\}\\approx\\varnothingforj≠lj\\neq l\. The approximate disjointness is consistent with sparsity: if the split neurons overlapped heavily, then data in the overlap would activate many neurons at once and violate the sparsity constraint\.

Feature absorption\.Feature absorption is a failure mode of hierarchical learning\. SupposeCi⊂CjC\_\{i\}\\subset C\_\{j\}, whereCiC\_\{i\}is the child concept andCjC\_\{j\}is the parent concept\. Ideally, data inCiC\_\{i\}should also activate the model concept forCjC\_\{j\}\. Absorption occurs whenμ​\(Ci∩θCjc\)\>0\\mu\(C\_\{i\}\\cap\\theta\_\{C\_\{j\}\}^\{c\}\)\>0\. That is, the parent feature fails to activate on part of its child concept\. This can happen because activating both parent and child features increases the sparsity cost\.

Feature family\.A feature family consists of several neurons that tend to activate together\. Formally, a familyθ1,…,θr\\theta\_\{1\},\\ldots,\\theta\_\{r\}has nontrivial co\-activation if⋂l=1rθl≠∅\\bigcap\_\{l=1\}^\{r\}\\theta\_\{l\}\\neq\\varnothing\. These neurons may represent different aspects of the same concept or nearby concepts in a semantic family\.

Hierarchical concepts\.Hierarchical concepts correspond to set inclusion\. IfCi⊂CjC\_\{i\}\\subset C\_\{j\}, thenCiC\_\{i\}is more specific thanCjC\_\{j\}\. For example, “mammal” is contained in “animal”\. Ideally, the learned model concepts should satisfyθCi⊂θCj\\theta\_\{C\_\{i\}\}\\subset\\theta\_\{C\_\{j\}\}\. However, maintaining this hierarchy can increase the sparsity cost because data in the child concept may need to activate both child and parent neurons\.

## 5Theoretical Results

In this section, we present the main theoretical results for concept learning and neuron interpretation, including necessary and sufficient conditions, failure conditions, and error bounds\. We focus primarily on ReLU SAE and defer the corresponding results for Top\-K SAE to Appendix[9\.7](https://arxiv.org/html/2606.07007#S9.SS7)\.

We begin with the following assumption\.

###### Assumption 5\.1\.

𝒞\\mathcal\{C\}has finite cardinality \(i\.e, there are finitely many concepts\)\. Every conceptC∈𝒞C\\in\\mathcal\{C\}is compact, andX=⋃C∈𝒞CX=\\bigcup\_\{C\\in\\mathcal\{C\}\}C\.

The finite cardinality assumption is intuitive and natural because a human cannot perceive infinitely many concepts\. Inℝn\\mathbb\{R\}^\{n\}, compact sets are closed and bounded by the Heine–Borel theorem\(Munkres,[1984](https://arxiv.org/html/2606.07007#bib.bib52)\)\. This assumption is reasonable and weak: closedness means that if a data point is infinitely close to a concept, then it belongs to that concept; boundedness rules out infinite\-valued data or concepts; andX=⋃C∈𝒞CX=\\bigcup\_\{C\\in\\mathcal\{C\}\}Censures that every data point belongs to at least one concept\. In particular, finite concepts are automatically compact\. Overall, assuming concepts to be measurable and compact rules out pathological cases while still allowing flexible shapes, such as disconnected sets, Swiss rolls, and helices\.

### 5\.1Concept Separation

We start with a simple warm\-up case: using one neuron to separate a conceptCC\. LetNNdenoteX∖CX\\setminus C, equivalently⋃C′∈𝒞,C′≠C\(C′∖C\)\\bigcup\_\{C^\{\\prime\}\\in\\mathcal\{C\},\\,C^\{\\prime\}\\neq C\}\(C^\{\\prime\}\\setminus C\)\. Recall that separatingCCrequires a neuron, or more generally a unit, to placeCCon its positive side andNNon its negative side\. The following theorem gives the necessary and sufficient condition for separation by one neuron\.

###### Theorem 5\.2\(Concept separation with one neuron\)\.

CCcan be separated fromNNwith one neuron if and only ifC​o​n​v​\(C\)∩C​o​n​v​\(N\)¯=∅Conv\(C\)\\cap\\overline\{Conv\(N\)\}=\\varnothing, whereC​o​n​vConvdenotes the convex hull\.

The proof can be found in Appendix[9\.4\.1](https://arxiv.org/html/2606.07007#S9.SS4.SSS1)\. If we use one neuron for each concept to separate all concepts, the following two corollaries follow immediately\.

###### Corollary 5\.3\.

1. 1\.All concepts in𝒞\\mathcal\{C\}can be separated from each other if and only ifC​o​n​v​\(Ci\)∩C​o​n​v​\(Ni\)¯=∅Conv\(C\_\{i\}\)\\cap\\overline\{Conv\(N\_\{i\}\)\}=\\varnothingfor allCi∈𝒞C\_\{i\}\\in\\mathcal\{C\}\.
2. 2\.When concept separation is possible for𝒞\\mathcal\{C\}, the minimum number of selected neurons is\|𝒞\|\|\\mathcal\{C\}\|\.

The proof can be found in Appendix[9\.4\.2](https://arxiv.org/html/2606.07007#S9.SS4.SSS2)\. These results show that concept separation with a single neuron is highly demanding\. Corollary[5\.3](https://arxiv.org/html/2606.07007#S5.Thmtheorem3)also clarifies a difficulty in neuron interpretation: monosemanticity is hard to achieve because concepts in LLM activation spaces can be twisted, and their convex hulls are unlikely to be disjoint\. Despite this difficulty, SAEs can mitigate polysemanticity because \(1\) they introduce more neurons as candidates for concept separation, as shown below, and \(2\) concepts that can be inherently disentangled or split may satisfy these requirements\. Our proofs are constructive; training an SAE is not guaranteed to find such constructions, but having more neurons increases the chance of successful separation\.

We next consider separation by multiple neurons, or a unit\. In this case, the requirements can be weaker:

###### Theorem 5\.4\(Concept separation with multiple neurons\)\.

CCcan be separated fromNNby a unit if and only ifC​o​n​v​\(C\)∩N¯=∅Conv\(C\)\\cap\\overline\{N\}=\\varnothing\.

The proof can be found in Appendix[9\.4\.3](https://arxiv.org/html/2606.07007#S9.SS4.SSS3)\. The corresponding corollaries for separating all concepts with units are as follows\.

###### Corollary 5\.5\.

All concepts in𝒞\\mathcal\{C\}can be separated from each other by units if and only ifC​o​n​v​\(Ci\)∩\(Cj∖Ci\)¯=∅Conv\(C\_\{i\}\)\\cap\\overline\{\(C\_\{j\}\\setminus C\_\{i\}\)\}=\\varnothingfor allCi∈𝒞C\_\{i\}\\in\\mathcal\{C\}andCj≠Ci∈𝒞C\_\{j\}\\neq C\_\{i\}\\in\\mathcal\{C\}\.

The proof can be found in Appendix[9\.4\.4](https://arxiv.org/html/2606.07007#S9.SS4.SSS4)\. These results show that unit\-based concept separation has much weaker requirements than neuron\-based separation, leading to cleaner concept learning and more disentangled neuron interpretation\. For example, with a single neuron, a concept cannot be separated when other concepts surround it; with a unit, such separation may still be possible\. One consequence is that feature splitting is not universal across all concepts, and hierarchical concepts can be difficult to learn without architectural changes\.

Although the above results are primarily built on finite\-dimensional spaces, we maintain flexibility to extend to infinite\-dimensional spaces, as discussed in Appendix[9\.4\.1](https://arxiv.org/html/2606.07007#S9.SS4.SSS1)\.

When perfect separation is not possible, we need to study the resulting error\. We define the separation error as follows\.

###### Definition 5\.6\(Separation error\)\.

The separation error is the symmetric difference on the data support:

es​e​p​\(C,θ\)=μ​\(C​Δ​θ\)=μ​\(θ∖C\)⏟contamination error​ec\+μ​\(C∖θ\)⏟missing error​em,\\displaystyle e\_\{sep\}\(C,\\theta\)=\\mu\(C\\Delta\\theta\)=\\underbrace\{\\mu\(\\theta\\setminus C\)\}\_\{\\text\{contamination error \}e\_\{c\}\}\+\\underbrace\{\\mu\(C\\setminus\\theta\)\}\_\{\\text\{missing error \}e\_\{m\}\},whereμ\\muis a Borel probability measure supported onXX\.

This definition also applies whenθ\\thetais a single neuron\. Intuitively,eme\_\{m\}measures how much of the target concept is missed by the selected unit, whileece\_\{c\}measures how much unrelated content is covered by the selected unit, leading to polysemanticity\. Variants of separation error are discussed in Appendix[9\.4\.5](https://arxiv.org/html/2606.07007#S9.SS4.SSS5)\. The separation error is zero when perfect separation holds up to aμ\\mu\-null set\. In non\-separable concept learning, however, it cannot reach zero and has an irreducible component\.

When concepts are not disjoint, separation on the overlapping boundary reduces to concept approximation, which we discuss next\.

### 5\.2Concept Approximation

We define the approximation error as follows\.

###### Definition 5\.7\(Approximation error\)\.

The approximation error is the symmetric difference under the ambient measure:

ea​p​p​\(C,θ\)=ν​\(C​Δ​θ\),\\displaystyle e\_\{app\}\(C,\\theta\)=\\nu\(C\\Delta\\theta\),whereν\\nuis a Borel probability measure supported onℝn\\mathbb\{R\}^\{n\}\.

Unlike concept separation, which considers only observed data, concept approximation must also account for novel data\. Although a single neuron may separate concepts on the observed support, it is generally insufficient for concept approximation\. We therefore focus on multi\-neuron activations, or units\. The necessary and sufficient condition is as follows\.

###### Theorem 5\.8\(Concept approximation condition\)\.

A conceptC∈𝒞C\\in\\mathcal\{C\}can be arbitrarily well approximated under the approximation error by a unit if and only ifCCis convex up to aν\\nu\-null set\.

The proof can be found in Appendix[9\.5\.1](https://arxiv.org/html/2606.07007#S9.SS5.SSS1)\. Thus, for all concepts to be arbitrarily well approximated by units, each concept must be convex up to aν\\nu\-null set\. The error rate is given by the following theorem\.

###### Theorem 5\.9\(Concept approximation error rate\)\.

Under regularity and boundary\-smoothness conditions, a conceptC∈𝒞C\\in\\mathcal\{C\}can be approximated by a unitθM\\theta\_\{M\}with error

ea​p​p​\(C,θM\)≲ei​r​r\+A​\|M\|−2r−1,\\displaystyle e\_\{app\}\(C,\\theta\_\{M\}\)\\lesssim e\_\{irr\}\+A\|M\|^\{\-\\frac\{2\}\{r\-1\}\},whereAAis a constant related to boundary smoothness,rris the effective dimension ofCC, andei​r​re\_\{irr\}is an irreducible error\. In particular,ei​r​r=0e\_\{irr\}=0whenCCis convex, or more generally whenν​\(C​o​n​v​\(C\)∖C\)=0\\nu\(Conv\(C\)\\setminus C\)=0\.

Details and proofs can be found in Appendix[9\.5\.2](https://arxiv.org/html/2606.07007#S9.SS5.SSS2)\. Intuitively, the irreducible error is nonzero whenCCis non\-convex to a positive degree, because a unit is essentially a convex polytope and therefore cannot eliminate the penalty from non\-convexity\. To approximate all concepts arbitrarily well, each concept must be convex, and the number of neurons must satisfyd≥∑i=1\|𝒞\|\|Mi\|d\\geq\\sum\_\{i=1\}^\{\|\\mathcal\{C\}\|\}\|M\_\{i\}\|whenθMi\\theta\_\{M\_\{i\}\}is used to approximateCiC\_\{i\}; this lower bound can be reduced when neurons are reused across concepts\. Although concept approximation appears stricter than concept separation and requires more neurons, perfect concept approximation can impose weaker structural requirements in cases where concepts overlap: overlapping convex concepts can be approximated arbitrarily well, whereas concept separation is impossible in this setting\. The larger neuron requirement in concept approximation helps exclude unrelated regions and better resolve polysemanticity\.

Note that although the above theories for concept separation and concept approximation are for concept learning, they also apply to neuron interpretation because the selected neuron/unit forms an exclusive relation with the target concept, so the interpretation of the neuron/unit corresponds to the target concept\.

### 5\.3Concept Learning Capacity

To make monosemanticity possible, the selected concept learning functionffshould be approximately injective\. This yields a necessary combinatorial capacity condition\.

Without loss of generality, letddbe the number of non\-dead neurons\. Letkck\_\{c\}be the maximum number of neurons allowed to represent a conceptCC\. The value ofkck\_\{c\}is an interpretation budget; for Top\-K SAE, one should havekc≤kk\_\{c\}\\leq k, wherekkis the Top\-K sparsity\.

###### Theorem 5\.10\(Concept learning capacity\)\.

Suppose perfect monosemanticity holds for all concepts, and each concept is represented by at mostkck\_\{c\}neurons\. In the regimed≫kcd\\gg k\_\{c\}, this requires approximately

d≳\(kc\!​\|𝒞\|\)1/kc\.\\displaystyle d\\gtrsim\(k\_\{c\}\!\\,\|\\mathcal\{C\}\|\)^\{1/k\_\{c\}\}\.\(5\)

The proof can be found in Appendix[9\.6](https://arxiv.org/html/2606.07007#S9.SS6)\. This is a necessary condition and does not by itself guarantee concept separation or approximation\. Although it may appear to contradict Corollary[5\.3](https://arxiv.org/html/2606.07007#S5.Thmtheorem3), Corollary[5\.5](https://arxiv.org/html/2606.07007#S5.Thmtheorem5), and Theorem[5\.8](https://arxiv.org/html/2606.07007#S5.Thmtheorem8), those earlier results do not account for sparsity, whereas sparsity is explicitly considered here\.

### 5\.4Concept Learning and Neuron Interpretation

Although neuron interpretation has not been specified in the previous sections, the theoretical results above still provide insight into it\. In particular, using individual neurons or multi\-neuron units for concept learning leads to different levels of interpretation quality, such as different degrees of monosemanticity\. We now state the link between concept learning and neuron interpretation in an algebraic way\.

DefineUUas the power set of all data,U=𝒫​\(X\)U=\\mathcal\{P\}\(X\), so that the human concept family𝒞\\mathcal\{C\}is a finite subset ofUU\. Recall that𝒩=\[d\]\\mathcal\{N\}=\[d\]is the set of neurons andM∈𝒫​\(𝒩\)M\\in\\mathcal\{P\}\(\\mathcal\{N\}\)is a set of neurons andθM⊆X\\theta\_\{M\}\\subseteq Xis the activation region ofMM\. For a single neuronN∈𝒩N\\in\\mathcal\{N\}, we writeθN:=θ\{N\}\\theta\_\{N\}:=\\theta\_\{\\\{N\\\}\}\. Since eachθM\\theta\_\{M\}is a subset ofXX, we also regard it as an element ofUU\.

We define a binary relationR⊆U×𝒩R\\subseteq U\\times\\mathcal\{N\}by

C​R​N⟺C⊆θN\.\\displaystyle CRN\\Longleftrightarrow C\\subseteq\\theta\_\{N\}\.Intuitively,C​R​NCRNmeans that neuronNNis active on the whole data regionCC\. WhenC∈𝒞C\\in\\mathcal\{C\}, this says thatNNcovers the whole human concept regionCC\. This relation is not assumed to be a function: a concept may be represented by multiple neurons, and a neuron may be related to multiple concepts\.

The relationRRinduces two maps

f:U→𝒫​\(𝒩\),f​\(C\)=\{N∈𝒩:C​R​N\},\\displaystyle f:U\\rightarrow\\mathcal\{P\}\(\\mathcal\{N\}\),\\qquad f\(C\)=\\\{N\\in\\mathcal\{N\}:CRN\\\},g:𝒫​\(𝒩\)→U,g​\(M\)=⋂N∈MθN,\\displaystyle g:\\mathcal\{P\}\(\\mathcal\{N\}\)\\rightarrow U,\\qquad g\(M\)=\\bigcap\_\{N\\in M\}\\theta\_\{N\},with the convention thatg​\(∅\)=Xg\(\\emptyset\)=X\. The mapffsends a data region to the set of neurons that are active on the entire region\. Thus,ffcorresponds to the concept\-to\-neuron direction\. The mapggsends a set of neurons to their common activation region\. Thus,ggcorresponds to the neuron\-to\-region direction used in neuron interpretation\.

Due to space limit, we put the complete results in Appendix\.[9\.3](https://arxiv.org/html/2606.07007#S9.SS3)\. The complete results contains construction of concept lattice and algebraic explanations of SAE phenomena\.

![Refer to caption](https://arxiv.org/html/2606.07007v1/x3.png)Figure 3:Concept separation, concept approximation, concept\-learning capacity, and concept learning–neuron interpretation disagreement for ReLU SAEs\. Panels \(a\)–\(h\) and \(k\)–\(l\) use expansion factor 8; panels \(i\)–\(j\) vary the expansion factor\.\(a\)\(b\): F1 score and visualization for a neuron\-separable concept \(Theorem[5\.2](https://arxiv.org/html/2606.07007#S5.Thmtheorem2)\)\.\(c\)\(d\): F1 score and visualization for a non\-neuron\-separable but unit\-separable concept \(Theorem[5\.4](https://arxiv.org/html/2606.07007#S5.Thmtheorem4)\)\.\(e\)\(f\): approximation F1 score and visualization for an easier\-to\-approximate concept \(Theorems[5\.8](https://arxiv.org/html/2606.07007#S5.Thmtheorem8)and[5\.9](https://arxiv.org/html/2606.07007#S5.Thmtheorem9)\); the blue region inside the convex hull of all concepts represents novel/unseen data\.\(g\)\(h\): approximation F1 score and visualization for a harder\-to\-approximate concept\.\(i\)\(j\): concept\-learning capacity, measured by the best F1 score \(Theorem[5\.10](https://arxiv.org/html/2606.07007#S5.Thmtheorem10)\), on disjoint and overlapping concepts\.\(k\)\(l\): concept learning–neuron interpretation disagreement ratio \(Section[4\.4](https://arxiv.org/html/2606.07007#S4.SS4)\) for the selected concepts in \(d\) and \(h\)\.![Refer to caption](https://arxiv.org/html/2606.07007v1/x4.png)Figure 4:Concept separation, concept approximation, concept\-learning capacity, and concept learning–neuron interpretation disagreement for Top\-KKSAEs\. Panels \(a\)–\(h\) and \(k\)–\(l\) use expansion factor 8; panels \(i\)–\(j\) vary the expansion factor\.\(a\)\(b\): F1 score and visualization for a neuron\-separable concept \(Theorem[5\.2](https://arxiv.org/html/2606.07007#S5.Thmtheorem2)\)\.\(c\)\(d\): F1 score and visualization for a non\-neuron\-separable but unit\-separable concept \(Theorem[5\.4](https://arxiv.org/html/2606.07007#S5.Thmtheorem4)\)\.\(e\)\(f\): approximation F1 score and visualization for an easier\-to\-approximate concept \(Theorems[5\.8](https://arxiv.org/html/2606.07007#S5.Thmtheorem8)and[5\.9](https://arxiv.org/html/2606.07007#S5.Thmtheorem9)\); the blue region inside the convex hull of all concepts represents novel/unseen data\.\(g\)\(h\): approximation F1 score and visualization for a harder\-to\-approximate concept\.\(i\)\(j\): concept\-learning capacity, measured by the best F1 score \(Theorem[5\.10](https://arxiv.org/html/2606.07007#S5.Thmtheorem10)\), on disjoint and overlapping concepts\.\(k\)\(l\): concept learning–neuron interpretation disagreement ratio \(Section[4\.4](https://arxiv.org/html/2606.07007#S4.SS4)\) for the selected concepts in \(d\) and \(h\)\.

## 6Experiments

Our theoretical results are constructive\. In a trained SAE, however, the neurons are fixed, so concept learning becomes a search problem over the learned model\-concept set:

θC∗=arg⁡minθ∈Θ⁡metric​\(C,θ\),\\displaystyle\\theta\_\{C\}^\{\*\}=\\arg\\min\_\{\\theta\\in\\Theta\}\\mathrm\{metric\}\(C,\\theta\),\(6\)wheremetric\\mathrm\{metric\}is a task\-dependent loss, such as separation or approximation error\. For score\-based metrics such as F1, we equivalently maximize the score\. Thus, the constructive results characterize when concept learning is possible in principle, while the empirical setting measures how well the trained neurons approximate the constructive results\. Our experiments study how concept learning quality changes with the number of selected neurons, SAE expansion factor, and SAE sparsity\. In particular, we ask whether larger SAEs provide more candidate features for representing a concept, and whether concept learning and neuron interpretation induce the same feature–concept correspondence\. We empirically study Corollaries[5\.3](https://arxiv.org/html/2606.07007#S5.Thmtheorem3)and[5\.5](https://arxiv.org/html/2606.07007#S5.Thmtheorem5), Theorems[5\.9](https://arxiv.org/html/2606.07007#S5.Thmtheorem9)and[5\.10](https://arxiv.org/html/2606.07007#S5.Thmtheorem10), and Section[4\.4](https://arxiv.org/html/2606.07007#S4.SS4)\.

### 6\.1Setup

Data and Model\.We use two\-dimensional synthetic data for ease of visualization and analysis\. Each human concept is represented by a cluster in the input space\. This controlled setting lets us directly compare the geometry of human concepts with the model concepts induced by SAE features\. We consider two data configurations: mutually disjoint concepts and partially overlapping concepts\. We train ReLU SAEs and Top\-KKSAEs while varying expansion factor and sparsity\. All models are trained until convergence\. Details are provided in Appendix[9\.2](https://arxiv.org/html/2606.07007#S9.SS2)\.

Metrics\.FollowingHindupuret al\.\([2025](https://arxiv.org/html/2606.07007#bib.bib17)\), we use F1 score as the primary metric\. To obtain a more complete view of concept\-learning quality, we also report the separation erroresepe\_\{\\mathrm\{sep\}\}and approximation erroreapproxe\_\{\\mathrm\{approx\}\}in the case study in Section[7\.3](https://arxiv.org/html/2606.07007#S7.SS3)\.

Neuron Selection Algorithms\.For an SAE with target sparsityL0=kL\_\{0\}=k, we evaluate units containing at mostkkselected neurons\. Selecting an SAE feature subset, or equivalently a set of hyperplanes, to optimize a target metric is combinatorial\. Exhaustive search quickly becomes infeasible: for example, choosing exactly 8 features from an SAE of width 64 requires evaluating\(648\)=4,426,165,368\\binom\{64\}\{8\}=4\{,\}426\{,\}165\{,\}368candidate subsets\. We therefore use heuristic selection\. In the main results, we report top\-NNselection, which is closest to common interpretability practice: features are ranked by their score for the target concept, and the topNNfeatures are selected to form the unit\. Unless stated otherwise, the x\-axis reports the exact numberNNof selected neurons; when reporting a best score, we optimize over feasibleN≤kN\\leq k\.

### 6\.2Results

The theory suggests that enlarging the candidate set—through a larger expansion factor, a higherL0L\_\{0\}, or a larger selection budget—should improve the chance of finding a good concept learner\. However, exact\-NNperformance need not be monotone\. We therefore study how F1 score and disagreement ratio depend on SAE size, sparsity, and the number of selected neurons\.

We present the results for ReLU SAEs and Top\-KKSAEs in Fig\.[3](https://arxiv.org/html/2606.07007#S5.F3)and Fig\.[4](https://arxiv.org/html/2606.07007#S5.F4), respectively\.

#### 6\.2\.1Concept Separation and Approximation

To study the effect of the number of selected neurons, we fix expansion factor 8 for panels \(a\)–\(h\) in Fig\.[3](https://arxiv.org/html/2606.07007#S5.F3)and Fig\.[4](https://arxiv.org/html/2606.07007#S5.F4)\. For each concept, we plot one curve perL0L\_\{0\}\. In the discussion below, we report theL0L\_\{0\}curve that achieves the best F1 score in the corresponding panel\.

For a neuron\-separable concept \(panels \(a\),\(b\)\), both ReLU and Top\-KKSAEs can separate the concept with a single neuron, achieving perfect F1\. For a non\-neuron\-separable but unit\-separable concept \(panels \(c\),\(d\)\), multi\-neuron units substantially improve performance\. In the ReLU SAE atL0=5L\_\{0\}=5, F1 increases from 0\.4144 with one neuron to 0\.8529 with a 4\-neuron unit\. In the Top\-KKSAE atL0=6L\_\{0\}=6, F1 increases from 0\.4559 to 0\.9646 with a 3\-neuron unit\. Although the learned units are not always perfect, the visualizations show that intersecting multiple neurons yields more exclusive activation regions and can represent more complex concepts than a single neuron\.

Panels \(e\)–\(h\) study concept approximation with overlapping concepts and novel/unseen probe regions\. Concept clusters are shown as circles\. The blue region is the convex hull of all concepts after excluding seen concept regions; we use a soft boundary, treating points within the 95th\-percentile distance to a concept cluster as seen\. Panels \(e\),\(f\) show an easier\-to\-approximate concept that overlaps with only one other concept\. In the ReLU SAE atL0=6L\_\{0\}=6, F1 increases from 0\.710 with one neuron to 0\.9281 with a 4\-neuron unit\. In the Top\-KKSAE atL0=4L\_\{0\}=4, F1 increases from 0\.7815 to 0\.9089 with a 5\-neuron unit\. Panels \(g\),\(h\) show a harder\-to\-approximate concept that overlaps with three other concepts\. In the ReLU SAE atL0=7L\_\{0\}=7, F1 increases from 0\.2829 with one neuron to 0\.6261 with a 5\-neuron unit; in the Top\-KKSAE, the best multi\-neuron unit reaches a substantially higher F1 than any single neuron, but still remains below the easier approximation case\. These results show that a single neuron is generally insufficient for concept approximation, and that multi\-neuron units improve approximation but do not eliminate the difficulty of heavy overlap\.

The visualizations also clarify the difference between separation and approximation\. Concept separation only needs to separate observed concepts, so its learned regions may remain unbounded\. Concept approximation must also exclude novel/unseen regions, so the learned unit tends to shrink and bound the target concept more tightly\. This is most visible when comparing panels \(f\) and \(h\): as the target concept overlaps more heavily with others, the unit becomes more closed around the target\.

We also observe that F1 is not monotone in the exact number of selected neurons: it often first increases and then drops to zero when too many neurons are selected\. We analyze this phenomenon in Section[7\.2](https://arxiv.org/html/2606.07007#S7.SS2)\. We further analyze the difference between separation and approximation in Section[7\.1](https://arxiv.org/html/2606.07007#S7.SS1)\.

#### 6\.2\.2Concept Learning Capacity

To study the effect of expansion factor on concept\-learning capacity, we aggregate each concept by taking the number of selected neurons that achieves the best F1 score, and then plot one curve perL0L\_\{0\}\. We use this best F1 score as an operational measure of capacity: a higher\-capacity SAE should separate or approximate concepts more accurately\. Panels \(i\),\(j\) in Fig\.[3](https://arxiv.org/html/2606.07007#S5.F3)and Fig\.[4](https://arxiv.org/html/2606.07007#S5.F4)show the results\.

For ReLU SAEs, increasing the expansion factor from 2 to 16 substantially improves F1 at fixedL0L\_\{0\}, after which the gains plateau\. This is expected: larger SAEs provide a larger pool of candidate neurons, but once useful feature combinations are available, additional width brings diminishing returns\. The plateau is even clearer for Top\-KKSAEs\. Disjoint concepts are easier to learn than overlapping concepts, as reflected by their higher peak F1 scores\. At fixed expansion factor, largerL0L\_\{0\}generally improves F1 because each concept can use a larger selection budget\.

#### 6\.2\.3Neuron Interpretation

To study concept learning–neuron interpretation disagreement, we again choose, for each concept, the number of selected neurons that gives the best F1 score, and then compute the corresponding disagreement ratio\. We plot one curve perL0L\_\{0\}in panels \(k\),\(l\) of Fig\.[3](https://arxiv.org/html/2606.07007#S5.F3)and Fig\.[4](https://arxiv.org/html/2606.07007#S5.F4)\. For a target conceptAA, let

θ∗\\displaystyle\\theta^\{\*\}=arg​maxθ∈Θ⁡s​\(A,θ\),\\displaystyle=\\operatorname\*\{arg\\,max\}\_\{\\theta\\in\\Theta\}s\(A,\\theta\),B\\displaystyle B=arg​maxC∈𝒞⁡s​\(C,θ∗\),\\displaystyle=\\operatorname\*\{arg\\,max\}\_\{C\\in\\mathcal\{C\}\}s\(C,\\theta^\{\*\}\),Disagree\\displaystyle\\mathrm\{Disagree\}=𝟏​\{A≠B\},\\displaystyle=\\mathbf\{1\}\\\{A\\neq B\\\},wheressis the F1 score\. The first two lines correspond to the forward concept\-learning mapffand the reverse neuron\-interpretation mapggin Section[4\.4](https://arxiv.org/html/2606.07007#S4.SS4)\. The disagreement ratio averages this indicator over concepts\. WhenA=BA=B, the target concept and its selected unit form a fixed point of the Galois connection, i\.e\., a formal concept in the concept lattice\.

Panels \(k\),\(l\) study the disagreement ratio for the separation example in \(d\) and the approximation example in \(h\)\. Comparing \(c\) with \(k\), and \(g\) with \(l\), lower but nonzero F1 generally corresponds to higher disagreement, while high F1 gives low disagreement\. This matches the role of F1 as a membership\-exclusivity metric\. Comparing \(k\) and \(l\), overlapping concepts produce more disagreement than disjoint concepts, because the reverse map can easily select a nearby overlapping concept\. A zero F1 score can sometimes yield no disagreement when too many selected neurons make the intersection empty or nearly empty; under our convention, the reverse step then selects the target concept or no competing concept\.

Overall, these results show that concept learning and neuron interpretation need not agree\. We discuss this mismatch in more detail in Section[7\.4](https://arxiv.org/html/2606.07007#S7.SS4)\.

## 7Analysis

### 7\.1Difference Between Concept Separation and Approximation

![Refer to caption](https://arxiv.org/html/2606.07007v1/x5.png)Figure 5:Difference between concept separation and concept approximation, with expansion factor 8\.\(a\)\(b\): F1 score and visualization for a neuron\-separable concept under concept separation; one neuron atL0=1L\_\{0\}=1achieves F1=1\.0=1\.0\.\(c\)\(d\): F1 score and visualization for the same concept under concept approximation; the best case shown usesL0=6L\_\{0\}=6and achieves F1=0\.9281=0\.9281\.Although concept separation and concept approximation are mathematically related, they behave differently in practice\. Fig\.[5](https://arxiv.org/html/2606.07007#S7.F5)shows a case study on the same neuron\-separable concept\. Under concept separation, one neuron perfectly separates the target concept \(F1=1\.0=1\.0\)\. Under concept approximation, perfect F1 is not achieved; the best unit shown reaches F1=0\.9281=0\.9281\.

The difference comes from what each task penalizes\. Concept separation is evaluated only on observed data support, so the learner only needs to include the target concept and exclude other observed concepts\. Concept approximation also evaluates novel/unseen regions, so the unit must bound the target concept more tightly and avoid over\-generalizing into blank regions\. This explains why approximation uses more neurons in Fig\.[5](https://arxiv.org/html/2606.07007#S7.F5)\(d\), and why its F1 curves resemble the harder multi\-neuron cases in Fig\.[3](https://arxiv.org/html/2606.07007#S5.F3)and Fig\.[4](https://arxiv.org/html/2606.07007#S5.F4)\. It also supports the observation from Section[5\.1](https://arxiv.org/html/2606.07007#S5.SS1): when concepts overlap, separation on the boundary begins to behave like approximation\.

### 7\.2More Selected Neurons Is Not Monotonically Better

![Refer to caption](https://arxiv.org/html/2606.07007v1/x6.png)Figure 6:Visualizations of concept separation with different exact numbers of selected neurons\.Left: best unit forL0=5L\_\{0\}=5withN=4N=4selected neurons, achieving F1=0\.8529=0\.8529\.Middle: the same SAE withN=5N=5, where F1 drops to 0\.0924\.Right: best unit with exactN=5N=5, achieving F1=0\.6748=0\.6748\.In Fig\.[3](https://arxiv.org/html/2606.07007#S5.F3)and Fig\.[4](https://arxiv.org/html/2606.07007#S5.F4), increasing the exact number of selected neurons does not always increase F1\. This does not contradict Corollary[5\.5](https://arxiv.org/html/2606.07007#S5.Thmtheorem5): the theory says that a larger neuron budget gives more freedom, not that a concept must benefit from using more neurons\.

Fig\.[6](https://arxiv.org/html/2606.07007#S7.F6)illustrates the issue\. The left panel shows the best unit forL0=5L\_\{0\}=5withN=4N=4selected neurons\. If we keep the same SAE and force one more selected neuron, the activation region can shrink sharply and cut out most of the target concept, causing the F1 drop shown in the middle panel\. Even after re\-optimizing the selection for exactN=5N=5\(right panel\), the best F1 remains below theN=4N=4optimum\. Thus, exact\-NNcurves can decrease because additional intersections may remove useful parts of the target concept\. If the x\-axis instead represented “up toNN” selected neurons, the curve would be non\-decreasing, since the learner could always reuse the best smaller unit\.

### 7\.3Illusions of Separation/Approximation Error & Limitations of Top\-KKSelection

![Refer to caption](https://arxiv.org/html/2606.07007v1/x7.png)Figure 7:Negative separation error and visualizations when−esep\-e\_\{\\mathrm\{sep\}\}is used as the neuron\-selection objective, with expansion factor 8\.Left:−esep\-e\_\{\\mathrm\{sep\}\}versus the number of selected neurons\.Middle: best 2\-neuron unit forL0=6L\_\{0\}=6, with−esep=−0\.0026\-e\_\{\\mathrm\{sep\}\}=\-0\.0026\.Right: best 3\-neuron unit forL0=6L\_\{0\}=6, with−esep=−0\.1253\-e\_\{\\mathrm\{sep\}\}=\-0\.1253\.Althoughesep=0e\_\{\\mathrm\{sep\}\}=0andeapp=0e\_\{\\mathrm\{app\}\}=0characterize perfect concept separation and approximation, these errors can be poor objectives for selecting neurons\. Fig\.[7](https://arxiv.org/html/2606.07007#S7.F7)shows a representative failure case\.

The left panel may appear benign because several curves remain close to zero\. However, forL0=6L\_\{0\}=6, increasing the number of selected neurons from 2 to 3 makes the activation region miss the target concept almost entirely \(middle and right panels\)\. The heuristic selects neuron 14 because it has small separation error: it misses target mass 0\.12 but has zero contamination\. For small target concepts, missing the target can therefore be penalized less than including unrelated concepts\.

This is a class\-imbalance effect, which has been studied intensively in anomaly detection\(Ruffet al\.,[2018](https://arxiv.org/html/2606.07007#bib.bib50)\)\. Each concept occupies only a small part of the total space, so treating concept learning as binary classification with an unnormalized measure can favor conservative solutions that reject the target concept rather than risk false positives\. Measure\-weighted errors also underweight small concepts: ifμ​\(C1\)=0\.2\\mu\(C\_\{1\}\)=0\.2andμ​\(C2\)=0\.01\\mu\(C\_\{2\}\)=0\.01, then completely missingC2C\_\{2\}incurs a much smaller penalty than missingC1C\_\{1\}\. This suggests that concept learning and neuron interpretation need metrics that normalize by concept mass, such as F1 score or Intersection\-over\-Union \(IoU\)\. This issue is related to the metric sanity checks studied inOikarinenet al\.\([2025](https://arxiv.org/html/2606.07007#bib.bib32)\)\. We omit separate IoU results because they are qualitatively similar to the F1 results reported above\. Analytically, the corresponding IoU error has the same asymptotic approximation rate as the approximation error, up to an additional1/μ​\(C\)1/\\mu\(C\)factor; this factor gives relatively more weight to small\-mass concepts\.

Finally, Fig\.[7](https://arxiv.org/html/2606.07007#S7.F7)also shows a limitation of top\-NNselection\. ForN=3N=3, the better combination is neurons\(3,11,0\)\(3,11,0\), but neuron 0 has large individual contamination error and is not selected; the heuristic instead selects neuron 14, which looks good individually but destroys the intersection\. Thus, a neuron can be poor alone but useful in combination\. This score\-wise greediness can break top\-NNselection, and similar failures can occur in other greedy methods, including matching pursuit\(Wanget al\.,[2012](https://arxiv.org/html/2606.07007#bib.bib46); Costaet al\.,[2025](https://arxiv.org/html/2606.07007#bib.bib9)\)and forward selection\(Borboudakis and Tsamardinos,[2019](https://arxiv.org/html/2606.07007#bib.bib45)\)\.

### 7\.4More About Disagreement Ratio

Table 1:Disagreement ratio by concept type, concept\-learning task, SAE architecture, and expansion factor \(EF\)\. Each cell reports the disagreement ratio after optimizing F1\. For each concept, we choose the number of selected neurons that gives the best F1 score, average overL0L\_\{0\}and seeds, and then average over concepts\.Table 2:Disagreement ratio by F1 interval\. Each cell reports the disagreement ratio for cases whose optimized F1 score falls in the corresponding interval\. The optimized unit size is chosen per concept, and values are averaged overL0L\_\{0\}, seeds, and concepts\.Table[1](https://arxiv.org/html/2606.07007#S7.T1)gives a broader view of how disagreement depends on expansion factor, concept type, learning task, and SAE architecture\. Four trends stand out\. First, disjoint concepts yield lower disagreement than overlapping concepts\. Second, concept separation usually yields lower disagreement than concept approximation\. Third, on disjoint concepts, Top\-KKSAEs are worse than ReLU SAEs at small expansion factors but catch up or improve at larger expansion factors; on overlapping concepts, Top\-KKSAEs are generally less stable\. Fourth, increasing the expansion factor reduces disagreement\.

The first two trends follow from concept\-learning quality\. Disjoint concepts and separation tasks usually achieve higher F1 scores, meaning that the learned unit captures the target concept more exclusively; this lowers the chance that the reverse neuron\-interpretation map selects another concept\. Approximation is harder because unseen and unapproximatable regions create additional ambiguity\. The fourth trend is also expected: a larger expansion factor gives a larger pool of candidate neurons and allows activation regions to surround the data support more flexibly\.

The architectural trend is more subtle\. For Top\-KKSAEs, activation regions are subsets of the corresponding ReLU halfspaces because a neuron can have positive pre\-activation but still fail to enter the top\-KKset\. This relative gating creates “holes” in activation regions\. On disjoint concepts, such holes are less likely to intersect relevant data support, so the main effect of Top\-KKgating is to suppress weak residual activations; at larger expansion factors this can produce cleaner features and lower disagreement\. On overlapping concepts, however, the holes are more likely to fall inside data support, which destabilizes the reverse map and can increase disagreement\. Thus, Top\-KKcompetition gives cleaner sparsity but may reduce the stability of the concept lattice when concepts overlap\.

Table[2](https://arxiv.org/html/2606.07007#S7.T2)further shows that disagreement is concentrated at intermediate F1 scores\. In this regime, a neuron or unit partially learns the target concept while also mixing in other concepts\. High F1 scores, especially F1=1=1, almost never produce disagreement\. This supports the use of F1 as a practical metric for concept learning because it directly rewards membership exclusivity\.

Since real\-world concepts are often overlapping rather than disjoint, future work should account for overlap when choosing the SAE architecture, concept\-learning objective, and neuron\-selection algorithm\.

## 8Discussion

In this work, we propose a mathematical framework for studying concept learning and neuron interpretation in sparse autoencoders\. We formulate human and model concepts as sets, and view concept learning as a problem of set alignment\. This perspective distinguishes three modes of alignment: concept detection, concept separation, and concept approximation, each capturing a different relationship between human concepts and SAE\-induced model concepts\.

Our analysis explains several phenomena in SAE\-based interpretability\. In particular, it shows why a single neuron is often insufficient for representing a human concept, why wider SAEs can improve concept learning by providing more candidate features, and why excessive sparsity may hurt when a concept requires multiple features\. We also derive capacity and data geometry requirements for successful concept learning, linking concept geometry and SAE width\.

Finally, we show that concept learning and neuron interpretation are not equivalent\. A feature set may represent a human concept well in the forward direction, but need not be uniquely associated with that concept in the reverse direction\. This discrepancy suggests the need for metrics and algorithms that capture bidirectional alignment and the set geometry of concepts\.

This work has several limitations\. First, our theoretical analysis focuses primarily on ReLU SAEs, while SAEs with gated activations or other architectural variants are not fully studied\. Second, although we identify a discrepancy between concept learning and neuron interpretation, we do not yet provide a rigorous concept\-lattice formulation that characterizes their bidirectional relationship, and our empirical study of this relationship remains preliminary\. Future work can develop a more general formulation of both problems, study richer SAE architectures and data geometries, and design algorithms that reach agreement between concept learning and neuron interpretation\.

## References

- C\. D\. Aliprantis and K\. C\. Border \(2006\)Infinite dimensional analysis: a hitchhiker’s guide\.Springer,Berlin; London\.External Links:[Document](https://dx.doi.org/10.1007/3-540-29587-9),ISBN 9783540326960 3540326960Cited by:[§9\.4\.1](https://arxiv.org/html/2606.07007#S9.SS4.SSS1.Px1.p2.1),[Lemma 9\.2](https://arxiv.org/html/2606.07007#S9.Thmtheorem2),[Corollary 9\.3](https://arxiv.org/html/2606.07007#S9.Thmtheorem3.p1.1)\.
- U\. Anwar, A\. Saparov, J\. Rando, D\. Paleka, M\. Turpin, P\. Hase, E\. S\. Lubana, E\. Jenner, S\. Casper, O\. Sourbut,et al\.\(2024a\)Foundational challenges in assuring alignment and safety of large language models\.arXiv preprint arXiv:2404\.09932\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p1.1)\.
- U\. Anwar, A\. Saparov, J\. Rando, D\. Paleka, M\. Turpin, P\. Hase, E\. S\. Lubana, E\. Jenner, S\. Casper, O\. Sourbut,et al\.\(2024b\)Foundational challenges in assuring alignment and safety of large language models\.arXiv preprint arXiv:2404\.09932\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p1.1)\.
- K\. Ayonrinde and L\. Jaburi \(2025\)A mathematical philosophy of explanations in mechanistic interpretability–the strange science part ii\.arXiv preprint arXiv:2505\.00808\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p5.1)\.
- U\. Bhalla, T\. Fel, C\. Rager, S\. Feucht, T\. Haklay, D\. Wurgaft, S\. Boppana, M\. Kowal, V\. Shyam, J\. Merullo,et al\.\(2026\)Do sparse autoencoders capture concept manifolds?\.arXiv preprint arXiv:2604\.28119\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- S\. Bills, N\. Cammarata, D\. Mossing, H\. Tillman, L\. Gao, G\. Goh, I\. Sutskever, J\. Leike, J\. Wu, and W\. Saunders \(2023\)Language models can explain neurons in language models\.Note:[https://openaipublic\.blob\.core\.windows\.net/neuron\-explainer/paper/index\.html](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p3.1),[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- G\. Borboudakis and I\. Tsamardinos \(2019\)Forward\-backward selection with early dropping\.Journal of Machine Learning Research20\(8\),pp\. 1–39\.External Links:[Link](http://jmlr.org/papers/v20/17-334.html)Cited by:[§7\.3](https://arxiv.org/html/2606.07007#S7.SS3.p4.4)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell,et al\.\(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread2\(5\),pp\. 6\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p4.1),[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- D\. Chanin, J\. Wilken\-Smith, T\. Dulka, H\. Bhatnagar, S\. Golechha, and J\. Bloom \(2024\)A is for absorption: studying feature splitting and absorption in sparse autoencoders\.arXiv preprint arXiv:2409\.14507\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p4.1),[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- V\. Costa, T\. Fel, E\. S\. Lubana, B\. Tolooshams, and D\. Ba \(2025\)From flat to hierarchical: extracting sparse representations with matching pursuit\.arXiv preprint arXiv:2506\.03093\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p4.1),[§2](https://arxiv.org/html/2606.07007#S2.p2.1),[§2](https://arxiv.org/html/2606.07007#S2.p3.1),[§7\.3](https://arxiv.org/html/2606.07007#S7.SS3.p4.4)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p2.1),[§1](https://arxiv.org/html/2606.07007#S1.p4.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1),[§2](https://arxiv.org/html/2606.07007#S2.p2.1),[§3](https://arxiv.org/html/2606.07007#S3.p1.6)\.
- A\. J\. DeGrave, J\. D\. Janizek, and S\. Lee \(2021\)AI for radiographic covid\-19 detection selects shortcuts over signal\.Nature Machine Intelligence3\(7\),pp\. 610–619\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p1.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen,et al\.\(2022\)Toy models of superposition\.arXiv preprint arXiv:2209\.10652\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p2.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1),[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread1\(1\),pp\. 12\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p1.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- T\. Fel, G\. Franchi,et al\.\(2025\)A geometric unification of concept learning with concept cones\.arXiv preprint arXiv:2512\.07355\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p4.1),[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- S\. Gadgil, C\. Lin, and S\. Lee \(2025\)Ensembling sparse autoencoders\.arXiv preprint arXiv:2505\.16077\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- B\. Ganter, R\. Wille, and R\. Wille \(1999\)Formal concept analysis\.Vol\.150,Springer\.Cited by:[§4\.4](https://arxiv.org/html/2606.07007#S4.SS4.p1.3)\.
- L\. Gao, T\. D\. la Tour, H\. Tillman, G\. Goh, R\. Troll, A\. Radford, I\. Sutskever, J\. Leike, and J\. Wu \(2024\)Scaling and evaluating sparse autoencoders\.arXiv preprint arXiv:2406\.04093\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1),[§3](https://arxiv.org/html/2606.07007#S3.p1.6)\.
- W\. Gurnee, N\. Nanda, M\. Pauly, K\. Harvey, D\. Troitskii, and D\. Bertsimas \(2023\)Finding neurons in a haystack: case studies with sparse probing\.arXiv preprint arXiv:2305\.01610\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- T\. Heap, T\. Lawson, L\. Farnik, and L\. Aitchison \(2025\)Sparse autoencoders can interpret randomly initialized transformers\.arXiv e\-prints,pp\. arXiv–2501\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- S\. S\. R\. Hindupur, E\. S\. Lubana, T\. Fel, and D\. Ba \(2025\)Projecting assumptions: the duality between sparse autoencoders and concept geometry\.arXiv preprint arXiv:2503\.01822\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1),[§2](https://arxiv.org/html/2606.07007#S2.p3.1),[§4\.2](https://arxiv.org/html/2606.07007#S4.SS2.p2.5),[§4\.3\.2](https://arxiv.org/html/2606.07007#S4.SS3.SSS2.p2.1),[§6\.1](https://arxiv.org/html/2606.07007#S6.SS1.p2.2)\.
- B\. Kim, J\. Hewitt, N\. Nanda, N\. Fiedel, and O\. Tafjord \(2025\)Because we have llms, we can and should pursue agentic interpretability\.arXiv preprint arXiv:2506\.12152\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.07007#S4.SS3.SSS2.p3.3)\.
- P\. W\. Koh, T\. Nguyen, Y\. S\. Tang, S\. Mussmann, E\. Pierson, B\. Kim, and P\. Liang \(2020\)Concept bottleneck models\.InInternational conference on machine learning,pp\. 5338–5348\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- P\. Leask, B\. Bussmann, M\. T\. Pearce, J\. I\. Bloom, C\. Tigges, N\. Al Moubayed, L\. Sharkey, and N\. Nanda \(2025\)Sparse autoencoders do not find canonical units of analysis\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- M\. Ludwig \(1999\)Asymptotic approximation of smooth convex bodies by general polytopes\.Mathematika46\(1\),pp\. 103–125\.External Links:[Document](https://dx.doi.org/10.1112/S0025579300007609)Cited by:[§9\.5\.2](https://arxiv.org/html/2606.07007#S9.SS5.SSS2.Px1.2.p2.5)\.
- E\. J\. Michaud, L\. Gorton, and T\. McGrath \(2025\)Understanding sparse autoencoder scaling in the presence of feature manifolds\.arXiv preprint arXiv:2509\.02565\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- G\. Montúfar, R\. Pascanu, K\. Cho, and Y\. Bengio \(2014\)On the number of linear regions of deep neural networks\.Advances in neural information processing systems27\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p4.2)\.
- M\. Muchane, S\. Richardson, K\. Park, and V\. Veitch \(2025\)Incorporating hierarchical semantics in sparse autoencoder architectures\.arXiv preprint arXiv:2506\.01197\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- J\. R\. Munkres \(1984\)Elements of algebraic topology\.Addison\-Wesley\.External Links:ISBN 978\-0\-201\-04586\-4Cited by:[§5](https://arxiv.org/html/2606.07007#S5.p3.2),[§9\.4\.1](https://arxiv.org/html/2606.07007#S9.SS4.SSS1.Px1.2.p2.6)\.
- A\. Nget al\.\(2011\)Sparse autoencoder\.CS294A Lecture notes72\(2011\),pp\. 1–19\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p2.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- G\. Nikolaou, T\. Mencattini, D\. Crisostomi, A\. Santilli, Y\. Panagakis, and E\. Rodolà \(2025\)Language models are injective and hence invertible\.arXiv preprint arXiv:2510\.15511\.Cited by:[§4\.1](https://arxiv.org/html/2606.07007#S4.SS1.p1.9)\.
- C\. O’Neill, C\. Ye, K\. Iyer, and J\. F\. Wu \(2024\)Disentangling dense embeddings with sparse autoencoders\.arXiv preprint arXiv:2408\.00657\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p2.1),[§1](https://arxiv.org/html/2606.07007#S1.p4.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- T\. Oikarinen, G\. Yan, and T\. Weng \(2025\)Evaluating neuron explanations: a unified framework with sanity checks\.arXiv preprint arXiv:2506\.05774\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p3.1),[§7\.3](https://arxiv.org/html/2606.07007#S7.SS3.p3.5)\.
- C\. Olah \(2022\)Mechanistic interpretability, variables, and the importance of interpretable bases\.Transformer Circuits Thread2\(4\)\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p1.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- B\. A\. Olshausen and D\. J\. Field \(1997\)Sparse coding with an overcomplete basis set: a strategy employed by v1?\.Vision research37\(23\),pp\. 3311–3325\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen,et al\.\(2022\)In\-context learning and induction heads\.arXiv preprint arXiv:2209\.11895\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2023\)The linear representation hypothesis and the geometry of large language models\.arXiv preprint arXiv:2311\.03658\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p3.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1),[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- R\. Pascanu, G\. Montufar, and Y\. Bengio \(2013\)On the number of response regions of deep feed forward networks with piece\-wise linear activations\.arXiv preprint arXiv:1312\.6098\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p4.2)\.
- S\. Rajamanoharan, A\. Conmy, L\. Smith, T\. Lieberum, V\. Varma, J\. Kramár, R\. Shah, and N\. Nanda \(2024a\)Improving dictionary learning with gated sparse autoencoders\.arXiv preprint arXiv:2404\.16014\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramár, and N\. Nanda \(2024b\)Jumping ahead: improving reconstruction fidelity with jumprelu sparse autoencoders\.arXiv preprint arXiv:2407\.14435\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p2.1)\.
- L\. Ruff, R\. Vandermeulen, N\. Goernitz, L\. Deecke, S\. A\. Siddiqui, A\. Binder, E\. Müller, and M\. Kloft \(2018\)Deep one\-class classification\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 4393–4402\.External Links:[Link](https://proceedings.mlr.press/v80/ruff18a.html)Cited by:[§4\.3\.3](https://arxiv.org/html/2606.07007#S4.SS3.SSS3.p2.2),[§7\.3](https://arxiv.org/html/2606.07007#S7.SS3.p3.5)\.
- R\. Sarfati, E\. Bigelow, D\. Wurgaft, J\. Merullo, A\. Geiger, O\. Lewis, T\. McGrath, and E\. S\. Lubana \(2026\)The shape of beliefs: geometry, dynamics, and interventions along representation manifolds of language models’ posteriors\.arXiv preprint arXiv:2602\.02315\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- O\. Shafran, S\. Ronen, O\. Fahn, S\. Ravfogel, A\. Geiger, and M\. Geva \(2026\)From directions to regions: decomposing activations in language models via local geometry\.arXiv preprint arXiv:2602\.02464\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- E\. Simon and J\. Zou \(2025\)InterPLM: discovering interpretable features in protein language models via sparse autoencoders\.Nature methods22\(10\),pp\. 2107–2117\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p1.1),[§4\.3\.2](https://arxiv.org/html/2606.07007#S4.SS3.SSS2.p3.3)\.
- C\. Singh, A\. R\. Hsu, R\. Antonello, S\. Jain, A\. G\. Huth, B\. Yu, and J\. Gao \(2023\)Explaining black box text modules in natural language with language models\.arXiv preprint arXiv:2305\.09863\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p3.1)\.
- N\. Singh, S\. Lane, T\. Yu, J\. Lu, A\. Ramos, H\. Cui, and H\. Zhao \(2025\)A generalized platform for artificial intelligence\-powered autonomous enzyme engineering\.Nature communications16\(1\),pp\. 5648\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.07007#S4.SS3.SSS2.p3.3)\.
- R\. P\. Stanleyet al\.\(2007\)An introduction to hyperplane arrangements\.Geometric combinatorics13,pp\. 389–496\.Cited by:[Figure 1](https://arxiv.org/html/2606.07007#S1.F1),[Figure 1](https://arxiv.org/html/2606.07007#S1.F1.4.2),[§2](https://arxiv.org/html/2606.07007#S2.p4.2)\.
- J\. M\. Stokes, K\. Yang, K\. Swanson, W\. Jin, A\. Cubillos\-Ruiz, N\. M\. Donghia, C\. R\. MacNair, S\. French, L\. A\. Carfrae, Z\. Bloom\-Ackermann,et al\.\(2020\)A deep learning approach to antibiotic discovery\.Cell180\(4\),pp\. 688–702\.Cited by:[§4\.3\.2](https://arxiv.org/html/2606.07007#S4.SS3.SSS2.p3.3)\.
- Y\. Su, H\. Tang, Z\. Gong, and Y\. Liu \(2026\)Sparsity is combinatorial depth: quantifying moe expressivity via tropical geometry\.arXiv preprint arXiv:2602\.03204\.Cited by:[§2](https://arxiv.org/html/2606.07007#S2.p4.2),[§4\.2](https://arxiv.org/html/2606.07007#S4.SS2.p2.5)\.
- A\. Templeton \(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Anthropic\.Cited by:[§1](https://arxiv.org/html/2606.07007#S1.p2.1),[§2](https://arxiv.org/html/2606.07007#S2.p1.1)\.
- J\. Wang, S\. Kwon, and B\. Shim \(2012\)Generalized orthogonal matching pursuit\.IEEE Transactions on signal processing60\(12\),pp\. 6202–6216\.Cited by:[§7\.3](https://arxiv.org/html/2606.07007#S7.SS3.p4.4)\.

## 9Appendix

### 9\.1List of Contents

We organize the appendix as follows\.

- •Experiment Details[9\.2](https://arxiv.org/html/2606.07007#S9.SS2)describes the synthetic data generation, SAE training hyperparameters, hardware, and training time\.
- •A Complete View of Concept Learning and Neuron Interpretation[9\.3](https://arxiv.org/html/2606.07007#S9.SS3)includes proofs for Section[5\.4](https://arxiv.org/html/2606.07007#S5.SS4)\.
- •Proofs for Concept Separation[9\.4](https://arxiv.org/html/2606.07007#S9.SS4)contains proofs for Section[5\.1](https://arxiv.org/html/2606.07007#S5.SS1)\.
- •Proofs for Concept Approximation[9\.5](https://arxiv.org/html/2606.07007#S9.SS5)contains proofs for Section[5\.2](https://arxiv.org/html/2606.07007#S5.SS2)\.
- •Proofs for Concept Learning Capacity[9\.6](https://arxiv.org/html/2606.07007#S9.SS6)contains the proof for Section[5\.3](https://arxiv.org/html/2606.07007#S5.SS3)\.
- •Additional Discussion on Top\-KKSAE[9\.7](https://arxiv.org/html/2606.07007#S9.SS7)discusses the additional rank\-interference effect induced by Top\-KKgating\.

### 9\.2Experiment Details

For each cluster, we sample 10,000 points with uniform density\. The disjoint\-concept dataset contains 8 clusters, and the overlapping\-concept dataset contains 12 clusters\. In the overlapping dataset, the density in an overlap region is kept the same as the density in a non\-overlap region\. This mimics realistic concept densities: for example, the density of “red car” should not be double\-counted simply because a point belongs to both “red” and “car\.” We approximate the data\-supported measureμ\\muby the empirical measure on observed data\. For concept approximation, we additionally sample 10,000 probe points from the blank region inside the convex hull of all observed data, i\.e\.,Conv⁡\(X\)∖X\\operatorname\{Conv\}\(X\)\\setminus X, and include these points when empirically estimating the ambient/probe measureν\\nu\.

For ReLU SAEs, we use expansion factors in\{1,2,4,8,16,32,64\}\\\{1,2,4,8,16,32,64\\\}andL1L\_\{1\}regularization coefficients in

\{\\displaystyle\\\{10−5,5×10−5,10−4,5×10−4,10−3,5×10−3,10−2,\\displaystyle 10^\{\-5\},5\\\!\\times\\\!10^\{\-5\},10^\{\-4\},5\\\!\\times\\\!10^\{\-4\},10^\{\-3\},5\\\!\\times\\\!10^\{\-3\},10^\{\-2\},5×10−2,10−1,5×10−1,1,3,5,7,10\}\.\\displaystyle 5\\\!\\times\\\!10^\{\-2\},10^\{\-1\},5\\\!\\times\\\!10^\{\-1\},1,3,5,7,10\\\}\.
For Top\-KKSAEs,KKmust not exceed the SAE width\. Since the input dimension is two, the SAE width is2​E​F2\\mathrm\{EF\}\. We therefore use

K∈\{1,…,2​E​F\}for​EF∈\{1,2,4,8,16\},K\\in\\\{1,\\ldots,2\\mathrm\{EF\}\\\}\\quad\\text\{for \}\\mathrm\{EF\}\\in\\\{1,2,4,8,16\\\},K∈\{1,…,35,40,45,50,55,60,64\}for​EF=32,K\\in\\\{1,\\ldots,35,40,45,50,55,60,64\\\}\\quad\\text\{for \}\\mathrm\{EF\}=32,K∈\{\\displaystyle K\\in\\\{1,…,35,40,45,50,55,60,\\displaystyle 1,\\ldots,35,40,45,50,55,60,64,80,100,128\}forEF=64\.\\displaystyle 64,80,100,128\\\}\\quad\\text\{for \}\\mathrm\{EF\}=64\.
For both architectures, we use learning rates in\{0\.1,0\.01,0\.001,0\.0003\}\\\{0\.1,0\.01,0\.001,0\.0003\\\}and train for 200 epochs, which is sufficient for convergence in this synthetic setting\. Since our theory is constructive, we run 10 random seeds,\{0,1,42,3407,8347,19285,657306,482910,915673,2746089\}\\\{0,1,42,3407,8347,19285,657306,482910,915673,2746089\\\}, and report the best concept\-learning metric across seeds\. Each run is trained on one NVIDIA A40 GPU and takes approximately 5–20 minutes, depending on the expansion factor\.

### 9\.3A Complete View of Concept Learning and Neuron Interpretation

Although neuron interpretation has not been specified in the previous sections, the theoretical results above still provide insight into it\. In particular, using individual neurons or multi\-neuron units for concept learning leads to different levels of interpretation quality, such as different degrees of monosemanticity\. We now state the link between concept learning and neuron interpretation in an algebraic way\.

DefineUUas the power set of all data,U=𝒫​\(X\)U=\\mathcal\{P\}\(X\), so that the human concept family𝒞\\mathcal\{C\}is a finite subset ofUU\. Define𝒩=\[d\]\\mathcal\{N\}=\[d\]as the set of neurons\. Recall thatM∈𝒫​\(𝒩\)M\\in\\mathcal\{P\}\(\\mathcal\{N\}\)is a set of neurons andθM⊆X\\theta\_\{M\}\\subseteq Xis the activation region ofMM\. For a single neuronN∈𝒩N\\in\\mathcal\{N\}, we writeθN:=θ\{N\}\\theta\_\{N\}:=\\theta\_\{\\\{N\\\}\}\. Since eachθM\\theta\_\{M\}is a subset ofXX, we also regard it as an element ofUU\.

We define a binary relationR⊆U×𝒩R\\subseteq U\\times\\mathcal\{N\}by

C​R​N⟺C⊆θN\.\\displaystyle CRN\\Longleftrightarrow C\\subseteq\\theta\_\{N\}\.Intuitively,C​R​NCRNmeans that neuronNNis active on the whole data regionCC\. WhenC∈𝒞C\\in\\mathcal\{C\}, this says thatNNcovers the whole human concept regionCC\. This relation is not assumed to be a function: a concept may be represented by multiple neurons, and a neuron may be related to multiple concepts\.

The relationRRinduces two maps

f:U→𝒫​\(𝒩\),f​\(C\)=\{N∈𝒩:C​R​N\},\\displaystyle f:U\\rightarrow\\mathcal\{P\}\(\\mathcal\{N\}\),\\qquad f\(C\)=\\\{N\\in\\mathcal\{N\}:CRN\\\},g:𝒫​\(𝒩\)→U,g​\(M\)=⋂N∈MθN,\\displaystyle g:\\mathcal\{P\}\(\\mathcal\{N\}\)\\rightarrow U,\\qquad g\(M\)=\\bigcap\_\{N\\in M\}\\theta\_\{N\},with the convention thatg​\(∅\)=Xg\(\\emptyset\)=X\. The mapffsends a data region to the set of neurons that are active on the entire region\. Thus,ffcorresponds to the concept\-to\-neuron direction\. The mapggsends a set of neurons to their common activation region\. Thus,ggcorresponds to the neuron\-to\-region direction used in neuron interpretation\.

BothUUand𝒫​\(𝒩\)\\mathcal\{P\}\(\\mathcal\{N\}\)are partially ordered by set inclusion\. We write these posets as\(U,⊆\)\(U,\\subseteq\)and\(𝒫​\(𝒩\),⊆\)\(\\mathcal\{P\}\(\\mathcal\{N\}\),\\subseteq\)\. This order\-theoretic formulation is useful because it allows us to study hierarchical structure among concepts and neuron\-defined regions\. The larger the data regionCCis, the fewer neurons can be active on all of it; the larger the neuron setMMis, the smaller its common activation region becomes\. This is captured by the following Galois connection\.

###### Theorem 9\.1\(Galois Connection Between Human Concepts and Model Neurons\)\.

The mapsffandggform a contravariant Galois connection betweenUUand𝒫​\(𝒩\)\\mathcal\{P\}\(\\mathcal\{N\}\)\. That is, for everyC∈UC\\in Uand everyM∈𝒫​\(𝒩\)M\\in\\mathcal\{P\}\(\\mathcal\{N\}\),

C⊆g​\(M\)⟺M⊆f​\(C\)\.\\displaystyle C\\subseteq g\(M\)\\Longleftrightarrow M\\subseteq f\(C\)\.

###### Proof\.

FixC∈UC\\in UandM∈𝒫​\(𝒩\)M\\in\\mathcal\{P\}\(\\mathcal\{N\}\)\. By the definition ofgg,

C⊆g​\(M\)⟺C⊆⋂N∈MθN\.\\displaystyle C\\subseteq g\(M\)\\Longleftrightarrow C\\subseteq\\bigcap\_\{N\\in M\}\\theta\_\{N\}\.This holds if and only ifC⊆θNC\\subseteq\\theta\_\{N\}for everyN∈MN\\in M\. By the definition ofRR, this is equivalent to saying thatC​R​NCRNfor everyN∈MN\\in M\. By the definition off​\(C\)f\(C\), this holds if and only if everyN∈MN\\in Mbelongs tof​\(C\)f\(C\), or equivalentlyM⊆f​\(C\)M\\subseteq f\(C\)\. Therefore,C⊆g​\(M\)⟺M⊆f​\(C\)C\\subseteq g\(M\)\\Longleftrightarrow M\\subseteq f\(C\)\. ∎

Intuitively, the theorem says that a data regionCClies inside the common activation region of a neuron setMMif and only if every neuron inMMis active on the whole regionCC\. WhenC∈𝒞C\\in\\mathcal\{C\}, we interpretCCas a human concept\. The word “contravariant” reflects the reversal of inclusion: adding more neurons toMMshrinksg​\(M\)g\(M\), while enlargingCCcan only remove neurons fromf​\(C\)f\(C\)\. Therefore, the concept\-learning directionffand the neuron\-interpretation directionggare two sides of the same order\-theoretic structure\.

This Galois connection naturally induces closure operatorsg∘fg\\circ fonUUandf∘gf\\circ gon𝒫​\(𝒩\)\\mathcal\{P\}\(\\mathcal\{N\}\)\. For a data regionC∈UC\\in U, the regiong​\(f​\(C\)\)g\(f\(C\)\)is the smallest neuron\-closed region, relative to the current neuron family, that containsCC\. IfC=g​\(f​\(C\)\)C=g\(f\(C\)\), thenCCis closed under the concept\-to\-neuron\-to\-concept operation and is called afixed point\. Similarly, a neuron setMMis a fixed point ifM=f​\(g​\(M\)\)M=f\(g\(M\)\)\. Studying fixed points is useful for two reasons\. First, different regions or neuron sets with the same closure collapse to the same canonical object, which makes redundancy visible\. Second, fixed points carry lattice operations, giving a clean way to compare broader and narrower model\-learned regions\.

From fixed points, we defineformal conceptsas follows: a formal concept is a pair\(C,M\)\(C,M\)satisfyingf​\(C\)=Mf\(C\)=Mandg​\(M\)=Cg\(M\)=C\. Equip formal concepts with the order

\(C1,M1\)≤\(C2,M2\)⟺C1⊆C2\.\\displaystyle\(C\_\{1\},M\_\{1\}\)\\leq\(C\_\{2\},M\_\{2\}\)\\Longleftrightarrow C\_\{1\}\\subseteq C\_\{2\}\.Equivalently, the neuron\-side order is reversed:\(C1,M1\)≤\(C2,M2\)\(C\_\{1\},M\_\{1\}\)\\leq\(C\_\{2\},M\_\{2\}\)if and only ifM2⊆M1M\_\{2\}\\subseteq M\_\{1\}\. The meet and join operations are

\(C1,M1\)∧\(C2,M2\)\\displaystyle\(C\_\{1\},M\_\{1\}\)\\land\(C\_\{2\},M\_\{2\}\)=\(C1∩C2,f​\(C1∩C2\)\),\\displaystyle\\qquad=\(C\_\{1\}\\cap C\_\{2\},f\(C\_\{1\}\\cap C\_\{2\}\)\),\(C1,M1\)∨\(C2,M2\)\\displaystyle\(C\_\{1\},M\_\{1\}\)\\vee\(C\_\{2\},M\_\{2\}\)=\(g​\(M1∩M2\),M1∩M2\)\.\\displaystyle\\qquad=\(g\(M\_\{1\}\\cap M\_\{2\}\),M\_\{1\}\\cap M\_\{2\}\)\.Intuitively, the meet gives a more specific region,C1∩C2C\_\{1\}\\cap C\_\{2\}, and then collects the neurons active on all of that smaller region\. The join keeps the neurons shared by the two descriptions,M1∩M2M\_\{1\}\\cap M\_\{2\}, and returns the broader closed region described by those shared neurons\. The collection of formal concepts with this order is theconcept lattice\. TheUU\-side componentsCCare extents, and the𝒫​\(𝒩\)\\mathcal\{P\}\(\\mathcal\{N\}\)\-side componentsMMare intents\. Extents that are not in𝒞\\mathcal\{C\}are not necessarily meaningless; rather, they are closed regions that are not named by the chosen human concept family, and may therefore appear uninterpretable under that vocabulary\.

This algebraic framework is an abstraction of the geometric framework\. It gives a more general view of SAE phenomena and helps explain why concept learning can be cast as a set alignment problem between human concepts and model\-learned regions\.

Polysemanticity/monosemanticity\.NeuronNNis polysemantic when its activation region covers two unrelated concepts: there exist disjointC1,C2∈𝒞C\_\{1\},C\_\{2\}\\in\\mathcal\{C\}withC1⊆g​\(\{N\}\)C\_\{1\}\\subseteq g\(\\\{N\\\}\)andC2⊆g​\(\{N\}\)C\_\{2\}\\subseteq g\(\\\{N\\\}\), i\.e\., the regionθN=g​\(\{N\}\)\\theta\_\{N\}=g\(\\\{N\\\}\)is too coarse\. If\{N\}\\\{N\\\}is already closed, the corresponding lattice node has extentg​\(\{N\}\)g\(\\\{N\\\}\)and intent\{N\}\\\{N\\\}; in general, its intent isf​\(g​\(\{N\}\)\)f\(g\(\\\{N\\\}\)\)\. Its extent containsC1∪C2C\_\{1\}\\cup C\_\{2\}, so the current dictionary does not distinguish the two concepts inside this neuron region\. Lattice operations inside this fixed context can only recombine regions already induced by the available neurons; they cannot create a new boundary insideθN\\theta\_\{N\}unless some neuron already encodes that boundary\. Conversely,NNis monosemantic in the ideal case wheng​\(\{N\}\)=Cg\(\\\{N\\\}\)=Cfor a singleC∈𝒞C\\in\\mathcal\{C\}, up to the chosen notion of approximation\. Disentangling a polysemantic neuron therefore requires changing the dictionary or training objective, not merely reordering one fixed lattice\. This is what feature splitting attempts to do\.

Feature splitting\.Enlarging the SAE replaces the current neuron familyΘ\\Thetaby a richer familyΘ′\\Theta^\{\\prime\}\. The training objective, through reconstruction under sparsity, may learn new neuronsN1,N2,…N\_\{1\},N\_\{2\},\\dotswithθNi≈Ci\\theta\_\{N\_\{i\}\}\\approx C\_\{i\}that did not exist inΘ\\Theta\. Once these neurons are present, a coarse regionθN\\theta\_\{N\}that previously covered several concepts can be resolved into purer regions\. It is important to keep two directions separate\. The actual split from coarse to fine is performed by optimization in the larger dictionary; it is what carvesC1C\_\{1\}fromC2C\_\{2\}\. After training, the refined lattice records the result by giving theCiC\_\{i\}their own nodes\. If the old coarse feature remains, or if the refined dictionary contains shared neurons describing it, the broad region can appear above the finer ones\. Otherwise, the relationθN≈⋃iθNi\\theta\_\{N\}\\approx\\bigcup\_\{i\}\\theta\_\{N\_\{i\}\}should be read as an approximate geometric relation, not necessarily as an exact join inside a single intersection\-based lattice\.

Concept Hierarchy\.This is natural in the poset\. IfClargeC\_\{\\text\{large\}\}is more general thanCsmallC\_\{\\text\{small\}\}, thenCsmall⊆ClargeC\_\{\\text\{small\}\}\\subseteq C\_\{\\text\{large\}\}, and by antitonicity the neuron sets are reverse\-ordered:

f​\(Clarge\)⊆f​\(Csmall\)\.\\displaystyle f\(C\_\{\\text\{large\}\}\)\\subseteq f\(C\_\{\\text\{small\}\}\)\.Thus, moving to a more specific concept can only add neuron constraints, while moving to a more general concept removes them\. However, under a sparsity constraint, a standard SAE tends to learn a relatively flat set of features and does not explicitly represent these subconcept/superconcept containments\. Capturing such hierarchy may require lattice\- or graph\-structured SAEs\.

Feature family\.A feature family can be viewed as a local part of the concept lattice around a parent concept: the nodes lying below a parent extent, or a smaller interval selected by co\-activating neurons\. This captures related features without forcing each member of the family to be represented by a single neuron\.

Feature absorption\.In the exact lattice, the hierarchyCsmall⊆ClargeC\_\{\\text\{small\}\}\\subseteq C\_\{\\text\{large\}\}implies

f​\(Clarge\)⊆f​\(Csmall\)\.\\displaystyle f\(C\_\{\\text\{large\}\}\)\\subseteq f\(C\_\{\\text\{small\}\}\)\.That is, any neuronNNthat is active on the whole parent conceptClargeC\_\{\\text\{large\}\}must also be active on the child conceptCsmallC\_\{\\text\{small\}\}\. This gives the ideal implication

x∈Csmall⟹N​activates on​x,\\displaystyle x\\in C\_\{\\text\{small\}\}\\Longrightarrow N\\text\{ activates on \}x,for​N∈f​\(Clarge\)\.\\displaystyle\\text\{for \}N\\in f\(C\_\{\\text\{large\}\}\)\.Feature absorption is an empirical failure of this ideal implication: a detector intended forClargeC\_\{\\text\{large\}\}fails to activate on a non\-negligible part ofCsmallC\_\{\\text\{small\}\}\. In the empirical relation, this means thatN∉f​\(Csmall\)N\\notin f\(C\_\{\\text\{small\}\}\)even though the concept hierarchy predicts that it should be\. This discrepancy suggests a simple diagnostic: list the implicationsCsmall⇒NC\_\{\\text\{small\}\}\\Rightarrow Npredicted by the known concept hierarchy, and flag the cases where the empirical activation data do not support the implication\.

### 9\.4Proofs for Concept Separation

Throughout this subsection,C∈𝒞C\\in\\mathcal\{C\}is a nonempty compact concept and

N:=X∖C=⋃C′∈𝒞,C′≠C\(C′∖C\)N:=X\\setminus C=\\bigcup\_\{C^\{\\prime\}\\in\\mathcal\{C\},\\,C^\{\\prime\}\\neq C\}\(C^\{\\prime\}\\setminus C\)is the non\-target region\. We assume\|𝒞\|\|\\mathcal\{C\}\|is finite andX⊂ℝnX\\subset\\mathbb\{R\}^\{n\}is bounded\. All closures are taken in the ambient input space, andConv⁡\(⋅\)\\operatorname\{Conv\}\(\\cdot\)denotes convex hull\.

#### 9\.4\.1Proof of Theorem[5\.2](https://arxiv.org/html/2606.07007#S5.Thmtheorem2)

We first recall two standard convexity facts\.

###### Lemma 9\.2\(Caratheodory convexity theorem\(Aliprantis and Border,[2006](https://arxiv.org/html/2606.07007#bib.bib53)\)\)\.

In annn\-dimensional vector space, every vector in the convex hull of a nonempty set can be written as a convex combination of at mostn\+1n\+1points from that set\.

###### Corollary 9\.3\.

\(Aliprantis and Border,[2006](https://arxiv.org/html/2606.07007#bib.bib53)\)The convex hull of a compact subset of a finite\-dimensional vector space is compact\.

##### Theorem[5\.2](https://arxiv.org/html/2606.07007#S5.Thmtheorem2)\.

A conceptCCcan be separated fromNNwith one neuron if and only if

Conv⁡\(C\)∩Conv⁡\(N\)¯=ϕ\.\\operatorname\{Conv\}\(C\)\\cap\\overline\{\\operatorname\{Conv\}\(N\)\}=\\phi\.
###### Proof\.

IfN=ϕN=\\phi, the condition is immediate and a sufficiently large bias makes any nonzero neuron positive on the bounded setCC\. We therefore assumeN≠ϕN\\neq\\phi\.

By Corollary[9\.3](https://arxiv.org/html/2606.07007#S9.Thmtheorem3),K:=Conv⁡\(C\)K:=\\operatorname\{Conv\}\(C\)is compact and convex\. SinceN⊂XN\\subset XandXXis bounded,Conv⁡\(N\)\\operatorname\{Conv\}\(N\)is bounded; henceL:=Conv⁡\(N\)¯L:=\\overline\{\\operatorname\{Conv\}\(N\)\}is compact by the Heine–Borel theorem\(Munkres,[1984](https://arxiv.org/html/2606.07007#bib.bib52)\)\. The setLLis also convex because the closure of a convex set is convex\.

Recall that separation by one neuron means that there arew∈ℝn∖\{0\}w\\in\\mathbb\{R\}^\{n\}\\setminus\\\{0\\\}andb∈ℝb\\in\\mathbb\{R\}such that

w⊤​x\+b\>0\(x∈C\),w⊤​y\+b≤0\(y∈N\)\.w^\{\\top\}x\+b\>0\\quad\(x\\in C\),\\qquad w^\{\\top\}y\+b\\leq 0\\quad\(y\\in N\)\.\(7\)Equivalently, withβ:=−b\\beta:=\-b, we requirew⊤​x\>βw^\{\\top\}x\>\\betaonCCandw⊤​y≤βw^\{\\top\}y\\leq\\betaonNN\. The usual hyperplane\-separation theorem almost gives this result, but here the positive side is strict while the negative side is weak\. We therefore give the direct proof\.

\(⇐\\Leftarrow\)AssumeK∩L=ϕK\\cap L=\\phi\. SinceKKandLLare nonempty compact sets, the continuous map\(u,v\)↦‖u−v‖\(u,v\)\\mapsto\\\|u\-v\\\|attains its minimum overK×LK\\times Lat some\(u∗,v∗\)\(u^\{\\ast\},v^\{\\ast\}\)\. Let

δ:=‖u∗−v∗‖,w:=u∗−v∗\.\\delta:=\\\|u^\{\\ast\}\-v^\{\\ast\}\\\|,\\qquad w:=u^\{\\ast\}\-v^\{\\ast\}\.BecauseK∩L=ϕK\\cap L=\\phi, we haveδ\>0\\delta\>0andw≠0w\\neq 0\.

For anyu∈Ku\\in Kandt∈\[0,1\]t\\in\[0,1\], convexity givesu∗\+t​\(u−u∗\)∈Ku^\{\\ast\}\+t\(u\-u^\{\\ast\}\)\\in K\. By minimality,

‖u∗\+t​\(u−u∗\)−v∗‖2≥‖u∗−v∗‖2\.\\\|u^\{\\ast\}\+t\(u\-u^\{\\ast\}\)\-v^\{\\ast\}\\\|^\{2\}\\geq\\\|u^\{\\ast\}\-v^\{\\ast\}\\\|^\{2\}\.Expanding, dividing byt\>0t\>0, and sendingt→0\+t\\to 0^\{\+\}gives

w⊤​u≥w⊤​u∗\(u∈K\)\.w^\{\\top\}u\\geq w^\{\\top\}u^\{\\ast\}\\qquad\(u\\in K\)\.\(8\)Similarly, varyingv∈Lv\\in Lwhile keepingu∗u^\{\\ast\}fixed yields

w⊤​v≤w⊤​v∗\(v∈L\)\.w^\{\\top\}v\\leq w^\{\\top\}v^\{\\ast\}\\qquad\(v\\in L\)\.\(9\)Moreover,

w⊤​u∗−w⊤​v∗=w⊤​\(u∗−v∗\)=‖w‖2=δ2\>0\.w^\{\\top\}u^\{\\ast\}\-w^\{\\top\}v^\{\\ast\}=w^\{\\top\}\(u^\{\\ast\}\-v^\{\\ast\}\)=\\\|w\\\|^\{2\}=\\delta^\{2\}\>0\.\(10\)Setβ:=12​\(w⊤​u∗\+w⊤​v∗\)\\beta:=\\frac\{1\}\{2\}\(w^\{\\top\}u^\{\\ast\}\+w^\{\\top\}v^\{\\ast\}\)andb:=−βb:=\-\\beta\. Forx∈C⊆Kx\\in C\\subseteq K, equations \([8](https://arxiv.org/html/2606.07007#S9.E8)\) and \([10](https://arxiv.org/html/2606.07007#S9.E10)\) imply

w⊤​x\+b≥w⊤​u∗−β=12​δ2\>0\.w^\{\\top\}x\+b\\geq w^\{\\top\}u^\{\\ast\}\-\\beta=\\tfrac\{1\}\{2\}\\delta^\{2\}\>0\.Fory∈N⊆Conv⁡\(N\)⊆Ly\\in N\\subseteq\\operatorname\{Conv\}\(N\)\\subseteq L, equations \([9](https://arxiv.org/html/2606.07007#S9.E9)\) and \([10](https://arxiv.org/html/2606.07007#S9.E10)\) imply

w⊤​y\+b≤w⊤​v∗−β=−12​δ2<0\.w^\{\\top\}y\+b\\leq w^\{\\top\}v^\{\\ast\}\-\\beta=\-\\tfrac\{1\}\{2\}\\delta^\{2\}<0\.Thus\(w,b\)\(w,b\)satisfies \([7](https://arxiv.org/html/2606.07007#S9.E7)\)\.

\(⇒\\Rightarrow\)Conversely, suppose\(w,b\)\(w,b\)satisfies \([7](https://arxiv.org/html/2606.07007#S9.E7)\), and setβ:=−b\\beta:=\-b\. Thenw⊤​x\>βw^\{\\top\}x\>\\betafor allx∈Cx\\in Candw⊤​y≤βw^\{\\top\}y\\leq\\betafor ally∈Ny\\in N\.

By Lemma[9\.2](https://arxiv.org/html/2606.07007#S9.Thmtheorem2), eachu∈Conv⁡\(C\)u\\in\\operatorname\{Conv\}\(C\)is a finite convex combinationu=∑jλj​xju=\\sum\_\{j\}\\lambda\_\{j\}x\_\{j\}withxj∈Cx\_\{j\}\\in C,λj≥0\\lambda\_\{j\}\\geq 0, and∑jλj=1\\sum\_\{j\}\\lambda\_\{j\}=1\. Therefore

w⊤​u=∑jλj​w⊤​xj≥minj⁡w⊤​xj\>β,w^\{\\top\}u=\\sum\_\{j\}\\lambda\_\{j\}w^\{\\top\}x\_\{j\}\\geq\\min\_\{j\}w^\{\\top\}x\_\{j\}\>\\beta,where the final strict inequality is valid because the minimum is over finitely many terms\. Hence

w⊤​u\>β\(u∈Conv⁡\(C\)\)\.w^\{\\top\}u\>\\beta\\qquad\(u\\in\\operatorname\{Conv\}\(C\)\)\.\(11\)The same finite\-convex\-combination argument givesw⊤​v≤βw^\{\\top\}v\\leq\\betafor allv∈Conv⁡\(N\)v\\in\\operatorname\{Conv\}\(N\)\. By continuity, this extends to

w⊤​v≤β\(v∈Conv⁡\(N\)¯\)\.w^\{\\top\}v\\leq\\beta\\qquad\(v\\in\\overline\{\\operatorname\{Conv\}\(N\)\}\)\.\(12\)If a pointppbelonged to bothConv⁡\(C\)\\operatorname\{Conv\}\(C\)andConv⁡\(N\)¯\\overline\{\\operatorname\{Conv\}\(N\)\}, equations \([11](https://arxiv.org/html/2606.07007#S9.E11)\) and \([12](https://arxiv.org/html/2606.07007#S9.E12)\) would give the contradictionw⊤​p\>βw^\{\\top\}p\>\\betaandw⊤​p≤βw^\{\\top\}p\\leq\\beta\. Thus the two sets are disjoint\. ∎

In infinite\-dimensional settings, such as an RKHS induced by an RBF kernel, the convex hull of a compact concept need not be compact without additional assumptions\. In that case, the natural replacement is the closed convex hull, denotedConv⁡\(C\)¯\\overline\{\\operatorname\{Conv\}\(C\)\}\. In complete metrizable locally convex spaces, the closed convex hull of a compact set is compact\(Aliprantis and Border,[2006](https://arxiv.org/html/2606.07007#bib.bib53)\)\. Since the closure of a convex set is convex, this closed convex hull is exactly the closure of the ordinary convex hull\.

#### 9\.4\.2Proof of Corollary[5\.3](https://arxiv.org/html/2606.07007#S5.Thmtheorem3)

##### Corollary[5\.3](https://arxiv.org/html/2606.07007#S5.Thmtheorem3)\.

1. 1\.All concepts in𝒞\\mathcal\{C\}can be separated from each other by one neuron per concept if and only if Conv⁡\(Ci\)∩Conv⁡\(Ni\)¯=ϕ\(Ci∈𝒞\),\\operatorname\{Conv\}\(C\_\{i\}\)\\cap\\overline\{\\operatorname\{Conv\}\(N\_\{i\}\)\}=\\phi\\qquad\(C\_\{i\}\\in\\mathcal\{C\}\),whereNi=X∖CiN\_\{i\}=X\\setminus C\_\{i\}\.
2. 2\.If perfect one\-neuron separation is possible for all distinct, nonempty concepts in a finite𝒞\\mathcal\{C\}, then at least\|𝒞\|\|\\mathcal\{C\}\|neurons are necessary and\|𝒞\|\|\\mathcal\{C\}\|neurons are sufficient\.

###### Proof\.

The first claim follows by applying Theorem[5\.2](https://arxiv.org/html/2606.07007#S5.Thmtheorem2)to each conceptCiC\_\{i\}withNi=X∖CiN\_\{i\}=X\\setminus C\_\{i\}\.

For the second claim, sufficiency follows from the first claim: each concept can be assigned one separating neuron\. For necessity, suppose a neuron with activation regionH\+H^\{\+\}perfectly represents two conceptsCiC\_\{i\}andCjC\_\{j\}\. Perfect separation givesH\+∩X=CiH^\{\+\}\\cap X=C\_\{i\}and alsoH\+∩X=CjH^\{\+\}\\cap X=C\_\{j\}, soCi=CjC\_\{i\}=C\_\{j\}\. Thus distinct concepts cannot share the same one\-neuron representation, and at least\|𝒞\|\|\\mathcal\{C\}\|neurons are required\. ∎

#### 9\.4\.3Proof of Theorem[5\.4](https://arxiv.org/html/2606.07007#S5.Thmtheorem4)

We first record a strict point\-versus\-convex\-set separation lemma\.

###### Lemma 9\.4\(Strict separation of a point\)\.

LetK⊆ℝnK\\subseteq\\mathbb\{R\}^\{n\}be nonempty, compact, and convex, and lety∉Ky\\notin K\. Then there existw≠0w\\neq 0andβ∈ℝ\\beta\\in\\mathbb\{R\}such that

w⊤​y<β,w⊤​u\>β\(u∈K\)\.w^\{\\top\}y<\\beta,\\qquad w^\{\\top\}u\>\\beta\\quad\(u\\in K\)\.

###### Proof\.

SinceKKis compact, the mapu↦‖u−y‖u\\mapsto\\\|u\-y\\\|attains its minimum at someu∗∈Ku^\{\\ast\}\\in K\. Becausey∉Ky\\notin K,δ:=‖u∗−y‖\>0\\delta:=\\\|u^\{\\ast\}\-y\\\|\>0\. Letw:=u∗−yw:=u^\{\\ast\}\-y\. Foru∈Ku\\in Kandt∈\[0,1\]t\\in\[0,1\], convexity givesu∗\+t​\(u−u∗\)∈Ku^\{\\ast\}\+t\(u\-u^\{\\ast\}\)\\in K, and minimality gives

‖u∗\+t​\(u−u∗\)−y‖2≥‖u∗−y‖2\.\\\|u^\{\\ast\}\+t\(u\-u^\{\\ast\}\)\-y\\\|^\{2\}\\geq\\\|u^\{\\ast\}\-y\\\|^\{2\}\.Expanding and lettingt→0\+t\\to 0^\{\+\}yieldsw⊤​\(u−u∗\)≥0w^\{\\top\}\(u\-u^\{\\ast\}\)\\geq 0, so

w⊤​u≥w⊤​u∗=w⊤​y\+‖w‖2\(u∈K\)\.w^\{\\top\}u\\geq w^\{\\top\}u^\{\\ast\}=w^\{\\top\}y\+\\\|w\\\|^\{2\}\\qquad\(u\\in K\)\.Choosingβ:=w⊤​y\+12​‖w‖2\\beta:=w^\{\\top\}y\+\\frac\{1\}\{2\}\\\|w\\\|^\{2\}proves the claim\. ∎

##### Theorem[5\.4](https://arxiv.org/html/2606.07007#S5.Thmtheorem4)\.

A conceptCCcan be separated fromNNby a unit if and only if

Conv⁡\(C\)∩N¯=ϕ\.\\operatorname\{Conv\}\(C\)\\cap\\overline\{N\}=\\phi\.
###### Proof\.

A unit is a finite intersection of neuron activation regions\. Given neurons\(wi,bi\)i=1m\(w\_\{i\},b\_\{i\}\)\_\{i=1\}^\{m\}, write

θ:=⋂i=1m\{x:wi⊤​x\+bi\>0\}\.\\theta:=\\bigcap\_\{i=1\}^\{m\}\\\{x:w\_\{i\}^\{\\top\}x\+b\_\{i\}\>0\\\}\.SeparatingCCfromNNby this unit means

C⊆θ,N∩θ=ϕ,C\\subseteq\\theta,\\qquad N\\cap\\theta=\\phi,\(13\)or equivalently, every point inCCactivates all selected neurons, while every point inNNfails to activate at least one selected neuron\.

\(⇐\\Leftarrow\)AssumeConv⁡\(C\)∩N¯=ϕ\\operatorname\{Conv\}\(C\)\\cap\\overline\{N\}=\\phi\. LetK:=Conv⁡\(C\)K:=\\operatorname\{Conv\}\(C\), which is nonempty, compact, and convex by Corollary[9\.3](https://arxiv.org/html/2606.07007#S9.Thmtheorem3)\. The setN¯\\overline\{N\}is compact becauseN⊆XN\\subseteq XandXXis bounded\.

For eachy∈N¯y\\in\\overline\{N\}, Lemma[9\.4](https://arxiv.org/html/2606.07007#S9.Thmtheorem4)giveswy≠0w\_\{y\}\\neq 0andβy\\beta\_\{y\}such that

wy⊤​y<βy,wy⊤​u\>βy\(u∈K\)\.w\_\{y\}^\{\\top\}y<\\beta\_\{y\},\\qquad w\_\{y\}^\{\\top\}u\>\\beta\_\{y\}\\quad\(u\\in K\)\.\(14\)The open halfspacesUy:=\{x:wy⊤​x<βy\}U\_\{y\}:=\\\{x:w\_\{y\}^\{\\top\}x<\\beta\_\{y\}\\\}form an open cover ofN¯\\overline\{N\}\. By compactness, choose a finite subcoverUy1,…,UymU\_\{y\_\{1\}\},\\ldots,U\_\{y\_\{m\}\}\. Define neurons by\(wi,bi\):=\(wyi,−βyi\)\(w\_\{i\},b\_\{i\}\):=\(w\_\{y\_\{i\}\},\-\\beta\_\{y\_\{i\}\}\)and set

θ=⋂i=1m\{x:wi⊤​x\+bi\>0\}\.\\theta=\\bigcap\_\{i=1\}^\{m\}\\\{x:w\_\{i\}^\{\\top\}x\+b\_\{i\}\>0\\\}\.For everyx∈C⊆Kx\\in C\\subseteq K, the second inequality in \([14](https://arxiv.org/html/2606.07007#S9.E14)\) giveswi⊤​x\+bi\>0w\_\{i\}^\{\\top\}x\+b\_\{i\}\>0for allii, soC⊆θC\\subseteq\\theta\. For everyy∈N⊆N¯y\\in N\\subseteq\\overline\{N\}, the finite subcover gives someiiwithy∈Uyiy\\in U\_\{y\_\{i\}\}, hencewi⊤​y\+bi<0w\_\{i\}^\{\\top\}y\+b\_\{i\}<0andy∉θy\\notin\\theta\. ThereforeN∩θ=ϕN\\cap\\theta=\\phi\.

\(⇒\\Rightarrow\)Conversely, suppose a finite unitθ\\thetasatisfies \([13](https://arxiv.org/html/2606.07007#S9.E13)\)\. We first show thatConv⁡\(C\)⊆θ\\operatorname\{Conv\}\(C\)\\subseteq\\theta\. For anyu∈Conv⁡\(C\)u\\in\\operatorname\{Conv\}\(C\), writeu=∑jλj​xju=\\sum\_\{j\}\\lambda\_\{j\}x\_\{j\}withxj∈Cx\_\{j\}\\in C,λj≥0\\lambda\_\{j\}\\geq 0, and∑jλj=1\\sum\_\{j\}\\lambda\_\{j\}=1\. For each selected neuronii,

wi⊤​u\+bi=∑jλj​\(wi⊤​xj\+bi\)\>0,w\_\{i\}^\{\\top\}u\+b\_\{i\}=\\sum\_\{j\}\\lambda\_\{j\}\(w\_\{i\}^\{\\top\}x\_\{j\}\+b\_\{i\}\)\>0,because each term is positive\. Henceu∈θu\\in\\theta\.

Ifp∈Conv⁡\(C\)∩N¯p\\in\\operatorname\{Conv\}\(C\)\\cap\\overline\{N\}, thenp∈θp\\in\\theta\. Sinceθ\\thetais open andp∈N¯p\\in\\overline\{N\}, there exists a sequenceyk∈Ny\_\{k\}\\in Nwithyk→py\_\{k\}\\to pand eventuallyyk∈θy\_\{k\}\\in\\theta, contradictingN∩θ=ϕN\\cap\\theta=\\phi\. ThusConv⁡\(C\)∩N¯=ϕ\\operatorname\{Conv\}\(C\)\\cap\\overline\{N\}=\\phi\. ∎

#### 9\.4\.4Proof of Corollary[5\.5](https://arxiv.org/html/2606.07007#S5.Thmtheorem5)

##### Corollary[5\.5](https://arxiv.org/html/2606.07007#S5.Thmtheorem5)\.

Assume𝒞\\mathcal\{C\}is finite\. All concepts in𝒞\\mathcal\{C\}can be separated from each other by units if and only if

Conv\(Ci\)∩\(Cj∖Ci\)¯=ϕ\(Ci,Cj∈𝒞,Cj≠Ci\)\.\\operatorname\{Conv\}\(C\_\{i\}\)\\cap\\overline\{\(C\_\{j\}\\setminus C\_\{i\}\)\}=\\phi\\qquad\(C\_\{i\},C\_\{j\}\\in\\mathcal\{C\},\\ C\_\{j\}\\neq C\_\{i\}\)\.
###### Proof\.

FixCi∈𝒞C\_\{i\}\\in\\mathcal\{C\}and write

Ni:=X∖Ci=⋃j≠i\(Cj∖Ci\)\.N\_\{i\}:=X\\setminus C\_\{i\}=\\bigcup\_\{j\\neq i\}\(C\_\{j\}\\setminus C\_\{i\}\)\.By Theorem[5\.4](https://arxiv.org/html/2606.07007#S5.Thmtheorem4),CiC\_\{i\}can be separated fromNiN\_\{i\}by a unit if and only if

Conv⁡\(Ci\)∩Ni¯=ϕ\.\\operatorname\{Conv\}\(C\_\{i\}\)\\cap\\overline\{N\_\{i\}\}=\\phi\.Since𝒞\\mathcal\{C\}is finite,

Ni¯=⋃j≠i\(Cj∖Ci\)¯=⋃j≠i\(Cj∖Ci\)¯\.\\overline\{N\_\{i\}\}=\\overline\{\\bigcup\_\{j\\neq i\}\(C\_\{j\}\\setminus C\_\{i\}\)\}=\\bigcup\_\{j\\neq i\}\\overline\{\(C\_\{j\}\\setminus C\_\{i\}\)\}\.Therefore

Conv⁡\(Ci\)∩Ni¯=ϕ\\displaystyle\\operatorname\{Conv\}\(C\_\{i\}\)\\cap\\overline\{N\_\{i\}\}=\\phi⟺\\displaystyle\\LongleftrightarrowConv⁡\(Ci\)∩\(Cj∖Ci\)¯=ϕfor all​j≠i\.\\displaystyle\\operatorname\{Conv\}\(C\_\{i\}\)\\cap\\overline\{\(C\_\{j\}\\setminus C\_\{i\}\)\}=\\phi\\quad\\text\{for all \}j\\neq i\.
Taking the conjunction over alliiproves the equivalence\. ∎

##### Neuron\-budget remark\.

The equivalence above guarantees that each concept can be separated by some finite unit, but the number of halfspaces obtained from the compact\-cover proof depends on the geometry of the sets\. A geometry\-independent bound of\|𝒞\|​\(\|𝒞\|−1\)\|\\mathcal\{C\}\|\(\|\\mathcal\{C\}\|\-1\)neurons follows under the stronger pairwise one\-neuron condition

Conv⁡\(Ci\)∩Conv⁡\(Cj∖Ci\)¯=ϕ\(i≠j\)\.\\operatorname\{Conv\}\(C\_\{i\}\)\\cap\\overline\{\\operatorname\{Conv\}\(C\_\{j\}\\setminus C\_\{i\}\)\}=\\phi\\qquad\(i\\neq j\)\.Indeed, Theorem[5\.2](https://arxiv.org/html/2606.07007#S5.Thmtheorem2)then gives one neuron that is positive onCiC\_\{i\}and non\-positive onCj∖CiC\_\{j\}\\setminus C\_\{i\}for each ordered pair\(i,j\)\(i,j\)\. Intersecting the\|𝒞\|−1\|\\mathcal\{C\}\|\-1neurons associated with a fixediiseparatesCiC\_\{i\}fromX∖CiX\\setminus C\_\{i\}\. Across all concepts this uses at most\|𝒞\|​\(\|𝒞\|−1\)\|\\mathcal\{C\}\|\(\|\\mathcal\{C\}\|\-1\)neurons before any reuse\. Under the weaker condition of Corollary[5\.5](https://arxiv.org/html/2606.07007#S5.Thmtheorem5), the correct general statement is finiteness, not a universal\|𝒞\|​\(\|𝒞\|−1\)\|\\mathcal\{C\}\|\(\|\\mathcal\{C\}\|\-1\)bound\.

This explains why units can be more monosemantic than individual neurons: neurons can be reused across several units, and each unit can combine several halfspaces to exclude different non\-target regions\.

#### 9\.4\.5Discussion of Definition[5\.6](https://arxiv.org/html/2606.07007#S5.Thmtheorem6)

Recall that the separation error is the symmetric difference on the data support:

esep​\(C,θ\)=μ​\(C​Δ​θ\)=μ​\(θ∖C\)⏟contamination error​ec\+μ​\(C∖θ\)⏟missing error​em,e\_\{\\mathrm\{sep\}\}\(C,\\theta\)=\\mu\(C\\Delta\\theta\)=\\underbrace\{\\mu\(\\theta\\setminus C\)\}\_\{\\text\{contamination error \}e\_\{c\}\}\+\\underbrace\{\\mu\(C\\setminus\\theta\)\}\_\{\\text\{missing error \}e\_\{m\}\},whereμ\\muis a Borel probability measure supported onXX\.

An equivalent form is

esep​\(C,θ\)=μ​\(C∖θ\)\+μ​\(θ∩⋃C′∈𝒞,C′≠C\(C′∖C\)\)\.e\_\{\\mathrm\{sep\}\}\(C,\\theta\)=\\mu\(C\\setminus\\theta\)\+\\mu\\\!\\left\(\\theta\\cap\\bigcup\_\{C^\{\\prime\}\\in\\mathcal\{C\},\\,C^\{\\prime\}\\neq C\}\(C^\{\\prime\}\\setminus C\)\\right\)\.The second term measures how much unrelated concept mass is covered byθ\\theta\.

###### Proof\.

BecauseX=⋃C′∈𝒞C′X=\\bigcup\_\{C^\{\\prime\}\\in\\mathcal\{C\}\}C^\{\\prime\}, we have

⋃C′∈𝒞,C′≠C\(C′∖C\)=X∖C\.\\bigcup\_\{C^\{\\prime\}\\in\\mathcal\{C\},\\,C^\{\\prime\}\\neq C\}\(C^\{\\prime\}\\setminus C\)=X\\setminus C\.For concept separation,θ⊆X\\theta\\subseteq X, and hence

θ∩⋃C′∈𝒞,C′≠C\(C′∖C\)=θ∩\(X∖C\)=θ∖C\.\\theta\\cap\\bigcup\_\{C^\{\\prime\}\\in\\mathcal\{C\},\\,C^\{\\prime\}\\neq C\}\(C^\{\\prime\}\\setminus C\)=\\theta\\cap\(X\\setminus C\)=\\theta\\setminus C\.Finally,

C​Δ​θ=\(C∖θ\)∪\(θ∖C\),C\\Delta\\theta=\(C\\setminus\\theta\)\\cup\(\\theta\\setminus C\),where the union is disjoint\. Additivity ofμ\\mugives the claimed expression\. ∎

### 9\.5Proofs for Concept Approximation

For the approximation results, we work on a bounded convex probe domainΩ⊂ℝn\\Omega\\subset\\mathbb\{R\}^\{n\}containing the support on which the approximation error is evaluated\. We assume thatν\\nuhas a density with respect to the relevant Lebesgue measure onΩ\\Omega, bounded above and below on its support\. These regularity assumptions rule out purely atomic or otherwise degenerate probe measures, for which arbitrary labels on finitely many atoms may not reflect the ambient geometry\.

#### 9\.5\.1Proof of Theorem[5\.8](https://arxiv.org/html/2606.07007#S5.Thmtheorem8)

##### Theorem[5\.8](https://arxiv.org/html/2606.07007#S5.Thmtheorem8)\.

A conceptC∈𝒞C\\in\\mathcal\{C\}can be arbitrarily well approximated under the approximation error by units if and only ifCCis convex up to aν\\nu\-null set\.

###### Proof\.

Here “convex up to aν\\nu\-null set” means that there exists a convex setKKsuch thatν​\(C​Δ​K\)=0\\nu\(C\\Delta K\)=0\.

\(⇐\\Leftarrow\)Suppose such a convex setKKexists\. Standard convex\-body approximation gives a sequence of polytopesPmP\_\{m\}, each representable as an intersection of finitely many halfspaces, such thatν​\(K​Δ​Pm\)→0\\nu\(K\\Delta P\_\{m\}\)\\to 0\. EachPmP\_\{m\}can be implemented by a unit, up to boundary sets ofν\\nu\-measure zero\. Therefore

ν​\(C​Δ​Pm\)≤ν​\(C​Δ​K\)\+ν​\(K​Δ​Pm\)→0,\\nu\(C\\Delta P\_\{m\}\)\\leq\\nu\(C\\Delta K\)\+\\nu\(K\\Delta P\_\{m\}\)\\to 0,soCCcan be arbitrarily well approximated by units\.

\(⇒\\Rightarrow\)Conversely, suppose there exist unitsθm\\theta\_\{m\}such thatν​\(C​Δ​θm\)→0\\nu\(C\\Delta\\theta\_\{m\}\)\\to 0\. Eachθm\\theta\_\{m\}is convex because it is an intersection of halfspaces\. Intersecting with the bounded domainS​u​p​p​\(ν\)Supp\(\\nu\)and taking closures changes the sets only on polyhedral boundaries, which areν\\nu\-null; thus, the compact convex setsKm:=θm∩S​u​p​p​\(ν\)¯K\_\{m\}:=\\overline\{\\theta\_\{m\}\\cap Supp\(\\nu\)\}still satisfyν​\(C​Δ​Km\)→0\\nu\(C\\Delta K\_\{m\}\)\\to 0\. By the compactness theorem for convex bodies on a bounded domain, a subsequence ofKmK\_\{m\}converges in Hausdorff distance to a compact convex setK⊆S​u​p​p​\(ν\)K\\subseteq Supp\(\\nu\)\. Since Hausdorff convergence of convex bodies implies convergence in measure,ν​\(Km​Δ​K\)→0\\nu\(K\_\{m\}\\Delta K\)\\to 0along this subsequence\. Therefore

ν​\(C​Δ​K\)≤ν​\(C​Δ​Km\)\+ν​\(Km​Δ​K\)→0,\\nu\(C\\Delta K\)\\leq\\nu\(C\\Delta K\_\{m\}\)\+\\nu\(K\_\{m\}\\Delta K\)\\to 0,soν​\(C​Δ​K\)=0\\nu\(C\\Delta K\)=0\. ThusCCis convex up to aν\\nu\-null set\. ∎

#### 9\.5\.2Proof of Theorem[5\.9](https://arxiv.org/html/2606.07007#S5.Thmtheorem9)

##### Theorem[5\.9](https://arxiv.org/html/2606.07007#S5.Thmtheorem9)\.

Under regularity and boundary\-smoothness conditions, a conceptC∈𝒞C\\in\\mathcal\{C\}can be approximated by a unitθM\\theta\_\{M\}with error

eapp​\(C,θM\)≲eirr\+A​\|M\|−2r−1,e\_\{\\mathrm\{app\}\}\(C,\\theta\_\{M\}\)\\lesssim e\_\{\\mathrm\{irr\}\}\+A\|M\|^\{\-\\frac\{2\}\{r\-1\}\},whereAAdepends on boundary smoothness,rris the effective dimension ofCC, andeirre\_\{\\mathrm\{irr\}\}is an irreducible convexification error\. In particular,eirr=0e\_\{\\mathrm\{irr\}\}=0whenCCis convex, or more generally whenν​\(Conv⁡\(C\)∖C\)=0\\nu\(\\operatorname\{Conv\}\(C\)\\setminus C\)=0\.

###### Proof\.

We first consider the convex case\. Letm:=\|M\|m:=\|M\|\. Assume thatCCis anrr\-dimensional convex body,r≥2r\\geq 2, with𝒞2\\mathcal\{C\}^\{2\}boundary and positive Gaussian curvature, and thatν\\nuhas positive continuous densityρν\\rho\_\{\\nu\}with respect torr\-dimensional Lebesgue measure on the affine hull ofCC\. The caser=1r=1is simpler: an interval can be represented exactly by two halfspaces, up to boundary measure zero\.

A unit withmmselected neurons is an intersection ofmmhalfspaces, hence a polytope with at mostmmfacets\. Conversely, any such polytope can be represented by a unit with at mostmmneurons\. Theorem 3 ofLudwig \([1999](https://arxiv.org/html/2606.07007#bib.bib48)\)gives the asymptotic best approximation rate bymm\-facet polytopes:

infθM:\|M\|=mν​\(C​Δ​θM\)∼\\displaystyle\\inf\_\{\\theta\_\{M\}:\\,\|M\|=m\}\\nu\(C\\Delta\\theta\_\{M\}\)\\sim12ldivr−1\(∫∂Cρν\(x\)r−1r\+1κC\(x\)1r\+1dℋr−1\(x\)\)r\+1r−1m−2r−1\.\\displaystyle\\frac\{1\}\{2\}\\operatorname\{ldiv\}\_\{r\-1\}\\left\(\\int\_\{\\partial C\}\\rho\_\{\\nu\}\(x\)^\{\\frac\{r\-1\}\{r\+1\}\}\\kappa\_\{C\}\(x\)^\{\\frac\{1\}\{r\+1\}\}\\,d\\mathcal\{H\}^\{r\-1\}\(x\)\\right\)^\{\\frac\{r\+1\}\{r\-1\}\}m^\{\-\\frac\{2\}\{r\-1\}\}\.Absorbing the boundary integral and dimension\-dependent constants intoAAyields

eapp​\(C,θM\)≲A​m−2r−1\.e\_\{\\mathrm\{app\}\}\(C,\\theta\_\{M\}\)\\lesssim Am^\{\-\\frac\{2\}\{r\-1\}\}\.
For a non\-convex concept, letKKbe a convex body used as a convex envelope forCC; whenConv⁡\(C\)\\operatorname\{Conv\}\(C\)satisfies the same regularity assumptions, we may takeK=Conv⁡\(C\)K=\\operatorname\{Conv\}\(C\)\. LetθM\\theta\_\{M\}be themm\-facet unit approximatingKK\. By the triangle inequality for symmetric difference,

ν​\(C​Δ​θM\)≤ν​\(C​Δ​K\)\+ν​\(K​Δ​θM\)≲ν​\(C​Δ​K\)\+AK​m−2r−1\.\\nu\(C\\Delta\\theta\_\{M\}\)\\leq\\nu\(C\\Delta K\)\+\\nu\(K\\Delta\\theta\_\{M\}\)\\lesssim\\nu\(C\\Delta K\)\+A\_\{K\}m^\{\-\\frac\{2\}\{r\-1\}\}\.The first term is the irreducible error caused by approximating a non\-convex set with convex units\. TakingK=Conv⁡\(C\)K=\\operatorname\{Conv\}\(C\)giveseirr=ν​\(Conv⁡\(C\)∖C\)e\_\{\\mathrm\{irr\}\}=\\nu\(\\operatorname\{Conv\}\(C\)\\setminus C\)wheneverC⊆Conv⁡\(C\)C\\subseteq\\operatorname\{Conv\}\(C\)is evaluated underν\\nu; more generally,eirre\_\{\\mathrm\{irr\}\}may be defined as the best such convex\-envelope error over admissible convexKK\. This proves the claimed upper bound\. ∎

### 9\.6Proofs for Concept Learning Capacity

##### Theorem[5\.10](https://arxiv.org/html/2606.07007#S5.Thmtheorem10)\.

Suppose perfect monosemanticity holds for all concepts, and each concept is represented by at mostkck\_\{c\}neurons\. In the regimed≫kcd\\gg k\_\{c\}, this requires approximately

d≳\(kc\!​\|𝒞\|\)1/kc,d\\gtrsim\(k\_\{c\}\!\\,\|\\mathcal\{C\}\|\)^\{1/k\_\{c\}\},whereddis the number of non\-dead neurons\.

###### Proof\.

If each concept is represented by at mostkck\_\{c\}neurons, then the number of possible nonempty neuron sets is

∑r=1kc\(dr\)\.\\sum\_\{r=1\}^\{k\_\{c\}\}\{d\\choose r\}\.Perfect monosemanticity requires the concept\-to\-unit mapf:𝒞→Θf:\\mathcal\{C\}\\to\\Thetato be injective, so two distinct concepts cannot be assigned to the same neuron set\. Therefore

\|𝒞\|≤∑r=1kc\(dr\)\.\|\\mathcal\{C\}\|\\leq\\sum\_\{r=1\}^\{k\_\{c\}\}\{d\\choose r\}\.Whend≫kcd\\gg k\_\{c\}, the leading term is

\(dkc\)=dkckc\!​\(1\+o​\(1\)\)\.\{d\\choose k\_\{c\}\}=\\frac\{d^\{k\_\{c\}\}\}\{k\_\{c\}\!\}\\,\(1\+o\(1\)\)\.Thus a necessary asymptotic scaling is

dkckc\!≳\|𝒞\|,\\frac\{d^\{k\_\{c\}\}\}\{k\_\{c\}\!\}\\gtrsim\|\\mathcal\{C\}\|,which gives

d≳\(kc\!​\|𝒞\|\)1/kc\.d\\gtrsim\(k\_\{c\}\!\\,\|\\mathcal\{C\}\|\)^\{1/k\_\{c\}\}\.∎

### 9\.7Additional Discussion on Top\-KKSAE

For Top\-KKSAE, writezi​\(x\)=⟨wi,x⟩\+biz\_\{i\}\(x\)=\\langle w\_\{i\},x\\rangle\+b\_\{i\}\. The SNTA of neuroniiis

Nitopk=\{x∈X:zi​\(x\)\>τi​and​i∈TopKk​\(z​\(x\)\)\}\.N\_\{i\}^\{\\mathrm\{topk\}\}=\\\{x\\in X:z\_\{i\}\(x\)\>\\tau\_\{i\}\\ \\text\{and\}\\ i\\in\\mathrm\{TopK\}\_\{k\}\(z\(x\)\)\\\}\.ThusNitopk⊆Hi\+N\_\{i\}^\{\\mathrm\{topk\}\}\\subseteq H\_\{i\}^\{\+\}\. For a set of neuronsMM,

θMtopk=⋂j∈MNjtopk\.\\theta\_\{M\}^\{\\mathrm\{topk\}\}=\\bigcap\_\{j\\in M\}N\_\{j\}^\{\\mathrm\{topk\}\}\.If\|M\|\>k\|M\|\>k, thenθMtopk=ϕ\\theta\_\{M\}^\{\\mathrm\{topk\}\}=\\phi\. If\|M\|≤k\|M\|\\leq k, thenθMtopk\\theta\_\{M\}^\{\\mathrm\{topk\}\}equals the absolute\-gating region⋂j∈MHj\+\\bigcap\_\{j\\in M\}H\_\{j\}^\{\+\}after removing points where at least one neuron inMMfails to enter the top\-kkset\. This is the relative\-gating effect\.

Let

PM:=⋂j∈MHj\+,QM:=θMtopk\.P\_\{M\}:=\\bigcap\_\{j\\in M\}H\_\{j\}^\{\+\},\\qquad Q\_\{M\}:=\\theta\_\{M\}^\{\\mathrm\{topk\}\}\.ThenQM⊆PMQ\_\{M\}\\subseteq P\_\{M\}\. Define the rank\-interference region

IM:=PM∖QM\.I\_\{M\}:=P\_\{M\}\\setminus Q\_\{M\}\.For any conceptCCand any measureη\\etaequal to eitherμ\\muorν\\nu,

η​\(C​Δ​QM\)\\displaystyle\\eta\(C\\Delta Q\_\{M\}\)=η​\(C​Δ​PM\)\+η​\(C∩IM\)−η​\(\(PM∖C\)∩IM\)\.\\displaystyle=\\eta\(C\\Delta P\_\{M\}\)\+\\eta\(C\\cap I\_\{M\}\)\-\\eta\(\(P\_\{M\}\\setminus C\)\\cap I\_\{M\}\)\.Indeed,QM=PM∖IMQ\_\{M\}=P\_\{M\}\\setminus I\_\{M\}, so replacingPMP\_\{M\}byQMQ\_\{M\}adds false negatives onC∩IMC\\cap I\_\{M\}and removes false positives on\(PM∖C\)∩IM\(P\_\{M\}\\setminus C\)\\cap I\_\{M\}\.

Therefore, Top\-KKgating can hurt by removing true target points, but it can also help by removing false positives outside the target\. The absolute\-gating results transfer cleanly to Top\-KKonly when the rank\-interference term is small on the target concept, or when its false\-positive reduction outweighs the additional false negatives\.

Similar Articles

Can SAEs Capture Neural Geometry? (6 minute read)

TLDR AI

This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.

Extracting Concepts from GPT-4

OpenAI Blog

OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.