Interpreting Brain Responses to Language with Sparse Features from Language Models

arXiv cs.CL Papers

Summary

This paper introduces Augmented Sparse Encoding Models to interpret brain responses to language using sparse features from language models, validated on high-field 7T fMRI data. It recovers known neural tuning properties and discovers a new voxel population tuned to people-related content.

arXiv:2606.06857v1 Announce Type: new Abstract: A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:21 AM

# Interpreting Brain Responses to Language with Sparse Features from Language Models
Source: [https://arxiv.org/html/2606.06857](https://arxiv.org/html/2606.06857)
Michael A\. Lepori Dept\. of Computer Science Brown University michael\_lepori@brown\.edu &Kendrick Kay Dept\. of Radiology University of Minnesota kay@umn\.edu &Greta Tuckute Kempner Institute Harvard University gtuckute@fas\.harvard\.edu

###### Abstract

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex\. Artificial language models \(LMs\) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another\. The present work introducesAugmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically\-organized sparse autoencoder \(SAE\) features, while explicitly including surprisal as a predictor\. Using this approach, we \(i\) produceinterpretationsof neural responses and \(ii\) test whether model\-brain alignment reflects primary or idiosyncratic variation in LM representations\. Using a high\-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness\. We then interpret a previously\-uncharacterized \(but reliable\) voxel population and find that it is tuned to people\-related content\. Next, we show that the fronto\-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well\-explained by surprisal alone , even in the absence of LM\-based features\. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features\. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation\.

## 1Introduction

Humans effortlessly map speech signals to complex meanings through language, but the representations that support this process in the brain remain poorly characterized\. Language models \(LMs\) have become a central tool for probing the representations underlying language processing\(Jain and Huth,[2018](https://arxiv.org/html/2606.06857#bib.bib25); Caucheteux and King,[2022](https://arxiv.org/html/2606.06857#bib.bib93); Tuckuteet al\.,[2024a](https://arxiv.org/html/2606.06857#bib.bib99)\)\. At the same time, this line of work has been criticized as relating one black box \(an LM\) to another \(the brain\) without yielding clear claims about what neural populations represent or compute\.

Recently, two developments make it increasingly feasible to gain scientific insight from LM encoding models\. On the LM side, sparse autoencoders \(SAEs\) provide a tool for decomposing dense LM hidden states into latent features that are often more identifiable and easier to interpret than residual stream features\(Brickenet al\.,[2023](https://arxiv.org/html/2606.06857#bib.bib163)\)\. On the neuroscience side, decades of neuroimaging work have delineated a fronto\-temporal language network\(Binderet al\.,[1997](https://arxiv.org/html/2606.06857#bib.bib135); Fedorenkoet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib97)\), and additional work has made progress in characterizing the response properties of these regions\. One line of work shows that responses in language regions track linguistic processing difficulty \(as quantified by LM surprisal, for instance;Hendersonet al\.,[2016](https://arxiv.org/html/2606.06857#bib.bib132); Shainet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib114); Wehbeet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib31); Heilbronet al\.,[2022](https://arxiv.org/html/2606.06857#bib.bib14); Tuckuteet al\.,[2024b](https://arxiv.org/html/2606.06857#bib.bib22)\), while another shows tracking of concrete vs\. abstract meanings during sentence or narrative comprehension \(Botch and Finn,[2024](https://arxiv.org/html/2606.06857#bib.bib15); Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1); with related work on single\-words, e\.g\.,Binderet al\.,[2005](https://arxiv.org/html/2606.06857#bib.bib16); West and Holcomb,[2000](https://arxiv.org/html/2606.06857#bib.bib32); Fernandinoet al\.,[2015](https://arxiv.org/html/2606.06857#bib.bib33)\)\.

In the present study, we leverage these advancements to propose and validate a new class of encoding models:Augmented Sparse Encoding Models\.111Code available[here](https://github.com/mlepori1/Interpretable_Encoding_Models)\.Compared to standard LM encoding models, our framework introduces two changes: we project dense residual\-stream LM features into a sparse, hierarchically organized SAE basis, and we augment that basis with an explicit feature that captures processing difficulty \(surprisal\)\.Our first goal is to produce interpretations of voxel response tuning\.Next, because the SAE basis is hierarchically organized from general, primary features to fine\-grained, idiosyncratic features, it enables us to go beyond claims from prior LM encoding work: whereas they show that LM representations can predict neural responses, they do not characterize the properties of the features that drive alignment\.Our second goal, therefore, is to identify whether brain alignment relies on primary or idiosyncratic LM feature dimensions\.

Using Augmented Sparse Encoding Models, we make the following contributions:

1. 1\.SAE features can predict voxel responses to language as accurately as dense LM residual\-stream features, while also providing interpretations that affirm and extend prior neuroscience findings on processing difficulty\-tuned and content\-tuned voxels \(Section[4](https://arxiv.org/html/2606.06857#S4)\)\.
2. 2\.We discover and interpret previously uncharacterized voxel populations \(Section[5](https://arxiv.org/html/2606.06857#S5)\)\.
3. 3\.We show how different regions of the human language network respond differentially to processing difficulty vs\. content features: some frontal brain regions are explained well by processing difficulty, whereas temporal regions draw more on LM\-derived “content” features\. We further characterize the features that are prevalent across brain regions and individuals \(Section[6](https://arxiv.org/html/2606.06857#S6)\)\.
4. 4\.We show that brain responses to language are predicted by primary, general features of LM representations rather than idiosynractic ones—revealing a deeper correspondence between artificial and biological language representations than implied by measures of encoding accuracy alone \(Section[6](https://arxiv.org/html/2606.06857#S6)\)\.

## 2Related Work

#### Organization of the Language Network

A set of frontal and temporal brain areas in the left hemisphere—the “language network”—supports language understanding and production across input modalities\(Binderet al\.,[1997](https://arxiv.org/html/2606.06857#bib.bib135); Denizet al\.,[2019](https://arxiv.org/html/2606.06857#bib.bib126); Lipkinet al\.,[2022](https://arxiv.org/html/2606.06857#bib.bib127); Huet al\.,[2023](https://arxiv.org/html/2606.06857#bib.bib131); Fedorenkoet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib97)\)\. Within this system, prior work has emphasized processing difficulty and predictability as robust drivers of univariate response magnitude during comprehension\(Shainet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib114); Wehbeet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib31); Tuckuteet al\.,[2024b](https://arxiv.org/html/2606.06857#bib.bib22)\)\. A separate line of work has studied abstract–concrete semantic organization in the brain, mostly in single\-word experiments\(Binderet al\.,[2005](https://arxiv.org/html/2606.06857#bib.bib16); West and Holcomb,[2000](https://arxiv.org/html/2606.06857#bib.bib32); Fernandinoet al\.,[2015](https://arxiv.org/html/2606.06857#bib.bib33)\); one recent study links these lines of work by showing that sentence\-evoked voxel responses are jointly organized along dimensions of processing difficulty and content \(specifically meaning\-abstractness\)\(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1)\)\. So, although the location of the language network and its broad response properties are now well established, how linguistic information is represented at a finer grain across voxels—within and across regions, and across individuals—remains less well understood\.

#### Mechanistic Interpretability

Efforts to interpret the algorithms and representations learned by trained neural networks have coalesced into a set of techniques and paradigms that comprise the field of mechanistic interpretability\(Elhageet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib162); Geigeret al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib164)\)\. In particular, the rediscovery of sparse dictionary learning\(Olshausen and Field,[1996](https://arxiv.org/html/2606.06857#bib.bib165)\)has driven progress on unsupervised methods for uncovering interpretable features\(Felet al\.,[2023](https://arxiv.org/html/2606.06857#bib.bib167)\), discovering circuits\(Ameisenet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib166)\), and controlling model behavior\(Brickenet al\.,[2023](https://arxiv.org/html/2606.06857#bib.bib163)\)\. Sparse dictionary learning methods, such as SAEs, attempt to learn overcomplete dictionaries of features, such that dense representations can be approximated by a small number of monosemantic \(and therefore interpretable\) features\. In this work, we employ pretrained SAEs to create feature spaces for Augmented Sparse Encoding Models\.

#### LM Encoding Models

Standard LM encoding studies have established that LM representations can predict brain responses to language at the granularity of fMRI voxels, M\(EEG\) sensors, and intracranial recordings\(Jain and Huth,[2018](https://arxiv.org/html/2606.06857#bib.bib25); Caucheteux and King,[2022](https://arxiv.org/html/2606.06857#bib.bib93); Hosseiniet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib70); AlKhamissiet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib26)\)\. A growing body of work has aimed to distill and interpret the properties of LM representations that enable LMs to capture neural responses to language\. One direction restricts or perturbs the linguistic information \(e\.g\., syntactic vs\. semantic information\) a model is trained on or can use during inference\(Pasquiouet al\.,[2023](https://arxiv.org/html/2606.06857#bib.bib138); Merlin and Toneva,[2024](https://arxiv.org/html/2606.06857#bib.bib112); Kaufet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib27)\)\. A second direction isolates specific internal LM components, such as attention weights rather than standard residual\-stream embeddings\(Lamarreet al\.,[2022](https://arxiv.org/html/2606.06857#bib.bib29); Kumaret al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib92)\)\. A third direction develops explicitly human\-readable or sparse, identifiable feature spaces for voxel\-wise encoding models\(Benaraet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib113); Singhet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib137); Zeng and Gallant,[2025](https://arxiv.org/html/2606.06857#bib.bib136)\)\. Our work is closest to the third direction\. Most closely related,Zeng and Gallant \([2025](https://arxiv.org/html/2606.06857#bib.bib136)\)apply sparse dictionary learning toword\-levelembeddings in narratives\. We differ by \(i\) using SAEs over contextual LM hidden states, and \(ii\) leveraging the hierarchical structure of one class of SAE architecture to askwhich levels of the LM feature hierarchybest align with brain responses\. Concurrent and complementary work employs SAEs to study semantic organization in the cortex\(Guoet al\.,[2026](https://arxiv.org/html/2606.06857#bib.bib5)\)and to dissociate temporal windows and broad cortical regions using intracranial measurements\([Kleinman and Goldstein,](https://arxiv.org/html/2606.06857#bib.bib6)\)\.

## 3Methods

#### Brain Data

We used brain responses from prior work\(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1)\), in which eight proficient English speakers underwent ultra\-high\-field 7 Tesla \(7T\) fMRI while listening to 200 linguistically diverse sentences, each repeated three times in pseudorandomized order\. The fMRI blood\-oxygenation\-level\-dependent \(BOLD\) response amplitude was modeled with a General Linear Model\(Princeet al\.,[2022](https://arxiv.org/html/2606.06857#bib.bib23)\), yielding a single beta value per voxel per sentence trial \(relative to a fixation baseline\)\. We averaged over the three repetitions of a sentence\. We restricted analyses to voxels with a noise ceiling signal\-to\-noise ratio\>\>0\.4 \(NCSNR; computed from stimulus repetitions as inAllenet al\.[2022](https://arxiv.org/html/2606.06857#bib.bib35)\), ensuring that voxel selection and subsequent interpretation are based on voxels with reliable responses\. Given the left\-lateralization of language\(Lipkinet al\.,[2022](https://arxiv.org/html/2606.06857#bib.bib127)\), all analyses are restricted to the left hemisphere\. Analyses were conducted on the fsaverage surface, but for simplicity we refer to surface vertices as “voxels” throughout the paper\.

We select voxels of interest in two main ways: one based on two principal components \(PCs\) of sentence\-evoked responses identified byTuckuteet al\.\([2025](https://arxiv.org/html/2606.06857#bib.bib1)\), and the other based on whether a voxel is part of the fronto\-temporal language network\(Fedorenkoet al\.,[2010](https://arxiv.org/html/2606.06857#bib.bib19)\)\. For the PC\-based selection, we leverage a recent finding that two PCs capture most generalizable variance in sentence\-evoked voxel responses \(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1); and related workShainet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib114); Wehbeet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib31); Botch and Finn,[2024](https://arxiv.org/html/2606.06857#bib.bib15)\): PC1 reflectsprocessing difficulty\(characterized by frequency and surprisal\) and PC2 reflects concrete–abstractcontent\(the degree to which a sentence’s meaning is grounded in perceptual experience\)\. Based on each voxel’s correlation with these two PCs, we define five populations:Hard\-to\-Process\(high PC1\),Easy\-to\-Process\(low PC1\),Abstract\(high PC2\),Concrete\(low PC2\), andGhost\(weak loading on both\)\. For the language\-network selection, we use five canonical left\-hemisphere functional regions of interest \(fROIs\) defined via asentences\>\>nonwordscontrast\(Fedorenkoet al\.,[2010](https://arxiv.org/html/2606.06857#bib.bib19)\), selecting the top10%10\\%significant voxels within three frontal brain parcels—inferior frontal gyrus \(IFG\), its orbital portion \(IFGorb\), and middle frontal gyrus \(MFG\)—and two temporal parcels—anterior temporal \(AntTemp\) and posterior temporal \(PostTemp\) \(see Appendix[A](https://arxiv.org/html/2606.06857#A1)for details\)\.

This dataset—combining 7T resolution with highly reliable sentence\-level measurements—provides an ideal testbed for the voxel\-level analyses we pursue below\. We additionally use an independent 3T dataset for replication, see Appendix[F](https://arxiv.org/html/2606.06857#A6)\.

#### Augmented Sparse Encoding Models

Encoding models are defined by two components: \(i\) a feature extractor that transforms language stimuli into a feature vector, and \(ii\) a linear readout that maps that feature vector onto neural responses to the same linguistic stimuli\. LM encoding models typically use pretrained LMs as feature extractors, aggregating the intermediate residual\-stream representations of an LM to create one feature vector per sentence \(See Fig\.[1](https://arxiv.org/html/2606.06857#S3.F1)B, Bottom\)\. Linear readouts are typically regularized linear regressions\. See individual study sections for details on the linear readouts that we employ\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x1.png)Figure 1:Overview of methods\.\(A\)Each dot is a voxel from one sample participant, projected onto two PCs from prior work\. Voxels selected for analysis in Study 1 and 2 are annotated\. Color denotes each voxel’s correlation with sentence\-frequency annotations derived from humans\(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1)\)\.\(B\)Visualization of LM feature spaces\. Prior LM encoding models rely on residual stream features, whereas Augmented Sparse Encoding Models use SAE feature spaces along with surprisal, which is measured using the mean of LM output probabilities over tokens in a sentence\.In this work, we introduce two changes to the typical framework to make the resulting LM encoding model more interpretable\. First, we employ pretrained SAEs to transform dense, difficult\-to\-interpret residual\-stream LM representations into a much larger, sparser, and \(ideally\) more interpretable feature basis\. Specifically, we pass LM hidden states \(i\.e\., residual stream representations\) into the encoder of an SAE and compute the mean representation over tokens in a sentence in this sparse basis \(Fig\.[1](https://arxiv.org/html/2606.06857#S3.F1)B, Middle\)\. Our second change is the inclusion of a dedicated feature that encodes the average surprisal \(i\.e\., negative log probability\) of a sentence, as computed by an LM \(See Fig\.[1](https://arxiv.org/html/2606.06857#S3.F1)B, Top\)\. Surprisal is widely used as a measure of sentence processing difficulty, and has been shown to correlate with behavioral and neural signatures of processing difficulty in human participants\(Wilcoxet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib12); Michaelovet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib13)\)\. Together, these two changes enable us to \(i\) distinguish whether neural responses are best explained by processing difficulty vs\. representational content obtained from the LM/SAE, and \(ii\) attempt to directly interpret the SAE features that explain neural responses\.

We instantiate these changes using thegemma\-2\-2bbase model\(Teamet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib11)\), owing to its competitive language performance, relatively small size, and availability of pretrained SAEs\. Following prior work on SAEs in gemma models, we primarily use representations from layer 12, though we replicate some of our findings with layer 14 \(Appendix[B](https://arxiv.org/html/2606.06857#A2)\)\.

We analyze two different SAE architectures: JumpReLU and Matryoshka\. The JumpReLU SAE is a common, high\-performing variation of a standard SAE\(Rajamanoharanet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib4)\)\. The Matryoshka SAE introduces additional structure into the SAE basis to learn more disentangled features at varying levels of granularityBussmannet al\.\([2025](https://arxiv.org/html/2606.06857#bib.bib2)\)\. Specifically, it is trained in a hierarchical fashion that produces five nested feature sets: the first set can coarsely approximate LM residual states on its own, and each subsequent set is trained to reduce the residual reconstruction error given all earlier sets\. As a result, early SAE features capture general, widely\-applicable information, and later features capture progressively finer\-grained features\. We use pretrained JumpReLU SAEs from theGemma Scoperelease\(Lieberumet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib7)\), and pretrained Matryoshka SAEs fromchanindet al\.\([2025](https://arxiv.org/html/2606.06857#bib.bib3)\)\. Standardgemma\-2\-2bresidual\-stream embeddings have 2304 dimensions, the JumpReLU SAE has 16\.4K dimensions, and the Matryoshka SAE has 32K dimensions\.

## 4Study 1: Validating Augmented Sparse Encoding Models Using PC\-Derived Voxel Subtypes

We first evaluate our Augmented Sparse Encoding Models on the four PC\-derived voxel subtypes, defined using two brain\-derived PCs\(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1)\): one corresponding to processing difficulty \(easy\-to\-process to hard\-to\-process\) and another corresponding to content \(concrete to abstract sentence meanings\) \(Fig\.[1](https://arxiv.org/html/2606.06857#S3.F1)A, Section[3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1)\)\. These PC interpretations were obtained via a small set of manually collected \(human\-derived\) annotations\. Thus, these voxel populations provide a useful validation setting: because their response properties have already been partially characterized, we can ask whether interpretations obtained from Augmented Sparse Encoding Models recover or contradict these prior accounts\. Thus, this validation requires more than just matching predictive performance\.

#### Analysis Methods

We wish to identify a sparse set of features that best predict a voxel’s response profile in order to gain insight into the underlying features driving that voxel’s activity\. To do so, we employ a two\-stage analysis pipeline comprised of feature selection \(using a LASSO regression\) and then refitting \(using a Ridge regression\)\. We always include the surprisal feature in the Ridge regression, allowing for clean comparisons between full LM encoding model predictivity and a surprisal\-only baseline\. This pipeline is fully cross\-validated \(5\-fold\)\. The final predictivity score is defined as the mean Fisher\-zz\-transformed correlation across folds, noise\-ceiling normalized at the voxel level\. See Appendix[C](https://arxiv.org/html/2606.06857#A3)for more details\.

#### Finding 1: Sparse SAE features predict voxel responses as well as dense residual\-stream features\.

From Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)A, we see that SAE features tend to predict voxel responses to language stimuli nearly as well as dense LM residual features\. In particular, features in the Matryoshka SAE basis achieve very similar performance as features in the dense residual LM basis, across all four voxel subtypes \(on average, 18 Matryoshka SAE features were selected per voxel; Appendix[D](https://arxiv.org/html/2606.06857#A4)\)\. The JumpReLU basis consistently yields feature sets that are somewhat worse for prediction\. All feature spaces outperform a control that attempts to predict voxel responses from mismatched LM features\(Hadidiet al\.,[2026](https://arxiv.org/html/2606.06857#bib.bib128)\)\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x2.png)Figure 2:\(A\)Normalized encoding model predictivity of four voxel subtypes using different feature spaces\. Processing\-difficulty voxels show no predictivity benefit from LM representation features beyond surprisal, whereas abstract and concrete voxels benefit substantially\.\(B\)Matryoshka SAE features selected by regressions onAbstractandConcretevoxels\. Features form different subspaces for each subtype generalize across individual brains, and correspond to features that match—and extend—existing interpretations of these voxel populations\.
#### Finding 2: Dissociation between processing difficulty\-driven and content\-driven voxel regimes\.

Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)A shows a clear dissociation across the PC\-derived subtypes: voxels in theHard\-to\-Processsubtype are captured by surprisal alone, with no significant improvement from SAE or residual\-stream features\. In contrast, voxels in theAbstractandConcretesubtypes benefit significantly from the inclusion of SAE or residual\-stream features, above and beyond surprisal \(Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)A,p<0\.001p<0\.001\)\. This finding recovers—at the single\-voxel level—a dissociation between processing difficulty\-driven and content\-driven voxel populations during language comprehension\. Consistent with prior work that summarized voxel responses via PCA or region\-level averaging \(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1), also e\.g\.,Wehbeet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib31)\), we show this here using Augmented Sparse Encoding Models\.

#### Finding 3: Distinct features refine interpretations of content\-driven voxels\.

We next ask which specific model features support predictivity of the content\-driven voxel subtypes\. SAE features can often \(but not always\) be interpreted by examining the natural\-language contexts that maximally activate them, using resources such as Neuronpedia\(Team,[2024](https://arxiv.org/html/2606.06857#bib.bib168)\)\. For this analysis, we focus on the Matryoshka SAE, as it \(i\) better matched the predictivity of the residual\-stream feature space, \(ii\) is known to produce more interpretable features than JumpReLU SAEs\(Bussmannet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib2)\), and \(iii\) results in more sparse feature bases than the JumpReLU SAE \(Appendix[D](https://arxiv.org/html/2606.06857#A4)\)\. For each voxel, we refit the encoding model on all sentences and examine the \(signed\) SAE features selected by the model\. Because this feature basis is learned without supervision and contains tens of thousands of candidate features, agreement with prior interpretations would be nontrivial\.

Table 1:Descriptions of Matryoshka SAE features that often predict voxel responses\. Feature \# corresponds to the dimension in the Matryoshka latent space\. Feature Label and Annotation are manually annotated summaries of the semantics of the feature\. Max\-Activating examples demonstrate the natural\-language contexts that most activate these features, from Neuronpedia\(Team,[2024](https://arxiv.org/html/2606.06857#bib.bib168)\)\.From Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)B, we see thatAbstractandConcretevoxels are consistently supported by different feature subspaces\. Moreover, many of these features are selectedacrossall eight participants \(e\.g\., among the 160 voxels analyzed across eight participants—20 per subtype—feature 79 is selected in 156/160 abstract voxels; Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)B\)\. Thus, out of a 32k\-feature SAE basis, a small set of signed features isconsistentlyselected across individual brains\. From Table[1](https://arxiv.org/html/2606.06857#S4.T1)it is evident that the interpretations of these features broadly align with existing interpretations\(Tuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1)\)\. For example,Abstractvoxels are driven by features related to emotions \(Feat\. \#40\+\+; 144/160 voxels\), and suppressed by features related to scenery \(Feat\. \#79−\-; 156 voxels\)\.

Furthermore, this analysis also serves to nuance existing interpretations of these voxel subtypes\. By characterizing voxels based on their projection along a PC, prior work was limited to looking for high\-level, symmetric interpretations that cleanly contrast voxel populations at both ends of a PC\. By attempting to create voxel\-level interpretations, our encoding models can reveal subtle asymmetries that were previously obscured\. For example, we find an asymmetry when investigating a people/event\-related feature \(Feat\. \#94\), which suppressesAbstractvoxels but isnotvery reliable for predictingConcretevoxels\. Finally, theAbstractvoxels are additionally driven by question\- and consequence\-related features \(Feats\. \#389\+\+, \#1916\+\+\), as well as by failure/error\-related content \(\#16\+\+\)\. Thus, what might appear to be a symmetric contrast when performing a PC\-based analysis is, at the voxel level, asymmetric: some features drive one population without driving its counterpart to the same degree\.

## 5Study 2: Identifying and Interpreting Uncharacterized Voxel Populations

Having validated Augmented Sparse Encoding Models in Section[4](https://arxiv.org/html/2606.06857#S4), we now present a case study of deriving novel hypotheses about the tuning of previously uncharacterized voxel populations\. We use the voxels present in theGhostsubtype, as described in Section[3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1)\. This population shows stable responses to language \(i\.e\., high noise ceilings\), but is not captured by the two main organizing PCs \(Fig\.[1](https://arxiv.org/html/2606.06857#S3.F1)A\)\. In this section, we use Augmented Sparse Encoding Models to tackle both the problem ofidentifyingcoherent voxel subpopulations and theninterpretingthem\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x3.png)Figure 3:\(A\)Generalization heatmap for “Ghost” voxels \(20 per participant, 160 in total\), testing whether signed Matryoshka SAE features used to predict one voxel \(source\) generalize to predicting other voxels \(target\)\. Voxel order is determined by a hierarchical agglomerative clustering algorithm to cluster voxels with similar generalization profiles\. Mutually well\-predicted voxels are denoted by the red square\.\(B\)Matryoshka SAE features selected by regressions on the red\-square voxels in panel A\. These features correspond to a “people\-specific” voxel tuning\.#### Analysis Methods

To identify subpopulations of voxels that are predicted by a coherent set of features within theGhostsubtype \(GG\), we assess how well features that are found to predict one voxel \(asourcevoxel,ss\) can generalize to predicting another voxel \(atargetvoxel,tt\)\. For everys∈Gs\\in G, we identify a signed sparse feature basis that predictsssusing all linguistic stimuli\. Then, for everyt∈Gt\\in G, we run a cross\-validated and sign\-constrained Ridge regression to predictttusing exactly the set of \(signed\) features that are used to predictss, ensuring that features are being used in “the same way” for both the source and target voxels \(i\.e\., directional feature agreement\)\. See Appendix[C](https://arxiv.org/html/2606.06857#A3)for further analysis details\.

#### Finding 1: A subset of the Ghost voxels form a coherent feature\-defined subpopulation\.

Overall, many voxels in theGhostsubtype are not well predicted by surprisal or LM/SAE representations \(Fig\.[3](https://arxiv.org/html/2606.06857#S5.F3)A, and Appendix[E](https://arxiv.org/html/2606.06857#A5)\)\. However, one voxel subpopulation stands out as an exception: using the feature generalization analysis described above, we identify a coherent subpopulation of 32 voxels \(Fig\.[3](https://arxiv.org/html/2606.06857#S5.F3)A, red square\)\. Within this subpopulation, signed features that predict one voxel also generalize to other voxels, consistent with a shared underlying feature basis\.

#### Finding 2: The coherent Ghost subpopulation is tuned to people\-centered content\.

To interpret this voxel subpopulation, we fit Augmented Sparse Encoding Models to each voxel and examined the signed Matryoshka SAE selected by the encoding models \(i\.e\., the same analysis procedure as in Section[4](https://arxiv.org/html/2606.06857#S4)\)\. We find that the features used to predict this voxel subpopulation are typically driven by sentences that include “people\-specific” features, such as discussions of relationships/emotions \(Feat\. \#40\+\+\), descriptions of people taking concrete actions \(Feat\. \#94\+\+\), or pronouns \(Feat\. \#49\+\+\)\.

Interestingly, these signed features form an orthogonal subspace to the signed features that predictAbstractorConcretevoxels—intuitively, whether a sentence’s content is about people is not tied to abstractness or concreteness, nor processing difficulty\. Rather, this subpopulation appears to respond to sentences centered on social relations and person\-referential events\.

Whereas the analyses in Section[4](https://arxiv.org/html/2606.06857#S4)focused on main dimensions of language processing that generalize across brains, we find that this voxel population appears to be driven by just a few individuals\. More than one\-third of the voxels in this cluster \(37\.5%\) come from a single participant, with most of the remainder from just two others \(28\.1% and 18\.8%\)\.

Anatomically, these voxels were located largely outside traditional frontal and temporal areas for language: in the inferior angular gyrus near parcel\-based approximations of the temporoparietal junction—a canonical theory\-of\-mind area in the right hemisphere \(we here study the left hemisphere;Saxe and Kanwisher,[2003](https://arxiv.org/html/2606.06857#bib.bib30); Miaoet al\.,[2026](https://arxiv.org/html/2606.06857#bib.bib34)\) and visual extrastriate body area\(Rosenkeet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib28)\), and some in the medial prefrontal cortex\. The identification and interpretation of this subpopulation demonstrates how Augmented Sparse Encoding Models can recover not only broad organizing dimensions, but also representational structure shared across asubsetof individuals\.

## 6Study 3: Characterizing the Feature Basis of the Fronto\-Temporal Language Network

![Refer to caption](https://arxiv.org/html/2606.06857v1/x4.png)Figure 4:\(A\)Normalized encoding model predictivity of five language fROIs\.\(B\)Generalization heatmap for language fROI voxels \(2,296 voxels in total across eight participants; see Appendix[G](https://arxiv.org/html/2606.06857#A7)\), testing whether signed Matryoshka SAE features used to predict one voxel \(source\) generalize to predicting other voxels \(target\)\. Voxel order is sorted according to participant and fROI\.\(C\)Quantification of the prevalence of features across participants\. Each point is a feature \(828 in total\); the x\-axis shows how often that feature is selected across voxel regressions, and the y\-axis shows participant entropy in bits \(log2\\log\_\{2\}effective participants\), so higher values indicate broader sharing across individuals\. Orange shows the theoretical maximum\-prevalence baseline \(a feature present in every voxel regression\), and green shows maximum\-specificity baselines for features confined to a single participant \(a feature present in all regressions from exactly one participant\)\.We next analyze the five canonical left\-lateralized language regions\(Fedorenkoet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib97)\), using Augmented Sparse Encoding Models to ask whether the language network implements a largely shared representational code, or whether differences exist between regions and/or individuals\.

#### Finding 1: Language regions show mixed tuning to processing difficulty and content, but not uniformly across regions\.

First, in line with Section[4](https://arxiv.org/html/2606.06857#S4), we find that Matryoshka SAEs still provide a better feature basis than JumpReLU SAEs, and are on par with residual\-stream features \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)A\)\. Second, frontal fROIs are relatively well\-captured by surprisal alone; in particular, the middle frontal gyrus \(MFG\) does not show a significant gain from SAE/residual\-stream features beyond surprisal\. In contrast, temporal fROIs show a significant boost from feature\-based models above and beyond surprisal, indicating stronger sensitivity to content\-linked structure during sentence comprehension \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)A,p<0\.001p<0\.001\)\. We replicated these findings in an independent dataset with different participants, different MRI field strength, different stimulus presentation modality, and experimental design \(Appendix[F](https://arxiv.org/html/2606.06857#A6)\)\.

#### Finding 2: Language\-network voxels draw on a common feature basis, with graded participant\-specific variation\.

Next, we ask whether the human language network is organized around a shared feature basis or more sharply separated regional and participant\-specific feature spaces\. First, we find that the voxel\-to\-voxel generalization heatmap shows substantial transfer between feature bases across the language network \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)B\), in stark contrast to theGhostvoxel\-to\-voxel heatmap \(Fig\.[3](https://arxiv.org/html/2606.06857#S5.F3)A\)\. More specifically, we find that the heatmap does not reveal substantially higher transfer within\-fROI than cross\-fROI \(within\-fROI predictivity = 0\.51 vs\. cross\-fROI predictivity = 0\.50\)\. This finding indicates substantial feature sharing across language regions, rather than region\-specific feature bases\. See Appendix[G](https://arxiv.org/html/2606.06857#A7)for further analysis details\.

Second, we also find substantial transfer acrossindividuals\. Specifically, predictivity within\-individuals is slightly higher than across individuals \(within\-participant predictivity = 0\.61 vs\. cross\-participant predictivity = 0\.49\)\. To understand the feature sharing across participants further, we quantify how broadly individual SAE features are shared across individuals by plotting each feature’s prevalence across language\-voxel regressions against the entropy of its distribution across participants \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)C\)\. Intuitively, high entropy means the feature occurs with roughly uniform frequency across participants; low entropy means it is concentrated in a few\. Entropy is measured in bits, so2H2^\{H\}gives the effective number of participants associated with that feature\. This prevalence analysis revealed four regimes: a small number of features are extremely prevalent and broadly shared across all eight participants \(upper\-right portion of the plot; entropy≈\\approx3\)\. Their interpretations are given in Table[1](https://arxiv.org/html/2606.06857#S4.T1)\(see Appendix[H](https://arxiv.org/html/2606.06857#A8)for further details\)\. Interestingly, many substantially less prevalent features are also reused across all eight participants despite appearing in far fewer voxel regressions\. This suggests that the same type of heterogeneitywithinindividual brains occursacrossparticipants \(i\.e\., these features are only predictive of a subset of voxels, but are found at equal rates across brains\)\. A broad middle regime contains features with nontrivial prevalence reused across multiple—but not all—participants\. Finally, a smaller low\-entropy tail suggests participant\-specific features \(inset in Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)C\)\.

What are these shared features? The two most prevalent features are Feat\. \#79 and \#44—Feat\. \#79 is content\-linked, responding to locations and scenes, whereas Feat\. \#44 \(and Feat\. \#71, another prevalent feature\) appears to be driven by contexts with fewer tokens\. Because each sentence contains the same number of words, this feature implicitly captures whether a sentence contains infrequent words that are divided into a larger number of tokens\. These features may reflect additional form\- or difficulty\-linked features beyond average sentence\-level surprisal alone, as surprisal is explicitly included in all models\. Other prevalent features captured people/emotions \(Feat\. \#40\), people/actions \(Feat\. \#94\), questions \(Feat\. \#389\), consequences/statements \(Feat\. \#1916\), and technical or specialized vocabulary \(Feat\. \#82\) \(see Table[1](https://arxiv.org/html/2606.06857#S4.T1)\)\. These widely\-shared features represent only a portion of a much richer feature basis: in total, 828 unique \(unsigned\) features were selected at least once across language\-network voxels\.

Taken together, these results argue that the entire language network is organized around a shared, dominant feature set consisting of both content\- and form\-linked features\. Our results are consistent with recent views of the language network as an integrated system\(Bragaet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib58); Fedorenkoet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib97)\)\. We extend this view with our feature prevalence analysis, which shows that, at a single\-voxel level, this code consists of a small cross\-participant core alongside a broader regime of features reused across onlysubsetsof participants, likely reflecting individual variation in linguistic representations\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x5.png)Figure 5:Augmented Sparse Encoding Models preferentially rely on general features to predict voxels across language fROIs\.\(A\)Histogram of how often each Matryoshka features \(ordered by feature index\) are selected by the encoding models\.\(B\)Average number of features per Matryoshka bin when predicting voxels in each language fROI\.\(C\)Performance of encoding models with feature sets restricted to individual Matryoshka feature bins\. General features are most predictive despite making up 0\.4% of all features\.
#### Finding 3: Human language network voxels are preferentially explained by general, widely\-applicable LM features\.

Finally, we ask where brain\-predictive features fall within the Matryoshka SAE hierarchy\. As described in Section[3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2), the Matryoshka SAE imposes an ordered feature hierarchy: early latent features account for general, widely\-applicable variance in LM representations, and later features are progressively more granular\(Bussmannet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib2)\)\. We find that language\-network encoding models preferentially load on the most general Matryoshka features \(feature indices<128<128; Fig\.[5](https://arxiv.org/html/2606.06857#S6.F5)A,B\)\. A complementary analysis reaches the same conclusion: when models are fit using only a single Matryoshka feature bin, the most general bin outperforms every more granular bin, despite those bins containing far more features \(Fig\.[5](https://arxiv.org/html/2606.06857#S6.F5)C\)\. In Appendix[I](https://arxiv.org/html/2606.06857#A9), we further show that the most general bin also outperforms the union of all finer\-grained bins\. These findings indicate that brain\-alignment with the human language network relies on the LM’s general, widely\-applicable feature dimensions rather than with fine\-grained, idiosyncratic features\. These findings are replicated using the four PC\-derived voxel subtypes \(Appendix[I](https://arxiv.org/html/2606.06857#A9)\) and are not driven by a lack of active features in the finer\-grained bins \(Appendix[J](https://arxiv.org/html/2606.06857#A10)\)\. These findings extend “standard” LM\-encoding claims from predictivity to a more specific correspondence between the properties of the language features that organize biological and artificial language systems\.

## 7Discussion

#### Conclusion

Augmented Sparse Encoding Models move beyond prior approaches that treat both models and brains as black boxes\. Our empirical results show that small sets of features can be used to predict brain responses to language and separate processing difficulty\-related signal from content\. Within the language network, tuning is shared but not uniform: regions differ in their sensitivity to processing difficulty versus content, and content\-linked features range from those broadly shared across individuals to individual\-specific features\. This study provides a step towards characterizing how linguistic meaning is represented and organized in the human brain\.

Below we discuss three implications of our findings: \(i\) the nature of linguistic representations in the language network \(across regions and individuals\), \(ii\) interpretation of the “Ghost” voxels that are tuned to people\-related content, and \(iii\) which properties of LM representations mediate brain alignment with human language responses\.

#### The nature of linguistic representations: a shared feature basis\.

Prior work has shown that language\-responsive regions show similar response profiles profiles during controlled and naturalistic language paradigms\(Roddet al\.,[2010](https://arxiv.org/html/2606.06857#bib.bib10); Fedorenkoet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib20); Shainet al\.,[2020](https://arxiv.org/html/2606.06857#bib.bib114)\)\. Here, we extend this view by demonstrating that these profiles rely on a common, interpretable feature basis\. Small sets ofsignedSAE features selected to predict one language\-network voxel generalize strongly to voxels in other fROIs \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)B\), unlike theGhostpopulation \(Fig\.[3](https://arxiv.org/html/2606.06857#S5.F3)A\)\. This finding indicates that features can be reused across language fROIs\. This full feature basis is large—predicting voxels across all five language fROIs draw on many features \(828 in total\)—but far from random: a handful of SAE features dominate, including “processing difficulty\-linked” features that track token count \(and, implicitly, lexical frequency\) and persist even with sentence\-level surprisal as an explicit predictor \(Appendix[H](https://arxiv.org/html/2606.06857#A8)\)\. Though a shared basis set of features can predict voxels across fROIs, different fROIs arenotuniformly tuned to these features\. We find that processing difficulty features are useful for predicting voxels in all fROIs to different degrees \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)A, surprisal bar\), and that there exists a basis of content\-linked features that are useful for predicting voxels in most, but not all, fROIs\.

Moreover, we find that a small set of core features also appear acrossindividuals\(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)C; in line withTuckuteet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib1)\)\. This core set of features is supplemented by many less prevalent features that are shared across most participants, as well as a broad set of features that are only shared within one or few individuals\. While previous studies have shown that it is possible to generalize from one brain to another via LM residual\-stream spaces \(e\.g\.,de Vardaet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib129); Tuckuteet al\.,[2024b](https://arxiv.org/html/2606.06857#bib.bib22)\), these studies have been based on average fROI responses across the entire language network\. The current analyses extend this prior work by generalizing at a single\-voxel level instead of a network average \(also see e\.g\.,Tang and Huth,[2025](https://arxiv.org/html/2606.06857#bib.bib9)\), and further interpretingwhichfeatures are shared\. These results bring us closer to understanding how linguistic meaning is represented in individual brains\. Future work with larger sets of stimuli can interpret these subset\- and individual\-specific features; one such direction is neural control\(Bashivanet al\.,[2019](https://arxiv.org/html/2606.06857#bib.bib154); Tuckuteet al\.,[2024b](https://arxiv.org/html/2606.06857#bib.bib22); Antonelloet al\.,[2024](https://arxiv.org/html/2606.06857#bib.bib8)\)in interpretable voxel subspaces, i\.e\., designing stimuli to drive targeted voxel populations\.

#### People\-centered content outside the dominant language dimensions\.

The Ghost analysis \(Section[5](https://arxiv.org/html/2606.06857#S5)\) shows that Augmented Sparse Encoding Models can be used to identify and interpret voxel subpopulations that exhibit coherent feature tuning\. The cluster identified in the current work was a people\-centered subpopulation, predicted by features related to relationships, pronouns, and actions involving people\.

This subpopulation falls outside the dominant language processing dimensions\. Anatomically, the largest proportion of voxels were located in the angular gyrus area \(43\.8%\), outside the five “core” fronto\-temporal language fROIs analyzed in Study[6](https://arxiv.org/html/2606.06857#S6)\. The subpopulation also included some medial frontal voxels\. These broad anatomical regions have been described as implicated in social cognition and theory\-of\-mind tasks\(Roweet al\.,[2001](https://arxiv.org/html/2606.06857#bib.bib63); Stusset al\.,[2001](https://arxiv.org/html/2606.06857#bib.bib64); Saxe and Kanwisher,[2003](https://arxiv.org/html/2606.06857#bib.bib30); Dufouret al\.,[2013](https://arxiv.org/html/2606.06857#bib.bib60); Miaoet al\.,[2026](https://arxiv.org/html/2606.06857#bib.bib34)\)\. We emphasize that these anatomical interpretations are post hoc; we did not include functional localizers to test for e\.g\. theory\-of\-mind reasoning\.

This finding is consistent with prior work showing that the core language network is not tuned to social content or engaged by \(non\-verbal\) theory\-of\-mind reasoning \(Shainet al\.,[2023](https://arxiv.org/html/2606.06857#bib.bib65); Tuckuteet al\.,[2024b](https://arxiv.org/html/2606.06857#bib.bib22), cf\.Mellemet al\.,[2016](https://arxiv.org/html/2606.06857#bib.bib61)\)\. Interestingly, however, a smaller proportion of voxels in this cluster fell within the two temporal language parcels: anterior temporal cortex \(15\.6%\) and posterior temporal cortex \(12\.5%\)\. Thus, our results suggest that social tuning during comprehension of short sentences appear partly adjacent to the core language network and partly interspersed with temporal language areas\. Because this tuning is spatially distributed and not aligned with the dominant PCs, it would be tricky to detect from anatomy or low\-dimensional response structure alone\. Augmented Sparse Encoding Models isolate this people\-centered voxel subpopulation through a small, signed feature basis, separating it from voxel populations tuned to processing difficulty or concrete–abstract content\.

#### General LM features predict brain responses\.

The Matryoshka hierarchy analysis \(Section[6](https://arxiv.org/html/2606.06857#S6)\) gives a more specific interpretation of why LM representations align with brain responses\. The strongest alignment isnotcarried by the multiplicity of fine\-grained features that contribute to an LM’s high\-dimensional feature space\. Instead, brain\-predictive features are concentrated among the set of features that most generically characterize LM representations\. This finding suggests partial overlap in the features that organize internal language representations in the human language network and LMs, even though these systems presumably having very different means of arriving at these representations\.

A priori, it was plausible that fine\-grained features could have contributed substantially to predicting voxel responses\. Various metrics that capture representation reconstruction steadily improve by incorporating such fine\-grained features\(Bussmannet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib2)\), and incorporating disentangled hierarchical features have resulted in improvements in sparse probing\(Luoet al\.,[2026](https://arxiv.org/html/2606.06857#bib.bib171)\), and removing spurious correlations\(Bussmannet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib2)\)\. However, our work suggests that most single voxels do not tend to represent fine\-grained features, and are instead tuned to broader linguistic features\. We do note that some fine\-grained features predict responses in individual participants or small subsets of participants \(Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)C, inset\), likely reflecting individual\-specific representations of linguistic meaning\.

Our claim about general SAE dimensions aligns withAntonelloet al\.,[2021](https://arxiv.org/html/2606.06857#bib.bib62), who showed that low\-dimensional structure in LM representations is reflected in brain responses\. We extend this work by identifying the relevant dimensions in an interpretable sparse feature basis\. We further leverage these features to identify voxel subpopulations and characterize how these feature bases are shared across regions and individuals\.

#### Limitations

Augmented Sparse Encoding Models face several limitations stemming from the choice of SAE\. Specifically, the interpretations that one can generate are limited to the feature basis of the SAE, and some SAE features remain opaque \(especially for the JumpReLU SAE\)\. However, other common critiques of SAEs \(e\.g\., lack of causal relevance;Wuet al\.,[2025](https://arxiv.org/html/2606.06857#bib.bib170)\) do not apply to our framework, as we merely require an unsupervised, interpretable feature basis to predict neural responses\. Other limitations stem from the neural data\. Voxels encompass hundreds of thousands of neurons\. It is possible that more general Matryoshka SAE features predict voxel tuning because voxels are a coarse\-grained measure of neural activity, but one could imagine a coarse, yet “brain\-irrelevant” feature set that fails to predict voxel responses, or could only do so with a larger and less semantically coherent feature set\. Finally, collecting neural data for a larger set of diverse stimuli will enable further investigation of feature tuning during language processing; the present work provides a foundation for such analyses\.

## 8Acknowledgments

The authors would like to thank Thomas Serre, Ellie Pavlick, Andrea de Varda, Cindy Luo, Etha Hua, Eduardo Michelsen, and Jojo Yang for their helpful comments on an earlier version of the manuscript and for their many insightful discussions throughout the project\. The authors would also like to thank Rachel Goepner for proofreading the manuscript\.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No\. 2439559\. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author\(s\) and do not necessarily reflect the views of the National Science Foundation\. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University\. We also acknowledge support from the McGovern Institute for Brain Research at MIT\. We also acknowledge funding from National Institutes of Health grant \(NIH\) R01EY034118\.

## References

- From language to cognition: how llms outgrow the human language network\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 24332–24350\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- E\. J\. Allen, G\. St\-Yves, Y\. Wu, J\. L\. Breedlove, J\. S\. Prince, L\. T\. Dowdle, M\. Nau, B\. Caron, F\. Pestilli, I\. Charest,et al\.\(2022\)A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence\.Nature neuroscience25\(1\),pp\. 116–126\.Cited by:[Appendix F](https://arxiv.org/html/2606.06857#A6.p2.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p1.1)\.
- E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer, J\. Marcus, M\. Sklar, A\. Templeton, T\. Bricken, C\. McDougall, H\. Cunningham, T\. Henighan, A\. Jermyn, A\. Jones, A\. Persic, Z\. Qi, T\. Ben Thompson, S\. Zimmerman, K\. Rivoire, T\. Conerly, C\. Olah, and J\. Batson \(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Antonello, C\. Singh, S\. Jain, A\. Hsu, S\. Guo, J\. Gao, B\. Yu, and A\. Huth \(2024\)Generative causal testing to bridge data\-driven models and scientific theories in language neuroscience\.arXiv preprint arXiv:2410\.00812\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p2.1)\.
- R\. Antonello, J\. S\. Turek, V\. Vo, and A\. Huth \(2021\)Low\-dimensional structure in the space of language representations is reflected in brain responses\.Advances in neural information processing systems34,pp\. 8332–8344\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px4.p3.1)\.
- P\. Bashivan, K\. Kar, and J\. J\. DiCarlo \(2019\)Neural population control via deep image synthesis\.Science364\(6439\),pp\. eaav9436\.External Links:[Link](https://www.science.org/doi/10.1126/science.aav9436)Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p2.1)\.
- V\. Benara, C\. Singh, J\. X\. Morris, R\. J\. Antonello, I\. Stoica, A\. G\. Huth, and J\. Gao \(2024\)Crafting interpretable embeddings for language neuroscience by asking llms questions\.Advances in neural information processing systems37,pp\. 124137\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- J\. R\. Binder, J\. A\. Frost, T\. A\. Hammeke, R\. W\. Cox, S\. M\. Rao, and T\. Prieto \(1997\)Human brain language areas identified by functional magnetic resonance imaging\.Journal of neuroscience17\(1\),pp\. 353–362\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1)\.
- J\. R\. Binder, C\. F\. Westbury, K\. A\. McKiernan, E\. T\. Possing, and D\. A\. Medler \(2005\)Distinct brain systems for processing concrete and abstract concepts\.Journal of cognitive neuroscience17\(6\),pp\. 905–917\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1)\.
- T\. L\. Botch and E\. S\. Finn \(2024\)Neural representations of concreteness and concrete concepts are specific to the individual\.Journal of Neuroscience44\(45\)\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p2.2)\.
- R\. M\. Braga, L\. M\. DiNicola, H\. C\. Becker, and R\. L\. Buckner \(2020\)Situating the left\-lateralized language network in the broader organization of multiple specialized large\-scale distributed networks\.Journal of neurophysiology124\(5\),pp\. 1415–1448\.Cited by:[§6](https://arxiv.org/html/2606.06857#S6.SS0.SSS0.Px2.p4.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, Z\. Hatfield\-Dodds, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.Note:https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.htmlCited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Bussmann, N\. Nabeshima, A\. Karvonen, and N\. Nanda \(2025\)Learning multi\-level features with matryoshka sparse autoencoders\.InForty\-second International Conference on Machine Learning,Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p4.1),[§4](https://arxiv.org/html/2606.06857#S4.SS0.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2606.06857#S6.SS0.SSS0.Px3.p1.1),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px4.p2.1)\.
- C\. Caucheteux and J\. King \(2022\)Brains and algorithms partially converge in natural language processing\.Communications Biology5\(1\),pp\. 134\.External Links:[Link](https://www.nature.com/articles/s42003-022-03036-1)Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p1.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- chanind, TomasD, and A\. Garriga\-alonso \(2025\)A bunch of matryoshka saes\.Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p4.1)\.
- A\. G\. de Varda, S\. Malik\-Moraleda, G\. Tuckute, and E\. Fedorenko \(2025\)Multilingual computational models reveal shared brain responses to 21 languages\.bioRxiv,pp\. 2025–02\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p2.1)\.
- F\. Deniz, A\. O\. Nunez\-Elizalde, A\. G\. Huth, and J\. L\. Gallant \(2019\)The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality\.Journal of Neuroscience39\(39\),pp\. 7722–7736\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Dufour, E\. Redcay, L\. Young, P\. L\. Mavros, J\. M\. Moran, C\. Triantafyllou, J\. D\. Gabrieli, and R\. Saxe \(2013\)Similar brain activation during false belief tasks in a large sample of adults with and without autism\.PloS one8\(9\),pp\. e75468\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p2.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly,et al\.\(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread1\(1\),pp\. 12\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Fedorenko, I\. A\. Blank, M\. Siegelman, and Z\. Mineroff \(2020\)Lack of selectivity for syntax relative to word meanings throughout the language network\.Cognition203,pp\. 104348\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p1.1)\.
- E\. Fedorenko, P\. Hsieh, A\. Nieto\-Castañón, S\. Whitfield\-Gabrieli, and N\. Kanwisher \(2010\)New method for fMRI investigations of language: Defining ROIs functionally in individual subjects\.Journal of Neurophysiology104\(2\),pp\. 1177–1194\.External Links:[Link](https://journals.physiology.org/doi/prev/20100421-aop/pdf/10.1152/jn.00032.2010)Cited by:[Appendix A](https://arxiv.org/html/2606.06857#A1.SS0.SSS0.Px3.p1.4),[Appendix A](https://arxiv.org/html/2606.06857#A1.p1.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p2.2)\.
- E\. Fedorenko, A\. A\. Ivanova, and T\. I\. Regev \(2024\)The language network as a natural kind within the broader landscape of the human brain\.Nature Reviews Neuroscience25\(5\),pp\. 289–312\.External Links:[Link](https://www.nature.com/articles/s41583-024-00802-4)Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2606.06857#S6.SS0.SSS0.Px2.p4.1),[§6](https://arxiv.org/html/2606.06857#S6.p1.1)\.
- T\. Fel, A\. Picard, L\. Bethune, T\. Boissin, D\. Vigouroux, J\. Colin, R\. Cadène, and T\. Serre \(2023\)Craft: concept recursive activation factorization for explainability\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 2711–2721\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Fernandino, C\. J\. Humphries, M\. S\. Seidenberg, W\. L\. Gross, L\. L\. Conant, and J\. R\. Binder \(2015\)Predicting brain activation patterns associated with individual lexical concepts based on five sensory\-motor attributes\.Neuropsychologia76,pp\. 17–26\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Fischl, M\. I\. Sereno, R\. B\. Tootell, and A\. M\. Dale \(1999\)High\-resolution intersubject averaging and a coordinate system for the cortical surface\.Human brain mapping8\(4\),pp\. 272–284\.Cited by:[Appendix G](https://arxiv.org/html/2606.06857#A7.p1.6)\.
- A\. Geiger, D\. Ibeling, A\. Zur, M\. Chaudhary, S\. Chauhan, J\. Huang, A\. Arora, Z\. Wu, N\. Goodman, C\. Potts,et al\.\(2025\)Causal abstraction: a theoretical foundation for mechanistic interpretability\.Journal of Machine Learning Research26\(83\),pp\. 1–64\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, J\. Wu, and S\. M\. Yiu \(2026\)Sparse autoencoders map brain\-llm alignment onto cortical semantic topography\.arXiv preprint arXiv:2605\.23035\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Hadidi, E\. Feghhi, B\. H\. Song, I\. A\. Blank, and J\. C\. Kao \(2026\)Spurious alignment between large language models and brains can emerge from non\-robust methods and overlooked confounds\.Nature Communications\.Cited by:[§4](https://arxiv.org/html/2606.06857#S4.SS0.SSS0.Px2.p1.1)\.
- M\. Heilbron, K\. Armeni, J\. Schoffelen, P\. Hagoort, and F\. P\. De Lange \(2022\)A hierarchy of linguistic predictions during natural language comprehension\.Proceedings of the National Academy of Sciences119\(32\),pp\. e2201968119\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1)\.
- J\. M\. Henderson, W\. Choi, M\. W\. Lowder, and F\. Ferreira \(2016\)Language structure in the brain: a fixation\-related fmri study of syntactic surprisal in reading\.NeuroImage132,pp\. 293–300\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1)\.
- E\. Hosseini, C\. Casto, N\. Zaslavsky, C\. Conwell, M\. Richardson, and E\. Fedorenko \(2024\)Universality of representation in biological and artificial neural networks\.bioRxiv\.External Links:[Link](https://www.biorxiv.org/content/10.1101/2024.12.26.629294v1.abstract)Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Hu, H\. Small, H\. Kean, A\. Takahashi, L\. Zekelman, D\. Kleinman, E\. Ryan, A\. Nieto\-Castañón, V\. Ferreira, and E\. Fedorenko \(2023\)Precision fmri reveals that the language\-selective network supports both phrase\-structure building and lexical access during language production\.Cerebral Cortex33\(8\),pp\. 4384–4404\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Jain and A\. Huth \(2018\)Incorporating context into language encoding models for fmri\.Advances in neural information processing systems31\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p1.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Kauf, G\. Tuckute, R\. Levy, J\. Andreas, and E\. Fedorenko \(2024\)Lexical\-semantic content, not syntactic structure, is the main contributor to ann\-brain similarity of fmri responses in the language network\.Neurobiology of Language5\(1\),pp\. 7–42\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]T\. W\. Kleinman and A\. GoldsteinBack to the feature: toward a feature\-centric account of brain–lm alignment\.InICLR 2026 Workshop on Representational Alignment \(Re\{\\\{\\\\backslashtextasciicircum\}\\\}4\-Align\),Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Kumar, T\. R\. Sumers, T\. Yamakoshi, A\. Goldstein, U\. Hasson, K\. A\. Norman, T\. L\. Griffiths, R\. D\. Hawkins, and S\. A\. Nastase \(2024\)Shared functional specialization in transformer\-based language models and the human brain\.Nature Communications15\(1\),pp\. 5523\.External Links:[Link](https://www.nature.com/articles/s41467-024-49173-5)Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Lamarre, C\. Chen, and F\. Deniz \(2022\)Attention weights accurately predict language representations in the brain\.InFindings of the Association for Computational Linguistics: EMNLP 2022,pp\. 4513–4529\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Lieberum, S\. Rajamanoharan, A\. Conmy, L\. Smith, N\. Sonnerat, V\. Varma, J\. Kramár, A\. Dragan, R\. Shah, and N\. Nanda \(2024\)Gemma scope: open sparse autoencoders everywhere all at once on gemma 2\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,pp\. 278–300\.Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p4.1)\.
- B\. Lipkin, G\. Tuckute, J\. Affourtit, H\. Small, Z\. Mineroff, H\. Kean, O\. Jouravlev, L\. Rakocevic, B\. Pritchett, M\. Siegelman,et al\.\(2022\)Probabilistic atlas for the language network based on precision fmri data from\> 800 individuals\.Scientific data9\(1\),pp\. 529\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p1.1)\.
- Y\. Luo, Y\. Zhan, J\. Jiang, T\. Liu, M\. Wu, Z\. Zhou, and B\. Dong \(2026\)From atoms to trees: building a structured feature forest with hierarchical sparse autoencoders\.arXiv preprint arXiv:2602\.11881\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px4.p2.1)\.
- M\. S\. Mellem, K\. M\. Jasmin, C\. Peng, and A\. Martin \(2016\)Sentence processing in anterior superior temporal cortex shows a social\-emotional bias\.Neuropsychologia89,pp\. 217–224\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p3.1)\.
- G\. Merlin and M\. Toneva \(2024\)Language models and brains align due to more than next\-word prediction and word\-level information\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 18431–18454\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1024/)Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Miao, H\. Jung, P\. A\. Kragel, K\. Bo, P\. Sadil, M\. A\. Lindquist, and T\. D\. Wager \(2026\)Common and distinct neural correlates of social interaction processing and theory of mind in narratives\.Nature Communications\.Cited by:[§5](https://arxiv.org/html/2606.06857#S5.SS0.SSS0.Px3.p4.1),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p2.1)\.
- J\. A\. Michaelov, M\. D\. Bardolph, C\. K\. Van Petten, B\. K\. Bergen, and S\. Coulson \(2024\)Strong prediction: language model surprisal explains multiple n400 effects\.Neurobiology of language5\(1\),pp\. 107–135\.Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p2.1)\.
- B\. A\. Olshausen and D\. J\. Field \(1996\)Emergence of simple\-cell receptive field properties by learning a sparse code for natural images\.Nature381\(6583\),pp\. 607–609\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Pasquiou, Y\. Lakretz, B\. Thirion, and C\. Pallier \(2023\)Information\-restricted neural language models reveal different brain regions’ sensitivity to semantics, syntax, and context\.Neurobiology of Language4\(4\),pp\. 611–636\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- J\. S\. Prince, I\. Charest, J\. W\. Kurzawski, J\. A\. Pyles, M\. J\. Tarr, and K\. N\. Kay \(2022\)Improving the accuracy of single\-trial fMRI response estimates using GLMsingle\.eLife11,pp\. e77599\.External Links:[Link](https://elifesciences.org/articles/77599)Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p1.1)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramár, and N\. Nanda \(2024\)Jumping ahead: improving reconstruction fidelity with jumprelu sparse autoencoders\.arXiv preprint arXiv:2407\.14435\.Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p4.1)\.
- J\. M\. Rodd, O\. A\. Longe, B\. Randall, and L\. K\. Tyler \(2010\)The functional organisation of the fronto\-temporal language system: evidence from syntactic and semantic ambiguity\.Neuropsychologia48\(5\),pp\. 1324–1335\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p1.1)\.
- M\. Rosenke, R\. Van Hoof, J\. Van Den Hurk, K\. Grill\-Spector, and R\. Goebel \(2021\)A probabilistic functional atlas of human occipito\-temporal visual cortex\.Cerebral Cortex31\(1\),pp\. 603–619\.Cited by:[§5](https://arxiv.org/html/2606.06857#S5.SS0.SSS0.Px3.p4.1)\.
- A\. D\. Rowe, P\. R\. Bullock, C\. E\. Polkey, and R\. G\. Morris \(2001\)Theory of mind’impairments and their relationship to executive functioning following frontal lobe excisions\.Brain124\(3\),pp\. 600–616\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p2.1)\.
- R\. Saxe, M\. Brett, and N\. Kanwisher \(2006\)Divide and conquer: a defense of functional localizers\.Neuroimage30\(4\),pp\. 1088–1096\.Cited by:[Appendix A](https://arxiv.org/html/2606.06857#A1.SS0.SSS0.Px3.p1.4)\.
- R\. Saxe and N\. Kanwisher \(2003\)People thinking about thinking people: the role of the temporo\-parietal junction in “theory of mind”\.NeuroImage19\(4\),pp\. 1835–1842\.External Links:[Document](https://dx.doi.org/10.1016/S1053-8119%2803%2900230-1)Cited by:[§5](https://arxiv.org/html/2606.06857#S5.SS0.SSS0.Px3.p4.1),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p2.1)\.
- C\. Shain, I\. A\. Blank, M\. van Schijndel, W\. Schuler, and E\. Fedorenko \(2020\)FMRI reveals language\-specific predictive coding during naturalistic sentence comprehension\.Neuropsychologia138,pp\. 107307\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p2.2),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p1.1)\.
- C\. Shain, A\. Paunov, X\. Chen, B\. Lipkin, and E\. Fedorenko \(2023\)No evidence of theory of mind reasoning in the human language network\.Cerebral Cortex33\(10\),pp\. 6299–6319\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p3.1)\.
- C\. Singh, R\. J\. Antonello, S\. Guo, G\. Mischler, J\. Gao, N\. Mesgarani, and A\. G\. Huth \(2025\)Evaluating scientific theories as predictive models in language neuroscience\.bioRxiv\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.
- D\. T\. Stuss, G\. G\. Gallup Jr, and M\. P\. Alexander \(2001\)The frontal lobes are necessary fortheory of mind’\.Brain124\(2\),pp\. 279–286\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p2.1)\.
- J\. Tang and A\. G\. Huth \(2025\)Semantic language decoding across participants and stimulus modalities\.Current Biology35\(5\),pp\. 1023–1032\.Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p2.1)\.
- G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p3.1)\.
- N\. Team \(2024\)Neuronpedia: an open platform for mechanistic interpretability\.Note:[https://www\.neuronpedia\.org](https://www.neuronpedia.org/)Cited by:[§4](https://arxiv.org/html/2606.06857#S4.SS0.SSS0.Px4.p1.1),[Table 1](https://arxiv.org/html/2606.06857#S4.T1),[Table 1](https://arxiv.org/html/2606.06857#S4.T1.4.2)\.
- G\. Tuckute, N\. Kanwisher, and E\. Fedorenko \(2024a\)Language in brains, minds, and machines\.Annual Review of Neuroscience47\.External Links:[Link](https://www.annualreviews.org/content/journals/10.1146/annurev-neuro-120623-101142?TRACK=RSS)Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p1.1)\.
- G\. Tuckute, E\. J\. Lee, Y\. Ou, E\. Fedorenko, and K\. Kay \(2025\)A two\-dimensional space of linguistic representations shared across individuals\.bioRxiv\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1101/2025.05.21.65533)Cited by:[Appendix A](https://arxiv.org/html/2606.06857#A1.SS0.SSS0.Px1.p2.1),[Appendix A](https://arxiv.org/html/2606.06857#A1.p1.1),[Appendix H](https://arxiv.org/html/2606.06857#A8.p1.1),[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1),[Figure 1](https://arxiv.org/html/2606.06857#S3.F1),[Figure 1](https://arxiv.org/html/2606.06857#S3.F1.5.2),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p2.2),[§4](https://arxiv.org/html/2606.06857#S4.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.06857#S4.SS0.SSS0.Px4.p2.2),[§4](https://arxiv.org/html/2606.06857#S4.p1.1),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p2.1)\.
- G\. Tuckute, A\. Sathe, S\. Srikant, M\. Taliaferro, M\. Wang, M\. Schrimpf, K\. Kay, and E\. Fedorenko \(2024b\)Driving and suppressing the human language network using large language models\.Nature Human Behaviour8\(3\),pp\. 544–561\.External Links:[Link](https://www.nature.com/articles/s41562-023-01783-7)Cited by:[Figure 10](https://arxiv.org/html/2606.06857#A6.F10),[Figure 10](https://arxiv.org/html/2606.06857#A6.F10.2.1),[Appendix F](https://arxiv.org/html/2606.06857#A6.p1.1),[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px2.p2.1),[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px3.p3.1)\.
- L\. Wehbe, I\. A\. Blank, C\. Shain, R\. Futrell, R\. Levy, T\. von der Malsburg, N\. Smith, E\. Gibson, and E\. Fedorenko \(2021\)Incremental language comprehension difficulty predicts activity in the language network but not the multiple demand network\.Cerebral Cortex31\(9\),pp\. 4006–4023\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px1.p2.2),[§4](https://arxiv.org/html/2606.06857#S4.SS0.SSS0.Px3.p1.1)\.
- W\. C\. West and P\. J\. Holcomb \(2000\)Imaginal, semantic, and surface\-level processing of concrete and abstract words: an electrophysiological investigation\.Journal of Cognitive Neuroscience12\(6\),pp\. 1024–1037\.Cited by:[§1](https://arxiv.org/html/2606.06857#S1.p2.1),[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px1.p1.1)\.
- E\. G\. Wilcox, J\. Gauthier, J\. Hu, P\. Qian, and R\. P\. Levy \(2020\)On the predictive power of neural language models for human real\-time comprehension behavior\.InProceedings of the Annual Meeting of the Cognitive Science Society,Vol\.42\.Cited by:[§3](https://arxiv.org/html/2606.06857#S3.SS0.SSS0.Px2.p2.1)\.
- Z\. Wu, A\. Arora, A\. Geiger, Z\. Wang, J\. Huang, D\. Jurafsky, C\. D\. Manning, and C\. Potts \(2025\)AxBench: steering llms? even simple baselines outperform sparse autoencoders\.InForty\-second International Conference on Machine Learning,Cited by:[§7](https://arxiv.org/html/2606.06857#S7.SS0.SSS0.Px5.p1.1)\.
- A\. Zeng and J\. Gallant \(2025\)Disentangling superpositions: interpretable brain encoding model with sparse concept atoms\.bioRxiv,pp\. 2025–11\.Cited by:[§2](https://arxiv.org/html/2606.06857#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix ADetailed Voxel Selection Procedure

We selected voxels of interest in two main ways: one based on the two principal components \(PCs\) of sentence\-evoked responses identified byTuckuteet al\.\([2025](https://arxiv.org/html/2606.06857#bib.bib1)\), and the other one based on whether a voxel is part of the fronto\-temporal language network\(Fedorenkoet al\.,[2010](https://arxiv.org/html/2606.06857#bib.bib19)\)\.

#### PC\-derived subtypes \(Hard\-to\-Process, Easy\-to\-Process, Abstract, Concrete\)\.

We define these four subtypes via a three\-step procedure which was applied separately for each of the eight participants\. We selectedn=20n=20voxels per subtype\.

Step 1: Identify voxels significantly predicted by either of the two PCs\.We restrict to left\-hemisphere voxels that are significantly predicted by a two\-PC linear encoding model \(PC 1 and PC 2 as regressors; cross\-validatedR2\>1\.06R^\{2\}\>1\.06; seeTuckuteet al\.[2025](https://arxiv.org/html/2606.06857#bib.bib1)for details\)\. This step ensures that the voxels entering subtype assignment are meaningfully captured by at least one of the two PCs\.

Step 2: Assign voxels to subtypes based on PC correlations\.For each remaining voxel, we compute its Pearson correlation with the sentence\-level PC1 and PC2 scores \(vectors of length 200, i\.e\., the number of stimuli in the experiment\)\. The threshold\|r\|=0\.14\|r\|=0\.14marks the correlation values expected by chance for a two\-tailedp<0\.05p<0\.05threshold based on 200 samples\. We assign voxels to subtypes as follows:

Table 2:Criteria used to define the four voxel subtypes\.Step 3: Ensure voxel reliability\.We require an NCSNR of at least 0\.4\. If fewer thann=20n=20voxels meet this threshold for a given subtype, we incrementally relax it in steps of 0\.02 down to a minimum of 0\.20 \(thresholds: 0\.40, 0\.38, 0\.36, …, 0\.20\)\. This relaxation is occasionally necessary given that some participants have more reliable voxels than others \(see the NCSNR values for each voxel subtype in Table[3](https://arxiv.org/html/2606.06857#A1.T3)\)\. No anatomical constraint is imposed beyond restricting to the left hemisphere\. Across all subtypes, then×4=80n\\times 4=80selected voxels are mutually exclusive \(i\.e\., no voxel appears in more than one subtype\)\.

#### Ghost voxels\.

Ghost voxels are selected separately and donotrequire significant prediction by the two\-PC encoding model, since by definition they are not well captured by either PC\. Starting from all left\-hemisphere voxels with NCSNR\>0\.4\>0\.4, we identify voxels whose correlations with both PC 1 and PC 2 fall within\(−0\.14,0\.14\)\(\-0\.14,\\;0\.14\)\(i\.e\., chance\-level\)\. Among these voxels, we rank by the sum of absolute correlations \(\|rPC1\|\+\|rPC2\|\|r\_\{\\mathrm\{PC1\}\}\|\+\|r\_\{\\mathrm\{PC2\}\}\|\) and select then=20n=20voxels with the lowest combined absolute correlation, i\.e\., those closest to the origin in PC\-correlation space \(see Figure[1](https://arxiv.org/html/2606.06857#S3.F1)A\)\.

#### Language\-network fROIs\.

Using a standard approach in cognitive neuroscience, we use an independent “functional localizer” experiment to select voxels that are responsive to language\(Saxeet al\.,[2006](https://arxiv.org/html/2606.06857#bib.bib24); Fedorenkoet al\.,[2010](https://arxiv.org/html/2606.06857#bib.bib19)\)\. For each participant, we identify voxels falling within five predefined left\-hemisphere parcels—IFG, IFGorb, MFG, AntTemp, and PostTemp—and select the top10%10\\%of voxels bytt\-statistic from a standardsentences\>\>nonwordslocalizer contrast\(Fedorenkoet al\.,[2010](https://arxiv.org/html/2606.06857#bib.bib19)\)\. As in all other analyses, we require NCSNR\>0\.4\>0\.4for the language fROI voxels\.

Table 3:Summary statistics for the five voxel subtypes across eight participants\.Table 4:Summary statistics for language fROI voxels \(NCSNR\>0\.4\>0\.4\) across eight participants\.

## Appendix BReplication Usinggemma\-2\-2bLayer 14

In this section, we replicate our primary quantitative regression results in Section[4](https://arxiv.org/html/2606.06857#S4)using LM representations from layer 14 \(rather than layer 12\) ofgemma\-2\-2b\. From Fig\.[6](https://arxiv.org/html/2606.06857#A2.F6), one can see that our results are qualitatively similar to those in Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)A\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x6.png)Figure 6:Replication of Fig\.[2](https://arxiv.org/html/2606.06857#S4.F2)A using representations from layer 14 ofGemma\-2\-2b
## Appendix CAdditional Methodological Details for All Analyses

#### Computational Costs

Every analysis presented in this work was performed using CPUs\. A single Nvidia TitanRTX GPU was used to featurize stimuli\.

#### Cross\-Validated Predictivity Analysis

For each voxel, we use 5\-fold cross validation over sentence stimuli to estimate the performance of our encoding models in Sections[4](https://arxiv.org/html/2606.06857#S4)and[6](https://arxiv.org/html/2606.06857#S6)\. For each train/test split, we employ a two stage processing pipeline\. The first stage consists of feature selection\. For Matryoshka and JumpReLU feature sets, we first filter to the 8000 most promising features using F\-tests from univariate linear regression on each feature\. This removes features that do not correlate with voxel responses\. Next, we fit an L1\-penalized \(LASSO\) regression\. We use a nested 5\-fold CV to identify the LASSOα\\alphahyperparameter, searching over a log\-spaced range of 10 values from \.01 to 1\. After identifying theα\\alphahyperparameter, we fit the LASSO regression to the entire train split to identify a sparse feature basis\.

The second stage consists of fitting a generalizable regression model to the sparse feature basis identified in the first stage\. We employ a Ridge regression for this purpose\. Regardless of the results of the LASSO feature selection, we always include the surprisal feature in the Ridge regression\. This procedure ensures that the feature set is never empty, and allows for clean comparisons between full LM encoding model predictivity and a surprisal\-only baseline\. We use a nested leave\-one\-out cross validation to identify the optimalα\\alphahyperparameter, searching over a range from \.01 to 100000\. We refit the Ridge regression on the entire train split using the optimal alpha in order to predict test split responses\.

Within each fold, brain responses werezz\-scored using training\-set statistics only, and surprisal waszz\-scored using the training set only, preventing information leakage across folds\. The final predictivity score was defined as the mean Fisher\-zz\-transformed correlation across folds, noise\-ceiling normalized at the voxel level\.

For testing statistical significance among different featurizers \(Figs\.[2](https://arxiv.org/html/2606.06857#S4.F2),[4](https://arxiv.org/html/2606.06857#S6.F4),[6](https://arxiv.org/html/2606.06857#A2.F6),[9](https://arxiv.org/html/2606.06857#A5.F9)\), we use a pairedtt\-test on the predictivity values obtained across all eight participants\. For each figure, \* =p<\.05p<\.05, \*\* =p<\.01p<\.01, \*\*\* =p<\.001p<\.001

#### Generalization Analysis

Generalization analyses described in Sections[5](https://arxiv.org/html/2606.06857#S5)and[6](https://arxiv.org/html/2606.06857#S6)proceeded as follows\. At a high level, the goal is to understand which voxels in a population are predicted by the same sets of features\. To do this, we first identify a sparse set of features that predict one voxel, asource voxel, and then assess how well the identified \(signed\) feature set predicts each other voxel in the population\.

In service of this goal, we identify features that predict a source voxel by first filtering features using an F\-Test, then fitting a LASSO regression for feature selection, and finally fitting a Ridge regression to arrive at a final set of coefficients and features\. Regardless of the results of the LASSO feature selection, we always include the surprisal feature in the Ridge regression\. We run this pipeline using all sentence stimuli, resulting in a sparse, signed set of features that predict the source voxel\.

Next, we wish to assess the degree to which this sparse, signed feature set can predict each of the other voxels in the population\. We call each of these voxelstargetvoxels\. We use 5\-fold cross validation over sentence stimuli to assess how well the \(signed\) features predict a target voxel\. Within each of these folds, we use an 80\-20 train/test split \(rather than a computationally expensive nested 5\-fold CV\) to identify the optimalα\\alphavalue for a sign\-constrained Ridge regression, searching over a range from \.01 to 100000\. After identifying the optimalα\\alphavalue for a fold, we refit a sign\-constrained Ridge regression and predict test split voxel responses\.

#### Feature Interpretation Analysis

The feature interpretation analyses employed by Sections[4](https://arxiv.org/html/2606.06857#S4),[5](https://arxiv.org/html/2606.06857#S5), and[6](https://arxiv.org/html/2606.06857#S6)proceeds by first filtering features using an F\-Test, then fitting a LASSO regression for feature selection, and finally fitting a Ridge regression to arrive at a final set of coefficients and features\. To arrive at a feature set that predicts voxel responses to all stimuli, we run this pipeline over all sentence stimuli \(rather than performing a cross\-validated analysis\)\. After fitting the Ridge regression, we note the sign of each selected feature\.

## Appendix DFeature Set Support Size

In Fig\.[7](https://arxiv.org/html/2606.06857#A4.F7), we present the average support size \(i\.e\., number of features\) resulting from LASSO\-based feature selection using residual stream, JumpReLU, and Matryoshka feature sets forAbstractandConcretevoxel subtypes\. When including processing difficulty voxel subtypes, the average Matryoshka support size is 18 features\. In Fig\.[8](https://arxiv.org/html/2606.06857#A4.F8), we do the same, except for each language fROI\. We find that Matryoshka feature sets result in comparable support set sizes to residual stream regressions, and typically smaller support sets than JumpReLU regressions\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x7.png)Figure 7:Average support set sizes \(number of features selected by LASSO regression\) forAbstractandConcretevoxel subtype regressions for all three feature sets used in Section[4](https://arxiv.org/html/2606.06857#S4)\.![Refer to caption](https://arxiv.org/html/2606.06857v1/x8.png)Figure 8:Average support set sizes \(number of features selected by LASSO regression\) for fROI regressions for all three feature sets used in Section[6](https://arxiv.org/html/2606.06857#S6)\.
## Appendix EGhost Voxel Regressions

In this section, we demonstrate that, on average, none of our feature sets can predict voxels in theGhostsubtype well\. From Fig\.[9](https://arxiv.org/html/2606.06857#A5.F9), one can see that even the best performing feature sets \(Residual and Matryoshka\) only achieve Normalized Predictivity of approximately \.1\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x9.png)Figure 9:Regression predictivity for Ghost voxels using all feature sets\.
## Appendix FReplication Using 3T Brain Dataset

In this section, we replicate our main quantitative results from Section[6](https://arxiv.org/html/2606.06857#S6)using an independent brain dataset\(Tuckuteet al\.,[2024b](https://arxiv.org/html/2606.06857#bib.bib22)\)\. Five participants read 1,000 linguistically diverse sentences during 3T fMRI\. Language fROIs were defined using the same localizer procedure as in the main analyses, and we refer the reader toTuckuteet al\.\([2024b](https://arxiv.org/html/2606.06857#bib.bib22)\)for additional methodological details on this dataset\. This dataset differs from the main 7T dataset in several respects: different participants, lower field strength \(3T vs\. 7T\), different stimulus modality \(reading vs\. listening\), different MRI preprocessing, and a different experimental design\.

The experiment was designed for fROI\-level rather than voxel\-level analyses: it was acquired at 3T \(vs\. 7T\) and contains no within\-participant stimulus repetitions, so voxel\-wise reliability cannot be estimated from repeated stimulus presentation\(Allenet al\.,[2022](https://arxiv.org/html/2606.06857#bib.bib35)\)\.

Despite these differences, Fig\.[10](https://arxiv.org/html/2606.06857#A6.F10)shows the same qualitative pattern as Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4): MFG exhibits the smallest difference between surprisal vs\. other feature sets, and AntTemp exhibits the largest difference\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x10.png)Figure 10:Replication of Fig\.[4](https://arxiv.org/html/2606.06857#S6.F4)on an independent dataset\. Following the procedure inTuckuteet al\.\([2024b](https://arxiv.org/html/2606.06857#bib.bib22)\), the brain responses in each language fROI were averaged across the five participants\. Predictivity is reported as the mean Fisher\-zz\-transformed correlation across 5 CV folds, normalized by the across\-participant noise ceiling; seeTuckuteet al\.\([2024b](https://arxiv.org/html/2606.06857#bib.bib22)\)for details\. Error bars show SEM across 5 cross\-validation splits\.
## Appendix GDetailed Information about Language fROI Voxel Selection for Generalization Analysis

Due to the computational cost of running the voxel\-to\-voxel generalization analyses for the language fROI voxels \(which grows quadratically with the number of voxels; Section[6](https://arxiv.org/html/2606.06857#S6)\), we sub\-sampled fROI voxels \(see Table[4](https://arxiv.org/html/2606.06857#A1.T4)\)\. The total number of language fROI voxels across participants was 5,021 voxels\. The sub\-sampling selection proceed as follows: First, we “deduplicated” voxels with identical brain response profiles within each participant \(such duplicates arise fromkk\-nearest\-neighbor interpolation between native participant surfaces and the fsaverage template space\(Fischlet al\.,[1999](https://arxiv.org/html/2606.06857#bib.bib57)\), where distinct fsaverage vertices can inherit the response of the same native\-surface vertex\), dropping 646 voxels \(13%13\\%;5,021→4,3755\{,\}021\\to 4\{,\}375\)\. Second, we then kept the top50%50\\%of voxels per \(participant, fROI\) combination, ranked by cross\-validated NC\-normalized predictivity \(rr\) using the Matryoshka SAE features\. Two participants had relatively few high\-NCSNR voxels \(P4: 117, P7: 113\) and hence we did not further subsample these two participants’ voxels\. The final2,2962\{,\}296voxels are distributed across participants and fROIs as reported in Table[5](https://arxiv.org/html/2606.06857#A7.T5)\.

Table 5:Voxel counts for the fROI generalization analysis\. Starting from the language fROI voxels \(NCSNR\>0\.4\>0\.4\), the top 50% by NC\-normalized predictivity \(rr\) were selected per participant×\\timesfROI, except for participants with fewer than 200 voxels \(P4, P7\), who were kept in full\.
## Appendix HLanguage fROI Qualitative Feature Analysis

We visualize the prevalence of signed Matryoshka features across all participants and fROIs in Fig\.[11](https://arxiv.org/html/2606.06857#A8.F11)\. We first note that many of the most prevalent features are from the most general Matryoshka bin\. Next, we note that many of these features appear in voxel subtype analysis in Section[4](https://arxiv.org/html/2606.06857#S4)\. These features broadly reflect the interpretations of the PCs that have been identified inTuckuteet al\.\([2025](https://arxiv.org/html/2606.06857#bib.bib1)\)\. Features \-79, \-94, \+389, and \+40 are all concordant withAbstractvoxels, which are predominantly located in the language network\. Features 44 and 71 are both driven more by sentences comprised of fewer tokens, and are thus suppressed by more complicated sentences\. This reflects the prevalence ofHard\-to\-Processvoxels in the language network\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x11.png)Figure 11:Signed Matryoshka feature prevalence across all participants and all language fROIs\. We find many features from the most general Matryoshka bin, and many that arose when analyzing the voxel subtypes\.
## Appendix IAdditional Hierarchical Representational Alignment Analyses

In this section, we present additional analyses on the utility of different Matryoshka features for predicting voxel responses\.

### I\.1Extended fROI Analyses

We present an extreme case of the analysis presented in Fig\.[5](https://arxiv.org/html/2606.06857#S6.F5)— we compare the predictivity of a regression restricted to features from the most general bin of Matryoshka features \(128 features\) to the predictivity of a regression restricted to the union of allotherfeature bins \(\>\>30K features\)\. In Fig\.[12](https://arxiv.org/html/2606.06857#A9.F12)find that the general feature bin enables better predictivity than the union of all other feature bins\. This provides more support for the role of general features in predicting voxel tuning\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x12.png)Figure 12:Predictivity of regressions restricted to using features from the most general Matryoshka bin vs\. the union of all other features\. We find that general features enable better predictivity\.
### I\.2Replication using Subtypes Data

In this section, we replicate our feature granularity findings using the voxel subtypes data that were analyzed in Section[4](https://arxiv.org/html/2606.06857#S4)\. Figs\.[13](https://arxiv.org/html/2606.06857#A9.F13),[14](https://arxiv.org/html/2606.06857#A9.F14), and[15](https://arxiv.org/html/2606.06857#A9.F15)replicate the results presented in Fig\.[5](https://arxiv.org/html/2606.06857#S6.F5), and Fig\.[16](https://arxiv.org/html/2606.06857#A9.F16)replicates the results in Fig\.[12](https://arxiv.org/html/2606.06857#A9.F12)\.

![Refer to caption](https://arxiv.org/html/2606.06857v1/x13.png)Figure 13:Histogram of how often each of the Matryoshka features is selected by encoding models in the voxel subtypes data\.![Refer to caption](https://arxiv.org/html/2606.06857v1/x14.png)Figure 14:Average support set sizes \(number of features selected by LASSO regression\) per Matryoshka bin when predicting voxels in each subtype\.![Refer to caption](https://arxiv.org/html/2606.06857v1/x15.png)Figure 15:Performance of encoding models with feature sets restricted to individual Matryoshka feature bins\.![Refer to caption](https://arxiv.org/html/2606.06857v1/x16.png)Figure 16:Predictivity of regressions restricted to using features from the the most general Matryoshka bin vs\. the union of all other features\. We find that general features enable better predictivity for voxels in the PC\-derived subtypes\.

## Appendix JMatryoshka SAE Feature Summary Statistics

In this section, we provide summary statistics for the Matryoshka SAE features \(Gemma\-2\-2B, layer 12\)\. Table[6](https://arxiv.org/html/2606.06857#A10.T6)shows the count of non\-zero features per Matryoshka bin on the 200\-sentence stimulus set used in the main analyses, and Fig\.[17](https://arxiv.org/html/2606.06857#A10.F17)shows the number of non\-zero features firing on at leastNNsentences as a function ofNN\.

The table \([6](https://arxiv.org/html/2606.06857#A10.T6)\) shows that even in the fine\-grained \(late\) bins there are still thousands of non\-zero features \(e\.g\., 3,397 in the 2048–8192 bin; 5,921 in 8192\+\), and at least one feature in every bin fires on nearly all 200 sentences \(“Max fires” of 197–200\)\. The right panel of Fig\.[17](https://arxiv.org/html/2606.06857#A10.F17)shows that aroundN≈50N\\approx 50, all five bins converge to comparable non\-zero feature counts \(45, 48, 41, 51, 37 atN=50N\{=\}50, respectively\), even though the bins differ in raw size by orders of magnitude\.

Bin\-level predictivity differences are therefore not a trivial consequence of e\.g\., all\-zero features in the late bins\.

Table 6:Per\-bin statistics for the Matryoshka SAE on the 200 sentences used in the main analyses encoding models\. “N non\-zero” is the count of the number of features that activate on at least one sentence; the parenthetical is the percentage of the total number of features in the bin\. “Min / Max fires” is the smallest / largest number of sentences \(out of 200\) any single non\-zero feature in the bin fires on\.![Refer to caption](https://arxiv.org/html/2606.06857v1/x17.png)Figure 17:Number of non\-zero Matryoshka features firing on at leastNNsentences, per bin\. Left: full range,N∈\[1,200\]N\\in\[1,200\]\. Right: zoom onN∈\[50,200\]N\\in\[50,200\]\.

Similar Articles

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

arXiv cs.CL

This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.