Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

arXiv cs.CL Papers

Summary

This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.

arXiv:2604.15503v1 Announce Type: new Abstract: Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS) -- predicting fMRI activations during reading from LM activations -- have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data -- the human genome, Python, and pure hierarchical structure (nested parentheses) -- also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models' ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:27 AM

# Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
Source: https://arxiv.org/html/2604.15503
Ashvin RanjanShane Steinert-Threlkeld University of Washington {jingnong, ar31, shanest}@uw.edu

###### Abstract

Recent breakthroughs in language models (LMs) using neural networks have raised the question: how similar are these models' processing to human language processing? Results using a framework called Brain Score (BS)—predicting fMRI activations during reading from LM activations—have been used to argue for a high degree of similarity. To understand this similarity, we conduct experiments by training LMs on various types of input data and evaluate them on BS. We find that models trained on various natural languages from many different language families have very similar BS performance. LMs trained on other structured data—the human genome, Python, and pure hierarchical structure (nested parentheses)—also perform reasonably well and close to natural languages in some cases. These findings suggest that BS can highlight language models' ability to extract common structure across natural languages, but that the metric may not be sensitive enough to allow us to infer human-like processing from a high BS score alone.

## 1 Introduction

Modern language models (LMs) have proven potent in imitating human use of language (Cai et al., 2024; Wilcox et al., 2024). It is therefore an interesting question whether the language models represent languages similarly to humans. Schrimpf et al. (2021) developed Brain Score (BS) for language as a metric to quantify one important aspect of this similarity. In particular, the metric tests how well the internal representations of a language model can predict functional magnetic resonance imaging (fMRI) responses of human brains when reading text.

While proposing the metric, Schrimpf et al. (2021) also found strong correlation between models' ability of next-word prediction and their BS performance. They used this finding as evidence that the human language understanding is also optimized for predictive processing, an interesting claim that would benefit from more careful testing.

In this paper, we set out to test this link. If next-word prediction in language models indeed mirrors human language processing, we would expect such similarity to be language-specific. That is, if we are processing English, the prediction should be based on English instead of a typologically and structurally different language such as Indonesian. This language-specificity should then project to language models for the hypothesis to fully hold. In other words, an Indonesian language model should not perform as well as an English language model in BS evaluation with English stimuli. More details on the background of BS and the inspiration of our implementation for the testing can be found in Section 2.

To test the language-specificity of BS, we conduct a series of experiments—depicted in Figure 1—where we train a group of language models using a wide variety of natural languages and other structured sequences (the human genome, Python code, and nested parentheses). We then evaluate these LMs using BS on the same English reading data as above (Pereira et al., 2018; Schrimpf et al., 2021). In order to do this, we lightly adapt only the embedding layers of all of these models to acquire English vocabulary. The details of the experiments are explained in Section 3.

Our results show no statistically significant difference among models trained in various natural languages with respect to their BS on two evaluation datasets. LMs trained on structured sequences have significantly higher BS performance than random baselines. A programming language, Python, has only slightly lower BS than LMs trained on natural languages. These results, along with more detailed analyses of these experiments, may be found in Section 4.

On one hand, our findings suggest that LMs are able to extract common structure across human languages, which can be responsible for high BS scores. On the other hand, the indistinguishability between natural languages and the high scores for structured sequences also cast doubt on the hypothesized similarity between language model processing and human language processing, due to the lack of language-specificity. We discuss further ramifications of our results and avenues for refined metrics in Section 5.

Refer to captionFigure 1: The pipeline for training and evaluating the models. All training starts from a randomly initialized model. These randomly initialized models diverge in the first step by going through full training on a variety of datasets. Afterwards, they go through a separate step of embedding adaptation on English, where the embedding layers are further trained and the rest of the model is frozen. Afterwards, the model will represent English sentences used by Pereira et al. (2018). The sentence representations will be used to predict the human brain voxels in response to the same sentences.

## 2 Related Work

### 2.1 Brain Score

Schrimpf et al. (2018) first proposed BS as a measure of similarity between neural networks and human brains in the task of visual object recognition. Schrimpf et al. (2021) later implemented BS for natural language in several English datasets of human brain responses by comparing neural network responses to natural language stimuli and human brain imaging representations of the same responses. Here, we focus on the dataset that derives from Pereira et al. (2018), consistent with follow-up works to Schrimpf et al. (2021).

Several works have since attempted to find out the contributing factors to BS for language. Pasquiou et al. (2022) tested BS using the brain image data of participants listening to The Little Prince and concluded that training increases models' performance in BS.

Kauf et al. (2024) manipulated sentences used by Pereira et al. (2018) as stimuli for human subjects in various ways before using them to predict the human brain activities from pretrained English language models. Computing BS of pretrained models on manipulated stimuli, they observe that manipulations that modify semantics have significantly more impact on BS than manipulations that alter syntax. They then conclude that lexical-semantic information is vital for the performance in BS, while syntactic structure is not.

Hosseini et al. (2024) found that models training on a developmentally realistic amount of data—specifically, 100M tokens—achieve a BS nearly as high as very large models. Because of this finding, we use 100M tokens for all data types in our training procedures.

Feghhi et al. (2024) discovered that the reason for the performance of untrained GPT2-XL models, which achieve surprisingly good performance on BS, can be largely attributed to the use of shuffled train-test splits, sentence length, and sentence position. They also found that the performance of trained models in BS can be largely accounted for by sentence length, sentence position, and static word embeddings.

### 2.2 Pretraining on Alternative Datasets

The methodology of this paper is inspired by previous works that employ alternative pretraining data in place of natural languages for neural networks.

Papadimitriou and Jurafsky (2020, 2023) have found success in pretraining on a wide variety of data, including simple formal languages, music, and typologically distinct natural languages, to lower perplexity of language models. We also base our embedding adaptation on the methods used by Papadimitriou and Jurafsky (2020) on long short-term memory (LSTM) models.

Similar procedures have also been successfully attempted and shown to be effective for different natural language downstream tasks. Chiang and Lee (2022) tested sample manipulations on token distribution and formal languages and found success in various English downstream tasks. Hu et al. (2025) concluded that some formal language data is more helpful than natural language data in pretraining for lowering loss and improving linguistic generalization. Jiang et al. (2026) focused on procedural data, which are based on formal languages and simple algorithms, and found that front-loading such data can improve model performance on natural language, code, and informal mathematics. Kim et al. (2024) found that pretraining on code helps model better track entities in natural languages. Ri and Tsuruoka (2022) created artificial languages for pretraining and discovered that a nesting dependency structure is helpful for language modeling and dependency parsing. These successes lead us to reasonably expect some level of transfer to performance in BS from training on other non-natural-language datasets.

## 3 Methodology

Our overall methodology—depicted in Figure 1—is to train language models from scratch on a variety of different datasets and then evaluate their BS score on the English reading data from Pereira et al. (2018) after an embedding adaptation step. We outline each component of this pipeline and detail our full experimental setup in the subsequent subsections. Code and data is available at https://github.com/CLMBRs/xlbs.

### 3.1 Datasets

We curated a group of datasets that covers training situations with different levels of similarity to English, the subject language used for calculating BS (Pereira et al., 2018; Schrimpf et al., 2021). These datasets can be divided into three categories: natural languages, other structured sequences, and training without structures. Exact details on corpus construction are provided in Section 3.3.

Table 1: Classifications of natural languages according to Dryer and Haspelmath (2013). We consider the variety of Chinese on Wikipedia to be Mandarin, and the variety of Arabic to be Modern Standard Arabic.

#### Natural Languages

To ensure a similar data quality and style across different natural languages, we use official Wikipedia dumps from November 2023 (Wikimedia Foundation, 2023). To balance typological diversity and data availability, we select 7 languages as shown in Table 1.

The experiments by Pereira et al. (2018), which underlies the evaluation of BS, contain two parts, Experiments 2 and 3. Experiment 2 uses only Wikipedia-style texts as stimuli for human subjects, while Experiment 3 uses both Wikipedia-style texts and first- and third-person narratives as stimuli. To have a dataset more aligned with the stimuli Experiment 3, we also include a separate dataset that combines the English Wikipedia dump with the English subset of the Project Gutenberg dataset (Project Gutenberg, n.d.; Faysse, 2023) at a 3:1 ratio by example count, following a similar mix used by Hosseini et al. (2024). We refer to this dataset as "Mix".

#### Other Structured Sequences

We also select a variety of structured sequences that are not natural languages. These datasets include a simple formal language of nesting parentheses (the Dyck language), Python code from the Stack (Kocetkov et al., 2022), and the reference genome of Homo sapiens (National Center for Biotechnology Information, 2022).

In Dyck language, each type of parentheses is assigned with a unique token for both opening and closing. For each token being generated, we set a probability of 0.51 where the token is a closed parenthesis, meaning the token will be the same one as the last token of an odd number in the current string, unless the token has to be open. Otherwise, it will pick a token with equal chance out of 49,999 unique tokens for the token to serve as an open parenthesis.

We preprocess the Python code by first tokenizing the code using Python's built-in tokenizer and assigning special tokens for semantically significant whitespace in Python (i.e. newline, indent, dedent). In addition, all comments and strings are masked with corresponding special tokens to avoid natural language leaking into the dataset.

For the human genome data, we eliminate all the headings in the dataset to again avoid natural language leakage and irrelevant information.

#### Unstructured Training

To set up baselines, we include a dataset of scrambled English Wikipedia. In this dataset, all tokens are scrambled across the dataset. Despite keeping some basic statistical information of the token frequencies in the dataset, the contextual dependence in natural languages is eliminated. Finally, we also test a version of the model that is not trained on any dataset after initialization.

### 3.2 Brain Score

Conceptually, BS compares the similarity between the representations of stimuli in human brains and neural networks. As shown in Figure 1, sentences are represented in the language models and compared to the human fMRI images when subjects are shown the same sentences. In particular: a linear regression model is trained to predict fMRI responses in the language network from LM activations, and BS is then the Pearson correlation between actual and predicted fMRI responses. We utilize a Python implementation of this metric from an open-source GitHub repository (https://github.com/brain-score/language) (Schrimpf et al., 2018, 2020, 2021).

We compute the metric using the fMRI data from Experiments 2 and 3 conducted by Pereira et al. (2018). As discussed in Section 3.1, the two experiments use different styles of stimuli. A language model is evaluated on BS using this specific dataset under the method proposed by Schrimpf et al. (2021). Each layer of a language model is treated separately. 80% of the model representations and the human fMRI data are

Similar Articles

Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension

arXiv cs.CL

This paper investigates how language model representations predict neural activity during naturalistic language comprehension across MEG, ECoG, and other recordings. The findings demonstrate that language model features serve as useful neural predictors, but caution against overinterpreting predictive success as evidence for shared neural organization.

Brain-LLM Alignment Tracks Training Data, Not Typology

arXiv cs.CL

This paper investigates brain-LLM alignment across English, Chinese, and French using fMRI data and multiple LLMs, finding that training-language dominance and typological distance, not an inherent English advantage, drive alignment patterns.

Scientists recorded individual neurons in bilingual brains for the first time and found that the brain does not translate words, it does something very similar to vector space isomorphism in LLMs instead

Reddit r/singularity

Scientists recorded individual neurons in bilingual brains for the first time and found that the brain does not translate words using shared neurons but instead organizes each language into a geometric map of meaning with the same structure, similar to vector space isomorphism in LLMs.