Brain-LLM Alignment Tracks Training Data, Not Typology

arXiv cs.CL Papers

Summary

This paper investigates brain-LLM alignment across English, Chinese, and French using fMRI data and multiple LLMs, finding that training-language dominance and typological distance, not an inherent English advantage, drive alignment patterns.

arXiv:2605.23032v1 Announce Type: new Abstract: Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French (the Le Petit Prince corpus) and seven LLMs spanning English-dominant, Chinese-dominant, and multilingual architectures. Our central finding is that training-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese-dominant model (Baichuan2-7B), architecture-matched to LLaMA-2-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax-associated brain regions (IFG) show $2.3\times$ steeper typological gradients than lexico-semantic regions (PTL), and tokenization fertility accounts for $\sim$60% of a cross-linguistic shift in optimal encoding layer. These results reveal that the apparent "English advantage" in brain-LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing.
Original Article
View Cached Full Text

Cached at: 05/25/26, 08:57 AM

# Brain–LLM Alignment Tracks Training Data, Not Typology
Source: [https://arxiv.org/html/2605.23032](https://arxiv.org/html/2605.23032)
Dongxin Guo The University of Hong Kong Hong Kong, China bettyguo@connect\.hku\.hk &Jikun Wu Stellaris AI Limited Hong Kong, China hk950014@connect\.hku\.hk &Siu Ming Yiu The University of Hong Kong Hong Kong, China smyiu@cs\.hku\.hk

###### Abstract

Brain–LLM alignment is well established in English, yet the brain’s language network is neuroanatomically universal across languages\. Does alignment also generalize cross\-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French \(the Le Petit Prince corpus\) and seven LLMs spanning English\-dominant, Chinese\-dominant, and multilingual architectures\. Our central finding is that training\-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese\-dominant model \(Baichuan2\-7B\), architecture\-matched to LLaMA\-2\-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English\. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax\-associated brain regions \(IFG\) show2\.3×2\.3\\timessteeper typological gradients than lexico\-semantic regions \(PTL\), and tokenization fertility accounts for∼60\{\\sim\}60% of a cross\-linguistic shift in optimal encoding layer\. These results reveal that the apparent “English advantage” in brain–LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing\.

Brain–LLM Alignment Tracks Training Data, Not Typology

Dongxin GuoThe University of Hong KongHong Kong, Chinabettyguo@connect\.hku\.hkJikun WuStellaris AI LimitedHong Kong, Chinahk950014@connect\.hku\.hkSiu Ming YiuThe University of Hong KongHong Kong, Chinasmyiu@cs\.hku\.hk

## 1Introduction

The brain’s language network \(a set of left\-lateralized frontal and temporal regions\) constitutes a functionally universal system, activating with remarkably conserved topography across 45 languages from 12 families\(Malik\-Moraledaet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib33); Fedorenkoet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib32)\)\. This universality generates a foundational question for cognitive science: if the neural hardware for language is shared, do the computational representations it builds also converge across typologically diverse languages? And if so, what factors \(typological structure, training experience, or representational depth\) modulate the degree of convergence?

Large language models \(LLMs\) provide a powerful tool for investigating this question\. Encoding models that predict neural activity from LLM representations have revealed striking alignment between transformer representations and human brain responses\(Schrimpfet al\.,[2021](https://arxiv.org/html/2605.23032#bib.bib34); Goldsteinet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib35); Caucheteux and King,[2022](https://arxiv.org/html/2605.23032#bib.bib36)\), with the best models approaching the noise ceiling of fMRI data\(Tuckuteet al\.,[2024b](https://arxiv.org/html/2605.23032#bib.bib38); Antonelloet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib1)\)\. However, this alignment has been established almost exclusively with English data\(Tuckuteet al\.,[2024a](https://arxiv.org/html/2605.23032#bib.bib37)\), creating a critical blind spot for understanding the universality of language representation\.

de Vardaet al\.\([2025](https://arxiv.org/html/2605.23032#bib.bib46)\)recently demonstrated that multilingual encoding models transfer zero\-shot across 21 languages, confirming that a shared meaning component underlies brain–LLM alignment\. However, their study leaves key gaps: no formal typological distance metrics, only 2–3 participants per language in Study II \(precluding voxel\-wise analysis\), exclusively multilingual models tested, and∼\{\\sim\}4\.5 minutes of stimuli compared to our∼\{\\sim\}100 minutes of*The Little Prince*\.

Construction grammar \(CxG\) provides a principled theoretical framework for predicting where cross\-linguistic alignment should vary\. CxG holds that linguistic knowledge consists of form–function pairings \(constructions\) at all levels of abstraction\(Goldberg,[2005](https://arxiv.org/html/2605.23032#bib.bib55); Croft,[2001](https://arxiv.org/html/2605.23032#bib.bib54); Boas and Sag,[2012](https://arxiv.org/html/2605.23032#bib.bib78)\)\. CxG predicts that cross\-linguistic variation resides primarily in constructional \(syntactic\-functional\) representations, while core semantic content is more universal\(Goldberg,[2024](https://arxiv.org/html/2605.23032#bib.bib53)\)\. This generates a testable neural prediction: brain regions associated with syntactic\-constructional processing should show larger typological distance effects than regions associated with lexico\-semantic processing\. However, we note upfront that this prediction is also derivable from non\-CxG theories that distinguish syntax from semantics\(Mahowaldet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib51)\); we discuss CxG\-specific evidence requirements in §[6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4)\.

#### Why the result is non\-trivial\.

Three prima facie hypotheses would predict outcomes incompatible with the pattern we report\. First, if the language network’s functional universality across 45 languages\(Malik\-Moraledaet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib33)\)extends to representational format, alignment should depend on*what*is represented rather than*which*language dominates the model, predicting comparable alignment across all model–language pairs\. Second, if alignment primarily reflects shared feature spaces emerging from sufficiently rich linguistic exposure\(Antonello and Huth,[2023](https://arxiv.org/html/2605.23032#bib.bib39)\), English\-dominant models trained on∼\\sim2T tokens \(∼\\sim90% English\) could in principle align with Chinese brains as well as a Chinese\-dominant model trained on far less\. Third, typological\-distance\-based accounts predict that alignment tracks structural similarity between training and target languages, independent of training proportion\. Distinguishing training\-language dominance from these alternatives requires architecture\- and scale\-matched models that vary primarily in training\-language composition: the controlled comparison Baichuan2\-7B affords against LLaMA\-2\-7B\.

We address these gaps with five advances: \(i\) formal typological distance metrics as quantitative predictors; \(ii\) a Chinese\-dominant LLM \(Baichuan2\-7B\) to disentangle the training\-data confound; \(iii\) tokenization fertility analysis; \(iv\) region\-specific typological gradient analysis; and \(v\) noise\-ceiling\-normalized comparisons throughout\. Our contributions are:

1. 1\.First quantitative demonstration that formal typological distance covaries with brain encoding performance\.In mixed\-effects models controlling for training data proportion, Grambank distance is associated with alignment degradation for language\-dominant models \(β=−0\.41\\beta=\-0\.41\)\. With three languages providing limited degrees of freedom for the language\-level predictor, we treat this as a well\-supported descriptive pattern establishing a hypothesis to be tested across≥\\geq10 languages; a cluster bootstrap at the language level yields a wider confidence interval \(see §[4\.2](https://arxiv.org/html/2605.23032#S4.SS2)\)\.
2. 2\.Disentangling training dominance from typological distance via controlled comparison\.Baichuan2\-7B \(Chinese\-dominant\) reverses the*alignment gradient*—the systematic pattern by which encoding performance for a given model decreases as the listener’s language diverges from the model’s dominant training language\. Baichuan2\-7B performs best for Chinese \(r~=\.85\\tilde\{r\}=\.85\) and substantially lower for non\-dominant languages \(r~EN=\.59\\tilde\{r\}\_\{\\text\{EN\}\}=\.59,r~FR=\.54\\tilde\{r\}\_\{\\text\{FR\}\}=\.54\)\. This provides the first controlled demonstration with architecture\- and scale\-matched models that training language is the primary driver of this gradient\.
3. 3\.Cross\-linguistic layer dynamics with partial resolution of the Chinese shift\.The intermediate layer advantage peaks 1\.8 layers later for Chinese \(d=0\.92d=0\.92\)\. Tokenization fertility \(the average number of subword tokens produced per orthographic word;Rustet al\.[2021](https://arxiv.org/html/2605.23032#bib.bib22)\) reveals that Chinese has a fertility of2\.42\.4tokens per word under XLM\-R’s tokenizer versus1\.41\.4for English \(1\.7×1\.7\\timeshigher\); controlling for fertility attenuates∼60\{\\sim\}60% of the shift \(95% CI: \[42%, 76%\]\), though collinearity with information density means this is an upper bound on fertility’s contribution\.
4. 4\.Region\-specific neural gradient consistent with syntax–semantics dissociation\.IFG shows a2\.3×2\.3\\timessteeper typological gradient than PTL \(language\-level bootstrap 95% CI: \[1\.4, 3\.8\]\)\. This pattern is consistent with CxG’s prediction that structural variation is constructional\(Goldberg,[2005](https://arxiv.org/html/2605.23032#bib.bib55)\), but is also predicted by generic accounts distinguishing syntax from semantics, as well as IFG’s role in cognitive control for non\-dominant language processing\(Green and Abutalebi,[2013](https://arxiv.org/html/2605.23032#bib.bib69)\)\. We discuss what CxG\-specific evidence would require \(§[6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4)\)\.

## 2Background and Theoretical Motivation

### 2\.1The Universal Language Network

The language network comprises interconnected regions in left frontal and temporal cortex that respond selectively to linguistic input across modalities\(Fedorenkoet al\.,[2011](https://arxiv.org/html/2605.23032#bib.bib63),[2024](https://arxiv.org/html/2605.23032#bib.bib32)\)\. These regions are functionally dissociable from domain\-general systems\(Blank and Fedorenko,[2017](https://arxiv.org/html/2605.23032#bib.bib64)\)and are conserved across 45 languages from 12 families\(Malik\-Moraledaet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib33)\)\. Studies of polyglots further show that all languages activate the same network\(Malik\-Moraledaet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib68); Perani and Abutalebi,[2005](https://arxiv.org/html/2605.23032#bib.bib67)\)\. Cross\-linguistic differences in syntactic processing have been documented at the behavioral level\(Berzaket al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib80)\), but the extent to which these differences manifest in the language network’s alignment with computational models remains unknown\.

### 2\.2Brain–LLM Alignment

The alignment between transformer representations and human brain activity has been established through encoding models\(Schrimpfet al\.,[2021](https://arxiv.org/html/2605.23032#bib.bib34); Caucheteux and King,[2022](https://arxiv.org/html/2605.23032#bib.bib36); Goldsteinet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib35)\)\.Tuckuteet al\.\([2024b](https://arxiv.org/html/2605.23032#bib.bib38)\)provided causal evidence for shared representational structure by driving and suppressing the language network using LLM\-generated stimuli\.Caucheteux and King \([2022](https://arxiv.org/html/2605.23032#bib.bib36)\)identified the intermediate layer advantage: middle transformer layers best predict brain activity\.Antonello and Huth \([2023](https://arxiv.org/html/2605.23032#bib.bib39)\)offered an alternative account, arguing that alignment reflects shared feature spaces rather than shared computational objectives\. Scaling laws\(Antonelloet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib1)\), predictive coding hierarchies\(Caucheteuxet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib40); Heilbronet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib41)\), developmental plausibility\(Hosseiniet al\.,[2024a](https://arxiv.org/html/2605.23032#bib.bib44)\), and architectural comparisons\(Goldsteinet al\.,[2025](https://arxiv.org/html/2605.23032#bib.bib88)\)further characterize this alignment, but virtually all studies use English data\(Tuckuteet al\.,[2024a](https://arxiv.org/html/2605.23032#bib.bib37)\)\.

### 2\.3Cross\-Linguistic Evidence and Surprisal Theory

Surprisal theory posits that processing difficulty is proportional to the negative log\-probability of a word in context\(Hale,[2001](https://arxiv.org/html/2605.23032#bib.bib4); Levy,[2008](https://arxiv.org/html/2605.23032#bib.bib47)\)\. Cross\-linguistic tests show that LLM surprisal predicts reading times across 11 languages\(Wilcoxet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib5)\), with the logarithmic form confirmed at scale\(Shainet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib48)\), and GPT\-3 surprisal explaining multiple N400 effects\(Michaelovet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib49); Franket al\.,[2015](https://arxiv.org/html/2605.23032#bib.bib50)\)\. However,Oh and Schuler \([2023](https://arxiv.org/html/2605.23032#bib.bib6)\)demonstrated an inverse scaling paradox: larger models give poorer reading\-time predictions despite richer representations\. This dissociation is directly relevant to cross\-linguistic comparisons, as hidden\-state and surprisal\-based alignment may diverge differently across languages\. Information\-theoretic approaches to typological variation\(Cotterellet al\.,[2018](https://arxiv.org/html/2605.23032#bib.bib31)\)and information\-restricted contrasts isolating syntax from semantics in fMRI\(Pasquiouet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib8)\)provide complementary lenses; our cross\-language typological\-distance covariate operates at a coarser grain and is therefore complementary rather than competing with within\-sentence syntactic predictors\.

### 2\.4Construction Grammar and Cross\-Linguistic Variation

Construction grammar holds that linguistic knowledge is organized as constructions—learned pairings of form and function at all levels of abstraction\(Goldberg,[2005](https://arxiv.org/html/2605.23032#bib.bib55); Croft,[2001](https://arxiv.org/html/2605.23032#bib.bib54)\)\. This framework is grounded in cross\-linguistic constructional approaches\(Boas and Sag,[2012](https://arxiv.org/html/2605.23032#bib.bib78)\)and connected to embodied processing accounts\(Bergen and Chang,[2005](https://arxiv.org/html/2605.23032#bib.bib79)\)\.Goldberg \([2024](https://arxiv.org/html/2605.23032#bib.bib53)\)identified parallels \(though not identity\) between construction\-based learning and LLM training\.Rakshit and Goldberg \([2025](https://arxiv.org/html/2605.23032#bib.bib84)\)showed that Pythia’s internal representations reflect the gradience predicted by CxG, andKwonet al\.\([2025](https://arxiv.org/html/2605.23032#bib.bib85)\)argued that LLMs’ apparent rule\-following failures are consistent with CxG\.

We adopt CxG specifically, rather than only a generic syntax–semantics dissociation framework, for three convergent reasons\. First, CxG’s central tenet that linguistic knowledge consists of form–function pairings provides a unit of cross\-linguistic variation \(the construction\) at the same grain as ROI\-level neural measurements, where constructional differences plausibly recruit distinct frontal and temporal subnetworks\. Second, CxG is explicitly usage\-based, aligning with the statistical\-learning regime that produces LLM representations\(Goldberg,[2024](https://arxiv.org/html/2605.23032#bib.bib53)\): this lets us interpret model–brain alignment as a comparison between two usage\-based learners rather than across mismatched levels of explanation\. Third, computational evidence that LLM representations exhibit constructional gradience\(Rakshit and Goldberg,[2025](https://arxiv.org/html/2605.23032#bib.bib84)\)and CxG\-consistent failure modes\(Kwonet al\.,[2025](https://arxiv.org/html/2605.23032#bib.bib85)\)indicates that constructional structure is in fact a meaningful target of comparison, not merely a theoretical preference\.

For cross\-linguistic brain–LLM alignment, CxG generates a prediction: because cross\-linguistic variation is primarily structural, brain regions processing constructional information should show larger typological distance effects than regions processing core semantic content\. We operationalize this as: the IFG \(associated with syntactic processing;Fedorenkoet al\.[2011](https://arxiv.org/html/2605.23032#bib.bib63)\) should show a steeper alignment gradient with typological distance than the PTL \(associated with lexico\-semantic processing;Hickok and Poeppel[2007](https://arxiv.org/html/2605.23032#bib.bib65)\)\. As we discuss in §[6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4), this same neural prediction is derivable from non\-CxG frameworks; CxG provides a principled motivation but is not the only theory consistent with the result\.

### 2\.5Representational vs\. Computational Alignment and Theoretical Predictions

FollowingAntonello and Huth \([2023](https://arxiv.org/html/2605.23032#bib.bib39)\), we distinguish representational alignment \(do model features predict brain activity?\) from computational alignment \(does the brain use the same algorithms?\)\. Encoding models test representational alignment; we reserve computational claims for future causal work\(Tuckuteet al\.,[2024b](https://arxiv.org/html/2605.23032#bib.bib38); McCoyet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib7)\)\. We derive four testable predictions:

Prediction 1\(Language network universality\): LLM representations should predict brain activity significantly above chance in*all*tested languages, not just English\.

Prediction 2\(Syntax–semantics dissociation, consistent with CxG\): Typological distance effects should be larger in brain regions processing syntactic/constructional information \(IFG\) than in regions processing core semantics \(PTL\)\.

Prediction 3\(Training\-data dominance\): If alignment reflects training data quality, a Chinese\-dominant LLM should show a reversed alignment gradient \(best for Chinese, worst for English\)\.

Prediction 4\(Surprisal theory\): If perplexity mediates alignment, per\-language perplexity should negatively correlate with encoding performance\.

## 3Methods

### 3\.1Neuroimaging Data

We use the Le Petit Prince \(LPP\) multilingual fMRI corpus\(Liet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib70)\)\(OpenNeuro ds003643\): English \(n=49n=49; 9 runs,∼\{\\sim\}99 min; Cornell University\), Mandarin Chinese \(n=35n=35; 9 runs,∼\{\\sim\}99 min; Jiangsu Normal University\), and French \(n=28n=28; 9 runs,∼\{\\sim\}98 min; NeuroSpin\)\. These span moderate typological diversity \(Grambank Hamming: EN–FR = 0\.21, EN–ZH = 0\.48, FR–ZH = 0\.44\)\. These three are the only LPP languages with sufficient subject\-level coverage for voxelwise encoding with stable noise\-ceiling estimation under a common time\-aligned stimulus\. The pair \(English, Chinese\) anchors the maximal typological contrast available within LPP, while French provides an Indo\-European near\-neighbor of English that partially separates family\-level from training\-dominance effects \(§[4\.2](https://arxiv.org/html/2605.23032#S4.SS2)\)\. We note the language×\\timessite confound \(each language at a different institution\); the noise ceiling partially controls for site\-level signal quality, but residual effects cannot be fully excluded \(§[Limitations](https://arxiv.org/html/2605.23032#Sx2)\)\. FollowingLiet al\.\([2022](https://arxiv.org/html/2605.23032#bib.bib70)\): fMRI data were motion\-corrected, smoothed \(4mm FWHM\), registered to MNI152, and high\-pass filtered \(0\.01 Hz\)\. Word\-level predictors were convolved with a canonical HRF and downsampled to TR \(2s\)\. We define six bilateral language network ROIs followingFedorenkoet al\.\([2011](https://arxiv.org/html/2605.23032#bib.bib63)\): IFG, MFG, ATL, PTL, AG, and TP, using a leave\-one\-run\-out functional localizer \(p<0\.001p<0\.001\)\.

### 3\.2Language Models

We extract representations from seven models:English\-dominant:GPT\-2 Medium \(345M, 24 layers; English\-only\) and LLaMA\-2\-7B \(7B, 32 layers;∼\{\\sim\}90% English;Touvronet al\.[2023](https://arxiv.org/html/2605.23032#bib.bib14)\)\.Chinese\-dominant:Baichuan2\-7B \(7B, 32 layers;∼\{\\sim\}55% Chinese;Yanget al\.[2023](https://arxiv.org/html/2605.23032#bib.bib16)\), architecture\- and scale\-matched to LLaMA\-2\-7B, differing primarily in training language composition; sensitivity analyses at 40% and 70% are reported in §[4\.2](https://arxiv.org/html/2605.23032#S4.SS2.SSS0.Px2)\.Multilingual:mBERT \(110M, 12 layers;Devlinet al\.[2019](https://arxiv.org/html/2605.23032#bib.bib10)\), XLM\-R Large \(560M, 24 layers;Conneauet al\.[2020a](https://arxiv.org/html/2605.23032#bib.bib9)\), BLOOM\-7B \(7B, 30 layers;Scaoet al\.[2022](https://arxiv.org/html/2605.23032#bib.bib13)\), and Qwen2\.5\-7B \(7B, 32 layers;Yanget al\.[2024](https://arxiv.org/html/2605.23032#bib.bib81)\)\. For autoregressive models, we use the last subword token hidden state; for bidirectional models, the mean across subword tokens\.

#### Model selection rationale\.

We privileged architecture\- and scale\-matched comparisons over breadth\. LLaMA\-2\-7B and Baichuan2\-7B share architecture, parameter count, and broad training setup, differing primarily in training\-language composition; this pairing isolates the training\-dominance manipulation as cleanly as currently possible with publicly released models\. The multilingual set \(mBERT, XLM\-R, BLOOM\-7B, Qwen2\.5\-7B\) provides the baseline for what multilingual training achieves at comparable scale\. We do not include French\-dominant models such as CamemBERT or FlauBERT because they are an order of magnitude smaller, bidirectional rather than autoregressive, and trained on substantially smaller corpora; these differences would confound rather than complement the dominance comparison\. Even within the matched 7B pair the manipulation is approximate: residual differences in training\-data composition and quality between Baichuan2 and LLaMA\-2 are bounded by the±\\pm15% sensitivity analysis \(§[4\.2](https://arxiv.org/html/2605.23032#S4.SS2.SSS0.Px2)\)\.

### 3\.3Encoding Model Pipeline

We use voxelwise ridge regression encoding models\(Mitchellet al\.,[2008](https://arxiv.org/html/2605.23032#bib.bib61); Huthet al\.,[2016](https://arxiv.org/html/2605.23032#bib.bib19); Jain and Huth,[2018](https://arxiv.org/html/2605.23032#bib.bib20)\):y^v=𝐗​𝜷v\+ϵv\\hat\{y\}\_\{v\}=\\mathbf\{X\}\\boldsymbol\{\\beta\}\_\{v\}\+\\epsilon\_\{v\}, where𝐗∈ℝT×d′\\mathbf\{X\}\\in\\mathbb\{R\}^\{T\\times d^\{\\prime\}\}is the PCA\-reduced \(top 100 components\) feature matrix and𝜷v\\boldsymbol\{\\beta\}\_\{v\}is the ridge coefficient vector\. The ridge parameterα∈\{10−4,…,104\}\\alpha\\in\\\{10^\{\-4\},\\ldots,10^\{4\}\\\}is selected via 8\-fold inner CV; outer CV uses 9 folds corresponding to experimental runs\.

#### Evaluation and noise ceiling\.

Encoding performance is the Pearsonrrbetween predicted and actual voxel time series, averaged across voxels and subjects\. The noise ceiling is estimated via inter\-subject correlation\(Nastaseet al\.,[2021](https://arxiv.org/html/2605.23032#bib.bib62); Lage\-Castellanoset al\.,[2019](https://arxiv.org/html/2605.23032#bib.bib21)\)\. We report both rawrrand normalizedr~=r/NC\\tilde\{r\}=r/\\text\{NC\}with bootstrap 95% CIs \(10,000 resamples\)\.

#### Cross\-linguistic uniformity\.

We define:

Um=1−σ​\(\{rm,ℓ\}ℓ∈ℒ\)μ​\(\{rm,ℓ\}ℓ∈ℒ\)U\_\{m\}=1\-\\frac\{\\sigma\\\!\\left\(\\\{r\_\{m,\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\right\)\}\{\\mu\\\!\\left\(\\\{r\_\{m,\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\}\\right\)\}\(1\)whererm,ℓr\_\{m,\\ell\}is the mean encoding performance of modelmmfor languageℓ\\ell\.UUis bounded above by 1 \(perfect uniformity\) and can theoretically be negative ifσ\>μ\\sigma\>\\mu; in practice, all observed values fall in\[0\.62,0\.98\]\[0\.62,0\.98\]\. We report bootstrap 95% CIs \(10,000 resamples\)\.

### 3\.4Typological Distance and Tokenization Fertility

#### Typological distance\.

We use Grambank\(Skirgårdet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib58)\), which encodes 195 binary grammatical features per language spanning morphology \(e\.g\., overt case\-marking, agreement systems\), syntax \(e\.g\., basic word order, head\-directionality\), and structure–meaning mappings \(e\.g\., classifier systems, alignment, tense–aspect grammaticalization\)\. For each language pair we compute the Hamming distance over features defined for both languages \(features missing in either are dropped pairwise\), which can be read as the proportion of grammatical features on which the two languages disagree\. Concretely, EN–FR = 0\.21 reflects two head\-initial Indo\-European languages with overt verbal inflection and similar argument\-structure marking; EN–ZH = 0\.48 captures the contrasts that Chinese lacks inflectional morphology, uses classifiers, is topic\-prominent, and is tonal; FR–ZH = 0\.44 is similar in character to EN–ZH but smaller because French shares some properties \(e\.g\., relatively analytic morphology\) with Chinese\. We complement Grambank with WALS\(Dryer and Haspelmath,[2013](https://arxiv.org/html/2605.23032#bib.bib59)\)\(normalized Hamming over shared features\) and lang2vec\(Littellet al\.,[2017](https://arxiv.org/html/2605.23032#bib.bib17)\)subspace vectors \(syntactic, phonological, geographic\) for sensitivity analysis\.

#### Tokenization fertility\.

We compute tokenization fertility \(the average number of subword tokens produced per orthographic word\) across all language×\\timestokenizer combinations from the full LPP text\(Rustet al\.,[2021](https://arxiv.org/html/2605.23032#bib.bib22); Singhet al\.,[2019](https://arxiv.org/html/2605.23032#bib.bib23)\)\. Higher fertility means the same content is spread across more tokens, which both dilutes per\-token information and can shift the optimal encoding layer because deeper aggregation is required to recover word\-level meaning\. We include fertility as a covariate in the layer\-shift analysis \(§[4\.3](https://arxiv.org/html/2605.23032#S4.SS3)\)\.

### 3\.5Statistical Framework

Encoding significance is assessed via permutation testing \(1000 shuffles, FDR\-corrected\)\. For cross\-linguistic comparisons, we use linear mixed\-effects models with language, model type, and their interaction as fixed effects and subject as a random effect:

r∼dtyp×model\_type\+ptrain\+\(1\|subj\)\+\(1\|ROI\)r\\sim d\_\{\\text\{typ\}\}\\times\\text\{model\\\_type\}\+p\_\{\\text\{train\}\}\+\(1\|\\text\{subj\}\)\+\(1\|\\text\{ROI\}\)\(2\)
#### Degrees\-of\-freedom transparency\.

The typological distance predictor varies at the language level; with three languages, there are at most 3 unique distance values per model category, yielding≈\\approx1 residual degree of freedom\(Snijders and Bosker,[2012](https://arxiv.org/html/2605.23032#bib.bib77)\)\. Standard mixed\-effectspp\-values exploit subject\-level replication and overstate confidence\. We therefore report both the standard LMMpp\-value and a cluster bootstrap resampling entire language groups \(10,000 iterations\)\. We apply Bonferroni correction across four pre\-registered predictions\.

#### Software\.

Python 3\.10, scikit\-learn 1\.3, statsmodels 0\.14; random seed 42\. Computing: NVIDIA A100 GPUs \(80GB\); feature extraction∼\{\\sim\}6 hours, encoding∼\{\\sim\}16 hours\.

## 4Experiments and Results

### 4\.1Experiment 1: Cross\-Linguistic Brain–LLM Alignment

We first testPrediction 1: LLM representations should predict brain activity across all languages\. Table[1](https://arxiv.org/html/2605.23032#S4.T1)presents encoding performance for each model–language combination\.

Table 1:Encoding performance: rawrrand noise\-ceiling\-normalizedr~=r/NC\\tilde\{r\}=r/\\text\{NC\}\.Bold: best per column\. Full SEs in Appendix[B](https://arxiv.org/html/2605.23032#A2)\. Allp<0\.001p<0\.001, permutation test, FDR corrected \(pHolm<0\.004p\_\{\\text\{Holm\}\}<0\.004for Prediction 1\)\. Baichuan2\-7B*reverses*the typical alignment gradient\.Prediction 1 is confirmed\(p<0\.001p<0\.001for all model–language pairs;pHolm<0\.004p\_\{\\text\{Holm\}\}<0\.004\): all models show significant encoding across all three languages, confirming that brain–LLM alignment is not English\-specific\. XLM\-R achieves the highest and most uniform performance \(r~\\tilde\{r\}: EN \.83, ZH \.82, FR \.81\)\. The critical new observation is Baichuan2\-7B: this Chinese\-dominant model*reverses*the alignment gradient, achieving its highest performance for Chinese \(r~=\.85\\tilde\{r\}=\.85\) and substantially lower performance for both non\-dominant languages \(r~EN=\.59\\tilde\{r\}\_\{\\text\{EN\}\}=\.59,r~FR=\.54\\tilde\{r\}\_\{\\text\{FR\}\}=\.54\), directly mirroring LLaMA\-2\-7B’s pattern \(EN\-best, ZH\-worst\) with the dominant language reversed\. The PTL shows the highest alignment across all languages and models \(see Table[3](https://arxiv.org/html/2605.23032#A3.T3)in Appendix\), and the multiple\-demand \(MD\) network shows minimal encoding \(r<0\.04r<0\.04, n\.s\.\), providing a specificity control\.

### 4\.2Experiment 2: Disentangling Training Dominance from Typological Distance

The central confound in prior cross\-linguistic brain–LLM work is that training data proportion correlates with typological distance \(ρ=0\.89\\rho=0\.89for English\-dominant models\)\. Baichuan2\-7B directly addresses this\.

Prediction 3 is confirmed\(pHolm<0\.001p\_\{\\text\{Holm\}\}<0\.001\): Baichuan2\-7B shows the reversed pattern: ZH \(r~=\.85\\tilde\{r\}=\.85\)\>\>EN \(r~=\.59\\tilde\{r\}=\.59\)\. This demonstrates that the dominant alignment pattern \(best alignment for the model’s dominant training language\) is primarily driven by training data composition, not by an inherent typological advantage of English\. The 7B autoregressive\-only comparison \(Table[9](https://arxiv.org/html/2605.23032#A8.T9)in Appendix\) reveals a critical asymmetry: BLOOM\-7B \(multilingual\) achieves higher*minimum*performance across languages \(min⁡r~=\.77\\min\\tilde\{r\}=\.77\) than either LLaMA\-2\-7B \(min⁡r~=\.50\\min\\tilde\{r\}=\.50\) or Baichuan2\-7B \(min⁡r~=\.54\\min\\tilde\{r\}=\.54\), and dramatically higher cross\-linguistic uniformity \(U=\.96U=\.96vs\.\.71\.71and\.78\.78\)\. Multilingual training creates representations that transcend training\-language dominance\.

#### Mixed\-effects test of typological distance\.

We fit the LMM described in §[3\.5](https://arxiv.org/html/2605.23032#S3.SS5)\(Eq\.[2](https://arxiv.org/html/2605.23032#S3.E2)\)\. For English\-dominant models, the standard LMM yieldsβ=−0\.41\\beta=\-0\.41, SE=0\.12=0\.12,p<0\.005p<0\.005\(uncorrected\)\. However, the cluster bootstrap resampling at the language level, which properly reflects the three unique distance values, yields a wider 95% CI \[−0\.78\-0\.78,−0\.09\-0\.09\] withpcluster=0\.031p\_\{\\text\{cluster\}\}=0\.031\. The partial correlation analysis shows: alignment∼\\simtypological distance\|\|training proportion:rpartial=−0\.34r\_\{\\text\{partial\}\}=\-0\.34,p<0\.02p<0\.02\(with the same degrees\-of\-freedom caveat\); alignment∼\\simtraining proportion\|\|typological distance:rpartial=−0\.47r\_\{\\text\{partial\}\}=\-0\.47,p<0\.005p<0\.005\. Both factors contribute, but training proportion explains more variance\. For multilingual models, neither typological distance \(β=−0\.08\\beta=\-0\.08,p=0\.31p=0\.31\) nor training proportion \(β=−0\.05\\beta=\-0\.05,p=0\.44p=0\.44\) significantly predicts alignment, consistent with multilingual training creating language\-invariant representations\.We treat the typological distance effect as a well\-supported descriptive pattern requiring confirmation across≥\\geq10 languages, given that the effective degrees of freedom for the distance predictor are≈\\approx1 per model category \(§[3\.5](https://arxiv.org/html/2605.23032#S3.SS5)\)\.

#### Sensitivity to Baichuan2 training proportion\.

Because the∼\{\\sim\}55% Chinese estimate for Baichuan2 is approximate, we re\-ran the partial correlation analysis under±15\\pm 15% variation\. At 40% Chinese:rpartial​\(alignment∼dtyp\|ptrain\)=−0\.31r\_\{\\text\{partial\}\}\(\\text\{alignment\}\\sim d\_\{\\text\{typ\}\}\|p\_\{\\text\{train\}\}\)=\-0\.31; at 55%:−0\.34\-0\.34; at 70%:−0\.37\-0\.37\. The typological distance effect is stable across this range, and the qualitative conclusions are unchanged \(full sensitivity results in Appendix[I](https://arxiv.org/html/2605.23032#A9)\)\.

### 4\.3Experiment 3: Layer\-Wise Encoding Profiles

11448812121616202024240\.20\.20\.40\.40\.60\.60\.80\.8XLM\-R LayerEncodingr~\\tilde\{r\}EnglishChineseFrenchFigure 1:Layer\-wise noise\-ceiling\-normalized encoding \(r~\\tilde\{r\}\) for XLM\-R\. The intermediate layer advantage is preserved cross\-linguistically, but the peak shifts rightward for Chinese \(layer 18 vs\. 15, dashed lines\)\.±\\pm1 SE bands in Appendix[D](https://arxiv.org/html/2605.23032#A4)\.The intermediate layer advantage is preserved in all languages \(Figure[1](https://arxiv.org/html/2605.23032#S4.F1)\): performance peaks at layer 15 \(±1\.1\\pm 1\.1\) for English, layer 18 \(±1\.3\\pm 1\.3\) for Chinese, and layer 15 \(±1\.0\\pm 1\.0\) for French\. The Chinese rightward shift is significant \(bootstrapp<0\.001p<0\.001;t​\(82\)=4\.21t\(82\)=4\.21, Cohen’sd=0\.92d=0\.92\) and replicates across BLOOM\-7B \(d=0\.85d=0\.85; Appendix[D](https://arxiv.org/html/2605.23032#A4)\)\. mBERT also shows a Chinese rightward shift \(peak layer 8 vs\. 7,d=0\.61d=0\.61\), arguing against a pure training\-data explanation since mBERT’s Chinese proportion \(∼\{\\sim\}8%\) is comparable to XLM\-R’s \(∼\{\\sim\}9%\)\.

#### Tokenization fertility analysis\.

Table[4](https://arxiv.org/html/2605.23032#A5.T4)\(Appendix[E](https://arxiv.org/html/2605.23032#A5)\) reveals a striking cross\-linguistic asymmetry: Chinese requires substantially more subword tokens per word across all tokenizers\. For XLM\-R, Chinese fertility is2\.42\.4tokens per word versus1\.41\.4for English \(1\.7×1\.7\\timeshigher\)\. Including log\-fertility as a covariate in the layer\-shift mixed\-effects model reduces the Chinese shift from\+1\.8\+1\.8to\+0\.7\+0\.7layers \(Δ​d=0\.48\\Delta d=0\.48\), suggesting that tokenization granularity accounts for approximately 60% \(bootstrap 95% CI: \[42%, 76%\]\) of the layer shift\. The residual shift \(\+0\.7\+0\.7layers,p<0\.05p<0\.05\) may reflect genuine typological processing differences\. We note that fertility correlates with information density per word, so the 60% attenuation should be interpreted as an upper bound on fertility’s causal contribution\.

### 4\.4Experiment 4: Typological Distance and Alignment

Table[7](https://arxiv.org/html/2605.23032#A8.T7)\(Appendix[H](https://arxiv.org/html/2605.23032#A8)\) presents the alignment gradient alongside typological distances, and Figure[4](https://arxiv.org/html/2605.23032#A8.F4)\(Appendix[H](https://arxiv.org/html/2605.23032#A8)\) visualizes the key pattern: both language\-dominant models \(LLaMA\-2, Baichuan2\) show steep alignment degradation with distance from their dominant training language, while XLM\-R remains flat\. This symmetry supports the interpretation that the alignment gradient is primarily training\-driven, but the mixed\-effects test \(§[4\.2](https://arxiv.org/html/2605.23032#S4.SS2)\) shows that typological distance retains independent predictive power \(β=−0\.41\\beta=\-0\.41; cluster bootstrapp=0\.031p=0\.031\)\. With three languages, we treat this as a descriptive pattern establishing a hypothesis for larger\-scale confirmation\.

### 4\.5Experiment 5: ROI\-Specific Typological Gradient

We testPrediction 2: does the typological gradient differ across brain regions?

IFGMFGATLPTLAGTP−0\.8\-0\.8−0\.6\-0\.6−0\.4\-0\.4−0\.2\-0\.20−0\.76\-0\.76−0\.62\-0\.62−0\.51\-0\.51−0\.33\-0\.33−0\.58\-0\.58−0\.45\-0\.45Brain RegionTypological gradient slopeFigure 2:Typological gradient slope \(steeper = more negative = larger alignment drop with typological distance\) per ROI for English\-dominant models\. IFG shows the steepest gradient \(2\.3×2\.3\\timesPTL\)\. Note: each slope is fit through three language\-level data points; the language\-level bootstrap 95% CI on the IFG/PTL ratio is \[1\.4, 3\.8\]\. Error bars:±\\pm1 SE \(bootstrap\)\.Prediction 2 is supported\(Figure[2](https://arxiv.org/html/2605.23032#S4.F2)\): IFG shows the steepest typological gradient \(slope=−0\.76=\-0\.76\), while PTL shows the shallowest \(−0\.33\-0\.33\), a2\.3×2\.3\\timesdifference\. The ROI×\\timestypological distance interaction is significant in the standard mixed\-effects model \(F​\(5,312\)=3\.41F\(5,312\)=3\.41,p<0\.01p<0\.01;pHolm=0\.03p\_\{\\text\{Holm\}\}=0\.03\)\. The language\-level bootstrap yields a 95% CI on the IFG/PTL ratio of \[1\.4, 3\.8\], confirming that the direction of the effect is robust even under conservative resampling, though the magnitude is imprecisely estimated with three languages\. This pattern is consistent with a syntax–semantics dissociation: regions associated with syntactic processing \(IFG\) are more sensitive to typological variation than lexico\-semantic regions \(PTL\), suggesting that the universal component of brain–LLM alignment is primarily semantic\. We discuss the theoretical implications, including why this result does not uniquely support CxG, in §[6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4)\.

### 4\.6Experiment 6: Cross\-Language Transfer

All cross\-language transfer results are significantly above chance\. The gradient aligns with typological distance: EN→\\toFR transfer retains 73% of within\-language performance, substantially exceeding EN→\\toZH \(52%\)\. Normalized results \(r~\\tilde\{r\}\) confirm that this gradient persists after accounting for noise ceiling differences\. Transfer results for LLaMA\-2\-7B and Baichuan2\-7B \(Appendix[J](https://arxiv.org/html/2605.23032#A10)\) show that Baichuan2 exhibits a reversed transfer gradient, paralleling the within\-language alignment pattern and further supporting the training\-dominance account\. Full results are reported in Table[13](https://arxiv.org/html/2605.23032#A13.T13)\(Appendix[M](https://arxiv.org/html/2605.23032#A13)\)\.

## 5Related Work

Our work builds on encoding models\(Mitchellet al\.,[2008](https://arxiv.org/html/2605.23032#bib.bib61); Huthet al\.,[2016](https://arxiv.org/html/2605.23032#bib.bib19); Jain and Huth,[2018](https://arxiv.org/html/2605.23032#bib.bib20)\)systematized for LLMs bySchrimpfet al\.\([2021](https://arxiv.org/html/2605.23032#bib.bib34)\)andCaucheteux and King \([2022](https://arxiv.org/html/2605.23032#bib.bib36)\)\. Key precedents include bidirectional brain–NLP benefits\(Toneva and Wehbe,[2019](https://arxiv.org/html/2605.23032#bib.bib2)\), semantic decoding\(Tanget al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib42)\), architectural comparisons\(Goldsteinet al\.,[2025](https://arxiv.org/html/2605.23032#bib.bib88)\), and evidence that alignment reflects more than next\-word prediction\(Merlin and Toneva,[2024](https://arxiv.org/html/2605.23032#bib.bib3)\)\. On multilingual NLP:Conneauet al\.\([2020a](https://arxiv.org/html/2605.23032#bib.bib9)\)andConneau and Lample \([2019](https://arxiv.org/html/2605.23032#bib.bib11)\)established cross\-lingual pretraining;Mulleret al\.\([2021](https://arxiv.org/html/2605.23032#bib.bib56)\),Libovickýet al\.\([2020](https://arxiv.org/html/2605.23032#bib.bib57)\), andConneauet al\.\([2020b](https://arxiv.org/html/2605.23032#bib.bib12)\)characterized multilingual representations;Pontiet al\.\([2019](https://arxiv.org/html/2605.23032#bib.bib18)\),Pireset al\.\([2019](https://arxiv.org/html/2605.23032#bib.bib15)\),Chiet al\.\([2020](https://arxiv.org/html/2605.23032#bib.bib28)\),Beinbornet al\.\([2019](https://arxiv.org/html/2605.23032#bib.bib29)\), andStanczaket al\.\([2022](https://arxiv.org/html/2605.23032#bib.bib30)\)addressed typological evaluation; andCotterellet al\.\([2018](https://arxiv.org/html/2605.23032#bib.bib31)\)provided information\-theoretic perspectives\.Kaufet al\.\([2024](https://arxiv.org/html/2605.23032#bib.bib45)\)showed lexical\-semantic content dominates brain–LLM similarity;Hosseiniet al\.\([2024b](https://arxiv.org/html/2605.23032#bib.bib43)\)identified shared functional specialization\. Key datasets include the LPP corpus\(Liet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib70)\)and extensions\(Stehwienet al\.,[2020](https://arxiv.org/html/2605.23032#bib.bib72); Momenianet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib71)\), MECO\(Siegelmanet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib73)\), and cross\-linguistic eye\-tracking\(Berzaket al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib80)\)\.Misra and Mahowald \([2024](https://arxiv.org/html/2605.23032#bib.bib52)\),McCoyet al\.\([2023](https://arxiv.org/html/2605.23032#bib.bib7)\), andMilletet al\.\([2022](https://arxiv.org/html/2605.23032#bib.bib74)\)provide complementary perspectives on LLM learning and speech–brain alignment\.

## 6Discussion

#### Training dominance\.

The Baichuan2\-7B results are perhaps the most informative finding: by reversing the alignment gradient, they demonstrate that the apparent English\-best pattern in prior brain–LLM alignment work reflects the predominance of English\-dominant models and English\-language fMRI data\(Tuckuteet al\.,[2024a](https://arxiv.org/html/2605.23032#bib.bib37)\), not a special relationship between English and the brain’s representational format\. We note that no individual prior study explicitly claimed an English\-specific advantage; rather, the literature’s near\-exclusive focus on English left the question of cross\-linguistic generalization empirically open\. Our result settles one direction of that question: dominance, not English specifically, drives the gradient\. Cross\-linguistic brain–LLM studies must therefore control for training\-language dominance before attributing alignment differences to typological or neurocognitive factors\.

#### Universality debate\.

Our results support a nuanced position\(Evans and Levinson,[2009](https://arxiv.org/html/2605.23032#bib.bib60); Malik\-Moraledaet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib33)\): brain–LLM alignment is universal in its existence \(Prediction 1 confirmed\) but variable in its character\. The ROI\-specific analysis shows this variation is structured: concentrated in syntactic regions while semantic regions show near\-universal alignment, consistent with “functional universality” beneath structural diversity\(Croft,[2001](https://arxiv.org/html/2605.23032#bib.bib54); de Vardaet al\.,[2025](https://arxiv.org/html/2605.23032#bib.bib46); Ulrichet al\.,[2025](https://arxiv.org/html/2605.23032#bib.bib83)\)\. Specific predictions for future≥\\geq10\-language work are detailed in Appendix[N](https://arxiv.org/html/2605.23032#A14)\.

#### Implications for human language processing\.

The findings constrain three aspects of how the language network builds representations\. First, the near\-uniform PTL alignment across all model–language cells suggests that posterior\-temporal semantic representations converge across radically different training distributions, consistent with this region encoding language\-invariant meaning structure once form has been parsed\(Kaufet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib45); Hickok and Poeppel,[2007](https://arxiv.org/html/2605.23032#bib.bib65)\)\. Second, the IFG gradient implies that the structural component of comprehension is sensitive to whether the input distribution matches the listener’s typological profile, a pattern better fit by procedural skill aligned to specific structural regularities than by encoding of universal grammatical primitives\. Third, the asymmetry between language\-dominant and multilingual models is informative about*learning*rather than*processing*: biological and artificial learners both produce representations whose cross\-linguistic transferability is bounded by the breadth of their input, regardless of how that breadth was acquired\.

#### Syntax–semantics dissociation and CxG\.

The IFG2\.3×2\.3\\timessteeper gradient than PTL is consistent with CxG’s claim that variation is localized in constructional representations\(Goldberg,[2005](https://arxiv.org/html/2605.23032#bib.bib55)\)\. However, at least three alternative accounts predict the same pattern: generic syntax–semantics dissociation\(Mahowaldet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib51)\), cognitive control demands\(Green and Abutalebi,[2013](https://arxiv.org/html/2605.23032#bib.bib69)\), and signal\-to\-noise ratio differences\(Kaufet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib45)\)\. A genuinely CxG\-specific test requires construction\-level analysis \(e\.g\., testing bǎ\-constructions versus double\-object constructions using the Universal Constructicon\(Boas and Sag,[2012](https://arxiv.org/html/2605.23032#bib.bib78)\)\. Extended discussion is provided in Appendix[O](https://arxiv.org/html/2605.23032#A15)\.

#### Tokenization and fertility\.

The finding that∼\{\\sim\}60% of the Chinese layer shift is attributable to tokenization granularity \(with information\-density confound providing an upper bound\) extends to all cross\-linguistic LLM evaluation\(Rustet al\.,[2021](https://arxiv.org/html/2605.23032#bib.bib22); Beinbornet al\.,[2019](https://arxiv.org/html/2605.23032#bib.bib29); Stanczaket al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib30)\)\. Reporting fertility alongside encoding results should become routine, since tokenization artifacts can masquerade as cross\-linguistic processing differences\.

#### Surprisal theory\.

Prediction 4 is partially supported: per\-language perplexity negatively correlates with encoding performance \(ρ=−0\.62\\rho=\-0\.62,p<0\.01p<0\.01;pHolm=0\.03p\_\{\\text\{Holm\}\}=0\.03\)\. The effect is strongest for English\-dominant models \(ρ=−0\.78\\rho=\-0\.78\) and weakest for multilingual models \(ρ=−0\.41\\rho=\-0\.41\), suggesting it partially reflects training\-language match\. The middle\-layer advantage across all languages is consistent with the language network primarily encoding formal linguistic representations\(Mahowaldet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib51); Pasquiouet al\.,[2023](https://arxiv.org/html/2605.23032#bib.bib8); AlKhamissiet al\.,[2025](https://arxiv.org/html/2605.23032#bib.bib87)\)\.

#### Typological distance\.

The typological distance effect \(β=−0\.41\\beta=\-0\.41; cluster bootstrapp=0\.031p=0\.031\) provides preliminary quantitative evidence that formal typological metrics can predict brain encoding variation beyond training data effects, and that Grambank, WALS, and lang2vec agree closely on the relevant pairwise rankings\(Pontiet al\.,[2019](https://arxiv.org/html/2605.23032#bib.bib18); Beinbornet al\.,[2019](https://arxiv.org/html/2605.23032#bib.bib29)\)\. With three languages, this is a descriptive pattern; confirmation requires≥\\geq10 languages\.

## Conclusion

The apparent dominance of English in prior brain–LLM alignment results is not a property of English or of the brain: it is a property of training data\. By comparing architecture\- and scale\-matched models that differ primarily in training\-language composition, we show that Baichuan2\-7B reverses the alignment gradient entirely, performing best for Chinese and worst for English\. This reversal establishes training\-language dominance as the primary driver of cross\-linguistic alignment variation and reframes the field’s English\-centric findings as reflecting data composition rather than neurocognitive privilege\.

Beyond training dominance, three additional results shape how the field should approach multilingual brain–LLM comparisons\. First, formal typological distance covaries with alignment degradation independently of training proportion, establishing a descriptive pattern that warrants confirmation across ten or more languages\. Second, tokenization fertility accounts for approximately 60% of the Chinese rightward layer shift, revealing that what appears to be a cross\-linguistic processing difference is substantially a tokenization artifact, a confound that affects any multilingual evaluation relying on subword representations\. Third, the typological gradient is not uniform across the language network: syntax\-associated regions \(IFG\) show a2\.3×2\.3\\timessteeper gradient than lexico\-semantic regions \(PTL\), consistent with syntax–semantics dissociation accounts and suggesting that cross\-linguistic divergence is concentrated in structural rather than semantic processing\.

Taken together, the results support a nuanced view of universality: biological and artificial language processors converge on shared semantic representations while diverging where typology demands language\-specific solutions\. Whether this extends to agglutinative, polysynthetic, and tonal languages remains open\.

#### Reproducibility\.

## Limitations

#### Language sample and statistical power\.

Three languages provide meaningful but limited coverage\. The effective degrees of freedom for the distance predictor are≈\\approx1 per model category, and the cluster bootstrap yields wider CIs than standard mixed\-effectspp\-values suggest\. Agglutinative \(Turkish, Finnish\), polysynthetic \(Mohawk\), and tonal non\-Chinese \(Yoruba\) languages remain untested; confirmation requires≥\\geq10 languages\. The specific≥\\geq10 target reflects the requirement that a language\-level fixed effect of typological distance be reliably distinguished from sampling noise\. Withkklanguages, the distance predictor has at mostk​\(k−1\)/2k\(k\-1\)/2unique pairwise values and yields≈k−3\\approx k\-3residual degrees of freedom after accounting for an intercept and two cross\-language covariates \(training\-data proportion, fertility\); achieving conventional power \(1−β=0\.81\-\\beta=0\.8\) for the cross\-language slope observed here \(β≈−0\.4\\beta\\approx\-0\.4\) requires roughlyk≥10k\\geq 10\. A fuller sample should also vary along writing\-system dimensions \(alphabetic, logographic, syllabic, abugida\): orthographic granularity is what drives tokenizer fertility\(Rustet al\.,[2021](https://arxiv.org/html/2605.23032#bib.bib22)\), and treating typological distance as purely syntactic without conditioning on script confounds the gradient with subword\-segmentation effects \(§[4\.3](https://arxiv.org/html/2605.23032#S4.SS3)\)\.

#### Language×\\timessite confound\.

Each language was collected at a different institution \(Cornell, Jiangsu Normal, NeuroSpin\)\. The noise ceiling partially controls for site\-level signal quality, but residual site effects cannot be fully excluded\. The Cantonese extension\(Momenianet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib71)\)could help, though its older adult sample \(mean age 69\) introduces an age confound\.

#### Modality and architecture\.

The LPP uses auditory stimuli while LLMs process text, which may be particularly consequential for Chinese where tone carries lexical information\(Milletet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib74)\)\. Additionally, while the 7B autoregressive comparison controls for architecture and scale, mBERT and XLM\-R are bidirectional encoders that may inherently align differently with fMRI data\.

#### Correlational evidence\.

Encoding models test representational, not computational, alignment\(Antonello and Huth,[2023](https://arxiv.org/html/2605.23032#bib.bib39)\)\. Causal methods are needed for computational claims\. Baichuan2\-7B’s∼\{\\sim\}55% Chinese estimate is approximate; sensitivity analysis \(§[4\.2](https://arxiv.org/html/2605.23032#S4.SS2.SSS0.Px2)\) shows stable results across±\\pm15% variation\.

## Acknowledgments

We thank the anonymous CoNLL reviewers and the area chair for constructive feedback that substantially improved this paper, and the original*Le Petit Prince*multilingual fMRI corpus collectors\(Liet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib70)\)for making the data publicly available on OpenNeuro \(ds003643\)\. We also acknowledge the open\-source releases of GPT\-2, LLaMA\-2, Baichuan2, mBERT, XLM\-R, BLOOM, and Qwen2\.5, which made this analysis possible\.

## References

- From language to cognition: how LLMs outgrow the human language network\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 24321–24339\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1237/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1237),ISBN 979\-8\-89176\-332\-6Cited by:[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px6.p1.5)\.
- R\. Antonello and A\. Huth \(2023\)Predictive coding or just feature discovery? an alternative account of why language models fit brain data\.Neurobiology of Language,pp\. 1–16\.External Links:ISSN 2641\-4368,[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00087)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.SS0.SSS0.Px1.p1.2),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2605.23032#S2.SS5.p1.1),[Correlational evidence\.](https://arxiv.org/html/2605.23032#Sx2.SS0.SSS0.Px4.p1.2)\.
- R\. J\. Antonello, A\. R\. Vaidya, and A\. Huth \(2023\)Scaling laws for language encoding models in fmri\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1)\.
- L\. Beinborn, S\. Abnar, and R\. Choenni \(2019\)Robust evaluation of language\-brain encoding experiments\.InComputational Linguistics and Intelligent Text Processing \- 20th International Conference, CICLing 2019, La Rochelle, France, April 7\-13, 2019, Revised Selected Papers, Part I,A\. F\. Gelbukh \(Ed\.\),Lecture Notes in Computer Science, Vol\.13451,pp\. 44–61\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-24337-0%5F4)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px5.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px7.p1.3)\.
- B\. K\. Bergen and N\. Chang \(2005\)Embodied construction grammar in simulation\-based language understanding\.InConstruction Grammars,pp\. 147–190\.External Links:ISBN 9789027294708,ISSN 1573\-594X,[Document](https://dx.doi.org/10.1075/cal.3.08ber)Cited by:[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1)\.
- Y\. Berzak, C\. Nakamura, A\. Smith, E\. Weng, B\. Katz, S\. Flynn, and R\. Levy \(2022\)CELER: a 365\-participant corpus of eye movements in l1 and l2 english reading\.Open Mind6,pp\. 41–50\.External Links:ISSN 2470\-2986,[Document](https://dx.doi.org/10.1162/opmi%5Fa%5F00054)Cited by:[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- I\. A\. Blank and E\. Fedorenko \(2017\)Domain\-general brain regions do not track linguistic input as closely as language\-selective regions\.The Journal of Neuroscience37\(41\),pp\. 9999–10011\.External Links:[Document](https://dx.doi.org/10.1523/JNEUROSCI.3642-16.2017),[Link](https://doi.org/10.1523/JNEUROSCI.3642-16.2017)Cited by:[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1)\.
- H\. C\. Boas and I\. A\. Sag \(2012\)Sign\-based construction grammar\.CSLI Lecture Notes\.External Links:ISBN 978\-1\-57586\-628\-4Cited by:[§O\.2](https://arxiv.org/html/2605.23032#A15.SS2.SSS0.Px3.p2.1),[§1](https://arxiv.org/html/2605.23032#S1.p4.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4.p1.1)\.
- C\. Caucheteux, A\. Gramfort, and J\. King \(2023\)Evidence of a predictive coding hierarchy in the human brain listening to speech\.Nature Human Behaviour7\(3\),pp\. 430–441\.External Links:ISSN 2397\-3374,[Document](https://dx.doi.org/10.1038/s41562-022-01516-2)Cited by:[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1)\.
- C\. Caucheteux and J\. King \(2022\)Brains and algorithms partially converge in natural language processing\.Communications Biology5\(1\)\.External Links:ISSN 2399\-3642,[Document](https://dx.doi.org/10.1038/s42003-022-03036-1)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- E\. A\. Chi, J\. Hewitt, and C\. D\. Manning \(2020\)Finding universal grammatical relations in multilingual BERT\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. R\. Tetreault \(Eds\.\),pp\. 5564–5577\.External Links:[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.493)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020a\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. R\. Tetreault \(Eds\.\),pp\. 8440–8451\.External Links:[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.747)Cited by:[§3\.2](https://arxiv.org/html/2605.23032#S3.SS2.p1.2),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- A\. Conneau and G\. Lample \(2019\)Cross\-lingual language model pretraining\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 7057–7067\.Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- A\. Conneau, S\. Wu, H\. Li, L\. Zettlemoyer, and V\. Stoyanov \(2020b\)Emerging cross\-lingual structure in pretrained language models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. R\. Tetreault \(Eds\.\),pp\. 6022–6034\.External Links:[Document](https://dx.doi.org/10.18653/V1/2020.ACL-MAIN.536)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- R\. Cotterell, S\. J\. Mielke, J\. Eisner, and B\. Roark \(2018\)Are all languages equally hard to language\-model?\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT, New Orleans, Louisiana, USA, June 1\-6, 2018, Volume 2 \(Short Papers\),M\. A\. Walker, H\. Ji, and A\. Stent \(Eds\.\),pp\. 536–541\.External Links:[Document](https://dx.doi.org/10.18653/V1/N18-2085)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- W\. Croft \(2001\)Radical construction grammar: syntactic theory in typological perspective\.Oxford University PressOxford\.External Links:ISBN 9780191708091,[Document](https://dx.doi.org/10.1093/acprof%3Aoso/9780198299554.001.0001)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p4.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px2.p1.1)\.
- A\. G\. de Varda, S\. Malik\-Moraleda, G\. Tuckute, and E\. Fedorenko \(2025\)Multilingual computational models capture a shared meaning component in brain responses across 21 languages\.bioRxiv preprintbioRxiv 2025\.02\.01\.636044\.External Links:[Document](https://dx.doi.org/10.1101/2025.02.01.636044),[Link](https://www.biorxiv.org/content/early/2025/11/18/2025.02.01.636044)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p3.2),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2019, Minneapolis, MN, USA, June 2\-7, 2019, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/V1/N19-1423)Cited by:[§3\.2](https://arxiv.org/html/2605.23032#S3.SS2.p1.2)\.
- M\. S\. Dryer and M\. Haspelmath \(2013\)InWALS Online \(v2020\.4\) \[Data set\],External Links:[Document](https://dx.doi.org/10.5281/zenodo.13950591),[Link](https://wals.info/)Cited by:[§3\.4](https://arxiv.org/html/2605.23032#S3.SS4.SSS0.Px1.p1.1)\.
- N\. Evans and S\. C\. Levinson \(2009\)The myth of language universals: language diversity and its importance for cognitive science\.Behavioral and Brain Sciences32\(5\),pp\. 429–448\.External Links:ISSN 1469\-1825,[Document](https://dx.doi.org/10.1017/s0140525x0999094x)Cited by:[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px2.p1.1)\.
- E\. Fedorenko, M\. K\. Behr, and N\. Kanwisher \(2011\)Functional specificity for high\-level linguistic processing in the human brain\.Proceedings of the National Academy of Sciences108\(39\),pp\. 16428–16433\.External Links:ISSN 1091\-6490,[Document](https://dx.doi.org/10.1073/pnas.1112937108)Cited by:[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p3.1),[§3\.1](https://arxiv.org/html/2605.23032#S3.SS1.p1.8)\.
- E\. Fedorenko, A\. A\. Ivanova, and T\. I\. Regev \(2024\)The language network as a natural kind within the broader landscape of the human brain\.Nature Reviews Neuroscience25\(5\),pp\. 289–312\.External Links:ISSN 1471\-0048,[Document](https://dx.doi.org/10.1038/s41583-024-00802-4)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1)\.
- S\. L\. Frank, L\. J\. Otten, G\. Galli, and G\. Vigliocco \(2015\)The erp response to the amount of information conveyed by words in sentences\.Brain and Language140,pp\. 1–11\.External Links:ISSN 0093\-934X,[Document](https://dx.doi.org/10.1016/j.bandl.2014.10.006)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- A\. E\. Goldberg \(2024\)Usage\-based constructionist approaches and large language models\.Constructions and Frames16\(2\),pp\. 220–254\.External Links:[Document](https://dx.doi.org/10.1075/cf.23017.gol)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p4.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p2.1)\.
- A\. Goldberg \(2005\)Constructions at work: the nature of generalization in language\.Oxford University PressOxford\.External Links:ISBN 9780191708428,[Document](https://dx.doi.org/10.1093/acprof%3Aoso/9780199268511.001.0001)Cited by:[item 4](https://arxiv.org/html/2605.23032#S1.I1.i4.p1.1),[§1](https://arxiv.org/html/2605.23032#S1.p4.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4.p1.1)\.
- A\. Goldstein, E\. Ham, M\. Schain, S\. A\. Nastase, B\. Aubrey, Z\. Zada, A\. Grinstein\-Dabush, H\. Gazula, A\. Feder, W\. Doyle, S\. Devore, P\. Dugan, D\. Friedman, M\. Brenner, A\. Hassidim, Y\. Matias, O\. Devinsky, N\. Siegelman, A\. Flinker, O\. Levy, R\. Reichart, and U\. Hasson \(2025\)Temporal structure of natural language processing in the human brain corresponds to layered hierarchy of large language models\.Nature Communications16\.External Links:[Document](https://dx.doi.org/10.1038/s41467-025-65518-0),ISSN 20411723Cited by:[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- A\. Goldstein, Z\. Zada, E\. Buchnik, M\. Schain, A\. Price, B\. Aubrey, S\. A\. Nastase, A\. Feder, D\. Emanuel, A\. Cohen, A\. Jansen, H\. Gazula, G\. Choe, A\. Rao, C\. Kim, C\. Casto, L\. Fanda, W\. Doyle, D\. Friedman, P\. Dugan, L\. Melloni, R\. Reichart, S\. Devore, A\. Flinker, L\. Hasenfratz, O\. Levy, A\. Hassidim, M\. Brenner, Y\. Matias, K\. A\. Norman, O\. Devinsky, and U\. Hasson \(2022\)Shared computational principles for language processing in humans and deep language models\.Nature Neuroscience25\(3\),pp\. 369–380\.External Links:ISSN 1546\-1726,[Document](https://dx.doi.org/10.1038/s41593-022-01026-4)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1)\.
- D\. W\. Green and J\. Abutalebi \(2013\)Language control in bilinguals: the adaptive control hypothesis\.Journal of Cognitive Psychology25\(5\),pp\. 515–530\.External Links:ISSN 2044\-592X,[Document](https://dx.doi.org/10.1080/20445911.2013.796377)Cited by:[§O\.1](https://arxiv.org/html/2605.23032#A15.SS1.SSS0.Px2.p1.1),[item 4](https://arxiv.org/html/2605.23032#S1.I1.i4.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4.p1.1)\.
- J\. Hale \(2001\)A probabilistic earley parser as a psycholinguistic model\.InLanguage Technologies 2001: The Second Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL 2001, Pittsburgh, PA, USA, June 2\-7, 2001,Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- M\. Heilbron, K\. Armeni, J\. Schoffelen, P\. Hagoort, and F\. P\. de Lange \(2022\)A hierarchy of linguistic predictions during natural language comprehension\.Proceedings of the National Academy of Sciences119\(32\),pp\. e2201968119\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2201968119)Cited by:[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1)\.
- G\. Hickok and D\. Poeppel \(2007\)The cortical organization of speech processing\.Nature Reviews Neuroscience8\(5\),pp\. 393–402\.External Links:ISSN 1471\-0048,[Document](https://dx.doi.org/10.1038/nrn2113)Cited by:[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p3.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px3.p1.1)\.
- E\. A\. Hosseini, M\. Schrimpf, Y\. Zhang, S\. Bowman, N\. Zaslavsky, and E\. Fedorenko \(2024a\)Artificial neural network language models predict human brain responses to language even after a developmentally realistic amount of training\.Neurobiology of Language5\(1\),pp\. 43–63\.External Links:[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00137)Cited by:[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1)\.
- E\. A\. Hosseini, M\. Schrimpf, Y\. Zhang, S\. Bowman, N\. Zaslavsky, and E\. Fedorenko \(2024b\)Artificial neural network language models predict human brain responses to language even after a developmentally realistic amount of training\.Neurobiology of Language5\(1\),pp\. 43–63\.External Links:[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00137)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- A\. G\. Huth, W\. A\. de Heer, T\. L\. Griffiths, F\. E\. Theunissen, and J\. L\. Gallant \(2016\)Natural speech reveals the semantic maps that tile human cerebral cortex\.Nature532\(7600\),pp\. 453–458\.External Links:[Document](https://dx.doi.org/10.1038/NATURE17637)Cited by:[§3\.3](https://arxiv.org/html/2605.23032#S3.SS3.p1.4),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- S\. Jain and A\. Huth \(2018\)Incorporating context into language encoding models for fmri\.InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3\-8, 2018, Montréal, Canada,S\. Bengio, H\. M\. Wallach, H\. Larochelle, K\. Grauman, N\. Cesa\-Bianchi, and R\. Garnett \(Eds\.\),pp\. 6629–6638\.Cited by:[§3\.3](https://arxiv.org/html/2605.23032#S3.SS3.p1.4),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- C\. Kauf, G\. Tuckute, R\. Levy, J\. Andreas, and E\. Fedorenko \(2024\)Lexical\-semantic content, not syntactic structure, is the main contributor to ANN\-brain similarity of fMRI responses in the language network\.Neurobiology of Language5\(1\),pp\. 7–42\.External Links:[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00116)Cited by:[§O\.1](https://arxiv.org/html/2605.23032#A15.SS1.SSS0.Px3.p1.1),[Appendix C](https://arxiv.org/html/2605.23032#A3.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4.p1.1)\.
- Y\. Kwon, S\. Zhu, F\. Bianchi, K\. Zhou, and J\. Zou \(2025\)ReasonIF: large reasoning models fail to follow instructions during reasoning\.arXiv preprintarXiv\.2510\.15211\.External Links:[Link](https://arxiv.org/arXiv.2510.15211)Cited by:[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p2.1)\.
- A\. Lage\-Castellanos, G\. Valente, E\. Formisano, and F\. D\. Martino \(2019\)Methods for computing the maximum performance of computational models of fmri responses\.PLoS Comput\. Biol\.15\(3\)\.External Links:[Document](https://dx.doi.org/10.1371/JOURNAL.PCBI.1006397)Cited by:[§3\.3](https://arxiv.org/html/2605.23032#S3.SS3.SSS0.Px1.p1.3)\.
- R\. Levy \(2008\)Expectation\-based syntactic comprehension\.Cognition106\(3\),pp\. 1126–1177\.External Links:[Document](https://dx.doi.org/10.1016/j.cognition.2007.05.006)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- J\. Li, S\. Bhattasali, S\. Zhang, B\. Franzluebbers, W\. Luh, R\. N\. Spreng, J\. R\. Brennan, Y\. Yang, C\. Pallier, and J\. Hale \(2022\)Le petit prince multilingual naturalistic fmri corpus\.Scientific Data9\(1\)\.External Links:ISSN 2052\-4463,[Document](https://dx.doi.org/10.1038/s41597-022-01625-7)Cited by:[Appendix P](https://arxiv.org/html/2605.23032#A16.p1.1),[§3\.1](https://arxiv.org/html/2605.23032#S3.SS1.p1.8),[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[Acknowledgments](https://arxiv.org/html/2605.23032#Sx3.p1.1)\.
- J\. Libovický, R\. Rosa, and A\. Fraser \(2020\)On the language neutrality of pre\-trained multilingual representations\.InFindings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16\-20 November 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Findings of ACL, Vol\.EMNLP 2020,pp\. 1663–1674\.External Links:[Document](https://dx.doi.org/10.18653/V1/2020.FINDINGS-EMNLP.150)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- P\. Littell, D\. R\. Mortensen, K\. Lin, K\. Kairis, C\. Turner, and L\. S\. Levin \(2017\)URIEL and lang2vec: representing languages as typological, geographical, and phylogenetic vectors\.InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3\-7, 2017, Volume 2: Short Papers,M\. Lapata, P\. Blunsom, and A\. Koller \(Eds\.\),pp\. 8–14\.External Links:[Document](https://dx.doi.org/10.18653/V1/E17-2002)Cited by:[§3\.4](https://arxiv.org/html/2605.23032#S3.SS4.SSS0.Px1.p1.1)\.
- K\. Mahowald, A\. A\. Ivanova, I\. A\. Blank, N\. Kanwisher, J\. B\. Tenenbaum, and E\. Fedorenko \(2024\)Dissociating language and thought in large language models\.Trends in Cognitive Sciences28\(6\),pp\. 517–540\.External Links:[Document](https://dx.doi.org/10.1016/j.tics.2024.01.011)Cited by:[§O\.1](https://arxiv.org/html/2605.23032#A15.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.23032#S1.p4.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px4.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px6.p1.5)\.
- S\. Malik\-Moraleda, D\. Ayyash, J\. Gallée, J\. Affourtit, M\. Hoffmann, Z\. Mineroff, O\. Jouravlev, and E\. Fedorenko \(2022\)An investigation across 45 languages and 12 language families reveals a universal language network\.Nature Neuroscience25\(8\),pp\. 1014–1019\.External Links:ISSN 1546\-1726,[Document](https://dx.doi.org/10.1038/s41593-022-01114-5)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.SS0.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2605.23032#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Malik\-Moraleda, O\. Jouravlev, M\. Taliaferro, Z\. Mineroff, T\. Cucu, K\. Mahowald, I\. A\. Blank, and E\. Fedorenko \(2024\)Functional characterization of the language network of polyglots and hyperpolyglots with precision fMRI\.Cerebral Cortex34\(3\)\.External Links:[Document](https://dx.doi.org/10.1093/cercor/bhae049)Cited by:[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1)\.
- R\. T\. McCoy, S\. Yao, D\. Friedman, M\. Hardy, and T\. L\. Griffiths \(2023\)Embers of autoregression: understanding large language models through the problem they are trained to solve\.arXiv preprintarXiv\.2309\.13638\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2309.13638)Cited by:[§2\.5](https://arxiv.org/html/2605.23032#S2.SS5.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- G\. Merlin and M\. Toneva \(2024\)Language models and brains align due to more than next\-word prediction and word\-level information\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 18431–18454\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.1024)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- J\. A\. Michaelov, M\. D\. Bardolph, C\. K\. Van Petten, B\. K\. Bergen, and S\. Coulson \(2024\)Strong prediction: language model surprisal explains multiple n400 effects\.Neurobiology of Language5\(1\),pp\. 107–135\.External Links:ISSN 2641\-4368,[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00105)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- J\. Millet, C\. Caucheteux, P\. Orhan, Y\. Boubenec, A\. Gramfort, E\. Dunbar, C\. Pallier, and J\. King \(2022\)Toward a realistic model of speech processing in the brain with self\-supervised learning\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[Modality and architecture\.](https://arxiv.org/html/2605.23032#Sx2.SS0.SSS0.Px3.p1.1)\.
- K\. Misra and K\. Mahowald \(2024\)Language models learn rare phenomena from less rare phenomena: the case of the missing aanns\.arXiv preprintarXiv\.2403\.19827\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2403.19827)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- T\. M\. Mitchell, S\. V\. Shinkareva, A\. Carlson, K\. Chang, V\. L\. Malave, R\. A\. Mason, and M\. A\. Just \(2008\)Predicting human brain activity associated with the meanings of nouns\.Science320\(5880\),pp\. 1191–1195\.External Links:ISSN 1095\-9203,[Document](https://dx.doi.org/10.1126/science.1152876)Cited by:[§3\.3](https://arxiv.org/html/2605.23032#S3.SS3.p1.4),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- M\. Momenian, Z\. Ma, S\. Wu, C\. Wang, J\. Brennan, J\. Hale, L\. Meyer, and J\. Li \(2024\)Le Petit Prince Hong Kong \(LPPHK\): naturalistic fMRI and EEG data from older Cantonese speakers\.Scientific Data11,pp\. 992\.External Links:[Document](https://dx.doi.org/10.1038/s41597-024-03745-8)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[Language×\\timessite confound\.](https://arxiv.org/html/2605.23032#Sx2.SS0.SSS0.Px2.p1.1)\.
- B\. Muller, Y\. Elazar, B\. Sagot, and D\. Seddah \(2021\)First align, then predict: understanding the cross\-lingual ability of multilingual BERT\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 \- 23, 2021,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),pp\. 2214–2231\.External Links:[Document](https://dx.doi.org/10.18653/V1/2021.EACL-MAIN.189)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- S\. A\. Nastase, Y\. Liu, H\. Hillman, A\. Zadbood, L\. Hasenfratz, N\. Keshavarzian, J\. Chen, C\. J\. Honey, Y\. Yeshurun, M\. Regev, M\. Nguyen, C\. H\. C\. Chang, C\. Baldassano, O\. Lositsky, E\. Simony, M\. A\. Chow, Y\. C\. Leong, P\. P\. Brooks, E\. Micciche, G\. Choe, A\. Goldstein, T\. Vanderwal, Y\. O\. Halchenko, K\. A\. Norman, and U\. Hasson \(2021\)The “Narratives” fMRI dataset for evaluating models of naturalistic language comprehension\.Scientific Data8,pp\. 250\.External Links:[Document](https://dx.doi.org/10.1038/s41597-021-01033-3)Cited by:[§3\.3](https://arxiv.org/html/2605.23032#S3.SS3.SSS0.Px1.p1.3)\.
- B\. Oh and W\. Schuler \(2023\)Why does surprisal from larger transformer\-based language models provide a poorer fit to human reading times?\.Trans\. Assoc\. Comput\. Linguistics11,pp\. 336–350\.External Links:[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00548)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- A\. Pasquiou, Y\. Lakretz, B\. Thirion, and C\. Pallier \(2023\)Information\-restricted neural language models reveal different brain regions’ sensitivity to semantics, syntax and context\.arXiv preprintarXiv\.2302\.14389\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2302.14389)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px6.p1.5)\.
- D\. Perani and J\. Abutalebi \(2005\)The neural basis of first and second language processing\.Current Opinion in Neurobiology15\(2\),pp\. 202–206\.External Links:ISSN 0959\-4388,[Document](https://dx.doi.org/10.1016/j.conb.2005.03.007)Cited by:[§2\.1](https://arxiv.org/html/2605.23032#S2.SS1.p1.1)\.
- T\. Pires, E\. Schlinger, and D\. Garrette \(2019\)How multilingual is multilingual bert?\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4996–5001\.External Links:[Document](https://dx.doi.org/10.18653/V1/P19-1493)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- E\. M\. Ponti, H\. O’Horan, Y\. Berzak, I\. Vulic, R\. Reichart, T\. Poibeau, E\. Shutova, and A\. Korhonen \(2019\)Modeling language variation and universals: A survey on typological linguistics for natural language processing\.Comput\. Linguistics45\(3\),pp\. 559–601\.External Links:[Document](https://dx.doi.org/10.1162/COLI%5FA%5F00357)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px7.p1.3)\.
- S\. Rakshit and A\. E\. Goldberg \(2025\)Meaning\-infused grammar: gradient acceptability shapes the geometric representations of constructions in LLMs\.InProceedings of the Second International Workshop on Construction Grammars and NLP,C\. Bonial, M\. Torgbi, L\. Weissweiler, A\. Blodgett, K\. Beuls, P\. Van Eecke, and H\. Tayyar Madabushi \(Eds\.\),Düsseldorf, Germany,pp\. 151–157\.External Links:[Link](https://aclanthology.org/2025.cxgsnlp-1.15/),ISBN 979\-8\-89176\-318\-0Cited by:[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p1.1),[§2\.4](https://arxiv.org/html/2605.23032#S2.SS4.p2.1)\.
- P\. Rust, J\. Pfeiffer, I\. Vulic, S\. Ruder, and I\. Gurevych \(2021\)How good is your tokenizer? on the monolingual performance of multilingual language models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, \(Volume 1: Long Papers\), Virtual Event, August 1\-6, 2021,C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),pp\. 3118–3135\.External Links:[Document](https://dx.doi.org/10.18653/V1/2021.ACL-LONG.243)Cited by:[item 3](https://arxiv.org/html/2605.23032#S1.I1.i3.p1.5),[§3\.4](https://arxiv.org/html/2605.23032#S3.SS4.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px5.p1.1),[Language sample and statistical power\.](https://arxiv.org/html/2605.23032#Sx2.SS0.SSS0.Px1.p1.10)\.
- T\. L\. Scao, A\. Fan, C\. Akiki, E\. Pavlick, S\. Ilic, D\. Hesslow, R\. Castagné, A\. S\. Luccioni, F\. Yvon, M\. Gallé, J\. Tow, A\. M\. Rush, S\. Biderman, A\. Webson, P\. S\. Ammanamanchi, T\. Wang, B\. Sagot, N\. Muennighoff, A\. V\. del Moral, O\. Ruwase, R\. Bawden, S\. Bekman, A\. McMillan\-Major, I\. Beltagy, H\. Nguyen, L\. Saulnier, S\. Tan, P\. O\. Suarez, V\. Sanh, H\. Laurençon, Y\. Jernite, J\. Launay, M\. Mitchell, C\. Raffel, A\. Gokaslan, A\. Simhi, A\. Soroa, A\. F\. Aji, A\. Alfassy, A\. Rogers, A\. K\. Nitzav, C\. Xu, C\. Mou, C\. Emezue, C\. Klamm, C\. Leong, D\. van Strien, D\. I\. Adelani, and et al\. \(2022\)BLOOM: A 176b\-parameter open\-access multilingual language model\.arXiv preprintarXiv\.2211\.05100\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2211.05100)Cited by:[§3\.2](https://arxiv.org/html/2605.23032#S3.SS2.p1.2)\.
- M\. Schrimpf, I\. A\. Blank, G\. Tuckute, C\. Kauf, E\. A\. Hosseini, N\. Kanwisher, J\. B\. Tenenbaum, and E\. Fedorenko \(2021\)The neural architecture of language: integrative modeling converges on predictive processing\.Proceedings of the National Academy of Sciences118\(45\),pp\. e2105646118\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2105646118)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1),[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- C\. Shain, C\. Meister, T\. Pimentel, R\. Cotterell, and R\. Levy \(2024\)Large\-scale evidence for logarithmic effects of word predictability on reading time\.Proceedings of the National Academy of Sciences121\(10\),pp\. e2307876121\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2307876121)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- N\. Siegelman, S\. Schroeder, C\. Acartürk, H\. Ahn, S\. Alexeeva, S\. Amenta, R\. Bertram, R\. Bonandrini, M\. Brysbaert, D\. Chernova, S\. M\. Da Fonseca, N\. Dirix, W\. Duyck, A\. Fella, R\. Frost, C\. A\. Gattei, A\. Kalaitzi, N\. Kwon, K\. Lõo, M\. Marelli, T\. C\. Papadopoulos, A\. Protopapas, S\. Savo, D\. E\. Shalom, N\. Slioussar, R\. Stein, L\. Sui, A\. Taboh, V\. Tønnesen, K\. A\. Usal, and V\. Kuperman \(2022\)Expanding horizons of cross\-linguistic research on reading: the multilingual eye\-movement corpus \(meco\)\.Behavior Research Methods54\(6\),pp\. 2843–2863\.External Links:ISSN 1554\-3528,[Document](https://dx.doi.org/10.3758/s13428-021-01772-6)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- J\. Singh, B\. McCann, R\. Socher, and C\. Xiong \(2019\)BERT is not an interlingua and the bias of tokenization\.InProceedings of the 2nd Workshop on Deep Learning Approaches for Low\-Resource NLP, DeepLo@EMNLP\-IJCNLP 2019, Hong Kong, China, November 3, 2019,C\. Cherry, G\. Durrett, G\. F\. Foster, R\. Haffari, S\. Khadivi, N\. Peng, X\. Ren, and S\. Swayamdipta \(Eds\.\),pp\. 47–55\.External Links:[Document](https://dx.doi.org/10.18653/V1/D19-6106)Cited by:[§3\.4](https://arxiv.org/html/2605.23032#S3.SS4.SSS0.Px2.p1.1)\.
- H\. Skirgård, H\. J\. Haynie, D\. E\. Blasi, H\. Hammarström, J\. Collins, J\. J\. Latarche, J\. Lesage, T\. Weber, A\. Witzlack\-Makarevich, S\. Passmore, A\. Chira, L\. Maurits, R\. Dinnage, M\. Dunn, G\. Reesink, R\. Singer, C\. Bowern, P\. Epps, J\. Hill, O\. Vesakoski, M\. Robbeets, N\. K\. Abbas, D\. Auer, N\. A\. Bakker, G\. Barbos, R\. D\. Borges, S\. Danielsen, L\. Dorenbusch, E\. Dorn, J\. Elliott, G\. Falcone, J\. Fischer, Y\. Ghanggo Ate, H\. Gibson, H\. Göbel, J\. A\. Goodall, V\. Gruner, A\. Harvey, R\. Hayes, L\. Heer, R\. E\. Herrera Miranda, N\. Hübler, B\. Huntington\-Rainey, J\. K\. Ivani, M\. Johns, E\. Just, E\. Kashima, C\. Kipf, J\. V\. Klingenberg, N\. König, A\. Koti, R\. G\. A\. Kowalik, O\. Krasnoukhova, N\. L\. M\. Lindvall, M\. Lorenzen, H\. Lutzenberger, T\. R\. A\. Martins, C\. Mata German, S\. van der Meer, J\. Montoya Samamé, M\. Müller, S\. Muradoglu, K\. Neely, J\. Nickel, M\. Norvik, C\. A\. Oluoch, J\. Peacock, I\. O\. C\. Pearey, N\. Peck, S\. Petit, S\. Pieper, M\. Poblete, D\. Prestipino, L\. Raabe, A\. Raja, J\. Reimringer, S\. C\. Rey, J\. Rizaew, E\. Ruppert, K\. K\. Salmon, J\. Sammet, R\. Schembri, L\. Schlabbach, F\. W\. P\. Schmidt, A\. Skilton, W\. D\. Smith, H\. de Sousa, K\. Sverredal, D\. Valle, J\. Vera, J\. Voß, T\. Witte, H\. Wu, S\. Yam, J\. Ye, M\. Yong, T\. Yuditha, R\. Zariquiey, R\. Forkel, N\. Evans, S\. C\. Levinson, M\. Haspelmath, S\. J\. Greenhill, Q\. D\. Atkinson, and R\. D\. Gray \(2023\)Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss\.Science Advances9\(16\)\.External Links:ISSN 2375\-2548,[Document](https://dx.doi.org/10.1126/sciadv.adg6175)Cited by:[§3\.4](https://arxiv.org/html/2605.23032#S3.SS4.SSS0.Px1.p1.1)\.
- T\. A\. B\. Snijders and R\. J\. Bosker \(2012\)Multilevel analysis: an introduction to basic and advanced multilevel modeling\.2nd edition,SAGE,London\.External Links:ISBN 978\-1\-84920\-201\-5Cited by:[§3\.5](https://arxiv.org/html/2605.23032#S3.SS5.SSS0.Px1.p1.3)\.
- K\. Stanczak, E\. M\. Ponti, L\. T\. Hennigen, R\. Cotterell, and I\. Augenstein \(2022\)Same neurons, different languages: probing morphosyntax in multilingual pre\-trained models\.InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10\-15, 2022,M\. Carpuat, M\. de Marneffe, and I\. V\. M\. Ruíz \(Eds\.\),pp\. 1589–1598\.External Links:[Document](https://dx.doi.org/10.18653/V1/2022.NAACL-MAIN.114)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px5.p1.1)\.
- S\. Stehwien, L\. Henke, J\. Hale, J\. Brennan, and L\. Meyer \(2020\)The little prince in 26 languages: towards a multilingual neuro\-cognitive corpus\.InProceedings of the Second Workshop on Linguistic and Neurocognitive Resources,E\. Chersoni, B\. Devereux, and C\. Huang \(Eds\.\),Marseille, France,pp\. 43–49\(eng\)\.External Links:[Link](https://aclanthology.org/2020.lincr-1.6/),ISBN 979\-10\-95546\-52\-8Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- J\. Tang, A\. LeBel, S\. Jain, and A\. G\. Huth \(2023\)Semantic reconstruction of continuous language from non\-invasive brain recordings\.Nature Neuroscience26\(5\),pp\. 858–866\.External Links:[Document](https://dx.doi.org/10.1038/s41593-023-01304-9)Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- M\. Toneva and L\. Wehbe \(2019\)Interpreting and improving natural\-language processing \(in machines\) with natural language\-processing \(in the brain\)\.InAdvances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8\-14, 2019, Vancouver, BC, Canada,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 14928–14938\.Cited by:[§5](https://arxiv.org/html/2605.23032#S5.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample \(2023\)LLaMA: open and efficient foundation language models\.arXiv preprintarXiv\.2302\.13971\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2302.13971)Cited by:[§3\.2](https://arxiv.org/html/2605.23032#S3.SS2.p1.2)\.
- G\. Tuckute, N\. Kanwisher, and E\. Fedorenko \(2024a\)Language in brains, minds, and machines\.Annual Review of Neuroscience47\(1\),pp\. 277–301\.External Links:ISSN 1545\-4126,[Document](https://dx.doi.org/10.1146/annurev-neuro-120623-101142)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1),[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Tuckute, A\. Sathe, S\. Srikant, M\. Taliaferro, M\. Wang, M\. Schrimpf, K\. Kay, and E\. Fedorenko \(2024b\)Driving and suppressing the human language network using large language models\.Nature Human Behaviour8\(3\),pp\. 544–561\.External Links:[Document](https://dx.doi.org/10.1038/s41562-023-01783-7)Cited by:[§1](https://arxiv.org/html/2605.23032#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.23032#S2.SS2.p1.1),[§2\.5](https://arxiv.org/html/2605.23032#S2.SS5.p1.1)\.
- M\. Ulrich, M\. Harpaintner, N\. M\. Trumpp, A\. Berger, F\. Günther, and M\. Kiefer \(2025\)Indirect experiential grounding: semantic similarity of abstract scientific concepts is reflected in activity patterns in visual and motor cortex\.Scientific Reports15\.External Links:[Document](https://dx.doi.org/10.1038/s41598-025-32189-2),ISSN 20452322Cited by:[§6](https://arxiv.org/html/2605.23032#S6.SS0.SSS0.Px2.p1.1)\.
- E\. G\. Wilcox, T\. Pimentel, C\. Meister, R\. Cotterell, and R\. P\. Levy \(2023\)Testing the predictions of surprisal theory in 11 languages\.Trans\. Assoc\. Comput\. Linguistics11,pp\. 1451–1470\.External Links:[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00612)Cited by:[§2\.3](https://arxiv.org/html/2605.23032#S2.SS3.p1.1)\.
- A\. Yang, B\. Xiao, B\. Wang, B\. Zhang, C\. Bian, C\. Yin, C\. Lv, D\. Pan, D\. Wang, D\. Yan, F\. Yang, F\. Deng, F\. Wang, F\. Liu, G\. Ai, G\. Dong, H\. Zhao, H\. Xu, H\. Sun, H\. Zhang, H\. Liu, J\. Ji, J\. Xie, J\. Dai, K\. Fang, L\. Su, L\. Song, L\. Liu, L\. Ru, L\. Ma, M\. Wang, M\. Liu, M\. Lin, N\. Nie, P\. Guo, R\. Sun, T\. Zhang, T\. Li, T\. Li, W\. Cheng, W\. Chen, X\. Zeng, X\. Wang, X\. Chen, X\. Men, X\. Yu, X\. Pan, Y\. Shen, Y\. Wang, Y\. Li, Y\. Jiang, Y\. Gao, Y\. Zhang, Z\. Zhou, and Z\. Wu \(2023\)Baichuan 2: open large\-scale language models\.arXiv preprintarXiv\.2309\.10305\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2309.10305)Cited by:[§3\.2](https://arxiv.org/html/2605.23032#S3.SS2.p1.2)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2024\)Qwen2\.5 technical report\.arXiv preprintarXiv\.2412\.15115\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.15115),[Document](https://dx.doi.org/10.48550/ARXIV.2412.15115)Cited by:[§3\.2](https://arxiv.org/html/2605.23032#S3.SS2.p1.2)\.

## Appendix AConceptual Framework

Figure[3](https://arxiv.org/html/2605.23032#A1.F3)summarizes our experimental setup: fMRI data from three languages and seven LLMs are fed through identical PCA \+ ridge\-regression encoding pipelines, with each of four pre\-registered predictions \(P1–P4\) mapped onto a specific contrast or covariate in the resulting alignment matrix\.

Universal LanguageNetwork\(fMRI,N=112N\\\!=\\\!112\)LLMRepresentations\(7 models\)PCA\(d→100d\\\!\\to\\\!100\)RidgeRegressionrr,r~\\tilde\{r\}per ROIyvy\_\{v\}Englishn=49n\\\!=\\\!49Chinesen=35n\\\!=\\\!35Frenchn=28n\\\!=\\\!28EN\-dominantGPT\-2, LLaMA\-2ZH\-dominantBaichuan2MultilingualmBERT, XLM\-R,BLOOM, Qwen2\.5P1: Universal alignment?P2: Syntax–semantics ROI gradient?P3: Training dominance?P4: Surprisal mediation?

Figure 3:Conceptual framework\. Four theory\-derived predictions tested across three languages and seven models with noise\-ceiling normalization\.
## Appendix BEncoding Performance Standard Errors

Table[2](https://arxiv.org/html/2605.23032#A2.T2)reports the standard error of the mean encoding correlation for each model–language pair, computed as the across\-subject standard deviation of voxel\-mean correlations divided bynsubj\\sqrt\{n\_\{\\text\{subj\}\}\}\. All SEs lie below 0\.015, indicating that the cross\-language gradients summarized in Table[1](https://arxiv.org/html/2605.23032#S4.T1)are not driven by per\-subject noise\.

Table 2:Standard errors of mean encoding correlations \(rr\) for each model–language combination\. All SEs<\.015<\.015\.
## Appendix CFull ROI Results

Table[3](https://arxiv.org/html/2605.23032#A3.T3)reports noise\-ceiling\-normalized encoding \(r~\\tilde\{r\}\) by ROI for three representative models \(XLM\-R, Baichuan2\-7B, LLaMA\-2\-7B\)\. Two patterns are visible\. First, PTL shows the highest alignment in every model–language cell, consistent with the lexico\-semantic gradient\(Kaufet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib45)\)\. Second, the across\-language drop for language\-dominant models is concentrated in the more anterior ROIs \(IFG, MFG\), foreshadowing the region\-specific gradient analysis of §[4](https://arxiv.org/html/2605.23032#S4)\(Experiment 5\)\.

Table 3:Noise\-ceiling\-normalized encoding \(r~\\tilde\{r\}\) by ROI for selected models\. PTL shows the highest alignment across all conditions\. Note: Baichuan2 Chinese ATL \(r~=\.87\\tilde\{r\}=\.87\) and PTL \(r~=\.90\\tilde\{r\}=\.90\) approach near\-ceiling performance; the near\-ceiling values should be interpreted cautiously as they may be influenced by noise ceiling estimation uncertainty\.
## Appendix DLayer\-Wise Details

The Chinese rightward shift replicates across multiple models: BLOOM\-7B \(peak shift: \+1\.5 layers,d=0\.85d=0\.85\); mBERT \(peak shift: \+1\.0 layers,d=0\.61d=0\.61\); LLaMA\-2\-7B \(peak shift: \+2\.1 layers,d=1\.04d=1\.04\)\. The Baichuan2 peak layer for Chinese is*earlier*than for English \(18\.1 vs\. 19\.8\), the reverse of the pattern seen with other models, consistent with Baichuan2’s low Chinese fertility \(1\.6 vs\. 3\.8 for LLaMA\-2\) and further supporting the tokenization\-mediated account\.

## Appendix ETokenization Fertility

Table[4](https://arxiv.org/html/2605.23032#A5.T4)reports tokenization fertility \(the average number of subword tokens produced per orthographic word in the LPP stimulus text\) across every language×\\timestokenizer combination used in our experiments\. Two regularities are visible\. Tokenizers trained predominantly on English produce nearly word\-level segmentation for English and French but break Chinese into many more pieces \(1\.7–2\.9×\\times\)\. Tokenizers with substantial Chinese training \(Baichuan2, Qwen2\.5\) achieve near\-parity across all three languages, demonstrating that the Chinese\-high\-fertility pattern is a property of vocabulary composition rather than of Chinese script per se\.

Table 4:Tokenization fertility: mean subword tokens per word in the LPP stimulus text\. Chinese shows consistently higher fertility except for Baichuan2 and Qwen2\.5, whose vocabularies are optimized for Chinese\.
## Appendix FPCA Variance Explained

Table[5](https://arxiv.org/html/2605.23032#A6.T5)reports cumulative variance explained by the top 50, 100, and 200 principal components of each model’s representations \(best\-performing layer, English\)\. The 100\-PC choice used throughout the main analysis retains 85–93% of variance across all models\. A robustness analysis varying the cut at 50, 100, 150, and 200 PCs produced encoding\-correlation differencesΔ​r<0\.008\\Delta r<0\.008, ruling out PCA bandwidth as a meaningful source of cross\-language variation\.

Table 5:Variance explained by PCA across models \(best layer, English\)\. 100 PCs explain 85–93% of variance\. Robustness analysis \(50/100/150/200 PCs\) yields stable encoding results \(Δ​r<\.008\\Delta r<\.008\)\.
## Appendix GIndividual Variability

Table[6](https://arxiv.org/html/2605.23032#A7.T6)reports the across\-subject distribution of XLM\-R encoding correlations within each language\. Most subjects \(47/49 EN, 33/35 ZH, 27/28 FR\) show above\-chance encoding, indicating that aggregate cross\-language patterns are not produced by a small subset of outlier subjects\. The medians \(\.19–\.21\) are tightly clustered across languages, while the broader spread for English likely reflects its larger sample size rather than language\-specific variability\.

Table 6:Distribution of individual\-subject encoding correlations \(XLM\-R\)\. 47/49 EN, 33/35 ZH, 27/28 FR subjects show above\-chance encoding\. Individual\-difference analyses \(e\.g\., L2 proficiency, language exposure\) could reveal whether the typological gradient varies with individual linguistic experience; we reserve this for future work\.
## Appendix HTypological Distance Details

Table[7](https://arxiv.org/html/2605.23032#A8.T7)reports pairwise typological distances under three metrics alongside the alignment drop \(in percent of rawrr\) each model exhibits when tested on a language other than its training\-dominant language\. The three metrics agree closely on the rank\-ordering EN–FR<<FR–ZH<<EN–ZH, supporting the use of any of them as the distance covariate; Grambank is preferred in the main analysis because of its dense, balanced feature set\. The alignment\-drop rows show the central asymmetry: both language\-dominant models degrade steeply with typological distance, while the multilingual model \(XLM\-R\) shows only minor drops\.

Table 7:Typological distances and corresponding alignment drops \(%\) from each model’s dominant training language\. Drops are computed from rawrrvalues\.05⋅10−25\\cdot 10^\{\-2\}0\.10\.10\.150\.150\.20\.20\.250\.250\.30\.30\.350\.350\.40\.40\.450\.450\.50\.50\.550\.550\.40\.40\.60\.60\.80\.8ENFRZHZHFRENGrambank distance from dominant languageNormalizedr~\\tilde\{r\}LLaMA\-2 \(EN\-dom\.\)Baichuan2 \(ZH\-dom\.\)XLM\-R \(multilingual\)Figure 4:Encoding performance \(r~\\tilde\{r\}\) vs\. Grambank distance from each model’s dominant training language\. Dashed lines: linear fits\. Both language\-dominant models show steep degradation; XLM\-R \(multilingual\) is flat\. Baichuan2 distances are computed from Chinese \(its dominant language\); LLaMA\-2 distances from English\.Table 8:Full pairwise typological, genealogical, and geographic distances \(normalized to \[0, 1\]\)\.Table 9:Autoregressive\-only comparison \(all 7B\)\. BLOOM achieves dramatically higher uniformity than either language\-dominant model\.UU= uniformity \(Eq\.[1](https://arxiv.org/html/2605.23032#S3.E1)\); 95% bootstrap CIs \(10,000 resamples\)\.UUis computed from rawrrvalues using sample standard deviation\.
## Appendix ISensitivity Analysis for Baichuan2 Training Proportion

Because Baichuan2’s training\-language composition is reported as approximately 55% Chinese without an exact split per source\-language, we re\-estimated the partial\-correlation and slope coefficients under values ofpZHp\_\{\\text\{ZH\}\}ranging from 40% to 70% in 5\-point increments \(Table[10](https://arxiv.org/html/2605.23032#A9.T10)\)\. The typological\-distance coefficientβ\\betaand partial correlation vary by less than 0\.07 across this 30\-percentage\-point range, and all qualitative conclusions are preserved\. This bounds the influence of imprecision in our training\-proportion covariate\.

Table 10:Sensitivity analysis: partial correlation \(alignment∼\\simtypological distance\|\|training proportion\) andβ\\betaunder varying assumptions about Baichuan2’s Chinese training proportion\. All results remain qualitatively stable across the full±15\\pm 15% range\. The near\-linear increments reflect the approximately linear relationship between the training\-proportion covariate and the partial correlation over this narrow range; exact values may show minor deviations at extremes depending on convergence criteria\.
## Appendix JCross\-Language Transfer: Additional Models

Table[11](https://arxiv.org/html/2605.23032#A10.T11)summarizes cross\-language encoding transfer for the two language\-dominant 7B models\. LLaMA\-2 retains∼\\sim61% of within\-language performance on its typologically closer non\-dominant language \(French\) and∼\\sim36% on its more distant non\-dominant language \(Chinese\)\. Baichuan2 shows the mirror\-image pattern \(the typologically closer non\-dominant language being French at distance 0\.44, the more distant being English at 0\.48, with comparable gradient magnitudes\), further supporting the training\-dominance account introduced in §[4\.2](https://arxiv.org/html/2605.23032#S4.SS2)\.

Table 11:Cross\-language transfer for language\-dominant models\. “Close” = typologically closer non\-dominant language; “Far” = typologically distant non\-dominant language\. Baichuan2 shows a reversed gradient paralleling the within\-language alignment pattern\.
## Appendix KMultiple Comparison Correction

Table[12](https://arxiv.org/html/2605.23032#A11.T12)reports Bonferroni\-correctedpp\-values for our four pre\-registered predictions \(§[2\.5](https://arxiv.org/html/2605.23032#S2.SS5)\)\. All four remain significant after correction\. We use Bonferroni rather than Holm here because the four predictions were specified before data inspection and we treat them as independent confirmatory tests\.

Table 12:Bonferroni correction across the four pre\-registered predictions\. All predictions remain significant after correction\.
## Appendix LWhole\-Brain Analysis

Whole\-brain encoding reveals significant effects in the default mode network \(DMN\), including medial prefrontal cortex and posterior cingulate, consistent with narrative comprehension\. The cross\-linguistic pattern in DMN parallels language network results: multilingual models show more uniform performance\. Absolute magnitude is lower \(meanr=0\.11r=0\.11vs\.0\.190\.19in language network for XLM\-R\)\. The MD network shows minimal encoding \(r<0\.04r<0\.04, n\.s\.\)\.

## Appendix MCross\-Language Transfer: Full Results

Table[13](https://arxiv.org/html/2605.23032#A13.T13)reports the full set of cross\-language transfer results for XLM\-R: encoding models trained on subjects listening to one language and tested on subjects listening to another\. Discussion of the patterns follows the table\.

Table 13:Cross\-language encoding transfer \(XLM\-R\) with noise\-ceiling\-normalizedr~\\tilde\{r\}\. Transfer to typologically closer pairs \(EN→\\toFR: 73%\) outperforms distant pairs \(FR→\\toZH: 50%\)\. LLaMA\-2 and Baichuan2 transfer results are reported in Appendix[J](https://arxiv.org/html/2605.23032#A10)\. The*% within*column normalizes raw transferrragainst the within\-language baseline of the language whose tokenizer dominates the encoding \(English’sr=0\.208r=0\.208for transfers involving English; the source language’s within\-languagerrotherwise\), which captures the practical question of how much of the achievable encoding ceiling is recoverable across the language boundary\.All cross\-language transfer results are significantly above chance \(p<0\.001p<0\.001\)\. Several patterns emerge\. First, the transfer gradient aligns with typological distance: EN→\\toFR transfer retains 73% of within\-language performance, substantially exceeding EN→\\toZH \(52%\) and FR→\\toZH \(50%\)\. Second, the asymmetry between ZH→\\toEN \(57%\) and ZH→\\toFR \(53%\) may reflect the higher proportion of English in XLM\-R’s training data rather than typological proximity, since Grambank distance places Chinese slightly closer to French \(0\.44\) than to English \(0\.48\)\. Third, FR→\\toEN \(66%\) exceeds both FR→\\toZH \(50%\) and EN→\\toFR \(62%\), consistent with the Indo\-European pair sharing more transferable representations\. These patterns reinforce the finding that typological distance, training data composition, and genealogical relatedness all contribute to cross\-linguistic brain–LLM alignment\.

## Appendix NFuture Predictions for≥\\geq10 Languages

Our results generate three specific predictions for future cross\-linguistic work: \(a\) typological features indexing structural variation \(e\.g\., head direction, topic\-prominence, morphological complexity\) should predict ROI\-specific alignment gradients more strongly than features indexing semantic organization; \(b\) the IFG gradient should be steeper for language pairs differing in argument\-structure properties than for pairs differing primarily in phonological features; and \(c\) agglutinative languages \(e\.g\., Turkish, Finnish\) should show a layer shift in the opposite direction from Chinese if the shift reflects morphological decomposition demands rather than information density\.

## Appendix OCxG Specificity: Extended Analysis

### O\.1Detailed Alternative Accounts

Three alternative accounts predict the same IFG\>\>PTL typological gradient pattern observed in our data:

#### Generic syntax–semantics dissociation\.

Any theory that posits cross\-linguistic syntactic variation while maintaining relative semantic universality \(including generative grammar and functionalist approaches\) would predict larger typological effects in syntax\-associated regions\(Mahowaldet al\.,[2024](https://arxiv.org/html/2605.23032#bib.bib51)\)\. Under this account, IFG’s steeper gradient simply reflects the well\-established cross\-linguistic variability of syntactic structure relative to semantic content, without requiring construction\-specific representations\.

#### Cognitive control\.

IFG is implicated in cognitive control for non\-dominant language processing\(Green and Abutalebi,[2013](https://arxiv.org/html/2605.23032#bib.bib69)\)\. The steeper IFG gradient could partially reflect increased control demands when LLM representations diverge from the listener’s native language, rather than constructional processing per se\. If this account is correct, the gradient should correlate with measures of cognitive control demand \(e\.g\., conflict adaptation, switching costs\) rather than purely linguistic typological distance\.

#### Signal\-to\-noise ratio\.

Kaufet al\.\([2024](https://arxiv.org/html/2605.23032#bib.bib45)\)showed that lexical\-semantic content dominates brain–LLM similarity\. Because IFG contributes less to overall alignment than PTL, its signal may be noisier and more susceptible to degradation with typological distance, producing a steeper apparent gradient without implicating constructional specificity\. Under this account, normalizing for baseline alignment magnitude should attenuate the IFG/PTL gradient difference\.

### O\.2What CxG\-Specific Evidence Would Require

A genuinely CxG\-specific test would need to go beyond ROI\-level gradients to examine construction\-level alignment\. Three approaches would provide uniquely CxG evidence:

#### Construction\-type\-specific analysis\.

Testing whether sentences containing Chinese bǎ\-constructions versus English double\-object constructions produce different IFG alignment patterns\. CxG predicts that specific form–function mappings, not just generic syntactic complexity, drive the gradient\. This requires construction\-annotated stimuli across languages\.

#### Gradient category boundaries\.

CxG predicts gradient, not categorical, constructional differences across languages, the neural gradient should itself be gradient rather than step\-like\. Languages with partially overlapping constructional inventories \(e\.g\., English and French datives\) should show intermediate gradients compared to languages with fully different constructions \(e\.g\., English and Chinese disposal constructions\)\.

#### Form–function dissociation\.

Constructions that are form\-similar but function\-different across languages should show different alignment patterns than form\-different but function\-similar pairs\. For example, English and French passive constructions share similar form but may differ in discourse function; if the IFG gradient tracks functional rather than formal similarity, this would support CxG over purely formal syntactic accounts\.

We note that the Universal Constructicon \(UCxn\) project’s 10\-language annotations\(Boas and Sag,[2012](https://arxiv.org/html/2605.23032#bib.bib78)\)could provide the infrastructure for such tests\. The present finding is best described as consistent with, but not uniquely supporting, CxG\.

## Appendix PEthical Considerations

The LPP corpus\(Liet al\.,[2022](https://arxiv.org/html/2605.23032#bib.bib70)\)is publicly available on OpenNeuro \(ds003643\) under the CC0 license\. The original study obtained IRB approval from Cornell University, Jiangsu Normal University, and NeuroSpin; all participants provided informed written consent\. Participants were recruited from university communities and compensated at standard institutional rates\. Our study involves secondary analysis of de\-identified data and does not require additional IRB approval\. All models are publicly available under open\-source or research\-use licenses\. Code for all analyses is publicly released at[https://github\.com/bettyguo/cross\-lingual\-brain\-llm](https://github.com/bettyguo/cross-lingual-brain-llm)\.

Similar Articles

LLM Neuroanatomy III - LLMs seem to think in geometry, not language

Reddit r/LocalLLaMA

Researcher analyzes LLM internal representations across 8 languages and multiple models, finding that concept thinking occurs in geometric space in middle transformer layers independent of input language, supporting a universal deep structure hypothesis similar to Chomsky's theory rather than Sapir-Whorf linguistic relativism.

Toward LLMs Beyond English-Centric Development

arXiv cs.CL

This paper demonstrates that LLMs are heavily biased toward English, and shows that continual pre-training does not offer cost advantages over training from scratch for adapting models to other languages, especially for cultural understanding.

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

arXiv cs.CL

This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.

Anchoring LLM Gender Bias to Human Baselines: A Cross-Lingual Audit

arXiv cs.CL

This paper audits six large language models for gender stereotyping across English, Korean, Chinese, and Japanese, anchoring against human baselines. It finds that LLM stereotyping often exceeds human cross-country variation and can compound across languages, introducing a four-pattern framework to characterize such behaviors.