Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension

arXiv cs.CL Papers

Summary

This paper investigates how language model representations predict neural activity during naturalistic language comprehension across MEG, ECoG, and other recordings. The findings demonstrate that language model features serve as useful neural predictors, but caution against overinterpreting predictive success as evidence for shared neural organization.

arXiv:2606.26880v1 Announce Type: new Abstract: Language-model representations provide structured, high-dimensional annotations of naturalistic language stimuli and can serve as informative neural predictors during comprehension. We analyzed locked derived data from Brain Treebank, MEG-MASC, and Podcast ECoG with eight frozen language models, blocked encoding models, and matched temporal, nuisance, and representation-capacity controls. Positive held-out prediction and gains over low-level baselines were widespread in source-level summaries. Across Brain Treebank and Podcast ECoG, 67 of 432 evaluable rows met a controlled predictive-only criterion, and model-side feature ablations changed prediction scores in most evaluable source rows. Brain-derived, timing-linked, acoustic, and implanted-signal controls confirmed component-level sensitivity of the analysis pipeline. These findings show that language-model-derived quantities can annotate neural activity during natural speech and text comprehension. Participant-level matched-control advantages were localized rather than uniform, response-profile and feature-specificity contrasts bounded representational or computational interpretations, and complete co-indexed integrated interpretation will require future jointly indexed coverage. Together, the analyses identify language-model features as useful neural predictors and separate predictive usefulness from claims about shared neural organization or language-processing computations.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:20 AM

# Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension
Source: [https://arxiv.org/html/2606.26880](https://arxiv.org/html/2606.26880)
Xiao Jia School of Artificial Intelligence The Chinese University of Hong Kong, Shenzhen xiaojia@link\.cuhk\.edu\.cn

\(June 25, 2026\)

###### Abstract

Language\-model representations provide structured, high\-dimensional annotations of naturalistic language stimuli and can serve as informative neural predictors during comprehension\. We analyzed locked derived data from Brain Treebank, MEG\-MASC, and Podcast ECoG with eight frozen language models, blocked encoding models, and matched temporal, nuisance, and representation\-capacity controls\. Positive held\-out prediction and gains over low\-level baselines were widespread in source\-level summaries\. Across Brain Treebank and Podcast ECoG, 67 of 432 evaluable rows met a controlled predictive\-only criterion, and model\-side feature ablations changed prediction scores in most evaluable source rows\. Brain\-derived, timing\-linked, acoustic, and implanted\-signal controls confirmed component\-level sensitivity of the analysis pipeline\. These findings show that language\-model\-derived quantities can annotate neural activity during natural speech and text comprehension\. Participant\-level matched\-control advantages were localized rather than uniform, response\-profile and feature\-specificity contrasts bounded representational or computational interpretations, and complete co\-indexed integrated interpretation will require future jointly indexed coverage\. Together, the analyses identify language\-model features as useful neural predictors and separate predictive usefulness from claims about shared neural organization or language\-processing computations\.

Keywords:naturalistic language comprehension, language models, cognitive computational neuroscience, neural encoding, positive controls, evidence calibration

## 1Introduction

Language\-model representations have become effective quantitative probes of naturalistic language comprehension\. Contextual embeddings, surprisal estimates, and layer\-specific features can predict neural responses measured with fMRI, MEG, EEG, and intracranial electrophysiology, providing a tractable way to relate unfolding linguistic context to neural activity\[[13](https://arxiv.org/html/2606.26880#bib.bib26),[23](https://arxiv.org/html/2606.26880#bib.bib48),[3](https://arxiv.org/html/2606.26880#bib.bib11),[7](https://arxiv.org/html/2606.26880#bib.bib19),[27](https://arxiv.org/html/2606.26880#bib.bib50),[28](https://arxiv.org/html/2606.26880#bib.bib51),[22](https://arxiv.org/html/2606.26880#bib.bib45),[34](https://arxiv.org/html/2606.26880#bib.bib55),[12](https://arxiv.org/html/2606.26880#bib.bib27),[14](https://arxiv.org/html/2606.26880#bib.bib29)\]\. This predictive success is scientifically useful because language\-model features provide structured, high\-dimensional representations of the stimulus that can account for variation in neural responses\. The next question is what kind of neural and cognitive information this predictivity supports\.

Positive neural prediction can support several levels of inference\. A feature may predict held\-out neural responses because it tracks stimulus properties relevant to the measurement\. Additional evidence can show whether the trained representation exceeds matched temporal, nuisance, and capacity controls, whether it reproduces organization across sampled neural units, or whether it depends selectively on a candidate language\-related quantity\. The present study treats these outcomes as separable empirical claims\.

Naturalistic stimuli make this separation important\. Word onset, word rate, acoustic envelope, sentence position, discourse progression, lexical frequency, token predictability, and local transition statistics are intercorrelated\. Neural measurements also contain temporal autocorrelation, and modern language\-model features are high\-dimensional, layer\-indexed, and context\-length\-dependent\[[32](https://arxiv.org/html/2606.26880#bib.bib54),[4](https://arxiv.org/html/2606.26880#bib.bib13),[2](https://arxiv.org/html/2606.26880#bib.bib9),[36](https://arxiv.org/html/2606.26880#bib.bib46),[37](https://arxiv.org/html/2606.26880#bib.bib47)\]\. Positive neural prediction may therefore combine language\-related information with temporal, lexical, acoustic, and representation\-capacity contributions\.

Recent language\-neuroscience work motivates a more explicit link between computational measures and cognitive interpretation\. Theoretical conclusions depend on the chain connecting constructs, tasks, measurements, analyses, and auxiliary assumptions\[[29](https://arxiv.org/html/2606.26880#bib.bib8)\]\. Syntax–semantics, surprisal, and narrative\-comprehension studies show the value of linking computational measures to specified neural response patterns and interpretive targets\[[24](https://arxiv.org/html/2606.26880#bib.bib30),[35](https://arxiv.org/html/2606.26880#bib.bib32),[26](https://arxiv.org/html/2606.26880#bib.bib31)\]\. Hadidi and colleagues further show that shuffled training and testing partitions, activation\-extraction choices, positional signals, and word\-rate controls can strongly affect brain–language\-model predictivity\[[9](https://arxiv.org/html/2606.26880#bib.bib2)\]\. These results motivate a narrower question: how much of the observed signal supports predictive usefulness, model\-specific advantage, shared neural organization, or candidate\-computation interpretation\.

Here we characterize heterogeneous neural predictivity from language\-model features across three naturalistic language datasets\. We quantify positive neural predictions, gains over nuisance baselines, controlled predictive\-only rows, participant\-level consistency, and sensitivity to model\-side ablation\. We then compare these outputs with matched controls, response\-profile tests, feature\-specificity diagnostics, and reliability\-bounded summaries to determine which interpretations are supported by the available derived data\.

## 2Materials and Methods

### 2\.1Participants and data sources

This secondary analysis used preprocessed and derived data from previously released sources; raw neural recordings were governed by the original data providers and were not redistributed\. Three datasets were treated as the primary naturalistic language sources because the available derived data contained neural time series or neural targets, word\-level event grids, language\-model features, nuisance variables, matched controls, and reliability or coverage metadata\[[33](https://arxiv.org/html/2606.26880#bib.bib10),[39](https://arxiv.org/html/2606.26880#bib.bib57),[8](https://arxiv.org/html/2606.26880#bib.bib21)\]\. Brain Treebank contributed 10 participants, 26 subject\-run units, and 248 modality\-specific brain units\. Podcast ECoG contributed 9 participants, 9 subject\-run units, and 235 ECoG\-derived brain units\. MEG\-MASC contributed 11 participants, 84 subject\-run units, and 257 MEG\-derived target units\. These brain units are electrodes, sensors, source targets, time windows, or derived target profiles depending on the source dataset\. They are distinct from independent participants\.

The sample size was determined by the secondary\-data design and by the subset of publicly released or locally accessible derived data that could be matched across neural targets, word\-event grids, model features, controls, and reliability metadata\. No new participants were recruited, and participant expansion was outside the available derived\-data scope\. Inference is therefore bounded by the participant and participant\-run coverage retained for each contrast\. The matched participant\-run predictive inference retained 26 Brain Treebank subject\-run units from 10 participants, 44 MEG\-MASC subject\-run units from 11 participants, and 8 Podcast ECoG subject\-run units from 8 participants after complete model\-control matching\. When a participant contributed multiple runs, runs were averaged within participant before participant\-cluster bootstrap\. These coverage counts define the predictive inference boundary; electrodes, sensors, layers, and table rows define nested analysis dimensions\.

The source datasets contain material beyond the subset with complete local feature matching\. Brain Treebank records intracranial electrophysiology while neurosurgical participants watched naturalistic movie stimuli, with manually corrected transcripts, word onsets, part\-of\-speech labels, and dependency parses aligned to the audio track\[[33](https://arxiv.org/html/2606.26880#bib.bib10)\]\. Podcast ECoG records intracranial responses while participants listened to a natural spoken podcast, with high\-gamma preprocessed derivatives and linguistic feature annotations available from the source release\[[39](https://arxiv.org/html/2606.26880#bib.bib57)\]\. MEG\-MASC records English\-speaking participants listening to naturalistic MASC stories across repeated MEG sessions that also include word\-list and comprehension\-question material\[[8](https://arxiv.org/html/2606.26880#bib.bib21)\]\. The present manuscript analyzes the participant\-run, event\-grid, representation, and control rows that could be matched in the derived data\.

The source inventory retained Narratives and LPP multilingual as secondary or exploratory sources\[[18](https://arxiv.org/html/2606.26880#bib.bib42),[16](https://arxiv.org/html/2606.26880#bib.bib39)\]\. Learning Brain was treated as validation\-only, and Natural Stories was treated as stimulus–language\-model\-only or diagnostic because the available derived data lacked the neural\-side coverage required for the present contrasts\[[6](https://arxiv.org/html/2606.26880#bib.bib18),[20](https://arxiv.org/html/2606.26880#bib.bib37)\]\. Participant demographics, original exclusion criteria, ethics approvals, and consent procedures are governed by the source publications and repositories\. The present manuscript reports the derived units available to this analysis and preserves de\-identification and raw\-data access boundaries\.

### 2\.2Naturalistic stimuli and neural measurements

The three primary datasets sample naturalistic comprehension with different measurement modalities\. Brain Treebank provides intracranial recordings during naturalistic audiovisual or language stimuli\. Podcast ECoG provides intracranial recordings during podcast comprehension\. MEG\-MASC provides MEG responses to natural speech\. The local analysis inherited each source’s preprocessing, artifact rejection, and target definition from the released or derived artifacts\. Word onset and event grids were used to align language\-model features and nuisance variables to the neural targets\.

The primary inferential unit was the participant or participant\-run whenever the matched data allowed that aggregation\. Electrodes, sensors, target windows, layers, models, metrics, and candidate quantities were treated as nested or crossed analysis dimensions\. Row counts describe coverage of model–dataset–layer–metric combinations; independent predictive evidence comes from participant\-aware summaries\. A modality, region, and time\-window summary was generated from the predictive outputs\. In the matched derived data, predictive rows retain modality labels and a broad target\-coverage label\. Modality and window summaries are therefore available, whereas within\-region comparisons require finer retained target strata\. Supplementary Table 30 lists the dataset\-specific neural target type, brain\-unit count descriptor, temporal window, selection rule, final unit, and what was averaged before the manuscript\-facing contrasts\. The broad target label denotes the retained target coverage in the derived data\. Predictive intervals are reported over 10 Brain Treebank participants and 26 subject\-run units, 11 MEG\-MASC participants and 44 subject\-run units, and 8 Podcast ECoG participants and 8 subject\-run units\. Response\-profile and feature\-ablation summaries retain their contrast\-specific target\-profile or diagnostic scopes and are reported as bounded summaries without formal participant\-level equivalence tests\.

### 2\.3Language\-model representations and candidate quantities

The analysis used fixed language\-model representation files from the analysis package\. The analyzed model inventory was bounded to eight validated feature sets: DistilGPT\-2, GPT\-2, GPT\-2 Medium, Pythia\-160M, Pythia\-410M, Qwen2\.5\-0\.5B\-Instruct, Qwen2\.5\-1\.5B\-Instruct, and Qwen3\-1\.7B\. Larger locally indexed checkpoints without matched analysis rows, including Qwen2\.5\-7B\-Instruct and Qwen3\-4B\-Instruct\-2507, were outside the analyzed model set\.

Candidate language\-related quantities were operationalized from the fixed representation files and word\-event tables\. Word surprisal was computed in natural\-log units as the summed subword surprisal for each word,−∑i∈wlog⁡p​\(ti∣t<i\)\-\\sum\_\{i\\in w\}\\log p\(t\_\{i\}\\mid t\_\{<i\}\), using model logits and token\-to\-word maps\. The event\-level implementation used model\-derived surprisal when available and otherwise used the fixed unigram\-surprisal proxy; proxy rows are treated as source\-boundary diagnostics\. Proxy\-derived quantities are excluded from model\-specific computational correspondence claims\. Semantic transition and context update were read from the available word\-feature annotations\. Dependency integration, syntactic boundaries, and discourse boundaries were inherited from the event grid as scalar feature streams or event\-indicator streams\. These operational variables annotate contextual predictability, semantic change, syntactic or dependency structure, boundary structure, and context updating\. Biological\-mechanism claims require evidence beyond these operational variables alone\.

### 2\.4Temporal alignment, encoding models, and matched controls

Where held\-out scores were recomputed from matchable event, feature, and neural grids, neural prediction used ridge\-regularized linear encoding within blocked cross\-validation\[[11](https://arxiv.org/html/2606.26880#bib.bib25),[10](https://arxiv.org/html/2606.26880#bib.bib22),[21](https://arxiv.org/html/2606.26880#bib.bib44)\]\. Lag structure, ridge penalty, PCA dimensionality, residualization, projection\-removal steps, and ablation transformations were selected or fit only inside training data, following leakage\-prevention cautions from predictive modeling and neuroimaging\[[25](https://arxiv.org/html/2606.26880#bib.bib49),[30](https://arxiv.org/html/2606.26880#bib.bib52),[31](https://arxiv.org/html/2606.26880#bib.bib53),[38](https://arxiv.org/html/2606.26880#bib.bib56),[15](https://arxiv.org/html/2606.26880#bib.bib36)\]\. The blocked design was used to reduce temporal leakage from autocorrelated naturalistic stimuli\. Supplementary Table 29 summarizes the manuscript\-facing implementation settings, including block and fold policy, ridge alpha grid, lag and time\-window specification, PCA dimensionality cap, hidden\-state token pooling, subword\-to\-word aggregation, score aggregation unit, and proxy policy\.

Matched controls tested whether a model advantage was specific to the real language\-model representation\. The control families included nuisance features, random matched\-dimensionality features, autocorrelation\-matched random features, circular shifts, sentence\-reset features, reversed\-context features, layer\-label permutation, token\-order shuffle, and within\-story block shuffle\. For each model–dataset–layer–metric summary row, the matched\-control sensitivity contrast compared the real\-model score with the most competitive available matched control for the same row\. This comparison asks whether a result remains positive after the closest available alternative in the matched derived data\. Family\-specific contrasts, the frequency with which each family became the most competitive control, and single\-control\-family removal analyses are retained in the Supplement and detailed CSV files to show the role of individual control families\.

### 2\.5Claim hierarchy and statistical summaries

The analysis separates information\-bearing predictivity from model\-specific predictive advantage\. Information\-bearing predictivity refers to positive held\-out prediction or improvement over a low\-level nuisance baseline\. It establishes that a representation contains information useful for neural prediction, with model specificity evaluated at the next evidence level\. Model\-specific predictive advantage requires a positive real\-minus\-matched\-control contrast under the configured temporal, capacity, and contextual controls\. Cross\-neural\-unit response\-profile correspondence is a representational\-organization contrast asking whether model\-to\-brain profiles reproduce brain\-derived profile organization over the same sampled units\. Candidate\-computation ablation is a computation\-specificity contrast asking whether model\-side ablation of a candidate language\-related quantity selectively changes held\-out neural prediction, conditional on predictive evidence\. Reliability\-bounded response\-profile magnitude asks whether a surviving response\-profile effect is large relative to the reliable brain\-derived profile signal\. Integrated summary rows combine the predictive, response\-profile, feature\-ablation, reliability\-bounded response\-profile magnitude, matched\-control, and replication criteria\.

Table 1:Claim\-to\-evidence mapping for language\-model neural predictivity\.Alternative explanations addressed at each level are listed in Supplementary Table 3\.

The Supplementary Information documents the thresholded contrast rules used for reproducibility\. In the main text, supported, unsupported, and unavailable refer to the specified contrast and available data chain\. Integrated summary rows represent the combined claim level and are treated as coverage summaries; participant\-level predictive contrasts are reported separately\.

### 2\.6Response\-profile and reliability\-bounded response\-profile analyses

Response profiles were constructed to test cross\-neural\-unit organization beyond target\-wise prediction\. For each matched dataset, subject, run, stimulus, model, layer, and candidate quantity, the model–brain profile was the ordered vector of held\-out readout scores across sampled neural units\. Each element of this vector corresponds to one electrode or MEG sensor\-group target after event alignment and blocked cross\-validated ridge readout\. The brain–brain profile used the same unit ordering for brain\-derived pattern vectors\. Target order was fixed by a stored unit\-order hash; rows with mismatched order were marked invalid\. Nonfinite profile elements were dropped pairwise for the similarity calculation\. Pearson and Spearman profile similarities use their standard centered or rank\-centered definitions; cosine similarity uses the finite profile vectors without additional centering\. The configured profile readouts are descriptive per\-unit readout profiles, distinct from matrix\-level CKA or representational\-similarity\-analysis estimates\. In the implemented event\-response readout, per\-unit Pearson prediction scores are transformed tor2r^\{2\}for the squared\-correlation profile, to\|r\|\|r\|for the absolute\-correlation profile, and kept signed for the signed\-correlation profile\.

For each profile\-similarity cell, the real model was compared with matched profile controls generated from the same event grids and target order\. The fixed controls include matched\-dimensionality random features, autocorrelation\-matched random features, circular shifts, context\-reset and reversed\-context features, layer\-label permutations, token\-order shuffles, and within\-story block shuffles when the corresponding representation files are available\. The summary table uses the maximum profile similarity among available controls as the conservative control contrast for that cell; profile controls are deterministic or seeded by the fixed configuration, with no repeated resampling of null draws\. The response\-profile delta isΔprofile=sreal−maxc⁡sc\\Delta\_\{\\mathrm\{profile\}\}=s\_\{\\mathrm\{real\}\}\-\\max\_\{c\}s\_\{c\}, wheressis the configured profile\-similarity metric\. A response\-profile cell can pass when the real profile similarity and control similarity are finite, the target order matches, andΔprofile\>0\\Delta\_\{\\mathrm\{profile\}\}\>0\. For manuscript\-facing summary rows, these profile deltas are then averaged over the sampled unit and subject\-run rows in the corresponding dataset–model–layer–candidate\-quantity contrast; isolated positive target rows are descriptive within their local target scope\. A valid brain ceiling is a requirement for the separate reliability\-bounded response\-profile magnitude criterion\.

Brain reliability ceilings were computed from brain\-derived profile vectors\. Split\-half reliability uses the split\-half brain\-ceiling values stored with the brain pattern table\. Run\-to\-run, subject\-to\-subject, and session\-to\-session reliability, when available, are pairwise Pearson similarities between brain pattern vectors sharing the same dataset, region group, profile\-similarity metric, and unit key while differing in the named grouping variable\. Method\-specific reliability is the mean of available pairwise values, with a 500\-sample percentile bootstrap interval\. A ceiling is valid only when reliability is finite and at least 0\.10\. Ceiling\-normalized response\-profile summaries usefceiling=sreal/rbrainf\_\{\\mathrm\{ceiling\}\}=s\_\{\\mathrm\{real\}\}/r\_\{\\mathrm\{brain\}\}andΔ​fceiling=Δprofile/rbrain\\Delta f\_\{\\mathrm\{ceiling\}\}=\\Delta\_\{\\mathrm\{profile\}\}/r\_\{\\mathrm\{brain\}\}\. Negative or missing reliabilities are retained as invalid ceilings\. The reliability\-bounded response\-profile criterion additionally requiresfceiling≥0\.50f\_\{\\mathrm\{ceiling\}\}\\geq 0\.50, a positive profile\-control delta, and a passing response\-profile cell\. Predictive uncertainty summaries use 1000 bootstrap samples over participant means\. When a participant contributed multiple runs, runs were first averaged within participant and the bootstrap resampled participants at the participant level\. When only one participant was retained, participant\-cluster inference was marked as unavailable\. Feature\-ablation double\-dissociation tests use 1000 bootstrap and 1000 sign/permutation samples where applicable\. False\-discovery\-rate values use Benjamini–Hochberg correction over the configured dataset\-by\-candidate\-quantity\-by\-model\-by\-region family and are interpreted withq<0\.05q<0\.05\.

### 2\.7Model\-side feature\-ablation analyses

The feature\-ablation analyses recomputed held\-out scores after feature zeroing, train\-only residualization, train\-only projection removal, layer ablation, context reset, or reversed\-context transformation\. These operations differ in their interpretation\. Feature zeroing changes the available feature dimensions directly\. Residualization and projection removal remove variance associated with a candidate quantity while preserving train\-only fitting\. Layer ablation tests dependence on layer\-specific representations\. Context reset and reversed context test sensitivity to the model’s contextual history\. For this reason, ablation results are interpreted by operation and candidate quantity before they are summarized across the decision table\.

For a candidate quantitymm, the feature\-ablation delta wasΔm=sreal−sablated​\(m\)\\Delta\_\{m\}=s\_\{\\mathrm\{real\}\}\-s\_\{\\mathrm\{ablated\}\(m\)\}, wheressis the held\-out predictive score after the same blocked cross\-validation policy\. The feature\-specificity index wasFSIm=Δm−Δ¯¬m\\mathrm\{FSI\}\_\{m\}=\\Delta\_\{m\}\-\\overline\{\\Delta\}\_\{\\neg m\}, whereΔ¯¬m\\overline\{\\Delta\}\_\{\\neg m\}is the mean ablation delta for nonmatching target quantities within the same dataset, subject\-run, region group, time window, model, layer, and ablation method\. Double\-dissociation tests compared candidate quantities such as surprisal, semantic transition, dependency integration, syntactic boundary, discourse boundary, and context update\. They provide diagnostic evidence about selectivity\. Neural\-system intervention claims are outside the scope of these model\-side perturbations\. A positive ablation delta is a diagnostic result; integrated interpretation requires predictive matched\-control support and response\-profile correspondence as separate evidence levels\.

### 2\.8Positive controls and uncertainty summaries

Positive controls were stratified by the kind of sensitivity they test\. Brain–brain reliability estimates test whether the derived neural data contain repeatable signal\. Brain\-as\-model controls test whether response\-profile machinery can recover brain\-derived organization\. Low\-level neural and acoustic checks test component\-level sensitivity to timing\-linked structure where source media and event grids permit the check\. The implanted\-signal simulations are integrated engineered controls: a single strong\-signal row verifies the implementation under a known signal, and a stochastic graded calibration estimates detection probability across candidate\-signal strengths\. In the stochastic calibration, synthetic brain responses were generated asy=β​ximplant\+ϵy=\\beta x\_\{\\mathrm\{implant\}\}\+\\epsilonfor the implanted latent feature in the affected units; the plotted strength is the coefficientβ\\betaon a unit\-variance latent scale, with synthetic observation noise added separately\. The 80% threshold is therefore an implementation\-scale threshold in this engineered parameterization, outside the scale of empirical neural signal\-to\-noise ratios or empirical predictive and ablation deltas\. Podcast brain\-as\-model summaries are interpreted as auxiliary because they are brain\-derived assay checks and are limited to the available subject\-to\-subject Podcast profile coverage\.

Dataset\-level uncertainty summaries aggregate over participant\-run or derived target units before bootstrapping when those units are available\. Supplementary Table 31 reports the aggregation path\. The source inventory retains subject, run, stimulus, and brain\-unit coverage, and the predictive score table retains subject/session/run/stimulus columns\. The participant\-run export preserves subject and run identifiers for predictive contrasts where the join is available\. The matched predictive inference retained 26 Brain Treebank subject\-run units from 10 participants, 44 MEG\-MASC subject\-run units from 11 participants, and 8 Podcast ECoG subject\-run units from 8 participants\. Participant\-cluster bootstrap resampled participant means after within\-participant run averaging\. The integrated summary table joins predictive, response\-profile, feature\-ablation, ceiling, matched\-control, and replication summaries by dataset, candidate quantity, model, layer, region group, and time window\. This table is treated as a coverage\-summary table; population inference is limited to contrasts with retained participant\-cluster coverage\. Degenerate, low\-unit, or low\-participant intervals are labeled as coverage limitations\. Supplementary Table 18 reports the interval\-bound summaries\. No smallest effect size of interest was preregistered, so the summaries report observed bounds and excluded\-effect limits without formal equivalence tests\. Positive upper interval bounds indicate compatibility with positive model\-specific effects up to that bound\. Resampling, permutation\-style diagnostics, and false\-discovery\-rate correction follow standard bootstrap, nonparametric neuroimaging, and Benjamini\-Hochberg logic\[[5](https://arxiv.org/html/2606.26880#bib.bib15),[19](https://arxiv.org/html/2606.26880#bib.bib43),[1](https://arxiv.org/html/2606.26880#bib.bib1)\]\. Software versions, model identifiers, random\-seed policy, checksums, and provenance records are documented in the Supplementary Information\.

### 2\.9Human\-participant and secondary\-data ethics

This study analyzed de\-identified, previously released neural and stimulus\-derived data and included no new human\-participant recruitment or new human\-data collection\. As reported in the source publications, Brain Treebank experiments were approved by the Boston Children’s Hospital/Harvard Institutional Review Board and were conducted with subjects’ informed consent\[[33](https://arxiv.org/html/2606.26880#bib.bib10)\]; Podcast ECoG participants provided oral and written informed consent, with study approval from the Institutional Review Boards at New York University Langone Medical Center and Princeton University\[[39](https://arxiv.org/html/2606.26880#bib.bib57)\]; and MEG\-MASC participants provided written informed consent under approval from the Institutional Review Board ethics committee of New York University Abu Dhabi\[[8](https://arxiv.org/html/2606.26880#bib.bib21)\]\. Raw neural datasets and stimulus media were not redistributed and must be obtained from the original repositories or data owners under the terms set by those providers\.

## 3Results

### 3\.1Positive predictivity and evidence levels

We first summarize where language\-model features produced positive neural prediction \(Figure[1](https://arxiv.org/html/2606.26880#S3.F1)\)\. We then evaluate how this signal changes under matched controls, participant\-level summaries, response\-profile tests, feature\-ablation diagnostics, and calibration checks\. This ordering keeps the positive predictive result as the empirical anchor and separates it from stronger evidence for model\-specific predictive advantage, response\-profile organization, computation\-specific sensitivity, and reliability\-bounded response\-profile magnitude\. The source tables preserve experimental coverage across models, layers, datasets, controls, response\-profile metrics, feature\-ablation diagnostics, reliability summaries, and positive controls; participant\-level inference is reported for contrasts that retained participant or participant\-run identifiers\.

![Refer to caption](https://arxiv.org/html/2606.26880v1/x1.png)Figure 1:Positive information\-bearing predictivity and participant\-level scope\. \(A\) Analysis path and evidence ladder\. \(B\) Participant\-level raw Pearson\-rrafter within\-participant run averaging; intervals are participant\-cluster bootstrap summaries\. \(C\) Participant\-level gain over the nuisance baseline\. \(D\) Predictive\-only criterion rows by dataset, model, layer, and candidate quantity; dot area gives passed configurations, fill gives participant consistency, and labelled empty facets denote zero passed rows among evaluable rows\. \(E\) Participant consistency among predictive\-only configurations; dashed lines mark medians\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x2.png)Figure 2:Predictive coverage and matched controls\. \(A\) Dataset inventory and branch availability in the matched derived data; Ready denotes branch coverage, including reliability\-bounded profile rows where available, and NE denotes unavailable complete\-chain coverage\. \(B\) Participant\-level model\-control means summarize dataset\-level scope; labels give participants/subject\-run units retained after complete matching\. \(C\) Independent control\-family contrasts show mean Pearson\-rrreal\-minus\-control deltas before the most\-competitive\-control reduction; point size gives the positive\-row fraction, and the circular\-shift contrast is slightly positive\. \(D\) Model parameter scale and predictive\-only row counts show passed rows across GPT\-2, Pythia, and Qwen families, with no single family dominating\.
### 3\.2Language\-model features show heterogeneous neural predictivity

Across the three primary datasets, language\-model features produced widespread positive held\-out prediction\. In the Pearson predictive\-control source table, 5541 of 11232 rows had a positive raw model score, and 5819 of 11232 rows improved over the nuisance baseline before matched\-control comparison was considered\. These source\-level counts describe the prevalence of information\-bearing predictivity and are separate from participant\-level inferential units\. Under the controlled predictive\-only criterion, which additionally required the configured matched\-control contrast at the summary\-row level, 67 of 432 evaluable predictive rows were retained: 38 of 144 in Brain Treebank and 29 of 144 in Podcast ECoG\. MEG\-MASC contributed no predictive\-only summary row\. Participant\-level mean raw Pearson\-rrsummaries were small after averaging runs within participant: Brain Treebank had a mean of0\.00580\.0058with a bootstrap interval of\[−0\.0026,0\.0141\]\[\-0\.0026,0\.0141\], MEG\-MASC had a mean of−0\.0001\-0\.0001with interval\[−0\.0025,0\.0026\]\[\-0\.0025,0\.0026\], and Podcast ECoG had a mean of0\.00330\.0033with interval\[−0\.0048,0\.0123\]\[\-0\.0048,0\.0123\]\. Participant\-level nuisance\-baseline gains were similarly bounded:−0\.0270\-0\.0270for Brain Treebank,0\.00770\.0077for MEG\-MASC, and−0\.0144\-0\.0144for Podcast ECoG, with all corresponding bootstrap intervals crossing zero\. Information\-bearing predictivity was widespread in source\-level summaries and remained detectable in a subset of controlled configurations, with participant\-level averages defining the population\-level scope\.

Other components showed the same lower\-level pattern\. In the all\-metric response\-profile table, 313411 of 539136 metric cells had positive raw model\-to\-brain profile similarity\. In the model\-side feature\-ablation table, 2704 of 3348 source rows had positive raw ablation deltas\. Finally, 195632 of 539136 ceiling\-normalized response\-profile metric cells reached a raw fraction of ceiling of at least 0\.25 before the model–control delta was considered\. Together, these counts show that language\-model\-derived quantities can annotate neural responses and produce positive local summaries\. The subsequent contrasts specify how far those summaries travel toward representational and computational interpretation\.

### 3\.3Predictive information is heterogeneous across datasets and model configurations

The 67 predictive\-only rows were distributed across the analyzed model set\. Qwen3\-1\.7B contributed 10 rows; DistilGPT\-2, Pythia\-160M, Pythia\-410M, and GPT\-2 each contributed 9; Qwen2\.5\-0\.5B\-Instruct and GPT\-2 Medium each contributed 8; and Qwen2\.5\-1\.5B\-Instruct contributed 5\. The layer distribution was similarly broad, with 26 rows in final\-layer features, 23 in middle\-layer features, and 18 in embedding\-layer features\. The retained rows were concentrated in two candidate quantities in the matched data, with 35 semantic transition rows and 32 context update rows\. Figure[1](https://arxiv.org/html/2606.26880#S3.F1)D–E shows the model, layer, candidate\-quantity, and participant\-consistency pattern; Figure[2](https://arxiv.org/html/2606.26880#S3.F2)D summarizes row counts by model parameter scale and family; and Supplementary Table 32 provides the compact table\. The controlled positives support localized configuration\-level predictive information with limited participant consistency across datasets\.

The contrast\-status summaries in Figure[1](https://arxiv.org/html/2606.26880#S3.F1)D use status labels only for evaluable rows\. A passed cell means that at least one summary row met the specified contrast rule\. Dataset\-level population support is evaluated separately through participant\-aware summaries\. Unavailable cells mark missing matched coverage\.

### 3\.4Local predictive information is heterogeneous at the participant level

The most\-competitive\-control comparison paired each real model summary with nuisance features and the strongest matched control available for the same row \(Figure[2](https://arxiv.org/html/2606.26880#S3.F2)\)\. The leading control family varied across rows: circular\-shift controls led in 1097 rows, autocorrelation\-matched random controls in 1108 rows, random matched\-dimensionality controls in 633 rows, circular\-shifted language\-model controls in 669 rows, and layer\-label permutation controls in 566 rows, with remaining rows assigned to token\-order, within\-story\-block, sentence\-reset, or reversed\-context controls\. Family\-specific contrasts and control\-family\-removal checks provide complementary views of the same matched\-control question\.

At the dataset level, the most\-competitive matched\-control predictive comparison localized the controlled positives rather than showing a uniform participant\-level advantage\. Brain Treebank had a mean participant\-cluster predictive delta of−0\.0673\-0\.0673across 26 retained subject\-run units from 10 participants, with a participant\-cluster interval of\[−0\.2147,−0\.0008\]\[\-0\.2147,\-0\.0008\]\. MEG\-MASC had a mean participant\-cluster predictive delta of−0\.0627\-0\.0627across 44 subject\-run units from 11 participants, with an interval of\[−0\.2226,−0\.0038\]\[\-0\.2226,\-0\.0038\]\. Podcast ECoG had a mean participant\-level predictive delta of−0\.0440\-0\.0440across 8 subject\-run units from 8 participants, with an interval of\[−0\.2012,0\.0000\]\[\-0\.2012,0\.0000\]\. The distinction between local summary positives and dataset\-level means is the main heterogeneity result\. Because no smallest effect size of interest was prespecified, these results provide interval bounds; formal equivalence tests would require a prespecified effect threshold\.

The independent family\-specific contrasts in Figure[2](https://arxiv.org/html/2606.26880#S3.F2)C were at or below zero for four of the five displayed control families: random matched\-dimensionality, autocorrelation\-matched random, layer\-label permutation, and reversed\-context controls\. The circular\-shift contrast was slightly positive \(0\.00230\.0023\), with 8973 of 32832 rows positive; the participant\-level most\-competitive\-control summaries stayed below zero\. The modality and time\-window summary yielded the same coverage\-bounded pattern\. For Pearson\-rrpredictive rows, Brain Treebank used the broad 0–1000 ms ECoG summary and had a mean most\-competitive\-control delta of−0\.0780\-0\.0780\. Podcast ECoG used the broad 0–1000 ms ECoG summary and had a mean most\-competitive\-control delta of−0\.0440\-0\.0440\. MEG\-MASC used the broad 100–250 ms MEG summary and had a mean Pearson delta of−0\.0627\-0\.0627\. These summaries document the available modality and time\-window coverage; anatomical or latency\-specific hypotheses require finer retained target strata\.

![Refer to caption](https://arxiv.org/html/2606.26880v1/x3.png)Figure 3:Response\-profile and reliability\-bounded profile evidence\. \(A\) Raw model\-to\-brain profile similarity is positive; best matched controls are stronger on average\. \(B\) Dataset\-level matched profile\-control deltas are below the most competitive matched controls across all three primary datasets; intervals denote target\-profile coverage summaries\. \(C\) Brain\-as\-model positive controls show median metric\-cell deltas versus shuffled unit order; bars show central metric\-cell intervals and labels give positive row fractions\. \(D\) Reliability\-bounded fraction\-of\-ceiling deltas after matched profile controls\.
### 3\.5Positive profile similarity localizes the evidence level

The response\-profile analysis asked whether model\-to\-brain profiles reproduce organization across sampled neural units beyond individual\-target prediction\. Raw model\-to\-brain profile similarities were positive in many metric cells, as noted above\. Matched profile deltas were available for all three primary datasets in the matched derived data, and the dataset\-level mean real\-minus\-best\-control profile deltas were below the most competitive matched controls\. Brain Treebank had a mean real model\-to\-brain profile similarity of0\.10280\.1028, a mean best\-control similarity of0\.38150\.3815, and a mean delta of−0\.2787\-0\.2787\. MEG\-MASC had corresponding values of0\.11240\.1124,0\.40190\.4019, and−0\.2895\-0\.2895\. Podcast ECoG had corresponding values of0\.12250\.1225,0\.30770\.3077, and−0\.1852\-0\.1852\.

This pattern localizes the evidence level\. The tested representations produced positive raw profile similarities, whereas the matched profile\-control contrast placed shared cross\-neural\-unit organization beyond the current dataset\-mean evidence\. Other relationships between language models and neural responses remain outside what this contrast tests\.

### 3\.6Model\-side ablations reveal widespread sensitivity

Model\-side feature\-ablation summaries were available for Brain Treebank and Podcast ECoG in the matched participant\-run data\. Brain Treebank had a mean raw feature\-ablation delta of0\.09480\.0948, with 2133 of 2700 table\-row deltas positive\. Podcast ECoG had a mean raw feature\-ablation delta of0\.09670\.0967, with 571 of 648 table\-row deltas positive\. These positive deltas show that candidate\-quantity ablations can change held\-out neural prediction\. In the present evidence framework, selective computational correspondence is a higher evidence level that additionally requires predictive matched\-control support, feature\-specificity support, and aligned response\-profile evidence\.

The interpretation differs by candidate quantity\. Surprisal remains useful as a predictive annotation of contextual predictability\. Semantic transition deltas are diagnostic for semantic\-state coding hypotheses\. Dependency integration deltas are diagnostic for syntactic or combinatorial hypotheses\. Boundary\-related deltas are diagnostic for localized boundary\-detector interpretations\. Context update deltas are diagnostic for shared model–brain context\-accumulation interpretations\. These outcomes support diagnostic sensitivity, with strong computational correspondence requiring evidence beyond the current feature\-ablation summaries\.

### 3\.7Reliability ceilings bound response\-profile magnitude

Reliability normalization calibrated the profile signal against reliable brain\-derived profile structure\. After aggregation to the ceiling table, the brain–brain reliability summaries contained 36 dataset\-by\-method rows, of which 18 met the configured 0\.10 reliability minimum\. Ceiling\-normalized response\-profile rows were available for all three primary datasets\. The mean fraction\-of\-ceiling deltas were−0\.2829\-0\.2829for Brain Treebank,−0\.3093\-0\.3093for MEG\-MASC, and−0\.8823\-0\.8823for Podcast ECoG after matched profile controls\. These results locate the positive raw profile similarities below the reliability\-bounded evidence level and keep them separate from predictive and feature\-ablation contrasts\.

### 3\.8Brain\-derived and implanted signals validate component\-level sensitivity

The control analyses tested whether the analysis could recover signal in settings with expected structure\. The most complete integrated check was the implanted\-signal simulation\. Its single strong\-signal row produced a predictive score of0\.98670\.9867, a predictive delta of0\.89450\.8945, a response\-profile delta of1\.61541\.6154, a model\-side feature\-ablation delta of0\.26590\.2659, and a fraction of ceiling of0\.99110\.9911\. A stochastic graded version of this synthetic check is shown in Figure[4](https://arxiv.org/html/2606.26880#S3.F4)C\. Repeating each candidate\-signal strength across 100 seeded random draws yielded an interpolated 80% integrated\-detection threshold of 1\.49\. The strength axis is the synthetic coefficientβ\\betain the engineered latent model; 1\.49 is therefore an implementation\-scale detection threshold outside empirical neural effect\-size units\. This engineered control shows recovery of a known signal once it is strong enough in the synthetic setting, with real\-data sensitivity bounded by the available participant\-run and derived target units\.

The brain\-derived controls support component\-level sensitivity \(Figure[3](https://arxiv.org/html/2606.26880#S3.F3)C\)\. Across primary and secondary reliability checks, all 111 valid brain–brain reliability rows exceeded the configured minimum reliability of 0\.10\. After aggregation by dataset and method, 18 of 36 reliability summaries met the same threshold\. The brain\-as\-model panel reports median deltas against shuffled brain\-unit order\. Brain Treebank showed positive run\-to\-run and subject\-to\-subject median deltas of0\.6210\.621and0\.1130\.113, with positive\-row fractions of 94\.3% and 68\.9%\. MEG\-MASC showed positive run\-to\-run and subject\-to\-subject median deltas of0\.6900\.690and0\.6360\.636, with positive\-row fractions of 97\.1% and 96\.1%\. Podcast ECoG subject\-to\-subject brain\-as\-model profiles were weaker but positive and are interpreted as auxiliary: the median delta was0\.0340\.034, with 50\.5% positive rows\.

Low\-level and acoustic checks showed source\-limited sensitivity to timing\-linked signals\. Brain Treebank word\-onset checks exceeded the configured criterion in 5 of 30 table rows in the 0–100 ms window and 4 of 30 table rows in the 100–300 ms window\. MEG\-MASC word\-onset and word\-rate checks also produced partial evidence\. WAV\-derived acoustic\-envelope readouts were available for MEG\-MASC local audio and Podcast ECoG standalone and full\-length checks\. The full\-length Podcast acoustic check exceeded the configured criterion in 6 of 8 readable subject\-run rows, with median delta over the best control of0\.0054330\.005433\. Brain Treebank is retained as a source\-limited case for acoustic\-envelope analysis because rights\-cleared waveform or movie source files were unavailable\.

### 3\.9Robustness analyses preserve the same pattern

Robustness analyses summarize how the results change across evidence levels \(Figure[4](https://arxiv.org/html/2606.26880#S3.F4)D\)\. The evaluable predictive summary table contains 432 rows: 144 rows in each primary dataset\. These rows are coverage summaries over tested model–dataset–layer combinations, separate from participant counts and complete\-chain tests\. In the matched derived data, response\-profile rows were matched for all three datasets, participant\-run feature\-ablation rows were matched for Brain Treebank and Podcast ECoG, and remaining unavailable contrasts lacked matched coverage\. Thirty\-eight Brain Treebank rows and 29 Podcast ECoG rows were labeled predictive\-only by the configured decision rule\. Complete co\-indexing across all required contrasts remains a future coverage requirement, so Figure[4](https://arxiv.org/html/2606.26880#S3.F4)D marks those chains as NE\.

Leave\-one\-dataset\-out checks, reliability\-threshold sweeps, fraction\-of\-ceiling sweeps, positive\-direction sweeps, most\-competitive\-control aggregation sweeps, and single\-control\-family removals preserved the same qualitative pattern for real language\-model rows\. Some relaxed rules created stage\-level positives by removing or weakening a relevant contrast\. Jointly indexed predictive, response\-profile, feature\-ablation, reliability\-bounded, matched\-control, and replication evidence for the same row remains a future coverage requirement\. Information\-bearing predictivity and diagnostic ablation sensitivity were reproducible across these checks, while response\-profile organization and candidate\-computation correspondence occupied higher evidence levels under the specified contrasts\.

![Refer to caption](https://arxiv.org/html/2606.26880v1/x4.png)Figure 4:Feature\-ablation, specificity, and calibration\. \(A\) Model\-side ablation deltas for Brain Treebank and Podcast ECoG candidate quantities\. \(B\) Stage\-level summaries with distinct denominators\. \(C\) Stochastic implanted\-signal calibration with 100 seeded implants per strength; the band gives Wilson 95% intervals, and dashed lines mark 80% detection and the threshold of 1\.49\. \(D\) Robustness status matrix; P denotes at least one passed row, NS evaluated without support, and NE unavailable matched coverage\.

## 4Discussion

### 4\.1Heterogeneous predictivity during naturalistic comprehension

Language\-model features provided heterogeneous, information\-bearing neural predictivity during naturalistic comprehension\. Positive held\-out prediction and nuisance\-baseline gains were widespread, a subset of controlled configurations met the predictive\-only criterion, and model\-side ablations frequently altered neural prediction\. These results show where language\-model features act as useful neural annotations and what additional evidence is needed for model\-specific, representational, or computation\-specific interpretations\.

These results are tied to the analyzed datasets and fixed representation files\. The analysis therefore treats predictive usefulness, response\-profile correspondence, candidate\-computation specificity, and reliability\-bounded response\-profile magnitude as different claims\. A positive local encoding score supports predictive usefulness; stronger statements require additional matched\-control, profile, ablation\-specific, and reliability\-bounded evidence\.

### 4\.2Local information and shared organization are empirically dissociable

Naturalistic language contains structure at many time scales\. Word onsets, word rate, acoustic envelope, sentence position, discourse progression, lexical frequency, token predictability, and temporal autocorrelation can all help a flexible readout predict neural measurements\. A high\-dimensional language\-model representation may capture some of this structure while leaving cross\-unit organization unresolved\. Matched controls and response\-profile contrasts clarify what kind of theoretical conclusion the data can support\.

The response\-profile results were especially informative\. The tested models had positive raw profile similarities in many rows, and the matched profile deltas were lower than the most competitive matched control in all three primary datasets\. This pattern is consistent with predictive or annotational usefulness and locates shared cross\-neural\-unit response\-profile organization at a higher evidence level than the current contrasts support\.

### 4\.3Ablation sensitivity provides diagnostic evidence about candidate quantities

The feature\-ablation results sharpen model\-based accounts of language comprehension\. Surprisal remains a useful candidate annotation of contextual predictability, and prior work links surprisal to neural responses during natural speech and reading\[[35](https://arxiv.org/html/2606.26880#bib.bib32),[17](https://arxiv.org/html/2606.26880#bib.bib33)\]\. In the present analysis, surprisal\-related ablation effects indicated sensitivity of fitted neural readouts to model\-side information removal\. Selective dependence on the tested language\-model surprisal computation would require additional support from the predictive matched\-control contrast and feature\-specificity diagnostics\.

The same logic applies to the other tested quantities\. Semantic transition effects remain diagnostic for semantic\-state coding hypotheses\. Dependency integration effects remain diagnostic for syntactic or combinatorial hypotheses\. Syntactic and discourse boundary effects remain diagnostic for localized boundary\-detector interpretations\. Context update effects remain diagnostic for shared model–brain context\-accumulation interpretations\. These quantities remain theoretically meaningful, and strong computational correspondence requires evidence beyond raw or diagnostic ablation sensitivity\.

### 4\.4Implications for brain–AI model evaluation

The findings are compatible with prior reports that language\-model features predict neural responses during comprehension\[[23](https://arxiv.org/html/2606.26880#bib.bib48),[7](https://arxiv.org/html/2606.26880#bib.bib19),[3](https://arxiv.org/html/2606.26880#bib.bib11),[12](https://arxiv.org/html/2606.26880#bib.bib27)\]\. Those studies often ask whether model\-derived quantities contain information useful for explaining neural or behavioral measurements\. The present analyses add a complementary inferential question: whether the same comparisons support claims about response\-profile organization or candidate language computations after matched controls, source\-coverage checks, ablation diagnostics, and reliability\-bounded response\-profile interpretation\.

Future brain–AI model evaluation should separately report positive held\-out prediction, nuisance\-baseline gain, matched\-control advantage, response\-profile organization, computation\-specificity diagnostics, and reliability\-bounded response\-profile magnitude\. This separation preserves the scientific value of language\-model features as neural annotations and keeps model\-specific representational claims tied to the evidence required to support them\.

### 4\.5Relation to prior control analyses

Hadidi and colleagues establish that methodological choices and stimulus\-related variables can inflate neural predictivity, and that positional signals and word rate can perform competitively with trained language models in widely used datasets\[[9](https://arxiv.org/html/2606.26880#bib.bib2)\]\. The present findings build on this concern while retaining the observed information\-bearing predictivity of model features\. Substantial information\-bearing predictivity and localized predictive\-only positives appeared alongside separate tests of response\-profile organization and computation\-specific correspondence\.

The two studies therefore address different inferential transitions: robustness of prediction and interpretation of surviving prediction\. The analyses here separate predictive information, cross\-neural\-unit response\-profile correspondence, candidate\-computation ablation sensitivity, and reliability\-bounded response\-profile magnitude\.

### 4\.6Limitations and future tests

Several scope limits remain\. The analysis relies on a fixed derived\-data set and contains no new prospective data collection\. Dataset coverage is uneven across evidence levels\. Participant\-run predictive summaries cover 10 Brain Treebank participants and 26 subject\-run units, 11 MEG\-MASC participants and 44 subject\-run units, and 8 Podcast ECoG participants and 8 subject\-run units, which are still modest samples for participant\-cluster inference\. Response\-profile deltas and ceiling\-normalized profile summaries use target\-profile grids, while feature\-ablation closure is currently available for Brain Treebank and Podcast ECoG through matchable derived grids\. These contrast\-specific scopes define coverage limits without population\-equivalence estimates\.

The positive controls also have contrast\-specific scope\. The implanted\-signal calibration in Figure[4](https://arxiv.org/html/2606.26880#S3.F4)C repeats stochastic synthetic implants across candidate\-signal strengths and estimates the strength needed for high\-probability detection\. This engineered simulation calibrates the implementation\-level detection path; participant\-level detection power over real participants, regions, or stimulus samples remains unresolved\.

The model inventory is bounded to the fixed representation files\. Larger open models, paired base and instruction\-tuned models, and prospective holdout applications could change the evidence landscape\. Future work should also broaden MEG\-MASC contrast\-level participant\-run coverage and response\-profile coverage, prespecify a smallest effect size of interest before formal equivalence testing, add finer anatomical and latency\-specific analyses, and estimate participant\-level detection sensitivity where source data permit those analyses\.

### 4\.7Conclusion

This study identifies heterogeneous neural predictivity from language models during naturalistic comprehension\. Across three datasets, language\-model features produced widespread positive held\-out prediction, nuisance\-baseline gains, localized predictive\-only effects, and sensitivity to model\-side ablation\. These results support language\-model features as informative neural annotations in the analyzed derived data\. The same analyses define the additional conditions needed for claims about model\-specific advantage, shared cross\-neural\-unit response organization, and shared language\-processing computations\.

## Data and Code Availability Statements

Derived result tables, analysis code, manuscript tables, and reproducibility scripts are available through an OSF view\-only repository:[https://osf\.io/7s84h/overview?view\_only=e4a83f8cece44c90af40417820c8acd8](https://osf.io/7s84h/overview?view_only=e4a83f8cece44c90af40417820c8acd8)\. Raw neural datasets and stimulus media are not redistributed and remain governed by their original data providers\. Upon acceptance, the repository record will be made public and assigned or updated with a persistent accession or DOI\.

## AI\-Assisted Tools Disclosure

AI\-assisted tools were used during manuscript preparation only for manuscript grammar and clarity checks and for code\-quality review\. The author reviewed and verified all final text, analyses, interpretations, citations, code, and submission materials, and accepts full responsibility for the content\.

## Supplementary Material

Supplementary Information follows the references in this arXiv source package\.

## References

- \[1\]\(1995\)Controlling the false discovery rate: a practical and powerful approach to multiple testing\.Journal of the Royal Statistical Society: Series B57,pp\. 289–300\.External Links:[Document](https://dx.doi.org/10.1111/j.2517-6161.1995.tb02031.x)Cited by:[§2\.8](https://arxiv.org/html/2606.26880#S2.SS8.p2.1)\.
- \[2\]T\. B\. Brownet al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 1877–1901\.External Links:2005\.14165,[Link](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p3.1)\.
- \[3\]C\. Caucheteux and J\. King\(2022\)Brains and algorithms partially converge in natural language processing\.Communications Biology5,pp\. 134\.External Links:[Document](https://dx.doi.org/10.1038/s42003-022-03036-1)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1),[§4\.4](https://arxiv.org/html/2606.26880#S4.SS4.p1.1)\.
- \[4\]J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova\(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of NAACL\-HLT,pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423),[Link](https://aclanthology.org/N19-1423/)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p3.1)\.
- \[5\]B\. Efron and R\. J\. Tibshirani\(1994\)An introduction to the bootstrap\.Chapman and Hall/CRC\.External Links:[Document](https://dx.doi.org/10.1201/9780429246593)Cited by:[§2\.8](https://arxiv.org/html/2606.26880#S2.SS8.p2.1)\.
- \[6\]R\. Futrellet al\.\(2021\)The Natural Stories corpus: a reading\-time corpus of English texts containing rare syntactic constructions\.Language Resources and Evaluation55,pp\. 63–77\.External Links:[Document](https://dx.doi.org/10.1007/s10579-020-09503-7)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p4.1)\.
- \[7\]A\. Goldsteinet al\.\(2022\)Shared computational principles for language processing in humans and deep language models\.Nature Neuroscience25,pp\. 369–380\.External Links:[Document](https://dx.doi.org/10.1038/s41593-022-01026-4)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1),[§4\.4](https://arxiv.org/html/2606.26880#S4.SS4.p1.1)\.
- \[8\]L\. Gwilliamset al\.\(2023\)Introducing MEG\-MASC: a high\-quality magneto\-encephalography dataset for evaluating natural speech processing\.Scientific Data10,pp\. 862\.External Links:[Document](https://dx.doi.org/10.1038/s41597-023-02752-5)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p3.1),[§2\.9](https://arxiv.org/html/2606.26880#S2.SS9.p1.1)\.
- \[9\]N\. Hadidi, E\. Feghhi, B\. H\. Song, I\. A\. Blank, and J\. C\. Kao\(2026\)Spurious alignment between large language models and brains can emerge from non\-robust methods and overlooked confounds\.Nature Communications\.External Links:[Document](https://dx.doi.org/10.1038/s41467-026-72253-7)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p4.1),[§4\.5](https://arxiv.org/html/2606.26880#S4.SS5.p1.1)\.
- \[10\]T\. Hastie, R\. Tibshirani, and J\. Friedman\(2009\)The elements of statistical learning\.2 edition,Springer\.External Links:[Document](https://dx.doi.org/10.1007/978-0-387-84858-7)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[11\]A\. E\. Hoerl and R\. W\. Kennard\(1970\)Ridge regression: biased estimation for nonorthogonal problems\.Technometrics12,pp\. 55–67\.External Links:[Document](https://dx.doi.org/10.1080/00401706.1970.10488634)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[12\]E\. A\. Hosseini, M\. Schrimpf, Y\. Zhang, S\. R\. Bowman, N\. Zaslavsky, and E\. Fedorenko\(2024\)Artificial neural network language models predict human brain responses to language even after a developmentally realistic amount of training\.Neurobiology of Language5\(1\),pp\. 43–63\.External Links:[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00137)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1),[§4\.4](https://arxiv.org/html/2606.26880#S4.SS4.p1.1)\.
- \[13\]A\. G\. Huth, W\. A\. de Heer, T\. L\. Griffiths, F\. E\. Theunissen, and J\. L\. Gallant\(2016\)Natural speech reveals the semantic maps that tile human cerebral cortex\.Nature532,pp\. 453–458\.External Links:[Document](https://dx.doi.org/10.1038/nature17637)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1)\.
- \[14\]S\. Jain, V\. A\. Vo, L\. Wehbe, and A\. G\. Huth\(2024\)Computational language modeling and the promise of in silico experimentation\.Neurobiology of Language5\(1\),pp\. 80–106\.External Links:[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00101)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1)\.
- \[15\]N\. Kriegeskorte, W\. K\. Simmons, P\. S\. F\. Bellgowan, and C\. I\. Baker\(2009\)Circular analysis in systems neuroscience: the dangers of double dipping\.Nature Neuroscience12,pp\. 535–540\.External Links:[Document](https://dx.doi.org/10.1038/nn.2303)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[16\]J\. Liet al\.\(2022\)Le Petit Prince multilingual naturalistic fMRI corpus\.Scientific Data9,pp\. 530\.External Links:[Document](https://dx.doi.org/10.1038/s41597-022-01625-7)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p4.1)\.
- \[17\]J\. A\. Michaelov, M\. D\. Bardolph, C\. K\. Van Petten, B\. K\. Bergen, and S\. Coulson\(2024\)Strong prediction: language model surprisal explains multiple N400 effects\.Neurobiology of Language5\(1\),pp\. 107–135\.External Links:[Document](https://dx.doi.org/10.1162/nol%5Fa%5F00105)Cited by:[§4\.3](https://arxiv.org/html/2606.26880#S4.SS3.p1.1)\.
- \[18\]S\. A\. Nastaseet al\.\(2021\)The Narratives fMRI dataset for evaluating models of naturalistic language comprehension\.Scientific Data8,pp\. 250\.External Links:[Document](https://dx.doi.org/10.1038/s41597-021-01033-3)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p4.1)\.
- \[19\]T\. E\. Nichols and A\. P\. Holmes\(2002\)Nonparametric permutation tests for functional neuroimaging: a primer with examples\.Human Brain Mapping15,pp\. 1–25\.External Links:[Document](https://dx.doi.org/10.1002/hbm.1058)Cited by:[§2\.8](https://arxiv.org/html/2606.26880#S2.SS8.p2.1)\.
- \[20\]A\. M\. Olszewska, M\. Gaca, D\. Drozdziel, B\. Kossowski, A\. M\. Herman, and A\. Marchewka\(2025\)LEARNING BRAIN: a longitudinal dataset on neural plasticity dynamics in Braille and music training\.Data set,OpenNeuro\.Note:OpenNeuro ds007022 v1\.1\.0External Links:[Link](https://openneuro.org/datasets/ds007022/versions/1.1.0)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p4.1)\.
- \[21\]F\. Pedregosaet al\.\(2011\)Scikit\-learn: machine learning in Python\.Journal of Machine Learning Research12,pp\. 2825–2830\.External Links:[Link](https://jmlr.org/papers/v12/pedregosa11a.html)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[22\]F\. Pereiraet al\.\(2018\)Toward a universal decoder of linguistic meaning from brain activation\.Nature Communications9,pp\. 963\.External Links:[Document](https://dx.doi.org/10.1038/s41467-018-03068-4)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1)\.
- \[23\]M\. Schrimpfet al\.\(2021\)The neural architecture of language: integrative modeling converges on predictive processing\.Proceedings of the National Academy of Sciences118,pp\. e2105646118\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2105646118)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1),[§4\.4](https://arxiv.org/html/2606.26880#S4.SS4.p1.1)\.
- \[24\]C\. Shain, H\. Kean, C\. Casto, B\. Lipkin, J\. Affourtit, M\. Siegelman, F\. Mollica, and E\. Fedorenko\(2024\)Distributed sensitivity to syntax and semantics throughout the language network\.Journal of Cognitive Neuroscience36\(7\),pp\. 1427–1471\.External Links:[Document](https://dx.doi.org/10.1162/jocn%5Fa%5F02164)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p4.1)\.
- \[25\]M\. Stone\(1974\)Cross\-validatory choice and assessment of statistical predictions\.Journal of the Royal Statistical Society: Series B36,pp\. 111–133\.External Links:[Document](https://dx.doi.org/10.1111/j.2517-6161.1974.tb00994.x)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[26\]M\. Thye, P\. Hoffman, and D\. Mirman\(2024\)“All the stars will be wells with a rusty pulley”: neural processing of the social and pragmatic content in a narrative\.Journal of Cognitive Neuroscience36\(11\),pp\. 2495–2517\.External Links:[Document](https://dx.doi.org/10.1162/jocn%5Fa%5F02228)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p4.1)\.
- \[27\]M\. Toneva and L\. Wehbe\(2019\)Interpreting and improving natural\-language processing in machines with natural language\-processing in the brain\.InAdvances in Neural Information Processing Systems,Vol\.32\.External Links:1905\.11833,[Link](https://papers.nips.cc/paper/7987-interpreting-and-improving-natural-language-processing-in-machines-with-natural-language-processing-in-the-brain)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1)\.
- \[28\]G\. Tuckuteet al\.\(2024\)Driving and suppressing the human language network using large language models\.Nature Human Behaviour8,pp\. 544–561\.External Links:[Document](https://dx.doi.org/10.1038/s41562-023-01783-7)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1)\.
- \[29\]C\. L\. van der Burght, A\. D\. Friederici, M\. Maran, G\. Papitto, E\. Pyatigorskaya, J\. A\. M\. Schroën, P\. C\. Trettenbrein, and E\. Zaccarella\(2023\)Cleaning up the brickyard: how theory and methodology shape experiments in cognitive neuroscience of language\.Journal of Cognitive Neuroscience35\(12\),pp\. 2067–2088\.External Links:[Document](https://dx.doi.org/10.1162/jocn%5Fa%5F02058)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p4.1)\.
- \[30\]S\. Varma and R\. Simon\(2006\)Bias in error estimation when using cross\-validation for model selection\.BMC Bioinformatics7,pp\. 91\.External Links:[Document](https://dx.doi.org/10.1186/1471-2105-7-91)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[31\]G\. Varoquauxet al\.\(2017\)Assessing and tuning brain decoders: cross\-validation, caveats, and guidelines\.NeuroImage145,pp\. 166–179\.External Links:[Document](https://dx.doi.org/10.1016/j.neuroimage.2016.10.038)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[32\]A\. Vaswaniet al\.\(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Vol\.30\.External Links:1706\.03762,[Link](https://papers.nips.cc/paper/7181-attention-is-all-you-need)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p3.1)\.
- \[33\]C\. Wang, A\. U\. Yaari, A\. K\. Singh, V\. Subramaniam, D\. Rosenfarb, J\. DeWitt, P\. Misra, J\. R\. Madsen, S\. Stone, G\. Kreiman, B\. Katz, I\. Cases, and A\. Barbu\(2024\)Brain treebank: large\-scale intracranial recordings from naturalistic language stimuli\.InAdvances in Neural Information Processing Systems 37, Datasets and Benchmarks Track,Note:OpenReview: KZlJF8kguO; arXiv:2411\.08343External Links:2411\.08343,[Document](https://dx.doi.org/10.52202/079017-3060),[Link](https://openreview.net/forum?id=KZlJF8kguO)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p3.1),[§2\.9](https://arxiv.org/html/2606.26880#S2.SS9.p1.1)\.
- \[34\]L\. Wehbeet al\.\(2014\)Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses\.PLOS ONE9,pp\. e112575\.External Links:[Document](https://dx.doi.org/10.1371/journal.pone.0112575)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p1.1)\.
- \[35\]H\. Weissbart, K\. D\. Kandylaki, and T\. Reichenbach\(2020\)Cortical tracking of surprisal during continuous speech comprehension\.Journal of Cognitive Neuroscience32\(1\),pp\. 155–166\.External Links:[Document](https://dx.doi.org/10.1162/jocn%5Fa%5F01467)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p4.1),[§4\.3](https://arxiv.org/html/2606.26880#S4.SS3.p1.1)\.
- \[36\]A\. Yanget al\.\(2024\)Qwen2\.5 technical report\.Note:arXiv:2412\.15115External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p3.1)\.
- \[37\]A\. Yanget al\.\(2025\)Qwen3 technical report\.Note:arXiv:2505\.09388External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2606.26880#S1.p3.1)\.
- \[38\]T\. Yarkoni and J\. Westfall\(2017\)Choosing prediction over explanation in psychology: lessons from machine learning\.Perspectives on Psychological Science12,pp\. 1100–1122\.External Links:[Document](https://dx.doi.org/10.1177/1745691617693393)Cited by:[§2\.4](https://arxiv.org/html/2606.26880#S2.SS4.p1.1)\.
- \[39\]Z\. Zadaet al\.\(2025\)The Podcast ECoG dataset for modeling neural activity during natural language comprehension\.Scientific Data12,pp\. 1135\.External Links:[Document](https://dx.doi.org/10.1038/s41597-025-05462-2)Cited by:[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.26880#S2.SS1.p3.1),[§2\.9](https://arxiv.org/html/2606.26880#S2.SS9.p1.1)\.

## Supplementary Information

## Supplementary Overview

This Supplementary Information documents the reproducibility and coverage materials supporting the manuscript\. The main manuscript evaluates what cognitive\-neuroscientific inferences language\-model neural predictivity supports during naturalistic language comprehension\. This supplement provides the table index, source\-coverage notes, criterion implementation, and supplementary figures needed to interpret the manuscript\.

The supplement should be read together with the derived CSV tables available through the OSF data\-and\-code repository\. Raw neural data and stimulus media are not redistributed\. Terminology matches the main manuscript: predictive, response\-profile, feature\-ablation, reliability\-bounded profile, and integrated evidence levels\.

## Supplementary Figures

![Refer to caption](https://arxiv.org/html/2606.26880v1/x5.png)Figure S1:Criterion\-threshold sensitivity matrix\. Threshold and rule sweeps summarize criterion status under relaxed rules; component\-level positives are reported separately from jointly indexed integrated coverage\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x6.png)Figure S2:Control\-family ablation\. Labels report stage positives to complete\-chain positives\. Single\-family removal leaves the integrated interpretation outside jointly indexed complete\-chain coverage\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x7.png)Figure S3:Model inventory and coverage limits\. The included\-feature\-store label identifies models retained in the fixed representation\-file set\. The final summary table is bounded to fixed representation files and matched analysis rows\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x8.png)Figure S4:Integrated\-coverage heatmap\. The integrated table retains 504 real\-model coverage rows, of which 432 are evaluable predictive summary rows in the matched derived data\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x9.png)Figure S5:Predictive delta distributions by dataset and control family\. Dataset\-level mean predictive contrasts are nonpositive, although a subset of evaluable summary rows meets the predictive\-only criterion\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x10.png)Figure S6:Model\-side feature\-ablation uncertainty by candidate quantity\. Intervals are bootstrap intervals over source\-row summaries, not participant\-level population intervals\. Positive ablation deltas are diagnostic\-level evidence; ablation\-supported computational correspondence requires predictive matched\-control evidence, feature\-specificity diagnostics, and response\-profile evidence for the integrated claim\.![Refer to caption](https://arxiv.org/html/2606.26880v1/x11.png)Figure S7:Positive\-control and source\-coverage atlas\. The horizontal dot plot collects brain–brain reliability, brain\-as\-model profile controls, word\-onset and word\-rate checks, full\-length Podcast acoustic\-envelope checks, and the stochastic implanted\-signal calibration\. The brain–brain reliability row reports both total table rows and valid\-row coverage; other labels use the denominators of the corresponding source tables\. The atlas makes auxiliary checks and source\-coverage boundaries visible; component\-level positives are interpreted at their corresponding evidence level\.
## Supplementary Table S0: Figure Reproducibility Index

Supplementary Table S0 is provided as a machine\-readable CSV at:

```
anc/figure_source_data/
Supplementary_Table_S0_figure_reproducibility_index.csv
```

The table maps each main\-figure and atlas panel to its manuscript\-facing source\-data CSV, source tables, generation script, seed policy, input directory, output files, and MD5 checksums for the source\-data and output files\. This panel\-level index connects the visual claims in Figures 1–4 and Supplementary Figure S7 to the exact derived tables and R outputs\.

## Key Supplementary Tables Displayed in This PDF

The detailed machine\-readable CSV files remain the authoritative table exports\. The compact tables below reproduce the submission\-facing content needed to read the Methods and Results alongside the source files\.

### Supplementary Table 3\. Binary criterion definitions, alternatives addressed, and interpretation

### Supplementary Table 4\. Matched\-control families

### Supplementary Tables 17–18\. Dataset\-level uncertainty and interval\-bound audit

### Supplementary Table 26\. Model inventory summary

Published or model\-card nominal parameter counts are shown\. The detailed model\-inventory CSV separately preserves registry parameter estimates and local checkpoint tensor counts for provenance; manuscript parameter counts use the published or model\-card values\.

### Supplementary Table 29\. Manuscript\-facing implementation settings

### Supplementary Table 30\. Dataset\-specific neural targets

### Supplementary Table 31\. Aggregation\-path audit

### Supplementary Table 32\. Predictive\-only enrichment and participant\-consistency coverage

The corresponding machine\-readable tables are available through the OSF data\-and\-code repository cited in the main text\.

## Supplementary Table Index

## Detailed CSV Directory

Detailed row\-level tables are available through the OSF data\-and\-code repository cited in the main text\. The arXiv package includes lightweight figure source\-data exports as ancillary files under:

The positive\-information, enrichment, and model\-inventory tables used in Figure 1, Figure 2D, and Supplementary Table 32 are provided as:

- •main\_table\_positive\_information\_subject\_run\.csv
- •main\_table\_positive\_information\_participant\_values\.csv
- •main\_table\_positive\_information\_participant\_summary\.csv
- •main\_table\_predictive\_only\_enrichment\_by\_dataset\_model\_layer\.csv
- •main\_table\_predictive\_only\_enrichment\_overview\.csv
- •main\_table\_predictive\_only\_participant\_consistency\.csv
- •main\_table\_model\_inventory\.csv

## Claim Scope

The supplement preserves the same scope as the main manuscript\. Positive controls establish component\-level sensitivity\. Claims about language\-model response\-profile organization or candidate\-computation correspondence require the calibrated contrasts applied to the tested model rows\. Mean participant\-run predictive model–control contrasts were nonpositive across the three primary datasets, although 67 of 432 evaluable predictive summary rows met the predictive\-only criterion\. Integrated mechanism\-specific or reliability\-bounded support was not established by the available contrasts\. The integrated interpretation required jointly indexed measurements across all contrasts, which were not available in the current dataset coverage\. Learning Brain records remain validation\-only in this manuscript, with longitudinal plasticity and brain–LLM co\-plasticity outside the current claim scope\.

Similar Articles

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

arXiv cs.CL

This paper investigates whether Brain Score, a metric comparing language model representations to human fMRI activations during reading, is truly capturing human-like language processing or merely structural similarity. The researchers train language models on diverse natural languages and non-linguistic structured data (genome, Python, nested parentheses), finding that models trained on different languages and even non-linguistic sequences achieve similar Brain Score performance, suggesting the metric may not be sensitive enough to distinguish human-specific processing.

What do Language Models Learn and When? The Implicit Curriculum Hypothesis

Hugging Face Daily Papers

This paper proposes the Implicit Curriculum Hypothesis, demonstrating that language model pretraining follows a structured, compositional curriculum where capabilities emerge consistently across architectures and can be predicted from internal representations. The authors validate this through designed tasks spanning retrieval, morphology, coreference, reasoning, and mathematics, finding highly consistent emergence orderings (ρ=0.81) across four model families.