Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

arXiv cs.CL 05/20/26, 04:00 AM Papers
neuroscience fmri ecog language-encoding fine-tuning encoding-models
Summary
This paper demonstrates that fine-tuning language encoding models on fMRI data improves their ability to predict neural activity from ECoG recordings, despite fMRI's lower temporal resolution. The findings suggest that abundant 'slow' fMRI data can enhance models for 'fast' ECoG data.
arXiv:2605.19224v1 Announce Type: new Abstract: Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography (ECoG), for human experiments because of the fine spatial and temporal resolution that they afford. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording. We propose using non-invasive fMRI to bridge the gap in training data. Using spoken language representations fine-tuned on fMRI, we build encoding models of ECoG. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse. Prediction improved in frequency bands well beyond what is directly measured in fMRI. Next, to test the procedure's generalization ability, we fine-tuned models on fMRI responses that were temporally downsampled by a factor of 2. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI-tuned models. Finally, we showed that ECoG performance steadily scales with the amount of fMRI-tuning data. Our results show that "slow" data like fMRI can be a valuable resource for building better models of "fast" brain data like ECoG. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding.
Original Article
View Cached Full Text
Cached at: 05/20/26, 08:24 AM
# Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG
Source: [https://arxiv.org/html/2605.19224](https://arxiv.org/html/2605.19224)
Aditya R\. Vaidya Department of Computer Science The University of Texas at Austin, USA avaidya@utexas\.edu Richard J\. Antonello Zuckerman Mind Brain Behavior Institute Columbia University, USA rja2163@columbia\.edu Alexander G\. Huth Departments of Neuroscience and Statistics University of California, Berkeley, USA alex\.huth@berkeley\.edu

###### Abstract

Neuroscientists have recently turned to intracranial brain recording methods, like electrocorticography \(ECoG\), for human experiments because of the fine spatial and temporal resolution that they afford\. Models trained on this data, however, are fundamentally restricted by the patient populations that can receive the implants necessary for recording\. We propose using non\-invasive fMRI to bridge the gap in training data\. Using spoken language representations fine\-tuned on fMRI, we build encoding models of ECoG\. These representations showed improved prediction performance in ECoG, even though the temporal resolution of fMRI is two orders of magnitude worse\. Prediction improved in frequency bands well beyond what is directly measured in fMRI\. Next, to test the procedure’s generalization ability, we fine\-tuned models on fMRI responses that were temporally downsampled by a factor of 2\. Despite the loss in resolution, these models were able to predict fMRI and ECoG responses at levels comparable to the original fMRI\-tuned models\. Finally, we showed that ECoG performance steadily scales with the amount of fMRI\-tuning data\. Our results show that “slow” data like fMRI can be a valuable resource for building better models of “fast” brain data like ECoG\. In the future, integrating across multiple recording methods may further improve performance in other applications, like decoding\.

## 1Introduction

Many neuroscience experiments use data recorded from intracranial electrodes that have been surgically implanted in human patients as part of clinical treatment\. In electrocorticography \(ECoG\) the electrodes are arranged in grids that are applied directly to the surface of the brain\. Electrodes in such close proximity to the brain tissue record neural activity with high temporal and spatial precision, enabling the construction of detailed encoding models that predict brain activity elicited by a stimulus\. These encoding models can then be used to answer neuroscience research questionsMesgarani et al\. \([2014](https://arxiv.org/html/2605.19224#bib.bib25)\); Keshishian et al\. \([2026](https://arxiv.org/html/2605.19224#bib.bib15)\)or in downstream applications like brain\-computer interfaces \(BCIs\)Tang et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib37)\); Littlejohn et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib20)\)\.

The downside of intracranial recordings is that datasets are small and rare\. ECoG electrodes are only implanted when clinically necessary, and are typically removed within a few daysChang \([2015](https://arxiv.org/html/2605.19224#bib.bib3)\); Mercier et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib24)\)\. The number of electrodes and their placement also differs from patient to patient, depending on their clinical needs\. This severely limits the amount of data collected from each individual, making it challenging to build elaborate encoding models\. In contrast, non\-invasive measurements like functional magnetic resonance imaging \(fMRI\)—while having much lower temporal resolution than ECoG—are easy to acquire repeatedly and have whole\-brain coverage\. Can plentiful fMRI data be exploited to improve encoding models for intracranial data?

Modern encoding models are typically based on pretrained deep neural networks\. Language encoding models often use neural network language models pretrained on large text corporaJain & Huth \([2018](https://arxiv.org/html/2605.19224#bib.bib13)\); Caucheteux & King \([2022](https://arxiv.org/html/2605.19224#bib.bib2)\)or speech models pretrained on audio corporaVaidya et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib39)\); Millet et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib26)\); Tuckute et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib38)\)\. These networks have mostly been used as frozen feature extractorsKell et al\. \([2018](https://arxiv.org/html/2605.19224#bib.bib14)\); Oota et al\. \([2024](https://arxiv.org/html/2605.19224#bib.bib33)\), but in more recent “brain\-tuning” studies the networks are also fine\-tuned on brain data to improve encoding performanceMoussa et al\. \([2024](https://arxiv.org/html/2605.19224#bib.bib28)\); Negi et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib31)\)\. fMRI\-tuned models increase prediction performance relative to pretrained modelsVattikonda et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib40)\), and also generalize to fMRI data from new subjects or different brain areasMoussa & Toneva \([2025](https://arxiv.org/html/2605.19224#bib.bib27)\)\. While the limited size of ECoG datasets make fine\-tuning challenging, it may be possible that fMRI\-tuned models, by virtue of learning more brain\-like representations, generalize to ECoG\. This would enable ECoG encoding model performance to scale with available fMRI datasets\.

The idea that ECoG encoding models could benefit from fine\-tuning on fMRI data is surprising but not impossible\. ECoG measures electrical activity in the brain that varies at the scale of milliseconds, while fMRI measures blood flow that varies over seconds—at least 2 orders of magnitude slower\. Yet despite its low sampling rate, fMRI signals are sensitive to fast timescale stimulus properties\. In auditory cortex, for example, fMRI responses can depend upon temporal modulations in sound that are much faster than the fMRI signal itselfOverath et al\. \([2015](https://arxiv.org/html/2605.19224#bib.bib34)\); Schönwiesner & Zatorre \([2009](https://arxiv.org/html/2605.19224#bib.bib36)\)\. More recent studies have shown that auditory cortex fMRI responses to speech are well\-modeled using networks like HuBERTHsu et al\. \([2021](https://arxiv.org/html/2605.19224#bib.bib10)\), WavLMChen et al\. \([2021](https://arxiv.org/html/2605.19224#bib.bib4)\), or WhisperRadford et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib35)\)\. These models’ fMRI prediction performance are improved by fine\-tuning on fMRI dataMoussa et al\. \([2024](https://arxiv.org/html/2605.19224#bib.bib28)\); Vattikonda et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib40)\)And the same networks can also very effectively predict ECoG responses to speechLi et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib18)\)\. These results all support the possibility that fMRI\-tuned models could improve ECoG prediction performance\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x1.png)Figure 1:fMRI\-to\-ECoG transfer via fMRI\-tuning\.We fine\-tune the 9th layer of a deep speech representation model, WavLM Base\+Chen et al\. \([2021](https://arxiv.org/html/2605.19224#bib.bib4)\), to predict fMRI responses \(measured at 0\.5 Hz\) to spoken language\. We then freeze the weights of the WavLM model and use its representations to build linearized encoding models of ECoG responses \(measured at 20 Hz\) to speech from a separate dataset\. Successfully performing this task requires learning representations that are useful across brain recording methods and robust to new subjects and stimuli\.In this work, we demonstrate that encoding models fine\-tuned on fMRI can generalize to new subjects, stimuli, and recording methods \(Figure[1](https://arxiv.org/html/2605.19224#S1.F1)\)\. Models fine\-tuned on fMRI improved prediction performance in ECoG, despite the difference in temporal resolution between the two methods\. We stress test the “temporal resolution generalization” of this procedure by fine\-tuning on downsampled fMRI responses\. Despite a sampling rate of only 0\.25 Hz, fine\-tuning on these responses still yielded a significant improvement in ECoG prediction performance compared to the pretrained model\. We show that this slow\-to\-fast generalization even applies within fMRI data; models fine\-tuned on downsampled fMRI responses were able to predict the original fMRI responses better than the pretrained model\. Finally, we show that ECoG performance steadily increases with the amount of fMRI fine\-tuning data, demonstrating that scaling fMRI language datasets could benefit ECoG models for applications in neuroscience or BCI, despite the vast difference in temporal resolution\.

## 2Data and methods

### 2\.1fMRI data

From the public dataset released by LeBel et al\. \(2023\)LeBel et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib17)\)and Tang et al\. \(2023\)Tang et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib37)\), we used pre\-processed fMRI data from 3 participants who were scanned while listening to 94–103 naturally spoken narrative stories \(17\.8 h–19\.7 h hours per participant\)\. These stories capture a range of responses that may be easier to capture in whole\-brain recordings \(like fMRI\) than in more limited intracranial datasets\. Three of the stories were presented to the subjects multiple times: two \(“fromboyhoodtofatherhood” and “onapproachtopluto”\) were seen five times each, and one \(“wheretheressmoke”\) ten times\. We averaged across the responses of the repeated presentations to reduce measurement noise\.

### 2\.2Speech encoding models

In this work, we build linearized speech encoding models that aim to estimate the responseRRto a stimulusSSas:

R^t=f\(St;θ\)β\\hat\{R\}\_\{t\}=f\(S\_\{t\};\\theta\)\\beta\(1\)wheref:𝕊→ℝFf:\\mathbb\{S\}\\rightarrow\\mathbb\{R\}^\{F\}is a non\-linear transform of the stimulusSSat timettwith parametersθ\\theta, andβ∈ℝF×C\\beta\\in\\mathbb\{R\}^\{F\\times C\}is a linear projection of theFFfeatures to each ofCCchannels of the brain response\. Response channels are voxels in fMRI and electrodes in ECoG\. When possible, encoding models use a finite impulse response \(FIR\) structure to capture temporal properties of the response\. For example, fMRI responses derive from the blood\-oxygen\-level\-dependent \(BOLD\) signal which, after an impulse of neural activity, rises to a peak over the course of 3\-4 s and then falls back to baseline over another 4\-6 sNaselaris et al\. \([2011](https://arxiv.org/html/2605.19224#bib.bib30)\)\. To capture this behavior, our fMRI encoding models use concatenated stimulus features from several timepoints \(t−4t\-4,t−3t\-3,t−2t\-2, andt−1t\-1\) to predict the responses at timett\. For an FIR model withdddelays, this expands the feature space tod⋅Fd\\cdot Fdimensions\. In ECoG, responses are also delayed by upstream neural processing, often spanning 200\-800 ms after a stimulus transientHullett et al\. \([2016](https://arxiv.org/html/2605.19224#bib.bib12)\); Hamilton et al\. \([2018](https://arxiv.org/html/2605.19224#bib.bib8)\)\. Because the sampling rate of ECoG data is also much higher, capturing this temporal response could required=20d=20or more\. This quickly becomes computationally intractable for large feature spaces\. To circumvent this issue, we followed earlier studiesGoldstein et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib7)\); Zada et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib41)\)and fit ECoG encoding models that use only a single delay hyperparameterτ\\tau, yieldingR^t=f\(St−τ;θ\)β\\hat\{R\}\_\{t\}=f\(S\_\{t\-\\tau\};\\theta\)\\beta\.

Here, we parametrize the non\-linear stimulus transformffusing WavLMChen et al\. \([2021](https://arxiv.org/html/2605.19224#bib.bib4)\), a neural network that operates on the waveform of the stimulus\. Layer 9 has been previously shown to have the highest encoding performance in fMRIAntonello et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib1)\), so we only extract features from this layer\. Following previous work, we extract features by sliding a 4 s window over the waveform, feeding the windowed stimulus into the model, and saving the hidden state of the final token of the layer\.

For modeling fMRI, we extract features with a 0\.25 s stride, and the result is downsampled to 0\.5 Hz, the sampling rate of the fMRI responses\. For modeling ECoG, we extract features with a 0\.05 s stride, resulting in features at 20 Hz, the sampling rate we use for the ECoG responses\.

### 2\.3fMRI\-tuning of audio models

To induce a brain\-like bias in the speech representations, we fine\-tune the underlying WavLM model to predict fMRI responses in a procedure we call “fMRI\-tuning”\. We adopt the procedure of Vattikonda et al\. \(2025\)Vattikonda et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib40)\)to fine\-tune separate WavLM models to predict the fMRI responses of three subjects from the fMRI dataset, starting from the WavLM Base\+ checkpoint\. To reduce the likelihood of overfitting, we use rank\-4 low\-rank adaptation \(LoRAHu et al\. \([2021](https://arxiv.org/html/2605.19224#bib.bib11)\)\) on theWQ,WK,QVW^\{Q\},W^\{K\},Q^\{V\}matrices of each Transformer layer, and we constrain the final linear projection from WavLM to fMRI voxels to be rank\-100\. We use the Adam optimizerKingma & Ba \([2017](https://arxiv.org/html/2605.19224#bib.bib16)\)with a learning rate of5×10−45\\times 10^\{\-4\}to optimize the LoRA matrices and linear projection to minimize a spatial correlation loss\.

We use two stories as a validation set \(“fromboyhoodtofatherhood” and “onapproachtopluto”\) and, as we do for the pretrained model, evaluate encoding performance on one test story \(“wheretheressmoke”\)\. We fine\-tuned each model for 30 epochs with a batch size of 10 TRs using the same feature extraction parameters as described in Section[2\.2](https://arxiv.org/html/2605.19224#S2.SS2)\. To select the best epoch, we evaluated encoding performance on the validation set\. Using the ridge parameters from the pretrained model, we fit ridge regression encoding models using features from each of the 30 epochs, and we selected the epoch with the best validation encoding performance\. For this epoch, we then re\-ran cross\-validation to select the best ridge parameters, and we evaluated its encoding performance on the unseen test story\.

Fine\-tuning one model took 30 hours on a single 48GB NVIDIA RTX A6000\.

### 2\.4ECoG data and evaluation

We use the “Podcast” datasetZada et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib41)\)to evaluate the generalization of our fMRI\-tuned models to intracranial data\. In this dataset, nine patients listened to a 30\-minute podcast while electrical brain activity was recorded with electrocorticography \(ECoG\)\. The dataset contains 1,268 electrodes across all patients, who are distinct from the participants in the fMRI dataset\.

As is conventional for ECoG, we used the high\-gamma power \(provided byZada et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib41)\)\) in each ECoG electrode as the response of interestMukamel et al\. \([2005](https://arxiv.org/html/2605.19224#bib.bib29)\); Manning et al\. \([2009](https://arxiv.org/html/2605.19224#bib.bib22)\)\. Power in the high\-gamma band was computed as the analytic amplitude of the Hilbert transform of ECoG signals that had been band\-pass filtered in the 70\-200 Hz band\. High\-gamma power is thought to represent pooled activity across synchronously active neurons in the region near the electrode\. We further downsampled the high\-gamma signals to 20 Hz to reduce computational burdens\.

As mentioned earlier, the high temporal resolution of ECoG and high dimensionality of the speech representations make it computationally difficult to fit FIR encoding weights\. Instead, we offset the stimulus different amounts relative to the response, and build separate linear models for each lag with ridge regression\. 81 lags are evenly spaced between \-2 and \+2 seconds\. ECoG encoding performance is calculated with 4\-fold cross\-validation on the 30\-minute stimulus\. The encoding performance for a model is that of the best performing lag\.

Along with encoding performance \(Pearson correlation coefficient\) on the overall signal, we also evaluate performance within individual frequency bands by computing the power spectrum density \(PSD\) of the residual of the model’s predictions\. The residual PSD for a model is chosen to be that of the lag with the highest encoding performance\.

Fitting all ECoG encoding models for one fMRI\-tuned model took 10 hours on a single NVIDIA RTX A6000\.

### 2\.5Signal\-to\-noise ratio of fMRI responses

fMRI data comprise both signal and noise\. To tease the two apart one must use data collected while the same stimulus is played several times\. Averaging the response timecourse across repeats gives an estimate of the “signal” portion of the data\. Subtracting that average from each individual recording then gives an estimate of the “noise” part\. Computing the variance of the signal and noise components then enables one to directly estimate the total signal to noise ratio \(SNR\)\. We estimate the SNR within each frequency band in a similar fashion: compute the power spectral densities \(PSD\) of the signal and noise components, and then compute the ratio at each frequencyHsu et al\. \([2004](https://arxiv.org/html/2605.19224#bib.bib9)\)\. We use the 10 individual presentations of the test story \(“wheretheressmoke”\) to compute band\-wise SNR in our fMRI data\.

## 3Results

![Refer to caption](https://arxiv.org/html/2605.19224v1/x2.png)Figure 2:fMRI\-tuned models for predicting ECoG\.\(a\) ECoG encoding performance averaged across all electrodes within regions of interest \(ROIs\)\. Error bars show the standard error of the mean \(SEM\) across electrodes\. The Destrieux atlasDestrieux et al\. \([2010](https://arxiv.org/html/2605.19224#bib.bib5)\)was used to find electrodes in each ROI: pre\-frontal cortex \(PFC\), primary auditory cortex \(AC\), and the language network as described in Lipkin et al\. \(2022\)Lipkin et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib19)\)\. fMRI\-tuned models outperform pretrained models in every ROI \(p<1×10−8p<$1\\text\{\\times\}\{10\}^\{\-8\}$for pairedtt\-tests across electrodes within the ROI\)\. \(b\) Encoding performance of each electrode using the pretrained vs\. fMRI\-tuned models\. fMRI\-tuned models consistently outperform pretrained models, especially for electrodes that were already well\-modeled \(p<1×10−87p<$1\\text\{\\times\}\{10\}^\{\-87\}$for a pairedtt\-test across all electrodes\)\. \(c\) Change in encoding performance for each electrode after fMRI\-tuning, visualized on a brain atlas\. Electrodes with the greatest improvement are found in bilateral auditory cortex\. Smaller gains are seen in other components of the language network, including the superior temporal gyrus \(STG\) and inferior frontal gyrus \(IFG\)\. In Appendix[A](https://arxiv.org/html/2605.19224#A1), we separate the electrodes by subject \(Figure[6](https://arxiv.org/html/2605.19224#A1.F6)\)\.### 3\.1Fine\-tuning on fMRI improves ECoG performance

Following Vattikonda et al\. 2025Vattikonda et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib40)\), we fine\-tuned separate WavLM models on responses from each fMRI participant\. Using features from these fine\-tuned models, we fit new encoding models to ECoG responses from the “Podcast” dataset\.

Averaged across cortex, encoding models fine\-tuned on fMRI outperform the pretrained WavLM model when predicting ECoG \(Figure[2](https://arxiv.org/html/2605.19224#S3.F2)a\)\. Improvements are most pronounced in electrodes within auditory cortex \(AC\) and the “language network”, a set of areas hypothesized to be selective for language processingLipkin et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib19)\)\. Electrodes that did not improve after fine\-tuning had poor pretrained performance as well, suggesting their response is not stimulus\-driven \(Figure[2](https://arxiv.org/html/2605.19224#S3.F2)b\)\.

In Figure[2](https://arxiv.org/html/2605.19224#S3.F2)c, we show the change in performance after fMRI\-tuning for each electrode on an atlas brain map\. Despite evaluating on a different recording modality, the cortical areas that improve in ECoG after fine\-tuning largely overlap with previous fMRI\-specific findingsMoussa & Toneva \([2025](https://arxiv.org/html/2605.19224#bib.bib27)\); Negi et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib31)\); Vattikonda et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib40)\)\.

### 3\.2fMRI\-tuning improves prediction in high\-frequency bands of high\-gamma

Responses in auditory cortex \(AC\) tend to follow spectral features of the stimulus and can fluctuate quicklyHullett et al\. \([2016](https://arxiv.org/html/2605.19224#bib.bib12)\); Norman\-Haignere et al\. \([2022](https://arxiv.org/html/2605.19224#bib.bib32)\)\. Interestingly, electrodes in AC also showed the greatest improvements after fMRI\-tuning\. This led us to ask what parts of the neural response were better predicted after fMRI\-tuning\. We characterized the spectral properties of the model’s performance by examining the power spectral density of the residuals from the encoding model predictions\. Because fMRI responses are temporally smooth, one might expect fMRI\-tuning to only improve predictions for slowly\-varying components of the ECoG response at or below the fMRI Nyquist rate \(0\.25 Hz\)\. However, these fMRI signals still ultimately derive from neural responses to rapidly\-changing speech stimuli, which we know are well\-modeled by the WavLM features\. Thus it is also possible that fine\-tuning captures or accentuates brain\-relevant information at much finer timescales, leading to improvements in ECoG prediction performance above 0\.25 Hz\.

Surprisingly, we observed that fMRI\-tuning improved the prediction of ECoG across the entire spectrum \(Figure[3](https://arxiv.org/html/2605.19224#S3.F3)\)\. The biggest improvements were at frequencies below 0\.25 Hz, but substantial and significant improvements were seen at frequencies up to 1 Hz\. Even in the 1\-10 Hz band we found small but consistent improvement from fMRI\-tuning\. The power of the error reduced by1\.16%1\.16\\%below 0\.25 Hz and above0\.602%0\.602\\%above it\. This result indicates that models tuned to predict fMRI data also predict ECoG at timescales that fMRI cannot resolve\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x3.png)Figure 3:Frequency\-binned improvement of fMRI\-tuned models\.Change in the power spectral density \(PSD\) of the ECoG encoding model residual after fMRI\-tuning \(lower is better\)\. The shaded area shows the standard error across electrodes\. The dotted green line at 0\.25 Hz shows the Nyquist frequency of the fMRI responses\. fMRI\-tuning improves model fit \(reduces the residual power\) overall, with improvement both below and above the fMRI Nyquist frequency \(two\-sidedtt\-test across electrodes between pretrained and fMRI\-tuned PSD,p<5×10−5p<$5\\text\{\\times\}\{10\}^\{\-5\}$for every frequency\)\. The power of the residual below 0\.25 Hz decreases by1\.16%1\.16\\%\(p<1×10−23p<$1\\text\{\\times\}\{10\}^\{\-23\}$, two\-sidedtt\-test\), and above 0\.25 Hz the power decreases by0\.602%0\.602\\%\(p<1×10−31p<$1\\text\{\\times\}\{10\}^\{\-31\}$, two\-sidedtt\-test\)\. This demonstrates that models fine\-tuned on fMRI data can generalize and improve prediction performance of responses much faster than can be measured in fMRI\.
### 3\.3Models fine\-tuned on downsampled fMRI generalize to faster responses

In the previous sections we showed that fMRI\-tuned models generalize to the higher temporal resolution of ECoG\. How far can we push this temporal generalization? To further stress\-test our approach we created a new dataset at even lower temporal resolution by downsampling the fMRI data to 0\.25 Hz \(4 seconds between timepoints\)\. We then fine\-tuned new models on these downsampled fMRI responses as before\.

First, we compared how well the original fMRI\-tuned and downsampled models predict held\-out fMRI data\. Figure[4](https://arxiv.org/html/2605.19224#S3.F4)a shows the difference in model prediction performance in one fMRI subject \(S3\)\. We found that models fine\-tuned on downsampled responses perform similarly to the original models overall, with slight decreases in performance in some higher\-level areas like prefrontal cortex \(PFC\) and the boundaries of the occipital and temporal lobes\. Overall prediction performance was comparable, and both models outperformed the pretrained model in all brain areas \(Figure[4](https://arxiv.org/html/2605.19224#S3.F4)b\)\.

While surprising, this result can be explained by the signal\-to\-noise ratio \(SNR\) properties of fMRI responses to natural stimuli\. To explore this idea, we measured the SNR of the fMRI data across different frequency bands using a spectral approachHsu et al\. \([2004](https://arxiv.org/html/2605.19224#bib.bib9)\)\(see Section[2\.5](https://arxiv.org/html/2605.19224#S2.SS5)\)\. This showed that all frequency bands of the fMRI response are not equally informative\. Most of the usable signal lies in the band between 0\.01 and 0\.1 Hz, while very little falls in the highest frequencies \(Figure[4](https://arxiv.org/html/2605.19224#S3.F4)c\)\. Thus, by downsampling and truncating the noisiest frequencies, we actually*raise*the overall SNR of the fMRI responses\. Fine\-tuning on these “cleaner” downsampled responses allows the representations to generalize to the original signal\.

Finally, we tested whether the models fine\-tuned on downsampled fMRI data could still generalize to ECoG data\. Figure[4](https://arxiv.org/html/2605.19224#S3.F4)d compares ECoG encoding model performance using fMRI\-tuned and downsampled\-fMRI\-tuned models\. As was the case predicting fMRI data, we found virtually no difference between the two models at predicting ECoG data\. This demonstrates even more extreme temporal generalization than before: models fine\-tuned on fMRI data with a Nyquist rate of 0\.125 Hz improve performance on ECoG data\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x4.png)Figure 4:Fine\-tuning on downsampled fMRI responses\.\(a\) Flattened cortical surface of fMRI subject S3 that compares the encoding performance of models fine\-tuned on original fMRI responses \(blue\) and downsampled fMRI responses \(red\)\. Voxel brightness is proportional to overall model performance\. The 2\-dimensional histogram shows that voxels have similar performance across fine\-tuning conditions\. We show subjects S1 and S2 in Appendix[B](https://arxiv.org/html/2605.19224#A2)\. \(b\) fMRI encoding performance, averaged within ROIs, after fine\-tuning on original or downsampled fMRI responses\. Error bars show standard error across subjects\. There was no significant effect between the two\-finetuning conditions \(pairedtt\-test across subjects hadpp\-values 0\.424, 0\.565, and 0\.798 for each respective ROI\)\. \(c\) Signal\-to\-noise ratio \(SNR\) of the fMRI responses, averaged across subjects\. The dark and light green vertical lines are the Nyquist frequency of the original and downsampled fMRI responses at 0\.25 Hz and 0\.125 Hz, respectively\. The average SNR is0\.155±0\.01430\.155\\pm 0\.0143below the 0\.125 Hz threshold and0\.114±0\.0005340\.114\\pm 0\.000534above it\. The original signal had an overall SNR of0\.139±0\.008950\.139\\pm 0\.00895\. This shows why downsampling the fMRI data before fine\-tuning had little effect on model performance: most of the responses between\\qtyrange\[range\-units=single,range\-phrase= – \]0\.1250\.25 are noise\. \(d\) Effect on ECoG performance after fine\-tuning on downsampled or original fMRI responses\. We see no significant change in performance due to fine\-tuning data \(p=0\.514p=0\.514, two\-sided pairedtt\-test across all electrodes\)\.![Refer to caption](https://arxiv.org/html/2605.19224v1/x5.png)Figure 5:Scaling of fMRI\-tuning for ECoG\.\(a\) ECoG encoding performance as a function of the number of fMRI fine\-tuning stories\. Error bars indicate the standard error over bootstraps of the fMRI stories\. We show the scaling performance with the full fMRI\-tuning dataset in Appendix[D](https://arxiv.org/html/2605.19224#A4)\. \(b\) Pretrained encoding performance vs\. scaling coefficientmem\_\{e\}for all electrodes\. The scaling law is measured per electrode as the slope of a linear regression between \(1\) its encoding performance and \(2\) log\-fMRI story count\. The electrodes that tend to improve have strong pretrained performance\. \(c\) Electrodes were thresholded by pretrained encoding performanceρ\>0\.1\\rho\>0\.1, yieldingn=219n=219electrodes, and plotted on a brain atlas\.
### 3\.4ECoG prediction scales with fMRI data

Several works have examined the scaling laws of encoding models in brain dataMatsuyama et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib23)\); Gokce & Schrimpf \([2025](https://arxiv.org/html/2605.19224#bib.bib6)\)\. In particular, Antonello et al\. \(2023\)Antonello et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib1)\)showed that language encoding models in fMRI scale with logarithmically with the amount of training data\. Moussa et al\. \(2025\)Moussa & Toneva \([2025](https://arxiv.org/html/2605.19224#bib.bib27)\)followed up this finding by fine\-tuning the underlying models, and they found that fine\-tuning reduces the amount of data needed to generalize to new subjects\. We next examined whether this effect holds across modalities as well\. To test scaling, we fine\-tuned models using subsets of the LeBel et al\. fMRI dataset, and then tested them on the “Podcast” ECoG datasetAntonello et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib1)\)\.

We found that improvements in overall ECoG prediction scaled logarithmically with the number of stories used for fMRI\-tuning \(Figure[5](https://arxiv.org/html/2605.19224#S3.F5)a\)\. This result mirrors the literatureAntonello et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib1)\); Moussa & Toneva \([2025](https://arxiv.org/html/2605.19224#bib.bib27)\), but with smaller improvement magnitudes due to transferring across modalities\. \(In Appendix[C](https://arxiv.org/html/2605.19224#A3), we additionally compare the scaling behavior of linear and fMRI\-tuned models on fMRI encoding\.\)

We next quantified the scaling law of fMRI\-tuning data on ECoG\. Adapting the formulation of Antonello et al\. \(2023\), for each electrode we fit a linear modelΔρe≈melog2⁡N\\Delta\\rho\_\{e\}\\approx m\_\{e\}\\log\_\{2\}Nto predict the change in encoding performanceΔρe\\Delta\\rho\_\{e\}from the number of fMRI\-tuning storiesNN\. In Figure[5](https://arxiv.org/html/2605.19224#S3.F5)b, we show the scaling coefficientmem\_\{e\}for all electrodes\. Our trend was consistent across electrodes; after filtering for language\-responsive electrodes, 176 of 219 electrodes across cortex had a positive relationship between encoding performance and the number of stories used for fMRI\-tuning \(Figure[5](https://arxiv.org/html/2605.19224#S3.F5)b\)\. Scaling effects were largest in areas with strong baseline encoding performance \(Figure[5](https://arxiv.org/html/2605.19224#S3.F5)c\)\.

Our results extend the literature by showing that fMRI\-tuned models not only generalize across subjects, but also across recording methods\. These results suggest that increasing amounts of fMRI data could be used to mitigate the limitations of intracranial data collection\.

## 4Conclusion

In this work, we showed that speech representations fine\-tuned with fMRI data transfer to ECoG prediction, a recording modality with orders of magnitude higher temporal resolution\. Transfer succeeds even when the fMRI data are further downsampled to 0\.25 Hz, and performance scales logarithmically with the amount of fMRI fine\-tuning data\.

How is this possible? High\-gamma power, the ECoG signal that we model, is thought to be the closest neural correlate to the BOLD signal measured in fMRILogothetis et al\. \([2001](https://arxiv.org/html/2605.19224#bib.bib21)\)\. Thus it is plausible that fine\-tuning on fMRI accentuates features that are relevant for high\-gamma, even at response frequencies invisible to fMRI\. This shared underpinning may be sufficient to enable slow\-to\-fast model transfer\.

ECoG datasets are typically small: clinically necessary electrodes are implanted for a few days and then removed\. fMRI datasets, in contrast, face no such constraints\. Thus, as fMRI datasets continue to grow, our scaling results suggest that leveraging them will provide better and better ECoG encoding models\. This is relevant for BCI applications where encoding model quality limits performance\.

Still, several limitations apply to these results\. We primarily focused on one model \(WavLM Base\+\) and the single task domain of spoken language\. Our approach is probably not fully data\-optimal: there is also a rich landscape of training recipes involving nonlinear encoding adapters for ECoG data which we do not explore in this work, and there are other known means of fine\-tuning acoustic representations to be more brain\-like, such as fine\-tuning on phonemes or semantic representationsVattikonda et al\. \([2025](https://arxiv.org/html/2605.19224#bib.bib40)\)\. A wider search of this landscape would help clarify the optimal means by which fMRI data can be used to predict ECoG data\. Yet it also seems likely that the optimal approach would combine all available data modes: fine\-tune on fMRIandECoGandother potential information sources in a joint training cocktail\. Future work \(and larger datasets\) will be required to fully explore this space\. Broadly, our work presents strong evidence that brain\-like representations are highly modality invariant, and that out\-of\-modality training data can be a core component of the training mix for future brain encoding models\.

## References

- Antonello et al\. \(2023\)Richard Antonello, Aditya Vaidya, and Alexander Huth\.Scaling laws for language encoding models in fMRI\.*Advances in Neural Information Processing Systems*, 36:21895–21907, December 2023\.
- Caucheteux & King \(2022\)Charlotte Caucheteux and Jean\-Rémi King\.Brains and algorithms partially converge in natural language processing\.*Communications Biology*, 5\(1\):1–10, February 2022\.ISSN 2399\-3642\.doi:10\.1038/s42003\-022\-03036\-1\.
- Chang \(2015\)Edward F\. Chang\.Towards Large\-Scale, Human\-Based, Mesoscopic Neurotechnologies\.*Neuron*, 86\(1\):68–78, April 2015\.ISSN 0896\-6273\.doi:10\.1016/j\.neuron\.2015\.03\.037\.
- Chen et al\. \(2021\)Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, and Furu Wei\.WavLM: Large\-Scale Self\-Supervised Pre\-Training for Full Stack Speech Processing\.*arXiv:2110\.13900 \[cs, eess\]*, October 2021\.
- Destrieux et al\. \(2010\)Christophe Destrieux, Bruce Fischl, Anders Dale, and Eric Halgren\.Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature\.*NeuroImage*, 53\(1\):1–15, October 2010\.ISSN 1053\-8119\.doi:10\.1016/j\.neuroimage\.2010\.06\.010\.
- Gokce & Schrimpf \(2025\)Abdulkadir Gokce and Martin Schrimpf\.Scaling Laws for Task\-Optimized Models of the Primate Visual Ventral Stream\.In*Forty\-Second International Conference on Machine Learning*, June 2025\.
- Goldstein et al\. \(2025\)Ariel Goldstein, Haocheng Wang, Leonard Niekerken, Mariano Schain, Zaid Zada, Bobbi Aubrey, Tom Sheffer, Samuel A\. Nastase, Harshvardhan Gazula, Aditi Singh, Aditi Rao, Gina Choe, Catherine Kim, Werner Doyle, Daniel Friedman, Sasha Devore, Patricia Dugan, Avinatan Hassidim, Michael Brenner, Yossi Matias, Orrin Devinsky, Adeen Flinker, and Uri Hasson\.A unified acoustic\-to\-speech\-to\-language embedding space captures the neural basis of natural language processing in everyday conversations\.*Nature Human Behaviour*, 9\(5\):1041–1055, May 2025\.ISSN 2397\-3374\.doi:10\.1038/s41562\-025\-02105\-9\.
- Hamilton et al\. \(2018\)Liberty S\. Hamilton, Erik Edwards, and Edward F\. Chang\.A Spatial Map of Onset and Sustained Responses to Speech in the Human Superior Temporal Gyrus\.*Current Biology*, 28\(12\):1860–1871\.e4, June 2018\.ISSN 0960\-9822\.doi:10\.1016/j\.cub\.2018\.04\.033\.
- Hsu et al\. \(2004\)Anne Hsu, Alexander Borst, and Frédéric E Theunissen\.Quantifying variability in neural responses and its application for the validation of model predictions\.*Network: Computation in Neural Systems*, 15\(2\):91–109, January 2004\.ISSN 0954\-898X, 1361\-6536\.doi:10\.1088/0954\-898X\_15\_2\_002\.
- Hsu et al\. \(2021\)Wei\-Ning Hsu, Benjamin Bolte, Yao\-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed\.HuBERT: Self\-Supervised Speech Representation Learning by Masked Prediction of Hidden Units\.*arXiv:2106\.07447 \[cs, eess\]*, June 2021\.
- Hu et al\. \(2021\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-Rank Adaptation of Large Language Models, October 2021\.
- Hullett et al\. \(2016\)Patrick W\. Hullett, Liberty S\. Hamilton, Nima Mesgarani, Christoph E\. Schreiner, and Edward F\. Chang\.Human Superior Temporal Gyrus Organization of Spectrotemporal Modulation Tuning Derived from Speech Stimuli\.*Journal of Neuroscience*, 36\(6\):2014–2026, February 2016\.ISSN 0270\-6474, 1529\-2401\.doi:10\.1523/JNEUROSCI\.1779\-15\.2016\.
- Jain & Huth \(2018\)Shailee Jain and Alexander Huth\.Incorporating Context into Language Encoding Models for fMRI\.In S\. Bengio, H\. Wallach, H\. Larochelle, K\. Grauman, N\. Cesa\-Bianchi, and R\. Garnett \(eds\.\),*Advances in Neural Information Processing Systems 31*, pp\. 6628–6637\. Curran Associates, Inc\., 2018\.
- Kell et al\. \(2018\)Alexander J\. E\. Kell, Daniel L\. K\. Yamins, Erica N\. Shook, Sam V\. Norman\-Haignere, and Josh H\. McDermott\.A Task\-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy\.*Neuron*, 98\(3\):630–644\.e16, May 2018\.ISSN 0896\-6273\.doi:10\.1016/j\.neuron\.2018\.03\.044\.
- Keshishian et al\. \(2026\)Menoua Keshishian, Gavin Mischler, Samuel Thomas, Brian Kingsbury, Stephan Bickel, Ashesh D\. Mehta, and Nima Mesgarani\.Parallel hierarchical encoding of linguistic representations in the human auditory cortex and recurrent automatic speech recognition systems\.*Nature Machine Intelligence*, 8\(2\):257–269, February 2026\.ISSN 2522\-5839\.doi:10\.1038/s42256\-026\-01185\-0\.
- Kingma & Ba \(2017\)Diederik P\. Kingma and Jimmy Ba\.Adam: A Method for Stochastic Optimization\.*arXiv:1412\.6980 \[cs\]*, January 2017\.
- LeBel et al\. \(2023\)Amanda LeBel, Lauren Wagner, Shailee Jain, Aneesh Adhikari\-Desai, Bhavin Gupta, Allyson Morgenthal, Jerry Tang, Lixiang Xu, and Alexander G\. Huth\.A natural language fMRI dataset for voxelwise encoding models\.*Scientific Data*, 10\(1\):555, August 2023\.ISSN 2052\-4463\.doi:10\.1038/s41597\-023\-02437\-z\.
- Li et al\. \(2022\)Yuanning Li, Gopala K\. Anumanchipalli, Abdelrahman Mohamed, Junfeng Lu, Jinsong Wu, and Edward F\. Chang\.Dissecting neural computations of the human auditory pathway using deep neural networks for speech, March 2022\.
- Lipkin et al\. \(2022\)Benjamin Lipkin, Greta Tuckute, Josef Affourtit, Hannah Small, Zachary Mineroff, Hope Kean, Olessia Jouravlev, Lara Rakocevic, Brianna Pritchett, Matthew Siegelman, Caitlyn Hoeflin, Alvincé Pongos, Idan A\. Blank, Melissa Kline Struhl, Anna Ivanova, Steven Shannon, Aalok Sathe, Malte Hoffmann, Alfonso Nieto\-Castañón, and Evelina Fedorenko\.Probabilistic atlas for the language network based on precision fMRI data from\>\>800 individuals\.*Scientific Data*, 9\(1\):529, August 2022\.ISSN 2052\-4463\.doi:10\.1038/s41597\-022\-01645\-3\.
- Littlejohn et al\. \(2025\)Kaylo T\. Littlejohn, Cheol Jun Cho, Jessie R\. Liu, Alexander B\. Silva, Bohan Yu, Vanessa R\. Anderson, Cady M\. Kurtz\-Miott, Samantha Brosler, Anshul P\. Kashyap, Irina P\. Hallinan, Adit Shah, Adelyn Tu\-Chan, Karunesh Ganguly, David A\. Moses, Edward F\. Chang, and Gopala K\. Anumanchipalli\.A streaming brain\-to\-voice neuroprosthesis to restore naturalistic communication\.*Nature Neuroscience*, pp\. 1–11, March 2025\.ISSN 1546\-1726\.doi:10\.1038/s41593\-025\-01905\-6\.
- Logothetis et al\. \(2001\)Nikos K\. Logothetis, Jon Pauls, Mark Augath, Torsten Trinath, and Axel Oeltermann\.Neurophysiological investigation of the basis of the fMRI signal\.*Nature*, 412\(6843\):150–157, July 2001\.ISSN 1476\-4687\.doi:10\.1038/35084005\.
- Manning et al\. \(2009\)Jeremy R\. Manning, Joshua Jacobs, Itzhak Fried, and Michael J\. Kahana\.Broadband Shifts in Local Field Potential Power Spectra Are Correlated with Single\-Neuron Spiking in Humans\.*Journal of Neuroscience*, 29\(43\):13613–13620, October 2009\.ISSN 0270\-6474, 1529\-2401\.doi:10\.1523/JNEUROSCI\.2041\-09\.2009\.
- Matsuyama et al\. \(2023\)Takuya Matsuyama, Kota S Sasaki, and Shinji Nishimoto\.Applicability of scaling laws to vision encoding models, August 2023\.
- Mercier et al\. \(2022\)Manuel R\. Mercier, Anne\-Sophie Dubarry, François Tadel, Pietro Avanzini, Nikolai Axmacher, Dillan Cellier, Maria Del Vecchio, Liberty S\. Hamilton, Dora Hermes, Michael J\. Kahana, Robert T\. Knight, Anais Llorens, Pierre Megevand, Lucia Melloni, Kai J\. Miller, Vitória Piai, Aina Puce, Nick F Ramsey, Caspar M\. Schwiedrzik, Sydney E\. Smith, Arjen Stolk, Nicole C\. Swann, Mariska J Vansteensel, Bradley Voytek, Liang Wang, Jean\-Philippe Lachaux, and Robert Oostenveld\.Advances in human intracranial electroencephalography research, guidelines and good practices\.*NeuroImage*, 260:119438, October 2022\.ISSN 1053\-8119\.doi:10\.1016/j\.neuroimage\.2022\.119438\.
- Mesgarani et al\. \(2014\)N\. Mesgarani, C\. Cheung, K\. Johnson, and E\. F\. Chang\.Phonetic Feature Encoding in Human Superior Temporal Gyrus\.*Science*, 343\(6174\):1006–1010, February 2014\.ISSN 0036\-8075, 1095\-9203\.doi:10\.1126/science\.1245994\.
- Millet et al\. \(2022\)Juliette Millet, Charlotte Caucheteux, Pierre Orhan, Yves Boubenec, Alexandre Gramfort, Ewan Dunbar, Christophe Pallier, and Jean\-Remi King\.Toward a realistic model of speech processing in the brain with self\-supervised learning, June 2022\.
- Moussa & Toneva \(2025\)Omer Moussa and Mariya Toneva\.Brain\-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, October 2025\.
- Moussa et al\. \(2024\)Omer Moussa, Dietrich Klakow, and Mariya Toneva\.Improving Semantic Understanding in Speech Language Models via Brain\-tuning\.In*The Thirteenth International Conference on Learning Representations*, October 2024\.
- Mukamel et al\. \(2005\)Roy Mukamel, Hagar Gelbard, Amos Arieli, Uri Hasson, Itzhak Fried, and Rafael Malach\.Coupling Between Neuronal Firing, Field Potentials, and fMRI in Human Auditory Cortex\.*Science*, 309\(5736\):951–954, August 2005\.doi:10\.1126/science\.1110913\.
- Naselaris et al\. \(2011\)Thomas Naselaris, Kendrick N\. Kay, Shinji Nishimoto, and Jack L\. Gallant\.Encoding and decoding in fMRI\.*NeuroImage*, 56\(2\):400–410, May 2011\.ISSN 1053\-8119\.doi:10\.1016/j\.neuroimage\.2010\.07\.073\.
- Negi et al\. \(2025\)Anuja Negi, Subba Reddy Oota, Anwar O\. Nunez\-Elizalde, Manish Gupta, and Fatma Deniz\.Brain\-Informed Fine\-Tuning for Improved Multilingual Understanding in Language Models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, October 2025\.
- Norman\-Haignere et al\. \(2022\)Sam V\. Norman\-Haignere, Laura K\. Long, Orrin Devinsky, Werner Doyle, Ifeoma Irobunda, Edward M\. Merricks, Neil A\. Feldstein, Guy M\. McKhann, Catherine A\. Schevon, Adeen Flinker, and Nima Mesgarani\.Multiscale temporal integration organizes hierarchical computation in human auditory cortex\.*Nature Human Behaviour*, pp\. 1–15, February 2022\.ISSN 2397\-3374\.doi:10\.1038/s41562\-021\-01261\-y\.
- Oota et al\. \(2024\)Subba Reddy Oota, Emin Çelik, Fatma Deniz, and Mariya Toneva\.Speech language models lack important brain\-relevant semantics\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 8503–8528, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.462\.
- Overath et al\. \(2015\)Tobias Overath, Josh H\. McDermott, Jean Mary Zarate, and David Poeppel\.The cortical analysis of speech\-specific temporal structure revealed by responses to sound quilts\.*Nature Neuroscience*, 18\(6\):903–911, June 2015\.ISSN 1546\-1726\.doi:10\.1038/nn\.4021\.
- Radford et al\. \(2022\)Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever\.Robust Speech Recognition via Large\-Scale Weak Supervision, 2022\.
- Schönwiesner & Zatorre \(2009\)Marc Schönwiesner and Robert J\. Zatorre\.Spectro\-temporal modulation transfer function of single voxels in the human auditory cortex measured with high\-resolution fMRI\.*Proceedings of the National Academy of Sciences*, 106\(34\):14611–14616, August 2009\.doi:10\.1073/pnas\.0907682106\.
- Tang et al\. \(2023\)Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G\. Huth\.Semantic reconstruction of continuous language from non\-invasive brain recordings\.*Nature Neuroscience*, 26\(5\):858–866, May 2023\.ISSN 1546\-1726\.doi:10\.1038/s41593\-023\-01304\-9\.
- Tuckute et al\. \(2023\)Greta Tuckute, Jenelle Feather, Dana Boebinger, and Josh H\. McDermott\.Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions\.*PLOS Biology*, 21\(12\):e3002366, December 2023\.ISSN 1545\-7885\.doi:10\.1371/journal\.pbio\.3002366\.
- Vaidya et al\. \(2022\)Aditya R\. Vaidya, Shailee Jain, and Alexander Huth\.Self\-Supervised Models of Audio Effectively Explain Human Cortical Responses to Speech\.In*Proceedings of the 39th International Conference on Machine Learning*, pp\. 21927–21944\. PMLR, June 2022\.
- Vattikonda et al\. \(2025\)Nishitha Vattikonda, Aditya R\. Vaidya, Richard J\. Antonello, and Alexander G\. Huth\.BrainWavLM: Fine\-tuning Speech Representations with Brain Responses to Language, February 2025\.
- Zada et al\. \(2025\)Zaid Zada, Samuel A\. Nastase, Bobbi Aubrey, Itamar Jalon, Sebastian Michelmann, Haocheng Wang, Liat Hasenfratz, Werner Doyle, Daniel Friedman, Patricia Dugan, Lucia Melloni, Sasha Devore, Adeen Flinker, Orrin Devinsky, Ariel Goldstein, and Uri Hasson\.The “Podcast” ECoG dataset for modeling neural activity during natural language comprehension\.*Scientific Data*, 12\(1\):1135, July 2025\.ISSN 2052\-4463\.doi:10\.1038/s41597\-025\-05462\-2\.

## Appendix APer\-subject change in ECoG performance after fMRI\-tuning

In Figure[6](https://arxiv.org/html/2605.19224#A1.F6), we visualize the per\-electrode effect of fMRI\-tuning separately for each subject in the “Podcast” dataset\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x6.png)Figure 6:Improvement in encoding performance for all electrodes \(Figure[2](https://arxiv.org/html/2605.19224#S3.F2)c\), visualized separately for each subject\.
## Appendix BfMRI\-tuning on downsampled fMRI responses

Figure[7](https://arxiv.org/html/2605.19224#A2.F7)compares the fMRI encoding performance of models fine\-tuned on the original fMRI responses and the downsampled responses\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x7.png)Figure 7:Flattened cortical surfaces of subjects S1 and S2 from LeBel et al\. \(2023\)LeBel et al\. \([2023](https://arxiv.org/html/2605.19224#bib.bib17)\)that compare the encoding performance of the original fMRI\-tuned models against the downsampled\-fMRI\-tuned models\. On the right, 2\-dimensional histograms show that individual voxel performance is stable across fine\-tuning conditions\. \(See Figure[4](https://arxiv.org/html/2605.19224#S3.F4)a for S3\.\)
## Appendix CScaling of fMRI\-tuning for fMRI prediction

Using the models fMRI\-tuned in Section[3\.4](https://arxiv.org/html/2605.19224#S3.SS4), we show how within\-modality prediction scales with fine\-tuning dataset size in Figure[8](https://arxiv.org/html/2605.19224#A3.F8)\. Fine\-tuned models improve at a faster rate than the pre\-trained model, and they show gains even at small dataset sizes\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x8.png)Figure 8:Within\-subject fMRI encoding performance scales with the size of the fine\-tuning dataset\. Pink, green, and blue show scaling for each fMRI subject, and red shows the mean of the three\. Solid lines show the scaling performance of the pre\-trained model, while dashed lines show scaling of fine\-tuned models\. Black lines show a best linear fit to the mean across subjects\. Fine\-tuned models outperform the pretrained model at all dataset sizes and also improve faster\.
## Appendix DScaling of fMRI\-tuning for ECoG prediction

In Figure[9](https://arxiv.org/html/2605.19224#A4.F9), we extend Figure[5](https://arxiv.org/html/2605.19224#S3.F5)a by showing fMRI\-tuning performance using the full fMRI training set\.

![Refer to caption](https://arxiv.org/html/2605.19224v1/x9.png)Figure 9:ECoG encoding performance as a function of the number of fMRI fine\-tuning stories, including the full training set \(93 stories for S1, and 100 stories for S2 and S3\)\. Error bars indicate the standard error over bootstraps of the fMRI stories\.
Fine-tuning language encoding models on slow fMRI improves prediction for fast ECoG

Similar Articles

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences

Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding

Nonlocal operator learning for fMRI encoding and decoding tasks

FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance

Submit Feedback

Similar Articles

Brain Score Tracks Shared Properties of Languages: Evidence from Many Natural Languages and Structured Sequences
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
Nonlocal operator learning for fMRI encoding and decoding tasks
FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis
Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance