Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning
Summary
NERVE proposes a network-aware bilinear tokenization method for self-supervised learning on brain functional connectivity matrices using masked autoencoders, improving representation learning across developmental cohorts.
View Cached Full Text
Cached at: 05/15/26, 06:19 AM
# Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning
Source: [https://arxiv.org/html/2605.14048](https://arxiv.org/html/2605.14048)
11institutetext:Department of Radiology, Weill Cornell Medicine, New York, NY, USA\.22institutetext:School of Electrical and Computer Engineering, Cornell University and Cornell Tech, New York, NY, USA\.22email:lem4012@med\.cornell\.edu###### Abstract
Masked autoencoders \(MAEs\) have recently shown promise for self\-supervised representation learning of resting\-state brain functional connectivity \(FC\)\. However, a fundamental question remains unresolved: how should FC matrices be tokenized to align with the intrinsic modular organization of large\-scale brain networks? Existing approaches typically adopt region\-centric or graph\-based schemes that treat FC as structurally homogeneous elements and overlook the large\-scale network brain organization\. We introduce NERVE \(Network\-Aware Representations of Brain Functional Connectivity via Bilinear Tokenization\), a self\-supervised learning framework that redefines FC tokenization by partitioning FC matrices into patches of intra\- and inter\-network connectivity blocks\. Unlike image\-based MAE, where fixed\-size patches share a common tokenizer, FC patches defined by network pairs are heterogeneous in size and correspond to distinct functional roles\. To resolve this problem, NERVE embeds FC patches through a novel structured bilinear factorization\. This formulation preserves network identity and reduces parameter complexity from quadratic to linear scaling in the number of networks\. We evaluate NERVE across three large\-scale developmental cohorts \(ABCD, PNC, and CCNP\) for behavior and psychopathology prediction\. Compared to structurally agnostic MAE variants and graph\-based self\-supervised baselines, the proposed network\-aware formulation yields more stable and transferable representations, particularly in cross\-cohort evaluation\. Ablation studies confirm that the proposed bilinear network embedding and anatomically grounded parcellation are critical for performance\. These findings highlight the importance of incorporating domain\-specific structural priors into self\-supervised learning for functional connectomics\.
## 1Introduction
Resting\-state functional magnetic resonance imaging \(rs\-fMRI\) enables the estimation of functional connectivity \(FC\), defined as the temporal correlation between spatially distributed brain regions\. FC has become a central tool for studying individual differences in large\-scale brain organization and their association with cognition, behavior, and mental health\[[15](https://arxiv.org/html/2605.14048#bib.bib15),[17](https://arxiv.org/html/2605.14048#bib.bib17),[14](https://arxiv.org/html/2605.14048#bib.bib14),[29](https://arxiv.org/html/2605.14048#bib.bib29)\]\. However, extracting compact predictive representations from FC remains challenging due to its high dimensionality, low signal\-to\-noise ratio, and substantial inter\-subject variability\[[31](https://arxiv.org/html/2605.14048#bib.bib31),[2](https://arxiv.org/html/2605.14048#bib.bib2),[28](https://arxiv.org/html/2605.14048#bib.bib28)\]\. Indeed, large\-scale studies have reported that increasing model complexity does not reliably improve performance over classical approaches\[[9](https://arxiv.org/html/2605.14048#bib.bib9),[23](https://arxiv.org/html/2605.14048#bib.bib23),[27](https://arxiv.org/html/2605.14048#bib.bib27),[10](https://arxiv.org/html/2605.14048#bib.bib10)\], suggesting that more appropriate inductive biases may be required\.
Masked autoencoders \(MAE\)\[[8](https://arxiv.org/html/2605.14048#bib.bib8)\]provide a principled framework for representation learning by partitioning the input into units, embedding each unit into a token representation, masking a subset of tokens, and reconstructing the masked content from the visible ones\. In computer vision, these units correspond to spatial image patches that naturally align with local structure\. When adapted to FC, however, the notion of a “patch” lacks a canonical definition, as the arrangement of regions in an FC matrix is largely arbitrary and does not necessarily reflect spatial or functional locality\. Defining appropriate patch units and their corresponding token embeddings thus becomes central for applying MAE to FC\. Existing approaches for FC adopt heuristic tokenization schemes prior to masking\[[4](https://arxiv.org/html/2605.14048#bib.bib4),[32](https://arxiv.org/html/2605.14048#bib.bib32),[3](https://arxiv.org/html/2605.14048#bib.bib3),[20](https://arxiv.org/html/2605.14048#bib.bib20),[6](https://arxiv.org/html/2605.14048#bib.bib6)\]\. BrainMass\[[32](https://arxiv.org/html/2605.14048#bib.bib32)\], for example, treats individual regions as units and masks randomly selected rows of the FC matrix, while RS\-MAE\[[20](https://arxiv.org/html/2605.14048#bib.bib20)\]masks grouped regions inspired by spatiotemporal strategies\. Graph\-based methods similarly operate on node\-level embeddings, where masking is applied to node tokens\[[5](https://arxiv.org/html/2605.14048#bib.bib5),[12](https://arxiv.org/html/2605.14048#bib.bib12),[22](https://arxiv.org/html/2605.14048#bib.bib22),[30](https://arxiv.org/html/2605.14048#bib.bib30)\]\. Despite promising results, a fundamental question remains: how should FC matrices be partitioned into patches and embedded into tokens to align the learned representations with the modular, network\-level organization of brain FC?
Figure 1:Overview of NERVE\.A\.The functional connectivity \(FC\) matrix is partitioned into patches defined by pairs of functional brain networks\.B\. Network\-aware Bilinear Tokenization\.Each functional network is assigned learnable network\-specific weights at initialization, and patch tokens are computed through structured bilinear interactions between network weights during forward\.C\. MAE Framework\.We apply a standard MAE framework to the proposed network\-aware tokens, thereby introducing a functionally informed inductive bias over connectivity structure\.Our central insight is that the conceptual analog of spatially neighboring pixels in an image is groups of brain regions that share similar functional dynamics, e\.g\., regions organized into large\-scale functional networks\[[33](https://arxiv.org/html/2605.14048#bib.bib33)\]\. Under this view, the natural counterpart of an image patch is a connectivity block defined by intra\- or inter\-network interactions\. However, a key challenge arises: FC patches defined by pairs of networks vary in dimensionality, precluding the use of a shared patch encoder\. Moreover, each network carries distinct functional roles, suggesting that tokenization should preserve network identity rather than collapse all patches into a homogeneous representation space\. To address this, we introduce a novel and parameter\-efficient*bilinear tokenization*scheme\. Instead of learning independent embeddings for each network\-pair patch, we learn network\-specific region embeddings and model inter\-network connectivity via bilinear interactions\. This factorization replaces quadratic growth in patch\-specific parameters with linear scaling in the number of networks, while explicitly encoding network identity and structured intra\- and inter\-network interactions\.
Integrating this design into an MAE framework, we introduceNERVE\(Network\-Aware Representations of Brain Functional Connectivity via Bilinear Tokenization\), a self\-supervised approach tailored to the modular organization of brain FC\. We evaluate NERVE across three large\-scale adolescent neuroimaging cohorts on the challenging task of predicting behavioral and psychopathology scores\. Our results demonstrate that NERVE learns more informative and transferable FC representations, outperforming alternative tokenization strategies and existing self\-supervised learning \(SSL\) approaches\.
## 2Methods
Let𝒟=\{X\(i\)\}i=1N\\mathcal\{D\}=\\\{X^\{\(i\)\}\\\}\_\{i=1\}^\{N\}denote FC matrices ofNNparticipants, where eachX\(i\)∈ℝR×RX^\{\(i\)\}\\in\\mathbb\{R\}^\{R\\times R\}is a correlation matrix constructed from functional time series acrossRRbrain regions\. We aim to learn structured and transferable representations ofXXin a self\-supervised manner\. To this end, we adopt a transformer\-based MAE framework, which partitionsXXinto patches, encodes each patch into a token, randomly masks a subset of tokens, and reconstructs the masked content from the visible tokens\.
Network\-based Patching\.The way data are partitioned into patches determines what structure the model can exploit through masking and reconstruction\. While image\-based MAE relies on a natural decomposition into fixed\-size image patches, FC matrices lack a canonical patching scheme, making tokenization a non\-trivial design choice\. Rather than treating rows ofXXas patches\[[4](https://arxiv.org/html/2605.14048#bib.bib4),[32](https://arxiv.org/html/2605.14048#bib.bib32),[3](https://arxiv.org/html/2605.14048#bib.bib3)\], we observe that the conceptual analog of neighboring pixels in images is groups of brain regions that share similar functional dynamics\. This organization is captured by large\-scale functional networks\[[33](https://arxiv.org/html/2605.14048#bib.bib33)\]\. Therefore, we propose to reorganizeXXby grouping theRRregions intoNnN\_\{n\}established functional networks\{𝒩1,…,𝒩Nn\}\\\{\\mathcal\{N\}\_\{1\},\\dots,\\mathcal\{N\}\_\{N\_\{n\}\}\\\}\(e\.g\., Visual, Default, Dorsal Attention\)\. For each network pair\(l,m\)\(l,m\)withl≤ml\\leq m, we define a connectivity block:xl,m∈ℝ\|𝒩l\|×\|𝒩m\|x\_\{l,m\}\\in\\mathbb\{R\}^\{\|\\mathcal\{N\}\_\{l\}\|\\times\|\\mathcal\{N\}\_\{m\}\|\}representing intra\- \(l=ml=m\) or inter\-network \(l<ml<m\) connectivity \(Fig\.[1](https://arxiv.org/html/2605.14048#S1.F1)A\)\. These connectivity patches are then treated as the basic units for masking and reconstruction\. The total number of patches is:Npatch=Nn\(Nn\+1\)2\.N\_\{\\text\{patch\}\}=\\frac\{N\_\{n\}\(N\_\{n\}\+1\)\}\{2\}\.
Shared vs\. Patch\-specific Tokenization\.In image\-based MAE, patches are fixed\-size and interchangeable, enabling a shared projection for tokenization\. Here, network\-defined patches vary in size and correspond to specific network interactions, making them structurally and semantically distinct\. Embedding these irregular and network\-specific patches, therefore, requires a dedicated network\-aware tokenization strategy\. A straightforward adaptation of image\-based MAE to FC would flatten each patchxl,mx\_\{l,m\}and project it through a linear transformation \(denoted assharedlinear\):tl,m=W⊤vec\(xl,m\),t\_\{l,m\}=W^\{\\top\}\\,\\mathrm\{vec\}\(x\_\{l,m\}\),whereW∈ℝSmax×dEW\\in\\mathbb\{R\}^\{S\_\{\\text\{max\}\}\\times d\_\{E\}\}withdEd\_\{E\}the encoder embedding dimension, andSmax=maxl,m\|𝒩l\|⋅\|𝒩m\|S\_\{\\text\{max\}\}=\\max\_\{l,m\}\|\\mathcal\{N\}\_\{l\}\|\\cdot\|\\mathcal\{N\}\_\{m\}\|the maximum flattened patch size\. Because patches have different sizes, zero\-padding to the largest patch sizeSmaxS\_\{\\max\}is required\. While parameter\-efficient, this shared projection enforces a common representation across semantically distinct network interactions and ignores the structural heterogeneity of FC\. An alternative is to assign a distinct projection layer to each network pair \(denoted as patch\-specificlinear\):tl,m=Wl,m⊤vec\(xl,m\),t\_\{l,m\}=W\_\{l,m\}^\{\\top\}\\,\\mathrm\{vec\}\(x\_\{l,m\}\),whereWl,m∈ℝ\(\|𝒩l\|\|𝒩m\|\)×dEW\_\{l,m\}\\in\\mathbb\{R\}^\{\(\|\\mathcal\{N\}\_\{l\}\|\|\\mathcal\{N\}\_\{m\}\|\)\\times d\_\{E\}\}\. Although this allows patch\-specific \(network pairs\) modeling, it introduces a quadratic growth in parameters with respect to the number of networksNnN\_\{n\}, which quickly becomes impractical and risks overfitting\.
Bilinear Tokenization\.To preserve network specificity while maintaining parameter efficiency, we propose a bilinear network\-aware tokenization\. Each functional network𝒩l\\mathcal\{N\}\_\{l\}is assigned a learnable matrixUl∈ℝ\|𝒩l\|×dEU\_\{l\}\\in\\mathbb\{R\}^\{\|\\mathcal\{N\}\_\{l\}\|\\times d\_\{E\}\}, where each column represents a network\-specific embedding dimension\. For an FC patchxl,m∈ℝ\|𝒩l\|×\|𝒩m\|x\_\{l,m\}\\in\\mathbb\{R\}^\{\|\\mathcal\{N\}\_\{l\}\|\\times\|\\mathcal\{N\}\_\{m\}\|\}between networksllandmm, we construct the corresponding tokenizer via a column\-wise Kronecker \(Khatri–Rao\)\[[16](https://arxiv.org/html/2605.14048#bib.bib16)\]product \(Fig\.[1](https://arxiv.org/html/2605.14048#S1.F1)B\):
Wl,m=Ul⊙Um∈ℝ\(\|𝒩l\|\|𝒩m\|\)×dE,tl,m=Wl,m⊤vec\(xl,m\)∈ℝdE\.W\_\{l,m\}=U\_\{l\}\\odot U\_\{m\}\\;\\in\\;\\mathbb\{R\}^\{\(\|\\mathcal\{N\}\_\{l\}\|\|\\mathcal\{N\}\_\{m\}\|\)\\times d\_\{E\}\},\\quad t\_\{l,m\}=W\_\{l,m\}^\{\\top\}\\mathrm\{vec\}\(x\_\{l,m\}\)\\;\\in\\;\\mathbb\{R\}^\{d\_\{E\}\}\.where⊙\\odotdenotes the Khatri–Rao product defined elementwise as\[Wl,m\]\(i,j\),k=\[Ul\]i,k\[Um\]j,k\.\[W\_\{l,m\}\]\_\{\(i,j\),k\}=\[U\_\{l\}\]\_\{i,k\}\\,\[U\_\{m\}\]\_\{j,k\}\.Conceptually, instead of learning an independent projection for each network pair, we learn network\-level region embeddings and model inter\-network connectivity as bilinear interactions between them\. The contribution of a connection between regioni∈𝒩li\\in\\mathcal\{N\}\_\{l\}and regionj∈𝒩mj\\in\\mathcal\{N\}\_\{m\}to embedding dimensionkkis given by the product\[Ul\]i,k\[Um\]j,k\[U\_\{l\}\]\_\{i,k\}\[U\_\{m\}\]\_\{j,k\}, capturing structured interactions while sharing parameters across networks\. This low\-rank factorization of patch\-specific projections replaces quadratic growth in patch\-specific parameters with a linear scaling in the number of networks, while explicitly encoding network identity and structured intra\- and inter\-network interactions\. Specifically, if we assume theRRbrain regions are approximately evenly distributed across theNnN\_\{n\}networks, we can approximateSmax≈\(R/Nn\)2S\_\{\\text\{max\}\}\\approx\(R/N\_\{n\}\)^\{2\}\. Under this assumption, the parameter complexities forsharedlinear, patch\-specificlinear, andbilinearembedding are:𝒪\(R2Nn2×dE\)\\mathcal\{O\}\\left\(\\frac\{R^\{2\}\}\{N\_\{n\}^\{2\}\}\\times d\_\{\\text\{E\}\}\\right\),𝒪\(R2×dE\)\\mathcal\{O\}\\left\(R^\{2\}\\times d\_\{\\text\{E\}\}\\right\), and𝒪\(R×dE\)\\mathcal\{O\}\\left\(R\\times d\_\{\\text\{E\}\}\\right\), respectively, highlighting the trade\-off between parameter efficiency and expressivity by shifting complexity from universal, patch\-specific to network\-centric representations\.
MAE for FC\.Given the network\-aware token sequenceT∈ℝNpatch×dET\\in\\mathbb\{R\}^\{N\_\{\\text\{patch\}\}\\times d\_\{E\}\}, a learnable CLS token is prepended, and learnable positional embeddings are added to encode the identity of each network pair\. For each subject, a fixed proportion of patch tokens is randomly masked, and only the visible tokens are processed by a transformer encoder, producing contextualized representations\. To reconstruct masked connectivity, mask tokens are first inserted at masked positions, and the full token sequence is then processed by a lightweight transformer decoder\. The decoded tokens are then projected back to the FC patch space using a bilinear decoding layer consistent with the tokenization scheme, yielding reconstructed patchesx^l,m\\hat\{x\}\_\{l,m\}\(Fig\.[1](https://arxiv.org/html/2605.14048#S1.F1)C\)\. The reconstruction objective is:ℒrecon=1\|ℳ\|∑\(l,m\)∈ℳ‖x^l,m−xl,m‖22\\mathcal\{L\}\_\{\\text\{recon\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\(l,m\)\\in\\mathcal\{M\}\}\\left\\\|\\hat\{x\}\_\{l,m\}\-x\_\{l,m\}\\right\\\|\_\{2\}^\{2\}\.
Implementation Details\.We use Schaefer’s 17\-network parcellation \(R=400R=400,Nn=17N\_\{n\}=17,Npatch=153N\_\{\\text\{patch\}\}=153\)\[[26](https://arxiv.org/html/2605.14048#bib.bib26)\]with a masking ratio of0\.50\.5\. The encoder uses44layers,44heads, anddE=256d\_\{E\}=256\. The decoder has11layer,22heads withdD=64d\_\{D\}=64\. Models are trained for 4,000 epochs using AdamW with a learning rate of1e−21e^\{\-2\}, weight decay of1e−21e^\{\-2\}, cosine scheduling with linear warmup for 400 epochs, batch size of 1,024, and mixed precision using Pytorch on a single NVIDIA L40 GPU\. We perform preliminary hyperparameter tuning on the training data for the masking ratio, transformer encoder, and decoder length, width, and learning rate\. For brevity, we report results using the best\-performing configuration and focus ablations on methodological design choices central to our contribution\. Code and pretrained models will be released upon acceptance\.
## 3Experiments and Results
Datasets & Downstream Evaluation\.We evaluate our framework on three large\-scale developmental rs\-fMRI cohorts: ABCD \(N=1,791, age 9–10, 53\.8% F\)\[[7](https://arxiv.org/html/2605.14048#bib.bib7)\], PNC \(N=1,416, age 8–23, 53\.6% F\)\[[25](https://arxiv.org/html/2605.14048#bib.bib25)\], and CCNP \(N=178, age 6–17, 52\.3% F\)\[[19](https://arxiv.org/html/2605.14048#bib.bib19)\]\. For ABCD, we directly use the FC matrices preprocessed and released by CBIG\[[21](https://arxiv.org/html/2605.14048#bib.bib21)\], which contain a subset of the full ABCD cohort with the highest image quality\. We preprocess PNC and CCNP data using the publicly available DeepPrep pipeline\[[24](https://arxiv.org/html/2605.14048#bib.bib24)\], which includes motion correction, nuisance regression, and temporal and spatial filtering\. All FC matrices are defined by theR=400R=400regions Schaefer atlas\[[26](https://arxiv.org/html/2605.14048#bib.bib26)\]\. We train the MAE on the FC matrices from the combined ABCD\+PNC datasets, treating CCNP as out\-of\-domain samples, and apply the trained model to extract FC representations for all three datasets\. Then, within each dataset, we use the representations to predict behavioral phenotypes\. For ABCD, we predict theInternalizing,Externalizing, andTotalscores from the Child Behavior Checklist \(CBCL\)\[[1](https://arxiv.org/html/2605.14048#bib.bib1)\]\. For PNC and CCNP, we predict the Reproducible Brain Charts \(RBC\)\[[11](https://arxiv.org/html/2605.14048#bib.bib11)\]harmonizedInternalizing,Externalizing, and general psychopathology \(p\-factor\) scores\. Kernal Ridge Regression \(KRR\) is adopted as the downstream regression model following prior large\-scale evaluations demonstrating its strong and stable performance for behavioral prediction compared to more complex neural network\-based probes\[[9](https://arxiv.org/html/2605.14048#bib.bib9)\]\. KRR is assessed using stratified 10\-fold cross\-validation within each dataset, with age and sex effects removed via linear regressions fit on the training folds\. Finally, we report the Pearson correlation between the true and predicted behavioral scores concatenated across testing folds\. Uncertainty is quantified via bootstrap resampling over subjects \(1,000 iterations\) to estimate 95% confidence intervals \(CI\)\. The statistical significance of performance differences between the top two methods was evaluated using two\-sided paired bootstrap tests on subject\-level out\-of\-fold predictions \(1,000 iterations\) for each behavioral variable\.
Table 1:Behavioral prediction performance across developmental cohorts\.We report Pearson correlationr±Δr\_\{\\pm\\Delta\}with bootstrap 95% CI half\-widthΔ\\Deltafrom 10\-fold cross\-validation\. n\.: negative values, OOD: Out\-of\-domain\.Dataset→\\rightarrowABCDPNCCCNP\(OOD\)Method↓\\downarrowInt\.Ext\.TotalInt\.Ext\.p\-factorInt\.Ext\.p\-factorBrainNetCNN\.05±\.04\.05\_\{\\pm\.04\}\.08±\.04\.11±\.04\.07±\.03\.07\_\{\\pm\.03\}\.08±\.04\.08\_\{\\pm\.04\}\.05±\.03\.05\_\{\\pm\.03\}\.15±\.12\.10±\.13\.01±\.12\.01\_\{\\pm\.12\}BrainGNN\.05±\.04\.05\_\{\\pm\.04\}\.08±\.05\.06±\.04\.06\_\{\\pm\.04\}\.08±\.04\.08\_\{\\pm\.04\}\.04±\.04\.04\_\{\\pm\.04\}\.06±\.04\.06\_\{\\pm\.04\}\.17±\.13\.16±\.16\.16\_\{\\pm\.16\}\.04±\.10\.04\_\{\\pm\.10\}BrainNetTF–––\.08±\.04\.08\_\{\\pm\.04\}\.07±\.05\.07\_\{\\pm\.05\}\.05±\.04\.05\_\{\\pm\.04\}\.01±\.09\.01\_\{\\pm\.09\}\.21±\.14\.21\_\{\\pm\.14\}\.05±\.09\.05\_\{\\pm\.09\}GraphMAE\.06±\.05\.06±\.04\.06\_\{\\pm\.04\}\.06±\.04\.06\_\{\\pm\.04\}\.09±\.05\.08±\.05\.08\_\{\\pm\.05\}\.09±\.05\.09\_\{\\pm\.05\}\.07±\.16\.07\_\{\\pm\.16\}\.22±\.16\.12±\.13GATE–––\.07±\.06\.07\_\{\\pm\.06\}\.08±\.05\.08\_\{\\pm\.05\}\.09±\.04\.09\_\{\\pm\.04\}\.04±\.09\.04\_\{\\pm\.09\}\.14±\.14\.14\_\{\\pm\.14\}\.11±\.08\.11\_\{\\pm\.08\}BrainGSLs\.04±\.05\.04\_\{\\pm\.05\}\.05±\.05\.05\_\{\\pm\.05\}\.05±\.05\.05\_\{\\pm\.05\}\.05±\.05\.05\_\{\\pm\.05\}\.08±\.06\.08\_\{\\pm\.06\}\.04±\.05\.04\_\{\\pm\.05\}n\.\.22±\.15n\.BrainMass\.06±\.05\.08±\.05\.01±\.04\.01\_\{\\pm\.04\}\.03±\.06\.03\_\{\\pm\.06\}\.09±\.06\.10±\.05\.04±\.13\.04\_\{\\pm\.13\}\.14±\.14\.14\_\{\\pm\.14\}\.13±\.16NERVE\\cellcolorlightgray\.11±\.06\.09±\.05\.13±\.04\\cellcolorlightgray\.14±\.05\\cellcolorlightgray\.13±\.05\.12±\.05\.08±\.14\.08\_\{\\pm\.14\}\\cellcolorlightgray\.33±\.14\.13±\.15Gray shading:p<0\.05p<0\.05, two\-sided paired bootstrap test between top\-1and2methods\.
Comparison Methods\.We benchmark NERVE against established methods for FC encoding, restricting comparisons to approaches operating on FC matrices to ensure fair and interpretable evaluation\. Compared encoders include BrainMass\[[32](https://arxiv.org/html/2605.14048#bib.bib32)\], a foundation\-style model combining temporal masking and contrastive learning, and graph\-based SSL methods: GraphMAE\[[12](https://arxiv.org/html/2605.14048#bib.bib12)\], BrainGSLs\[[30](https://arxiv.org/html/2605.14048#bib.bib30)\], and GATE\[[22](https://arxiv.org/html/2605.14048#bib.bib22)\], which apply node\-masked graph encoders with SSL objectives\. All models are assessed using the same protocol: training encoders on ABCD\+PNC and evaluating downstream KRR within each dataset via 10\-fold cross\-validation\. To put the prediction performance in the context of fully supervised, task\-specific learning capacity, we also evaluate state\-of\-the\-art supervised FC\-based prediction models, which include BrainNetCNN\[[15](https://arxiv.org/html/2605.14048#bib.bib15)\], BrainGNN\[[17](https://arxiv.org/html/2605.14048#bib.bib17)\], and BrainNetTF\[[14](https://arxiv.org/html/2605.14048#bib.bib14)\]\. These supervised models are trained and evaluated end\-to\-end within each dataset by 10\-fold cross\-validation, as behavioral targets derive from different assessment instruments and are not strictly interchangeable across cohorts\. GATE and BrainNetTF require BOLD time series preprocessing and are therefore evaluated only on PNC and CCNP, as only FC matrices are available for ABCD in the CBIG release\.
Main Results\.Past rs\-fMRI studies have consistently shown that continuous behavioral prediction from FC is intrinsically challenging, with state\-of\-the\-art models typically achieving modest effect sizes \(oftenr<0\.15r<0\.15\) even in large cohorts\[[21](https://arxiv.org/html/2605.14048#bib.bib21),[9](https://arxiv.org/html/2605.14048#bib.bib9)\]\. In this context, the representations learned by NERVE demonstrate strong and stable predictive power across datasets \(Table[1](https://arxiv.org/html/2605.14048#S3.T1)\)\. Specifically, in in\-domain evaluations \(ABCD, PNC\), NERVE achieves the highest or tied\-highest performance across all behavioral targets, supporting the advantage of incorporating network\-level structure over structurally agnostic SSL baselines and supervised FC architectures trained within a single dataset\. Among self\-supervised baselines, GraphMAE and BrainMass show competitive performance on specific targets but exhibit higher variability across targets\. In the out\-of\-domain CCNP cohort, NERVE exhibits strong cross\-cohort generalization, achieving the highest observed correlation for externalizing symptoms \(r=\.33r=\.33\) and competitive performance for the general psychopathology \(p\-factor\) score\. These results further support the role of network\-aware tokenization in enhancing representation stability beyond the training distribution\. The only target where NERVE does not achieve top performance is the internalizing phenotype in the CCNP cohort, where supervised models trained specifically on CCNP obtain higher correlations\.
Table 2:Tokenization ablation\.We report Pearson correlationr±Δr\_\{\\pm\\Delta\}with bootstrap 95% CI half\-widthΔ\\Deltafrom 10\-fold cross\-validation\. n\. indicates negative values\.Table 3:Parcellation and network structure ablation\.We report Pearsonrron 10\-fold cross\-validation\. For thePermutationbaselines, we report mean±\\pmstdrrover 100 random permutations of region\-to\-network assignment\.Tokenization Ablation\.We compare three tokenization strategies within the NERVE framework: i\)sharedlinear, ii\) patch\-specificlinear, and iii\) the proposedbilinearnetwork\-aware formulation \(Table[2](https://arxiv.org/html/2605.14048#S3.T2)\)\. The bilinear formulation consistently achieves the strongest overall performance, particularly on in\-domain datasets, while the patch\-specific variant performs second\-best in most settings\. The shared linear formulation remains competitive on the out\-of\-domain CCNP samples, consistent with the idea that reduced parameterization benefits generalization\. Overall, the proposed bilinear strategy achieves the strongest predictive performance, supporting the effectiveness of combining constrained parameterization with an explicit network\-level inductive bias\.
Network Structure and Parcellation\.To evaluate the interaction between network\-based patching and the proposed network\-specific bilinear tokenization, we design controlled ablations that modify patch definitions while keeping all other architectural and training components fixed, isolating the effect of anatomically grounded tokenization in NERVE\. First, we perform 100 permutation baselines by randomly shuffling region indices prior to patch assignment within the 17\-network parcellation\[[33](https://arxiv.org/html/2605.14048#bib.bib33)\]\. This serves as a null model without meaningful network structure\. Second, we evaluate a vanilla MAE configuration that partitions the FC matrix into fixed continuous16×1616\\times 16square patches following image\-based practice\[[8](https://arxiv.org/html/2605.14048#bib.bib8)\]\. We then encode these patches using theshared, patch\-specific, and proposedbilinearnetwork\-aware formulation, thereby removing explicit functional network grounding\. Third, we test a coarser 7\-network parcellation\[[33](https://arxiv.org/html/2605.14048#bib.bib33)\]to assess sensitivity to network granularity\. Table[3](https://arxiv.org/html/2605.14048#S3.T3)reports performance and one\-sided permutation test p\-values computed against 100 region\-permuted baselines\. NERVE with the 17\-network parcellation achieves the highest performance and significantly outperforms the baselines, supporting the relevance of anatomically informed patch definitions\. Notably, when network structure is disrupted, patch\-specific linear tokenization tends to outperform the bilinear formulation, suggesting that the proposed bilinear factorization is most effective when aligned with meaningful functional organization\.
## 4Discussion & Conclusion
Our results consistently demonstrate that incorporating network\-level structure improves stability and cross\-cohort generalizability of FC representations\. Nevertheless, several limitations should be acknowledged\. First, evaluation was conducted within developmental cohorts and psychopathology measures\. While this setting reflects a clinically relevant and challenging prediction task, further evaluation across additional populations and phenotypic domains will help assess the generalizability of the learned representations\. Second, our framework currently operates on static FC derived from a predefined parcellation\. Extensions incorporating dynamic FC patterns\[[13](https://arxiv.org/html/2605.14048#bib.bib13)\]or multimodal integration with structural connectivity derived from diffusion MRI\[[18](https://arxiv.org/html/2605.14048#bib.bib18)\]represent promising directions for extending the proposed framework\. Future work will focus on scaling to more heterogeneous datasets, exploring alternative downstream tasks, and improving interpretability of the learned representations\. In particular, future analyses will examine the learned network\-specific weights in the bilinear embedding and characterize attention patterns between network\-aware tokens to better understand how large\-scale network connectivity patterns across parcellations contribute to representation learning\.
In this work, we introduced a self\-supervised framework that incorporates large\-scale functional network structure into brain connectivity representation learning through a bilinear tokenization strategy\. Our results demonstrate that network\-aware modeling provides a principled inductive bias for functional connectomics and supports scalable cross\-cohort brain–behavior mapping\.
\{credits\}
### 4\.0\.1Acknowledgements
This work was supported in part by NIH Grant AA028840 \(QZ\), a BBRF Young Investigator Grant \(QZ\), and a NIARR Pilot Resource Grant \(QZ\)\.
### 4\.0\.2\\discintname
The authors have no competing interests to declare that are relevant to the content of this article\.
## References
- \[1\]Achenbach, T\.M\., Edelbrock, C\.S\.: The classification of child psychopathology: A review and analysis of empirical efforts\. Psychological Bulletin85\(6\), 1275–1301 \(1978\)
- \[2\]Button, K\.S\., Ioannidis, J\.P\., Mokrysz, C\., Nosek, B\.A\., Flint, J\., Robinson, E\.S\., et al\.: Power failure: why small sample size undermines the reliability of neuroscience\. Nat\. Rev\. Neurosci\.14\(5\), 365–376 \(2013\)
- \[3\]Caro, J\.O\., Fonseca, A\.H\.d\.O\., Averill, C\., Rizvi, S\.A\., Rosati, M\., Cross, J\.L\., et al\.: BrainLM: A foundation model for brain activity recordings\. In: ICLR\. Curran Associates, Inc\. \(2024\)
- \[4\]Dong, Z\., Li, R\., Wu, Y\., Nguyen, T\., Su, J\., Chong, et al\.: Brain\-JEPA: Brain Dynamics Foundation Model with Gradient Positioning and Spatiotemporal Masking\. In: NeurIPS\. vol\. 37, pp\. 86048–86073\. Curran Associates, Inc\. \(2024\)
- \[5\]Farahani, F\.V\., Karwowski, W\., Lighthall, N\.R\.: Application of graph theory for identifying connectivity patterns in human brain networks: A systematic review\. Frontiers in Neuroscience13\(2019\)
- \[6\]Gao, J\., Ge, B\., Qiang, N\., Zhao, S\.: 3D masked autoencoder with spatiotemporal transformer for modeling of 4D fMRI data\. Medical Image Analysis107\(Pt B\), 103861 \(2026\)
- \[7\]Garavan, H\., Bartsch, H\., Conway, K\., Decastro, A\., Goldstein, R\.Z\., Heeringa, S\., et al\.: Recruiting the ABCD sample: design considerations and procedures\. Developmental Cognitive Neuroscience32, 16–22 \(2018\)
- \[8\]He, K\., Chen, X\., Xie, S\., Li, Y\., Dollár, P\., Girshick, R\.: Masked Autoencoders Are Scalable Vision Learners\. In: CVPR\. IEEE, Inc\. \(2021\)
- \[9\]He, T\., Kong, R\., Holmes, A\.J\., Nguyen, M\., Sabuncu, M\.R\., et al\.: Deep neural networks and kernel regression achieve comparable accuracies for functional connectivity prediction of behavior and demographics\. NeuroImage206\(2020\)
- \[10\]He, T\., Kong, R\., Holmes, A\.J\., Sabuncu, M\.R\., Eickhoff, S\.B\., Bzdok, et al\.: Is deep learning better than kernel regression for functional connectivity prediction of fluid intelligence? In: PRNI\. IEEE, Inc\. \(2018\)
- \[11\]Hoffmann, M\.S\., Moore, T\.M\., Axelrud, L\.K\., Tottenham, N\., Pan, P\.M\., Miguel, et al\.: An Evaluation of Item Harmonization Strategies Between Assessment Tools of Psychopathology in Children and Adolescents\. Assessment31\(2\), 502–517 \(2024\)
- \[12\]Hou, Z\., Liu, X\., Cen, Y\., Dong, Y\., Yang, H\., Wang, C\., et al\.: GraphMAE: Self\-Supervised Masked Graph Autoencoders\. In: SIGKDD\. pp\. 594–604\. Association for Computing Machinery \(2022\)
- \[13\]Hutchison, R\.M\., Womelsdorf, T\., Allen, E\.A\., Bandettini, P\.A\., Calhoun, V\.D\., Corbetta, et al\.: Dynamic functional connectivity: Promise, issues, and interpretations\. NeuroImage80, 360–378 \(2013\)
- \[14\]Kan, X\., Dai, W\., Cui, H\., Zhang, Z\., Guo, Y\., Yang, C\.: Brain Network Transformer\. In: NeurIPS\. vol\. 35\. Curran Associates, Inc\. \(2022\)
- \[15\]Kawahara, J\., Brown, C\.J\., Miller, S\.P\., Booth, B\.G\., Chau, V\., Grunau, R\.E\., et al\.: BrainNetCNN: Convolutional neural networks for brain networks; towards predicting neurodevelopment\. NeuroImage146, 1038–1049 \(2017\)
- \[16\]Khatri, C\.G\., Radhakrishna Rao, C\.: Solutions to Some Functional Equations and Their Applications to Characterization of Probability Distributions\. The Indian Journal of Statistics30\(2\), 167–180 \(1968\)
- \[17\]Li, X\., Zhou, Y\., Dvornek, N\., Zhang, M\., Gao, S\., Zhuang, et al\.: BrainGNN: Interpretable Brain Graph Neural Network for fMRI Analysis\. Medical Image Analysis74, 102233 \(2021\)
- \[18\]Litwińczuk, M\.C\., Muhlert, N\., Cloutman, L\., Trujillo\-Barreto, N\., Woollams, A\.: Combination of structural and functional connectivity explains unique variation in specific domains of cognitive function\. NeuroImage262\(3\), 119531 \(2022\)
- \[19\]Liu, S\., Wang, Y\.S\., Zhang, Q\., Zhou, Q\., Cao, L\.Z\., Jiang, C\., et al\.: Chinese Color Nest Project : An accelerated longitudinal brain\-mind cohort\. Developmental Cognitive Neuroscience52, 101020 \(2021\)
- \[20\]Ma, H\., Xu, Y\., Tian, L\.: RS\-MAE: Region\-State Masked Autoencoder for Neuropsychiatric Disorder Classifications Based on Resting\-State fMRI\. IEEE transactions on neural networks and learning systems36\(6\), 10707–10720 \(2025\)
- \[21\]Ooi, L\.Q\.R\., Chen, J\., Zhang, S\., Kong, R\., Tam, A\., Li, J\., et al\.: Comparison of individualized behavioral predictions across anatomical, diffusion and functional connectivity MRI\. NeuroImage263, 119636 \(2022\)
- \[22\]Peng, L\., Wang, N\., Xu, J\., Zhu, X\., Li, X\.: GATE: Graph CCA for Temporal Self\-Supervised Learning for Label\-Efficient fMRI Analysis\. IEEE Transactions on Medical Imaging42\(2\), 391–402 \(2023\)
- \[23\]Pervaiz, U\., Vidaurre, D\., Woolrich, M\.W\., Smith, S\.M\.: Optimising network modelling methods for fMRI\. NeuroImage211, 116604 \(2020\)
- \[24\]Ren, J\., An, N\., Lin, C\., Zhang, Y\., Sun, Z\., Zhang, et al\.: DeepPrep: an accelerated, scalable and robust pipeline for neuroimaging preprocessing empowered by deep learning\. Nature Methods22\(3\), 473–476 \(2025\)
- \[25\]Satterthwaite, T\.D\., Elliott, M\.A\., Ruparel, K\., Loughead, J\., Prabhakaran, K\., Calkins, et al\.: Neuroimaging of the Philadelphia Neurodevelopmental Cohort\. NeuroImage86, 544–553 \(2014\)
- \[26\]Schaefer, A\., Kong, R\., Gordon, E\.M\., Laumann, T\.O\., Zuo, X\.N\., Holmes, et al\.: Local\-Global Parcellation of the Human Cerebral Cortex from Intrinsic Functional Connectivity MRI\. Cerebral cortex28\(9\), 3095–3114 \(2018\)
- \[27\]Schulz, M\.A\., Yeo, B\.T\., Vogelstein, J\.T\., Mourao\-Miranada, J\., Kather, J\.N\., Kording, K\., et al\.: Different scaling of linear models and deep learning in UKBiobank brain images versus machine\-learning datasets\. Nature Communications11\(1\), 1–15 \(2020\)
- \[28\]Tiego, J\., Martin, E\.A\., DeYoung, C\.G\., Hagan, K\., Cooper, S\.E\., Pasion, et al\.: Precision behavioral phenotyping as a strategy for uncovering the biological correlates of psychopathology\. Nature Mental Health1\(5\), 304–315 \(2023\)
- \[29\]Wei, W\., Zhang, K\., Chang, J\., Zhang, S\., Ma, L\., Wang, H\., et al\.: Analyzing 20 years of Resting\-State fMRI Research: Trends and collaborative networks revealed\. Brain Research1822, 148634 \(2024\)
- \[30\]Wen, G\., Cao, P\., Liu, L\., Yang, J\., Zhang, X\., Wang, F\., et al\.: Graph Self\-Supervised Learning With Application to Brain Networks Analysis\. IEEE Journal of Biomedical and Health Informatics27\(8\), 4154–4165 \(2023\)
- \[31\]Woo, C\.W\., Chang, L\.J\., Lindquist, M\.A\., Wager, T\.D\.: Building better biomarkers: brain models in translational neuroimaging\. Nature Neuroscience20\(3\), 365–377 \(2017\)
- \[32\]Yang, Y\., Ye, C\., Su, G\., Zhang, Z\., Chang, Z\., Chen, H\., et al\.: BrainMass: Advancing Brain Network Analysis for Diagnosis with Large\-scale Self\-Supervised Learning\. IEEE Transactions on Medical Imaging43\(11\), 4004–4016 \(2024\)
- \[33\]Yeo, B\.T\., Krienen, F\.M\., Sepulcre, J\., Sabuncu, M\.R\., Lashkari, D\., Hollinshead, M\., et al\.: The organization of the human cerebral cortex estimated by intrinsic functional connectivity\. Journal of Neurophysiology106\(3\), 1125–1165 \(2011\)Similar Articles
BrainG3N: A Dual-Purpose Tokenizer for Controllable 3D Brain MRI Generation
Introduces BrainG3N, a dual-purpose tokenizer for 3D brain MRI latent diffusion using a frozen masked autoencoder encoder for clinically informative embeddings and a CNN decoder for reconstruction, achieving state-of-the-art performance on a 23-task benchmark and enabling controllable generation and longitudinal forecasting.
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
This paper uses sparse autoencoders to decompose LLMs into interpretable features and shows that semantic features explain brain alignment with cortical semantic topography, generalizing across English, Chinese, and French.
Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling
This paper introduces an unsupervised framework for modeling acquisition-related variability in structural connectomes using hybrid latent space modeling, eliminating the need for manual capacity tuning by architecturally annealing encoder outputs.
LDARNet: DNA Adaptive Representation Network with Learnable Tokenization for Genomic Modeling
LDARNet is a 120M-parameter hierarchical genomic foundation model that introduces learnable adaptive tokenization (inspired by H-Net's dynamic chunking) for masked language modeling on DNA sequences. It achieves state-of-the-art results on 5 histone modification tasks and outperforms models up to 20× larger on several genomic benchmarks, with learned token boundaries aligning with biological features like promoter motifs and splice junctions.
Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning
This paper introduces BrainSimSiam, a lightweight self-supervised framework using siamese networks to learn robust fMRI representations from positive-only pairs, achieving strong performance on downstream tasks even with limited data.