Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

arXiv cs.LG Papers

Summary

Artemis proposes a region-level causal framework that learns region-specific confounder representations to eliminate demographic confounders in multimodal neuroimaging, improving graph neural network performance on disease diagnosis and classification tasks.

arXiv:2606.18287v1 Announce Type: new Abstract: Multimodal neuroimaging, integrating functional connectivity from fMRI and structural connectivity from DTI, enables non-invasive analysis of brain networks using graph neural networks. However, demographic factors such as age and sex systematically confound the relationship between brain connectivity and clinical outcomes, causing GNNs to exploit spurious shortcuts rather than learning causally invariant representations. While recent causal GNN methods introduce causality at the graph-modeling level, their causal mechanisms remain domain-agnostic without accounting for the real-world confounders inherent in clinical neuroimaging data. Moreover, brain networks are constructed from atlas-based parcellations where each region exhibits distinct sensitivity to demographic factors, necessitating region-aware adjustment. We propose Artemis, a region-level causal framework that bridges this gap with causal intervention at each brain region independently by learning region-specific confounder representations with lightweight parameters. Our adjustment comprehensively utilized the multimodal functional and structural features for graph reasoning as a plug-in module compatible with arbitrary GNN backbones. Experiments on three benchmarks, ADNI for disease diagnosis, OASIS for dementia staging, and HCP for sex classification, demonstrate consistent improvements over representative GNN-based baselines. Multiple supporting experiments further demonstrate statistical significance and neuroscientific interpretability.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:40 AM

# Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS
Source: [https://arxiv.org/html/2606.18287](https://arxiv.org/html/2606.18287)
Yang Du11footnotemark:1Kun Zhao11footnotemark:1Zhusuyi ChenDepartment of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA\.[siyuan\.dai@pitt\.edu](https://arxiv.org/html/2606.18287v1/mailto:[email protected])[liang\.zhan@pitt\.edu](https://arxiv.org/html/2606.18287v1/mailto:[email protected])Heng HuangDepartment of Computer Science, University of Maryland, College Park, MDPaul ThompsonImaging Genetics Center, Mark & Mary Stevens Institute for Neuroimaging & Informatics, Keck School of Medicine, University of Southern California, Los Angeles, CAChao ShiSchool of Systems Science and Industrial Engineering, Binghamton University, Binghamton, NYHaoteng TangDepartment of Computer Science, University of Texas Rio Grande Valley, Edinburg, TXLiang Zhan11footnotemark:1

###### Abstract

Multimodal neuroimaging, integrating functional connectivity from fMRI and structural connectivity from DTI, enables non\-invasive analysis of brain networks using graph neural networks\. However, demographic factors such as age and sex systematically confound the relationship between brain connectivity and clinical outcomes, causing GNNs to exploit spurious shortcuts rather than learning causally invariant representations\. While recent causal GNN methods introduce causality at the graph\-modeling level, their causal mechanisms remain domain\-agnostic without accounting for the real\-world confounders inherent in clinical neuroimaging data\. Moreover, brain networks are constructed from atlas\-based parcellations where each region exhibits distinct sensitivity to demographic factors, necessitating region\-aware adjustment\. We propose Artemis, a region\-level causal framework that bridges this gap with causal intervention at each brain region independently by learning region\-specific confounder representations with lightweight parameters\. Our adjustment comprehensively utilized the multimodal functional and structural features for graph reasoning as a plug\-in module compatible with arbitrary GNN backbones\. Experiments on three benchmarks, ADNI for disease diagnosis, OASIS for dementia staging, and HCP for sex classification, demonstrate consistent improvements over representative GNN\-based baselines\. Multiple supporting experiments further demonstrate statistical significance and neuroscientific interpretability\.

## 1Introduction\.

![Refer to caption](https://arxiv.org/html/2606.18287v1/x1.png)Figure 1:Demographic confounders are entangled with labels across all three benchmarks\. \(a\) On ADNI, MCI prevalence climbs from the youngest age to oldest\. \(b\) On HCP, the younger cohort is male\-skewed\. \(c\) On OASIS, dementia prevalence roughly triples in the oldest age quartile relative to the three younger ones\.Non\-invasive neuroimaging is a cornerstone of clinical neuroscience\. Functional MRI \(fMRI\) captures the temporal co\-activation of brain regions, while diffusion tensor imaging \(DTI\) delineates the anatomical white\-matter pathways that support this activity\. Modeling each modality as a graph over atlas\-defined regions of interest \(ROIs\) lets graph neural networks \(GNNs\) learn discriminative representations for disease diagnosis, cognitive prediction, and demographic analysis\[Li2020BrainGNNIB,Kawahara2017BrainNetCNNCN,Kan2022BrainNT\]\. Fusing the two modalities further provides a richer, anatomically grounded substrate for population\-level brain analysis\[Tang2024InterpretableSE,Yin2024AHG,Ye2023BidirectionalMW\]\.

However, demographic confounding in multimodal clinical neuroimages is pervasive and largely overlooked in downstream predictive modeling\. Clinical cohorts are naturally demographically unbalanced: age, sex, and education co\-vary with nearly every clinical label of interest\. For instance, in ADNI\[Jack2008TheAD\], subjects with mild cognitive impairment \(MCI\) are on average older than healthy controls, as demonstrated in Fig\.[1](https://arxiv.org/html/2606.18287#F1), while in HCP\[Essen2013TheWH\], sex is associated with educational differences\. Meanwhile, demographic factors are known to shape regional brain structure and function, including age\-related decline in hippocampal integrity and sex\-related effects in cortical regions such as the primary visual cortex\[Ritchie2017SexDI,Fjell2014WhatIN\]\. Without explicit adjustment, GNNs trained end\-to\-end can therefore rely on demographic\-driven connectivity patterns as spurious shortcuts, neglecting causally relevant disease signals\. This issue can lead to biased predictions, degraded robustness across subpopulations, and misleading neuroscientific interpretations, all of which are especially undesirable in clinical applications\.

Existing causal GNNs, while promising, remain domain\-agnostic\. Typical works inject causal reasoning into GNNs, most prominently by disentangling*"causal"*and*"spurious"*subgraphs via intervention or invariance objectives\[Fan2022DebiasingGN,Sui2021CausalAF,Chen2022LearningCI\]\. These methods advance causally robust graph learning in general, but their notion of a confounder is*structural*: induced by the graph topology itself rather than by any external variable\. For neuroimages, CI\-GNN\[zheng2024ci\]employs Granger causality as a post\-hoc interpretability tool, and Contrasformer\[xu2024contrasformer\]addresses sub\-population shift via contrast graphs without invoking causal adjustment at all\. Such approaches implicitly assume that confounding lives inside the connectivity matrix, which is unobservable, ignoring the fact that in clinical cohorts, the dominant confounders are*observed*demographic attributes with well\-documented neuroscientific effects\. As a consequence, current causal GNNs cannot perform the most basic causal operation that neuroimages necessitate: adjusting for known demographic confounders via backdoor adjustment\.

Furthermore, demographic confounding is not uniform across the brain\[Alex2023AGM,Eickhoff2018ImagingbasedPO\]\. Atlas\-based parcellations divide the cortex and subcortex into anatomically and functionally distinct regions, each with its own developmental trajectory and sensitivity profile\. A single global confounder correction applied uniformly to all ROIs cannot capture this heterogeneity, and risks either under\-correcting regions like the hippocampus or over\-correcting regions that are already clean, which are strongly affected by both aging and Alzheimer’s pathology\[Frisoni1999HippocampalAE\]\.

To address these gaps, we introduce Artemis, a region\-level causal intervention framework motivated by backdoor adjustment and illustrated in Fig\.[1](https://arxiv.org/html/2606.18287#F1)\. Artemis maps the demographics vector to a*per\-ROI*confounder embedding through a shared multilayer perceptron combined with learnable region tokens, capturing region\-specific sensitivity\. We introduce a lightweight exponential\-moving\-average \(EMA\) memory bank that maintains a running per\-ROI estimate of the population confounder distribution, enabling each sample’s confounder to be centered against the cohort mean, a low\-variance approximation of the backdoor\-adjustment expectation\. The entire intervention adds only a few thousand parameters and plugs into any GNN backbone, making it a drop\-in module rather than a new architecture\. Across three clinical benchmarks, ADNI \(NC vs\. MCI\), HCP \(sex\), and OASIS \(CDR three\-class\), our Artemis outperforms ten representative GNN\-based baselines across multiple categories, improving accuracy over the vanilla GCN backbone by\+20\.9%\+20\.9\\%,\+27\.9%\+27\.9\\%, and\+7\.8%\+7\.8\\%, and AUC by\+26\.2%\+26\.2\\%,\+34\.2%\+34\.2\\%, and\+8\.0%\+8\.0\\%, respectively\. We summarize our contributions as follows\.

- •We identify region\-specific demographic confounding as an overlooked but crucial source of spurious shortcuts in multimodal brain\-network GNNs\.
- •We proposeArtemisby formulating a region\-level backdoor adjustment, a lightweight plug\-in intervention module with only7​K7Kparameters, which is compatible with arbitrary GNN backbones\.
- •On three clinical benchmarks, Artemis consistently outperforms ten representative GNN\-based baselines across multiple categories, with substantial improvements in accuracy, F1, and AUC\. Multiple supportive analyses validate the formulated region\-level backdoor adjustment\.

## 2Related Works\.

![Refer to caption](https://arxiv.org/html/2606.18287v1/x2.png)Figure 2:Artemis pipeline\. \(1\) Per\-ROI multimodal features together with subject\-level demographicsddserve as inputs\. \(2\) A shared MLP combined with learnable per\-ROI tokens𝚛𝚘𝚒​\_​𝚎𝚖𝚋i\\mathtt\{roi\\\_emb\}\_\{i\}produces region\-specific confounder embeddingscic\_\{i\}\. \(3\) A per\-ROI EMA memory bank stores the running mean ofcic\_\{i\}over the training population, approximating𝔼​\[ci\]\\mathbb\{E\}\[c\_\{i\}\]for backdoor centering\. \(4\) A learned gateσ​\(W​cicentered\)\\sigma\(Wc\_\{i\}^\{\\mathrm\{centered\}\}\)is applied elementwise \(⊙\\odot\) to*both*fif\_\{i\}andsis\_\{i\}\. \(5\) The adjusted features feed any GNN backbone for downstream prediction\.### 2\.1Brain\-Network GNNs\.

Graph neural networks have become the dominant paradigm for learning over brain connectomes\. BrainNetCNN\[Kawahara2017BrainNetCNNCN\]pioneered the use of edge\-to\-edge and edge\-to\-node convolutions tailored to symmetric connectivity, and BrainGNN\[Li2020BrainGNNIB\]introduced ROI\-aware graph convolutions with a top\-KKpooling mechanism that highlights clinically salient regions\. More recently, transformer\-style architectures have been adapted to brain networks: BrainNetTF\[Kan2022BrainNT\]employs a cluster\-readout module that captures community structure in functional connectivity, and BioBGT\[Peng2025BiologicallyPB\]incorporates spectral positional encodings and community\-guided attention to inject biological priors into the attention mechanism\. Multimodal extensions exploit the complementarity of functional and structural connectivity: Tang et al\.\[Tang2024InterpretableSE\]propose an interpretable FC\-SC fusion framework for Alzheimer’s disease staging, and Yin et al\.\[Yin2024AHG\]align multi\-view connectivity through heterogeneous graph attention\. Despite architectural diversity, these models are trained end\-to\-end on class labels alone and never account for demographic variables that co\-vary with both the input graphs and the clinical target, leading to exploit demographic shortcuts for only reducing the training loss\.

### 2\.2Causal Reasoning on Graphs

One line of work injects causal reasoning into general\-purpose GNNs, typically by disentangling a causal subgraph from a spurious complement: DIR\[wu2022discovering\]partitions each input into invariant and variant components\. CAL\[Sui2021CausalAF\]implements this via causal attention and do\-calculus\-style training, and GIL\[li2022learning\], CIGA\[Chen2022LearningCI\], GSAT\[miao2022interpretable\], MoleOOD\[yang2022learning\], and Fan et al\.\[Fan2022DebiasingGN\]extend this agenda through graph\-level invariance, sparse stochastic attention, environment\-invariant features, or stable\-learning objectives\. A separate thread brings causality specifically to brain networks: CI\-GNN\[zheng2024ci\]uses Granger\-causal interactions as a*post\-hoc*interpretability tool without any adjustment during learning\. While Contrasformer\[xu2024contrasformer\]targets sub\-population shift via a contrast graph that is distributional rather than causal in the do\-calculus sense, and MediAD\[jin2025cross\]pursues a heavyweight, LLM\-augmented cross\-modal causal view at the patient level\. A growing literature on algorithmic fairness in medical imaging further highlights that demographic attributes routinely induce subgroup disparities\[seyyed2021underdiagnosis,petersen2023path\], motivating adjustment without providing a graph\-level causal mechanism\. Across both threads, the confounder is either treated as*latent graph structure*or handled only distributionally, none of these methods adjusts for the*observed*demographic variables that are known a priori and dominate confounding in clinical brain\-network studies\.

## 3Methodology\.

As shown in Fig[2](https://arxiv.org/html/2606.18287#F2), we summarize the proposed Artemis pipeline: a region\-specific confounder encoder, an EMA memory bank, and a gated multimodal intervention plug into any GNN backbone\. The following subsections formalize each component\.

### 3\.1Problem Formulation and Causal Graph\.

![Refer to caption](https://arxiv.org/html/2606.18287v1/x3.png)Figure 3:Causal graph for brain\-network classification\. \(a\) Observed demographicsddconfound both brain featuresXXand the clinical labelYYvia backdoor paths\. \(b\) Artemis performs backdoor adjustment by intervening on thed→Xd\\\!\\to\\\!Xpath at each ROI\.For each subject, we are given a multimodal brain network defined over a fixed atlas ofNNregions of interest \(ROIs\)\. The functional connectivity matrixFC∈ℝN×N\\mathrm\{FC\}\\in\\mathbb\{R\}^\{N\\times N\}is computed from resting\-state fMRI and the structural connectivity matrixSC∈ℝN×N\\mathrm\{SC\}\\in\\mathbb\{R\}^\{N\\times N\}from DTI tractography\. We treat eachR​O​IiROI\_\{i\}as a node whose multimodal feature is the concatenation of its FC and SC rows, written asfi∈ℝNf\_\{i\}\\in\\mathbb\{R\}^\{N\}andsi∈ℝNs\_\{i\}\\in\\mathbb\{R\}^\{N\}, so that the node\-feature tensor isX=\{\(fi,si\)\}i=1NX=\\\{\(f\_\{i\},s\_\{i\}\)\\\}\_\{i=1\}^\{N\}\. In addition, each subject carries an observed demographics vectord∈ℝddemod\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{demo\}\}\}comprising age, sex, and education \(actual involved attributes depend on the dataset\), and a prediction targetyy\.

Following the most common backdoor path in neuroimage analysis domain \(e\.g\.*demographics*→\\rightarrow*connectivity*→\\rightarrow*prediction*\), we adopt the causal directed acyclic graph illustrated in Fig[3](https://arxiv.org/html/2606.18287#F3):d→Xd\\\!\\to\\\!X,d→yd\\\!\\to\\\!y,X→yX\\\!\\to\\\!y, in whichddis an observed common cause, demographics of both the imaging features and the clinical label\. Standard maximum\-likelihood training ofP​\(y∣X\)P\(y\\mid X\)leaves the backdoor pathX←d→yX\\\!\\leftarrow\\\!d\\\!\\to\\\!yopen and is therefore prone to demographic shortcuts\. Our objective is the interventional distributionP​\(y∣do​\(X\)\)P\(y\\mid\\mathrm\{do\}\(X\)\), which by the backdoor criterion\[pearl2009causality\]admits the identification

\(3\.1\)P​\(y∣do​\(X\)\)=𝔼d​\[P​\(y∣X,d\)\]\.P\(y\\mid\\mathrm\{do\}\(X\)\)\\;=\\;\\mathbb\{E\}\_\{d\}\\\!\\left\[\\,P\(y\\mid X,d\)\\,\\right\]\.Crucially, Eq\. \([3\.1](https://arxiv.org/html/2606.18287#S3.E1)\) adjusts at the*feature*level rather than at the label level, it requires modeling howddmodulates the per\-ROI features and then marginalizing over the population distribution ofdd\.

### 3\.2Region\-Specific Confounder Encoder\.

Different brain regions exhibit markedly different sensitivity to demographic factors, e\.g\., the hippocampus and entorhinal cortex are dominated by age\-related atrophy\[Alex2023AGM,Eickhoff2018ImagingbasedPO\]\. A single global confounder vector applied uniformly to all ROIs cannot respect this heterogeneity\. At the same time, instantiatingNNindependent MLPs is parameter\-inefficient and discards inductive bias across regions\.

We therefore implement*region specificity*via a single shared MLP combined with a learnable per\-ROI token\. Let𝚛𝚘𝚒​\_​𝚎𝚖𝚋∈ℝN×droi\\mathtt\{roi\\\_emb\}\\in\\mathbb\{R\}^\{N\\times d\_\{\\mathrm\{roi\}\}\}be a learnable embedding table whoseii\-th row𝚛𝚘𝚒​\_​𝚎𝚖𝚋i\\mathtt\{roi\\\_emb\}\_\{i\}is a small, randomly initialized identity vector forR​O​IiROI\_\{i\}\. The per\-ROI confounder embedding is

\(3\.2\)ci=MLP​\(\[d;𝚛𝚘𝚒​\_​𝚎𝚖𝚋i\]\)∈ℝdc,c\_\{i\}\\;=\\;\\mathrm\{MLP\}\\\!\\left\(\\,\[\\,d\\,;\\,\\mathtt\{roi\\\_emb\}\_\{i\}\\,\]\\,\\right\)\\;\\in\\;\\mathbb\{R\}^\{d\_\{c\}\},which can be viewed as a feature\-wise conditioning of the demographics vector by an ROI identity\[perez2018film\]\. Although the same MLP weights are used for every region, the ROI token shifts its input and yields an ROI\-specific output, so identical demographics produce differentcic\_\{i\}for different regions\. The total parameter count of the encoder is\|MLP\|\+N⋅droi\|\\mathrm\{MLP\}\|\+N\\\!\\cdot\\\!d\_\{\\mathrm\{roi\}\}, which is two orders of magnitude smaller thanNNseparate MLPs \(e\.g\.,90×16=144090\\\!\\times\\\!16=1440extra parameters for ADNI withdroi=16d\_\{\\mathrm\{roi\}\}=16\)\. Crucially,cic\_\{i\}is not a\[0,1\]\[0,1\]saliency or an importance weight, it is an unbounded embedding that summarizes how this region responds along the demographics axes, i\.e\., a region\-conditional surrogate forP​\(c∣d,ROI=i\)P\(c\\mid d,\\mathrm\{ROI\}=i\)\.

### 3\.3Population\-Level Memory Bank and Backdoor Centering\.

The backdoor adjustment in Eq\. \([3\.1](https://arxiv.org/html/2606.18287#S3.E1)\) requires the population\-level expectation𝔼d​\[ci\]\\mathbb\{E\}\_\{d\}\[c\_\{i\}\]for every ROI\. Computing this exactly at every step is infeasible under mini\-batch training, and per\-batch means are noisy on small clinical cohorts\. We therefore maintain a per\-ROI memory bankB∈ℝN×dcB\\in\\mathbb\{R\}^\{N\\times d\_\{c\}\}, registered as a non\-trainable buffer and updated by an exponential moving average \(EMA\):

\(3\.3\)Bi←m⋅Bi\+\(1−m\)⋅1\|ℬ\|​∑j∈ℬci\(j\),B\_\{i\}\\;\\leftarrow\\;m\\cdot B\_\{i\}\\;\+\\;\(1\-m\)\\cdot\\tfrac\{1\}\{\|\\mathcal\{B\}\|\}\\\!\\sum\_\{j\\in\\mathcal\{B\}\}c\_\{i\}^\{\(j\)\},whereℬ\\mathcal\{B\}is the current mini\-batch andm∈\[0\.9,0\.999\]m\\in\[0\.9,0\.999\]is the momentum \(defaultm=0\.999m=0\.999\)\. The first batch initializesBiB\_\{i\}directly to its batch mean rather than via Eq\. \([3\.3](https://arxiv.org/html/2606.18287#S3.E3)\), avoiding the cold\-start bias of starting from zero\.

The bank is then used to*center*the confounder embedding before any downstream use:

\(3\.4\)cicentered=ci−Bi\.c\_\{i\}^\{\\mathrm\{centered\}\}\\;=\\;c\_\{i\}\\;\-\\;B\_\{i\}\.Two points are worth emphasizing\. First,BBis a*consistent, low\-variance approximation*of𝔼d​\[ci\]\\mathbb\{E\}\_\{d\}\[c\_\{i\}\]rather than a strictly unbiased Monte\-Carlo estimate: the EMA weights do not sum to one and the underlying encoder drifts during training, soBBaverages snapshots of a non\-stationary distribution\. Once training has converged, however, the effective sample size of order1/\(1−m\)≈1031/\(1\-m\)\\\!\\approx\\\!10^\{3\}drives the variance well below that of any single mini\-batch mean\. Significantly, the bank is frozen at inference time, exactly as the running statistics of batch normalization are frozen at evaluation\[ioffe2015batch\]\. This guarantees train\-test consistency that every test subject is centered against the same population reference that the model was calibrated to during training\.

### 3\.4Gated Multimodal Intervention

![Refer to caption](https://arxiv.org/html/2606.18287v1/x4.png)Figure 4:Single\-stream vs\. dual\-stream gated intervention for one ROI\. Left: a single gate is applied to the concatenated multimodal feature\[fi;si\]\[f\_\{i\};s\_\{i\}\]\. Right \(Artemis\): the same gate is applied independently to thefif\_\{i\}andsis\_\{i\}streams\.The centered confoundercicenteredc\_\{i\}^\{\\mathrm\{centered\}\}drives a learned gate that selectively suppresses confounded directions in the per\-ROI multimodal features\. LetW∈ℝdfeat×dcW\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{feat\}\}\\times d\_\{c\}\}andb∈ℝdfeatb\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{feat\}\}\}be a single linear layer, wheredfeatd\_\{\\mathrm\{feat\}\}is the projected feature dimension shared by the FC and SC streams\. The gate is

\(3\.5\)gi=σ​\(W​cicentered\+b\)∈\(0,1\)dfeat,g\_\{i\}\\;=\\;\\sigma\\\!\\left\(W\\,c\_\{i\}^\{\\mathrm\{centered\}\}\+b\\right\)\\;\\in\\;\(0,1\)^\{d\_\{\\mathrm\{feat\}\}\},and is applied elementwise to*both*modalities of regionii:

\(3\.6\)fi′=fi⊙gi,si′=si⊙gi\.f\_\{i\}^\{\\prime\}\\;=\\;f\_\{i\}\\odot g\_\{i\},\\qquad s\_\{i\}^\{\\prime\}\\;=\\;s\_\{i\}\\odot g\_\{i\}\.We stress thatgig\_\{i\}is*not*attention\. It is a per\-ROI scalar mask whose drive signal is the demographic confounder rather than pairwise ROI similarity\. Intuitively,\|gi,k−0\.5\|\|g\_\{i,k\}\-0\.5\|encodes how much confounder contamination feature dimensionkkof regioniicarries, given this subject’s demographics, and the linear layer is shaped by the downstream task loss to push the mask towards0along precisely those directions whose variance is explained bydd\. Gate strength is therefore an*interaction*between the demographic sensitivity ofcic\_\{i\}and the task\-driven suppression learned byWW\.

The adjusted features\{\(fi′,si′\)\}i=1N\\\{\(f\_\{i\}^\{\\prime\},s\_\{i\}^\{\\prime\}\)\\\}\_\{i=1\}^\{N\}, together with edges derived from the SC matrix, are passed to any GNN backbone\. We use a 2\-layer GCN by default, but the intervention is architecture\-agnostic\. In this case, the total parameter overhead of the intervention, encoder MLP, ROI embedding table, EMA buffer, and gate layer, is only on the order of a few thousand parameters \(about77k in our default configuration\), which is negligible compared with any modern GNN\-based backbone\.

## 4Experiments\.

![Refer to caption](https://arxiv.org/html/2606.18287v1/x5.png)Figure 5:Pre\-classifier embedding visualization via t\-SNE on all three benchmarks \(silhouette scores, rawdd\-dim embedding wrt task label\)\.![Refer to caption](https://arxiv.org/html/2606.18287v1/x6.png)Figure 6:Gate strength vs\. permutation importance for 90 ADNI ROIs \(AAL atlas\)\. The two quantities are largely uncorrelated \(Spearmanρ=−0\.06\\rho=\-0\.06,p=0\.58p=0\.58\)\. Upper\-right: clinically relevant regions that are also demographically confounded \(e\.g\., hippocampus\)\. Upper\-left: age\-sensitive but clinically irrelevant regions\.### 4\.1Experimental Setup\.

We evaluate Artemis on three clinical neuroimaging benchmarks\.ADNI\(NC vs\. MCI,N=90N\\\!=\\\!90AAL ROIs,n=338n\\\!=\\\!338\) uses age and sex as observed confounders\.HCP\(sex classification,N=82N\\\!=\\\!82Desikan\-Killiany ROIs,n=999n\\\!=\\\!999\) uses age and education\.OASIS\(CDR three\-class: NC/MCI/Dementia,N=132N\\\!=\\\!132Harvard\-Oxford ROIs,n=1324n\\\!=\\\!1324\) uses age and sex as confounders\.

We compare against eleven baselines organized in three groups:*general GNN baselines*\(GCN\[kipf2016semi\], GAT\[velivckovic2018graph\], GIN\[xu2018powerful\]\),*brain\-specific architectures*\(BrainGNN\[Li2020BrainGNNIB\], BrainNetTF\[Kan2022BrainNT\], BrainNetCNN\[Kawahara2017BrainNetCNNCN\], BioBGT\), and*causality\-related GNNs*\(CI\-GNN\[zheng2024ci\], Contrasformer\[xu2024contrasformer\], CAL\[Sui2021CausalAF\]\)\.

Unless stated otherwise, Artemis defaultly usesdc=32d\_\{c\}\\\!=\\\!32,droi=16d\_\{\\mathrm\{roi\}\}\\\!=\\\!16,dfeat=64d\_\{\\mathrm\{feat\}\}\\\!=\\\!64, a 2\-layer GCN backbone, and EMA momentumm=0\.999m\\\!=\\\!0\.999\. The training objective is a single cross\-entropy loss without introducing any auxiliary loss\. All models are evaluated with fixed 5\-fold subject\-level cross\-validation, with early stopping on validation accuracy \(patience2020\) and a maximum of100100epochs\. We report accuracy, macro\-F1, and macro\-averaged ROC\-AUC \(one\-vs\-rest for the three\-class OASIS task\) as our evaluation metrics\.

### 4\.2Main Results

Table 1:Experimental results on ADNI, HCP, and OASIS \(5\-fold cross\-validation, mean±\\pmstd, %\)\. Acc = accuracy; F1 = macro\-F1; AUC = macro\-averaged ROC\-AUC \(one\-vs\-rest for the 3\-class OASIS task\)\.Bold= best overall;underline= second best\.Params counted with ADNI configuration \(NN=90, hidden\_dim=64\)\. The intervention adds only∼\\sim7K parameters to any backbone\.

Table[4\.2](https://arxiv.org/html/2606.18287#S4.SS2)summarizes results across all three benchmarks\. Artemis outperforms every baseline on every dataset, with especially pronounced gains on the metrics most sensitive to class imbalance\. On ADNI, Artemis achieves 84\.3% accuracy, a\+5\.0\+5\.0% absolute improvement over the best baseline BrainNetCNN \(79\.3%\), and a macro\-F1 of 73\.5% compared with BrainNetCNN’s 52\.4%, a\+21\.1\+21\.1percentage\-point gain that reflects far more balanced recall across the NC and MCI classes\. On HCP, Artemis reaches 86\.7% accuracy \(\+6\.2\+6\.2% over BrainNetCNN’s 80\.5%\) and 92\.4% AUC \(\+8\.7\+8\.7% over the best baseline AUC of 83\.7%\)\. On OASIS, Artemis improves accuracy by\+4\.7\+4\.7% over Contrasformer \(53\.2%\) and delivers consistent gains across F1 and AUC\.

Crucially, Artemis shares the same GCN backbone as the weakest baseline \(vanilla GCN,1212K parameters\), while the only addition is the77K\-parameter intervention module\. Adding this drop\-in causal adjustment alone lifts accuracy by\+20\.9\+20\.9%,\+27\.9\+27\.9%, and\+7\.8\+7\.8% on ADNI, HCP, and OASIS respectively \(Table[4\.2](https://arxiv.org/html/2606.18287#S4.SS2),Δ\\Deltarow\), and macro\-F1 by\+27\.3\+27\.3,\+28\.8\+28\.8, and\+5\.4\+5\.4points\.

It is worth mentioning that, the three causal baselines, CI\-GNN, Contrasformer, and CAL consistently underperform the brain\-specific architectures\. Their mechanisms \(Granger causality, contrast graphs, and causal attention on structural subgraphs, respectively\) do not account for the demographic confounders that dominate clinical brain imaging \(e\.g\. CAL in particular, a general\-purpose causal GNN, drops to61\.6%61\.6\\%accuracy on ADNI which even worse than vanilla GCN\), reinforcing the motivation of ArtemisSS\. Meanwhile, the Artemis intervention module adds only∼7\{\\sim\}7K parameters on top of the GCN backbone \(12K\), a negligible overhead compared with Contrasformer \(178K\) or BrainNetCNN \(882K\), demonstrating that the gains originate from the causal intervention mechanism rather than model capacity\.

### 4\.3Ablation Study\.

Table 2:Ablation on intervention granularity\. GCN backbone for all rows\. FC\-only / SC\-only apply region\-level gated intervention to one modality only\. Single\-stream concatenates\[fi;si\]\[f\_\{i\};s\_\{i\}\]\. Global shares a single confounder embedding across all ROIs\. All rows use default hyperparameters\.Table[2](https://arxiv.org/html/2606.18287#T2)isolates each design choice under default hyperparameters\. The progression from no intervention to global and region\-level is monotonic across all three benchmarks: global backdoor adjustment alone already lifts ADNI accuracy from63\.463\.4% to80\.580\.5%, and region\-level differentiation adds another\+3\.8\+3\.8% accuracy and\+16\.2\+16\.2F1 points on ADNI, consistent with the view that heterogeneous per\-ROI confounder effects cannot be captured by a single shared embedding; the large drop in F1 std \(Global28\.928\.9→\\rightarrowArtemis5\.25\.2on ADNI\) further indicates that per\-ROI parameterization acts as a regularizer\. Single\-modality ablations \(FC\-only, SC\-only\) recover much of the gain but neither matches full Artemis, showing that the two modalities carry complementary confounding signal\. Finally, replacing the dual\-stream gate with a single shared gate on the concatenated\[fi;si\]\[f\_\{i\};s\_\{i\}\]\(Figure[4](https://arxiv.org/html/2606.18287#F4)left\) loses33\-66% accuracy across benchmarks \(e\.g\.,80\.680\.6% vs\.86\.786\.7% on HCP\), validating the modality\-specific gating design\.

### 4\.4Parameter Sensitivity

![Refer to caption](https://arxiv.org/html/2606.18287v1/x7.png)Figure 7:Sensitivity to confounder embedding dimensiondcd\_\{c\}\. Accuracy remains stable \(≤3%\\leq 3\\%variation\) across all three benchmarks\.Figure[7](https://arxiv.org/html/2606.18287#F7)shows accuracy as a function of the confounder embedding dimensiondc∈\{8,16,32,64,128\}d\_\{c\}\\\!\\in\\\!\\\{8,16,32,64,128\\\}under otherwise default hyperparameters\. Performance remains stable across all three benchmarks, with less than33% variation in accuracy\. ADNI and HCP peak neardc=32d\_\{c\}\\\!=\\\!32\-6464; larger values \(dc=128d\_\{c\}\\\!=\\\!128\) offer no further gain and introduce mild overfitting\. This insensitivity confirms that the performance improvements stem from the causal intervention mechanism itself, not from additional model capacity\.

### 4\.5Interpretability Analysis

We first visualize the pre\-classifier embeddings with t\-SNE on all three benchmarks \(Figure[5](https://arxiv.org/html/2606.18287#F5)\): vanilla GCN embeddings conflate task labels with silhouette near zero, whereas Artemis yields clearly separable class structure across ADNI, HCP, and OASIS\.

To examine*where*the intervention acts, we compare per\-ROI gate strength \(\|gi−0\.5\|\|g\_\{i\}\-0\.5\|, averaged over subjects and feature dimensions\) with permutation importance \(accuracy drop when shuffling ROIii\) for all 90 ADNI ROIs \(Figure[6](https://arxiv.org/html/2606.18287#F6)\)\. The two are largely uncorrelated \(Spearmanρ=−0\.06\\rho\\\!=\\\!\{\-0\.06\},p=0\.58p\\\!=\\\!0\.58\), confirming that the gate is driven by the centered confounder embedding, not by the task label, and therefore decoupled from classification importance by design\. The*upper\-right*quadrant \(high gate, high importance\) contains the left hippocampus and bilateral amygdala, regions central to Alzheimer’s pathology and strongly modulated by aging\[Fjell2014WhatIN\]; Artemis applies targeted adjustment here, preserving the disease\-relevant component while suppressing the age\-driven shortcut\. The*upper\-left*quadrant \(e\.g\., Frontal\_Sup\_R, ParaHippocampal\_R\) contains age\-sensitive but task\-irrelevant regions whose high gate strength reflects suppression of a spurious backdoor path, while the*lower\-right*quadrant contains clean biomarkers that require no adjustment\. This pattern validates that Artemis calibrates precisely the regions where demographic and disease effects are entangled\.

![Refer to caption](https://arxiv.org/html/2606.18287v1/x8.png)Figure 8:Integrated\-Gradient attribution of the top\-5 ROIs for HCP \(top\), OASIS \(middle\), and ADNI \(bottom\)\. ROIs driving the task decision afterArtemisintervention\.To locate the regions that drive the task decision*after*the intervention, we compute Integrated Gradients \(IG, 50 steps, zero baseline\) of the predicted class w\.r\.t\. the per\-ROI input features for every subject, and project the top\-5 ROIs per benchmark onto the cortical surface \(Figure[8](https://arxiv.org/html/2606.18287#F8)\)\. On ADNI, IG highlights bilateral precentral gyrus, left middle temporal gyrus, and bilateral putamen, consistent with motor\-cortex thinning and striatal tauopathy reported in advancing AD\[whitwell2008rates\], while hippocampus and amygdala \(ranked 54\-88\) are deliberately down\-weighted because their variance is already explained by the demographic gate\. On HCP, the top\-5 concentrates on bilateral insula, posterior cingulate, lateral occipital, and paracentral cortex, regions with the strongest documented structural sex dimorphism in the connectome\[ingalhalikar2014sex,Ritchie2017SexDI\]\. On OASIS, bilateral precentral gyrus, lingual gyrus, and temporo\-occipital fusiform emerge as most informative, matching the posterior\-dominant atrophy pattern observed in dementia cohorts\[lehmann2010atrophy\]\. The contrast between IG \(task\-driven attribution\) and gate strength \(confounder\-driven modulation\) on ADNI is complementary: the gate suppresses demographically confounded directions in medial\-temporal regions, allowing the GCN to ground its decision in the remaining, less\-confounded motor\-subcortical signal\.

## 5Conclusion\.

We presented Artemis, a region\-level causal intervention framework that addresses demographic confounding in multimodal brain\-network analysis\. By mapping observed demographics to per\-ROI confounder embeddings via a shared MLP with learnable region tokens, centering each embedding against an EMA population bank, and applying a learned sigmoid gate to both functional and structural features, Artemis performs backdoor adjustment at each brain region independently\. The entire intervention adds only a few thousand parameters and plugs into any GNN backbone without architectural modification\. Experiments on ADNI, HCP, and OASIS demonstrate consistent improvements over ten representative baselines, and ROI\-level analyses confirm that the intervention targets confounder\-sensitive regions while remaining decoupled from task\-relevant signals\.

### 5\.1Limitations and Future Work

Artemis models only*observed*demographic confounders \(age, sex, education\) and evaluates on cross\-sectional cohorts, unmeasured confounders \(e\.g\., socioeconomic status, medication\) and longitudinal dynamics are not yet considered\. We plan to \(i\) extend Artemis to implicit confounders via a post\-backbone adversarial, \(ii\) apply it to regression targets \(e\.g\., MMSE, depression/anxiety scores\), and \(iii\) scale to multi\-site cohorts for further scanner\-effected demographic confoundings\.

## References

Similar Articles