Treatment Effect Estimation with Differentiated Networked Effect on Graph Data
Summary
This paper addresses the challenge of estimating individual treatment effects from graph data by modeling differentiated networked effects, proposing a mechanism with partial attention and a message amplifier to capture varying neighbor importance and scale. Experiments show improved performance over existing methods.
View Cached Full Text
Cached at: 05/26/26, 09:05 AM
# Treatment Effect Estimation with Differentiated Networked Effect on Graph Data
Source: [https://arxiv.org/html/2605.24358](https://arxiv.org/html/2605.24358)
\(5 June 2009\)
###### Abstract\.
Estimating individual treatment effect \(ITE\) from observational graph data is crucial for decision\-making in the fields such as commerce and medicine\. This task is challenging due tointerference, where individual outcomes can be influenced by the treatments and covariates of their neighbors\. Existing methods attempt to model such interference for accurate ITE estimation\. However, a critical issue is often overlooked:differentiated networked effect\(DNE\), an effect caused by local networks consisting of neighbors with varying importance and scales\. Capturing DNE is vital; otherwise, we will end up with imprecise ITE estimation due to an erroneous characterization of interference, which can result in misguided decisions\. To address this challenge, we propose a novel interference modeling mechanism that incorporates two partial attention mechanisms and a message amplifier\. The partial attention mechanisms automatically estimate the importance of different neighbors in contributing to interference, while the message amplifier adjusts the results of the interference modeling mechanism based on the scale of neighbors, all of which enables the model to capture DNE\. Experiments on three real\-world graphs demonstrate that our methods outperform existing approaches for ITE estimation from graph data, which corroborates the importance of explicitly capturing DNE\.
Causal Inference; Treatment Effect Estimation; Interference
††copyright:acmlicensed††journalyear:2018††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Mathematics of computing Causal networks††ccs:Mathematics of computing Graph algorithms††ccs:Information systems Social networks## 1\.Introduction
Treatment effect estimation from graph data has been applied to decision\-making in various areas, such as medicine\(Changet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib115); Maet al\.,[2022a](https://arxiv.org/html/2605.24358#bib.bib109); Schnitzer,[2022](https://arxiv.org/html/2605.24358#bib.bib48)\)and commerce\(Nabiet al\.,[2022](https://arxiv.org/html/2605.24358#bib.bib47)\)\. For example, it enables business owners to assess whether an advertisement stimulates a customer to purchase\. This supports making reasonable decisions on promotional strategies\. A crucial task in this context is estimating the individual treatment effect \(ITE\),111Sometimes known as conditional average treatment effect \(CATE\)\(Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)\.which quantifies the differences in the outcome of an individual with and without treatment\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)\.
We aim to estimate ITE from observational graph data, which typically include covariates, treatments, outcomes of individuals, and a network structure among individuals\. In this case, the outcomes of individuals can receive influence from treatments and covariates of their neighbors\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12); Rakeshet al\.,[2018](https://arxiv.org/html/2605.24358#bib.bib52)\), a phenomenon known as*interference*\(Rakeshet al\.,[2018](https://arxiv.org/html/2605.24358#bib.bib52)\)\. Such interference can propagate among individuals and their multi\-hop neighbors, which is referred to as*networked interference*\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)\. The multi\-hop neighbors of an individual, along with connections among them, form a local network that contributes to networked interference received by the individual\. Properly modeling networked interference is critical; otherwise, we end up with unreasonable decisions due to inaccurate ITE estimation\.
Figure 1\.Comparison of representations generated by improper \(left\) and proper \(right\) interference modeling mechanisms\. Individuals with similar clothes have similar covariates\. Each example shows two similar target individuals exposed to local networks with different scales of neighbors, which leads to different levels of interference\. Proper modeling mechanisms capture this by generating distinct representations\. In contrast, improper mechanisms, such as mean aggregation, cannot generate distinct representations\.Figure 2\.Architecture of GITE\. Here, we show an example of three individuals\. NIML represents an NIM layer\.Despite the significant contributions of previous studies in demonstrating the effectiveness of modeling interference for ITE estimation from observational graph data, their modeling mechanisms still have limitations in properly capturing networked interference, which can result in imprecise ITE estimation\. Several methods model interference \(detailed in Section[2](https://arxiv.org/html/2605.24358#S2)\) by applying a mean aggregation or graph convolutional network \(GCN\)\(Welling and Kipf,[2016](https://arxiv.org/html/2605.24358#bib.bib17)\), such asMa and Tresp \([2021](https://arxiv.org/html/2605.24358#bib.bib12)\),Jiang and Sun \([2022](https://arxiv.org/html/2605.24358#bib.bib103)\),Chenet al\.\([2024](https://arxiv.org/html/2605.24358#bib.bib126)\), andLinet al\.\([2025](https://arxiv.org/html/2605.24358#bib.bib132)\)\. However, such aggregation mechanisms cannot fully capture networked interference, since a critical issue is overlooked:differentiated networked effect\(DNE\), an effect caused by local networks consisting of neighbors with varying importance and scales\. Specifically, DNE consists of two key sub\-issues\. \(I\) The importance of different neighbors in contributing to interference varies\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\)\.222In literature\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\), it is called heterogeneous interference or influence\. However, heterogeneous interference or influence do not consider the sub\-issue \(II\) of DNE\. Therefore, DNE can be considered as a refinement issue of them\.For instance, purchase behaviors of customers are usually more significantly influenced by their closer friends than by others\. \(II\) The scale of neighbors varies, leading to different levels of interference \(see Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\)\. An individual with many neighbors may experience more severe interference than one with few neighbors\. For example, purchase behaviors of customers may be more significantly influenced by advertisements shared by many friends than by those shared by only a few friends\. Methods based on mean aggregation or GCN do not address both issues \(I\) and \(II\), as they lack explicit mechanisms to automatically estimate the importance of interference from each neighbor, and may fail to generate distinct representations for interference received by individuals from their local networks with different scales of neighbors\. Although a line of work takes issue \(I\) into account by applying a graph attention mechanism \(GAT\)\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)to estimate the importance of different neighbors\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\), their methods also do not fully address issue \(II\), as they may degenerate to a mean aggregation when individuals are similar \(see Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\)\. A detailed proof in Appendix[C](https://arxiv.org/html/2605.24358#A3)shows that capturing DNE remains a challenge for most existing interference modeling methods\. If DNE is not properly captured by jointly addressing issues \(I\) and \(II\), ITE estimation deteriorates and the subsequent decision making is misguided\.
To overcome the challenge of capturing DNE, we propose graph\-based individual treatment effect estimation \(GITE\), which models the propagation of networked interference while capturing DNE\. A novel networked interference modeling \(NIM\) layer forms the core of our approach\. It is designed to capture DNE through two partial attention mechanisms and a message amplifier\. Specifically, we design two partial attention mechanisms: individual partial attention \(IPAtt\) and structure partial attention \(SPAtt\), which are intended to adaptively capture the varying contributions of neighbors to interference based on two key factors that influence their relative importance\. IPAtt estimates the individual partial importance of interference between two individuals based on their interference\-related information, which assists in addressing issue \(I\)\. SPAtt estimates the structure partial importance of interference between two individuals based on the structures of their local networks, which assists in addressing both issues \(I\) and \(II\)\. Subsequently, the NIM layer conducts aggregations using the estimated partial importance and integrates the aggregated results adaptively through a learnable summary function\. Each partial attention mechanism can be implemented using either GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)or the attention mechanism of Transformer\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib5); Yinget al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib151)\)\. To precisely capture structural information on the local network of every individual for applying the SPAtt mechanism, we apply a graph isomorphism network \(GIN\)\(Xuet al\.,[2019](https://arxiv.org/html/2605.24358#bib.bib133)\)\. To address issue \(II\), we design a message amplifier to vary integrated results based on the degree of individuals, which is inspired byCorsoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib129)\)\. Details of the NIM layer are described in Section[4\.1](https://arxiv.org/html/2605.24358#S4.SS1)\. Furthermore, we propose a representation balancing strategy for ITE estimation from observational graph data, as detailed in Section[4\.2](https://arxiv.org/html/2605.24358#S4.SS2)\. We theoretically analyze the error bound of ITE estimation based on this strategy in Appendix[E](https://arxiv.org/html/2605.24358#A5)\.
We summarize three contributions of this study, as follows:
- •We propose the NIM layer to address the challenging issue of DNE, and further introduce a representation balancing strategy for ITE estimation from observational graph data\.
- •We discuss that capturing DNE remains a challenge for most existing methods \(see Appendix[C](https://arxiv.org/html/2605.24358#A3)\) and theoretically analyze the error bound of ITE estimation based on the proposed balancing strategy \(see Appendix[E](https://arxiv.org/html/2605.24358#A5)\)\.
- •Results of extensive experiments reveal that the proposed method outperforms existing methods in ITE estimation with networked interference, which suggests the importance of capturing DNE\.
## 2\.Related work
ITE estimation from observational data with interference\.Although many studies model interference by assuming aneighbor interference, where interference exists among close neighbors only\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126); Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103); Rakeshet al\.,[2018](https://arxiv.org/html/2605.24358#bib.bib52); Viviano,[2019](https://arxiv.org/html/2605.24358#bib.bib11);[Wuet al\.,](https://arxiv.org/html/2605.24358#bib.bib148)\), real\-world data often involves networked interference, where interference propagates widely among individuals and their multi\-hop yet influential neighbors\(Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91),[2025](https://arxiv.org/html/2605.24358#bib.bib132); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12); Suiet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib2)\)\. Specifically, existing methods model neighbor interference by applying a mean aggregation\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126); Forastiereet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib10),[2022](https://arxiv.org/html/2605.24358#bib.bib110); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\), GCN\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126); Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103);[Wuet al\.,](https://arxiv.org/html/2605.24358#bib.bib148)\), or GAT\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\)\. Several studies model the propagation of networked interference through a mean aggregation\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)and GCN\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12); Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\)\. To accelerate the training of GNN\-based estimators,Linet al\.\([2025](https://arxiv.org/html/2605.24358#bib.bib132)\)aggregates interference\-related information before training\. To estimate ITE from more convoluted graphs,Maet al\.\([2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)andLinet al\.\([2023](https://arxiv.org/html/2605.24358#bib.bib91)\)propose methods to estimate ITE from hypergraphs and heterogeneous graphs, respectively\. They use GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)to model interference when they are applied to an ordinary graph\. Despite the valuable contributions that previous studies have made to ITE estimation with interference, their methods still face challenges in capturing DNE, as detailed in Appendix[C](https://arxiv.org/html/2605.24358#A3)\. A comparison table and studies on treatment effect estimation in other settings are detailed in Appendix[H](https://arxiv.org/html/2605.24358#A8)\.
Graph machine learning methods\.Beyond GNN\-based methods for modeling interference in ITE estimation, several methods have been proposed in the graph machine learning \(GML\) community\. Although these methods contribute significantly to GML tasks, they are unable to estimate ITE from observational graph data with DNE\. Sum aggregation\(Liuet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib138); Xuet al\.,[2019](https://arxiv.org/html/2605.24358#bib.bib133)\)cannot capture DNE, as they do not consider the different importance of neighbors in contributing to interference\. Pooling\-based methods have also been explored\(Liuet al\.,[2022](https://arxiv.org/html/2605.24358#bib.bib137)\), where the max\-pooling method\(Hamiltonet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib76)\)is one of the most widely used methods\. The max\-pooling operation does not capture DNE, as proved in Appendix[C](https://arxiv.org/html/2605.24358#A3)\. Although several studies proposed methods to enhance the expressive power of representations for addressing some complex GML tasks\(Corsoet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib129); Maet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib131)\), ITE estimation from observational graph data differs from standard GML tasks due to confounding and interference biases \(see Section[3](https://arxiv.org/html/2605.24358#S3)\), which are absent in GML\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\)\. Thus, GML methods lack mechanisms to address these biases, which can result in biased ITE estimation\. For a broader overview of GML, refer to recent literatures\(Liuet al\.,[2022](https://arxiv.org/html/2605.24358#bib.bib137),[2024](https://arxiv.org/html/2605.24358#bib.bib138)\)\.
## 3\.Problem setting
In this study, we aim to estimate ITE from observational graph data with networked interference\. We use𝒙i∈ℝc\\boldsymbol\{x\}\_\{i\}\\in\\mathbb\{R\}^\{c\}to denote the covariates of the individualii,ti∈\{0,1\}t\_\{i\}\\in\\\{0,1\\\}to denote the treatment assigned to the individualii,yi∈ℝy\_\{i\}\\in\\mathbb\{R\}to denote the factual or observed outcome with assignedtit\_\{i\}, andNNto denote the number of individuals\. Let𝑿∈ℝN×c\\boldsymbol\{X\}\\in\\mathbb\{R\}^\{N\\times c\}be the covariates of all individuals,𝑻=\[t1,…,tN\]\\boldsymbol\{T\}=\[t\_\{1\},\.\.\.,t\_\{N\}\]be all treatment assignments, and𝒀=\[y1,…,yN\]\\boldsymbol\{Y\}=\[y\_\{1\},\.\.\.,y\_\{N\}\]be all factual outcomes\. Moreover, we use uppercase letters \(e\.g\.,YY\) to denote random variables\. We show a notation table in Appendix[A](https://arxiv.org/html/2605.24358#A1)\.
Observational graph data\.𝒟=\(𝑿,𝑻,𝒀,𝑨\)\\mathcal\{D\}=\(\\boldsymbol\{X\},\\boldsymbol\{T\},\\boldsymbol\{Y\},\\boldsymbol\{A\}\)denotes an observational graph data, where𝑨∈\{0,1\}N×N\\boldsymbol\{A\}\\in\\\{0,1\\\}^\{N\\times N\}denotes the adjacency matrix of a directed graph\. If there is an edge from an individualkkto an individualii,Aik=1A\_\{ik\}=1; otherwise,Aik=0A\_\{ik\}=0\. Letℕi\\mathbb\{N\}\_\{i\}denote the set of neighbors of the individualii,𝔾i\\mathbb\{G\}\_\{i\}denote the set of related individuals who can reach the individualiiin the graph𝑨\\boldsymbol\{A\}\(Chenget al\.,[2012](https://arxiv.org/html/2605.24358#bib.bib136)\),𝒙𝔾i=\{𝒙k∣k∈𝔾i\}\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}=\\\{\\boldsymbol\{x\}\_\{k\}\\mid k\\in\\mathbb\{G\}\_\{i\}\\\}denote the set of covariates in𝔾i\\mathbb\{G\}\_\{i\}, and𝒕𝔾i=\{tk∣k∈𝔾i\}\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}=\\\{t\_\{k\}\\mid k\\in\\mathbb\{G\}\_\{i\}\\\}denote the set of treatments in𝔾i\\mathbb\{G\}\_\{i\}\. Importantly, only individuals in𝔾i\\mathbb\{G\}\_\{i\}can interfere with the individualii\.
Challenges\.ITE estimation from observational graph data suffers from four challenges:
- •Networked interference\.In a graph, the outcome of an individualiican receive influence from covariates𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}and treatments𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\. This phenomenon is referred to as networked interference\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)\. We consider there exists an issue of DNE, see Section[1](https://arxiv.org/html/2605.24358#S1)\.
- •Counterfactual outcome\.Counterfactual outcome is the outcome with an alternative treatment1−t1\-tand unobserved from observational data but needed for ITE estimation\(Yaoet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib71)\)\.
- •Confounding bias\.Confounders of an individual are a part of covariates𝒙i\\boldsymbol\{x\}\_\{i\}that affect the treatment assignment and outcome jointly\(Yaoet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib71)\), which can introduce confounding bias in ITE estimation\. For example, consider a scenario where customers are treated with advertisements\. Younger customers may be more likely to receive advertisements and also go shopping than elderly customers\. In this case, age acts as a confounder\. Due to the existence of networked interference among individuals, many studies for ITE estimation from a graph suggest that treatment assignment of an individual can be affected by confounders of his/her related individuals𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\(Chuet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib54); Guoet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib51),[2021](https://arxiv.org/html/2605.24358#bib.bib127); Maet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib1)\)\. In this study, we call confounders of an individual from𝒙i\\boldsymbol\{x\}\_\{i\}asindividual confoundersand the confounders from𝒙𝔾\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}asnetworked confounders\. In observational graph data, confounding bias results inp\(t∣𝒙,𝒙𝔾\)≠p\(1−t∣𝒙,𝒙𝔾\)p\(t\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\neq p\(1\-t\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\)\.
- •Interference bias\.A bias issue might also exist in networked interference\(Jianget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib94)\), resulting inp\(𝒕𝔾∣𝒙,𝒙𝔾,t\)≠p\(𝒕𝔾∣𝒙,𝒙𝔾,1−t\)p\(\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\},t\)\\neq p\(\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\},1\-t\), which is calledinterference bias\. This can result in additional bias in ITE estimation\. For instance, younger customers are not only more likely to be treated but also tend to have younger friends, who in turn have higher exposure to advertisements, whereas elderly customers have more elderly friends\.
Figure 3\.An example of a causal graph for an individual in graph data\. In the causal graph,bluearrows represent the effect caused by confounders,greenarrows represent the effect caused by networked interference, andyellowrepresent the effect caused by treatments assigned to the individual\. Here,TTandT𝔾T\_\{\\mathbb\{G\}\}are associated\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\), but there is no causal edge between them\.ITE estimation from observational graph data\.In observational graph data𝒟=\(𝑿,𝑻,𝒀,𝑨\)\\mathcal\{D\}=\(\\boldsymbol\{X\},\\boldsymbol\{T\},\\boldsymbol\{Y\},\\boldsymbol\{A\}\), we assume the existence of both confounders and networked interference with DNE\. A causal graph is shown in Figure[3](https://arxiv.org/html/2605.24358#S3.F3)\. The potential outcomes of the individualiiwith the individual treatmentt=1t=1andt=0t=0, along with treatments of related individuals𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}, are denoted byyi\(1,𝒕𝔾i\)y\_\{i\}\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)andyi\(0,𝒕𝔾i\)y\_\{i\}\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\), respectively\. Then, ITE with confounders and networked interference can be defined as:
\(1\)τi≔𝔼\[Y\(1,𝒕𝔾i\)−Y\(0,𝒕𝔾i\)∣𝒙i,𝒙𝔾i\]\.\\tau\_\{i\}\\coloneqq\\mathbb\{E\}\\left\[Y\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\-Y\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]\.This definition is extended from that of ITE with neighbor interference inJiang and Sun \([2022](https://arxiv.org/html/2605.24358#bib.bib103)\)\.
Identifiability of ITE\.We now discuss that ITE is identifiable from observational graph data with a set of assumptions\. First, we extend consistency assumption to networked interference\(Forastiereet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib10)\), as follows:
###### Assumption 3\.1\.
yi=yi\(ti,𝒕𝔾i\)y\_\{i\}=y\_\{i\}\(t\_\{i\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)for the individualiiwithtit\_\{i\}and𝐭𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\.
This assumption means that the potential outcome is equal to the observed outcomes with giventit\_\{i\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\. Then, we extend unconfoundedness assumption for confounders of neighbors\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\)to networked confounders:
###### Assumption 3\.2\.
For any individualii, given the covariates of the individual and individual’s related individuals, treatments of the individual and individual’s related individuals are independent of potential outcomes, i\.e\.,Ti,T𝔾i⟂Y\(1,𝐭𝔾i\),Y\(0,𝐭𝔾i\)∣𝐱i,𝐱𝔾iT\_\{i\},T\_\{\\mathbb\{G\}\_\{i\}\}\\perp Y\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\),Y\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\.
This assumption says that, confounders that describe the difference between the treated and the control groups are observed in individual covariates and covariates of related individuals\. Lastly, we extend the overlap assumption for neighbor interference\(Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126)\)to networked interference, as follows:
###### Assumption 3\.3\.
Given the covariates of any individual and individual’s related individuals, the treatment pair\(ti,𝐭𝔾i\)\(t\_\{i\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)has a non\-zero probability, i\.e\.,0<p\(ti,𝐭𝔾i∣𝐱i,𝐱𝔾i\)<1,∀𝐱i,∀𝐱𝔾i0<p\(t\_\{i\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)<1,\\forall\\boldsymbol\{x\}\_\{i\},\\forall\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\.
This assumption means that the treatment assignment is nondeterministic\(Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126)\)\.
###### Theorem 3\.4\.
With assumptions[3\.1](https://arxiv.org/html/2605.24358#S3.Thmtheorem1),[3\.2](https://arxiv.org/html/2605.24358#S3.Thmtheorem2), and[3\.3](https://arxiv.org/html/2605.24358#S3.Thmtheorem3), ITE is identifiable from observational graph data\.
We prove Theorem[3\.4](https://arxiv.org/html/2605.24358#S3.Thmtheorem4)as follows:
###### Proof\.
𝔼\[Y\(1,𝒕𝔾i\)−Y\(0,𝒕𝔾i\)∣𝒙i,𝒙𝔾i\]\\displaystyle\\quad\\;\\mathbb\{E\}\\left\[Y\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\-Y\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]=𝔼\[Y\(1,𝒕𝔾i\)∣𝒙,𝒙𝔾i\]−𝔼\[Y\(0,𝒕𝔾i\)∣𝒙i,𝒙𝔾i\]\\displaystyle=\\mathbb\{E\}\\left\[Y\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]\-\\mathbb\{E\}\\left\[Y\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]=𝔼\[Y\(1,𝒕𝔾i\)∣𝒙i,𝒙𝔾i,1,𝒕𝔾i\]−𝔼\[Y\(0,𝒕𝔾i\)∣𝒙i,𝒙𝔾i,0,𝒕𝔾i\]\\displaystyle=\\mathbb\{E\}\\left\[Y\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]\-\\mathbb\{E\}\\left\[Y\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]=𝔼\[Y∣𝒙i,𝒙𝔾i,1,𝒕𝔾i\]−𝔼\[Y∣𝒙i,𝒙𝔾i,0,𝒕𝔾i\],\\displaystyle=\\mathbb\{E\}\\left\[Y\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\]\-\\mathbb\{E\}\[Y\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\],where the second equality is based on the assumption[3\.2](https://arxiv.org/html/2605.24358#S3.Thmtheorem2)and the third equality is based on the assumptions[3\.1](https://arxiv.org/html/2605.24358#S3.Thmtheorem1)and[3\.3](https://arxiv.org/html/2605.24358#S3.Thmtheorem3)\. Letℂi\\mathbb\{C\}\_\{i\}denote individuals that cannot reach the individualiiin the graph𝑨\\boldsymbol\{A\},𝒙ℂi\\boldsymbol\{x\}\_\{\\mathbb\{C\}\_\{i\}\}denote covariates of individualsℂi\\mathbb\{C\}\_\{i\}, and𝒕ℂi\\boldsymbol\{t\}\_\{\\mathbb\{C\}\_\{i\}\}denote treatments of individualsℂi\\mathbb\{C\}\_\{i\}\. In this case, the covariates𝒙ℂi\\boldsymbol\{x\}\_\{\\mathbb\{C\}\_\{i\}\}and treatments𝒕ℂi\\boldsymbol\{t\}\_\{\\mathbb\{C\}\_\{i\}\}cannot interfere with the outcome of the individualii, which means that the outcome of individualiican only receive interference from𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}in the graph𝑨\\boldsymbol\{A\}\. This tells that, given𝒙i\\boldsymbol\{x\}\_\{i\},tit\_\{i\},𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}, and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}, the outcome of the individualiidoes not receive interference from𝒙ℂi,𝒕ℂi\\boldsymbol\{x\}\_\{\\mathbb\{C\}\_\{i\}\},\\boldsymbol\{t\}\_\{\\mathbb\{C\}\_\{i\}\}and𝑨\\boldsymbol\{A\}\. Then, we have:
\(2\)𝔼\[Y∣𝒙i,𝒙𝔾i,t,𝒕𝔾i\]\\displaystyle\\;\\mathbb\{E\}\\left\[Y\\mid\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\\right\]=\\displaystyle=𝔼\[Y∣𝒙i,t,𝒙𝔾i,𝒕𝔾i,𝒙ℂi,𝒕ℂi,𝑨\]\\displaystyle\\;\\mathbb\{E\}\\left\[Y\\mid\\boldsymbol\{x\}\_\{i\},t,\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\},\\boldsymbol\{x\}\_\{\\mathbb\{C\}\_\{i\}\},\\boldsymbol\{t\}\_\{\\mathbb\{C\}\_\{i\}\},\\boldsymbol\{A\}\\right\]=\\displaystyle=𝔼\[Y∣𝒙i,t,𝑿,𝑻,𝑨\],\\displaystyle\\;\\mathbb\{E\}\\left\[Y\\mid\\boldsymbol\{x\}\_\{i\},t,\\boldsymbol\{X\},\\boldsymbol\{T\},\\boldsymbol\{A\}\\right\],where𝒙𝔾\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}and𝒙ℂ\\boldsymbol\{x\}\_\{\\mathbb\{C\}\}constitute covariates𝑿\\boldsymbol\{X\}, while𝒕𝔾\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}and𝒕ℂ\\boldsymbol\{t\}\_\{\\mathbb\{C\}\}constitute treatments𝑻\\boldsymbol\{T\}for all individuals\. ∎
This shows that we can recover the ITE from observational graph data\. However, as the propagation mechanism of networked interference among individualiiand its related individuals𝔾i\\mathbb\{G\}\_\{i\}is unknown and complex, it is hard to model networked interference received by individualiifrom𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}on the data\(𝑿,𝑻,𝑨\)\(\\boldsymbol\{X\},\\boldsymbol\{T\},\\boldsymbol\{A\}\)\. A common choice for addressing this issue is to generate representations of𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}by applying some aggregation function\(Linet al\.,[2025](https://arxiv.org/html/2605.24358#bib.bib132); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)\. In this case, it is crucial to design a proper aggregation function; otherwise, we will erroneously characterize networked interference with DNE received by individuals from𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}, which can result in imprecise ITE estimation\.
## 4\.Proposed method
Given observational graph data𝒟=\(𝑿,𝑻,𝒀,𝑨\)\\mathcal\{D\}=\(\\boldsymbol\{X\},\\boldsymbol\{T\},\\boldsymbol\{Y\},\\boldsymbol\{A\}\), we aim to estimate ITE with confounders and networked interference, which incorporates DNE\. To this end, we propose GITE, as illustrated in Figure[2](https://arxiv.org/html/2605.24358#S1.F2)\.333The code is released inhttps://github\.com/LINXF208/GITE/tree/main\.GITE contains three key modules: representation learning, representation balancing, and outcome prediction modules\. The representation learning module generates covariate representations of𝒙i\\boldsymbol\{x\}\_\{i\}, as well as representations of networked interference derived from𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}, which are called interference representations\.
### 4\.1\.Representation learning module
In this section, we aim to generate covariate and interference representations\. We generate covariate representations by an MLP, which are expected to capture individual confounders\. To capture networked interference, it is important to model the propagation of interference among individuals\. We model it by leveraging some layer\-by\-layer aggregation function to aggregate interference\-related information of individuals within the same local network for every individual\. While modeling interference representation, we need to carefully design the aggregation function due to the existence of DNE, which consists of two sub\-issues\. \(I\) The importance of different neighbors in contributing to interference varies\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)\. \(II\) The scale of neighbors varies, leading to different levels of interference \(see Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\)\. Despite several powerful aggregation functions proposed for modeling interference, these methods still have limitations in capturing DNE, as discussed in Appendix[C](https://arxiv.org/html/2605.24358#A3)\. For example, the mean aggregation\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)and GCN\(Welling and Kipf,[2016](https://arxiv.org/html/2605.24358#bib.bib17)\)cannot capture DNE, as they do not address both issues \(I\) and \(II\)\. Although GAT\-based methods\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)estimate importance based on individual information to take issue \(I\) into account, they can not fully address issue \(II\), as they may degenerate to a mean aggregation when individuals are similar\. Sum aggregation\(Xuet al\.,[2019](https://arxiv.org/html/2605.24358#bib.bib133)\)can address issue \(II\), but does not address issue \(I\) and may suffer from a numerical explosion issue in some graphs with many connections, as shown in the results of ablation experiments \(see results of GITENAin Table[2](https://arxiv.org/html/2605.24358#S5.T2)\)\. Therefore, a proper aggregation function needs to address both issues \(I\) and \(II\), while avoiding the risk of numerical explosion\.
Figure 4\.Architecture of an NIM layer\. We show an example where the underlying graph consists of three individuals\.To address these nontrivial issues, we design an NIM layer, as illustrated in Figure[4](https://arxiv.org/html/2605.24358#S4.F4)\. To address issue \(I\), an NIM layer contains two partial attention mechanisms: IPAtt and SPAtt mechanisms, which are intended to adaptively capture the varying contributions of neighbors to interference based on two key factors that influence the importance of interference\. Specifically, IPAtt mechanism estimates the importance of interference among individuals based on their interference representations, whereas SPAtt mechanism estimates the importance of interference based on the structures of their local networks, which can also contribute to addressing issue \(II\)\. The two estimated partial importance guides separate aggregations, the results of which are adaptively integrated by a learnable summary function for each layer\. To tackle issue \(II\), the NIM layer applies a message amplifier to vary integrated results with the degrees of individuals and update interference representations\. By aggregating interference\-related information with NIM layers, we can jointly address issues \(I\) and \(II\), which enables the model to capture DNE\.
Specifically, given covariates𝑿\\boldsymbol\{X\}, treatments𝑻\\boldsymbol\{T\}, and network𝑨\\boldsymbol\{A\}, we aim to generate covariate representation𝒛i\\boldsymbol\{z\}\_\{i\}, interference representation𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}of𝒙𝔾i\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}, and interference representation𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}of𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\. Here, information of network𝑨\\boldsymbol\{A\}is encoded into𝒙𝔾i\\boldsymbol\{x\}\_\{\{\\mathbb\{G\}\_\{i\}\}\}and𝒕𝔾i\\boldsymbol\{t\}\_\{\{\\mathbb\{G\}\_\{i\}\}\}\. We use two aggregation functions to generate𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}and𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}separately due to the different dimensions and scales among covariates and treatments\. Let𝒛𝕊i\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{i\}\}denote the representation of the structure of the local network of the individualii\. We use𝒁\\boldsymbol\{Z\}with a subscript to denote corresponding representations of all individuals, such as𝒁𝕏\\boldsymbol\{Z\}\_\{\\mathbb\{X\}\}and𝒁𝕊\\boldsymbol\{Z\}\_\{\\mathbb\{S\}\}, and use a superscript\(l\)\(l\)to denote the representation generated by thell\-layer, such as𝒛𝕊i\(l\)\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{i\}\}^\{\(l\)\}\. Letσ\\sigmabe an activation function,𝑾\(l\)\\boldsymbol\{W\}^\{\(l\)\}with any subscript be a learnable parameter matrix, and𝒘\\boldsymbol\{w\}with any subscript be a learnable parameter vector\.
We generate𝒛i\\boldsymbol\{z\}\_\{i\}independently by an MLP, i\.e\.,MLP\(𝒙i\)=𝒛i\\textrm\{MLP\}\(\\boldsymbol\{x\}\_\{i\}\)=\\boldsymbol\{z\}\_\{i\}, as covariates are typically important to treatment assignment and outcome prediction of an individual\.
Now, we describe the architecture of an NIM layer that consists of IPAtt, SPAtt, an encoderϕ𝕊\\phi\_\{\\mathbb\{S\}\}for generating𝒛𝕊\\boldsymbol\{z\}\_\{\\mathbb\{S\}\}, a covariate aggregation functionϕ𝕏\\phi\_\{\\mathbb\{X\}\}for generating𝒛𝕏\\boldsymbol\{z\}\_\{\\mathbb\{X\}\}, and a treatment aggregation functionϕ𝕋\\phi\_\{\\mathbb\{T\}\}for generating𝒛𝕋\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}\. Letπ𝕊\(l\)\\pi\_\{\\mathbb\{S\}\}^\{\(l\)\}be a learnable one\-dimensional parameter\. In each NIM layer, we first generate𝒛𝕊i\(l\)\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{i\}\}^\{\(l\)\}by applyingϕ𝕊\(𝒁𝕊\(l−1\),ℕi\)\\phi\_\{\\mathbb\{S\}\}\\biggr\(\\boldsymbol\{Z\}\_\{\\mathbb\{S\}\}^\{\(l\-1\)\},\\mathbb\{N\}\_\{i\}\\biggr\)with a GIN layer\(Xuet al\.,[2019](https://arxiv.org/html/2605.24358#bib.bib133)\), chosen for its strong ability to capture the structural information, as follows:
\(3\)𝒛𝕊i\(l\)=σ\(𝑾𝕊\(l\)\(\(1\+π𝕊\(l\)\)⋅𝒛𝕊i\(l−1\)\+∑k∈ℕi𝒛𝕊k\(l−1\)\)\)\.\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{i\}\}^\{\(l\)\}=\\sigma\\Biggl\(\\boldsymbol\{W\}\_\{\\mathbb\{S\}\}^\{\(l\)\}\\biggl\(\(1\+\\pi\_\{\\mathbb\{S\}\}^\{\(l\)\}\)\\cdot\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{i\}\}^\{\(l\-1\)\}\+\\sum\_\{k\\in\\mathbb\{N\}\_\{i\}\}\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{k\}\}^\{\(l\-1\)\}\\biggr\)\\Biggr\)\.Next, we estimate individual partial importanceαikin\\alpha\_\{ik\}^\{\\rm\{in\}\}and structure partial importanceαikst\\alpha\_\{ik\}^\{\\rm\{st\}\}by IPAtt and SPAtt mechanisms, respectively\. Let𝒑iin=𝒛𝕏i\(l−1\)\\boldsymbol\{p\}\_\{i\}^\{\\rm\{in\}\}=\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}^\{\(l\-1\)\}for the individual partial importance ofϕ𝕏\\phi\_\{\\mathbb\{X\}\},𝒑iin=\[𝒛𝕏i\(l−1\)∥𝒛𝕋i\(l−1\)\]\\boldsymbol\{p\}\_\{i\}^\{\\rm\{in\}\}=\[\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}^\{\(l\-1\)\}\\\|\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}^\{\(l\-1\)\}\]for that ofϕ𝕋\\phi\_\{\\mathbb\{T\}\}due to the consideration that covariates can influence𝒕𝔾i\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\(see Figure[3](https://arxiv.org/html/2605.24358#S3.F3)\), and𝒑ist=𝒛𝕊i\(l−1\)\\boldsymbol\{p\}\_\{i\}^\{\\rm\{st\}\}=\\boldsymbol\{z\}\_\{\\mathbb\{S\}\_\{i\}\}^\{\(l\-1\)\}for structure partial importance estimation\. Here,∥\\\|denotes the concatenation operation\.αikin\\alpha\_\{ik\}^\{\\rm\{in\}\}andαikst\\alpha\_\{ik\}^\{\\rm\{st\}\}are estimated as follows:
\(4\)αikin=Norm\(a\(𝒑iin,𝒑kin\)\),αikst=Norm\(a\(𝒑ist,𝒑kst\)\),\\displaystyle\\alpha\_\{ik\}^\{\\rm\{in\}\}=\\mathrm\{Norm\}\\biggl\(a\\left\(\\boldsymbol\{p\}\_\{i\}^\{\\rm\{in\}\},\\boldsymbol\{p\}\_\{k\}^\{\\rm\{in\}\}\\right\)\\biggr\),\\;\\alpha\_\{ik\}^\{\\rm\{st\}\}=\\mathrm\{Norm\}\\biggl\(a\\left\(\\boldsymbol\{p\}\_\{i\}^\{\\rm\{st\}\},\\boldsymbol\{p\}\_\{k\}^\{\\rm\{st\}\}\\right\)\\biggr\),whereaaestimates the importance between two individuals based on its inputs, which can be implemented by several attention mechanisms, such as GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)and the attention mechanism of Transformer\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib5); Yinget al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib151)\), as detailed in Appendix[B](https://arxiv.org/html/2605.24358#A2)\.Norm\\rm\{Norm\}is a normalization operation, which adjusts estimated importance based on the sum of estimated importance of all neighborsℕ~i=ℕi∪\{i\}\\tilde\{\\mathbb\{N\}\}\_\{i\}=\\mathbb\{N\}\_\{i\}\\cup\{\\\{i\\\}\}of the individualiito prevent the risk of numerical explosion\.
Subsequently, we update interference representations𝒛𝕏i\(l−1\)\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}^\{\(l\-1\)\}and𝒛𝕋i\(l−1\)\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}^\{\(l\-1\)\}by using two functionsϕ𝕏\(𝒁𝕏\(l−1\),ℕ~i,\{αikin,αikst\}k∈ℕ~i\)\\phi\_\{\\mathbb\{X\}\}\\left\(\\boldsymbol\{Z\}\_\{\\mathbb\{X\}\}^\{\(l\-1\)\},\\tilde\{\\mathbb\{N\}\}\_\{i\},\\\{\\alpha^\{\\rm\{in\}\}\_\{ik\},\\alpha^\{\\rm\{st\}\}\_\{ik\}\\\}\_\{k\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\right\)andϕ𝕋\(𝒁𝕋\(l−1\),ℕ~i,\{αikin,αikst\}k∈ℕ~i\)\\phi\_\{\\mathbb\{T\}\}\\left\(\\boldsymbol\{Z\}\_\{\\mathbb\{T\}\}^\{\(l\-1\)\},\\tilde\{\\mathbb\{N\}\}\_\{i\},\\\{\\alpha^\{\\rm\{in\}\}\_\{ik\},\\alpha^\{\\rm\{st\}\}\_\{ik\}\\\}\_\{k\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\right\), respectively, as follows:
\(5\)𝒛̊𝕏i\(l\)=\\displaystyle\\mathring\{\\boldsymbol\{z\}\}\_\{\\mathbb\{X\}\_\{i\}\}^\{\(l\)\}=π̊𝕏\(l\)⋅σ\(∑k∈ℕ~iαikin⋅𝑾𝕏in\(l\)𝒛𝕏k\(l−1\)\)\+\\displaystyle\\;\\mathring\{\\pi\}\_\{\\mathbb\{X\}\}^\{\(l\)\}\\cdot\\sigma\\Biggl\(\\sum\_\{k\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{ik\}^\{\\rm\{in\}\}\\cdot\\boldsymbol\{W\}\_\{\\mathbb\{X\}\_\{\\rm\{in\}\}\}^\{\(l\)\}\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{k\}\}\}^\{\(l\-1\)\}\\Biggr\)\+\(1−π̊𝕏\(l\)\)⋅σ\(∑k∈ℕ~iαikst⋅𝑾𝕏st\(l\)𝒛𝕏k\(l−1\)\),\\displaystyle\\left\(1\-\\mathring\{\\pi\}\_\{\\mathbb\{X\}\}^\{\(l\)\}\\right\)\\cdot\\sigma\\Biggl\(\\sum\_\{k\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{ik\}^\{\\rm\{st\}\}\\cdot\\boldsymbol\{W\}\_\{\\mathbb\{X\}\_\{\\rm\{st\}\}\}^\{\(l\)\}\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{k\}\}\}^\{\(l\-1\)\}\\Biggr\),𝒛̊𝕋i\(l\)=\\displaystyle\\mathring\{\\boldsymbol\{z\}\}\_\{\\mathbb\{T\}\_\{i\}\}^\{\(l\)\}=π̊𝕋\(l\)⋅σ\(∑k∈ℕ~iαikin⋅𝑾𝕋in\(l\)𝒛𝕋k\(l−1\)\)\+\\displaystyle\\;\\mathring\{\\pi\}\_\{\\mathbb\{T\}\}^\{\(l\)\}\\cdot\\sigma\\Biggl\(\\sum\_\{k\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{ik\}^\{\\rm\{in\}\}\\cdot\\boldsymbol\{W\}\_\{\\mathbb\{T\}\_\{\\rm\{in\}\}\}^\{\(l\)\}\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{k\}\}^\{\(l\-1\)\}\\Biggr\)\+\(1−π̊𝕋\(l\)\)⋅σ\(∑k∈ℕ~iαikst⋅𝑾𝕋st\(l\)𝒛𝕋k\(l−1\)\)\.\\displaystyle\\left\(1\-\\mathring\{\\pi\}\_\{\\mathbb\{T\}\}^\{\(l\)\}\\right\)\\cdot\\sigma\\Biggl\(\\sum\_\{k\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{ik\}^\{\\rm\{st\}\}\\cdot\\boldsymbol\{W\}\_\{\\mathbb\{T\}\_\{\\rm\{st\}\}\}^\{\(l\)\}\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{k\}\}^\{\(l\-1\)\}\\Biggr\)\.Hereπ̊∙\(l\)\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}for∙∈\{𝕏,𝕋\}\\bullet\\in\\\{\\mathbb\{X\},\\mathbb\{T\}\\\}is a learnable one\-dimensional parameter with the range of\[0,1\]\[0,1\], which is used in the summary function to adaptively integrate aggregated results with different partial importance\. To prevent the normalization operation in IPAtt and SPAtt from exacerbating the issue \(II\) of DNE \(as observed in Appendix[D](https://arxiv.org/html/2605.24358#A4)\), we design a message amplifierηi\\eta\_\{i\}to vary interference representations with the degrees of individuals, inspired byCorsoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib129)\):
\(6\)𝒛𝕏i\(l\)=\\displaystyle\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{i\}\}\}^\{\(l\)\}=\(1\+ηi\)⋅𝒛̊𝕏i\(l\),𝒛𝕋i\(l\)=\(1\+ηi\)⋅𝒛̊𝕋i\(l\),\\displaystyle\\;\(1\+\\eta\_\{i\}\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\mathbb\{X\}\_\{i\}\}^\{\(l\)\},\\;\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}^\{\(l\)\}=\(1\+\\eta\_\{i\}\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\mathbb\{T\}\_\{i\}\}^\{\(l\)\},ηi=πη⋅\(log\(d~i\)/∑i=1ntrlog\(d~i\)\),\\displaystyle\\eta\_\{i\}=\\pi\_\{\\eta\}\\cdot\\left\(\{\\log\(\\tilde\{d\}\_\{i\}\)\}/\{\\sum\_\{i=1\}^\{n\_\{\\rm\{tr\}\}\}\{\\log\(\\tilde\{d\}\_\{i\}\)\}\}\\right\),whereπη\\pi\_\{\\eta\}is a learnable parameter or hyperparameter, which adjusts the amplification level,d~i\\tilde\{d\}\_\{i\}represents the degree ofiiincluding the self\-loop, andntrn\_\{\\rm\{tr\}\}represents the size of training set\. Implementation and hyperparameter details are described in Appendix[M](https://arxiv.org/html/2605.24358#A13)\. By applying the NIM layer with the message amplifier to generate interference representation, the model can address the issue \(II\) of DNE, as stated in Proposition[4\.1](https://arxiv.org/html/2605.24358#S4.Thmtheorem1)and proved in the Appendix[D](https://arxiv.org/html/2605.24358#A4)\.
###### Proposition 4\.1\.
Interference representation generated by NIM layers after applying the message amplifier can address the issue \(II\) of DNE, even in local networks where all individuals have similar interference\-related information\.
### 4\.2\.Representation balancing module
In this section, we aim to address issues of both confounding biasp\(t∣𝒙,𝒙𝔾\)≠p\(1−t∣𝒙,𝒙𝔾\)p\(t\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\neq p\(1\-t\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)and interference biasp\(𝒕𝔾∣𝒙,𝒙𝔾,t\)≠p\(𝒕𝔾∣𝒙,𝒙𝔾,1−t\)p\(\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\},t\)\\neq p\(\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\},1\-t\)\. Most ITE estimators for graph data mitigate these bias issues by a strategy of separately balancing representations𝒛i\\boldsymbol\{z\}\_\{i\},𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}, and𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}between treated and control groups, such asMa and Tresp \([2021](https://arxiv.org/html/2605.24358#bib.bib12)\),Maet al\.\([2022b](https://arxiv.org/html/2605.24358#bib.bib53)\),Jiang and Sun \([2022](https://arxiv.org/html/2605.24358#bib.bib103)\), andLinet al\.\([2025](https://arxiv.org/html/2605.24358#bib.bib132)\)\. If we use their strategies for representation balancing, we need to achieve two goals:p\(𝒛,𝒛𝕏∣t=1\)≈p\(𝒛,𝒛𝕏∣t=0\)p\(\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\}\\mid t=1\)\\approx p\(\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\}\\mid t=0\)andp\(𝒛𝕋∣𝒛,𝒛𝕏,t=1\)≈p\(𝒛𝕋∣𝒛,𝒛𝕏,t=0\)p\(\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}\\mid\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\},t=1\)\\approx p\(\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}\\mid\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\},t=0\)for mitigating confounding and interference biases, respectively\. This requires multiple hyperparameters to trade off each loss term for different representation balancing targets, leading to costly hyperparameter selection\. Furthermore, there might exist bias caused by some unobserved variables\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\), which cannot be addressed by the strategy of separate balancing\. To address these limitations, we propose a new representation balancing strategy that can jointly balance representations𝒛i\\boldsymbol\{z\}\_\{i\},𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}, and𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}between treated and control groups while mitigating the bias issue caused by unobserved variables through a proximal factual outcome regularizer \(PFOR\)\(Courtyet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib135); Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\. Our key insight stems from the observationp\(𝒛,𝒛𝕏∣t\)p\(𝒛𝕋∣𝒛,𝒛𝕏,t\)=p\(𝒛,𝒛𝕏,𝒛𝕋∣t\)p\(\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\}\\mid t\)p\(\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}\\mid\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\},t\)=p\(\\boldsymbol\{z\},\\boldsymbol\{z\}\_\{\\mathbb\{X\}\},\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}\\mid t\)\. This implies that we can balance𝒛i\\boldsymbol\{z\}\_\{i\},𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}, and𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}between treated and control groups by balancing their joint distribution\. In this case, the model can adaptively prioritizes which parts of representations𝒛i\\boldsymbol\{z\}\_\{i\},𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}, and𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}need to be balanced at each training iteration\. Importantly, the proposed strategy has the potential to mitigate bias caused by some unobserved variables through a Wasserstein discrepancy\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)\(see Definition[E\.2](https://arxiv.org/html/2605.24358#A5.Thmtheorem2)\) with a PFOR\.
Specifically, let𝒲\(p1,p2\)\\mathcal\{W\}\(p\_\{1\},p\_\{2\}\)denote the Wasserstein discrepancy between two distributionsp1p\_\{1\}andp2p\_\{2\}, andΦ\\Phidenote a map function that is achieved by combining the MLP and NIM layers \(see Section[4\.1](https://arxiv.org/html/2605.24358#S4.SS1)\)\. Let𝒖i=\(𝒙i,𝒙𝔾i,𝒕𝔾i\)\\boldsymbol\{u\}\_\{i\}=\(\\boldsymbol\{x\}\_\{i\},\\boldsymbol\{x\}\_\{\{\\mathbb\{G\}\_\{i\}\}\},\\boldsymbol\{t\}\_\{\{\\mathbb\{G\}\_\{i\}\}\}\)and𝒓i=Φ\(𝒖i\)=\(𝒛i,𝒛𝕏i,𝒛𝕋i\)\\boldsymbol\{r\}\_\{i\}=\\Phi\(\\boldsymbol\{u\}\_\{i\}\)=\(\\boldsymbol\{z\}\_\{i\},\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{i\}\}\},\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}\)for simplicity\.
We minimize Wasserstein discrepancy with PFOR\(Courtyet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib135); Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)to balance representations𝒓i\\boldsymbol\{r\}\_\{i\}between different treatment groups\. The definition for Wasserstein discrepancy is detailed in Definition[E\.2](https://arxiv.org/html/2605.24358#A5.Thmtheorem2)\. As the scales of values of representations𝒛\\boldsymbol\{z\},𝒛𝕏\\boldsymbol\{z\}\_\{\\mathbb\{X\}\}, and𝒛𝕋\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}may be uneven, we propose a proxy module that contains a normalization layer followed by a projection function without nonlinear activation function\. Then, we balance representations𝒛\\boldsymbol\{z\},𝒛𝕏\\boldsymbol\{z\}\_\{\\mathbb\{X\}\}, and𝒛𝕋\\boldsymbol\{z\}\_\{\\mathbb\{T\}\}jointly by balancing the output of the proxy module\. There might be unobserved variables that introduce an additional bias issue\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\. By applying the proposed joint balancing strategy, we can use the PFOR\(Courtyet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib135); Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)to mitigate the bias from unobserved variables\. Let𝒓i′\\boldsymbol\{r\}^\{\\prime\}\_\{i\}be the output of the proxy module\. In this case, if we calculate unit\-wise distanceDij=‖𝒓i′−𝒓j′‖2D\_\{ij\}=\\\|\\boldsymbol\{r\}^\{\\prime\}\_\{i\}\-\\boldsymbol\{r\}^\{\\prime\}\_\{j\}\\\|^\{2\}for Wasserstein discrepancy, we can only mitigate the bias introduced by observed data\. Let𝒗\\boldsymbol\{v\}be the unobserved variables\. To take𝒗\\boldsymbol\{v\}into account, the individual\-wise distance needs to be modified asDij=‖𝒓i′−𝒓j′‖2\+‖𝒗i−𝒗j‖2D\_\{ij\}=\\\|\\boldsymbol\{r\}^\{\\prime\}\_\{i\}\-\\boldsymbol\{r\}^\{\\prime\}\_\{j\}\\\|^\{2\}\+\\\|\\boldsymbol\{v\}\_\{i\}\-\\boldsymbol\{v\}\_\{j\}\\\|^\{2\}\. However, we do not have information about𝒗\\boldsymbol\{v\}\. Inspired byWanget al\.\([2023](https://arxiv.org/html/2605.24358#bib.bib122)\), who designed a PFOR for the no\-interference setting, we design a PFOR for scenarios involving networked interference\. Specifically, when we have balanced𝒓\\boldsymbol\{r\}\(or𝒓′\\boldsymbol\{r\}^\{\\prime\}\), and identicaltt, the only variable reflecting the variation of𝒗\\boldsymbol\{v\}is the outcome\. Therefore, we can use information of outcomes to replace𝒗\\boldsymbol\{v\}in the modified individual\-wise distance by the PFOR\(Courtyet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib135); Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\. Then, the individual\-wise distance can be modified as:
\(7\)DijλD=∥𝒓i′−𝒓j′∥2\+λD⋅\\displaystyle D\_\{ij\}^\{\\lambda\_\{D\}\}=\\\|\\boldsymbol\{r\}^\{\\prime\}\_\{i\}\-\\boldsymbol\{r\}^\{\\prime\}\_\{j\}\\\|^\{2\}\+\\lambda\_\{D\}\\cdot\(∥yi\(1,𝒕𝔾i\)−yj\(1,𝒕𝔾i\)∥2\+\\displaystyle\\biggl\(\\\|y\_\{i\}\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\-y\_\{j\}\(1,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\\|^\{2\}\+∥yi\(0,𝒕𝔾i\)−yj\(0,𝒕𝔾i\)∥2\),\\displaystyle\\;\\;\\\|y\_\{i\}\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\-y\_\{j\}\(0,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)\\\|^\{2\}\\biggr\),whereλD\\lambda\_\{D\}is a hyperparameter\. As only a part of potential outcomes is observed, we use factual and predicted outcomes to replace potential outcomes forDijλDD\_\{ij\}^\{\\lambda\_\{D\}\}, then we have:
\(8\)DijλD=‖𝒓i′−𝒓j′‖2\+λD⋅\(‖yi−y^j‖2\+‖y^i−yj‖2\),\\displaystyle D\_\{ij\}^\{\\lambda\_\{D\}\}=\\\|\\boldsymbol\{r\}^\{\\prime\}\_\{i\}\-\\boldsymbol\{r\}^\{\\prime\}\_\{j\}\\\|^\{2\}\+\\lambda\_\{D\}\\cdot\\biggl\(\\\|y\_\{i\}\-\\hat\{y\}\_\{j\}\\\|^\{2\}\+\\\|\\hat\{y\}\_\{i\}\-y\_\{j\}\\\|^\{2\}\\biggr\),whereiiandjjrepresent two individuals in different treatment groups, andy^i\\hat\{y\}\_\{i\}represents the predicted outcome\.
Variant\.We propose a variant named GITEvthat uses an MLP with a nonlinear activation function for the proxy module, since the expressive ability of projection function without nonlinear activation function is often limited\. Let𝒓i′\\boldsymbol\{r\}\_\{i\}^\{\\prime\}be the output of the proxy module\. We add a term into loss:ℒP=‖𝒓i−𝒓i′‖2\\mathcal\{L\}\_\{P\}=\\\|\\boldsymbol\{r\}\_\{i\}\-\\boldsymbol\{r\}\_\{i\}^\{\\prime\}\\\|^\{2\}for GITEvto ensure that𝒓i\\boldsymbol\{r\}\_\{i\}is close to𝒓i′\\boldsymbol\{r\}\_\{i\}^\{\\prime\}when balancing the representations\.
Table 1\.Results \(mean and standard errors\) on the test sets\. Results are averaged over ten executions\. Results withboldfacerepresent the lowest mean error, whereas results withunderlinesrepresent the second and third lowest mean error\. Here, the AMZ\-N dataset is a sparse graph as its size of nodes≈\\approxthat of edges, whereas the Flickr and Blog datasets are far more dense than the AMZ\-N dataset\.
### 4\.3\.Outcome prediction module
Given the covariate representation𝒛i\\boldsymbol\{z\}\_\{i\}, interference representations𝒛𝕏i\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}and𝒛𝕋i\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}, and treatment assignmenttit\_\{i\}, we train two predictors to infer the outcomes with different values oftt\.
Specifically, letf0f\_\{0\}andf1f\_\{1\}denote the predictor for potential outcome witht=0t=0andt=1t=1, respectively\. Each predictor is achieved by an MLP\. LetΘ\\Thetabe all learnable parameters of GITE\. We add L2 regularization into our loss function to avoid model overfitting\. The loss functionℒtotal\\mathcal\{L\}\_\{\\rm\{total\}\}of the proposed GITE consists of mean square error \(MSE\) between predicted and factual outcomes, Wasserstein discrepancy of representations between different treatment groups, and L2 regularization \(denoted as∥Θ∥2\)\\\|\\Theta\\\|^\{2\}\)\. Each term inℒtotal\\mathcal\{L\}\_\{\\rm\{total\}\}is traded off by hyperparametersβ\\betaandλ\\lambda, as follows:
\(9\)ℒtotal=\\displaystyle\\mathcal\{L\}\_\{\\rm\{total\}\}=1ntr∑i=1ntr\(ft\(𝒛i,𝒛𝕏i,𝒛𝕋i\)−yi\)2\+β⋅𝒲\+λ⋅‖Θ‖2\.\\displaystyle\\frac\{1\}\{n\_\{\\rm\{tr\}\}\}\\sum\_\{i=1\}^\{n\_\{\\rm\{tr\}\}\}\\left\(f\_\{t\}\(\\boldsymbol\{z\}\_\{i\},\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{i\}\}\},\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}\)\-y\_\{i\}\\right\)^\{2\}\+\\beta\\cdot\\mathcal\{W\}\+\\lambda\\cdot\\\|\\Theta\\\|^\{2\}\.The parameters of GITE are optimized by minimizingℒtotal\\mathcal\{L\}\_\{\\rm\{total\}\}\. For GITEv, we addλP⋅ℒP\\lambda\_\{P\}\\cdot\\mathcal\{L\}\_\{P\}into the Equation \([9](https://arxiv.org/html/2605.24358#S4.E9)\), whereλP\\lambda\_\{P\}is a hyperparameter\. After training, ITE can be estimated using the trained predictors and generated representations, as follows:
\(10\)τ^i=f1\(𝒛i,𝒛𝕏i,𝒛𝕋i\)−f0\(𝒛i,𝒛𝕏i,𝒛𝕋i\)\.\\hat\{\\tau\}\_\{i\}=f\_\{1\}\(\\boldsymbol\{z\}\_\{i\},\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{i\}\}\},\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}\)\-f\_\{0\}\(\\boldsymbol\{z\}\_\{i\},\\boldsymbol\{z\}\_\{\{\\mathbb\{X\}\_\{i\}\}\},\\boldsymbol\{z\}\_\{\\mathbb\{T\}\_\{i\}\}\)\.
Error bounds\.To estimate ITE from observational data, the training of the model relies on factual outcomes instead of true ITE due to the absence of counterfactual outcomes\. However, confounding and interference biases may cause the performance of factual outcomes to poorly reflect that of ITE estimation, which raises concerns about the ability of the trained model to guide decision\-making\. To address this,Shalitet al\.\([2017](https://arxiv.org/html/2605.24358#bib.bib4)\)andCaiet al\.\([2023](https://arxiv.org/html/2605.24358#bib.bib125)\)analyze the error bound for ITE estimation by building a bridge between errors in ITE estimation and factual outcome prediction\. However, the former applies only to non\-graph data, and the latter assumes neighbor interference rather than networked interference\. Therefore, we analyze the error bound of ITE with networked interference based on the proposed representation balancing strategy, as detailed in Appendix[E](https://arxiv.org/html/2605.24358#A5)\.
## 5\.Experiments
In this section, we conducted experiments to answer the following research questions \(RQs\)\.
- •RQ 1: do the proposed methods outperform baseline methods in ITE estimation with confounders and networked interference?
- •RQ 2: are proposed components important to the proposed methods?
- •RQ 3: how sensitive are the proposed methods to their hyperparameters?
Table 2\.Results \(mean and standard errors\) of ablation experiments\. Results are averaged over ten executions\.### 5\.1\.Experimental setting
Datasets\.We conducted experiments on three public datasets widely used in the task of ITE estimation with interference: Amazon negative \(abbreviated as AMZ\-N\) dataset\(He and McAuley,[2016](https://arxiv.org/html/2605.24358#bib.bib67)\), Flickr dataset\(Wanget al\.,[2013](https://arxiv.org/html/2605.24358#bib.bib69)\), and BlogCatalog dataset \(abbreviated as Blog\)\(Liet al\.,[2015](https://arxiv.org/html/2605.24358#bib.bib74),[2019](https://arxiv.org/html/2605.24358#bib.bib73)\)\. The AMZ\-N dataset contains14,53814,538items with15,01115,011edges\. For the AMZ\-N dataset, we used the covariates, treatments, outcomes, and ITE, all of which were provided byRakeshet al\.\([2018](https://arxiv.org/html/2605.24358#bib.bib52)\)\. The Flickr dataset contains7,5757,575users with479,476479,476directed edges\. We used the1,2061,206\-dimensional embeddings of user profiles that were provided byGuoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib51)\)\. The Blog dataset contains5,1965,196units with343,486343,486directed edges\. We used the2,1982,198\-dimensional embeddings of user profiles that were provided byGuoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib51)\)\. The details of each dataset are described in Appendix[I](https://arxiv.org/html/2605.24358#A9)\. The ground truth of ITE is hard to collect due to the lack of ground truth regarding counterfactual outcomes\. Following existing works\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103); Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\), we transformed the covariates and graph structures to simulate treatments and outcomes with confounders and networked interference for the Flickr and Blog datasets, as detailed in Appendix[J](https://arxiv.org/html/2605.24358#A10)\.
Baselines\.Baseline methods can be divided into the following four categories\. \(I\) ITE estimators for non\-graph data: BNN\(Johanssonet al\.,[2016](https://arxiv.org/html/2605.24358#bib.bib7)\), CFR\-MMD\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\), CFR\-Wass\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\), TARNet\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\), ESCFR\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\), and RERUM\(Heet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib177)\)\. \(II\) ITE estimator for graph data without addressing interference: NetDeconf\(Guoet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib51)\)\. \(III\) ITE estimators for graph data with addressing neighbor interference only: NetEst\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\), SPNet\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102)\), DWR\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\), and CauGramer\([Wuet al\.,](https://arxiv.org/html/2605.24358#bib.bib148)\)\. \(IV\) ITE estimators for graph data with addressing networked interference: GCN\-HSIC\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\), SAGE\-HSIC\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\), SITE\(Linet al\.,[2025](https://arxiv.org/html/2605.24358#bib.bib132)\), IDENet\(Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\)model networked interference by using GCN or mean aggregation function; HyperSCI\(Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)and HINITE\(Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91)\)model networked interference by using GAT\. We describe the details of each baseline in Appendix[K](https://arxiv.org/html/2605.24358#A11)\.
Metrics\.FollowingMa and Tresp \([2021](https://arxiv.org/html/2605.24358#bib.bib12)\)andLinet al\.\([2025](https://arxiv.org/html/2605.24358#bib.bib132)\), we consider two widely used metricsϵMSE\\sqrt\{\\epsilon\_\{\\rm\{MSE\}\}\}andϵPEHE\\sqrt\{\\epsilon\_\{\\rm\{PEHE\}\}\}for all datasets\.ϵMSE\\sqrt\{\\epsilon\_\{\\mathrm\{MSE\}\}\}quantifies the performance in outcome prediction, whileϵPEHE\\sqrt\{\\epsilon\_\{\\text\{PEHE\}\}\}quantifies the performance in ITE estimation\. They are defined as follows:
ϵMSE=1nte∑i=1nte\(y^i−yi\)2,ϵPEHE=1nte∑i=1nte\(τ^i−τi\)2,\\displaystyle\\sqrt\{\\epsilon\_\{\\mathrm\{MSE\}\}\}=\\sqrt\{\\frac\{1\}\{n\_\{\\rm\{te\}\}\}\\sum^\{n\_\{\\rm\{te\}\}\}\_\{i=1\}\(\\hat\{y\}\_\{i\}\-y\_\{i\}\)^\{2\}\},\\;\\sqrt\{\\epsilon\_\{\\text\{PEHE\}\}\}=\\sqrt\{\\frac\{1\}\{n\_\{\\rm\{te\}\}\}\\sum^\{n\_\{\\rm\{te\}\}\}\_\{i=1\}\\left\(\\hat\{\\tau\}\_\{i\}\-\\tau\_\{i\}\\right\)^\{2\}\},wherenten\_\{\\rm\{te\}\}is the size of the test set\. We randomly partitioned all datasets into training/validation/test splits with a ratio of70%/15%/15%70\\%/15\\%/15\\%and averaged results over ten repeated executions\. We defer implementation and hyperparameter details to Appendix[M](https://arxiv.org/html/2605.24358#A13)\.
### 5\.2\.Performance evaluation experiments \(RQ 1\)
As shown in Table[1](https://arxiv.org/html/2605.24358#S4.T1), we conducted experiments to answer RQ 1\. Overall, GITE and GITEvoutperform all baseline methods, which shows their effectiveness\. Specifically, GAT\-based methods \(e\.g\., HyperSCI, HINITE\) generally outperform GCN/GNN\-based \(GCN\-HSIC, SAGE\-HSIC, and SITE\) methods in ITE estimation under DNE, as they partially account for differing importance of neighbors, but still ignore variations in neighbor size\. GITE outperforms GAT\-based methods by using two partial attention mechanisms and a message amplifier to properly model DNE\. Furthermore, we can observe that the improvements achieved by the proposed methods in the performance of outcome prediction and ITE estimation on the Flickr and Blog datasets are more significant than those on the AMZ\-N dataset\. We consider this is because the Flickr and Blog datasets contain far more edges than AMZ\-N, which introduces more severe DNE, whereas the AMZ\-N is a sparse graph, which contains14,53814,538nodes with only15,01115,011directed edges\. On the Flickr and Blog datasets, the baseline methods may generate inappropriate interference representations due to being unable to properly capture DNE, which degrades the performance in both outcome prediction and ITE estimation\. This reveals that properly capturing DNE is important to ITE estimation from graph data\. On the AMZ\-N dataset, GCN, mean aggregation, and GAT performed well in generating interference representations, as the network is relatively sparse, and they suffer less from DNE\. It is noteworthy that even on such a sparse graph, GITE achieves approximately2\.6%2\.6\\%improvement in outcome prediction performance and5\.5%5\.5\\%improvement in ITE estimation performance\.
\(a\)AMZ\-N,β\\beta,ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(b\)AMZ\-N,β\\beta,ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(c\)AMZ\-N,λ\\lambda,ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(d\)AMZ\-N,λ\\lambda,ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(e\)Flickr,β\\beta,ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(f\)Flickr,β\\beta,ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(g\)Flickr,λ\\lambda,ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(h\)Flickr,λ\\lambda,ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(i\)Blog,β\\beta,ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(j\)Blog,β\\beta,ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(k\)Blog,λ\\lambda,ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(l\)Blog,λ\\lambda,ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
Figure 5\.Results \(mean and standard errors\) of sensitivity experiments for hyperparametersβ\\betaandλ\\lambda\. Results are averaged over ten executions\.
### 5\.3\.Ablation experiments \(RQ 2\)
To answer RQ 2, we conducted ablation experiments\. We first introduce several variants of the original GITE with ablation\. GITENRremoves the L2 regularization by settingλ=0\\lambda=0\. GITENBremoves representation balancing module by settingβ=0\\beta=0\. GITENSremoves the message amplifier and importance estimated by SPAtt\. GITENMremoves the message amplifier module\. GITENATTremoves attention but keeps the message amplifier\. GITENAremoves both importance estimated by SPAtt and IPAtt, i\.e\., using the sum operation for networked interference modeling\. GITENPremoves the proxy module but still jointly balances representations\. GITEBSbalances representations separately\.
Results of ablation experiments are shown in Table[2](https://arxiv.org/html/2605.24358#S5.T2)\. Overall, the results show that each component is important to the proposed methods\. In particular, we observe that the performance declines significantly on the Flickr and Blog datasets when removing both partial importance estimated by SPAtt and IPAtt\. This is because GITENAuses the sum operation for aggregation, which leads to an issue of numerical explosion for individuals with many neighbors and connections in graphs\. This makes it difficult to train the model\. We can observe that the issue is not serious on the AMZ\-N dataset, which is very sparse\.
### 5\.4\.Sensitivity experiments \(RQ 3\)
To answer RQ 3, we tested GITE and GITEvwith different values ofβ\\betaandλ\\lambdain the range\{0\.001,0\.01,0\.1,0\.2,0\.5,1\.0\}\\\{0\.001,0\.01,0\.1,0\.2,0\.5,1\.0\\\}\. Results are shown in Figure[5](https://arxiv.org/html/2605.24358#S5.F5)\. We observe that no significant changes in performance with different values ofβ\\betaon the AMZ\-N and Blog datasets\. However, we can also observe that there is a performance degradation whenβ\>0\.2\\beta\>0\.2on the Flickr dataset, thus we recommend searching the value ofβ\\betain the range\(0,0\.2\]\(0,0\.2\]\. Furthermore, we observe that setting a large value ofλ\\lambda\(\>0\.2\>0\.2\) can result in significant performance degradation on the AMZ\-N and Flickr datasets, as models cannot update their weights with a large value ofλ\\lambda\. Thus, we recommend searching the value ofλ\\lambdain the range\(0,0\.2\]\(0,0\.2\]\. More sensitivity experiments are detailed in Appendix[N\.1](https://arxiv.org/html/2605.24358#A14.SS1)\.
### 5\.5\.Additional experiments
Additional experiments that address further RQs 4, 5, and 6 are detailed in the Appendices[N\.2](https://arxiv.org/html/2605.24358#A14.SS2),[N\.3](https://arxiv.org/html/2605.24358#A14.SS3), and[N\.4](https://arxiv.org/html/2605.24358#A14.SS4)\.
## 6\.Conclusion
In this study, we study an important issue: DNE, which remains a challenge for previous approaches\. We proposed a novel method to address this issue and conducted experiments to demonstrate the effectiveness of our method in ITE estimation from graph data with DNE, which reveals the importance of capturing DNE\. We also introduce a representation balancing strategy, while theoretically analyzing error bound based on this strategy in Appendix[E](https://arxiv.org/html/2605.24358#A5)\. Future research directions are discussed in Appendix[P](https://arxiv.org/html/2605.24358#A16)\.
## Acknowledgments
This work was supported by JSPS KAKENHI \(Grant\-in\-Aid for Scientific Research\) B: Grant Number 26K02984, and supported by JST SPRING, Grant Number JPMJSP2110\.
## References
- S\. Adhikari and E\. Zheleva \(2025\)Inferring individual direct causal effects under heterogeneous peer influence\.Machine Learning114\(4\),pp\. 113\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.16.16.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.23.16.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[footnote 2](https://arxiv.org/html/2605.24358#footnote2)\.
- P\. M\. Aronow and C\. Samii \(2017\)Estimating average causal effects under general interference, with application to a social network experiment\.The Annals of Applied Statistics11,pp\. 1912–1947\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- U\. Awan, M\. Morucci, V\. Orlandi, S\. Roy, C\. Rudin, and A\. Volfovsky \(2020\)Almost\-matching\-exactly for treatment effect estimation under network interference\.InProceedings of the 23rd International Conference on Artificial Intelligence and Statistics,pp\. 3252–3262\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.arXiv preprint arXiv:1607\.06450\.Cited by:[Appendix M](https://arxiv.org/html/2605.24358#A13.p1.11)\.
- G\. Basse and A\. Feller \(2018\)Analyzing two\-stage experiments in the presence of interference\.Journal of the American Statistical Association113\(521\),pp\. 41–55\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- R\. Cai, Z\. Yang, W\. Chen, Y\. Yan, and Z\. Hao \(2023\)Generalization bound for estimating causal effects from observational network data\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management,pp\. 163–172\.Cited by:[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.2),[Appendix E](https://arxiv.org/html/2605.24358#A5.p1.4),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§4\.3](https://arxiv.org/html/2605.24358#S4.SS3.p3.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8)\.
- S\. Chang, D\. Vrabac, J\. Leskovec, and J\. Ugander \(2023\)Estimating geographic spillover effects of covid\-19 policies from large\-scale mobility networks\.InProceedings of the 37th AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 14161–14169\.Cited by:[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p1.1)\.
- W\. Chen, R\. Cai, Z\. Yang, J\. Qiao, Y\. Yan, Z\. Li, and Z\. Hao \(2024\)Doubly robust causal effect estimation under networked interference via targeted learning\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§3](https://arxiv.org/html/2605.24358#S3.p7.1),[§3](https://arxiv.org/html/2605.24358#S3.p8.1)\.
- J\. Cheng, Z\. Shang, H\. Cheng, H\. Wang, and J\. X\. Yu \(2012\)K\-reach: who is in your small world\.arXiv preprint arXiv:1208\.0090\.Cited by:[§3](https://arxiv.org/html/2605.24358#S3.p2.17)\.
- Z\. Chu, S\. L\. Rathbun, and S\. Li \(2021\)Graph infomax adversarial learning for treatment effect estimation with networked observational data\.InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 176–184\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[3rd item](https://arxiv.org/html/2605.24358#S3.I1.i3.p1.5)\.
- G\. Corso, L\. Cavalleri, D\. Beaini, P\. Liò, and P\. Veličković \(2020\)Principal neighbourhood aggregation for graph nets\.Advances in Neural Information Processing Systems33,pp\. 13260–13271\.Cited by:[§1](https://arxiv.org/html/2605.24358#S1.p4.1),[§2](https://arxiv.org/html/2605.24358#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p6.8)\.
- N\. Courty, R\. Flamary, A\. Habrard, and A\. Rakotomamonjy \(2017\)Joint distribution optimal transportation for domain adaptation\.Advances in Neural Information Processing Systems30\.Cited by:[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p3.18)\.
- Z\. Cui, X\. Tang, Y\. Qiao, B\. He, L\. Chen, X\. He, and C\. Ma \(2024\)Treatment\-aware hyperbolic representation learning for causal effect estimation with social networks\.InProceedings of the 2024 SIAM International Conference on Data Mining,pp\. 289–297\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- M\. Cuturi \(2013\)Sinkhorn distances: lightspeed computation of optimal transport\.Advances in Neural Information Processing Systems26\.Cited by:[Appendix E](https://arxiv.org/html/2605.24358#A5.p4.7)\.
- L\. Forastiere, E\. M\. Airoldi, and F\. Mealli \(2021\)Identification and estimation of treatment and interference effects in observational studies on networks\.Journal of the American Statistical Association116\(534\),pp\. 901–918\.Cited by:[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§3](https://arxiv.org/html/2605.24358#S3.p5.1)\.
- L\. Forastiere, F\. Mealli, A\. Wu, and E\. M\. Airoldi \(2022\)Estimating causal effects under network interference with bayesian generalized propensity scores\.Journal of Machine Learning Research23\(289\),pp\. 1–61\.Cited by:[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[§2](https://arxiv.org/html/2605.24358#S2.p1.1)\.
- D\. Frauen and S\. Feuerriegel \(2022\)Estimating individual treatment effects under unobserved confounding using binary instruments\.arXiv preprint arXiv:2208\.08544\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- D\. Frauen, K\. Hess, and S\. Feuerriegel \(2024\)Model\-agnostic meta\-learners for estimating heterogeneous treatment effects over time\.arXiv preprint arXiv:2407\.05287\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- A\. Gretton, O\. Bousquet, A\. Smola, and B\. Schölkopf \(2005\)Measuring statistical dependence with Hilbert\-Schmidt norms\.InProceedings of the 16th International Conference on Algorithmic Learning Theory,pp\. 63–77\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1)\.
- R\. Guo, J\. Li, Y\. Li, K\. S\. Candan, A\. Raglin, and H\. Liu \(2021\)Ignite: a minimax game toward learning individual treatment effects from networked observational data\.InProceedings of the 29th International Conference on International Joint Conferences on Artificial Intelligence,pp\. 4534–4540\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[3rd item](https://arxiv.org/html/2605.24358#S3.I1.i3.p1.5)\.
- R\. Guo, J\. Li, and H\. Liu \(2020\)Learning individual causal effects from networked observational data\.InProceedings of the 13th International Conference on Web Search and Data Mining,pp\. 232–240\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p3.1),[Appendix M](https://arxiv.org/html/2605.24358#A13.p2.3),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.7.7.1),[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[Appendix I](https://arxiv.org/html/2605.24358#A9.p2.1),[Appendix I](https://arxiv.org/html/2605.24358#A9.p3.1),[3rd item](https://arxiv.org/html/2605.24358#S3.I1.i3.p1.5),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.15.8.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
- X\. Guo, Y\. Zhang, J\. Wang, and M\. Long \(2023\)Estimating heterogeneous treatment effects: mutual information bounds and learning algorithms\.InProceedings of the 40th International Conference on Machine Learning,pp\. 12108–12121\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- W\. L\. Hamilton, R\. Ying, and J\. Leskovec \(2017\)Inductive representation learning on large graphs\.Advances in Neural Information Processing Systems30\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[§N\.2](https://arxiv.org/html/2605.24358#A14.SS2.p2.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p8.1),[Appendix F](https://arxiv.org/html/2605.24358#A6.p2.5),[§2](https://arxiv.org/html/2605.24358#S2.p2.1)\.
- S\. Harada and H\. Kashima \(2021\)Graphite: estimating individual effects of graph\-structured treatments\.InProceedings of the 30th ACM International Conference on Information & Knowledge Management,pp\. 659–668\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- B\. He, Y\. Weng, X\. Tang, Z\. Cui, Z\. Sun, L\. Chen, X\. He, and C\. Ma \(2024\)Rankability\-enhanced revenue uplift modeling framework for online marketing\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5093–5104\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p2.1),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.6.6.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.14.7.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
- R\. He and J\. McAuley \(2016\)Ups and downs: modeling the visual evolution of fashion trends with one\-class collaborative filtering\.InProceedings of the 2016 World Wide Web Conference,pp\. 507–517\.Cited by:[Appendix I](https://arxiv.org/html/2605.24358#A9.p4.6),[Appendix I](https://arxiv.org/html/2605.24358#A9.p4.6.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8)\.
- Q\. Huang, J\. Ma, J\. Li, R\. Guo, H\. Sun, and Y\. Chang \(2023\)Modeling interference for individual treatment effect estimation from networked observational data\.ACM Transactions on Knowledge Discovery from Data18\(3\),pp\. 1–21\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p4.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p1.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.9),[Appendix D](https://arxiv.org/html/2605.24358#A4.p1.15),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.9.9.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.17.10.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[footnote 2](https://arxiv.org/html/2605.24358#footnote2)\.
- M\. G\. Hudgens and M\. E\. Halloran \(2008\)Toward causal inference with interference\.Journal of the American Statistical Association103\(482\),pp\. 832–842\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- A\. Jesson, S\. Mindermann, Y\. Gal, and U\. Shalit \(2021a\)Quantifying ignorance in individual\-level causal\-effect estimates under hidden confounding\.InProceedings of the 38th International Conference on Machine Learning,pp\. 4829–4838\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- A\. Jesson, P\. Tigas, J\. van Amersfoort, A\. Kirsch, U\. Shalit, and Y\. Gal \(2021b\)Causal\-bald: deep bayesian active learning of outcomes to infer treatment\-effects from observational data\.Advances in Neural Information Processing Systems34,pp\. 30465–30478\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- S\. Jiang, Z\. Huang, X\. Luo, and Y\. Sun \(2023\)CF\-GODE: continuous\-time causal inference for multi\-agent dynamical systems\.InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 997–1009\.Cited by:[4th item](https://arxiv.org/html/2605.24358#S3.I1.i4.p1.1)\.
- S\. Jiang and Y\. Sun \(2022\)Estimating causal effects on networked observational data via representation learning\.InProceedings of the 31st ACM International Conference on Information & Knowledge Management,pp\. 852–861\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p4.1),[Appendix M](https://arxiv.org/html/2605.24358#A13.p2.3),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.2),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.8.8.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§2](https://arxiv.org/html/2605.24358#S2.p2.1),[3rd item](https://arxiv.org/html/2605.24358#S3.I1.i3.p1.5),[§3](https://arxiv.org/html/2605.24358#S3.p4.8),[§3](https://arxiv.org/html/2605.24358#S3.p6.2),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.16.9.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
- F\. D\. Johansson, U\. Shalit, N\. Kallus, and D\. Sontag \(2022\)Generalization bounds and representation learning for estimation of potential outcomes and causal effects\.Journal of Machine Learning Research23\(166\),pp\. 1–50\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- F\. Johansson, U\. Shalit, and D\. Sontag \(2016\)Learning representations for counterfactual inference\.InProceedings of the 33rd International Conference on Machine Learning,Vol\.48,pp\. 3020–3029\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p2.1),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.3.3.1),[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.10.3.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
- J\. Kaddour, Y\. Zhu, Q\. Liu, M\. J\. Kusner, and R\. Silva \(2021\)Causal effect inference for structured treatments\.Advances in Neural Information Processing Systems34,pp\. 24841–24854\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- L\. V\. Kantorovich \(2006\)On the translocation of masses\.\.Journal of Mathematical Sciences133\(4\)\.Cited by:[Appendix E](https://arxiv.org/html/2605.24358#A5.p3.2)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InProceedings of the 3rd International Conference on Learning Representations,Cited by:[Appendix M](https://arxiv.org/html/2605.24358#A13.p3.12)\.
- K\. Kuang, L\. Li, Z\. Geng, L\. Xu, K\. Zhang, B\. Liao, H\. Huang, P\. Ding, W\. Miao, and Z\. Jiang \(2020\)Causal inference\.Engineering6\(3\),pp\. 253–263\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- Q\. Le and T\. Mikolov \(2014\)Distributed representations of sentences and documents\.InProceedings of the 31st International Conference on Machine Learning,pp\. 1188–1196\.Cited by:[Appendix I](https://arxiv.org/html/2605.24358#A9.p4.6)\.
- H\. Li, Y\. Lyu, C\. Zheng, and P\. Wu \(2023a\)TDR\-CL: targeted doubly robust collaborative learning for debiased recommendations\.InProceedings of the 11th International Conference on Learning Representations,Cited by:[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1)\.
- H\. Li, C\. Zheng, and P\. Wu \(2023b\)StableDR: stabilized doubly robust learning for recommendation on data missing not at random\.InProceedings of the 11th International Conference on Learning Representations,Cited by:[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1)\.
- J\. Li, R\. Guo, C\. Liu, and H\. Liu \(2019\)Adaptive unsupervised feature selection on attributed networks\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 92–100\.Cited by:[Appendix I](https://arxiv.org/html/2605.24358#A9.p3.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8)\.
- J\. Li, X\. Hu, J\. Tang, and H\. Liu \(2015\)Unsupervised streaming feature selection in social media\.InProceedings of the 24th ACM International on Conference on Information and Knowledge Management,pp\. 1041–1050\.Cited by:[Appendix I](https://arxiv.org/html/2605.24358#A9.p3.1),[Appendix I](https://arxiv.org/html/2605.24358#A9.p3.1.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8)\.
- X\. Lin, H\. Bao, Y\. Cui, K\. Takeuchi, and H\. Kashima \(2025\)Scalable individual treatment effect estimator for large graphs\.Machine Learning114\(1\),pp\. 1–19\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[Appendix M](https://arxiv.org/html/2605.24358#A13.p2.3),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Appendix F](https://arxiv.org/html/2605.24358#A6.p2.5),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.12.12.1),[Appendix I](https://arxiv.org/html/2605.24358#A9.p4.6),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§3](https://arxiv.org/html/2605.24358#S3.p10.10),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.22.15.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p3.4)\.
- X\. Lin, G\. Zhang, X\. Lu, H\. Bao, K\. Takeuchi, and H\. Kashima \(2023\)Estimating treatment effects under heterogeneous interference\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 576–592\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p1.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.9),[Appendix D](https://arxiv.org/html/2605.24358#A4.p1.15),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.14.14.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.25.18.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[footnote 2](https://arxiv.org/html/2605.24358#footnote2)\.
- C\. Liu, Y\. Zhan, J\. Wu, C\. Li, B\. Du, W\. Hu, T\. Liu, and D\. Tao \(2022\)Graph pooling for graph neural networks: progress, challenges, and opportunities\.arXiv preprint arXiv:2204\.07321\.Cited by:[§2](https://arxiv.org/html/2605.24358#S2.p2.1)\.
- L\. Liu and M\. G\. Hudgens \(2014\)Large sample randomization inference of causal effects in the presence of interference\.Journal of the American Statistical Association109\(505\),pp\. 288–301\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- M\. Liu, H\. Yu, and S\. Ji \(2024\)Empowering GNNs via edge\-aware weisfeiler\-leman algorithm\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§2](https://arxiv.org/html/2605.24358#S2.p2.1)\.
- J\. Ma, Y\. Dong, Z\. Huang, D\. Mietchen, and J\. Li \(2022a\)Assessing the causal impact of covid\-19 related policies on outbreak dynamics: a case study in the us\.InProceedings of the 2022 Web Conference,pp\. 2678–2686\.Cited by:[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p1.1)\.
- J\. Ma, R\. Guo, C\. Chen, A\. Zhang, and J\. Li \(2021\)Deconfounding with networked observational data in a dynamic environment\.InProceedings of the 14th ACM International Conference on Web Search and Data Mining,pp\. 166–174\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[3rd item](https://arxiv.org/html/2605.24358#S3.I1.i3.p1.5)\.
- J\. Ma, M\. Wan, L\. Yang, J\. Li, B\. Hecht, and J\. Teevan \(2022b\)Learning causal effects on hypergraphs\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1202–1212\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p1.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.9),[Appendix D](https://arxiv.org/html/2605.24358#A4.p1.15),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.13.13.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§3](https://arxiv.org/html/2605.24358#S3.p10.10),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.24.17.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[footnote 1](https://arxiv.org/html/2605.24358#footnote1),[footnote 2](https://arxiv.org/html/2605.24358#footnote2)\.
- L\. Ma, C\. Lin, D\. Lim, A\. Romero\-Soriano, P\. K\. Dokania, M\. Coates, P\. Torr, and S\. Lim \(2023\)Graph inductive biases in transformers without message passing\.InProceedings of the 40th International Conference on Machine Learning,pp\. 23321–23337\.Cited by:[§2](https://arxiv.org/html/2605.24358#S2.p2.1)\.
- Y\. Ma and V\. Tresp \(2021\)Causal inference under networked interference and intervention policy enhancement\.InProceedings of the 24th International Conference on Artificial Intelligence and Statistics,Vol\.130,pp\. 3700–3708\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[Appendix M](https://arxiv.org/html/2605.24358#A13.p2.3),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.10.10.1),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.11.11.1),[§1](https://arxiv.org/html/2605.24358#S1.p2.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[1st item](https://arxiv.org/html/2605.24358#S3.I1.i1.p1.3),[§3](https://arxiv.org/html/2605.24358#S3.p10.10),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.20.13.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.21.14.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p3.4)\.
- A\. L\. Maas, A\. Y\. Hannun, A\. Y\. Ng,et al\.\(2013\)Rectifier nonlinearities improve neural network acoustic models\.InProceedings of the 30th International Conference on Machine Learning,Vol\.30,pp\. 3\.Cited by:[Appendix B](https://arxiv.org/html/2605.24358#A2.p2.5)\.
- V\. Melnychuk, D\. Frauen, and S\. Feuerriegel \(2023\)Bounds on representation\-induced confounding bias for treatment effect estimation\.arXiv preprint arXiv:2311\.11321\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- R\. Nabi, J\. Pfeiffer, D\. Charles, and E\. Kıcıman \(2022\)Causal inference in the presence of interference in sponsored search advertising\.Frontiers in Big Data5\.Cited by:[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p1.1)\.
- M\. Oprescu, J\. Dorn, M\. Ghoummaid, A\. Jesson, N\. Kallus, and U\. Shalit \(2023\)B\-learner: quasi\-oracle bounds on heterogeneous causal effects under hidden confounding\.InProceedings of the 40th International Conference on Machine Learning,pp\. 26599–26618\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- T\. Qin, T\. Wang, and Z\. Zhou \(2021\)Budgeted heterogeneous treatment effect estimation\.InProceedings of the 38th International Conference on Machine Learning,pp\. 8693–8702\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- V\. Rakesh, R\. Guo, R\. Moraffah, N\. Agarwal, and H\. Liu \(2018\)Linked causal variational autoencoder for inferring paired spillover effects\.InProceedings of the 27th ACM International Conference on Information and Knowledge Management,pp\. 1679–1682\.Cited by:[Appendix I](https://arxiv.org/html/2605.24358#A9.p4.6),[§1](https://arxiv.org/html/2605.24358#S1.p2.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8)\.
- P\. R\. Rosenbaum \(2007\)Interference between units in randomized experiments\.Journal of the American Statistical Association102\(477\),pp\. 191–200\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- D\. B\. Rubin \(1980\)Randomization analysis of experimental data: the fisher randomization test comment\.Journal of the American Statistical Association75\(371\),pp\. 591–593\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- D\. B\. Rubin \(2005\)Causal inference using potential outcomes: design, modeling, decisions\.Journal of the American Statistical Association100\(469\),pp\. 322–331\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- M\. E\. Schnitzer \(2022\)Estimands and estimation of COVID\-19 vaccine effectiveness under the test\-negative design: connections to causal inference\.Epidemiology33\(3\),pp\. 325\.Cited by:[Appendix P](https://arxiv.org/html/2605.24358#A16.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p1.1)\.
- U\. Shalit, F\. D\. Johansson, and D\. Sontag \(2017\)Estimating individual treatment effect: generalization bounds and algorithms\.InProceedings of the 34th International Conference on Machine Learning,Vol\.70,pp\. 3076–3085\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p2.1),[Appendix E](https://arxiv.org/html/2605.24358#A5.p1.4),[Appendix E](https://arxiv.org/html/2605.24358#A5.p2.6),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.2.2.1),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.4.4.1),[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[§4\.3](https://arxiv.org/html/2605.24358#S4.SS3.p3.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.11.4.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.12.5.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.9.2.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1),[footnote 1](https://arxiv.org/html/2605.24358#footnote1)\.
- C\. Shi, D\. Blei, and V\. Veitch \(2019\)Adapting neural networks for the estimation of treatment effects\.Advances in Neural Information Processing Systems32\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- Y\. Sui, C\. Tang, Z\. Chu, J\. Fang, Y\. Gao, Q\. Cui, L\. Li, J\. Zhou, and X\. Wang \(2024\)Invariant graph learning for treatment effect estimation from networked observational data\.InProceedings of the 2024 Web Conference,Cited by:[§2](https://arxiv.org/html/2605.24358#S2.p1.1)\.
- E\. J\. T\. Tchetgen and T\. J\. VanderWeele \(2012\)On causal inference in the presence of interference\.Statistical Methods in Medical Research21\(1\),pp\. 55–75\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- A\. Thorat, R\. Kolla, N\. Pedanekar, and N\. Onoe \(2023\)Estimation of individual causal effects in network setup for multiple treatments\.arXiv preprint arXiv:2312\.11573\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- P\. Toulis and E\. Kao \(2013\)Estimation of causal peer influence effects\.InProceedings of the 30th International Conference on Machine Learning,pp\. 1489–1497\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in Neural Information Processing Systems30\.Cited by:[Appendix M](https://arxiv.org/html/2605.24358#A13.p1.11),[§N\.3](https://arxiv.org/html/2605.24358#A14.SS3.p1.1),[Appendix B](https://arxiv.org/html/2605.24358#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.24358#A2.p3.1),[§1](https://arxiv.org/html/2605.24358#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p5.24)\.
- V\. Veitch, Y\. Wang, and D\. Blei \(2019\)Using embeddings to correct for unobserved confounding in networks\.Advances in Neural Information Processing Systems32\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Lio, and Y\. Bengio \(2017\)Graph attention networks\.arXiv preprint arXiv:1710\.10903\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p4.1),[Appendix K](https://arxiv.org/html/2605.24358#A11.p5.1),[Appendix M](https://arxiv.org/html/2605.24358#A13.p1.11),[§N\.3](https://arxiv.org/html/2605.24358#A14.SS3.p1.1),[Appendix B](https://arxiv.org/html/2605.24358#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.24358#A2.p2.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.4),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.9),[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§1](https://arxiv.org/html/2605.24358#S1.p4.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p5.24)\.
- C\. Villaniet al\.\(2008\)Optimal transport: old and new\.Vol\.338,Springer\.Cited by:[Appendix E](https://arxiv.org/html/2605.24358#A5.p3.2)\.
- D\. Viviano \(2019\)Policy targeting under network interference\.arXiv preprint arXiv:1906\.10258\.Cited by:[§2](https://arxiv.org/html/2605.24358#S2.p1.1)\.
- H\. Wang, J\. Fan, Z\. Chen, H\. Li, W\. Liu, T\. Liu, Q\. Dai, Y\. Wang, Z\. Dong, and R\. Tang \(2023\)Optimal transport for treatment effect estimation\.Advances in Neural Information Processing Systems36\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p2.1),[Appendix E](https://arxiv.org/html/2605.24358#A5.p1.4),[Appendix E](https://arxiv.org/html/2605.24358#A5.p3.2),[Appendix E](https://arxiv.org/html/2605.24358#A5.p9.5),[Appendix G](https://arxiv.org/html/2605.24358#A7.p1.1),[Appendix G](https://arxiv.org/html/2605.24358#A7.p2.6),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.5.5.1),[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p1.17),[§4\.2](https://arxiv.org/html/2605.24358#S4.SS2.p3.18),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.13.6.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
- H\. Wang, W\. Yang, L\. Yang, A\. Wu, L\. Xu, J\. Ren, F\. Wu, and K\. Kuang \(2022\)Estimating individualized causal effect with confounded instruments\.InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 1857–1867\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- X\. Wang, L\. Tang, H\. Liu, and L\. Wang \(2013\)Learning with multi\-resolution overlapping communities\.Knowledge and Information Systems36,pp\. 517–535\.Cited by:[Appendix I](https://arxiv.org/html/2605.24358#A9.p2.1),[Appendix I](https://arxiv.org/html/2605.24358#A9.p2.1.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p1.8)\.
- M\. Welling and T\. N\. Kipf \(2016\)Semi\-supervised classification with graph convolutional networks\.InProceedings of the 4th International Conference on Learning Representations,Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p3.1),[Appendix M](https://arxiv.org/html/2605.24358#A13.p2.3),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.2),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.4),[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p1.1)\.
- H\. Wen, T\. Chen, L\. K\. Chai, S\. Sadiq, J\. Gao, and H\. Yin \(2023a\)Variational counterfactual prediction under runtime domain corruption\.IEEE Transactions on Knowledge and Data Engineering36\(5\),pp\. 2271–2284\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- H\. Wen, T\. Chen, L\. K\. Chai, S\. Sadiq, K\. Zheng, and H\. Yin \(2023b\)To predict or to reject: causal effect estimation with uncertainty on networked data\.InProceedings of the 23rd IEEE International Conference on Data Mining,pp\. 1415–1420\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- H\. Wen, T\. Chen, M\. Gong, L\. K\. Chai, S\. Sadiq, and H\. Yin \(2025a\)Enhancing treatment effect estimation via active learning: a counterfactual covering perspective\.arXiv preprint arXiv:2505\.05242\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- H\. Wen, T\. Chen, G\. Ye, L\. K\. Chai, S\. Sadiq, and H\. Yin \(2025b\)Progressive generalization risk reduction for data\-efficient causal effect estimation\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.1,pp\. 1575–1586\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- A\. Wu, K\. Kuang, B\. Li, and F\. Wu \(2022a\)Instrumental variable regression with confounder balancing\.InProceedings of the 39th International Conference on Machine Learning,pp\. 24056–24075\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- \[84\]A\. Wu, H\. Qiu, Z\. Chen, Z\. Li, R\. Xiong, F\. Wu, and K\. ZhangCausal graph transformer for treatment effect estimation under unknown interference\.InProceedings of the 13th International Conference on Learning Representations,Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p4.1),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.17.17.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.19.12.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
- A\. Wu, J\. Yuan, K\. Kuang, B\. Li, R\. Wu, Q\. Zhu, Y\. Zhuang, and F\. Wu \(2022b\)Learning decomposed representations for treatment effect estimation\.IEEE Transactions on Knowledge and Data Engineering35\(5\),pp\. 4989–5001\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka \(2019\)How powerful are graph neural networks?\.InProceedings of the 7th International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.24358#S1.p4.1),[§2](https://arxiv.org/html/2605.24358#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p5.9)\.
- L\. Yao, Z\. Chu, S\. Li, Y\. Li, J\. Gao, and A\. Zhang \(2021\)A survey on causal inference\.ACM Transactions on Knowledge Discovery from Data15\(5\),pp\. 1–46\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1),[2nd item](https://arxiv.org/html/2605.24358#S3.I1.i2.p1.1),[3rd item](https://arxiv.org/html/2605.24358#S3.I1.i3.p1.5)\.
- L\. Yao, S\. Li, Y\. Li, M\. Huai, J\. Gao, and A\. Zhang \(2018\)Representation learning for treatment effect estimation from observational data\.Advances in Neural Information Processing Systems31\.Cited by:[Appendix H](https://arxiv.org/html/2605.24358#A8.p1.1)\.
- C\. Ying, T\. Cai, S\. Luo, S\. Zheng, G\. Ke, D\. He, Y\. Shen, and T\. Liu \(2021\)Do transformers really perform badly for graph representation?\.Advances in Neural Information Processing Systems34\.Cited by:[Appendix M](https://arxiv.org/html/2605.24358#A13.p1.11),[§N\.3](https://arxiv.org/html/2605.24358#A14.SS3.p1.1),[Appendix B](https://arxiv.org/html/2605.24358#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.24358#A2.p3.1),[§1](https://arxiv.org/html/2605.24358#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.24358#S4.SS1.p5.24)\.
- Z\. Zhao, Y\. Bai, R\. Xiong, Q\. Cao, C\. Ma, N\. Jiang, F\. Wu, and K\. Kuang \(2024\)Learning individual treatment effects under heterogeneous interference in networks\.ACM Transactions on Knowledge Discovery from Data18\(8\),pp\. 1–21\.Cited by:[Appendix K](https://arxiv.org/html/2605.24358#A11.p4.1),[Appendix C](https://arxiv.org/html/2605.24358#A3.p5.10),[Table 4](https://arxiv.org/html/2605.24358#A8.T4.1.15.15.1),[§1](https://arxiv.org/html/2605.24358#S1.p3.1),[§2](https://arxiv.org/html/2605.24358#S2.p1.1),[Figure 3](https://arxiv.org/html/2605.24358#S3.F3),[Table 1](https://arxiv.org/html/2605.24358#S4.T1.9.18.11.1),[§5\.1](https://arxiv.org/html/2605.24358#S5.SS1.p2.1)\.
## Appendix ANotation table
We provide a notation table in Table[3](https://arxiv.org/html/2605.24358#A1.T3)\.
Table 3\.Notation Table\.
## Appendix BAttention mechanisms for IPAtt and SPAtt
Several attention mechanisms can be implemented for the proposed IPAtt and SPAtt mechanisms\. We consider two widely used attention mechanisms: GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)and the attention mechanism of Transformer based on query and key vectors \(abbreviated as QK\-based AT\)\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib5); Yinget al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib151)\)\. Leta\(𝒑i,𝒑k\)a\(\\boldsymbol\{p\}\_\{i\},\\boldsymbol\{p\}\_\{k\}\)denote the mechanism that estimates the importance between two individuals based on the inputs\.
The implementation ofa\(𝒑i,𝒑k\)a\(\\boldsymbol\{p\}\_\{i\},\\boldsymbol\{p\}\_\{k\}\)for IPAtt and SPAtt mechanisms with GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)is as follows:
\(11\)a\(𝒑i,𝒑k\)=LeakyReLU\(\(𝒘\(l\)\)⊤\[𝑾\(l\)𝒑i∥𝑾\(l\)𝒑k\]\),\\displaystyle a\\left\(\\boldsymbol\{p\}\_\{i\},\\boldsymbol\{p\}\_\{k\}\\right\)=\\mathrm\{LeakyReLU\}\\biggr\(\\left\(\\boldsymbol\{w\}^\{\(l\)\}\\right\)^\{\\top\}\[\\boldsymbol\{W\}^\{\(l\)\}\\boldsymbol\{p\}\_\{i\}\\\|\\boldsymbol\{W\}^\{\(l\)\}\\boldsymbol\{p\}\_\{k\}\]\\biggr\),whereLeakyReLU\\mathrm\{LeakyReLU\}denotesLeakyReLU\\mathrm\{LeakyReLU\}activation function\(Maaset al\.,[2013](https://arxiv.org/html/2605.24358#bib.bib108)\),𝒘\(l\)\\boldsymbol\{w\}^\{\(l\)\}denotes a learnable parameter vector, and𝑾\(l\)\\boldsymbol\{W\}^\{\(l\)\}denotes a learnable parameter matrix\.
The implementation ofa\(𝒑i,𝒑k\)a\(\\boldsymbol\{p\}\_\{i\},\\boldsymbol\{p\}\_\{k\}\)for IPAtt and SPAtt mechanisms with the QK\-based AT\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib5); Yinget al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib151)\), is as follows:
\(12\)a\(𝒑i,𝒑k\)=\(𝒑Qi⋅\(𝒑Kk\)⊤cK\),𝒑Qi=𝑾\(l\)Q𝒑i,𝒑Kk=𝑾\(l\)K𝒑k,\\displaystyle a\\left\(\\boldsymbol\{p\}\_\{i\},\\boldsymbol\{p\}\_\{k\}\\right\)=\\Biggr\(\\frac\{\\boldsymbol\{p\}\_\{\\text\{Q\}\_\{i\}\}\\cdot\\left\(\\boldsymbol\{p\}\_\{\\text\{K\}\_\{k\}\}\\right\)^\{\\top\}\}\{\\sqrt\{c\_\{\\text\{K\}\}\}\}\\Biggr\),\\;\\boldsymbol\{p\}\_\{\\text\{Q\}\_\{i\}\}=\\boldsymbol\{W\}^\{\(l\)\}\_\{\\text\{Q\}\}\\boldsymbol\{p\}\_\{i\},\\;\\boldsymbol\{p\}\_\{\\text\{K\}\_\{k\}\}=\\boldsymbol\{W\}^\{\(l\)\}\_\{\\text\{K\}\}\\boldsymbol\{p\}\_\{k\},where𝒑Q\\boldsymbol\{p\}\_\{\\text\{Q\}\}denotes a query vector,𝒑K\\boldsymbol\{p\}\_\{\\text\{K\}\}denotes a key vector,cKc\_\{\\text\{K\}\}denotes the dimension of key vector,𝑾Q\(l\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\text\{Q\}\}denotes a learnable parameter matrix for the query vector, and𝑾K\(l\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\text\{K\}\}denotes a learnable parameter matrix for the key vector\.
For our experiments in Section[5](https://arxiv.org/html/2605.24358#S5), we used the implementation with GAT by default for both IPAtt and SPAtt mechanisms\. Additional experiments using implementation with QK\-based AT are detailed in Appendix[N\.3](https://arxiv.org/html/2605.24358#A14.SS3)\.
## Appendix CProof of existing interference modeling methods cannot fully capture DNE
In this section, we prove that existing interference modeling methods cannot capture DNE for some local networks\. DNE consists of two sub\-issues: \(I\) the importance of different neighbors in contributing to interference varies\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\), and \(II\) the scale of neighbors varies, leading to different levels of interference \(see Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\)\.
Letℕi\\mathbb\{N\}\_\{i\}denote the set of neighbors of the individualii,did\_\{i\}denote degree of the individualii,𝒑∈ℝcp\\boldsymbol\{p\}\\in\\mathbb\{R\}^\{c\_\{p\}\}denote the interference\-related information of an individual, which serves as the input of a mean aggregation, GCN, or GAT layer\. Here,𝒑\\boldsymbol\{p\}is typically initialized as individual information, such as covariates and treatment of the individual, andcpc\_\{p\}depends on the specific initialization strategy of𝒑\\boldsymbol\{p\}\. Let𝒑′∈ℝcp′\\boldsymbol\{p\}^\{\\prime\}\\in\\mathbb\{R\}^\{c\_\{p^\{\\prime\}\}\}denote the interference representation generated by such a layer for an individual and𝐍~i=𝐍i∪i\\tilde\{\\mathbf\{N\}\}\_\{i\}=\\mathbf\{N\}\_\{i\}\\cup idenote the set of neighbors of individualiiwith the self\-loop\.
###### Definition C\.1\.
Given an individualiiand the set of neighborsℕi\\mathbb\{N\}\_\{i\}, leta\(i,k\)a\(i,k\)be a learnable importance estimation mechanism that can assign a non\-negative weight to each neighbork∈ℕik\\in\\mathbb\{N\}\_\{i\}\. This weight adaptively captures the importance of the neighborkkin contributing to the interference received by the individualii, as determined by their interference\-related and structural information\.
###### Definition C\.2\.
Given an individualiiand the set of related individuals𝔾i\\mathbb\{G\}\_\{i\}, letδi=f𝔾\(𝐱𝔾i,t𝔾i\)\\delta\_\{i\}=f\_\{\\mathbb\{G\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},t\_\{\\mathbb\{G\}\_\{i\}\}\)be the effect of interference received by the individualiiwith DNE\.f𝔾\(𝐱𝔾i,t𝔾i\)=g𝕏\(𝐱𝔾i\)\+g𝕋\(t𝔾i\)f\_\{\\mathbb\{G\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},t\_\{\\mathbb\{G\}\_\{i\}\}\)=g\_\{\\mathbb\{X\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)\+g\_\{\\mathbb\{T\}\}\(t\_\{\\mathbb\{G\}\_\{i\}\}\), whereg𝕏\(𝐱𝔾i\)g\_\{\\mathbb\{X\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)andg𝕋\(t𝔾i\)g\_\{\\mathbb\{T\}\}\(t\_\{\\mathbb\{G\}\_\{i\}\}\)are two aggregation functions\.
Here, we can consider an example for aggregation functionsg𝕏\(𝒙𝔾i\)g\_\{\\mathbb\{X\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)andg𝕋\(t𝔾i\)g\_\{\\mathbb\{T\}\}\(t\_\{\\mathbb\{G\}\_\{i\}\}\)\. For simplicity, we consider that𝔾i\\mathbb\{G\}\_\{i\}contains two\-hop neighbors for each individual\. In this case,g𝕏\(𝒙𝔾i\)=s\(ℕi\)⋅∑j∈ℕiαij⋅s\(ℕj\)⋅∑k∈ℕjαjk⋅\(𝒘𝕏\)⊤⋅𝒙kg\_\{\\mathbb\{X\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)=s\(\\mathbb\{N\}\_\{i\}\)\\cdot\\sum\_\{j\\in\\mathbb\{N\}\_\{i\}\}\\alpha\_\{ij\}\\cdot s\(\\mathbb\{N\}\_\{j\}\)\\cdot\\sum\_\{k\\in\\mathbb\{N\}\_\{j\}\}\\alpha\_\{jk\}\\cdot\(\\boldsymbol\{w\}\_\{\\mathbb\{X\}\}\)^\{\\top\}\\cdot\\boldsymbol\{x\}\_\{k\}andg𝕋\(t𝔾i\)=s\(ℕi\)⋅∑j∈ℕiαij⋅s\(ℕj\)⋅∑k∈ℕjαjk⋅w𝕋⋅tkg\_\{\\mathbb\{T\}\}\(t\_\{\\mathbb\{G\}\_\{i\}\}\)=s\(\\mathbb\{N\}\_\{i\}\)\\cdot\\sum\_\{j\\in\\mathbb\{N\}\_\{i\}\}\\alpha\_\{ij\}\\cdot s\(\\mathbb\{N\}\_\{j\}\)\\cdot\\sum\_\{k\\in\\mathbb\{N\}\_\{j\}\}\\alpha\_\{jk\}\\cdot w\_\{\\mathbb\{T\}\}\\cdot t\_\{k\}, wheres\(⋅\)s\(\\cdot\)denotes a mechanism that identify the neighbor scales of the individualii,αij\\alpha\_\{ij\}denotes the importance of interference between individualsiiandjj,𝒘𝕏\\boldsymbol\{w\}\_\{\\mathbb\{X\}\}andw𝕋w\_\{\\mathbb\{T\}\}denotes weight parameters, which can be learned from data\.
To address issue \(I\), a proper method should contain a mechanism, as defined in Definition[C\.1](https://arxiv.org/html/2605.24358#A3.Thmtheorem1)\. Such a mechanism can be learned in a data\-driven manner, which enables the model to adaptively estimate the importance of different neighbors, rather than assigning equal or fixed weights to all neighbors\. Due to the existence of DNE, different neighbors of an individual contribute equally to the interference received by this individual only if they have both similar interference\-related information and local network structures\. To address issue \(II\), a proper method needs to generate distinct representations for individuals exposed to different local networks where the scales of neighbors differ\. To this end, the mehod needs to include a mechanism that can identify scales of neighbors ing𝕏\(𝒙𝔾i\)g\_\{\\mathbb\{X\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)andg𝕋\(t𝔾i\)g\_\{\\mathbb\{T\}\}\(t\_\{\\mathbb\{G\}\_\{i\}\}\)\. If an interference modeling method fails to address either issue \(I\) or \(II\) for individuals exposed to different local networks, it cannot capture the DNE for them\. Specifically, this failure arises when the method either lacks a mechanism to assign importance to neighbors by adaptively estimating their contributions to interference, which do not address issue \(I\), or when it generates identical representations for individuals exposed to local networks of different scales of neighbors, which do not address issue \(II\)\.
Existing methods for interference modeling apply mean aggregation\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126); Forastiereet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib10),[2022](https://arxiv.org/html/2605.24358#bib.bib110); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103); Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\), GCN\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Chenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib126); Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103); Linet al\.,[2025](https://arxiv.org/html/2605.24358#bib.bib132); Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12); Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\), or GAT\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\)to generate interference representations\. The mean aggregation can be defined as follows\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\):
\(13\)𝒑i′=σ\(∑k∈𝐍~i1\|𝐍~i\|𝑾𝒑k\),\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\),where𝑾\\boldsymbol\{W\}denotes a learnable parameter matrix\. Notably,𝑾\\boldsymbol\{W\}is not always applied for mean aggregation, as seen in methods for treatment aggregation\(Caiet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib125); Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\)\. The individual\-level aggregation of GCN can be defined as follows\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103); Welling and Kipf,[2016](https://arxiv.org/html/2605.24358#bib.bib17)\):
\(14\)𝒑i′=σ\(∑k∈𝐍~i1d~id~k𝑾𝒑k\),\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{i\}\\tilde\{d\}\_\{k\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\),whered~i\\tilde\{d\}\_\{i\}is the degree of the individualiiwith the self\-loop\. The individual\-level aggregation of GCN is transformed from the aggregation mechanism of the original GCN\(Welling and Kipf,[2016](https://arxiv.org/html/2605.24358#bib.bib17)\)\. The individual\-level aggregation of GAT can be defined as follows\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\):
\(15\)𝒑i′=σ\(∑k∈𝐍~iαik⋅𝑾𝒑k\),\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\alpha\_\{ik\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\),whereαik\\alpha\_\{ik\}is the estimated attention weight between individualsiiandkk, computed by the graph attention mechanism\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)\. Here,αik\\alpha\_\{ik\}is typically normalized by softmax, such that∑k∈𝐍~iαik=1\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\alpha\_\{ik\}=1, as seen in previous methods\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53); Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)\. These methods cannot capture DNE for some local networks, as stated in Proposition[C\.3](https://arxiv.org/html/2605.24358#A3.Thmtheorem3)\.
###### Proposition C\.3\.
Interference representation generated by the mean aggregation, GCN, or GAT cannot capture DNE for some local networks\.
We now provide a proof for Proposition[C\.3](https://arxiv.org/html/2605.24358#A3.Thmtheorem3)\.
###### Proof\.
To prove Proposition[C\.3](https://arxiv.org/html/2605.24358#A3.Thmtheorem3), we need to provide examples in which the interference representation generated by the mean aggregation, GCN, or GAT cannot capture DNE\. For simplicity, we consider cases of local networks of individuals, which consist of individuals and their 1\-hop neighbors\. In this case, the neighbor set with the self\-loop of an individual can constitute the local network of the individual, e\.g\.,ℕ~i\\tilde\{\\mathbb\{N\}\}\_\{i\}\.
First, we prove that the interference representation generated by a mean aggregation cannot capture DNE\. The mean aggregation uses the same importance for different neighbors, so it is unable to address the issue \(I\) of DNE\. Furthermore, the mean aggregation cannot address the issue \(II\) of DNE for some local networks\. We can derive from the mean aggregation, i\.e\., equality \([13](https://arxiv.org/html/2605.24358#A3.E13)\) as follows:
\(16\)𝒑i′=σ\(∑k∈𝐍~i1\|𝐍~i\|𝑾𝒑k\)=σ\(𝑾∑k∈𝐍~i1\|𝐍~i\|𝒑k\)=σ\(𝑾𝒑¯ℕ~i\),\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|\}\\boldsymbol\{p\}\_\{k\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\right\),where𝒑¯ℕ~i\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}represents the mean of interference\-related information of neighborsℕ~i\\tilde\{\\mathbb\{N\}\}\_\{i\}with the self\-loop of the individualii\. Based on equality \([16](https://arxiv.org/html/2605.24358#A3.E16)\), we consider a case that individualsiiandjjare exposed to two different local networks, where𝒑¯ℕ~i=𝒑¯ℕ~j\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}=\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, and\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. In this case, the mean aggregation cannot generate distinct representations for individualsiiandjj, even though they are exposed to different local networks, where\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|, which can be proved as follows:
\(17\)𝒑i′=σ\(𝑾𝒑¯ℕ~i\)=σ\(𝑾𝒑¯ℕ~j\)=𝒑j′\.\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\sigma\\left\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\right\)=\\boldsymbol\{p\}\_\{j\}^\{\\prime\}\.This reveals that the mean aggregation cannot address the issue \(II\) of DNE for this case, as it generates the same representations for individualsiiandjjbut\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. We can consider another case \(similar to examples in Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\) that similar individualsiiandjj\(i\.e\.,𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}\) are exposed to two different local networks where𝒑i=𝒑k1,∀k1∈ℕ~i\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{k\_\{1\}\},\\forall k\_\{1\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\},𝒑j=𝒑k2,∀k2∈ℕ~j\\boldsymbol\{p\}\_\{j\}=\\boldsymbol\{p\}\_\{k\_\{2\}\},\\forall k\_\{2\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, and\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. In this case,𝒑¯ℕ~i=𝒑i\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}=\\boldsymbol\{p\}\_\{i\}, and𝒑¯ℕ~j=𝒑j\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}=\\boldsymbol\{p\}\_\{j\}hold, we then have:
\(18\)𝒑i′=σ\(𝑾𝒑¯ℕ~i\)=σ\(𝑾𝒑i\)=σ\(𝑾𝒑j\)=σ\(𝑾𝒑¯ℕ~j\)=𝒑j′\.\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\sigma\\left\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\right\)=\\boldsymbol\{p\}\_\{j\}^\{\\prime\}\.This indicates that the mean aggregation is also unable to address the issue \(II\) of DNE for this case, as it generates the same representations for individualsiiandjj\. Therefore, the mean aggregation cannot address the issue \(II\) of DNE for some local networks, as shown in equalities \([17](https://arxiv.org/html/2605.24358#A3.E17)\) and \([18](https://arxiv.org/html/2605.24358#A3.E18)\)\. As a result, the mean aggregation cannot capture DNE, as it cannot address issue \(I\) while it cannot address issue \(II\) for some local networks\.
Next, we prove that GCN cannot capture DNE for some local networks\. GCN can degenerate to mean aggregation whend~i=d~k,∀k∈ℕ~i\\tilde\{d\}\_\{i\}=\\tilde\{d\}\_\{k\},\\forall k\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}, which can be proved as follows:
\(19\)𝒑i′=\\displaystyle\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=σ\(∑k∈𝐍~i1d~id~k𝑾𝒑k\)\\displaystyle\\;\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{i\}\\tilde\{d\}\_\{k\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\)=\\displaystyle=σ\(∑k∈𝐍~i1d~id~i𝑾𝒑k\)\\displaystyle\\;\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{i\}\\tilde\{d\}\_\{i\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\)=\\displaystyle=σ\(∑k∈𝐍~i1\|ℕ~i\|𝑾𝒑k\)\.\\displaystyle\\;\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\)\.Therefore, GCN is unable to address the issue \(I\) of DNE for some networks, whered~i=d~k,∀k∈ℕ~i\\tilde\{d\}\_\{i\}=\\tilde\{d\}\_\{k\},\\forall k\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\. Furthermore, similar to the mean aggregation, it cannot address the issue \(II\) of DNE for some local networks\. We consider the case that individualsiiandjjare exposed to two different local networks, whered~i=d~k1,∀k1∈ℕ~i\\tilde\{d\}\_\{i\}=\\tilde\{d\}\_\{k\_\{1\}\},\\forall k\_\{1\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\},d~j=d~k2,∀k2∈ℕ~j\\tilde\{d\}\_\{j\}=\\tilde\{d\}\_\{k\_\{2\}\},\\forall k\_\{2\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\},𝒑¯ℕ~i=𝒑¯ℕ~j\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}=\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, and\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. In this case, we can derive from equality \([19](https://arxiv.org/html/2605.24358#A3.E19)\) as follows:
\(20\)𝒑i′\\displaystyle\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=σ\(∑k1∈𝐍~i1d~id~k1𝑾𝒑k1\)=σ\(∑k1∈𝐍~i1\|ℕ~i\|𝑾𝒑k1\)=σ\(𝑾𝒑¯ℕ~i\),\\displaystyle=\\sigma\\left\(\\sum\_\{k\_\{1\}\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{i\}\\tilde\{d\}\_\{k\_\{1\}\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\\right\)=\\sigma\\left\(\\sum\_\{k\_\{1\}\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\\right\)=\\sigma\\biggr\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\biggr\),𝒑j′\\displaystyle\\boldsymbol\{p\}\_\{j\}^\{\\prime\}=σ\(∑k2∈𝐍~j1d~jd~k2𝑾𝒑k2\)=σ\(∑k2∈𝐍~j1\|ℕ~j\|𝑾𝒑k2\)=σ\(𝑾𝒑¯ℕ~j\)\.\\displaystyle=\\sigma\\left\(\\sum\_\{k\_\{2\}\\in\\tilde\{\\mathbf\{N\}\}\_\{j\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{j\}\\tilde\{d\}\_\{k\_\{2\}\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{2\}\}\\right\)=\\sigma\\left\(\\sum\_\{k\_\{2\}\\in\\tilde\{\\mathbf\{N\}\}\_\{j\}\}\\frac\{1\}\{\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{2\}\}\\right\)=\\sigma\\biggr\(\\boldsymbol\{W\}\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\biggr\)\.Based on equalities \([20](https://arxiv.org/html/2605.24358#A3.E20)\) and𝒑¯ℕ~i=𝒑¯ℕ~j\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}=\\bar\{\\boldsymbol\{p\}\}\_\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, we can have𝒑i′=𝒑j′\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\boldsymbol\{p\}\_\{j\}^\{\\prime\}in this case, which indicates that the GCN cannot address both issues \(I\) and \(II\) of DNE in this scenario, as it assigns equal importance to all neighbors and generates the same representations for individualsiiandjj, but\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. Similarly, we consider another case that similar individualsiiandjj\(i\.e\.,𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}\) are exposed to two different local networks whered~i=d~k1,𝒑i=𝒑k1,∀k1∈ℕ~i\\tilde\{d\}\_\{i\}=\\tilde\{d\}\_\{k\_\{1\}\},\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{k\_\{1\}\},\\forall k\_\{1\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\},d~j=d~k2,𝒑j=𝒑k2,∀k2∈ℕ~j\\tilde\{d\}\_\{j\}=\\tilde\{d\}\_\{k\_\{2\}\},\\boldsymbol\{p\}\_\{j\}=\\boldsymbol\{p\}\_\{k\_\{2\}\},\\forall k\_\{2\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, and\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. In this case, we can have
\(21\)𝒑i′\\displaystyle\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=σ\(∑k1∈𝐍~i1d~id~k1𝑾𝒑k1\)=σ\(∑k1∈𝐍~i1d~id~i𝑾𝒑i\)=σ\(𝑾𝒑i\),\\displaystyle=\\sigma\\left\(\\sum\_\{k\_\{1\}\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{i\}\\tilde\{d\}\_\{k\_\{1\}\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\\right\)=\\sigma\\left\(\\sum\_\{k\_\{1\}\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{i\}\\tilde\{d\}\_\{i\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\),𝒑j′\\displaystyle\\boldsymbol\{p\}\_\{j\}^\{\\prime\}=σ\(∑k2∈𝐍~j1d~jd~k2𝑾𝒑k1\)=σ\(∑k2∈𝐍~j1d~jd~j𝑾𝒑j\)=σ\(𝑾𝒑j\)\.\\displaystyle=\\sigma\\left\(\\sum\_\{k\_\{2\}\\in\\tilde\{\\mathbf\{N\}\}\_\{j\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{j\}\\tilde\{d\}\_\{k\_\{2\}\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\\right\)=\\sigma\\left\(\\sum\_\{k\_\{2\}\\in\\tilde\{\\mathbf\{N\}\}\_\{j\}\}\\frac\{1\}\{\\sqrt\{\\tilde\{d\}\_\{j\}\\tilde\{d\}\_\{j\}\}\}\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\\right\)=\\sigma\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\\right\)\.Based on equalities \([21](https://arxiv.org/html/2605.24358#A3.E21)\) and𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}, we can have𝒑i′=𝒑j′\\boldsymbol\{p\}\_\{i\}^\{\\prime\}=\\boldsymbol\{p\}\_\{j\}^\{\\prime\}\. This indicates that the GCN is also unable to address the issues \(I\) and \(II\) of DNE for this case, as it still generates the same representations for individualsiiandjjbut\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. As a result, the GCN cannot capture DNE for some local networks, as it cannot address issues \(I\) and \(II\) jointly for these cases\.
Lastly, we prove that GAT with softmax operation cannot capture DNE for some local networks\. We can derive from the aggregation of GAT, i\.e\., equality \([15](https://arxiv.org/html/2605.24358#A3.E15)\) as follows:
\(22\)𝒑i′\\displaystyle\\;\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=\\displaystyle=σ\(∑k∈𝐍~iαik⋅𝑾𝒑k\)\\displaystyle\\;\\sigma\\left\(\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\alpha\_\{ik\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\\right\)=\\displaystyle=σ\(αii⋅𝑾𝒑i\+αi2⋅𝑾𝒑2\+⋯\+αin⋅𝑾𝒑n\)\\displaystyle\\;\\sigma\\left\(\\alpha\_\{ii\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\+\\alpha\_\{i2\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{2\}\+\\dots\+\\alpha\_\{in\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{n\}\\right\)=\\displaystyle=σ\(\(αii⋅𝑾𝒑i\+αi2⋅𝑾𝒑i\+⋯\+αin⋅𝑾𝒑i\)\+\\displaystyle\\;\\sigma\\biggl\(\\left\(\\alpha\_\{ii\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\+\\alpha\_\{i2\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\+\\dots\+\\alpha\_\{in\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\+\(αi2⋅𝑾𝒑2−αi2⋅𝑾𝒑i\)\+⋯\+\(αin⋅𝑾𝒑n−αin⋅𝑾𝒑i\)\)\\displaystyle\\left\(\\alpha\_\{i2\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{2\}\-\\alpha\_\{i2\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\+\\dots\+\\left\(\\alpha\_\{in\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{n\}\-\\alpha\_\{in\}\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\\biggr\)=\\displaystyle=σ\(\(\[∑k∈𝐍~iαik\]⋅𝑾𝒑i\)\+∑k∈𝐍iαik\(𝑾𝒑k−𝑾𝒑i\)\)\\displaystyle\\;\\sigma\\Biggl\(\\left\(\\left\[\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\alpha\_\{ik\}\\right\]\\cdot\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\+\\sum\_\{k\\in\\mathbf\{N\}\_\{i\}\}\\alpha\_\{ik\}\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\\Biggr\)=\\displaystyle=σ\(𝑾𝒑i\+∑k∈𝐍iαik\(𝑾𝒑k−𝑾𝒑i\)\),\\displaystyle\\;\\sigma\\Biggl\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\+\\sum\_\{k\\in\\mathbf\{N\}\_\{i\}\}\\alpha\_\{ik\}\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\\Biggr\),where last equality holds as∑k∈𝐍~iαik=1\\sum\_\{k\\in\\tilde\{\\mathbf\{N\}\}\_\{i\}\}\\alpha\_\{ik\}=1due to the softmax operation in GAT\. Here, such attention weightsαik\\alpha\_\{ik\}are estimated by the GAT, which tends to capture a pattern that assigns importance between two individuals based on individual information, i\.e\.,𝒑k\\boldsymbol\{p\}\_\{k\}and𝒑i\\boldsymbol\{p\}\_\{i\}\.
Although the GAT can adaptively estimates the importance of interference between individuals and their neighbors using individual information, it learns the same pattern across all local networks in a graph\. This may introduce risks arising from a potentialconflict issuebetween the estimated importance and individual information differences during aggregation\. For example, a GAT learns a pattern from a graph\. It tends to assign high importance to similar neighbors \(𝒑k−𝒑i=𝟎\\boldsymbol\{p\}\_\{k\}\-\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{0\}\) of the individualii; while assigning low importance to dissimilar neighbors \(𝒑k−𝒑i≫𝟎\\boldsymbol\{p\}\_\{k\}\-\\boldsymbol\{p\}\_\{i\}\\gg\\boldsymbol\{0\}\) of individualii, where𝟎\\boldsymbol\{0\}denotes zero vector\. Here, if a neighbor𝒑k\\boldsymbol\{p\}\_\{k\}is similar to𝒑i\\boldsymbol\{p\}\_\{i\},αik\\alpha\_\{ik\}can be high, but the result of𝑾𝒑k−𝑾𝒑i\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}will be close to𝟎\\boldsymbol\{0\}\. In contrast, if a neighbor𝒑k\\boldsymbol\{p\}\_\{k\}is totally different to𝒑i\\boldsymbol\{p\}\_\{i\}, every element of𝑾𝒑k−𝑾𝒑i\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}will be large, but the value ofαik\\alpha\_\{ik\}can be zero due to the softmax operation\. Based on the analysis above, we consider a case, where similar individualsiiandjj\(i\.e\.,𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}\) are exposed to two different local networks\. In their networks,∀k1∈𝐍~i,∀k2∈𝐍~j\\forall k\_\{1\}\\in\\tilde\{\\mathbf\{N\}\}\_\{i\},\\forall k\_\{2\}\\in\\tilde\{\\mathbf\{N\}\}\_\{j\}, their neighbors are either similar to them, i\.e,𝒑i=𝒑k1\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{k\_\{1\}\}and𝒑j=𝒑k2\\boldsymbol\{p\}\_\{j\}=\\boldsymbol\{p\}\_\{k\_\{2\}\}, or totally different from them, i\.e\.,𝒑k1−𝒑i≫0\\boldsymbol\{p\}\_\{k\_\{1\}\}\-\\boldsymbol\{p\}\_\{i\}\\gg 0and𝒑k2−𝒑j≫0\\boldsymbol\{p\}\_\{k\_\{2\}\}\-\\boldsymbol\{p\}\_\{j\}\\gg 0\. In this case, we can have as follows:
\(23\)αik1\\displaystyle\\alpha\_\{ik\_\{1\}\}=\{\>0,if𝒑k1=𝒑i,0,if𝒑k1−𝒑i≫𝟎\.\\displaystyle=αjk2\\displaystyle\\alpha\_\{jk\_\{2\}\}=\{\>0,if𝒑k2=𝒑j,0,if𝒑k2−𝒑j≫𝟎\.\\displaystyle=Based on \([23](https://arxiv.org/html/2605.24358#A3.E23)\), we can have the results for this case, as follows:
\(24\)𝒑i′\\displaystyle\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=σ\(𝑾𝒑i\+∑k1∈𝐍iαik1\(𝑾𝒑k1−𝑾𝒑i\)\)=σ\(𝑾𝒑i\),\\displaystyle=\\sigma\\Biggl\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\+\\sum\_\{\{k\_\{1\}\}\\in\\mathbf\{N\}\_\{i\}\}\\alpha\_\{i\{k\_\{1\}\}\}\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\\Biggr\)=\\sigma\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\),𝒑j′\\displaystyle\\boldsymbol\{p\}^\{\\prime\}\_\{j\}=σ\(𝑾𝒑j\+∑k2∈𝐍jαik2\(𝑾𝒑k2−𝑾𝒑j\)\)=σ\(𝑾𝒑j\)\.\\displaystyle=\\sigma\\Biggl\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\+\\sum\_\{\{k\_\{2\}\}\\in\\mathbf\{N\}\_\{j\}\}\\alpha\_\{i\{k\_\{2\}\}\}\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{2\}\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\\right\)\\Biggr\)=\\sigma\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\\right\)\.Based on equalites \([24](https://arxiv.org/html/2605.24358#A3.E24)\) and𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}, we can have𝒑i′=𝒑j′\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=\\boldsymbol\{p\}^\{\\prime\}\_\{j\}\. This indicates that the GAT is unable to address the issue \(II\) of DNE for this case, as it still generates the same representations for individualsiiandjjbut\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. We consider another case, where that similar individualsiiandjj\(i\.e\.,𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}\) are exposed to two different local networks, where𝒑i=𝒑k1,∀k1∈ℕ~i\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{k\_\{1\}\},\\forall k\_\{1\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\},𝒑j=𝒑k2,∀k2∈ℕ~j\\boldsymbol\{p\}\_\{j\}=\\boldsymbol\{p\}\_\{k\_\{2\}\},\\forall k\_\{2\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, and\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. We can derive from equality \([22](https://arxiv.org/html/2605.24358#A3.E22)\), as follows:
\(25\)𝒑i′\\displaystyle\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=σ\(𝑾𝒑i\+∑k1∈𝐍iαik1\(𝑾𝒑k1−𝑾𝒑i\)\)=σ\(𝑾𝒑i\),\\displaystyle=\\sigma\\Biggl\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\+\\sum\_\{\{k\_\{1\}\}\\in\\mathbf\{N\}\_\{i\}\}\\alpha\_\{i\{k\_\{1\}\}\}\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\\right\)\\Biggr\)=\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\),𝒑j′\\displaystyle\\boldsymbol\{p\}^\{\\prime\}\_\{j\}=σ\(𝑾𝒑j\+∑k2∈𝐍jαik2\(𝑾𝒑k2−𝑾𝒑j\)\)=σ\(𝑾𝒑j\)\.\\displaystyle=\\sigma\\Biggl\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\+\\sum\_\{\{k\_\{2\}\}\\in\\mathbf\{N\}\_\{j\}\}\\alpha\_\{i\{k\_\{2\}\}\}\\left\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{2\}\}\-\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\\right\)\\Biggr\)=\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\)\.Based on equalites \([25](https://arxiv.org/html/2605.24358#A3.E25)\) and𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}, we can have𝒑i′=𝒑j′\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=\\boldsymbol\{p\}^\{\\prime\}\_\{j\}\. This indicates that GAT is also unable to address issue \(II\) of DNE in this case, as it still generates the same representations for individualsiiandjjbut\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. Furthermore, the estimated importance of neighbors also tends to be the same for this case, which further prevents traditional GAT\-based methods from capturing DNE\. As a result, GAT cannot capture DNE for some local networks, such as those in equailites \([24](https://arxiv.org/html/2605.24358#A3.E24)\) and \([25](https://arxiv.org/html/2605.24358#A3.E25)\), as it cannot address issues \(I\) and \(II\) jointly for these cases\.
As discussed above, interference representation generated by mean aggregation, GCN, or GAT, cannot capture DNE for some local networks\. ∎
This indicates that most existing interference modeling methods face challenges in capturing DNE effectively\.
Apart from the above\-mentioned aggregation\-based methods for ITE estimation, there are some pooling\-based methods in the graph community, the max pooling\(Hamiltonet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib76)\)is one of the popular pooling\-based methods\.
###### Proposition C\.4\.
Interference representation generated by max\-pooling or min\-pooling operations cannot capture DNE for some local networks\.
We now prove Proposition[C\.4](https://arxiv.org/html/2605.24358#A3.Thmtheorem4), as follows:
###### Proof\.
As pooling\-based methods do not contain a mechanism to estimate the importance of interference among individuals, they cannot address issue \(I\)\.
Now, we discuss that pooling\-based methods cannot address issue \(II\) for some networks\. For simplicity, we consider cases of local networks of individuals, which consist of individuals and their 1\-hop neighbors\. In this case, the neighbor set with the self\-loop of the individual can constitute the local network of the individual, e\.g\.,ℕ~i\\tilde\{\\mathbb\{N\}\}\_\{i\}\. LetPool\\rm\{Pool\}be a max\-pooling or min\-pooling operation\. We consider the case \(similar to examples in Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\) that individualsiiandjj\(i\.e\.,𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}\) are exposed to two different local networks where𝒑i=𝒑k1,∀k1∈ℕ~i\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{k\_\{1\}\},\\forall k\_\{1\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\},𝒑j=𝒑k2,∀k2∈ℕ~j\\boldsymbol\{p\}\_\{j\}=\\boldsymbol\{p\}\_\{k\_\{2\}\},\\forall k\_\{2\}\\in\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}, and\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\.
We have the result of the pooling operation as follows:
\(26\)𝒑i′=\\displaystyle\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=Pool\(\{σ\(𝑾𝒑k1\),∀k1∈ℕ~i\}\)\\displaystyle\\;\{\\rm\{Pool\}\}\\biggr\(\\\{\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{1\}\}\),\\forall k\_\{1\}\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\\\}\\biggr\)=\\displaystyle=Pool\(\{σ\(𝑾𝒑i\),…,σ\(𝑾𝒑i\)\}\)\\displaystyle\\;\{\\rm\{Pool\}\}\\biggr\(\\\{\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\),\\dots,\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\)\\\}\\biggr\)=\\displaystyle=σ\(𝑾𝒑i\),\\displaystyle\\;\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{i\}\),𝒑j′=\\displaystyle\\boldsymbol\{p\}^\{\\prime\}\_\{j\}=Pool\(\{σ\(𝑾𝒑k2\),∀k2∈ℕ~j\}\)\\displaystyle\\;\{\\rm\{Pool\}\}\\biggr\(\\\{\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{k\_\{2\}\}\),\\forall k\_\{2\}\\in\\tilde\{\\mathbb\{N\}\}\_\{j\}\\\}\\biggr\)=\\displaystyle=Pool\(\{σ\(𝑾𝒑j\),…,σ\(𝑾𝒑j\)\}\)\\displaystyle\\;\{\\rm\{Pool\}\}\\biggr\(\\\{\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\),\\dots,\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\)\\\}\\biggr\)=\\displaystyle=σ\(𝑾𝒑j\)\.\\displaystyle\\;\\sigma\(\\boldsymbol\{W\}\\boldsymbol\{p\}\_\{j\}\)\.Based on equalites \([26](https://arxiv.org/html/2605.24358#A3.E26)\) and𝒑i=𝒑j\\boldsymbol\{p\}\_\{i\}=\\boldsymbol\{p\}\_\{j\}, we can have𝒑i′=𝒑j′\\boldsymbol\{p\}^\{\\prime\}\_\{i\}=\\boldsymbol\{p\}^\{\\prime\}\_\{j\}\. This indicates that the pooling operation is unable to address the issue \(II\) of DNE for this case, as it still generates the same representations for individualsiiandjjbut\|ℕ~i\|≠\|ℕ~j\|\|\{\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\|\\neq\|\{\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\|\. As a result, the pooling operation cannot capture DNE, as it cannot address issues \(I\) and \(II\) jointly\.
∎
## Appendix DProof of Proposition[4\.1](https://arxiv.org/html/2605.24358#S4.Thmtheorem1)
In this section, we prove that the proposed NIM layer can capture DNE\. A proper aggregation function that can capture DNE should address two sub\-issues\. \(I\) The importance of different neighbors in contributing to interference varies\(Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91)\)\. \(II\) The scale of neighbors varies, leading to different levels of interference \(see Figure[1](https://arxiv.org/html/2605.24358#S1.F1)\)\. IPAtt and SPAtt mechanisms can address the issue \(I\) of DNE in an adaptive manner, while SPAtt mechanism can mitigate the conflict issue \(see Appendix[C](https://arxiv.org/html/2605.24358#A3)\) of GAT\-based methods\. However, for the case where individuals in the same local networks have similar individual information, we still suffer from a similar issue to GAT\-based methods\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102); Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91); Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)when applying the normalization operation to the estimated importance without the message amplifier\. Here, we consider a case that similar individualsiiandjj\(i\.e\.,𝒛∙i\(l−1\)=𝒛∙j\(l−1\)\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}=\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\},∙∈\{𝕏,𝕋\}\\bullet\\in\\\{\\mathbb\{X\},\\mathbb\{T\}\\\}\) are exposed to two different local networks, which consist of individuals and their 1\-hop neighbors\. In their local networks,𝒛∙i\(l−1\)=𝒛∙k1\(l−1\),∀k1∈ℕ~i\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}=\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{1\}\}\}^\{\(l\-1\)\},\\forall k\_\{1\}\\in\\tilde\{\\mathbb\{N\}\}\_\{i\},𝒛∙j\(l−1\)=𝒛∙k2\(l−1\),∀k2∈ℕ~j\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}=\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{2\}\}\}^\{\(l\-1\)\},\\forall k\_\{2\}\\in\\tilde\{\\mathbb\{N\}\}\_\{j\},\|𝐍~i\|=m\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|=m,\|𝐍~j\|=n\|\\tilde\{\\mathbf\{N\}\}\_\{j\}\|=n, andn≫mn\\gg m\. In this case,𝑾∙in\(l\)𝒛∙i\(l−1\)=𝑾∙in\(l\)𝒛∙k1\(l−1\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{in\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}=\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{in\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{1\}\}\}^\{\(l\-1\)\},𝑾∙st\(l\)𝒛∙i\(l−1\)=𝑾∙st\(l\)𝒛∙k1\(l−1\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{st\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}=\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{st\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{1\}\}\}^\{\(l\-1\)\},𝑾∙in\(l\)𝒛∙j\(l−1\)=𝑾∙in\(l\)𝒛∙k2\(l−1\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{in\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}=\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{in\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{2\}\}\}^\{\(l\-1\)\}, and𝑾∙st\(l\)𝒛∙j\(l−1\)=𝑾∙st\(l\)𝒛∙k2\(l−1\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{st\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}=\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{st\}\}\}\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{2\}\}\}^\{\(l\-1\)\}hold\. Therefore, we omit𝑾∙in\(l\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{in\}\}\}and𝑾∙st\(l\)\\boldsymbol\{W\}^\{\(l\)\}\_\{\\bullet\_\{\\rm\{st\}\}\}in our proof for simplicity\. Then, we have:
\(27\)𝒛̊∙i\(l\)=\\displaystyle\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{i\}\}^\{\(l\)\}=π̊∙\(l\)⋅σ\(∑k1∈ℕ~iαik1in⋅𝒛∙k1\(l−1\)\)\+\\displaystyle\\;\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\cdot\\sigma\\Biggr\(\\sum\_\{k\_\{1\}\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{i\{k\_\{1\}\}\}^\{\\rm\{in\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{1\}\}\}^\{\(l\-1\)\}\\Biggr\)\+\(1−π̊∙\(l\)\)⋅σ\(∑k1∈ℕ~iαik1st⋅𝒛∙k1\(l−1\)\)\\displaystyle\\;\\left\(1\-\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\right\)\\cdot\\sigma\\Biggr\(\\sum\_\{\{k\_\{1\}\}\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{i\{k\_\{1\}\}\}^\{\\rm\{st\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{1\}\}\}^\{\(l\-1\)\}\\Biggr\)=\\displaystyle=π̊∙\(l\)⋅σ\(∑k1∈ℕ~iαik1in⋅𝒛∙i\(l−1\)\)\+\\displaystyle\\;\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\cdot\\sigma\\Biggr\(\\sum\_\{\{k\_\{1\}\}\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{i\{k\_\{1\}\}\}^\{\\rm\{in\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}\\Biggr\)\+\(1−π̊∙\(l\)\)⋅σ\(∑k1∈ℕ~iαik1st⋅𝒛∙i\(l−1\)\)\\displaystyle\\;\\left\(1\-\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\right\)\\cdot\\sigma\\Biggr\(\\sum\_\{\{k\_\{1\}\}\\in\\tilde\{\\mathbb\{N\}\}\_\{i\}\}\\alpha\_\{i\{k\_\{1\}\}\}^\{\\rm\{st\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}\\Biggr\)=\\displaystyle=π̊∙\(l\)⋅σ\(𝒛∙i\(l−1\)\)\+\(1−π̊∙\(l\)\)⋅σ\(𝒛∙i\(l−1\)\),\\displaystyle\\;\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\cdot\\sigma\\left\(\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}\\right\)\+\\left\(1\-\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\right\)\\cdot\\sigma\\left\(\\boldsymbol\{z\}\_\{\{\\bullet\_\{i\}\}\}^\{\(l\-1\)\}\\right\),𝒛̊∙j\(l\)=\\displaystyle\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{j\}\}^\{\(l\)\}=π̊∙\(l\)⋅σ\(∑k2∈ℕ~jαjk2in⋅𝒛∙k2\(l−1\)\)\+\\displaystyle\\;\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\cdot\\sigma\\Biggr\(\\sum\_\{k\_\{2\}\\in\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\alpha\_\{j\{k\_\{2\}\}\}^\{\\rm\{in\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{2\}\}\}^\{\(l\-1\)\}\\Biggr\)\+\(1−π̊∙\(l\)\)⋅σ\(∑k2∈ℕ~jαjk2st⋅𝒛∙k2\(l−1\)\)\\displaystyle\\;\\left\(1\-\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\right\)\\cdot\\sigma\\Biggr\(\\sum\_\{\{k\_\{2\}\}\\in\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\alpha\_\{j\{k\_\{2\}\}\}^\{\\rm\{st\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{k\_\{2\}\}\}^\{\(l\-1\)\}\\Biggr\)=\\displaystyle=π̊∙\(l\)⋅σ\(∑k2∈ℕ~jαjk2in⋅𝒛∙j\(l−1\)\)\+\\displaystyle\\;\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\cdot\\sigma\\Biggr\(\\sum\_\{\{k\_\{2\}\}\\in\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\alpha\_\{j\{k\_\{2\}\}\}^\{\\rm\{in\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}\\Biggr\)\+\(1−π̊∙\(l\)\)⋅σ\(∑k2∈ℕ~jαjk2st⋅𝒛∙j\(l−1\)\)\\displaystyle\\;\\left\(1\-\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\right\)\\cdot\\sigma\\Biggr\(\\sum\_\{\{k\_\{2\}\}\\in\\tilde\{\\mathbb\{N\}\}\_\{j\}\}\\alpha\_\{j\{k\_\{2\}\}\}^\{\\rm\{st\}\}\\cdot\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}\\Biggr\)=\\displaystyle=π̊∙\(l\)⋅σ\(𝒛∙j\(l−1\)\)\+\(1−π̊∙\(l\)\)⋅σ\(𝒛∙j\(l−1\)\)\.\\displaystyle\\;\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\cdot\\sigma\\left\(\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}\\right\)\+\\left\(1\-\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\\right\)\\cdot\\sigma\\left\(\\boldsymbol\{z\}\_\{\{\\bullet\_\{j\}\}\}^\{\(l\-1\)\}\\right\)\.
These equalities hold because the sum of importance estimated by every partial attention mechanism is 1 due to the normalization operation\. Based on equalities \([27](https://arxiv.org/html/2605.24358#A4.E27)\) and𝒛∙i\(l−1\)=𝒛∙j\(l−1\)\\boldsymbol\{z\}\_\{\\bullet\_\{i\}\}^\{\(l\-1\)\}=\\boldsymbol\{z\}\_\{\\bullet\_\{j\}\}^\{\(l\-1\)\}, we can have𝒛̊∙i\(l\)=𝒛̊∙j\(l\)\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{i\}\}^\{\(l\)\}=\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{j\}\}^\{\(l\)\}\. We can observe that, when the message amplifier is not applied, the aggregated results are identical regardless of the scales of neighbors in different local networks for this case\. This holds even when the scales of their neighbors differ significantly, i\.e\.,\|𝐍~i\|=m\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|=m,\|𝐍~j\|=n\|\\tilde\{\\mathbf\{N\}\}\_\{j\}\|=n, andn≫mn\\gg m\. Therefore, we apply the message amplifier to ensure the generated representations of two individuals differ according to the size of neighbors, i\.e\.,\|𝐍~i\|\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|and\|𝐍~j\|\|\\tilde\{\\mathbf\{N\}\}\_\{j\}\|, which correspond to their degreesd~i\\tilde\{d\}\_\{i\}andd~j\\tilde\{d\}\_\{j\}with self\-loops\.
###### Proposition D\.1\(Proposition[4\.1](https://arxiv.org/html/2605.24358#S4.Thmtheorem1), main text\)\.
Interference representation generated by NIM layers after applying the message amplifier can address the issue \(II\), even in some local networks where all individuals have similar interference\-related information\.
We now prove Proposition[D\.1](https://arxiv.org/html/2605.24358#A4.Thmtheorem1)\.
###### Proof\.
For the individualiiwith\|𝐍~i\|=m\|\\tilde\{\\mathbf\{N\}\}\_\{i\}\|=m, the individualjjwith\|𝐍~j\|=n\|\\tilde\{\\mathbf\{N\}\}\_\{j\}\|=n, andn≫mn\\gg m, we have:
\(28\)\(1\+ηi\)⋅𝒛̊∙i\(l\)\\displaystyle\(1\+\\eta\_\{i\}\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{i\}\}^\{\(l\)\}=\(1\+πη⋅log\(d~i\)∑i=1ntrlog\(d~i\)\)⋅𝒛̊∙i\(l\)\\displaystyle=\\Biggr\(1\+\\pi\_\{\\eta\}\\cdot\\frac\{\\log\(\\tilde\{d\}\_\{i\}\)\}\{\\sum\_\{i=1\}^\{n\_\{\\rm\{tr\}\}\}\{\\log\(\\tilde\{d\}\_\{i\}\)\}\}\\Biggr\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{i\}\}^\{\(l\)\}=\(1\+πη⋅log\(m\)∑i=1ntrlog\(d~i\)\)⋅𝒛̊∙i\(l\),\\displaystyle\\;=\\Biggr\(1\+\\pi\_\{\\eta\}\\cdot\\frac\{\\log\(m\)\}\{\\sum\_\{i=1\}^\{n\_\{\\rm\{tr\}\}\}\{\\log\(\\tilde\{d\}\_\{i\}\)\}\}\\Biggr\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{i\}\}^\{\(l\)\},\(1\+ηj\)⋅𝒛̊∙j\(l\)\\displaystyle\(1\+\\eta\_\{j\}\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{j\}\}^\{\(l\)\}=\(1\+πη⋅log\(d~j\)∑j=1ntrlog\(d~j\)\)⋅𝒛̊∙j\(l\)\\displaystyle=\\Biggr\(1\+\\pi\_\{\\eta\}\\cdot\\frac\{\\log\(\\tilde\{d\}\_\{j\}\)\}\{\\sum\_\{j=1\}^\{n\_\{\\rm\{tr\}\}\}\{\\log\(\\tilde\{d\}\_\{j\}\)\}\}\\Biggr\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{j\}\}^\{\(l\)\}=\(1\+πη⋅log\(n\)∑j=1ntrlog\(d~j\)\)⋅𝒛̊∙j\(l\)\.\\displaystyle\\;=\\Biggr\(1\+\\pi\_\{\\eta\}\\cdot\\frac\{\\log\(n\)\}\{\\sum\_\{j=1\}^\{n\_\{\\rm\{tr\}\}\}\{\\log\(\\tilde\{d\}\_\{j\}\)\}\}\\Biggr\)\\cdot\\mathring\{\\boldsymbol\{z\}\}\_\{\\bullet\_\{j\}\}^\{\(l\)\}\.This shows that the generated interference representation with the message amplifier differs according to the degree of neighbors, which can address the issue \(II\) of DNE, even in local networks where all individuals have similar information\. ∎
Therefore, by applying two partial attention mechanisms to address issue \(I\) and applying the message amplifier to address issue \(II\), the proposed NIM layer can capture DNE\.
## Appendix EError bound
Our theoretical analysis for the error bound is inspired byShalitet al\.\([2017](https://arxiv.org/html/2605.24358#bib.bib4)\), we extend their theoretical analysis of non\-graph data to graph data by using the proposed representation balancing strategy\. LetΦ\\Phibe a map functions, assume it is twice\-differentiable and invertible, followingShalitet al\.\([2017](https://arxiv.org/html/2605.24358#bib.bib4)\); Caiet al\.\([2023](https://arxiv.org/html/2605.24358#bib.bib125)\); Wanget al\.\([2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\. LetΦ−1\\Phi^\{\-1\}be the inverse ofΦ\\Phiand𝒖=\(𝒙,𝒙𝔾,𝒕𝔾\)\\boldsymbol\{u\}=\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)for simplicity\. Then, the expected loss for an individual is as follows:
\(29\)ℒf,Φ\(𝒖,t\)=∫𝒴ℒ\(y\(t,𝒕𝔾\),f\(Φ\(𝒖\),t\)\)p\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)𝑑y\(t,𝒕𝔾\)\.\\displaystyle\\mathcal\{L\}\_\{f,\\Phi\}\(\\boldsymbol\{u\},t\)=\{\\int\}\_\{\\mathcal\{Y\}\}\\mathcal\{L\}\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\),f\\left\(\\Phi\(\\boldsymbol\{u\}\),t\\right\)\\biggr\)p\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\biggr\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\.
We consider MSE forℒ\(y\(t,𝒕𝔾\),f\(Φ\(𝒖\),t\)\)\\mathcal\{L\}\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\),f\(\\Phi\(\\boldsymbol\{u\}\),t\)\)\. The expected losses of factual and counterfactual outcomes are, respectively:
\(30\)ϵF\(f,Φ\)≔∫𝒰×\{0,1\}ℒf,Φ\(𝒖,t\)p\(𝒖,t\)𝑑𝒖𝑑t,\\displaystyle\{\\epsilon\}\_\{\{\\mathrm\{F\}\}\}\{\(f,\\Phi\)\}\\coloneqq\{\\int\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\mathcal\{L\}\_\{f,\\Phi\}\(\\boldsymbol\{u\},t\)p\(\\boldsymbol\{u\},t\)d\\boldsymbol\{u\}dt,ϵCF\(f,Φ\)≔∫𝒰×\{0,1\}ℒf,Φ\(𝒖,1−t\)p\(𝒖,t\)𝑑𝒖𝑑t\.\\displaystyle\{\\epsilon\}\_\{\{\\mathrm\{CF\}\}\}\{\(f,\\Phi\)\}\\coloneqq\{\\int\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\mathcal\{L\}\_\{f,\\Phi\}\(\\boldsymbol\{u\},1\-t\)p\(\\boldsymbol\{u\},t\)d\\boldsymbol\{u\}dt\.We can decomposep\(𝒖,t\)=p\(t\)p\(𝒖∣t\)p\(\\boldsymbol\{u\},t\)=p\(t\)p\(\\boldsymbol\{u\}\\mid t\)\. Letpt=1\(𝒖\)=p\(𝒖∣t=1\)p^\{t=1\}\(\\boldsymbol\{u\}\)=p\(\\boldsymbol\{u\}\\mid t=1\)andpt=0\(𝒖\)=p\(𝒖∣t=0\)p^\{t=0\}\(\\boldsymbol\{u\}\)=p\(\\boldsymbol\{u\}\\mid t=0\)\. Then, the factual and counterfactual outcomes of the treated and control groups are, respectively:
\(31\)ϵFt=1\(f,Φ\)≔∫𝒰Lf,Φ\(𝒖,1\)pt=1\(𝒖\)𝑑𝒖,\\displaystyle\\epsilon\_\{\\text\{F\}\}^\{t=1\}\(f,\\Phi\)\\coloneqq\\int\_\{\\mathcal\{U\}\}L\_\{f,\\Phi\}\(\\boldsymbol\{u\},1\)p^\{t=1\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{u\},ϵFt=0\(f,Φ\)≔∫𝒰Lf,Φ\(𝒖,0\)pt=0\(𝒖\)𝑑𝒖,\\displaystyle\\epsilon\_\{\\text\{F\}\}^\{t=0\}\(f,\\Phi\)\\coloneqq\\int\_\{\\mathcal\{U\}\}L\_\{f,\\Phi\}\(\\boldsymbol\{u\},0\)p^\{t=0\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{u\},ϵCFt=1\(f,Φ\)≔∫𝒰Lf,Φ\(𝒖,1\)pt=0\(𝒖\)𝑑𝒖,\\displaystyle\\epsilon\_\{\\text\{CF\}\}^\{t=1\}\(f,\\Phi\)\\coloneqq\\int\_\{\\mathcal\{U\}\}L\_\{f,\\Phi\}\(\\boldsymbol\{u\},1\)p^\{t=0\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{u\},ϵCFt=0\(f,Φ\)≔∫𝒰Lf,Φ\(𝒖,0\)pt=1\(𝒖\)𝑑𝒖\.\\displaystyle\\epsilon\_\{\\text\{CF\}\}^\{t=0\}\(f,\\Phi\)\\coloneqq\\int\_\{\\mathcal\{U\}\}L\_\{f,\\Phi\}\(\\boldsymbol\{u\},0\)p^\{t=1\}\(\\boldsymbol\{u\}\)d\\boldsymbol\{u\}\.Letpt≔p\(t=1\)p\_\{t\}\\coloneqq p\(t=1\)\. We then have the following results:
\(32\)ϵF\(f,Φ\)=pt⋅ϵFt=1\(f,Φ\)\+\(1−pt\)⋅ϵFt=0\(f,Φ\),\\displaystyle\\epsilon\_\{\\text\{F\}\}\(f,\\Phi\)=p\_\{t\}\\cdot\\epsilon\_\{\\text\{F\}\}^\{t=1\}\(f,\\Phi\)\+\(1\-p\_\{t\}\)\\cdot\\epsilon\_\{\\text\{F\}\}^\{t=0\}\(f,\\Phi\),ϵCF\(f,Φ\)=\(1−pt\)⋅ϵCFt=1\(f,Φ\)\+pt⋅ϵCFt=0\(f,Φ\)\.\\displaystyle\\epsilon\_\{\\text\{CF\}\}\(f,\\Phi\)=\(1\-p\_\{t\}\)\\cdot\\epsilon\_\{\\text\{CF\}\}^\{t=1\}\(f,\\Phi\)\+p\_\{t\}\\cdot\\epsilon\_\{\\text\{CF\}\}^\{t=0\}\(f,\\Phi\)\.Integral probability metric \(IPM\) is defined as follows\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\):
###### Definition E\.1\.
Let𝕆\\mathbb\{O\}be a function family consisting of functionso:ℛ→ℝo:\\mathcal\{R\}\\rightarrow\\mathbb\{R\}\. For a pair of distributionsp1p\_\{1\}andp2p\_\{2\}overℛ\\mathcal\{R\}, define IPM:
\(33\)IPM𝕆\(p1,p2\)≔supo∈𝕆\|∫ℛo\(𝒓\)\(p1\(𝒓\)−p2\(𝒓\)\)𝑑𝒓\|\.\{\\mathrm\{IPM\}\}\_\{\\mathbb\{O\}\}\(p\_\{1\},p\_\{2\}\)\\coloneqq\{\\mathrm\{sup\}\}\_\{o\\in\\mathbb\{O\}\}\\Biggr\|\{\\int\}\_\{\\mathcal\{R\}\}o\(\\boldsymbol\{r\}\)\\biggl\(p\_\{1\}\(\\boldsymbol\{r\}\)\-p\_\{2\}\(\\boldsymbol\{r\}\)\\biggr\)d\\boldsymbol\{r\}\\Biggr\|\.
When𝕆\\mathbb\{O\}is the family of 1\-Lipschitz functions,IPM𝕆\(p1,p2\)=𝒲\(p1,p2\)\\text\{IPM\}\_\{\\mathbb\{O\}\}\(p\_\{1\},p\_\{2\}\)=\\mathcal\{W\}\(p\_\{1\},p\_\{2\}\), as demonstrated by\(Villani and others,[2008](https://arxiv.org/html/2605.24358#bib.bib139)\)\. We use Wasserstein discrepancy for representation balancing\. For the Wasserstein discrepancy, we follow a Kantorovich problem\(Kantorovich,[2006](https://arxiv.org/html/2605.24358#bib.bib128)\)and add an entropic regularization to reduce computational costs\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\), as follows:
###### Definition E\.2\.
\(34\)𝒲\(p1,p2\)≔⟨𝑫,𝝅ξ⟩,𝝅ξ≔argmin𝝅∈Π\(p1,p2\)⟨𝑫,𝝅⟩−ξH\(𝝅\),\\displaystyle\\mathcal\{W\}\(p\_\{1\},p\_\{2\}\)\\coloneqq\\langle\\boldsymbol\{D\},\\boldsymbol\{\\pi\}^\{\\xi\}\\rangle,\\quad\\boldsymbol\{\\pi\}^\{\\xi\}\\coloneqq\\mathop\{\\arg\\min\}\_\{\\boldsymbol\{\\pi\}\\in\\Pi\(p\_\{1\},p\_\{2\}\)\}\\langle\\boldsymbol\{D\},\\boldsymbol\{\\pi\}\\rangle\-\\xi H\(\\boldsymbol\{\\pi\}\),Π\(p1,p2\)≔\{𝝅∈ℝ\+n×m:𝝅𝟙m=𝕡1,𝝅⊤𝟙n=𝕡2\},\\displaystyle\\Pi\(p\_\{1\},p\_\{2\}\)\\coloneqq\\\{\\boldsymbol\{\\pi\}\\in\\mathbb\{R\}\_\{\+\}^\{n\\times m\}:\\boldsymbol\{\\pi\}\\mathbb\{1\}\_\{m\}=\\mathbb\{p\}\_\{1\},\\ \\boldsymbol\{\\pi\}^\{\\top\}\\mathbb\{1\}\_\{n\}=\\mathbb\{p\}\_\{2\}\\\},H\(𝝅\)≔−∑i,j𝝅i,j\(log\(𝝅i,j\)−1\)\.\\displaystyle H\(\\boldsymbol\{\\pi\}\)\\coloneqq\-\\sum\_\{i,j\}\\boldsymbol\{\\pi\}\_\{i,j\}\(\\log\(\\boldsymbol\{\\pi\}\_\{i,j\}\)\-1\)\.
Here,𝒲\\mathcal\{W\}is Wasserstein discrepancy betweenp1p\_\{1\}andp2p\_\{2\},𝑫\\boldsymbol\{D\}consists of unit\-wise distance betweenp1p\_\{1\}andp2p\_\{2\}\.𝒲\(p1,p2\)\\mathcal\{W\}\(p\_\{1\},p\_\{2\}\)can be solved by Sinkhorn algorithm\(Cuturi,[2013](https://arxiv.org/html/2605.24358#bib.bib130)\)\.
By using𝒲\\mathcal\{W\}for representation balancing, we can have the error bound for the counterfactual outcome, as stated in Lemma[E\.3](https://arxiv.org/html/2605.24358#A5.Thmtheorem3)\.
###### Lemma E\.3\.
𝕆\\mathbb\{O\}be a family of functions o:ℛ→ℝ\\mathcal\{R\}\\rightarrow\\mathbb\{R\}\. Assume there exists a constantBΦ\>0B\_\{\\Phi\}\>0, such that fort∈\{0,1\}t\\in\\\{0,1\\\}, the function1BΦ⋅ℒf,Φ\(Φ−1\(𝐫\),t\)∈𝕆\\frac\{1\}\{B\_\{\\Phi\}\}\\cdot\\mathcal\{L\}\_\{f,\\Phi\}\\left\(\\Phi^\{\-1\}\(\\boldsymbol\{r\}\),t\\right\)\\in\\mathbb\{O\}holds\. Then, the bound for counterfactual outcome is:
\(35\)ϵCF\(f,Φ\)≤\\displaystyle\\epsilon\_\{\\rm\{CF\}\}\(f,\\Phi\)\\leq\(1−pt\)⋅ϵFt=1\(f,Φ\)\+pt⋅ϵFt=0\(f,Φ\)\+\\displaystyle\(1\-p\_\{t\}\)\\cdot\\epsilon\_\{\\rm\{F\}\}^\{t=1\}\(f,\\Phi\)\+p\_\{t\}\\cdot\\epsilon\_\{\\rm\{F\}\}^\{t=0\}\(f,\\Phi\)\+\(36\)BΦ⋅𝒲\(pΦt=1\(𝒓\),pΦt=0\(𝒓\)\),\\displaystyle\\;B\_\{\\Phi\}\\cdot\\mathcal\{W\}\\biggr\(p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{r\}\),p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\\biggr\),wherepΦt=1p^\{t=1\}\_\{\\Phi\}andpΦt=0p^\{t=0\}\_\{\\Phi\}are distributions of representations𝐫\\boldsymbol\{r\}witht=1t=1andt=0t=0, respectively\.
We provide the proof for Lemma[E\.3](https://arxiv.org/html/2605.24358#A5.Thmtheorem3)as follows:
###### Proof\.
ϵCF\(f,Φ\)−ϵF\(f,Φ\)\\displaystyle\\epsilon\_\{\\text\{CF\}\}\(f,\\Phi\)\-\\epsilon\_\{\\text\{F\}\}\(f,\\Phi\)=\\displaystyle=ϵCF\(f,Φ\)−\(\(1−pt\)⋅ϵFt=1\(f,Φ\)\+pt⋅ϵFt=0\(f,Φ\)\)\\displaystyle\\;\\epsilon\_\{\\text\{CF\}\}\(f,\\Phi\)\-\\biggr\(\(1\-p\_\{t\}\)\\cdot\\epsilon\_\{\\text\{F\}\}^\{t=1\}\(f,\\Phi\)\+p\_\{t\}\\cdot\\epsilon\_\{\\text\{F\}\}^\{t=0\}\(f,\\Phi\)\\biggr\)=\\displaystyle=\(\(1−pt\)⋅ϵCFt=1\(f,Φ\)\+pt⋅ϵCFt=0\(f,Φ\)\)−\\displaystyle\\;\\biggr\(\(1\-p\_\{t\}\)\\cdot\\epsilon\_\{\\text\{CF\}\}^\{t=1\}\(f,\\Phi\)\+p\_\{t\}\\cdot\\epsilon\_\{\\text\{CF\}\}^\{t=0\}\(f,\\Phi\)\\biggr\)\-\(\(1−pt\)⋅ϵFt=1\(f,Φ\)\+pt⋅ϵFt=0\(f,Φ\)\)\\displaystyle\\;\\biggr\(\(1\-p\_\{t\}\)\\cdot\\epsilon\_\{\\text\{F\}\}^\{t=1\}\(f,\\Phi\)\+p\_\{t\}\\cdot\\epsilon\_\{\\text\{F\}\}^\{t=0\}\(f,\\Phi\)\\biggr\)=\\displaystyle=\(1−pt\)⋅\(ϵCFt=1\(f,Φ\)−ϵFt=1\(f,Φ\)\)\+\\displaystyle\\;\(1\-p\_\{t\}\)\\cdot\\biggr\(\\epsilon\_\{\\text\{CF\}\}^\{t=1\}\(f,\\Phi\)\-\\epsilon\_\{\\text\{F\}\}^\{t=1\}\(f,\\Phi\)\\biggr\)\+pt⋅\(ϵCFt=0\(f,Φ\)−ϵFt=0\(f,Φ\)\)\\displaystyle\\;p\_\{t\}\\cdot\\biggr\(\\epsilon\_\{\\text\{CF\}\}^\{t=0\}\(f,\\Phi\)\-\\epsilon\_\{\\text\{F\}\}^\{t=0\}\(f,\\Phi\)\\biggr\)=\\displaystyle=\(1−pt\)⋅∫𝒰ℒf,Φ\(𝒖,1\)\(pt=0\(𝒖\)−pt=1\(𝒖\)\)d𝒖\+\\displaystyle\\;\(1\-p\_\{t\}\)\\cdot\{\\int\}\_\{\\mathcal\{U\}\}\\mathcal\{L\}\_\{f,\\Phi\}\(\\boldsymbol\{u\},1\)\\biggr\(p^\{t=0\}\(\\boldsymbol\{u\}\)\-p^\{t=1\}\(\\boldsymbol\{u\}\)\\biggr\)d\\boldsymbol\{u\}\+pt⋅∫𝒰ℒf,Φ\(𝒖,0\)\(pt=1\(𝒖\)−pt=0\(𝒖\)\)d𝒖,\\displaystyle\\;p\_\{t\}\\cdot\{\\int\}\_\{\\mathcal\{U\}\}\\mathcal\{L\}\_\{f,\\Phi\}\(\\boldsymbol\{u\},0\)\\biggr\(p^\{t=1\}\(\\boldsymbol\{u\}\)\-p^\{t=0\}\(\\boldsymbol\{u\}\)\\biggr\)d\\boldsymbol\{u\},=\\displaystyle=BΦ⋅\(1−pt\)⋅∫ℛ1BΦℒf,Φ\(Φ−1\(𝒓\),1\)\(pΦt=0\(𝒓\)−pΦt=1\(𝒓\)\)d𝒓\+\\displaystyle\\;B\_\{\\Phi\}\\cdot\(1\-p\_\{t\}\)\\cdot\{\\int\}\_\{\\mathcal\{R\}\}\\frac\{1\}\{B\_\{\\Phi\}\}\\mathcal\{L\}\_\{f,\\Phi\}\\biggr\(\\Phi^\{\-1\}\(\\boldsymbol\{r\}\),1\\biggr\)\\biggr\(p\_\{\\Phi\}^\{t=0\}\(\\boldsymbol\{r\}\)\-p\_\{\\Phi\}^\{t=1\}\(\\boldsymbol\{r\}\)\\biggr\)d\\boldsymbol\{r\}\+\(37\)BΦ⋅pt⋅∫ℛ1BΦℒf,Φ\(Φ−1\(𝒓\),0\)\(pΦt=1\(𝒓\)−pΦt=0\(𝒓\)\)d𝒓\\displaystyle B\_\{\\Phi\}\\cdot p\_\{t\}\\cdot\{\\int\}\_\{\\mathcal\{R\}\}\\frac\{1\}\{B\_\{\\Phi\}\}\\mathcal\{L\}\_\{f,\\Phi\}\\biggr\(\\Phi^\{\-1\}\(\\boldsymbol\{r\}\),0\\biggr\)\\biggr\(p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\-p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\\biggr\)d\\boldsymbol\{r\}≤\\displaystyle\\leqBΦ⋅\(1−pt\)⋅supo∈𝕆\|∫ℛo\(𝒓\)\(pΦt=0\(𝒓\)−pΦt=1\(𝒓\)\)𝑑𝒓\|\+\\displaystyle\\;B\_\{\\Phi\}\\cdot\(1\-p\_\{t\}\)\\cdot\\sup\_\{o\\in\\mathbb\{O\}\}\\Biggr\|\{\\int\}\_\{\\mathcal\{R\}\}o\(\\boldsymbol\{r\}\)\\biggl\(p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\-p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\\biggr\)d\\boldsymbol\{r\}\\Biggr\|\+\(38\)BΦ⋅pt⋅supo∈𝕆⋅\|∫ℛo\(𝒓\)\(pΦt=1\(𝒓\)−pΦt=0\(𝒓\)\)𝑑𝒓\|\\displaystyle B\_\{\\Phi\}\\cdot p\_\{t\}\\cdot\\sup\_\{o\\in\\mathbb\{O\}\}\\cdot\\Biggr\|\{\\int\}\_\{\\mathcal\{R\}\}o\(\\boldsymbol\{r\}\)\\biggl\(p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\-p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\\biggr\)d\\boldsymbol\{r\}\\Biggr\|\(39\)=\\displaystyle=BΦ⋅IPM𝕆\(pΦt=0\(𝒓\),pΦt=1\(𝒓\)\)\.\\displaystyle\\;B\_\{\\Phi\}\\cdot\{\\mathrm\{IPM\}\}\_\{\\mathbb\{O\}\}\\biggl\(p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\),p\_\{\\Phi\}^\{t=1\}\(\\boldsymbol\{r\}\)\\biggr\)\.Here, inequality \([37](https://arxiv.org/html/2605.24358#A5.E37)\) and equality \([38](https://arxiv.org/html/2605.24358#A5.E38)\) are from the definition of IPM\. Then, by applying𝒲\\mathcal\{W\}for IPM, we can have:
\(40\)\([39](https://arxiv.org/html/2605.24358#A5.E39)\)=BΦ⋅𝒲\(pΦt=0\(𝒓\),pΦt=1\(𝒓\)\)\.\(\\ref\{eq:cf4\}\)=B\_\{\\Phi\}\\cdot\\mathcal\{W\}\\biggr\(p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\),p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\\biggr\)\.∎
Letτ^\\hat\{\\tau\}denote the proposed individual treatment effect estimator andτ\\taudenote the true treatment effect estimator, we have the following definitions\.
###### Definition E\.4\.
Letf1\(Φ\(𝐮\)\)=f\(Φ\(𝐮\),1\)f\_\{1\}\\left\(\\Phi\(\\boldsymbol\{u\}\)\\right\)=f\\left\(\\Phi\(\\boldsymbol\{u\}\),1\\right\)andf0\(Φ\(𝐮\)\)=f\(Φ\(𝐮\),0\)f\_\{0\}\\left\(\\Phi\(\\boldsymbol\{u\}\)\\right\)=f\\left\(\\Phi\(\\boldsymbol\{u\}\),0\\right\)\. The individual treatment estimator for an individual on graph data is:
\(41\)τ^=f1\(Φ\(𝒖\)\)−f0\(Φ\(𝒖\)\)\.\\hat\{\\tau\}=f\_\{1\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-f\_\{0\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\.
###### Definition E\.5\.
The expected precision in estimation of heterogeneous effect \(PEHE\) is:
\(42\)ϵPEHE\(f\)=∫𝒳×𝒳𝔾\(τ^−τ\)2p\(𝒙,𝒙𝔾\)𝑑𝒙𝑑𝒙𝔾\.\\epsilon\_\{\{\\mathrm\{PEHE\}\}\}\(f\)=\\int\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\_\{\\mathbb\{G\}\}\}\(\\hat\{\\tau\}\-\\tau\)^\{2\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{x\}d\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\.
The error bound for PEHE is stated in Theorem[E\.6](https://arxiv.org/html/2605.24358#A5.Thmtheorem6)\.
###### Theorem E\.6\.
𝕆\\mathbb\{O\}be a family of functions o:ℛ→ℝ\\mathcal\{R\}\\rightarrow\\mathbb\{R\}\. Assume there exists a constantBΦ\>0B\_\{\\Phi\}\>0, such that fort∈\{0,1\}t\\in\\\{0,1\\\}, the function1BΦ⋅ℒf,Φ\(Φ−1\(𝐮\),t\)∈𝕆\\frac\{1\}\{B\_\{\\Phi\}\}\\cdot\\mathcal\{L\}\_\{f,\\Phi\}\(\\Phi^\{\-1\}\(\\boldsymbol\{u\}\),t\)\\in\\mathbb\{O\}holds\. Then, we can have:
\(43\)ϵPEHE\(f\)≤2\(\\displaystyle\\epsilon\_\{\{\\mathrm\{PEHE\}\}\}\(f\)\\leq 2\\Biggr\(ϵFt=0\(f,Φ\)\+ϵFt=1\(f,Φ\)\+BΦ⋅𝒲\(pΦt=0\(𝒖\),pΦt=1\(𝒖\)\)\)\.\\displaystyle\\epsilon\_\{\\rm\{F\}\}^\{t=0\}\(f,\\Phi\)\+\\epsilon\_\{\\rm\{F\}\}^\{t=1\}\(f,\\Phi\)\+B\_\{\\Phi\}\\cdot\\mathcal\{W\}\\biggl\(p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{u\}\),p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{u\}\)\\biggr\)\\Biggr\)\.
###### Proof\.
Letmt\(𝒙,𝒙𝔾\)≔𝔼\[y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\]m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\coloneqq\\mathbb\{E\}\[y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\]fort∈\{1,0\}t\\in\\\{1,0\\\}, andτ=m1\(𝒙,𝒙𝔾\)−m0\(𝒙,𝒙𝔾\)\\tau=m\_\{1\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-m\_\{0\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\. We have:
ϵPEHE\(f\)\\displaystyle\\;\\epsilon\_\{\\mathrm\{PEHE\}\}\(f\)=\\displaystyle=∫𝒳×𝒳𝔾\(τ^−τ\)2p\(𝒙,𝒙𝔾\)𝑑𝒙𝑑𝒙𝔾\\displaystyle\{\\int\}\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\_\{\\mathbb\{G\}\}\}\(\\hat\{\\tau\}\-\\tau\)^\{2\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{x\}d\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}=\\displaystyle=∫𝒳×𝒳𝔾\(\[f1\(Φ\(𝒖\)\)−f0\(Φ\(𝒖\)\)\]−\\displaystyle\{\\int\}\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\_\{\\mathbb\{G\}\}\}\\Biggr\(\\biggl\[f\_\{1\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-f\_\{0\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\\biggr\]\-\[m1\(𝒙,𝒙𝔾\)−m0\(𝒙,𝒙𝔾\)\]\)2p\(𝒙,𝒙𝔾\)d𝒙d𝒙𝔾\\displaystyle\\quad\\quad\\quad\\;\\;\\biggl\[m\_\{1\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-m\_\{0\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\biggr\]\\Biggr\)^\{2\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{x\}d\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}=\\displaystyle=∫𝒳×𝒳𝔾\(\[f1\(Φ\(𝒖\)\)−m1\(𝒙,𝒙𝔾\)\]\+\\displaystyle\{\\int\}\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\_\{\\mathbb\{G\}\}\}\\Biggr\(\\biggl\[f\_\{1\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{1\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\biggr\]\+\(44\)\[m0\(𝒙,𝒙𝔾\)−f0\(Φ\(𝒖\)\)\]\)2p\(𝒙,𝒙𝔾\)d𝒙d𝒙𝔾\.\\displaystyle\\quad\\quad\\quad\\;\\;\\biggl\[m\_\{0\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-f\_\{0\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\\biggr\]\\Biggr\)^\{2\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{x\}d\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\.Based on the inequality\(a\+b\)2≤2\(a2\+b2\)\(a\+b\)^\{2\}\\leq 2\(a^\{2\}\+b^\{2\}\), we can have:
\(45\)\([44](https://arxiv.org/html/2605.24358#A5.E44)\)≤\\displaystyle\(\\ref\{eq:pehe1\}\)\\leq2∫𝒳×𝒳𝔾\(\[f1\(Φ\(𝒖\)\)−m1\(𝒙,𝒙𝔾\)\]2\+\\displaystyle 2\{\\int\}\_\{\\mathcal\{X\}\\times\\mathcal\{X\}\_\{\\mathbb\{G\}\}\}\\Biggr\(\\biggl\[f\_\{1\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{1\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\biggr\]^\{2\}\+\[m0\(𝒙,𝒙𝔾\)−f0\(Φ\(𝒖\)\)\]2\)p\(𝒙,𝒙𝔾\)d𝒙d𝒙𝔾\\displaystyle\\quad\\quad\\quad\\;\\;\\biggl\[m\_\{0\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-f\_\{0\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\\biggr\]^\{2\}\\Biggr\)p\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{x\}d\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}By usingp\(𝒙,𝒙𝔾\)=∫𝒯𝔾p\(𝒙,𝒙,𝒕𝔾\)𝑑𝒕𝔾p\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)=\\int\_\{\\mathcal\{T\}\_\{\\mathbb\{G\}\}\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}, which can further be decompose witht=1t=1andt=0t=0, i\.e,∫𝒯𝔾p\(𝒙,𝒙,𝒕𝔾,t=1\)𝑑𝒕𝔾\+∫𝒯𝔾p\(𝒙,𝒙,𝒕𝔾,t=0\)𝑑𝒕𝔾\\int\_\{\\mathcal\{T\}\_\{\\mathbb\{G\}\}\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\},t=1\)d\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\+\\int\_\{\\mathcal\{T\}\_\{\\mathbb\{G\}\}\}p\(\\boldsymbol\{x\},\\boldsymbol\{x\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\},t=0\)d\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\. Then, replacing\(𝒙,𝒙𝔾,𝒕𝔾\)\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)with𝒖\\boldsymbol\{u\}, we can have:
\([45](https://arxiv.org/html/2605.24358#A5.E45)\)=2\(\\displaystyle\(\\ref\{eq:pehe2\}\)=2\\Biggr\(∫𝒰\[f1\(Φ\(𝒖\)\)−m1\(𝒙,𝒙𝔾\)\]2p\(𝒖,t=1\)𝑑𝒖\+\\displaystyle\{\\int\}\_\{\\mathcal\{U\}\}\\biggl\[f\_\{1\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{1\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\biggr\]^\{2\}p\(\\boldsymbol\{u\},t=1\)d\\boldsymbol\{u\}\+∫𝒰\[m0\(𝒙,𝒙𝔾\)−f0\(Φ\(𝒖\)\)\]2p\(𝒖,t=0\)d𝒖\+\\displaystyle\{\\int\}\_\{\\mathcal\{U\}\}\\biggl\[m\_\{0\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-f\_\{0\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\\biggl\]^\{2\}p\(\\boldsymbol\{u\},t=0\)d\\boldsymbol\{u\}\+∫𝒰\[f1\(Φ\(𝒖\)\)−m1\(𝒙,𝒙𝔾\)\]2p\(𝒖,t=0\)𝑑𝒖\+\\displaystyle\{\\int\}\_\{\\mathcal\{U\}\}\\biggl\[f\_\{1\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{1\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\biggr\]^\{2\}p\(\\boldsymbol\{u\},t=0\)d\\boldsymbol\{u\}\+∫𝒰\[m0\(𝒙,𝒙𝔾\)−f0\(Φ\(𝒖\)\)\]2p\(𝒖,t=1\)d𝒖\)\\displaystyle\{\\int\}\_\{\\mathcal\{U\}\}\\biggl\[m\_\{0\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-f\_\{0\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\\biggr\]^\{2\}p\(\\boldsymbol\{u\},t=1\)d\\boldsymbol\{u\}\\Biggr\)=2\(\\displaystyle=2\\Biggr\(∫𝒰×\{0,1\}\[ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\]2p\(𝒖,t\)𝑑𝒖𝑑t\+\\displaystyle\{\\int\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\biggl\[f\_\{t\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\biggr\]^\{2\}p\(\\boldsymbol\{u\},t\)d\\boldsymbol\{u\}dt\+\(46\)∫𝒰×\{0,1\}\[mt\(𝒙,𝒙𝔾\)−ft\(Φ\(𝒖\)\)\]2p\(𝒖,1−t\)d𝒖dt\)\\displaystyle\{\\int\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\biggl\[m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-f\_\{t\}\\biggl\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\\biggr\]^\{2\}p\(\\boldsymbol\{u\},1\-t\)d\\boldsymbol\{u\}dt\\Biggr\)=2\(\\displaystyle=2\\Biggr\(\(ϵF\(f,Φ\)−σy\)\+\(ϵCF\(f,Φ\)−σy\)\)\\displaystyle\\biggl\(\\epsilon\_\{\\mathrm\{F\}\}\(f,\\Phi\)\-\\sigma\_\{y\}\\biggr\)\+\\biggl\(\\epsilon\_\{\\mathrm\{CF\}\}\(f,\\Phi\)\-\\sigma\_\{y\}\\biggr\)\\Biggr\)\(47\)=2\(\\displaystyle=2\\Biggr\(ϵF\(f,Φ\)\+ϵCF\(f,Φ\)−2σy\),\\displaystyle\\epsilon\_\{\\mathrm\{F\}\}\(f,\\Phi\)\+\\epsilon\_\{\\mathrm\{CF\}\}\(f,\\Phi\)\-2\\sigma\_\{y\}\\Biggr\),where equality \([46](https://arxiv.org/html/2605.24358#A5.E46)\) holds as:
ϵF\(f,Φ\)=∫𝒰×\{0,1\}\(ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\)2p\(𝒖,t\)d𝒖dt\+σy,\\displaystyle\\epsilon\_\{\\rm\{F\}\}\{\(f,\\Phi\)\}=\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\Biggr\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\Biggr\)^\{2\}p\(\\boldsymbol\{u\},t\)d\\boldsymbol\{u\}dt\+\\sigma\_\{y\},ϵCF\(f,Φ\)=∫𝒰×\{0,1\}\(ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\)2p\(𝒖,1−t\)d𝒖dt\+σy,\\displaystyle\\epsilon\_\{\\rm\{CF\}\}\{\(f,\\Phi\)\}=\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\Biggl\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\Biggr\)^\{2\}p\(\\boldsymbol\{u\},1\-t\)d\\boldsymbol\{u\}dt\+\\sigma\_\{y\},σy=∫𝒰×\{0,1\}×𝒴\(mt\(𝒙,𝒙𝔾\)−y\(t,𝒕𝔾\)\)2⋅\\displaystyle\\sigma\_\{y\}=\{\\int\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\\times\\mathcal\{Y\}\}\\Biggl\(m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\Biggr\)^\{2\}\\cdotp\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)p\(𝒖,t\)dy\(t,𝒕𝔾\)d𝒖dt,\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad p\\left\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\right\)p\(\\boldsymbol\{u\},t\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{u\}dt,whereσy\\sigma\_\{y\}will be0wheny\(t,𝒕𝔾\)y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)are deterministic functions of𝒖\\boldsymbol\{u\}andtt\. Letσy\\sigma\_\{y\}be zero for simplicity\. We provide the proof forϵF\(f,Φ\)\\epsilon\_\{\\rm\{F\}\}\(f,\\Phi\)\. The proof forϵCF\(f,Φ\)\\epsilon\_\{\\rm\{CF\}\}\(f,\\Phi\)is similar\. We derive forϵF\(f,Φ\)\\epsilon\_\{\\rm\{F\}\}\(f,\\Phi\)as follows:
ϵF\(f,Φ\)\\displaystyle\\quad\\;\{\\epsilon\}\_\{\{\\mathrm\{F\}\}\}\{\(f,\\Phi\)\}=∫𝒰×\{0,1\}ℒf,Φ\(𝒖,t\)p\(𝒖,t\)𝑑𝒖𝑑t,\\displaystyle=\{\\int\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\mathcal\{L\}\_\{f,\\Phi\}\(\\boldsymbol\{u\},t\)p\(\\boldsymbol\{u\},t\)d\\boldsymbol\{u\}dt,=∫𝒰×\{0,1\}×𝒴\(ft\(Φ\(𝒖\)\)−y\(t,𝒕𝔾\)\)2⋅\\displaystyle=\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\\times\\mathcal\{Y\}\}\\Biggl\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\Biggr\)^\{2\}\\cdotp\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)p\(𝒖,t\)dy\(t,𝒕𝔾\)d𝒖dt,\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad p\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\biggr\)p\(\\boldsymbol\{u\},t\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{u\}dt,=∫𝒰×\{0,1\}×𝒴\[\(ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\)\+\\displaystyle=\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\\times\\mathcal\{Y\}\}\\Biggl\[\\Biggl\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\Biggr\)\+\(mt\(𝒙,𝒙𝔾\)−y\(t,𝒕𝔾\)\)\]2⋅\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\biggl\(m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\biggr\)\\Biggr\]^\{2\}\\cdotp\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)p\(𝒖,t\)dy\(t,𝒕𝔾\)d𝒖dt,\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad p\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\biggr\)p\(\\boldsymbol\{u\},t\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{u\}dt,=∫𝒰×\{0,1\}×𝒴\(ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\)2⋅\\displaystyle=\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\\times\\mathcal\{Y\}\}\\Biggl\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\Biggr\)^\{2\}\\cdotp\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)p\(𝒖,t\)dy\(t,𝒕𝔾\)d𝒖dt\+\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad p\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\biggr\)p\(\\boldsymbol\{u\},t\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{u\}dt\+∫𝒰×\{0,1\}×𝒴\(mt\(𝒙,𝒙𝔾\)−y\(t,𝒕𝔾\)\)2⋅\\displaystyle\\quad\\;\\,\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\\times\\mathcal\{Y\}\}\\biggl\(m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\biggr\)^\{2\}\\cdotp\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)p\(𝒖,t\)dy\(t,𝒕𝔾\)d𝒖dt\+\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad p\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\biggr\)p\(\\boldsymbol\{u\},t\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{u\}dt\+2∫𝒰×\{0,1\}×𝒴\(mt\(𝒙,𝒙𝔾\)−y\(t,𝒕𝔾\)\)⋅\\displaystyle\\quad 2\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\\times\\mathcal\{Y\}\}\\biggl\(m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\-y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\biggr\)\\cdot\(ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\)⋅\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\;\\;\\,\\Biggl\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\Biggr\)\\cdot\(48\)p\(y\(t,𝒕𝔾\)∣𝒙,𝒙𝔾\)p\(𝒖,t\)dy\(t,𝒕𝔾\)d𝒖dt,\\displaystyle\\quad\\quad\\quad\\quad\\quad\\quad\\quad p\\biggl\(y\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)\\mid\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\\biggr\)p\(\\boldsymbol\{u\},t\)dy\(t,\\boldsymbol\{t\}\_\{\\mathbb\{G\}\}\)d\\boldsymbol\{u\}dt,=∫𝒰×\{0,1\}\(ft\(Φ\(𝒖\)\)−mt\(𝒙,𝒙𝔾\)\)2p\(𝒖,t\)d𝒖dt\+σy\+0,\\displaystyle=\\;\{\{\\int\}\}\_\{\\mathcal\{U\}\\times\\\{0,1\\\}\}\\Biggl\(f\_\{t\}\\biggr\(\\Phi\(\\boldsymbol\{u\}\)\\biggr\)\-m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\\Biggr\)^\{2\}p\(\\boldsymbol\{u\},t\)d\\boldsymbol\{u\}dt\+\\sigma\_\{y\}\+0,where this equality holds because the final integral in equality \([48](https://arxiv.org/html/2605.24358#A5.E48)\) is zero due to the definition ofmt\(𝒙,𝒙𝔾\)m\_\{t\}\(\\boldsymbol\{x\},\\boldsymbol\{x\}\_\{\\mathbb\{G\}\}\)\. Based on this equality, equality \([32](https://arxiv.org/html/2605.24358#A5.E32)\), and Lemma[E\.3](https://arxiv.org/html/2605.24358#A5.Thmtheorem3), we can have:
\([47](https://arxiv.org/html/2605.24358#A5.E47)\)≤2\(\\displaystyle\(\\ref\{eq:pehe4\}\)\\leq 2\\Biggr\(ϵFt=0\(f,Φ\)\+ϵFt=1\(f,Φ\)\+BΦ⋅𝒲\(pΦt=0\(𝒓\),pΦt=1\(𝒓\)\)\)\.\\displaystyle\\epsilon\_\{\\rm\{F\}\}^\{t=0\}\(f,\\Phi\)\+\\epsilon\_\{\\rm\{F\}\}^\{t=1\}\(f,\\Phi\)\+B\_\{\\Phi\}\\cdot\\mathcal\{W\}\\biggl\(p^\{t=0\}\_\{\\Phi\}\(\\boldsymbol\{r\}\),p^\{t=1\}\_\{\\Phi\}\(\\boldsymbol\{r\}\)\\biggr\)\\Biggr\)\.
∎
This tells if there are confounding and interference biases, we can minimize the discrepancy𝒲\\mathcal\{W\}in joint representation𝒓\\boldsymbol\{r\}between treated and control groups to mitigate the confoundering and interference biases\. Here, the discrepancy𝒲\\mathcal\{W\}is differentiable with respect to the map functionΦ\\Phi\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\. Thus, it can be minimized by updating the map functionΦ\\Phi\. In our approach, this is implemented by minimizing the loss function, i\.e\., Equation \([9](https://arxiv.org/html/2605.24358#S4.E9)\)\.
## Appendix FTime complexity for the NIM layer
In this section, we focus on analyzing the complexity of the NIM layer, which is the main source of time complexity\. Letcdc\_\{d\}denotes input and output dimension of each NIM layer andLLbe the layer depth for every sub\-networks for simplicity\. To simplify the results, our analysis is based on the straightforward runtime for performing matrix multiplications and remove constant factors for big\-o runtimes\. For the structure encoder of each individual, the time complexity isO\(\|ℕi\|⋅L⋅\(cd\)2\)O\(\|\\mathbb\{N\}\_\{i\}\|\\cdot L\\cdot\(c\_\{d\}\)^\{2\}\), where\|ℕi\|\|\\mathbb\{N\}\_\{i\}\|is the size of the set of neighbors of the individualii\. For each partial attention mechanism of each individual, the time complexity isO\(\|ℕi\|⋅L⋅\(cd\)2\+\|ℕi\|⋅cd\)O\(\|\\mathbb\{N\}\_\{i\}\|\\cdot L\\cdot\(c\_\{d\}\)^\{2\}\+\|\\mathbb\{N\}\_\{i\}\|\\cdot c\_\{d\}\)\. By combining them, we can have the time complexity of NIM layers isO\(\|ℕi\|⋅L⋅\(cd\)2\+\|ℕi\|⋅cd\)O\(\|\\mathbb\{N\}\_\{i\}\|\\cdot L\\cdot\(c\_\{d\}\)^\{2\}\+\|\\mathbb\{N\}\_\{i\}\|\\cdot c\_\{d\}\)per individual\.
By applying the neighbor sampling mechanism\(Hamiltonet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib76)\), the time complexity for each partial attention mechanism can be reduced toO\(k⋅L⋅\(cd\)2\+k⋅cd\)O\(k\\cdot L\\cdot\(c\_\{d\}\)^\{2\}\+k\\cdot c\_\{d\}\), wherekkis the size of sampled neighbors \(\|ℕi\|≫k\|\\mathbb\{N\}\_\{i\}\|\\gg k\)\. For the structure encoder, we can use the pre\-computation technologies in\(Linet al\.,[2025](https://arxiv.org/html/2605.24358#bib.bib132)\), which pre\-computes the summary information and saved the it for training MLPs\. In this case, the time complexity for the structure encoder can be reduced toO\(L⋅\(cd\)2\)O\(L\\cdot\(c\_\{d\}\)^\{2\}\)\. As a result, the time complexity of NIM layers can be reduced toO\(k⋅L⋅\(cd\)2\+k⋅cd\+L⋅\(cd\)2\)O\(k\\cdot L\\cdot\(c\_\{d\}\)^\{2\}\+k\\cdot c\_\{d\}\+L\\cdot\(c\_\{d\}\)^\{2\}\)per individual\. However, such techniques often involve a trade\-off between efficiency and performance\. Therefore, accelerating the implementation of the NIM layer while preserving the performance of ITE estimation with DNE can be a promising future research direction\.
## Appendix GTheoretical guarantee for PFOR
In this section, we discuss the necessary assumption for PFOR\. We extend a monotonicity assumption fromWanget al\.\([2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\.
###### Assumption G\.1\.
For all observed variableU=𝐮U=\\boldsymbol\{u\}in the population of interest, we have𝔼\[Y\|U=𝐮,V=𝐯,T=t\]\\mathbb\{E\}\[Y\|U=\\boldsymbol\{u\},V=\\boldsymbol\{v\},T=t\]is monotonically increasing or decreasing with respect to𝐯\\boldsymbol\{v\}\.
When this assumption holds and we have balanced joint representation𝒓\\boldsymbol\{r\}, and identical treatmenttt, the only variable reflecting the variation of𝒗\\boldsymbol\{v\}is the outcome\. In this case, we can calibrate distance using outcome differences rather than unobserved𝒗\\boldsymbol\{v\}\. Since unobserved𝒗\\boldsymbol\{v\}affect outcomes under similar observed𝒖\\boldsymbol\{u\}and treatments, outcome differences serve as reasonable calibration\. PFOR fails to handle variables that add constant effects to all units\. However, in real scenarios, it is rare that different values of only add a constant effect to the outcome\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\), making PFOR still effective in a wide range of application scenarios\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)\.
## Appendix HAdditional related work
Treatment effect estimation from observational data without interference\.The potential outcome framework underlies many existing treatment effect estimators for non\-graph data\(Rubin,[1980](https://arxiv.org/html/2605.24358#bib.bib72),[2005](https://arxiv.org/html/2605.24358#bib.bib159)\)\. Most existing methods with this context focus on mitigating confounding bias by minimizing the discrepancy between different treatment groups\(Guoet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib163); Johanssonet al\.,[2016](https://arxiv.org/html/2605.24358#bib.bib7),[2022](https://arxiv.org/html/2605.24358#bib.bib153); Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4); Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122); Wenet al\.,[2023a](https://arxiv.org/html/2605.24358#bib.bib155); Yaoet al\.,[2018](https://arxiv.org/html/2605.24358#bib.bib152)\)\. Several studies consider estimating treatment effect with a budget constraint\(Jessonet al\.,[2021b](https://arxiv.org/html/2605.24358#bib.bib173); Qinet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib147); Wenet al\.,[2025a](https://arxiv.org/html/2605.24358#bib.bib157),[b](https://arxiv.org/html/2605.24358#bib.bib154)\), unobserved confounders\(Frauen and Feuerriegel,[2022](https://arxiv.org/html/2605.24358#bib.bib164); Jessonet al\.,[2021a](https://arxiv.org/html/2605.24358#bib.bib158); Oprescuet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib161); Wanget al\.,[2022](https://arxiv.org/html/2605.24358#bib.bib145); Wuet al\.,[2022a](https://arxiv.org/html/2605.24358#bib.bib142)\), or other complex situation\(Frauenet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib174); Kaddouret al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib162); Melnychuket al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib165); Shiet al\.,[2019](https://arxiv.org/html/2605.24358#bib.bib146); Wuet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib143)\)\. These methods generally assume no interference among individuals\(Rubin,[1980](https://arxiv.org/html/2605.24358#bib.bib72),[2005](https://arxiv.org/html/2605.24358#bib.bib159)\)\. Although this assumption is reasonable for many scenarios, it does not always hold in real\-world data, such as graph data, where individuals are typically connected and exchange information\. Other studies explore estimating treatment effects from graph data by using GNN\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119); Welling and Kipf,[2016](https://arxiv.org/html/2605.24358#bib.bib17)\)to capture networked confounders\(Chuet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib54); Cuiet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib141); Guoet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib51),[2021](https://arxiv.org/html/2605.24358#bib.bib127); Maet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib1); Thoratet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib167); Veitchet al\.,[2019](https://arxiv.org/html/2605.24358#bib.bib140)\), model high\-dimensional treatment\(Harada and Kashima,[2021](https://arxiv.org/html/2605.24358#bib.bib160)\), or identify unreliable estimation\(Wenet al\.,[2023b](https://arxiv.org/html/2605.24358#bib.bib156)\), but still do not take interference into account\. For a comprehensive review on the topic of treatment effect estimation from observational data without interference, refer to recent survey literatures\(Kuanget al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib144); Yaoet al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib71)\)\.
Treatment effect estimation from experimental data with interference\.Several studies focused on treatment effect estimation from experimental data with interference\(Aronow and Samii,[2017](https://arxiv.org/html/2605.24358#bib.bib9); Awanet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib169); Basse and Feller,[2018](https://arxiv.org/html/2605.24358#bib.bib170); Hudgens and Halloran,[2008](https://arxiv.org/html/2605.24358#bib.bib8); Liu and Hudgens,[2014](https://arxiv.org/html/2605.24358#bib.bib27); Rosenbaum,[2007](https://arxiv.org/html/2605.24358#bib.bib168); Tchetgen and VanderWeele,[2012](https://arxiv.org/html/2605.24358#bib.bib26); Toulis and Kao,[2013](https://arxiv.org/html/2605.24358#bib.bib172)\)\. To collect experimental data, these studies first conducted experiments based on random treatment assignment strategies that they designed\. Next, they estimated treatment effects from the collected experimental data\. Although using experimental data is an ideal choice, which provides the most reliable evidence and serve as the gold standard in treatment effect estimation, the cost of experimental data collection is expensive and time\-consuming\. In contrast, estimating treatment effect from observational data provides a low\-cost alternative\.
Comparison of different methods for interference modeling\.We compare different methods for interference modeling in Table[4](https://arxiv.org/html/2605.24358#A8.T4)\.
Table 4\.Comparison of different methods\. “✓” or “\-” indicate whether the corresponding issue is considered or not\. “Neighbor” represents neighbor interference, whereas “Networked” represents networked interference\. DNE consists of two key sub\-issues\. \(I\) The importance of different neighbors in contributing to interference varies\. \(II\) The scale of neighbors varies, leading to different levels of interference\.MethodConfounderUnobserved ConfounderNeighborNetworkedSub\-issue \(I\) of DNESub\-issue \(II\) of DNETARNet\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)\-\-\-\-\-\-BNN\(Johanssonet al\.,[2016](https://arxiv.org/html/2605.24358#bib.bib7)\)✓\-\-\-\-\-CFR\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)✓\-\-\-\-\-ESCFR\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)✓✓\-\-\-\-RERUM\(Heet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib177)\)✓\-\-\-\-\-NetDeconf\(Guoet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib51)\)✓✓\-\-\-\-NetEst\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\)✓\-✓\-\-\-SPNet\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102)\)✓\-✓\-✓\-GCN\-HSIC\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)✓\-✓✓\-\-SAGE\-HSIC\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)✓\-✓✓\-\-SITE\(Linet al\.,[2025](https://arxiv.org/html/2605.24358#bib.bib132)\)✓\-✓✓\-\-HyperSCI\(Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)✓\-✓✓✓\-HINITE\(Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91)\)✓\-✓✓✓\-DWR\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\)✓\-✓\-✓\-IDE\-NET\(Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\)✓\-✓✓✓\-CauGramer\([Wuet al\.,](https://arxiv.org/html/2605.24358#bib.bib148)\)✓✓✓\-✓\-GITE \(proposed\)✓✓✓✓✓✓
## Appendix IDetails of dataset description
We describe the details of each dataset in the section\.
Flickr dataset\(Wanget al\.,[2013](https://arxiv.org/html/2605.24358#bib.bib69)\):Flickr is an online social website, where users share their images\. The Flickr dataset\(Wanget al\.,[2013](https://arxiv.org/html/2605.24358#bib.bib69)\)is collected from this website\. In this dataset, each individual is a user of Flickr\. There are 7,575 individuals with 479,476 directed edges\. Here, we aim to estimate how much recommending a hot photo \(treatment\) to a user affects the experience of the user \(outcome\) of this photo\. In this case, users may share recommended photos with their friends \(related individuals\), which constitutes networked interference\. We used the 1,206\-dimensional embeddings of user profiles as covariates that were provided byGuoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib51)\)\.
BlogCatalog dataset\(Liet al\.,[2015](https://arxiv.org/html/2605.24358#bib.bib74)\):BlogCatalog is an online community, where users post their blogs\. The BlogCatalog \(abbreviated as Blog\) dataset\(Liet al\.,[2015](https://arxiv.org/html/2605.24358#bib.bib74),[2019](https://arxiv.org/html/2605.24358#bib.bib73)\)is collected from the online community\. Every individual in this dataset is a user of BlogCatalog\. There are 5,196 individuals with 343,486 edges\. Here, we aim to estimate how much a recommended blog \(treatment\) to a user affects the experience of the user \(outcome\) of this blog\. In this case, users may share recommended blogs with their friends \(related individuals\), which constitutes networked interference\. We used embeddings of each individual as covariates that were provided byGuoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib51)\)\.
Amazon negative dataset\(He and McAuley,[2016](https://arxiv.org/html/2605.24358#bib.bib67)\):Amazon negative dataset \(abbreviated as AMZ\-N\) was extracted from the Amazon dataset\(He and McAuley,[2016](https://arxiv.org/html/2605.24358#bib.bib67)\)byRakeshet al\.\([2018](https://arxiv.org/html/2605.24358#bib.bib52)\), to study the effect of negative reviews on the sales of products and the issue of interference\. Every unit in the AMZ\-N dataset is an item, and every edge indicates that the two items are always purchased together by customers\. In the AMZ\-N dataset, there are 14,538 units with 15,011 directed edges\. The treatmentt∈\{0,1\}t\\in\\\{0,1\\\}depends on the number of negative reviews: if a unit has more than three negative reviews \(t=1t=1\) or if a unit has less than three negative reviews \(t=0t=0\)\(Rakeshet al\.,[2018](https://arxiv.org/html/2605.24358#bib.bib52)\)\. The covariate𝒙\\boldsymbol\{x\}\(with 300 features\) of each unit is created by applying the doc2vec method\(Le and Mikolov,[2014](https://arxiv.org/html/2605.24358#bib.bib93)\)to encode the review of the user\. We used covariates, treatments, outcomes, and ITE, all of which were provided by\(Rakeshet al\.,[2018](https://arxiv.org/html/2605.24358#bib.bib52)\)\. As values ofyyfluctuate in a large range, we applied the z\-score normalization toyyduring the training and testing phases, followingLinet al\.\([2025](https://arxiv.org/html/2605.24358#bib.bib132)\)\.
## Appendix JSimulation
We simulated treatments and outcomes for the Flickr and Blog datasets\.
Treatment simulation\.We generate treatments for the Flickr and Blog datasets as follows:
\(49\)ti∼Ber\(sigmoid\(0\.5⋅𝒛i\+0\.5⋅𝒛𝕏i\)\+ϵti\)\.\\displaystyle t\_\{i\}\\sim\\text\{Ber\}\\biggl\(\\text\{sigmoid\}\(0\.5\\cdot\\boldsymbol\{z\}\_\{i\}\+0\.5\\cdot\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}\)\+\\epsilon\_\{t\_\{i\}\}\\biggr\)\.𝒛i=𝒘t⊤𝒙i\\boldsymbol\{z\}\_\{i\}=\\boldsymbol\{w\}\_\{t\}^\{\\top\}\\boldsymbol\{x\}\_\{i\}represents individual confounders\.𝒛𝕏i=AGG𝕏\(𝒙~𝔾i\)\\boldsymbol\{z\}\_\{\\mathbb\{X\}\_\{i\}\}=\{\\textrm\{AGG\}\_\{\\mathbb\{X\}\}\}\(\\tilde\{\\boldsymbol\{x\}\}\_\{\\mathbb\{G\}\_\{i\}\}\)represents networked confounders from related individuals ofii\.𝒙~𝔾i=\{𝒙~k∣k∈𝔾i\}\\tilde\{\\boldsymbol\{x\}\}\_\{\\mathbb\{G\}\_\{i\}\}=\\\{\\tilde\{\\boldsymbol\{x\}\}\_\{k\}\\mid k\\in\\mathbb\{G\}\_\{i\}\\\}\. Importantly, to properly simulate interference from related individuals ofiiwith DNE, we rescale values of covariates as𝒙~k=𝒘𝔾⊤\(𝒙k\+η~k⋅𝒙k\\tilde\{\\boldsymbol\{x\}\}\_\{k\}=\\boldsymbol\{w\}\_\{\\mathbb\{G\}\}^\{\\top\}\(\\boldsymbol\{x\}\_\{k\}\+\\tilde\{\\eta\}\_\{k\}\\cdot\\boldsymbol\{x\}\_\{k\}\), whereη~k=log\(d~k\)/\(∑k=1Nlog\(d~k\)\)\\tilde\{\\eta\}\_\{k\}=\{\{\\log\}\(\\tilde\{d\}\_\{k\}\)\}/\(\{\\sum\_\{k=1\}^\{N\}\{\{\\log\}\(\\tilde\{d\}\_\{k\}\)\}\}\), every element of𝒘t\\boldsymbol\{w\}\_\{t\}and𝒘𝔾\\boldsymbol\{w\}\_\{\\mathbb\{G\}\}was generated from a normal distribution𝒩\(0,1\)\\mathcal\{N\}\(0,1\)or uniform distribution𝒰\(−1,1\)\\mathcal\{U\}\(\-1,1\), andϵti\\epsilon\_\{t\_\{i\}\}is a random noise generated from a normal distribution𝒩\(0,1\)\\mathcal\{N\}\(0,1\)\. We achievedAGG𝕏\(𝒙~𝔾i\)\{\\textrm\{AGG\}\_\{\\mathbb\{X\}\}\}\(\\tilde\{\\boldsymbol\{x\}\}\_\{\\mathbb\{G\}\_\{i\}\}\)by repeating summary computation of the one\-hop neighbor information∑k∈ℕieij𝒙~k\\sum\_\{k\\in\\mathbb\{N\}\_\{i\}\}e\_\{ij\}\\tilde\{\\boldsymbol\{x\}\}\_\{k\}three times for every individual, whereeije\_\{ij\}was generated from uniform distribution𝒰\(0,1\)\\mathcal\{U\}\(0,1\)\.
Outcome simulation\.We generated outcomes for the Flickr and Blog datasets as follows:
\(50\)yi=f𝒙\(𝒙i\)\+ft\(ti,𝒙i\)\+f𝔾\(𝒙𝔾i,t𝔾i\)\+ϵyi\.y\_\{i\}=f\_\{\\boldsymbol\{x\}\}\(\\boldsymbol\{x\}\_\{i\}\)\+f\_\{t\}\(t\_\{i\},\\boldsymbol\{x\}\_\{i\}\)\+f\_\{\\mathbb\{G\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},t\_\{\\mathbb\{G\}\_\{i\}\}\)\+\\epsilon\_\{y\_\{i\}\}\.f𝒙\(𝒙i\)=𝒘𝒙⊤𝒙if\_\{\\boldsymbol\{x\}\}\(\\boldsymbol\{x\}\_\{i\}\)=\\boldsymbol\{w\}\_\{\\boldsymbol\{x\}\}^\{\\top\}\\boldsymbol\{x\}\_\{i\}is the synthetic outcome of the individualiiwithout treatment effect and the effect from the related individuals, where every element of𝒘𝒙\\boldsymbol\{w\}\_\{\\boldsymbol\{x\}\}independently follows𝒩\(0,1\)\\mathcal\{N\}\(0,1\)\.ft\(ti,𝒙i\)=ti⋅𝒘1⊤𝒙if\_\{t\}\(t\_\{i\},\\boldsymbol\{x\}\_\{i\}\)=t\_\{i\}\\cdot\\boldsymbol\{w\}\_\{1\}^\{\\top\}\\boldsymbol\{x\}\_\{i\}synthesizes ITE, where𝒘1\\boldsymbol\{w\}\_\{1\}also follows𝒩\(0,1\)\\mathcal\{N\}\(0,1\)\.f𝔾\(𝒙𝔾i,𝒕𝔾i\)=g𝕏\(𝒙𝔾i\)\+g𝕋\(t𝔾i\)=AGG𝕏\(𝒙~𝔾i\)\+AGG𝕋\(𝒕~𝔾i\)f\_\{\\mathbb\{G\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\},\\boldsymbol\{t\}\_\{\\mathbb\{G\}\_\{i\}\}\)=g\_\{\\mathbb\{X\}\}\(\\boldsymbol\{x\}\_\{\\mathbb\{G\}\_\{i\}\}\)\+g\_\{\\mathbb\{T\}\}\(t\_\{\\mathbb\{G\}\_\{i\}\}\)=\\textrm\{AGG\}\_\{\\mathbb\{X\}\}\(\\tilde\{\\boldsymbol\{x\}\}\_\{\\mathbb\{G\}\_\{i\}\}\)\+\\textrm\{AGG\}\_\{\\mathbb\{T\}\}\(\\tilde\{\\boldsymbol\{t\}\}\_\{\\mathbb\{G\}\_\{i\}\}\)synthesizes effect from related individuals of the individualiion a graph, where𝒕~𝔾i=\{t~k∣k∈𝔾i\}\\tilde\{\\boldsymbol\{t\}\}\_\{\\mathbb\{G\}\_\{i\}\}=\\\{\\tilde\{t\}\_\{k\}\\mid k\\in\\mathbb\{G\}\_\{i\}\\\}, wheret~k=tk\+η~k⋅tk\\tilde\{t\}\_\{k\}=t\_\{k\}\+\\tilde\{\\eta\}\_\{k\}\\cdot t\_\{k\}, similar to the rescaling operation for covariates\.AGG𝕋\(𝒕~𝔾i\)\{\\textrm\{AGG\}\_\{\\mathbb\{T\}\}\}\(\\tilde\{\\boldsymbol\{t\}\}\_\{\\mathbb\{G\}\_\{i\}\}\)was achieved by repeating summary computation of the one\-hop neighbor information∑k∈ℕieijt~k\\sum\_\{k\\in\\mathbb\{N\}\_\{i\}\}e\_\{ij\}\\tilde\{t\}\_\{k\}three times for every individual\. Moreover,ϵyi\\epsilon\_\{y\_\{i\}\}is a random noise generated from a normal distribution𝒩\(0,1\)\\mathcal\{N\}\(0,1\)\.
## Appendix KDetails of baseline methods
Our methods were compared with several methods, which are divided into the following four categories:
ITE estimator for non\-graph data\.Balancing neural network \(BNN\)\(Johanssonet al\.,[2016](https://arxiv.org/html/2605.24358#bib.bib7)\), counterfactual regression with maximum mean discrepancy \(CFR\-MMD\)\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\), and counterfactual regression with Wasserstein discrepancy \(CFR\-Wass\)\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)address confounders by minimizing distribution discrepancies, maximum mean discrepancy \(MMD\), and Wasserstein discrepancy between control and treated groups, respectively\. TARNet\(Shalitet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib4)\)has the same model architecture as the CFR but no measures for confounders\. Entire space CFR \(ESCFR\)\(Wanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib122)\)have the same architecture and representation balancing technology as CFR\-Wass, but modified the calculation of Wasserstein distance for mini\-batch training\. RERUM\(Heet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib177)\)includes an outcome ranking loss and an ITE ranking loss to enhance the ranking ability of model\. FollowingHeet al\.\([2024](https://arxiv.org/html/2605.24358#bib.bib177)\), we implement it by combing the outcome ranking loss and the ITE ranking loss with CFR\-MMD\. All of these methods ignore interference and networked confounders\.
ITE estimator for graph data without addressing interference\.Network deconfounder \(NetDeconf\)\(Guoet al\.,[2020](https://arxiv.org/html/2605.24358#bib.bib51)\)models networked confounders by GCN\(Welling and Kipf,[2016](https://arxiv.org/html/2605.24358#bib.bib17)\)without modeling interference\. Moreover, it balances representations by Wasserstein discrepancy\.
ITE estimators for graph data with addressing neighbor interference\.Networked causal effect estimation \(NetEst\)\(Jiang and Sun,[2022](https://arxiv.org/html/2605.24358#bib.bib103)\)models neighbor confounders by a single GCN layer and neighbor interference by a mean aggregation\. NetEst balance representations by an adversarial learning technology\. SPNet\(Huanget al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib102)\)models networked confounders by GCN and neighbor interference by GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)\. SPNet balances representations by minimizing the Wasserstein discrepancy between different treatment groups\. DWR\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.24358#bib.bib176)\)models neighbor confounders and neighbor interference by GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)\. CauGramer\([Wuet al\.,](https://arxiv.org/html/2605.24358#bib.bib148)\)use GCN to model interference, while combining Transformer to discover interfernce from unknown neighbors\.
ITE estimators for graph data with addressing networked interference\.GCN\-HSIC\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)models networked interference by GCN and balances representations by Hibelt\-Schmidt independence criterion \(HSIC\)\(Grettonet al\.,[2005](https://arxiv.org/html/2605.24358#bib.bib16)\)\. SAGE\-HSIC\(Ma and Tresp,[2021](https://arxiv.org/html/2605.24358#bib.bib12)\)uses the same representation balancing technology as GCN\-HSIC but replaces the GCN of GCN\-HSIC with the mean aggregator of GraphSAGE\(Hamiltonet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib76)\)\. Scalable individual treatment effect estimator\(Linet al\.,[2025](https://arxiv.org/html/2605.24358#bib.bib132)\)\(SITE\) reduces computation in the aggregation of GCN\-HSIC by a pre\-aggregation technology for related individuals and balances representations by MMD regularization\. IDENet\(Adhikari and Zheleva,[2025](https://arxiv.org/html/2605.24358#bib.bib175)\)combines ego MLP, GCN, and a similarity score to model networked interference\. HyperSCI\(Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)and HINITE\(Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91)\)are designed for more complex graphs: hypergraph and heterogeneous graphs, respectively\. HyperSCI and HINITE can be extended to estimate treatment effect from ordinary graphs\. In this case, they use GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)to model networked interference\. HyperSCI balances representations by applying Wasserstein discrepancy and HINITE balances representations by applying HSIC regularization\.
## Appendix LCompute resources
The compute resources used in our experiments are as follows:
- •Operating system: Ubuntu 22\.04 LTS\.
- •GPU: H100 with 80GB GPU memory\.
- •CPU: AMD EPYC 7343 \(16C/32T, 3\.2GHz, 128M Cache\)\.
## Appendix MImplementation details
For the representation learning module,𝒛𝕊i\(0\)\\boldsymbol\{z\}^\{\(0\)\}\_\{\\mathbb\{S\}\_\{i\}\}is initialized as a one\-dimensional vector with an element value of 1,𝒛𝕏i\(0\)\\boldsymbol\{z\}^\{\(0\)\}\_\{\\mathbb\{X\}\_\{i\}\}is initialized as𝒙i\\boldsymbol\{x\}\_\{i\}, and𝒛𝕋i\(0\)\\boldsymbol\{z\}^\{\(0\)\}\_\{\\mathbb\{T\}\_\{i\}\}is initialized astit\_\{i\}\. The importance estimation mechanismaaof IPAtt and SPAtt in Equation \([4](https://arxiv.org/html/2605.24358#S4.E4)\) is implemented by GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)by default for our experiments\. We conducted experiments with the attention mechanism of Transformer\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib5); Yinget al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib151)\)in Appendix[N\.3](https://arxiv.org/html/2605.24358#A14.SS3)\. FollowingVeličkovićet al\.\([2017](https://arxiv.org/html/2605.24358#bib.bib119)\)andVaswaniet al\.\([2017](https://arxiv.org/html/2605.24358#bib.bib5)\), we use the softmax operation for the normalization operation in Equation \([4](https://arxiv.org/html/2605.24358#S4.E4)\)\. We apply the layer normalization\(Baet al\.,[2016](https://arxiv.org/html/2605.24358#bib.bib171)\)for the proxy module\. We computed the weightπ̊∙\(l\)=exp\(π̊∙\(l\)\)/\(exp\(π∙\(l\)\)\+exp\(1−π∙\(l\)\)\)\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}=\\exp\(\\mathring\{\\pi\}\_\{\\bullet\}^\{\(l\)\}\)/\(\\exp\(\\pi\_\{\\bullet\}^\{\(l\)\}\)\+\\exp\(1\-\\pi\_\{\\bullet\}^\{\(l\)\}\)\)for∙∈\{𝕏,𝕋\}\\bullet\\in\\\{\\mathbb\{X\},\\mathbb\{T\}\\\}\. For simplicity, we setπη=1\\pi\_\{\\eta\}=1of the message amplifier of GITE\. We conducted experiments with a different strategy forπη\\pi\_\{\\eta\}, i\.e\., settingπη\\pi\_\{\\eta\}as a learnable parameter in Appendix[N\.4](https://arxiv.org/html/2605.24358#A14.SS4)\.
FollowingWelling and Kipf \([2016](https://arxiv.org/html/2605.24358#bib.bib17)\),Guoet al\.\([2020](https://arxiv.org/html/2605.24358#bib.bib51)\),Ma and Tresp \([2021](https://arxiv.org/html/2605.24358#bib.bib12)\),Jiang and Sun \([2022](https://arxiv.org/html/2605.24358#bib.bib103)\), andLinet al\.\([2025](https://arxiv.org/html/2605.24358#bib.bib132)\), we consider a transductive setting, i\.e\., graph structure𝑨\\boldsymbol\{A\}, covariates𝑿\\boldsymbol\{X\}, and treatments𝑻\\boldsymbol\{T\}were given during the training, validation, and testing phases; Whereas only observed outcomes of individuals in the training dataset were provided for all graph\-based methods during training\. Importantly, the proposed methods are not limited to the transductive setting\. If researchers want to consider an inductive setting, they can mask covariates of individuals and edges connecting to individuals that are in the validation and test sets on graph during the training phase\.
We use the Adam optimizer\(Kingma and Ba,[2015](https://arxiv.org/html/2605.24358#bib.bib21)\)with a learning rate of 0\.001 and a decay rate of 0\.001 to train proposed methods and set the maximum training iterations to 2,000\. We set the number of layers for the proxy component to 1 for GITE and 3 for GITEv, for all other components to 3\. We set the dimension of the proxy component to 100 or 300 for GITE and set the dimensions for the proxy component to\(100,200,300\)\(100,200,300\)for GITEv\\rm\{v\}\. We set the dimension for layers of other components to 100\. We adopted ReLU for the activation function\. We searched for hyperparameters by checking the results on the validation set\. Specifically, we searchedλ\\lambdafrom the range of\{0\.001,0\.01,0\.1,0\.2,0\.5,1\.0\}\\\{0\.001,0\.01,0\.1,0\.2,0\.5,1\.0\\\},β\\betafrom the range of\{0\.001,0\.01,0\.1,0\.2,0\.5,1\.0\}\\\{0\.001,0\.01,0\.1,0\.2,0\.5,1\.0\\\},λD\\lambda\_\{D\}from\{0\.1,0\.5,1\.0,5\.0,10\.0\}\\\{0\.1,0\.5,1\.0,5\.0,10\.0\\\}\. We searchedλP\\lambda\_\{P\}from the range of\{0\.1,0\.5,1\.0,5\.0,10\.0\}\\\{0\.1,0\.5,1\.0,5\.0,10\.0\\\}for GITEv\. Early stopping and dropout were applied to the proposed methods to avoid overfitting\.
We used default hyperparameters or searched hyperparameters from the ranges suggested in the literature to implement the baseline methods\. We also applied early stopping and dropout to the baselines for all datasets to avoid overfitting\.
## Appendix NAdditional experiments
### N\.1\.Additional sensitivity experiments
We conducted sensitivity experiments with different values ofλD\\lambda\_\{D\}in the range\{0\.1,0\.5,1\.0,5\.0,10\.0\}\\\{0\.1,0\.5,1\.0,5\.0,10\.0\\\}for GITE and GITEv\. Results are shown in Figure[6](https://arxiv.org/html/2605.24358#A14.F6)\. Results reveal that there is no significant performance change with different values ofλD\\lambda\_\{D\}\.
We conducted sensitivity experiments with different values ofλP\\lambda\_\{P\}in the range\{0\.1,0\.5,1\.0,5\.0,10\.0\}\\\{0\.1,0\.5,1\.0,5\.0,10\.0\\\}for GITEv\. Results are shown in Figure[6](https://arxiv.org/html/2605.24358#A14.F6)\. Results reveal that there is a performance degradation for GITEvwhen setting a large value \(\>5\>5\) ofλP\\lambda\_\{P\}, which suggests that searching the value ofλP\\lambda\_\{P\}in the range of\(0,5\]\(0,5\]\.
We conducted sensitivity experiments with different values of dimension for different layers of proxy module in the range\{100,300\}\\\{100,300\\\}for GITE\. Results are shown in Table[5](https://arxiv.org/html/2605.24358#A14.T5)\. Results show that both dimensions are reasonable for GITE\.
\(a\)AMZ\-N,λD\\lambda\_\{D\},ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(b\)AMZ\-N,λD\\lambda\_\{D\},ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(c\)AMZ\-N,λP\\lambda\_\{P\},ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(d\)AMZ\-N,λP\\lambda\_\{P\},ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(e\)Flickr,λD\\lambda\_\{D\},ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(f\)Flickr,λD\\lambda\_\{D\},ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(g\)Flickr,λP\\lambda\_\{P\},ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(h\)Flickr,λP\\lambda\_\{P\},ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(i\)Blog,λD\\lambda\_\{D\},ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(j\)Blog,λD\\lambda\_\{D\},ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
\(k\)Blog,λP\\lambda\_\{P\},ϵMSE\\sqrt\{\\epsilon\_\{\\textrm\{MSE\}\}\}\.
\(l\)Blog,λP\\lambda\_\{P\},ϵPEHE\\sqrt\{\\epsilon\_\{\\textrm\{PEHE\}\}\}\.
Figure 6\.Results \(mean and standard errors\) of additional sensitivity experiments for hyperparametersλD\\lambda\_\{D\}andλP\\lambda\_\{P\}\. Results are averaged over ten executions\.Table 5\.Results \(mean and standard errors\) of experiments with different dimensions of different layers on the test sets\. Results are averaged over ten executions\.
### N\.2\.RQ 4: is the training time of the proposed methods reasonable
This section investigates the answer to RQ 4\. Experimental results are presented in Table[6](https://arxiv.org/html/2605.24358#A14.T6)\. Results show that the training time for the proposed methods is reasonable\.
Training efficiency might be further improved by applying acceleration technologies for GNN and partial attention mechanisms, such as neighbor sampling\(Hamiltonet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib76)\)\. We discuss the time complexity of the NIM layer in Appendix[F](https://arxiv.org/html/2605.24358#A6)\. However, such techniques often involve a trade\-off between efficiency and performance\. Therefore, accelerating the implementation of the NIM layer while preserving the performance of ITE estimation with DNE can be a promising future research direction\.
Table 6\.Training time \(in minutes\) of GITE and GITEv\{\}\_\{\\text\{v\}\}for one execution\.
### N\.3\.RQ 5: can the IPAtt and SPAtt mechanisms be implemented with a different technology other than GAT
This section investigates the answer to RQ 5\. In Table[7](https://arxiv.org/html/2605.24358#A14.T7), we compared performance for GITE with two different attention mechanisms: GAT\(Veličkovićet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib119)\)and the attention mechanism of Transformer based on query and key vectors \(QK\-based AT\)\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.24358#bib.bib5); Yinget al\.,[2021](https://arxiv.org/html/2605.24358#bib.bib151)\)\. Results indicate that the proposed IPAtt and SPAtt mechanisms can achieve comparable performance with different attention mechanisms\. This reveals that the IPAtt and SPAtt mechanisms can be implemented using an attention mechanism other than GAT, and do not rely on a specific attention mechanism\.
Table 7\.Results \(mean and standard errors\) of experiments with different attention mechanisms on the test sets\. Results are averaged over ten executions\. GITE \(GAT\) represents that GITE applies GAT for the IPAtt and SPAtt mechanisms, whereas GITE \(QK\-AT\) represents that GITE applies QK\-based attention mechanism for the IPAtt and SPAtt mechanisms\.Table 8\.Results \(mean and standard errors\) of experiments with different strategies for the hyperparameterπη\\pi\_\{\\eta\}on the test sets\. Results are averaged over ten executions\.πη=1\\pi\_\{\\eta\}=1represents that GITE setsπη\\pi\_\{\\eta\}as a hyperparameter with a fixed value of 1, whereas Auto represents that GITE setsπη\\pi\_\{\\eta\}as a learnable parameter with an initial value of 1\.
### N\.4\.RQ 6: can the message amplifier be implemented using different strategies
This section investigates the answer to RQ 6\. As introduced in Equation \([6](https://arxiv.org/html/2605.24358#S4.E6)\), we have two strategies forπη\\pi\_\{\\eta\}of the message amplifier: \(I\) setting it as a hyperparameter and \(II\) setting it as an adaptive learnable parameter\. In Table[8](https://arxiv.org/html/2605.24358#A14.T8), we conducted experiments to compare the performance between different strategies forπη\\pi\_\{\\eta\}\. Results indicate that different strategies can achieve comparable performance in outcome prediction and ITE estimation, but applying the strategy of learnableπη\\pi\_\{\\eta\}can sightly improve the performance of GITE\.
## Appendix OImpact
This paper proposes GITE, a method for estimating treatment effects from observational graph data\. Although randomized controlled trials \(RCTs\) remain the gold standard for estimating individual treatment effects, they are often costly and time\-consuming\. In contrast, observational graph data offers a low\-cost alternative\. As such, the proposed approach holds potential for applications in various domains, including decision\-making in commerce and medicine\.
## Appendix PFuture work
We introduce four research directions that are promising for future exploration based on this study\. First, extending ITE estimation from more convoluted graphs in the presence of DNE, such as hypergraphs\(Maet al\.,[2022b](https://arxiv.org/html/2605.24358#bib.bib53)\)and heterogeneous graphs\(Linet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib91)\)\. In such convoluted graphs, DNE is more complex compared to ordinary graphs\. Second, applying the proposed GITE to various areas for making reasonable decisions, such as medicine\(Changet al\.,[2023](https://arxiv.org/html/2605.24358#bib.bib115); Maet al\.,[2022a](https://arxiv.org/html/2605.24358#bib.bib109); Schnitzer,[2022](https://arxiv.org/html/2605.24358#bib.bib48)\)and commerce\(Nabiet al\.,[2022](https://arxiv.org/html/2605.24358#bib.bib47)\), as well as task\-specific applications, such as recommendation systems\(Liet al\.,[2023a](https://arxiv.org/html/2605.24358#bib.bib150),[b](https://arxiv.org/html/2605.24358#bib.bib149)\)\. Third, as discussed in Section[N\.2](https://arxiv.org/html/2605.24358#A14.SS2), although GITE exhibits reasonable training time, developing more efficient implementations of the NIM layer without performance degradation is a valuable direction\. Finally, with severe violations of the overlap assumption, our methods face limitations, which is a constraint and a promising future direction\.Similar Articles
GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation
GraphDiffMed is a medication recommendation framework that uses dual-scale differential attention and pharmacological graph priors to improve recommendation quality and safety on EHR data. Experiments on MIMIC-III show consistent improvements over baselines.
Modeling Heterophily in Multiplex Graphs: An Adaptive Approach for Node Classification
This paper introduces HAAM, a novel method for node classification in multiplex graphs that adapts to both homophilic and heterophilic interactions across dimensions. It uses dimension-specific compatibility matrices and a product of trainable low-pass and high-pass filters approximated via Chebyshev polynomials to capture smooth and abrupt signal changes.
Optimal Experiments for Partial Causal Effect Identification
This paper introduces the 'max-potency problem' for selecting cost-constrained experiments to maximize the tightening of bounds on partial causal effects. The authors propose graphical pruning criteria to reduce the search space and demonstrate the method on NHANES health data.
Modeling Spectral Energy Shifts in Spatio-Temporal Graph Anomaly Detection
Proposes a node-level spectral energy formulation for detecting camouflaged anomalies in graphs, extending to spatio-temporal settings with energy-driven message passing. Demonstrates effectiveness on large-scale benchmarks.
A Temporally Augmented Graph Attention Network for Affordance Classification
EEG-tGAT is a temporally augmented Graph Attention Network that improves affordance classification from interaction sequences by incorporating temporal attention and dropout mechanisms. The model enhances GATv2 for sequential data where temporal dimensions are semantically non-uniform.