A Unified Perspective for Learning Graph Representations Across Multi-Level Abstractions

arXiv cs.LG 05/14/26, 04:00 AM Papers
graph-self-supervised-learning contrastive-learning multi-level-abstractions graph-representations multi-task-learning self-weighting-mechanism
Summary
This paper proposes a unified contrastive framework for learning graph representations across multiple abstraction levels (node, proximity, cluster, graph) with a parameter-free self-weighting mechanism that adaptively assigns weights to similarity scores, outperforming state-of-the-art on downstream tasks like classification, clustering, and link prediction.
arXiv:2605.12685v1 Announce Type: new Abstract: Graph Self-Supervised Learning (GSSL) has emerged as a powerful paradigm for generating high-quality representations for graph-structured data. While multi-scale graph contrastive learning has received increasing attention, many existing methods still predominantly focus on a single graph abstraction level. To address this limitation, we propose a unified contrastive framework that can target node-level, proximity-level, cluster-level, and graph-level information and integrate them through a linear combination of similarity scores on positive pairs and dissimilarity scores (i.e., similarity scores on negative pairs). Furthermore, current approaches typically assign uniform penalty strengths to all examples, which reduces optimization flexibility and leads to ambiguous convergence status. To overcome this, we introduce a novel parameter-free fine-grained self-weighting mechanism that adaptively assigns weights to individual similarity and dissimilarity scores. The proposed mechanism emphasizes the scores that deviate significantly from their target values. Our approach not only enhances optimization flexibility but also eliminates the computational overhead of hyperparameter tuning in conventional multi-task GSSL methods. Comprehensive experiments on real-world datasets show that our methods consistently outperform state-of-the-art approaches across downstream tasks, including classification, clustering, and link prediction, in both single-level and multi-level scenarios.
Original Article
View Cached Full Text
Cached at: 05/14/26, 06:17 AM
# A Unified Perspective for Learning Graph Representations Across Multi-Level Abstractions
Source: [https://arxiv.org/html/2605.12685](https://arxiv.org/html/2605.12685)
Mohamed Mahmoud Amar, Nairouz Mrabah, Mohamed Bouguessa, Abdoulaye Baniré DialloM\. M\. Amar, N\. Mrabah, M\. Bouguessa, A\. B\. Diallo are with the Department of Computer Science, University of Quebec at Montreal, Montreal, QC, Canada\. E\-mails: amar\.mohamed\_mahmoud@courrier\.uqam\.ca, mrabah\.nairouz@gmail\.com, bouguessa\.mohamed@uqam\.ca, diallo\.abdoulaye@uqam\.ca

###### Abstract

Graph Self\-Supervised Learning \(GSSL\) has emerged as a powerful paradigm for generating high\-quality representations for graph\-structured data\. While multi\-scale graph contrastive learning has received increasing attention, many existing methods still predominantly focus on a single graph abstraction level\. To address this limitation, we propose a unified contrastive framework that can target node\-level, proximity\-level, cluster\-level, and graph\-level information and integrate them through a linear combination of similarity scores on positive pairs and dissimilarity scores \(i\.e\., similarity scores on negative pairs\)\. Furthermore, current approaches typically assign uniform penalty strengths to all examples, which reduces optimization flexibility and leads to ambiguous convergence status\. To overcome this, we introduce a novel parameter\-free fine\-grained self\-weighting mechanism that adaptively assigns weights to individual similarity and dissimilarity scores\. The proposed mechanism emphasizes the scores that deviate significantly from their target values\. Our approach not only enhances optimization flexibility but also eliminates the computational overhead of hyperparameter tuning in conventional multi\-task GSSL methods\. Comprehensive experiments on real\-world datasets show that our methods consistently outperform state\-of\-the\-art approaches across downstream tasks, including classification, clustering, and link prediction, in both single\-level and multi\-level scenarios\.

###### Index Terms:

Graph Self\-Supervised Learning; Contrastive Learning; Multi\-Task Learning\.

## 1Introduction

Self\-supervised learning \(SSL\)\[[45](https://arxiv.org/html/2605.12685#bib.bib77)\]has emerged as a highly effective approach for learning data representations, particularly in scenarios where supervisory signals are unavailable\. Unlike supervised learning, which requires large labeled datasets, or reinforcement learning, which depends on repeated trials and feedback, SSL uses inherent signals within the data itself\. By designing pretext tasks that challenge models to predict or reconstruct parts of the input data, SSL enables the learning of intrinsic data properties and relationships\. As a result, representations learned through SSL are often more robust and versatile\. When finetuned for downstream tasks, they show strong results across various applications\[[3](https://arxiv.org/html/2605.12685#bib.bib4),[36](https://arxiv.org/html/2605.12685#bib.bib76),[35](https://arxiv.org/html/2605.12685#bib.bib75)\]\.

In recent years, the contrastive learning paradigm has been widely adopted in graph self\-supervised learning \(GSSL\)\[[30](https://arxiv.org/html/2605.12685#bib.bib7)\]\. This paradigm aims to bring similar entities closer together and push dissimilar ones farther apart in the representation space\. Augmented views of a graph are generated through transformations like node feature masking, edge perturbation, and subgraph sampling\. These augmentations create diverse yet semantically consistent views\. Thus, the model can learn invariant representations by maximizing similarity between positive pairs while distinguishing them from negative pairs\. By emphasizing intrinsic relationships within the graph, contrastive GSSL methods learn generalizable representations, which can be effectively used in node classification, link prediction, and clustering without requiring extensive retraining\.

GSSL methods operate across multiple abstraction levels to capture different aspects of a graph’s structure and semantics\.Node\-Levelmethods\[[62](https://arxiv.org/html/2605.12685#bib.bib24),[50](https://arxiv.org/html/2605.12685#bib.bib25)\]focus on local structural and feature information, making them particularly effective for tasks like node classification and link prediction\. However, they may struggle to capture global properties, risk overfitting to local structures, and are sensitive to noise\.Proximity\-Levelmethods\[[13](https://arxiv.org/html/2605.12685#bib.bib26),[23](https://arxiv.org/html/2605.12685#bib.bib27)\]emphasize structural relationships within local neighborhoods, excelling in tasks like link prediction and community detection\. However, they often fail to capture long\-range dependencies and global structures\.Cluster\-Levelmethods\[[38](https://arxiv.org/html/2605.12685#bib.bib73),[37](https://arxiv.org/html/2605.12685#bib.bib74)\]identify and utilize community structures to capture relationships between node groups\. Their performance depends on achieving the right balance in cluster granularity, as overly coarse or fine clusters can lead to overfitting or loss of detail\. Moreover, the clustering process is prone to noisy clustering assignments, which can affect the quality of the learned representations\.Graph\-Levelmethods\[[51](https://arxiv.org/html/2605.12685#bib.bib30),[14](https://arxiv.org/html/2605.12685#bib.bib31)\]focus on the global structure of the graph, making them ideal for graph classification tasks\. However, these methods may overlook important local details\.

The previous multi\-task GSSL methods combine several pretext taskssimultaneouslyto improve task generalization\. AutoSSL\[[20](https://arxiv.org/html/2605.12685#bib.bib10)\]employs a pseudo\-homophily mechanism to assess the quality of representations across pretext tasks\. Then, evolutionary algorithms or meta\-gradient descent are used to identify the best linear combination of GSSL tasks\. ParetoGNN\[[21](https://arxiv.org/html/2605.12685#bib.bib13)\]leverages diverse pretext tasks, including generative reconstruction, whitening decorrelation, and mutual information maximization\. This model introduces a multi\-gradient descent mechanism that promotes Pareto optimality across pretext tasks and mitigates potential conflicts\. However, both approaches rely on an inner optimization process to search the hyperparameters associated with self\-supervised losses, resulting in significant computational overhead\. Moreover, both methods select the set of candidate pretext tasks based on heuristics and only focus on identifying the optimal combination of these tasks\. The synergy across multiple graph abstraction levels —node\-level, proximity\-level, cluster\-level, and graph\-level—remains unexplored in multi\-task GSSL\. We refer to this paradigm as multi\-level GSSL\.

![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/avg_per_cora.png)
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/avg_per_citeseer.png)
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/avg_per_pubmed.png)
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/cos_sim_cora.png)\(a\)Cora\.
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/cos_sim_citeseer.png)\(b\)CiteSeer\.
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/cos_sim_pubmed.png)\(c\)Pubmed\.

Figure 1:Top row: Average performance of our single\-level GSSL model \(SL\-GSSL\) across three downstream tasks \(node classification, node clustering, and link prediction\) at different abstraction levels on Cora, CiteSeer, and Pubmed datasets\. Bottom row: Average cosine similarity between the gradients of two abstraction\-level GSSL losses when trained using our linear multi\-level GSSL model \(L\-ML\-GSSL\)\.We introduce a unified framework that can operateseamlesslyat different abstraction levels through a linear combination of similarity and dissimilarity scores\. The proposed framework yields competitive results compared with its corresponding state\-of\-the\-art methods from each level\. Using the unified framework, we analyze the correlation between the graph abstraction levels and their relevance to the downstream tasks\. As illustrated in Fig\.[1](https://arxiv.org/html/2605.12685#S1.F1), the GSSL abstraction levels do not perform equally well across datasets, and there is a potential conflict between them\. To this end, we extend the linear combination of similarity and dissimilarity scores across all abstraction levels\. As an advantage, multi\-level learning can capture local and global information\.

From another perspective, previous single\-task and multi\-task GSSL approachesimplicitlyapply uniform penalty strengths to all examples, which limits their optimization flexibility and can lead to ambiguous convergence status\[[48](https://arxiv.org/html/2605.12685#bib.bib45)\]\. We address this by introducing a dynamic weighting mechanism that adaptively assigns weights to individual similarity and dissimilarity scores across examples and abstraction levels, prioritizing those that significantly deviate from the optimum\. We implement the weighting coefficients as linear functions w\.r\.t\. their similarity/dissimilarity scores\. This results in a hyperspherical decision boundary w\.r\.t\. similarity/dissimilarity scores\. This mechanism enhances optimization flexibility and ensures definite convergence status while eliminating the computational cost associated with inner optimization for hyperparameter tuning, as seen in conventional multi\-task GSSL\.

Contributions\.This work introduces a principled multi\-level graph self\-supervised learning \(GSSL\) framework\. It unifies learning objectives across 4 abstraction levels and proposes a score\-level self\-weighted multi\-task formulation with theoretical and empirical support\. The contributions are summarized below:

- •Multi\-level GSSL paradigm with explicit cross\-level complementarity and conflict\.We formalize GSSL as learning from four abstraction levels: node, proximity, cluster, and graph\. We empirically show that no single level is consistently optimal across datasets and downstream tasks\. Fig\.[1](https://arxiv.org/html/2605.12685#S1.F1)\(top row\) summarizes these task\-dependent behaviors\. Fig\.[1](https://arxiv.org/html/2605.12685#S1.F1)\(bottom row\) further shows that level\-wise objectives can have weak gradient alignment under joint training, which motivates principled multi\-level integration\.
- •Unified single\-level contrastive objective across four abstraction levels \(SL\-GSSL\)\. We develop a single contrastive formulation that is instantiated at each level by changing only the level\-specific positive/negative sample generators \(Eqs\. \([1](https://arxiv.org/html/2605.12685#S3.E1)\)–\([4](https://arxiv.org/html/2605.12685#S3.E4)\)\)\. The training pipeline is summarized in Algorithm[1](https://arxiv.org/html/2605.12685#alg1)\. We adopt a fixed augmentation and training configuration shared across datasets and downstream tasks \(Table[II](https://arxiv.org/html/2605.12685#S5.T2)\), and tune only a small set of data\-dependent hyperparameters \(Table[III](https://arxiv.org/html/2605.12685#S5.T3)\)\. Under this unified setup,SL\-GSSLachieves competitive or superior within\-level performance compared with state\-of\-the\-art single\-task baselines \(Table[IV](https://arxiv.org/html/2605.12685#S5.T4)\)\. Ablations validate the design of the unified loss and show that margin removal, hinge replacement, and InfoNCE\[[2](https://arxiv.org/html/2605.12685#bib.bib72)\]substitution degrade performance \(Table[V](https://arxiv.org/html/2605.12685#S5.T5)\)\.
- •Score\-level multi\-level integration and a unified multi\-task formulation \(L\-ML\-GSSL/LW\-ML\-GSSL\)\.We extend the unified formulation to multi\-level learning by integrating positive and negative similarity scores across levels inside a single objective \(Eqs\. \([8](https://arxiv.org/html/2605.12685#S4.E8)\) and \([10](https://arxiv.org/html/2605.12685#S4.E10)\)\)\. This yields a multi\-task GSSL setup where abstraction levels act as complementary tasks under a shared encoder \(Fig\.[2](https://arxiv.org/html/2605.12685#S3.F2)\)\. Table[IX](https://arxiv.org/html/2605.12685#S5.T9)shows that different levels contribute distinct information and that multi\-level variants improve over single\-level ones\. At the same time, Table[VII](https://arxiv.org/html/2605.12685#S5.T7)shows that naive linear integration is not sufficient to consistently surpass the best single\-level configuration, which motivates adaptive balancing\.
- •Dynamic score\-level self\-weighting with hyperspherical convergence geometry \(LSW\-ML\-GSSL\)\.We propose a linear self\-weighting mechanism that assigns weights to individual positive and negative similarity scores across levels \(Eqs\. \([11](https://arxiv.org/html/2605.12685#S4.E11)\) and \([12](https://arxiv.org/html/2605.12685#S4.E12)\)\), resulting in the self\-weighted multi\-level objective \(Eq\. \([14](https://arxiv.org/html/2605.12685#S4.E14)\)\)\. The induced decision boundary becomes hyperspherical in similarity\-score space \(Eqs\. \([15](https://arxiv.org/html/2605.12685#S4.E15)\) and \([16](https://arxiv.org/html/2605.12685#S4.E16)\)\)\. Gradient visualizations show improved optimization flexibility and reduced convergence ambiguity compared with uniform weighting \(Figs\.[3](https://arxiv.org/html/2605.12685#S3.F3),[4](https://arxiv.org/html/2605.12685#S4.F4), and[5](https://arxiv.org/html/2605.12685#S4.F5)\)\. We provide a dedicated convergence analysis in similarity\-score space \(Theorem[1](https://arxiv.org/html/2605.12685#Thmtheorem1)and Proposition[2](https://arxiv.org/html/2605.12685#Thmproposition2)\)\. The method avoids auxiliary expert networks and inner\-loop optimization, and remains computationally efficient \(Fig\.[6](https://arxiv.org/html/2605.12685#S5.F6)\)\.
- •Extensive empirical validation, ablations, and robustness studies\.We evaluate on six real\-world datasets and three downstream tasks\. The proposedLSW\-ML\-GSSLachieves the best overall performance against state\-of\-the\-art multi\-task and multi\-scale baselines \(Table[VI](https://arxiv.org/html/2605.12685#S5.T6)\)\. It consistently improves over linear multi\-level variants and over the strongest single\-level configuration \(Table[VII](https://arxiv.org/html/2605.12685#S5.T7)\)\. It also outperforms representative multi\-loss weighting strategies \(Table[VIII](https://arxiv.org/html/2605.12685#S5.T8)\)\. Sensitivity analyses indicate stable performance across wide hyperparameter ranges \(Figs\.[7](https://arxiv.org/html/2605.12685#S5.F7)and[8](https://arxiv.org/html/2605.12685#S5.F8)\)\.

## 2Related Work

Single\-Task Methods\.Conventional GSSL methods construct self\-supervision signals by extracting information from different graph abstraction levels, including individual nodes, local proximity, clusters, and the overall graph structure\. At thenode level, methods such as GRACE\[[62](https://arxiv.org/html/2605.12685#bib.bib24)\]and GraphCL\[[57](https://arxiv.org/html/2605.12685#bib.bib8)\]learn node representations by maximizing agreement between augmented graph views\. This agreement is achieved using the InfoNCE loss\[[41](https://arxiv.org/html/2605.12685#bib.bib46)\], which serves as a lower\-bound estimator of Mutual Information \(MI\)\. Building upon this, GCA\[[63](https://arxiv.org/html/2605.12685#bib.bib5)\]enhances GRACE by introducing an adaptive augmentation strategy that selectively modifies edges and node features based on their importance, rather than applying random perturbations\.Proximity\-levelGSSL methods differ from node\-level approaches by eliminating the need for explicit data augmentation\. Instead, these methods leverage neighborhood relationships to encode semantic invariance\. Methods such as DeepWalk\[[42](https://arxiv.org/html/2605.12685#bib.bib32)\]and Node2Vec\[[13](https://arxiv.org/html/2605.12685#bib.bib26)\]employ a random walk to capture proximity\-based information\. Another category of methods focuses on reconstructing the graph to capture proximity\-level information\. For example, MGAE\[[53](https://arxiv.org/html/2605.12685#bib.bib33)\]and MaskGAE\[[25](https://arxiv.org/html/2605.12685#bib.bib34)\]randomly mask edges and train the model to reconstruct the missing connections\. When the observed topology is noisy or imperfect, structure\-learning methods aim to refine or infer a task\-adaptive graph\. For example, SLAPS\[[9](https://arxiv.org/html/2605.12685#bib.bib67)\]shows that self\-supervised signals can improve graph structure learning for GNNs, leading to better downstream performance, while latent graph inference methods aim to recover an underlying graph structure from limited supervision\[[32](https://arxiv.org/html/2605.12685#bib.bib68)\]\. From another perspective, methods such as NCLA\[[47](https://arxiv.org/html/2605.12685#bib.bib56)\]apply neighbor contrastive learning on learnable graph augmentations, enabling the joint learning of both augmentations and embeddings\.Cluster\-levelGSSL methods emphasize capturing intra\-cluster similarities and inter\-cluster distinctions by integrating clustering with embedding learning\. These methods minimize a clustering\-oriented loss\. For instance, DAEGC\[[52](https://arxiv.org/html/2605.12685#bib.bib6)\]uses an attention mechanism to generate expressive latent representations while simultaneously performing embedding clustering\. GMM\-VGAE\[[16](https://arxiv.org/html/2605.12685#bib.bib28)\]leverages a variational graph autoencoder with a latent Gaussian mixture model\.Graph\-levelGSSL methods generate representations that encode the global structure of graphs\. For example, DGI\[[51](https://arxiv.org/html/2605.12685#bib.bib30)\]maximizes mutual information \(MI\) between localized patch representations and their corresponding high\-level graph summaries\. This objective ensures that the encoder captures features relevant across the graph\. MVGRL\[[14](https://arxiv.org/html/2605.12685#bib.bib31)\]builds on this by introducing a contrastive framework that maximizes MI between representations derived from local \(i\.e\., node\-level\) and global \(i\.e\., graph\-level\) structural views of graphs\. Beyond contrastive and reconstruction\-based GSSL, recent work has explored leveraging large language models \(LLMs\) for graph representation learning by encoding graphs into token or text\-like sequences compatible with LLM processing\[[10](https://arxiv.org/html/2605.12685#bib.bib69),[31](https://arxiv.org/html/2605.12685#bib.bib70)\]\.

Multi\-Scale Methods\.Recent research has increasingly combined insights from multiple hierarchical levels of graphs\[[17](https://arxiv.org/html/2605.12685#bib.bib53),[18](https://arxiv.org/html/2605.12685#bib.bib50),[19](https://arxiv.org/html/2605.12685#bib.bib52),[29](https://arxiv.org/html/2605.12685#bib.bib51),[24](https://arxiv.org/html/2605.12685#bib.bib54)\]\. SUBG\-CON\[[17](https://arxiv.org/html/2605.12685#bib.bib53)\]is a self\-supervised representation learning method that captures regional structural information by leveraging the strong correlation between central nodes and their corresponding sampled subgraphs\. ANEMONE\[[18](https://arxiv.org/html/2605.12685#bib.bib50)\]is a graph anomaly detection framework that identifies anomalies across multiple graph scales by employing a GNN\-based encoder and multi\-scale contrastive learning to capture pattern distributions through simultaneous agreement learning at the patch and context levels\. MERIT\[[19](https://arxiv.org/html/2605.12685#bib.bib52)\]is another multi\-scale graph contrastive learning approach that first generates two augmented views of the input graph, one focusing on local and the other on global perspectives\. It then employs two objectives \(cross\-view and cross\-network contrastiveness\) to maximize the alignment of node representations across these different views and networks\. Based on the idea that nodes can be observed at different abstraction levels, MNCSCL\[[24](https://arxiv.org/html/2605.12685#bib.bib54)\]samples multiple node\-centered subgraphs to reflect differences across various granularity levels\. Then it applies contrastive learning to maximize mutual information between the graph views generated at different abstraction levels\. In another graph self\-supervised learning method, MSSGCL\[[29](https://arxiv.org/html/2605.12685#bib.bib51)\]generates multi\-scale global and local views via subgraph sampling and builds multiple contrastive relations between these abstraction levels\. LMGTA\[[26](https://arxiv.org/html/2605.12685#bib.bib60)\]similarly adopts a multi\-order contrastive strategy that integrates subgraph\-level augmentations with a topology\-aware global module, enabling the model to capture both local and structural irregularities\. However, multi\-scale GSSL methods often instantiate ”scale” using only local–global or patch–context contrasts, providing limited coverage of intermediate abstractions\. They often couple each scale with scale\-specific objectives and augmentation heuristics, confounding the effect of abstraction level with objective engineering\. Finally, they usually aggregate scale losses via fixed linear weighting, leaving inter\-scale interference and overall optimization dynamics insufficiently characterized\.

Multi\-Task Methods\.Recently, extensive research has been devoted to learning from multiple tasks\[[44](https://arxiv.org/html/2605.12685#bib.bib36),[33](https://arxiv.org/html/2605.12685#bib.bib47),[28](https://arxiv.org/html/2605.12685#bib.bib48),[40](https://arxiv.org/html/2605.12685#bib.bib49),[58](https://arxiv.org/html/2605.12685#bib.bib38),[6](https://arxiv.org/html/2605.12685#bib.bib35),[11](https://arxiv.org/html/2605.12685#bib.bib39),[59](https://arxiv.org/html/2605.12685#bib.bib37)\]\. The motivation arises from single\-pretext tasks favoring task\-specific features, which leads to suboptimal performance across diverse downstream objectives\. Several multi\-task approaches have been developed for graph\-structured data\. For example, AutoSSL\[[20](https://arxiv.org/html/2605.12685#bib.bib10)\]introduces pseudo\-homophily as a metric to assess the quality of GSSL tasks and searches for an optimal combination of these tasks using evolution algorithms\. ParetoGNN\[[21](https://arxiv.org/html/2605.12685#bib.bib13)\]leverages a multi\-gradient descent algorithm to assign task weights that promote Pareto optimality\. DyFSS\[[61](https://arxiv.org/html/2605.12685#bib.bib40)\]introduces a Mixture of Experts \(MoE\) framework that integrates features derived from various GSSL tasks, with a focus on node clustering\. This model trains a gating network to learn node\-specific weights for each task\. From another perspective, GraphTCM\[[8](https://arxiv.org/html/2605.12685#bib.bib41)\]models the correlations between GSSL tasks and exploits these correlations to derive representations that maximize performance\. WAS\[[7](https://arxiv.org/html/2605.12685#bib.bib43)\]formulates the multi\-task learning problem as multi\-teacher knowledge distillation\. Moreover, the authors of\[[7](https://arxiv.org/html/2605.12685#bib.bib43)\]highlight the significance of selecting a set of tasks based on their compatibility before assigning importance weights to them\. The previous multi\-task learning methods typically require a trainable gating network \(e\.g\., DyFSS\), trainable task\-specific expert networks \(e\.g\., GraphTCM, DyFSS, and WAS\), and inner\-optimization algorithms \(e\.g\., the evolution algorithm for AutoSSL and the multi\-gradient descent algorithm for ParetoGNN\) to learn task\-specific importance weights\. Unlike previous methods, our approach employs a single multi\-task network and does not require inner\-optimization algorithms\. The proposed method introduces aself\-weightingmechanism based on similarity and dissimilarity scores across different graph abstraction levels\. Furthermore, our approach inherits the circle loss\[[48](https://arxiv.org/html/2605.12685#bib.bib45)\]advantages\. In particular, it improves optimization flexibility and reduces convergence uncertainty\.

Multi\-Loss Weighting\.Beyond multi\-task learning, a broad body of literature has emerged on strategies for balancing multiple objectives during optimization\. Uncertainty weighting\[[22](https://arxiv.org/html/2605.12685#bib.bib61)\]learns scalar loss weights from homoscedastic task uncertainty, down\-weighting objectives that appear noisier\. Instead of modeling uncertainty, GradNorm\[[4](https://arxiv.org/html/2605.12685#bib.bib62)\]dynamically rescales task losses to equalize gradient norms and encourage comparable learning speeds across tasks\. Complementarily, gradient\-conflict methods modify the update direction itself: PCGrad\[[58](https://arxiv.org/html/2605.12685#bib.bib38)\]projects gradients to reduce pairwise conflicts, while CAGrad\[[27](https://arxiv.org/html/2605.12685#bib.bib63)\]computes a conflict\-averse direction that explicitly limits interference\. More recent approaches address dominance effects over longer training horizons\. For instance, AdaTask\[[55](https://arxiv.org/html/2605.12685#bib.bib64)\]maintains task\-wise accumulators under adaptive optimizers, and IGBv1\[[5](https://arxiv.org/html/2605.12685#bib.bib65)\]prioritizes tasks with larger improvable gaps between current and desired progress\. As a recent example, AUTAUT\[[60](https://arxiv.org/html/2605.12685#bib.bib66)\]uses LLM\-based retrieval to identify candidate auxiliary tasks and adaptively reweights them via gradient alignment\. However, these strategies typically operate at the task\-loss granularity by assigning a single scalar weight to an entire objective\. In contrast, we adopt a more fine\-grained weighting strategy that reweights individual positive and negative similarity scores within each abstraction level, thereby controlling the relative influence of each score on the optimization process\.

## 3A Unified GSSL Perspective

Let𝒢=\(𝒱,ℰ,𝒳\)\\mathcal\{G\}=\(\\mathcal\{V\},\\,\\mathcal\{E\},\\,\\mathcal\{X\}\)be an undirected attributed graph, where𝒱=\{v1,…,vn\}\\mathcal\{V\}=\\\{v\_\{1\},\\ldots,v\_\{n\}\\\}represents the set ofnnnodes,ℰ⊆𝒱×𝒱\\mathcal\{E\}\\subseteq\\mathcal\{V\}\\times\\mathcal\{V\}is the set of edges, and𝒳∈ℝn×d′\\mathcal\{X\}\\in\\mathbb\{R\}^\{n\\times d^\{\\prime\}\}is the node feature matrix\. Theithi^\{\\text\{th\}\}row𝐱i\\mathbf\{x\}\_\{i\}of𝒳\\mathcal\{X\}is the feature vector of nodeviv\_\{i\}\. We define the adjacency matrix of𝒢\\ \\mathcal\{G\}as𝐀=\(aij\)∈ℝn×n\\mathbf\{A\}=\(a\_\{ij\}\)\\in\\mathbb\{R\}^\{n\\times n\}, such thataij=1a\_\{ij\}=1if\(vi,vj\)∈ℰ\(v\_\{i\},v\_\{j\}\)\\in\\mathcal\{E\}andaij=0a\_\{ij\}=0otherwise\. A graph neural network \(GNN\) encodes the input graph𝒢\\mathcal\{G\}into add\-dimensional latent space and generates the node embedding matrix𝐇∈ℝn×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{n\\times d\}\. Theii\-th row𝐡i\\mathbf\{h\}\_\{i\}of𝐇\\mathbf\{H\}represents the embedding vector of nodeviv\_\{i\}\. In this section, we introduce a unified contrastive learning framework that operates seamlessly at different graph abstraction levels and facilitates their integration\.

We generate three augmented views of𝒢\\mathcal\{G\}, which are then encoded by the GNN encoder to derive node representations for contrastive learning\. At each epoch, the input graph undergoes three augmentations, denoted as𝒢~\(1\)=\{𝒱~\(1\),ℰ~\(1\),𝒳~\(1\)\}\\widetilde\{\\mathcal\{G\}\}^\{\(1\)\}=\\\{\\widetilde\{\\mathcal\{V\}\}^\{\(1\)\},\\,\\widetilde\{\\mathcal\{E\}\}^\{\(1\)\},\\,\\widetilde\{\\mathcal\{X\}\}^\{\(1\)\}\\\},𝒢~\(2\)=\{𝒱~\(2\),ℰ~\(2\),𝒳~\(2\)\}\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}=\\\{\\widetilde\{\\mathcal\{V\}\}^\{\(2\)\},\\,\\widetilde\{\\mathcal\{E\}\}^\{\(2\)\},\\,\\widetilde\{\\mathcal\{X\}\}^\{\(2\)\}\\\}, and𝒢~\(3\)=\{𝒱~\(3\),ℰ~\(3\),𝒳~\(3\)\}\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}=\\\{\\widetilde\{\\mathcal\{V\}\}^\{\(3\)\},\\,\\widetilde\{\\mathcal\{E\}\}^\{\(3\)\},\\,\\widetilde\{\\mathcal\{X\}\}^\{\(3\)\}\\\}\. These augmentations are categorized into positive and negative transformations\. Positive augmentations preserve key structural and semantic properties of the graph\. We employ stochastic edge dropping and node feature masking as positive augmentations\[[62](https://arxiv.org/html/2605.12685#bib.bib24),[63](https://arxiv.org/html/2605.12685#bib.bib5)\]\. In contrast, negative augmentations introduce substantial modifications that alter the graph’s structure or feature distribution\. We use random node shuffling as negative augmentation\. This corruption breaks node\-attribute alignment and is commonly used to generate negative views in GSSL methods such as DGI\[[51](https://arxiv.org/html/2605.12685#bib.bib30)\]and MVGRL\[[14](https://arxiv.org/html/2605.12685#bib.bib31)\]\. The first two graphs𝒢~\(1\)\\widetilde\{\\mathcal\{G\}\}^\{\(1\)\}and𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}are positive views, while the third𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}is negative\.

For each augmented view𝒢~\(j\)\\widetilde\{\\mathcal\{G\}\}^\{\(j\)\}, we define𝐇\(j\)∈ℝn×d\\mathbf\{H\}^\{\(j\)\}\\in\\mathbb\{R\}^\{n\\times d\}as its node embedding matrix and𝐡i\(j\)\\mathbf\{h\}\_\{i\}^\{\(j\)\}as the embedding vector of nodeviv\_\{i\}\. The core principle of graph contrastive learning involves two steps: \(i\) generating positive and negative samples for each anchor, and \(ii\) bringing the anchor closer to positive samples while distancing it from the negative ones\. Formally, the problem can be framed as maximizing the similarity between the anchor and the positive samples, while minimizing the similarity between the anchor and the negative samples\. Our unified contrastive learning framework aligns with the multi\-level GSSL paradigm by capturing similarities and dissimilarities at four graph abstraction levels: node\-level, proximity\-level, cluster\-level, and graph\-level\. The indexllselects the granularity level at which the GSSL is performed\. Formally, we havel∈\{node, proximity, cluster, graph\}l\\in\\\{\\text\{node, proximity, cluster, graph\}\\\}\.

The anchors are defined as the node representations in the first augmented graph view,𝒢~\(1\)\\widetilde\{\\mathcal\{G\}\}^\{\(1\)\}\. For each anchor,n1n\_\{1\}positive samples are drawn from the second graph view,𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}, whilen2n\_\{2\}negative samples are selected from the second and third views,𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}and𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}\. Sample generation is governed by the functionsGl\+G\_\{l\}^\{\+\}andGl−G\_\{l\}^\{\-\}, which operate at a granularity levelll\. Here,Gk,l\+G\_\{k,\\,l\}^\{\+\}produces thekk\-th positive sample, andGk,l−G\_\{k,\\,l\}^\{\-\}generates thekk\-th negative sample\.

We employ a functionθ\\thetato measure the similarity between anchors and their positive and negative samples\. The similarity between theii\-th anchor and itskk\-th positive sample at granularity levelllis denoted assik,l\+s^\{\+\}\_\{ik,\\,l\}, where\(i,k\)∈\{1,…,n\}×\{1,…,n1\}\(i,\\,k\)\\in\\\{1,\\dots,n\\\}\\times\\\{1,\\dots,n\_\{1\}\\\}\. Similarly, the similarity between theii\-th anchor and itskk\-th negative sample is given bysik,l−s^\{\-\}\_\{ik,\\,l\}, where\(i,k\)∈\{1,…,n\}×\{1,…,n2\}\(i,\\,k\)\\in\\\{1,\\dots,n\\\}\\times\\\{1,\\dots,n\_\{2\}\\\}\. The positive and negative similarities are formally expressed as:

sik,l\+=θ\(𝐡i\(1\),Gk,l\+\(vi\)\),s^\{\+\}\_\{ik,\\,l\}=\\theta\\big\(\\mathbf\{h\}^\{\(1\)\}\_\{i\},G^\{\+\}\_\{k,\\,l\}\(v\_\{i\}\)\\big\),\(1\)
sik,l−=θ\(𝐡i\(1\),Gk,l−\(vi\)\)\.s^\{\-\}\_\{ik,\\,l\}=\\theta\\big\(\\mathbf\{h\}^\{\(1\)\}\_\{i\},G^\{\-\}\_\{k,\\,l\}\(v\_\{i\}\)\\big\)\.\(2\)
At the node level, the positive samples for an anchor consist of a single element, which is the embedding of the same node in𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}\. The negative samples contain the embeddings ofn2n\_\{2\}nodes, different from the anchor node, selected from the embeddings of𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}and𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}\. At the proximity level, positive samples are generated by considering the embeddings of𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}for nodes along aκ\\kappa\-length path in𝒢\\mathcal\{G\}originating from the anchor node\. Negative samples are generated by considering the embeddings of𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}and𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}for nodes along aκ\\kappa\-length path in𝒢\\mathcal\{G\}that does not intersect the anchor’sκ\\kappa\-hop neighborhood in𝒢\\mathcal\{G\}\. At the cluster level, the graph is first partitioned by applying k\-means to the node features𝒳\\mathcal\{X\}\. The positive samples are selected from the embeddings of𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}for nodes within the same cluster as the anchor node, while the negative samples are drawn from the embeddings of𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}and𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}for nodes outside the anchor’s cluster\. At the graph level, the positive samples are drawn from the embeddings of𝒢~\(2\)\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\}, whereas the negative samples come from the embeddings of𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}\.

In the general case, the similarities are indexed by the anchor and sample identifiers\. For notational simplicity, we omit these indices when no ambiguity arises, and denote the similarities of positive and negative pairs bys\+s^\{\+\}ands−s^\{\-\}, respectively\. We instantiateθ\\thetaas the shifted cosine similarity:

θ\(𝐮,𝐯\)\\displaystyle\\theta\(\\mathbf\{u\},\\mathbf\{v\}\)=1\+cos\(𝐮,𝐯\)2\\displaystyle=\\frac\{1\+\\mathrm\{cos\}\(\\mathbf\{u\},\\mathbf\{v\}\)\}\{2\}\(3\)=12\(1\+𝐮⊤𝐯‖𝐮‖2‖𝐯‖2\)∈\[0,1\]\.\\displaystyle=\\frac\{1\}\{2\}\\left\(1\+\\frac\{\\mathbf\{u\}^\{\\top\}\\mathbf\{v\}\}\{\\\|\\mathbf\{u\}\\\|\_\{2\}\\,\\\|\\mathbf\{v\}\\\|\_\{2\}\}\\right\)\\in\[0,1\]\.The target values ares\+→1s^\{\+\}\\to 1\(cosine→1\\to 1\) ands−→0s^\{\-\}\\to 0\(cosine→−1\\to\-1\)\. For terminological convenience, we use the term*dissimilarity score*to refer to the same similarity functionθ\(⋅,⋅\)\\theta\(\\cdot,\\cdot\)evaluated on negative pairs \(i\.e\.,s−s^\{\-\}\)\. That is, we do not introduce a separate distance/dissimilarity function\. Under the shifted cosine similarityθ∈\[0,1\]\\theta\\in\[0,1\], minimizing the negative\-pair similaritys−s^\{\-\}is equivalent to maximizing dissimilarity in the representation space \(ideallys−→0s^\{\-\}\\to 0\)\.

We propose a unified contrastive loss that can operate at different granularity levels\. For each anchor node, this loss function maximizes the similarity between the anchor node representation and its positive samples, while minimizing the similarity with its negative samples\. The unified loss functionℒSL\-GSSL\\mathcal\{L\}\_\{\\text\{SL\-GSSL\}\}for single\-level graph self\-supervised learning \(SL\-GSSL\) is defined as follows:

ℒSL\-GSSL\(l\)=1n∑i=1nlog⁡\[1\+∑j=1n1∑k=1n2exp⁡\(γ\(sik,l−−sij,l\+\+m\)\)\],\\mathcal\{L\}\_\{\\text\{SL\-GSSL\}\}^\{\(l\)\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\left\[1\+\\sum\_\{j=1\}^\{n\_\{1\}\}\\sum\_\{k=1\}^\{n\_\{2\}\}\\exp\\left\(\\gamma\(s^\{\-\}\_\{ik,\\,l\}\-s^\{\+\}\_\{ij,\\,l\}\+m\)\\right\)\\right\],\(4\)
whereγ\\gammadenotes a scaling factor, andmmis a margin hyperparameter that prevents excessive updates by disregarding dissimilar pairs once they surpass a predefined separation threshold\. This loss function iterates through each similarity pair to minimize\(sik,l−−sij,l\+\(s^\{\-\}\_\{ik,\\,l\}\-s^\{\+\}\_\{ij,\\,l\}\)\. Algorithm[1](https://arxiv.org/html/2605.12685#alg1)outlines the training procedure of SL\-GSSL\.

Algorithm 1The unified perspective SL\-GSSL1:Input:The input graph

𝒢\\mathcal\{G\}, \# of epochs T, scaling factor

γ\\gamma, margin

mm, \# of positive samples

n1n\_\{1\}, \# of negative samples

n2n\_\{2\}, abstraction level

l∈\{node, proximity, cluster, graph\}l\\in\\\{\\text\{node, proximity, cluster, graph\}\\\}, similarity function

θ\\theta, positive augmentations

O1\+O^\{\+\}\_\{1\}and

O2\+O^\{\+\}\_\{2\}, negative augmentation

O−O^\{\-\}, functions that generates positive

Gl\+G^\{\+\}\_\{l\}, and function that generates negative samples

Gl−G^\{\-\}\_\{l\}
2:Output:Trained model

3:for

epoch=1\\text\{epoch\}=1to

TTdo

4:Generate

𝒢~\(1\),𝒢~\(2\),and𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(1\)\},\\,\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\},\\text\{ and \}\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}by stochastically corrupting

𝒢\\mathcal\{G\}using

O1\+O^\{\+\}\_\{1\},

O2\+O^\{\+\}\_\{2\}, and

O−O^\{\-\}, respectively\.

5:Compute

𝐇\(1\),𝐇\(2\),and𝐇\(3\)\\mathbf\{H\}^\{\(1\)\},\\,\\mathbf\{H\}^\{\(2\)\},\\text\{ and \}\\mathbf\{H\}^\{\(3\)\}by encoding the graphs

𝒢~\(1\),𝒢~\(2\),and𝒢~\(3\)\\widetilde\{\\mathcal\{G\}\}^\{\(1\)\},\\,\\widetilde\{\\mathcal\{G\}\}^\{\(2\)\},\\text\{ and \}\\widetilde\{\\mathcal\{G\}\}^\{\(3\)\}, respectively, using the GNN\.

6:Generate

n1n\_\{1\}positive samples and

n2n\_\{2\}negative samples for each node using

Gl\+andGl−G^\{\+\}\_\{l\}\\text\{ and \}G^\{\-\}\_\{l\}, respectively\.

7:Compute the positive and negative similarity scores for each node using Eq\. \([1](https://arxiv.org/html/2605.12685#S3.E1)\) and Eq\. \([2](https://arxiv.org/html/2605.12685#S3.E2)\), respectively\.

8:Compute the loss function

ℒSL\-GSSL\(l\)\\mathcal\{L\}\_\{\\text\{SL\-GSSL\}\}^\{\(l\)\}using Eq\. \([4](https://arxiv.org/html/2605.12685#S3.E4)\)\.

9:Update the model’s parameters using Adam optimizer\.

10:endfor

Gradient Analysis\.To simplify visualization, we assume the toy scenario of a single anchor and one corresponding positive and negative sample \(n=n1=n2n=n\_\{1\}=n\_\{2\}\)\. The unified loss function becomes:

ℒ=log⁡\[1\+exp⁡\(γ\(s−−s\+\+m\)\)\]\.\\mathcal\{L\}=\\log\\left\[1\+\\exp\\left\(\\gamma\(s^\{\-\}\-s^\{\+\}\+m\)\\right\)\\right\]\.\(5\)For notational simplicity, we omit the granularity indexll\. We analyze the gradient ofℒ\\mathcal\{L\}w\.r\.t\. the positive and negative similarities:

\|∂ℒ∂s\+\|=\|∂ℒ∂s−\|=γexp⁡\(γ\(s−−s\+\+m\)\)1\+exp⁡\(γ\(s−−s\+\+m\)\)\.\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\+\}\}\\right\|=\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\-\}\}\\right\|=\\frac\{\\gamma\\,\\exp\\left\(\\gamma\(s^\{\-\}\-s^\{\+\}\+m\)\\right\)\}\{1\+\\exp\\left\(\\gamma\(s^\{\-\}\-s^\{\+\}\+m\)\\right\)\}\.\(6\)
Since the two expressions\|∂ℒ∂s\+\|\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\+\}\}\\right\|and\|∂ℒ∂s−\|\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\-\}\}\\right\|are mathematically identical, we plot only one\. Fig\.[3](https://arxiv.org/html/2605.12685#S3.F3)illustrates the partial derivative ofℒ\\mathcal\{L\}w\.r\.t\.s\+s^\{\+\}\. The figure shows a sharp transition from large to very small values, reflecting the saturation of the logistic term once the margin constraint is satisfied\. This transition induces an effective \(soft\) decision boundary in the\(s\+,s−\)\(s^\{\+\},s^\{\-\}\)plane\. In particular, the lines−−s\+\+m=0s^\{\-\}\-s^\{\+\}\+m=0corresponds to the midpoint of the transition \(where\|∂ℒ∂s\+\|=γ/2\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\+\}\}\\right\|=\\gamma/2\) and separates a high\-gradient regime \(s−−s\+\+m\>0s^\{\-\}\-s^\{\+\}\+m\>0\) from a low\-gradient regime \(s−−s\+\+m<0s^\{\-\}\-s^\{\+\}\+m<0\)\. Note that the gradient remains strictly positive for any finite value ofs−−s\+\+ms^\{\-\}\-s^\{\+\}\+m, but it decays exponentially ass−−s\+\+ms^\{\-\}\-s^\{\+\}\+mbecomes more negative\.

![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/LSW.png)Figure 2:Illustration of our multi\-task approach, LSW\-ML\-GSSL\. The positive and negative similarity scores are denoted bysl\+s^\{\+\}\_\{l\}andsl−s^\{\-\}\_\{l\}, respectively, with their corresponding reference endpoints denoted asol\+o^\{\+\}\_\{l\}andol−o^\{\-\}\_\{l\}\. The corresponding weights in the multi\-level lossℒLSW\-ML\-GSSL\\mathcal\{L\}\_\{\\text\{LSW\-ML\-GSSL\}\}are denoted byαl\+\\alpha^\{\+\}\_\{l\}andαl−\\alpha^\{\-\}\_\{l\}\.H\(i\)H^\{\(i\)\}denotes the embedding of the augmented graph𝒢~\(i\)\\tilde\{\\mathcal\{G\}\}^\{\(i\)\}\.![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/gradient_SL.png)Figure 3:The gradient magnitude ofℒ\\mathcal\{L\}w\.r\.t\.s\+s^\{\+\}\(\|∂ℒ∂s\+\|\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\+\}\}\\right\|\)\.We select two points,B1\(0\.8,0\.8\)B\_\{1\}\(0\.8,0\.8\)andB2\(0\.6,0\.5\)B\_\{2\}\(0\.6,0\.5\), on the gradient function graph\. PointB1B\_\{1\}exhibits high positive and negative similarities, yet still maintains a large derivative magnitude w\.r\.t\.s\+s^\{\+\}\. The derivative magnitude remains identical w\.r\.t\.s−s^\{\-\}ands\+s^\{\+\}, regardless of their relative balance\. This indicates that the optimization process is limited in its flexibility to adapt according to the balance betweens\+s^\{\+\}ands−s^\{\-\}\. PointB2B\_\{2\}lies closer to the decision boundarys−−s\+\+m=0s^\{\-\}\-s^\{\+\}\+m=0compared to pointB1B\_\{1\}\. However, the derivatives at pointB2B\_\{2\}w\.r\.t\.s\+s^\{\+\}ands−s^\{\-\}remain nearly identical to those at pointB1B\_\{1\}\. Consequently, the loss function penalizes both points equally, regardless of their relative proximity to the boundary\. Moreover, for any pair\(s\+,s−\)\(s^\{\+\},s^\{\-\}\)on the convergence boundary \(i\.e\.,s\+−s−=ms^\{\+\}\-s^\{\-\}=m\), the model exhibits no preference between these points\. Thus, the optimization process is susceptible to ambiguity in the convergence outcome\.

## 4Multi\-Level Approach

We extend the unified formulation to the multi\-task scenario by leveraging multi\-level granularity information\. Initially, we aggregate the positive and negative similarity scores linearly across all abstraction levels\. Then, we present a self\-weighting mechanism to enhance optimization flexibility and reduce convergence ambiguity\. Our multi\-level approach, LSW\-ML\-GSSL, is illustrated in Fig\.[2](https://arxiv.org/html/2605.12685#S3.F2)\.

Linear Combination\.The effectiveness of single\-level GSSL on each downstream task varies depending on the selected graph granularity level\. To ensure task generalization, we integrate positive and negative similarity scores across four graph granularity levels into the same loss function through a linear combination\.

We defineΔijk,l\\Delta\_\{ijk,\\,l\}as the contrastive similarity difference between theii\-th anchor, itsjj\-th positive sample, and itskk\-th negative sample at granularity levelll\. The expression ofΔijk,l\\Delta\_\{ijk,\\,l\}is given by:

Δijk,l=sik,l−−sij,l\+\+m\.\\Delta\_\{ijk,\\,l\}=s^\{\-\}\_\{ik,\\,l\}\-s^\{\+\}\_\{ij,\\,l\}\+m\.\(7\)
We perform a linear combination of the contrastive similarity differences across the four graph granularity levels\. The linear multi\-level GSSL loss, denoted asℒL\-ML\-GSSL\(l\)\\mathcal\{L\}\_\{\\text\{L\-ML\-GSSL\}\}^\{\(l\)\}, is expressed as follows:

ℒL\-ML\-GSSL=1n∑i=1nlog⁡\[1\+∑j=1n1∑k=1n2exp⁡\(γ∑l=14βlΔijk,l\)\],\\mathcal\{L\}\_\{\\text\{L\-ML\-GSSL\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\left\[1\+\\sum\_\{j=1\}^\{n\_\{1\}\}\\sum\_\{k=1\}^\{n\_\{2\}\}\\exp\\left\(\\gamma\\sum\_\{l=1\}^\{4\}\\beta\_\{l\}\\;\\Delta\_\{ijk,\\,l\}\\right\)\\right\],\(8\)whereβl\\beta\_\{l\}is a balancing hyperparameter that controls the contribution of thell\-th graph granularity level to the loss function\. We leverage enumeration for the indexllto simplify notation, wherel=1l=1corresponds to the node level,l=2l=2to the proximity level,l=3l=3to the cluster level, andl=4l=4to the graph level\.

Linear Weighted Combination\.Similar to the single\-level GSSL, discussed in Sec\.[3](https://arxiv.org/html/2605.12685#S3), the training process w\.r\.t\. the loss functionℒL\-ML\-GSSL\\mathcal\{L\}\_\{\\text\{L\-ML\-GSSL\}\}lacks optimization flexibility and is prone to ambiguity in its convergence outcome\. To address this issue, we introduce a dynamic weighting mechanism that adjusts the contribution of each similarity score to the final loss\. The goal is to optimize each score at its own pace by prioritizing the less optimized ones\.

We defineΔijk,l′\\Delta^\{\\prime\}\_\{ijk,\\,l\}as the weighted contrastive similarity difference between theii\-th anchor, itsjj\-th positive sample, and itskk\-th negative sample at granularity levelll\. The expression ofΔijk,l′\\Delta^\{\\prime\}\_\{ijk,\\,l\}is:

Δijk,l′=αik,l−\(sik,l−−δ−\)−αij,l\+\(sij,l\+−δ\+\),\\Delta^\{\\prime\}\_\{ijk,\\,l\}=\\alpha^\{\-\}\_\{ik,\\,l\}\\\>\(s^\{\-\}\_\{ik,\\,l\}\-\\delta^\{\-\}\)\-\\alpha^\{\+\}\_\{ij,\\,l\}\\\>\(s^\{\+\}\_\{ij,\\,l\}\-\\delta^\{\+\}\),\(9\)
whereαij,l\+\\alpha^\{\+\}\_\{ij,\\,l\}andαik,l−\\alpha^\{\-\}\_\{ik,\\,l\}are the weighting coefficients assigned to the positive and negative similarity scoressij,l\+s^\{\+\}\_\{ij,\\,l\}andsik,l−s^\{\-\}\_\{ik,\\,l\}, respectively;δ\+\\delta^\{\+\}andδ−\\delta^\{\-\}are the positive and negative margins\. In SL\-GSSL and L\-ML\-GSSL, the positive and negative similarity scores are assigned equal weights\. This allows the use of a single margin hyperparametermm\. After integrating the weighting coefficients,sij,l\+s^\{\+\}\_\{ij,\\,l\}andsik,l−s^\{\-\}\_\{ik,\\,l\}are no longer in symmetric positions\. Thus, we need two margins\.

We sum the weighted contrastive similarity differences across the four graph granularity levels\. The linear weighted multi\-level GSSL loss, denoted asℒLW\-ML\-GSSL\(l\)\\mathcal\{L\}\_\{\\text\{LW\-ML\-GSSL\}\}^\{\(l\)\}, is formally expressed as follows:

ℒLW\-ML\-GSSL=1n∑i=1nlog⁡\[1\+∑j=1n1∑k=1n2exp⁡\(γ∑l=14Δijk,l′\)\]\.\\mathcal\{L\}\_\{\\text\{LW\-ML\-GSSL\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\left\[1\+\\sum\_\{j=1\}^\{n\_\{1\}\}\\sum\_\{k=1\}^\{n\_\{2\}\}\\exp\\left\(\\gamma\\sum\_\{l=1\}^\{4\}\\,\\Delta^\{\\prime\}\_\{ijk,\\,l\}\\right\)\\right\]\.\(10\)
The hyperparameters\(βl\)l∈\{1,⋯,4\}\(\\beta\_\{l\}\)\_\{l\\in\\\{1,\\cdots,4\\\}\}that control the contribution of the graph granularity levels inℒL\-ML\-GSSL\\mathcal\{L\}\_\{\\text\{L\-ML\-GSSL\}\}, are absorbed by the weighting coefficientsαij,l\+\\alpha^\{\+\}\_\{ij,\\,l\}andαij,l−\\alpha^\{\-\}\_\{ij,\\,l\}inℒLW\-ML\-GSSL\\mathcal\{L\}\_\{\\text\{LW\-ML\-GSSL\}\}\.

Linear Self\-Weighted Combination\.We introduceol\+o\_\{l\}^\{\+\}andol−o\_\{l\}^\{\-\}as*reference endpoints*\(level\-wise constants\) used in the self\-weighting coefficients forsij,l\+s^\{\+\}\_\{ij,l\}andsik,l−s^\{\-\}\_\{ik,l\}\. Under shifted cosine similarity,sij,l\+,sik,l−∈\[0,1\]s\_\{ij,\\,l\}^\{\+\},s\_\{ik,l\}^\{\-\}\\in\[0,1\], and the feasible targets remains\+→1s^\{\+\}\\to 1ands−→0s^\{\-\}\\to 0\. The weighting termsαij,l\+\\alpha^\{\+\}\_\{ij,\\,l\}andαik,l−\\alpha^\{\-\}\_\{ik,\\,l\}control the gradient contribution of each similarity score during optimization\. Ideally, the positive similarity scores should be high, while the negative scores should be low\. When a similarity score deviates significantly from its target value, its corresponding weight should increase to enforce a stronger correction\. Thus, we can define the self\-weighting coefficientsαij,l\+\\alpha^\{\+\}\_\{ij,\\,l\}andαik,l−\\alpha^\{\-\}\_\{ik,\\,l\}as follows:

αij,l\+=\[ol\+−sij,l\+\]\+,\\alpha^\{\+\}\_\{ij,l\}=\[\\,o\_\{l\}^\{\+\}\-s\_\{ij,l\}^\{\+\}\\,\]\_\{\+\},\(11\)αik,l−=\[sik,l−−ol−\]\+,\\alpha^\{\-\}\_\{ik,l\}=\[\\,s\_\{ik,l\}^\{\-\}\-o\_\{l\}^\{\-\}\\,\]\_\{\+\},\(12\)
where\[⋅\]\+\[\\cdot\]\_\{\+\}represents the “cut\-off at zero” operation, ensuring the weights remain non\-negative\. Under the shifted cosine similarity, the desired \(feasible\) targets aresij,l\+→1s^\{\+\}\_\{ij,l\}\\to 1for positive pairs andsik,l−→0s^\{\-\}\_\{ik,l\}\\to 0for negative pairs\. Accordingly, when a negative similarity scoresik,l−s^\{\-\}\_\{ik,l\}is large \(i\.e\., far above its target0\), its weightαik,l−\\alpha^\{\-\}\_\{ik,l\}increases, strengthening the update that reducessik,l−s^\{\-\}\_\{ik,l\}\. Likewise, when a positive similarity scoresij,l\+s^\{\+\}\_\{ij,l\}is small \(i\.e\., far below its target11\), its weightαij,l\+\\alpha^\{\+\}\_\{ij,l\}increases, strengthening the update that increasessij,l\+s^\{\+\}\_\{ij,l\}\. The quantitiesol\+o^\{\+\}\_\{l\}andol−o^\{\-\}\_\{l\}are*reference endpoints*used only in the weighting functions \(Eqs\. \([11](https://arxiv.org/html/2605.12685#S4.E11)\)–\([12](https://arxiv.org/html/2605.12685#S4.E12)\)\); they are not similarity targets and may lie outside\[0,1\]\[0,1\]\.

We defineΔijk,l′′\\Delta^\{\\prime\\prime\}\_\{ijk,\\,l\}as the self\-weighted contrastive similarity difference between theii\-th anchor, itsjj\-th positive sample, and itskk\-th negative sample atll\-th level\. The expression ofΔijk,l′′\\Delta^\{\\prime\\prime\}\_\{ijk,\\,l\}is:

Δijk,l′′=\[sik,l−−ol−\]\+\(sik,l−−δ−\)−\[ol\+−sij,l\+\]\+\(sij,l\+−δ\+\),\\Delta^\{\\prime\\prime\}\_\{ijk,\\,l\}=\[s\_\{ik,l\}^\{\-\}\-o\_\{l\}^\{\-\}\]\_\{\+\}\\;\\Big\(s^\{\-\}\_\{ik,\\,l\}\-\\delta^\{\-\}\\Big\)\-\[o\_\{l\}^\{\+\}\-s\_\{ij,\\,l\}^\{\+\}\]\_\{\+\}\\;\\Big\(s^\{\+\}\_\{ij,\\,l\}\-\\delta^\{\+\}\\Big\),\(13\)
We sum the self\-weighted contrastive similarity differences across the four graph granularity levels\. The linear self\-weighted multi\-level GSSL loss, denoted asℒLSW\-ML\-GSSL\(l\)\\mathcal\{L\}\_\{\\text\{LSW\-ML\-GSSL\}\}^\{\(l\)\}, is expressed as follows:

ℒLSW\-ML\-GSSL=1n∑i=1nlog⁡\[1\+∑j=1n1∑k=1n2exp⁡\(γ∑l=14Δijk,l′′\)\]\.\\mathcal\{L\}\_\{\\text\{LSW\-ML\-GSSL\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\left\[1\+\\sum\_\{j=1\}^\{n\_\{1\}\}\\sum\_\{k=1\}^\{n\_\{2\}\}\\exp\\left\(\\gamma\\sum\_\{l=1\}^\{4\}\\,\\Delta^\{\\prime\\prime\}\_\{ijk,\\,l\}\\right\)\\right\]\.\(14\)
After omitting the indicesii,jj, andkkfor the sake of simplicity and substituting the self\-weighting coefficientsαij,l\+\\alpha^\{\+\}\_\{ij,\\,l\}andαik,l−\\alpha^\{\-\}\_\{ik,\\,l\}with their respective values, the decision boundary associated withℒLSW\-ML\-GSSL\\mathcal\{L\}\_\{\\text\{LSW\-ML\-GSSL\}\}can be formally expressed as follows:

∑l=14\[\(sl−−ol−\+δ−2\)2\+\(sl\+−ol\+\+δ\+2\)2\]=14∑l=14\[\(δ−−ol−\)2\+\(ol\+−δ\+\)2\]\.\\begin\{split\}\\sum\_\{l=1\}^\{4\}\\left\[\\left\(s^\{\-\}\_\{l\}\-\\frac\{o^\{\-\}\_\{l\}\+\\delta^\{\-\}\}\{2\}\\right\)^\{2\}\+\\left\(s^\{\+\}\_\{l\}\-\\frac\{o^\{\+\}\_\{l\}\+\\delta^\{\+\}\}\{2\}\\right\)^\{2\}\\right\]\\\\ =\\frac\{1\}\{4\}\\sum\_\{l=1\}^\{4\}\\left\[\(\\delta^\{\-\}\-o^\{\-\}\_\{l\}\)^\{2\}\+\(o^\{\+\}\_\{l\}\-\\delta^\{\+\}\)^\{2\}\\right\]\.\\end\{split\}\(15\)Eq\. \([15](https://arxiv.org/html/2605.12685#S4.E15)\) represents a 7\-dimensional hypersphere, which is a 7\-dimensional manifold embedded in 8\-dimensional space, with a radiusr=12∑l=14\[\(δ−−ol−\)2\+\(ol\+−δ\+\)2\]r=\\frac\{1\}\{2\}\\sqrt\{\\sum\_\{l=1\}^\{4\}\\left\[\(\\delta^\{\-\}\-o^\{\-\}\_\{l\}\)^\{2\}\+\(o^\{\+\}\_\{l\}\-\\delta^\{\+\}\)^\{2\}\\right\]\}\. The loss function expects thats\+\>δ\+s^\{\+\}\>\\delta^\{\+\}ands−<δ−s^\{\-\}<\\delta^\{\-\}\. We use a single margin hyperparametermmby settingδ\+=1−m\\delta^\{\+\}=1\-mandδ−=m\\delta^\{\-\}=m\. We then set the reference endpoints tool\+=1\+mo\_\{l\}^\{\+\}=1\+mandol−=−mo\_\{l\}^\{\-\}=\-mfor alll∈\{1,…,4\}l\\in\\\{1,\\dots,4\\\}\. These endpoint values are used only in the weighting functions in Eqs\. \([11](https://arxiv.org/html/2605.12685#S4.E11)\)–\([12](https://arxiv.org/html/2605.12685#S4.E12)\)\. They are not similarity scores and they are not required to lie in\[0,1\]\[0,1\]\. The stationary targets induced by Eq\. \([15](https://arxiv.org/html/2605.12685#S4.E15)\) are the midpointsol\+\+δ\+2=1\\frac\{o\_\{l\}^\{\+\}\+\\delta^\{\+\}\}\{2\}=1andol−\+δ−2=0\\frac\{o\_\{l\}^\{\-\}\+\\delta^\{\-\}\}\{2\}=0, which lie in\[0,1\]\[0,1\]under the shifted cosine similarity\. Then, the decision boundary’s equation becomes as follows:

∑l=14\[\(sl−\)2\+\(sl\+−1\)2\]=8m2\.\\sum\_\{l=1\}^\{4\}\\left\[\(s^\{\-\}\_\{l\}\)^\{2\}\+\(s^\{\+\}\_\{l\}\-1\)^\{2\}\\right\]=8m^\{2\}\.\(16\)
The only parameter in Eq\. \([16](https://arxiv.org/html/2605.12685#S4.E16)\) ismm, which determines the radius of the decision boundary\. It can be interpreted as a relaxation factor that adjusts the level of flexibility during the optimization process\.

Gradient Analysis\.To simplify visualization, we will focus on the single abstraction level case with one anchor and one positive and negative sample\. The linear self\-weighted loss becomes:

ℒ=log⁡\[1\+exp⁡\(γ\(α−\(s−−δ−\)−α\+\(s\+−δ\+\)\)\)\]\.\\mathcal\{L\}=\\log\\left\[1\+\\exp\\left\(\\gamma\(\\alpha^\{\-\}\(s^\{\-\}\-\\delta^\{\-\}\)\-\\alpha^\{\+\}\(s^\{\+\}\-\\delta^\{\+\}\)\)\\right\)\\right\]\.\(17\)
We study the gradient ofℒ\\mathcal\{L\}w\.r\.t\. the positive and negative similarities\. The corresponding partial derivatives are given by:

\|∂ℒ∂s−\|\\displaystyle\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\-\}\}\\right\|=2γs−exp⁡\(γΔ′′\)1\+exp⁡\(γΔ′′\),\\displaystyle=\\frac\{2\\,\\gamma\\,s^\{\-\}\\,\\exp\(\\gamma\\,\\Delta^\{\\prime\\prime\}\)\}\{1\+\\exp\(\\gamma\\Delta^\{\\prime\\prime\}\)\},\(18\)\|∂ℒ∂s\+\|=2γ\|δ\+−s\+\|exp⁡\(γΔ′′\)1\+exp⁡\(γΔ′′\),\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\+\}\}\\right\|=\\frac\{2\\,\\gamma\\,\\left\|\\delta^\{\+\}\-s^\{\+\}\\right\|\\exp\(\\gamma\\Delta^\{\\prime\\prime\}\)\}\{1\+\\exp\(\\gamma\\,\\Delta^\{\\prime\\prime\}\)\},\(19\)
where,Δ′′=α−\(s−−δ−\)−α\+\(s\+−δ\+\)\\Delta^\{\\prime\\prime\}=\\alpha^\{\-\}\(s^\{\-\}\-\\delta^\{\-\}\)\-\\alpha^\{\+\}\(s^\{\+\}\-\\delta^\{\+\}\), in both equations\.

![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/gradient_LSW_sn.png)Figure 4:The gradient magnitude ofℒ\\mathcal\{L\}w\.r\.t\.s−s^\{\-\}\(\|∂ℒ∂s−\|\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\-\}\}\\right\|\)\.![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/gradient_LSW_sp.png)Figure 5:The gradient magnitude ofℒ\\mathcal\{L\}w\.r\.t\.s\+s^\{\+\}\(\|∂ℒ∂s\+\|\\left\|\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s^\{\+\}\}\\right\|\)\.Figs\.[4](https://arxiv.org/html/2605.12685#S4.F4)and[5](https://arxiv.org/html/2605.12685#S4.F5)depict the gradient magnitude ofℒ\\mathcal\{L\}w\.r\.t\.s−s^\{\-\}ands\+s^\{\+\}, respectively\. In contrast to Fig\.[3](https://arxiv.org/html/2605.12685#S3.F3), where the partial derivatives remain large even when the similarity scores are close to their target values \(i\.e\.,s\+≈1s^\{\+\}\\approx 1ands−≈0s^\{\-\}\\approx 0\), the proposed self\-weighting mechanism effectively addresses this issue\. Specifically, the weighting coefficients dynamically scale the gradient according to the deviation of the similarity scores from their targets\. AtB1\(0\.8,0\.8\)B\_\{1\}\(0\.8,0\.8\),s−s^\{\-\}is far from its target whiles\+s^\{\+\}is relatively close to its target, so the loss prioritizes reducings−s^\{\-\}by yielding a large\|∂ℒ/∂s−\|\\left\|\\partial\\mathcal\{L\}/\\partial s^\{\-\}\\right\|and a smaller\|∂ℒ/∂s\+\|\\left\|\\partial\\mathcal\{L\}/\\partial s^\{\+\}\\right\|\. Conversely, atB2\(0\.6,0\.5\)B\_\{2\}\(0\.6,0\.5\),s−s^\{\-\}is smaller, which results in a noticeably smaller\|∂ℒ/∂s−\|\\left\|\\partial\\mathcal\{L\}/\\partial s^\{\-\}\\right\|\(and relatively more emphasis on improvings\+s^\{\+\}\)\. This demonstrates that the proposed self\-weighting mechanism improves optimization flexibility\. Moreover, the new convergence boundary \(a circle in our toy scenario\) reduces the convergence ambiguity that arises under uniform weighting of positive and negative similarities\.

Convergence AnalysisWe study the optimization dynamics directly in the similarity\-score space\{sl\+,sl−\}l=14\\\{s\_\{l\}^\{\+\},s\_\{l\}^\{\-\}\\\}\_\{l=1\}^\{4\}for the toy setting with one anchor and one positive/negative sample per level\. This allows us to rigorously characterize the effect of the proposed self\-weighting mechanism on the geometry and dynamics of the loss\.

###### Proposition 1\(Quadratic form of the exponent argument\)\.

By adopting a shifted cosine similarity \(\(1\+cos\)/2\(1\+cos\)/2\) and applying the reparameterizationδ\+=1−m\\delta^\{\+\}=1\-m,δ−=m\\delta^\{\-\}=m,ol\+=1\+mo\_\{l\}^\{\+\}=1\+m, andol−=−mo\_\{l\}^\{\-\}=\-mfor all levelsll, we can rewriteΔl′′\\Delta^\{\\prime\\prime\}\_\{l\}as follows:

Δl′′=\(sl−\)2\+\(1−sl\+\)2−2m2\.\\Delta^\{\\prime\\prime\}\_\{l\}=\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\-2m^\{2\}\.Consequently, the exponent argument of the toy version of Eq\. \([14](https://arxiv.org/html/2605.12685#S4.E14)\) becomes:

∑l=14Δl′′=D−8m2,D=∑l=14\[\(sl−\)2\+\(1−sl\+\)2\]\.\\sum\_\{l=1\}^\{4\}\\Delta^\{\\prime\\prime\}\_\{l\}=D\-8m^\{2\},\\qquad D=\\sum\_\{l=1\}^\{4\}\\big\[\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\\big\]\.

It is important to reiterate thatol\+o\_\{l\}^\{\+\}andol−o\_\{l\}^\{\-\}are reference endpoints, while the feasible midpoint targets areol\+\+δ\+2=1\\frac\{o\_\{l\}^\{\+\}\+\\delta^\{\+\}\}\{2\}=1andol−\+δ−2=0\\frac\{o\_\{l\}^\{\-\}\+\\delta^\{\-\}\}\{2\}=0\.

###### Theorem 1\(Error contraction in the toy similarity space\)\.

In the same toy setting, we define the error vector:

𝐞=\[s1−,…,s4−,1−s1\+,…,1−s4\+\]⊤\.\\mathbf\{e\}=\[s\_\{1\}^\{\-\},\\dots,s\_\{4\}^\{\-\},\\,1\-s\_\{1\}^\{\+\},\\dots,1\-s\_\{4\}^\{\+\}\]^\{\\top\}\.Let

Z=D−8m2,ℒ=log⁡\(1\+exp⁡\(γZ\)\)\.Z=D\-8m^\{2\},\\qquad\\mathcal\{L\}=\\log\\big\(1\+\\exp\(\\gamma Z\)\\big\)\.Then

∇𝐞ℒ=2γσ\(γZ\)𝐞,whereσ\(t\)=et1\+et∈\(0,1\)\.\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}=2\\,\\gamma\\,\\sigma\(\\gamma\\,Z\)\\,\\mathbf\{e\},\\quad\\text\{where\}\\quad\\sigma\(t\)=\\frac\{e^\{t\}\}\{1\+e^\{t\}\}\\in\(0,1\)\.Under gradient descent with step size0<η<12γ0<\\eta<\\frac\{1\}\{2\\gamma\}, the error vector contracts multiplicatively:

𝐞\(t\+1\)=\(1−2ηγσ\(γZ\(t\)\)\)𝐞\(t\),\\mathbf\{e\}^\{\(t\+1\)\}=\\Big\(1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\Big\)\\mathbf\{e\}^\{\(t\)\},and therefore:

D\(t\+1\)=\(1−2ηγσ\(γZ\(t\)\)\)2D\(t\)\.D^\{\(t\+1\)\}=\\Big\(1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\Big\)^\{2\}\\,D^\{\(t\)\}\.Hence, the dynamics perform radial descent toward the ideal pointsl−→0s\_\{l\}^\{\-\}\\to 0andsl\+→1s\_\{l\}^\{\+\}\\to 1for allll\.

We now show that the multiplicative contraction property in Theorem[1](https://arxiv.org/html/2605.12685#Thmtheorem1)is specific to the self\-weighted construction and does not hold for the linear multi\-level combination in Eq\. \([8](https://arxiv.org/html/2605.12685#S4.E8)\)\. Consider the toy version of Eq\. \([8](https://arxiv.org/html/2605.12685#S4.E8)\), with fixed coefficientsβl\\beta\_\{l\}:

ℒlin=log⁡\(1\+exp⁡\(γS\)\),\\mathcal\{L\}\_\{\\mathrm\{lin\}\}=\\log\\Big\(1\+\\exp\\big\(\\gamma S\\big\)\\Big\),\(20\)S=∑l=14βl\(sl−−sl\+\+m\)\.S=\\sum\_\{l=1\}^\{4\}\\beta\_\{l\}\\big\(s\_\{l\}^\{\-\}\-s\_\{l\}^\{\+\}\+m\\big\)\.\(21\)
###### Proposition 2\(No radial descent / no multiplicative contraction for the linear combination\)\.

The gradient ofℒlin\\mathcal\{L\}\_\{\\mathrm\{lin\}\}in𝐞\\mathbf\{e\}\-coordinates can be expressed as follows:

∇𝐞ℒlin=γσ\(γS\)\[β1,…,β4,β1,…,β4\]⊤,\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}=\\gamma\\,\\sigma\(\\gamma S\)\\,\[\\beta\_\{1\},\\dots,\\beta\_\{4\},\\beta\_\{1\},\\dots,\\beta\_\{4\}\]^\{\\top\},which is a fixed direction independent of𝐞\\mathbf\{e\}\(up to the scalar factorσ\(γS\)\\sigma\(\\gamma S\)\)\. Therefore, in general∇𝐞ℒlin\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}is not parallel to𝐞\\mathbf\{e\}and the update is not radial\. In particular, there does not exist a scalar sequence\{ct\}t\\\{c\_\{t\}\\\}\_\{t\}, such that𝐞\(t\+1\)=ct𝐞\(t\)\\mathbf\{e\}^\{\(t\+1\)\}=c\_\{t\}\\,\\mathbf\{e\}^\{\(t\)\}holds for all initial𝐞\(0\)\\mathbf\{e\}^\{\(0\)\}under gradient descent onℒlin\\mathcal\{L\}\_\{\\mathrm\{lin\}\}\.

The proofs of Proposition 1, Theorem 1, and Proposition 2 are provided in Appendix[A](https://arxiv.org/html/2605.12685#A1)\.

## 5Experiments

We carry out extensive experiments to evaluate the effectiveness of our methods\. At the single level, we compare our unified SL\-GSSL approach with state\-of\-the\-art methods that operate within the same abstraction level\. Subsequently, we evaluate our multi\-level approaches \(i\.e\., L\-ML\-GSSL, LW\-ML\-GSSL, and LSW\-ML\-GSSL\) against state\-of\-the\-art multi\-task GSSL methods\. This systematic evaluation ensures a comprehensive understanding of how our method performs relative to the best available techniques\. The code of the proposed approaches is available at this[GitHub repository](https://github.com/M-M-Amar/LSW_ML_GSSL)\.

Baselines\.We evaluate the performance of our single\-level approaches by comparing them with thirteen state\-of\-the\-art GSSL methods\. These methods are selected to ensure comprehensive coverage of all abstraction levels\. At the node level, we evaluate our approach \(withl=nodel=\\text\{node\}\) against three prominent node\-level GSSL methods: GRACE\[[62](https://arxiv.org/html/2605.12685#bib.bib24)\], BGRL\[[50](https://arxiv.org/html/2605.12685#bib.bib25)\], and GCMAE\[[54](https://arxiv.org/html/2605.12685#bib.bib55)\]\. At the proximity level, we compare our method \(withl=proximityl=\\text\{proximity\}\) with Node2Vec\[[13](https://arxiv.org/html/2605.12685#bib.bib26)\], GAE\[[23](https://arxiv.org/html/2605.12685#bib.bib27)\], VGAE\[[23](https://arxiv.org/html/2605.12685#bib.bib27)\], and NCLA\[[47](https://arxiv.org/html/2605.12685#bib.bib56)\]\. At the cluster level, we assess our approach \(withl=clusterl=\\text\{cluster\}\) against GMM\-VGAE\[[16](https://arxiv.org/html/2605.12685#bib.bib28)\], DGAE\[[39](https://arxiv.org/html/2605.12685#bib.bib29)\], and DGCLUSTER\[[1](https://arxiv.org/html/2605.12685#bib.bib57)\]\. It is crucial to highlight that, at the cluster level, most deep graph clustering methods involve a pretraining self\-supervised phase\. Following this phase, pseudo\-labels are generated and refined during a clustering phase\. For instance, GMM\-VGAE and DGAE employ proximity\-level adjacency reconstruction as a pretext task during the pretraining phase\. To ensure comparability with our approach, the proximity\-level pretraining is excluded, and the two methods are denoted GMM\-VGAE\* and DGAE\*\. Additionally, we exclude the auxiliary term in DGCLUSTER’s loss function that incorporates pairwise node information, and refer to this method as DGCLUSTER\*\. Finally, at the graph level, our method \(withl=graphl=\\text\{graph\}\) is compared with DGI\[[51](https://arxiv.org/html/2605.12685#bib.bib30)\], MVGRL\[[14](https://arxiv.org/html/2605.12685#bib.bib31)\], and LS\-GCL\[[56](https://arxiv.org/html/2605.12685#bib.bib59)\]\. We evaluate our multi\-level self\-weighting framework \(LSW\-ML\-GSSL\) against five leading methods in multi\-task graph learning, namely AutoSSL\[[20](https://arxiv.org/html/2605.12685#bib.bib10)\], ParetoGNN\[[21](https://arxiv.org/html/2605.12685#bib.bib13)\], DyFSS\[[61](https://arxiv.org/html/2605.12685#bib.bib40)\], WAS\[[7](https://arxiv.org/html/2605.12685#bib.bib43)\], and GraphTCM\[[8](https://arxiv.org/html/2605.12685#bib.bib41)\], together with four notable approaches for multi\-scale graph learning: ANEMONE\[[18](https://arxiv.org/html/2605.12685#bib.bib50)\], MERIT\[[19](https://arxiv.org/html/2605.12685#bib.bib52)\], MSSGCL\[[29](https://arxiv.org/html/2605.12685#bib.bib51)\], and LMGTA\[[26](https://arxiv.org/html/2605.12685#bib.bib60)\]\. For each baseline, we use the official code shared by the authors and tune the hyperparameters in cases where explicit guidelines are not provided\. To ensure a fair comparison, we run each method1010times and report the average performance along with its standard deviation\.

TABLE I:Datasets statistics\.CoraCiteSeerPubmedDBLPPhotoComputers\#Nodes2,7083,32719,71717,7167,65013,752\#Edges10,5569,10488,648105,734238,162491,722\#Features1,4333,7035001,639745767\#Classes7634810

Datasets, Downstream Tasks, and Evaluation Metrics\.We employed six diverse datasets to assess the performance of our approach: Cora\[[34](https://arxiv.org/html/2605.12685#bib.bib17)\], CiteSeer\[[12](https://arxiv.org/html/2605.12685#bib.bib18)\], Pubmed\[[43](https://arxiv.org/html/2605.12685#bib.bib19)\], DBLP\[[49](https://arxiv.org/html/2605.12685#bib.bib20)\], Photo\[[46](https://arxiv.org/html/2605.12685#bib.bib21)\], and Computers\[[46](https://arxiv.org/html/2605.12685#bib.bib21)\]\. Table[I](https://arxiv.org/html/2605.12685#S5.T1)provides a summary of key dataset characteristics, with additional details outlined below\.

- •Cora: The dataset is a citation network comprising 2,708 scientific publications \(nodes\), each classified into one of seven categories\. It contains a total of 10,556 links\. Each publication in the dataset is characterized by a binary vector \(0/1\), where each element signifies whether a specific word from a dictionary of 1,433 words \(features\) is absent or present\.
- •CiteSeer: The dataset includes 3,327 scientific publications divided into six categories, connected by a total of 9,104 links\. Each publication is defined by a binary vector, marking whether each of 3,703 unique dictionary words is present or absent\.
- •Pubmed: This dataset includes 19,717 scientific articles from the PubMed database, all focused on diabetes, categorized into three distinct classes\. The citation network contains 88,648 connections\. Each article is represented by a TF/IDF \(term frequency\-inverse document frequency\) weighted word vector derived from a dictionary containing 500 unique words\.
- •DBLP: This is a citation network compiled from sources like DBLP, ACM, MAG \(Microsoft Academic Graph\), and others\. The network comprises 17,716 articles connected by 105,734 links\. Each article is represented by 1,639 features and categorized into one of 4 classes\.
- •Photo: This dataset’s nodes represent goods, while edges represent frequent co\-purchases\. The reviews are utilized to generate bag\-of\-words node features\. The photo network comprises 7,650 nodes, 238,162 edges, and 745 features\. It is partitioned into 8 distinct classes\.
- •Computers: Similar to Photo, Computers is a copurchase graph derived from Amazon\. It consists of 13,752 nodes, 491,722 edges, 767 features, and 10 classes\.

TABLE II:Fixed hyperparameters for all datasets\.Our experiments focus on three downstream tasks: node classification, node clustering, and link prediction\. We assess the models using metrics specific to the downstream tasks\. For node classification, we measure performance using Accuracy \(ACC\)\. Node clustering is evaluated using Normalized Mutual Information \(NMI\) and Adjusted Rand Index \(ARI\), while link prediction performance is evaluated using the Area Under the Receiver Operating Characteristic Curve \(AUC\-ROC\)\.

TABLE III:Data\-dependent hyperparameters\.TABLE IV:Performance and task generalization of our SL\-GSSL methods and the state\-of\-the\-art single\-task GSSL approaches\.TABLE V:Ablation study of the single\-level unified objective in Eq\. \([4](https://arxiv.org/html/2605.12685#S3.E4)\)\.Best\-SL\-GSSLdenotes the best performing single\-level SL\-GSSL variant across the abstraction levels\. For each ablation setting, we also report the best result attained across the abstraction levels\.Best\-SL\-GSSL\-No\-Marginremoves the margin by settingm=0m=0\.Best\-SL\-GSSL\-Hinge\-Lossreplaces the exponential transformation with the hinge formmax⁡\(0,s−−s\+\+m\)\\max\\\!\\left\(0,\\,s^\{\-\}\-s^\{\+\}\+m\\right\)\.Best\-SL\-GSSL\-InfoNCEuses the standard Information Noise Contrastive Estimation loss defined overs\+s^\{\+\}ands−s^\{\-\}\.TABLE VI:Performance and task generalization of our LSW\-ML\-GSSL method and the state\-of\-the\-art multi\-task and multi\-scale GSSL approaches\.TABLE VII:The effect of the self\-weighting mechanism compared to its linear counterparts\.TABLE VIII:Comparison of multi\-loss weighting methods and LSW\-ML\-GSSL\.TABLE IX:The impact of each abstraction level on the performance of LSW\-ML\-GSSL\.Hyperparameters\.To ensure a fair comparison, the same GNN encoder architecture is employed across all baselines for each dataset\. The GNN encoder is a graph convolutional network \(GCN\) that projects the input graph into a latent space of dimension 256\. We use PReLU activation functions for the GNN encoder\. Subsequently, a two\-layer fully\-connected projection head refines the embeddings\. The ELU activation function is used for the projection head\. We measure positive and negative similarity scores using a shifted cosine similarity \(\(1\+cos\)/2\(1\+cos\)/2\)\. The graph augmentation hyperparameters control the transformations applied to the graph during training\. These augmentations include edge dropout for the two positive augmentations, which randomly remove edges from the graph with two different rates\. Node feature dropout for the two positive augmentations introduces stochasticity by masking a fraction of the node features with two different probability rates\.

We categorize the hyperparameters of our model into two types\. The first type consists of constant hyperparameters that are independent of the processed dataset, as detailed in Table[II](https://arxiv.org/html/2605.12685#S5.T2)\. This category includes hyperparameters related to data augmentation, the GNN architecture, and the training process\. For example, the edge drop rates and feature drop rates for positive augmentations are fixed, as are the GNN hidden and latent dimensions, the learning rate, and the optimizer\. The second category consists of three hyperparameters influenced by the input dataset:mm,γ\\gamma, and the number of epochs\. We select fixed values formmandγ\\gammafrom the respective ranges\[0\.10,0\.15,0\.20,0\.25,0\.30\]\[0\.10,\\,0\.15,\\,0\.20,\\,0\.25,\\,0\.30\]and\[1\.0,1\.5,2\.0,2\.5,3\.0\]\[1\.0,\\,1\.5,\\,2\.0,\\,2\.5,\\,3\.0\]\. The number of epochs is determined based on the maximum validation accuracy\. The hyperparameters that are influenced by the properties of the data are provided in Table[III](https://arxiv.org/html/2605.12685#S5.T3)\.

Single\-Task Results\.Table[IV](https://arxiv.org/html/2605.12685#S5.T4)summarizes the comparative performance of various single\-task GSSL methods across multiple datasets and tasks\. The results are categorized based on the abstraction levels of the methods—Node,Proximity,Cluster, andGraph\. The evaluated tasks include node classification \(Accuracy\), node clustering \(NMI/ARI\), and link prediction \(ROC\-AUC\)\. We report the average performance across all tasks to provide a comprehensive assessment of overall effectiveness\. Our methods yield competitive results across diverse tasks and datasets and consistently outperform previous approaches in terms of average performance\. In particular,SL\-GSSLexcels in node clustering and link prediction\.

It is clear that no single abstraction level consistently outperforms all others in every downstream task\. Each abstraction level has its strengths and weaknesses\. Its suitability varies depending on the task’s demands and inherent properties\. In the task of node classification, our node\-level approach shows superior performance on 4 out of the 6 evaluated datasets\. This could be attributed to its ability to capture fine\-grained patterns that are critical for node classification\. By focusing on individual nodes, the model benefits from a detailed understanding of node\-specific features\. For link prediction, our proximity\-level approach shows superior performance on most datasets\. The superior performance of the proximity\-level approach in link prediction tasks stems from its ability to effectively capture and utilize relational patterns and interactions between node pairs\. Unlike node\-level approaches that focus on individual nodes and their features, proximity\-level models prioritize the structural and semantic relationships between nodes, which are essential for accurate link prediction\. In addition, the results show that our node\-level approach outperforms other methods in terms of clustering performance\. However, the cluster\-level methods have the potential to yield superior results when combined with a pretext task, which was not included in our experiments\. Consequently, methods such as GMM\-VGAE\* and DGAE\*, which strongly rely on proximity\-level pretraining, exhibit lower performance in this context\. Overall, the results reveal that none of the single\-level methods has consistent generalizability across all tasks and datasets\. This limitation highlights the need for multi\-task GSSL\.

Ablation on the Single\-Level Unified Loss\.To assess how the single\-level unified objective benefits from its design choices, we compareBest\-SL\-GSSLwith three ablation variants that remove the margin, replace the exponential term in Eq\. \([4](https://arxiv.org/html/2605.12685#S3.E4)\) with a hinge form\[[15](https://arxiv.org/html/2605.12685#bib.bib71)\], or substitute the loss with the standard InfoNCE objective\[[62](https://arxiv.org/html/2605.12685#bib.bib24)\]\.Best\-SL\-GSSL\-No\-Marginsetsm=0m\{=\}0\.Best\-SL\-GSSL\-Hinge\-Lossreplaces the exponential transformation with the hinge formmax⁡\(0,s−−s\+\+m\)\\max\(0,s^\{\-\}\-s^\{\+\}\+m\)\.Best\-SL\-GSSL\-InfoNCEreplaces the unified loss with the standard InfoNCE objective over the same positive and negative similarity scores\. As can be seen in Table[V](https://arxiv.org/html/2605.12685#S5.T5),Best\-SL\-GSSLis consistently the most effective and robust across datasets and tasks\. Removing the margin consistently harms performance, especially for clustering, highlighting its role as a separation constraint that stabilizes optimization and improves the structure of the learned embedding space\. Replacing the exponential term with a hinge penalty is generally more detrimental, especially for link prediction and clustering, suggesting that the exponential loss yields better optimization\. The InfoNCE\-based replacement yields lower performance than the unified objective in most settings, indicating that the proposed formulation is more closely matched to the single\-level similarity optimization problem addressed in this work\. Overall, both the margin and smooth exponential penalty are key to strong downstream generalization\.

Multi\-Task and Multi\-Scale Results\.Table[VI](https://arxiv.org/html/2605.12685#S5.T6)illustrates the performance of our linear self\-weighting multi\-level GSSL \(LSW\-ML\-GSSL\) method on node classification, node clustering, and edge prediction, compared to several state\-of\-the\-art multi\-task and multi\-scale GSSL approaches\. Overall, LSW\-ML\-GSSL consistently achieves the best and most robust performance across datasets and tasks, highlighting its strong generalization beyond a single evaluation setting\. In contrast to multi\-task methods that can suffer from objective interference, and multi\-scale methods that may not adequately reconcile granularity levels, our approach explicitly integrates node\-, proximity\-, cluster\-, and graph\-level signals while automatically balancing their contributions through linear self\-weighting\. This adaptive coordination prevents any single level from dominating the optimization and promotes representations that remain simultaneously discriminative\. This further emphasizes the significance of the new decision boundary in enhancing optimization flexibility and achieving a more definitive convergence status\.

The Impact of the Proposed Weighting Mechanism\.As shown in Table[VII](https://arxiv.org/html/2605.12685#S5.T7), the proposed LSW\-ML\-GSSL consistently delivers superior performance compared to both its linear weighting counterparts \(L\-ML\-GSSL and LW\-ML\-GSSL\) and the leading single\-level baseline \(Best\-SL\-GSSL\)\. Notably, neither L\-ML\-GSSL nor LW\-ML\-GSSL is able to surpass the best\-performing single\-level method\. This indicates that simply applying a linear combination of similarities and dissimilarities across the four abstraction levels is insufficient to properly balance their contributions\. As a result, the downstream performance of these linear strategies remains limited, often falling below that of Best\-SL\-GSSL\. In contrast, the self\-weighting mechanism adjusts the weights in a fine\-grained, adaptive way based on how far each similarity and dissimilarity score is from their target values\. This enables a more balanced and effective use of multi\-level graph semantics, resulting in consistently superior performance\.

![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/execution_time.png)Figure 6:Execution time in seconds of LSW\-ML\-GSSL and state\-of\-the\-art multi\-task GSSL methods\.Comparison with Multi\-Loss Weighting Baselines\.Table[VIII](https://arxiv.org/html/2605.12685#S5.T8)compares our LSW\-ML\-GSSL approach with seven multi\-loss balancing baselines, namely Uncertainty weighting\[[22](https://arxiv.org/html/2605.12685#bib.bib61)\], GradNorm\[[4](https://arxiv.org/html/2605.12685#bib.bib62)\], PCGrad\[[58](https://arxiv.org/html/2605.12685#bib.bib38)\], CAGrad\[[27](https://arxiv.org/html/2605.12685#bib.bib63)\], AdaTask\[[55](https://arxiv.org/html/2605.12685#bib.bib64)\], IGBv1\[[5](https://arxiv.org/html/2605.12685#bib.bib65)\], and AUTAUT\[[60](https://arxiv.org/html/2605.12685#bib.bib66)\]\. For these methods, the aggregated training loss is obtained by combining the corresponding single level losses through the respective weighting strategy\. These approaches regulate optimization primarily at the loss or gradient level by adjusting task\-level weights or alleviating gradient conflicts, and thus focus on learning a global compromise among objectives\. In opposition, our linear self\-weighting mechanism operates at a finer\-grained level, weighting similarity and dissimilarity scores directly\. Even with these advanced balancing schemes, LSW\-ML\-GSSL consistently ranks first across tasks and datasets\. This suggests that loss reweighting or gradient conflict handling is less effective for multi\-level GSSL than our linear self\-weighting strategy, which better captures cross\-level synergy and improves generalization\.

Ablation of Abstraction Levels in LSW\-ML\-GSSL\.This experiment examines the impact of each abstraction level on the performance of LSW\-ML\-GSSL across all downstream tasks\. As shown in Table[IX](https://arxiv.org/html/2605.12685#S5.T9), both node\-level and graph\-level contributions are fundamental to our approach\. This implies that achieving effective learning requires the presence of extreme levels of granularity \(i\.e\., node and graph\)\. In addition, multi\-level variants consistently surpass single\-level ones, showing that each abstraction level contributes distinct and complementary information\. It can further be observed that each abstraction level benefits different downstream tasks\. Cluster and graph information yield the largest gains for clustering, proximity information is most influential for edge prediction, and the combination of node and graph information proves most effective for node classification\. Finally, the full model achieves superior performance compared to all other variants\. This highlights the complementary roles of all granularity levels and further illustrates how the self\-weighting mechanism is essential for enabling effective synergy among these abstraction levels\.

![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_acc_cora.png)\(a\)ACC Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_nmi_cora.png)\(b\)NMI Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_ari_cora.png)\(c\)ARI Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_auc_cora.png)\(d\)ROC\-AUC Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_acc_citeseer.png)\(e\)ACC CiteSeer
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_nmi_citeseer.png)\(f\)NMI CiteSeer
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_ari_citeseer.png)\(g\)ARI CiteSeer
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_auc_citeseer.png)\(h\)ROC\-AUC CiteSeer

Figure 7:Sensitivity of LSW\-ML\-GSSL tommandγ\\gamma\.![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/cora/sensitivity_acc_cora_lambda1.png)\(a\)ACC Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/cora/sensitivity_nmi_cora_lambda1.png)\(b\)NMI Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/cora/sensitivity_ari_cora_lambda1.png)\(c\)ARI Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/cora/sensitivity_auc_cora_lambda1.png)\(d\)ROC\-AUC Cora
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/citeseer/sensitivity_acc_citeseer_lambda1.png)\(e\)ACC CiteSeer
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/citeseer/sensitivity_nmi_citeseer_lambda1.png)\(f\)NMI CiteSeer
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/citeseer/sensitivity_ari_citeseer_lambda1.png)\(g\)ARI CiteSeer
![Refer to caption](https://arxiv.org/html/2605.12685v1/figures/sensitivity_heatmap_results_lambda1/citeseer/sensitivity_auc_citeseer_lambda1.png)\(h\)ROC\-AUC CiteSeer

Figure 8:Sensitivity of LSW\-ML\-GSSL to the neighborhood size and the number of clusters\.Execution Time\.In this experiment, we evaluate the execution time of LSW\-ML\-GSSL compared to previous multi\-task GSSL methods\. The results in Fig\.[6](https://arxiv.org/html/2605.12685#S5.F6)illustrate the efficiency of our approach compared with state\-of\-the\-art methods\. This efficiency stems from the simplicity of the self\-weighting solution that eliminates the inner\-optimization requirement\. Furthermore, our multi\-task GSSL method employs a single GNN encoder, whereas previous multi\-task GSSL methods typically rely on multiple architectural components and/or computationally intensive operations\. These include trainable gating networks \(e\.g\., DyFSS\), task\-specific expert networks \(e\.g\., GraphTCM, DyFSS, and WAS\), and inner\-optimization algorithms \(e\.g\., AutoSSL and ParetoGNN\)\. Notably, GraphTCM was excluded from the comparison due to its significantly higher computational cost\. Specifically, its execution times on the Cora, CiteSeer, Pubmed, DBLP, Photo, and Computers datasets are 716, 1056, 39696, 59109, 29364, and 63116 seconds, respectively\.

Sensitivity to Hyperparameters\.We evaluate the sensitivity of LSW\-ML\-GSSL to the data\-dependent hyperparametersmmandγ\\gamma\. Specifically, we varymmwithin the range\[0\.10,0\.15,0\.20,0\.25,0\.30\]\[0\.10,0\.15,0\.20,0\.25,0\.30\]andγ\\gammawithin\[1\.0,1\.5,2\.0,2\.5,3\.0\]\[1\.0,1\.5,2\.0,2\.5,3\.0\], while keeping all other hyperparameters constant across datasets\. Fig\.[7](https://arxiv.org/html/2605.12685#S5.F7)illustrates the impact of these variations on model performance, assessed on three downstream tasks \(i\.e\., node classification, node clustering, and link prediction\) using ACC, NMI, ARI, and ROC\-AUC for the Cora and CiteSeer datasets\. The results indicate that LSW\-ML\-GSSL remains highly robust across different hyperparameter settings\. As we can see, our approach exhibits small fluctuations in performance across the three downstream tasks\. For both datasets, ROC\-AUC scores are particularly stable across all values ofmmandγ\\gamma, confirming the method’s reliability in link prediction tasks\. ACC values show a slight increase for higher values ofγ\\gamma, particularly for CiteSeer, where the best performance is observed aroundγ=2\.5\\gamma=2\.5\. For NMI and ARI, which evaluate clustering quality, the performance of LSW\-ML\-GSSL exhibits minor variations, especially for smaller values ofmmand higher values ofγ\\gamma\. These results suggest the existence of a trade\-off between the two hyperparameters\. Interestingly, Cora appears to be more sensitive to changes inmmandγ\\gammathan CiteSeer, with noticeable performance peaks for certain configurations\. Overall, our approach yields consistent performance for a broad range of values\. These findings confirm that LSW\-ML\-GSSL is highly stable and robust to small hyperparameter changes\.

In addition, we conduct a sensitivity analysis of LSW\-ML\-GSSL with respect to the neighborhood size and the number of clusters, and report the results in Fig\.[8](https://arxiv.org/html/2605.12685#S5.F8)\. Overall, the model exhibits a broad performance plateau on both Cora and CiteSeer\. ACC and ROC\-AUC remain consistently high across a wide range of values, while NMI/ARI vary only mildly and degrade mainly when neighborhoods become large\. These trends suggest that these parameters do not require careful dataset\-specific tuning\. Consequently, we fix them to moderate values to ensure stable performance while maintaining a consistent accuracy–efficiency trade\-off\. In contrast,mmandγ\\gammadirectly affect the objective, making them inherently more data dependent and worth tuning\.

## 6Conclusion

This work introduces the multi\-level GSSL paradigm that captures information from four levels of granularity: node, proximity, cluster, and graph\. In single\-task learning, we propose a unified framework that can operate seamlessly at different abstraction levels\. In multi\-task learning, we extend the unified framework by integrating the positive and negative similarity scores across granularity levels\. Our multi\-task formulation aligns with the multi\-level GSSL paradigm\. First, the positive and negative scores are linearly combined\. Second, we propose a self\-weighting mechanism that adaptively prioritizes the scores that deviate significantly from their target values\. The self\-weighting mechanism enhances the optimization flexibility and defines a more precise convergence target\. Extensive experiments show that the proposed single\-task and multi\-task approaches outperform state\-of\-the\-art methods\. In particular, our self\-weighted method promotes task generalization and relinquishes the need for inner optimization and task\-specific expert networks\.

While our framework demonstrates strong performance on homogeneous graphs, many real\-world networks exhibit heterogeneous structures with multiple node and edge types\. A promising direction is extending the multi\-level GSSL paradigm to heterogeneous graphs by replacing the homogeneous encoder with a type\- or relation\-aware GNN, together with per node category projection modules that embed different entities in a common space, so the unified similarity objective remains well defined\. In this setting, the proximity level can be redefined using semantically meaningful meta\-paths, and the cluster level can leverage type\-aware community structures\. Notably, our self\-weighting mechanism is well\-suited for this extension, as it can naturally balance contributions across both abstraction levels and relation types\. A comprehensive heterogeneous treatment and evaluation are left as future work\.

## Appendix AConvergence Analysis

We analyze the optimization dynamics directly in the similarity\-score space\{sl\+,sl−\}l=14\\\{s\_\{l\}^\{\+\},s\_\{l\}^\{\-\}\\\}\_\{l=1\}^\{4\}under a toy setting with a single anchor and one positive/negative sample per level\. This formulation enables a rigorous characterization of how the proposed self\-weighting mechanism shapes the loss geometry and the resulting training dynamics\.

###### Proposition 1\(Quadratic form of the exponent argument\)\.

By adopting a shifted cosine similarity \(\(1\+cos\)/2\(1\+cos\)/2\) and applying the reparameterizationδ\+=1−m\\delta^\{\+\}=1\-m,δ−=m\\delta^\{\-\}=m,ol\+=1\+mo\_\{l\}^\{\+\}=1\+m, andol−=−mo\_\{l\}^\{\-\}=\-mfor all levelsll, we can rewriteΔl′′\\Delta^\{\\prime\\prime\}\_\{l\}as follows:

Δl′′=\(sl−\)2\+\(1−sl\+\)2−2m2\.\\Delta^\{\\prime\\prime\}\_\{l\}=\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\-2m^\{2\}\.Consequently, the exponent argument of the toy version of Eq\. \([14](https://arxiv.org/html/2605.12685#S4.E14)\) becomes:

∑l=14Δl′′=D−8m2,D=∑l=14\[\(sl−\)2\+\(1−sl\+\)2\]\.\\sum\_\{l=1\}^\{4\}\\Delta^\{\\prime\\prime\}\_\{l\}=D\-8m^\{2\},\\qquad D=\\sum\_\{l=1\}^\{4\}\\big\[\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\\big\]\.

###### Proof:

From Eqs\. \([11](https://arxiv.org/html/2605.12685#S4.E11)–[12](https://arxiv.org/html/2605.12685#S4.E12)\), the self\-weighting coefficients areαl\+=\[ol\+−sl\+\]\+\\alpha\_\{l\}^\{\+\}=\[o\_\{l\}^\{\+\}\-s\_\{l\}^\{\+\}\]\_\{\+\}andαl−=\[sl−−ol−\]\+\\alpha\_\{l\}^\{\-\}=\[s\_\{l\}^\{\-\}\-o\_\{l\}^\{\-\}\]\_\{\+\}\. Since we use a shifted cosine similarity \(\(1\+cos\)/2\(1\+cos\)/2\), the similarity scores satisfysl\+,sl−∈\[0,1\]s\_\{l\}^\{\+\},s\_\{l\}^\{\-\}\\in\[0,1\]for allll\. Sincesl−∈\[0,1\]s\_\{l\}^\{\-\}\\in\[0,1\]andol−=−mo\_\{l\}^\{\-\}=\-m, we havesl−−ol−=sl−\+m\>0s\_\{l\}^\{\-\}\-o\_\{l\}^\{\-\}=s\_\{l\}^\{\-\}\+m\>0, thusαl−=sl−\+m\\alpha\_\{l\}^\{\-\}=s\_\{l\}^\{\-\}\+m\. Similarly, sincesl\+∈\[0,1\]s\_\{l\}^\{\+\}\\in\[0,1\]andol\+=1\+mo\_\{l\}^\{\+\}=1\+m, we haveαl\+=1\+m−sl\+\\alpha\_\{l\}^\{\+\}=1\+m\-s\_\{l\}^\{\+\}\. Plugging into Eq\. \([13](https://arxiv.org/html/2605.12685#S4.E13)\) and usingδ−=m\\delta^\{\-\}=mandδ\+=1−m\\delta^\{\+\}=1\-m:

Δl′′\\displaystyle\\Delta^\{\\prime\\prime\}\_\{l\}=\(sl−\+m\)\(sl−−m\)−\(1\+m−sl\+\)\(sl\+−\(1−m\)\)\\displaystyle=\(s\_\{l\}^\{\-\}\+m\)\\,\(s\_\{l\}^\{\-\}\-m\)\-\(1\+m\-s\_\{l\}^\{\+\}\)\\,\\big\(s\_\{l\}^\{\+\}\-\(1\-m\)\\big\)=\(sl−\)2−m2−\(1\+m−sl\+\)\(sl\+−1\+m\)\.\\displaystyle=\(s\_\{l\}^\{\-\}\)^\{2\}\-m^\{2\}\-\(1\+m\-s\_\{l\}^\{\+\}\)\\,\\big\(s\_\{l\}^\{\+\}\-1\+m\\big\)\.\(22\)Leta=sl\+−1a=s\_\{l\}^\{\+\}\-1\. Thena∈\[−1,0\]a\\in\[\-1,0\]and\(1\+m−sl\+\)=m−a\(1\+m\-s\_\{l\}^\{\+\}\)=m\-aand\(sl\+−1\+m\)=a\+m\(s\_\{l\}^\{\+\}\-1\+m\)=a\+m, hence

\(1\+m−sl\+\)\(sl\+−1\+m\)\\displaystyle\(1\+m\-s\_\{l\}^\{\+\}\)\\,\\big\(s\_\{l\}^\{\+\}\-1\+m\\big\)=\(m−a\)\(m\+a\)\\displaystyle=\(m\-a\)\\,\(m\+a\)\(23\)=m2−a2\\displaystyle=m^\{2\}\-a^\{2\}=m2−\(sl\+−1\)2\.\\displaystyle=m^\{2\}\-\(s\_\{l\}^\{\+\}\-1\)^\{2\}\.
Substituting back into \([22](https://arxiv.org/html/2605.12685#A1.E22)\) gives

Δl′′\\displaystyle\\Delta^\{\\prime\\prime\}\_\{l\}=\(sl−\)2−m2−\(m2−\(sl\+−1\)2\)\\displaystyle=\(s\_\{l\}^\{\-\}\)^\{2\}\-m^\{2\}\-\\Big\(m^\{2\}\-\(s\_\{l\}^\{\+\}\-1\)^\{2\}\\Big\)\(24\)=\(sl−\)2\+\(sl\+−1\)2−2m2\\displaystyle=\(s\_\{l\}^\{\-\}\)^\{2\}\+\(s\_\{l\}^\{\+\}\-1\)^\{2\}\-2m^\{2\}=\(sl−\)2\+\(1−sl\+\)2−2m2\.\\displaystyle=\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\-2m^\{2\}\.Summing overl=1,…,4l=1,\\dots,4yields ∑l=14Δl′′=∑l=14\[\(sl−\)2\+\(1−sl\+\)2\]−8m2=D−8m2\\sum\_\{l=1\}^\{4\}\\Delta^\{\\prime\\prime\}\_\{l\}=\\sum\_\{l=1\}^\{4\}\\big\[\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\\big\]\-8m^\{2\}=D\-8m^\{2\}\. ∎

###### Theorem 1\(Error contraction in the toy similarity space\)\.

In the same toy setting, we define the error vector:

𝐞=\[s1−,…,s4−,1−s1\+,…,1−s4\+\]⊤\.\\mathbf\{e\}=\[s\_\{1\}^\{\-\},\\dots,s\_\{4\}^\{\-\},\\,1\-s\_\{1\}^\{\+\},\\dots,1\-s\_\{4\}^\{\+\}\]^\{\\top\}\.Let

Z=D−8m2,ℒ=log⁡\(1\+exp⁡\(γZ\)\)\.Z=D\-8m^\{2\},\\qquad\\mathcal\{L\}=\\log\\big\(1\+\\exp\(\\gamma Z\)\\big\)\.Then

∇𝐞ℒ=2γσ\(γZ\)𝐞,whereσ\(t\)=et1\+et∈\(0,1\)\.\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}=2\\,\\gamma\\,\\sigma\(\\gamma\\,Z\)\\,\\mathbf\{e\},\\quad\\text\{where\}\\quad\\sigma\(t\)=\\frac\{e^\{t\}\}\{1\+e^\{t\}\}\\in\(0,1\)\.Under gradient descent with step size0<η<12γ0<\\eta<\\frac\{1\}\{2\\gamma\}, the error vector contracts multiplicatively:

𝐞\(t\+1\)=\(1−2ηγσ\(γZ\(t\)\)\)𝐞\(t\),\\mathbf\{e\}^\{\(t\+1\)\}=\\Big\(1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\Big\)\\mathbf\{e\}^\{\(t\)\},and therefore:

D\(t\+1\)=\(1−2ηγσ\(γZ\(t\)\)\)2D\(t\)\.D^\{\(t\+1\)\}=\\Big\(1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\Big\)^\{2\}\\,D^\{\(t\)\}\.Hence the dynamics perform radial descent toward the ideal pointsl−→0s\_\{l\}^\{\-\}\\to 0andsl\+→1s\_\{l\}^\{\+\}\\to 1for allll\.

###### Proof:

We first compute the partial derivatives by the chain rule\. Sinceℒ=log⁡\(1\+exp⁡\(γZ\)\)\\mathcal\{L\}=\\log\(1\+\\exp\(\\gamma\\,Z\)\), we have:

∂ℒ∂Z=γexp⁡\(γZ\)1\+exp⁡\(γZ\)=γσ\(γZ\)\.\\frac\{\\partial\\mathcal\{L\}\}\{\\partial Z\}=\\frac\{\\gamma\\exp\(\\gamma\\,Z\)\}\{1\+\\exp\(\\gamma\\,Z\)\}=\\gamma\\,\\sigma\(\\gamma\\,Z\)\.\(25\)By definition,D=∑l=14\[\(sl−\)2\+\(1−sl\+\)2\]D=\\sum\_\{l=1\}^\{4\}\\big\[\(s\_\{l\}^\{\-\}\)^\{2\}\+\(1\-s\_\{l\}^\{\+\}\)^\{2\}\\big\]andZ=D−8m2Z=D\-8m^\{2\}\. Hence, for eachll, we have:

∂Z∂sl−=∂D∂sl−=2sl−,\\frac\{\\partial Z\}\{\\partial s\_\{l\}^\{\-\}\}=\\frac\{\\partial D\}\{\\partial s\_\{l\}^\{\-\}\}=2\\,s\_\{l\}^\{\-\},\(26\)∂Z∂sl\+=∂D∂sl\+=2\(sl\+−1\)=−2\(1−sl\+\)\.\\frac\{\\partial Z\}\{\\partial s\_\{l\}^\{\+\}\}=\\frac\{\\partial D\}\{\\partial s\_\{l\}^\{\+\}\}=2\\,\(s\_\{l\}^\{\+\}\-1\)=\-2\\,\(1\-s\_\{l\}^\{\+\}\)\.\(27\)Combining these gives:

∂ℒ∂sl−\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s\_\{l\}^\{\-\}\}=∂ℒ∂Z∂Z∂sl−\\displaystyle=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial Z\}\\frac\{\\partial Z\}\{\\partial s\_\{l\}^\{\-\}\}\(28\)=γσ\(γZ\)⋅2sl−\\displaystyle=\\gamma\\,\\sigma\(\\gamma\\,Z\)\\cdot 2\\,s\_\{l\}^\{\-\}=2γσ\(γZ\)sl−,\\displaystyle=2\\,\\gamma\\,\\sigma\(\\gamma\\,Z\)\\,s\_\{l\}^\{\-\},and

∂ℒ∂sl\+\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s\_\{l\}^\{\+\}\}=∂ℒ∂Z∂Z∂sl\+\\displaystyle=\\frac\{\\partial\\mathcal\{L\}\}\{\\partial Z\}\\frac\{\\partial Z\}\{\\partial s\_\{l\}^\{\+\}\}\(29\)=γσ\(γZ\)⋅2\(sl\+−1\)\\displaystyle=\\gamma\\,\\sigma\(\\gamma\\,Z\)\\cdot 2\\,\(s\_\{l\}^\{\+\}\-1\)=−2γσ\(γZ\)\(1−sl\+\)\.\\displaystyle=\-2\\,\\gamma\\,\\sigma\(\\gamma\\,Z\)\(1\-s\_\{l\}^\{\+\}\)\.
We now derive the gradient descent updates\. For the negative similarities, we have:

sl−\(t\+1\)\\displaystyle s\_\{l\}^\{\-\(t\+1\)\}=sl−\(t\)−η∂ℒ∂sl−\|t\\displaystyle=s\_\{l\}^\{\-\(t\)\}\-\\eta\\,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s\_\{l\}^\{\-\}\}\\Big\|\_\{t\}=sl−\(t\)−η⋅2γσ\(γZ\(t\)\)sl−\(t\)\\displaystyle=s\_\{l\}^\{\-\(t\)\}\-\\eta\\cdot 2\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\,s\_\{l\}^\{\-\(t\)\}=\(1−2ηγσ\(γZ\(t\)\)\)sl−\(t\)\.\\displaystyle=\\Big\(1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\Big\)s\_\{l\}^\{\-\(t\)\}\.\(30\)For the positive similarities:

sl\+\(t\+1\)\\displaystyle s\_\{l\}^\{\+\(t\+1\)\}=sl\+\(t\)−η∂ℒ∂sl\+\|t\\displaystyle=s\_\{l\}^\{\+\(t\)\}\-\\eta\\,\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s\_\{l\}^\{\+\}\}\\Big\|\_\{t\}=sl\+\(t\)−η⋅\(−2γσ\(γZ\(t\)\)\(1−sl\+\(t\)\)\)\\displaystyle=s\_\{l\}^\{\+\(t\)\}\-\\eta\\,\\cdot\\Big\(\-2\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\(1\-s\_\{l\}^\{\+\(t\)\}\)\\Big\)=sl\+\(t\)\+2ηγσ\(γZ\(t\)\)\(1−sl\+\(t\)\)\.\\displaystyle=s\_\{l\}^\{\+\(t\)\}\+2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\,\(1\-s\_\{l\}^\{\+\(t\)\}\)\.\(31\)Defineul\(t\)=1−sl\+\(t\)u\_\{l\}^\{\(t\)\}=1\-s\_\{l\}^\{\+\(t\)\}\. From \([31](https://arxiv.org/html/2605.12685#A1.E31)\), we get:

ul\(t\+1\)\\displaystyle u\_\{l\}^\{\(t\+1\)\}=1−sl\+\(t\+1\)\\displaystyle=1\-s\_\{l\}^\{\+\(t\+1\)\}=1−\(sl\+\(t\)\+2ηγσ\(γZ\(t\)\)\(1−sl\+\(t\)\)\)\\displaystyle=1\-\\Big\(s\_\{l\}^\{\+\(t\)\}\+2\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\(1\-s\_\{l\}^\{\+\(t\)\}\)\\Big\)=\(1−sl\+\(t\)\)−2ηγσ\(γZ\(t\)\)\(1−sl\+\(t\)\)\\displaystyle=\(1\-s\_\{l\}^\{\+\(t\)\}\)\-2\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\(1\-s\_\{l\}^\{\+\(t\)\}\)=\(1−2ηγσ\(γZ\(t\)\)\)ul\(t\)\.\\displaystyle=\\Big\(1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\)\\Big\)u\_\{l\}^\{\(t\)\}\.\(32\)Equations \([30](https://arxiv.org/html/2605.12685#A1.E30)\) and \([32](https://arxiv.org/html/2605.12685#A1.E32)\) show that all eight components of𝐞\(t\)=\[s1−\(t\),…,s4−\(t\),u1\(t\),…,u4\(t\)\]⊤\\mathbf\{e\}^\{\(t\)\}=\[s\_\{1\}^\{\-\(t\)\},\\dots,s\_\{4\}^\{\-\(t\)\},u\_\{1\}^\{\(t\)\},\\dots,u\_\{4\}^\{\(t\)\}\]^\{\\top\}are multiplied by the same scalar factorct=1−2ηγσ\(γZ\(t\)\)c\_\{t\}=1\-2\\,\\eta\\,\\gamma\\,\\sigma\(\\gamma\\,Z^\{\(t\)\}\), i\.e\.,𝐞\(t\+1\)=ct𝐞\(t\)\\mathbf\{e\}^\{\(t\+1\)\}=c\_\{t\}\\,\\mathbf\{e\}^\{\(t\)\}\.

Finally, since0<σ\(⋅\)<10<\\sigma\(\\cdot\)<1and0<η<12γ0<\\eta<\\frac\{1\}\{2\\gamma\}, we have0<ct<10<c\_\{t\}<1for alltt, so‖𝐞\(t\+1\)‖22=ct2‖𝐞\(t\)‖22\\\|\\mathbf\{e\}^\{\(t\+1\)\}\\\|\_\{2\}^\{2\}=c\_\{t\}^\{2\}\\\|\\mathbf\{e\}^\{\(t\)\}\\\|\_\{2\}^\{2\}\. BecauseD\(t\)=‖𝐞\(t\)‖22D^\{\(t\)\}=\\\|\\mathbf\{e\}^\{\(t\)\}\\\|\_\{2\}^\{2\}, this yieldsD\(t\+1\)=ct2D\(t\)D^\{\(t\+1\)\}=c\_\{t\}^\{2\}D^\{\(t\)\}and proves monotone contraction toward𝐞⋆=𝟎\\mathbf\{e\}^\{\\star\}=\\mathbf\{0\}, i\.e\.,sl−=0s\_\{l\}^\{\-\}=0andsl\+=1s\_\{l\}^\{\+\}=1for allll\. ∎

After establishing that the multiplicative contraction property in Theorem[1](https://arxiv.org/html/2605.12685#Thmtheorem1a)is specific to the self\-weighting mechanism, we now show that this behavior does not, in general, hold for the linear multi\-level combination in Eq\. \([8](https://arxiv.org/html/2605.12685#S4.E8)\)\. To illustrate this point, we consider the same toy setting with fixed coefficientsβl\\beta\_\{l\}:

ℒlin=log⁡\(1\+exp⁡\(γS\)\),\\mathcal\{L\}\_\{\\mathrm\{lin\}\}=\\log\\Big\(1\+\\exp\\big\(\\gamma S\\big\)\\Big\),\(33\)S=∑l=14βl\(sl−−sl\+\+m\)\.S=\\sum\_\{l=1\}^\{4\}\\beta\_\{l\}\\big\(s\_\{l\}^\{\-\}\-s\_\{l\}^\{\+\}\+m\\big\)\.\(34\)
###### Proposition 2\(No radial descent / no multiplicative contraction for the linear combination\)\.

The gradient ofℒlin\\mathcal\{L\}\_\{\\mathrm\{lin\}\}in𝐞\\mathbf\{e\}\-coordinates can be expressed as follows:

∇𝐞ℒlin=γσ\(γS\)\[β1,…,β4,β1,…,β4\]⊤,\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}=\\gamma\\,\\sigma\(\\gamma S\)\\,\[\\beta\_\{1\},\\dots,\\beta\_\{4\},\\beta\_\{1\},\\dots,\\beta\_\{4\}\]^\{\\top\},which is a fixed direction independent of𝐞\\mathbf\{e\}\(up to the scalar factorσ\(γS\)\\sigma\(\\gamma S\)\)\. Therefore, in general∇𝐞ℒlin\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}is not parallel to𝐞\\mathbf\{e\}and the update is not radial\. In particular, there does not exist a scalar sequence\{ct\}t\\\{c\_\{t\}\\\}\_\{t\}, such that𝐞\(t\+1\)=ct𝐞\(t\)\\mathbf\{e\}^\{\(t\+1\)\}=c\_\{t\}\\,\\mathbf\{e\}^\{\(t\)\}holds for all initial𝐞\(0\)\\mathbf\{e\}^\{\(0\)\}under gradient descent onℒlin\\mathcal\{L\}\_\{\\mathrm\{lin\}\}\.

###### Proof:

The derivative ofℒlin=log⁡\(1\+exp⁡\(γS\)\)\\mathcal\{L\}\_\{\\mathrm\{lin\}\}=\\log\(1\+\\exp\(\\gamma\\,S\)\)w\.r\.t\.SSis∂ℒlin/∂S=γσ\(γS\)\\partial\\mathcal\{L\}\_\{\\mathrm\{lin\}\}/\\partial S=\\gamma\\,\\sigma\(\\gamma\\,S\)\. Moreover,∂S/∂sl−=βl\\partial S/\\partial s\_\{l\}^\{\-\}=\\beta\_\{l\}and∂S/∂sl\+=−βl\\partial S/\\partial s\_\{l\}^\{\+\}=\-\\beta\_\{l\}\. Thus∂ℒlin/∂sl−=γσ\(γS\)βl\\partial\\mathcal\{L\}\_\{\\mathrm\{lin\}\}/\\partial s\_\{l\}^\{\-\}=\\gamma\\,\\sigma\(\\gamma\\,S\)\\,\\beta\_\{l\}and∂ℒlin/∂sl\+=−γσ\(γS\)βl\\partial\\mathcal\{L\}\_\{\\mathrm\{lin\}\}/\\partial s\_\{l\}^\{\+\}=\-\\gamma\\,\\sigma\(\\gamma\\,S\)\\,\\beta\_\{l\}\.

From the definition of𝐞\\mathbf\{e\}, its components satisfyel=sl−e\_\{l\}=s\_\{l\}^\{\-\}ande4\+l=1−sl\+e\_\{4\+l\}=1\-s\_\{l\}^\{\+\}forl=1,…,4l=1,\\dots,4\. By the chain rule,∂ℒlin/∂e4\+l=−\(∂ℒlin/∂sl\+\)=γσ\(γS\)βl\\partial\\mathcal\{L\}\_\{\\mathrm\{lin\}\}/\\partial e\_\{4\+l\}=\-\(\\partial\\mathcal\{L\}\_\{\\mathrm\{lin\}\}/\\partial s\_\{l\}^\{\+\}\)=\\gamma\\,\\sigma\(\\gamma\\,S\)\\,\\beta\_\{l\}\. Hence∇𝐞ℒlin=γσ\(γS\)\[β1,…,β4,β1,…,β4\]⊤\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}=\\gamma\\,\\sigma\(\\gamma\\,S\)\\,\[\\beta\_\{1\},\\dots,\\beta\_\{4\},\\beta\_\{1\},\\dots,\\beta\_\{4\}\]^\{\\top\}, which is a fixed direction independent of𝐞\\mathbf\{e\}\(up to the scalar factor\)\.

To show that multiplicative contraction cannot hold in general, assume by contradiction that there exists a scalarctc\_\{t\}such that𝐞\(t\+1\)=ct𝐞\(t\)\\mathbf\{e\}^\{\(t\+1\)\}=c\_\{t\}\\,\\mathbf\{e\}^\{\(t\)\}for all𝐞\(t\)\\mathbf\{e\}^\{\(t\)\}\. Then the gradient descent step𝐞\(t\+1\)=𝐞\(t\)−η∇𝐞ℒlin\(𝐞\(t\)\)\\mathbf\{e\}^\{\(t\+1\)\}=\\mathbf\{e\}^\{\(t\)\}\-\\eta\\,\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}\(\\mathbf\{e\}^\{\(t\)\}\)implies that∇𝐞ℒlin\(𝐞\(t\)\)\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}\(\\mathbf\{e\}^\{\(t\)\}\)is always parallel to𝐞\(t\)\\mathbf\{e\}^\{\(t\)\}\. However, since∇𝐞ℒlin\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}has a fixed direction, this can only hold when𝐞\(t\)\\mathbf\{e\}^\{\(t\)\}is collinear with\[β1,…,β4,β1,…,β4\]⊤\[\\beta\_\{1\},\\dots,\\beta\_\{4\},\\beta\_\{1\},\\dots,\\beta\_\{4\}\]^\{\\top\}, which is not true for general initial conditions\.

For an explicit counterexample, take𝐞\(0\)\\mathbf\{e\}^\{\(0\)\}with nonzero entries only in the first level, e\.g\.,𝐞\(0\)=\[1,0,0,0,0,0,0,0\]⊤\\mathbf\{e\}^\{\(0\)\}=\[1,0,0,0,\\,0,0,0,0\]^\{\\top\}\. Then∇𝐞ℒlin\(𝐞\(0\)\)\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}\(\\mathbf\{e\}^\{\(0\)\}\)has nonzero entries in the first and fifth coordinates wheneverβ1≠0\\beta\_\{1\}\\neq 0, so𝐞\(1\)=𝐞\(0\)−η∇𝐞ℒlin\(𝐞\(0\)\)\\mathbf\{e\}^\{\(1\)\}=\\mathbf\{e\}^\{\(0\)\}\-\\eta\\,\\nabla\_\{\\mathbf\{e\}\}\\mathcal\{L\}\_\{\\mathrm\{lin\}\}\(\\mathbf\{e\}^\{\(0\)\}\)generally has nonzero entries in multiple coordinates and cannot equalc0𝐞\(0\)c\_\{0\}\\mathbf\{e\}^\{\(0\)\}for any scalarc0c\_\{0\}\. This proves that the multiplicative contraction property of Theorem[1](https://arxiv.org/html/2605.12685#Thmtheorem1a)does not hold for the linear combination\. ∎

## References

- \[1\]\(2024\)DGCLUSTER: a neural framework for attributed graph clustering via modularity maximization\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 11069–11077\.Cited by:[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.174.174.174.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.276.276.276.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.320.14.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.72.72.72.7),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[2\]T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton\(2020\)A simple framework for contrastive learning of visual representations\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),Vol\.119,pp\. 1597–1607\.Cited by:[2nd item](https://arxiv.org/html/2605.12685#S1.I1.i2.p1.1.1)\.
- \[3\]T\. Chen, S\. Kornblith, K\. Swersky, M\. Norouzi, and G\. E\. Hinton\(2020\)Big self\-supervised models are strong semi\-supervised learners\.InAdvances in neural information processing systems \(NeurIPS\),Vol\.33,pp\. 22243–22255\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p1.1)\.
- \[4\]Z\. Chen, V\. Badrinarayanan, C\. Lee, and A\. Rabinovich\(2018\)Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks\.InInternational conference on machine learning \(ICML\),pp\. 794–803\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.108.108.108.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.12.12.12.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.60.60.60.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[5\]Y\. Dai, N\. Fei, and Z\. Lu\(2023\)Improvable gap balancing for multi\-task learning\.InUncertainty in Artificial Intelligence,pp\. 496–506\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.132.132.132.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.36.36.36.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.84.84.84.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[6\]C\. Doersch and A\. Zisserman\(2017\)Multi\-task self\-supervised visual learning\.InIEEE/CVF conference on computer vision and pattern recognition \(CVPR\),pp\. 2051–2060\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[7\]T\. Fan, L\. Wu, Y\. Huang, H\. Lin, C\. Tan, Z\. Gao, and S\. Z\. Li\(2024\)Decoupling weighing and selecting for integrating multiple graph pre\-training tasks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.168.168.168.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.28.28.28.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.98.98.98.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[8\]T\. Fang, W\. Zhou, Y\. Sun, K\. Han, L\. Ma, and Y\. Yang\(2024\)Exploring correlations of self\-supervised tasks for graphs\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.105.105.105.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.175.175.175.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.35.35.35.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[9\]B\. Fatemi, L\. El Asri, and S\. M\. Kazemi\(2021\)SLAPS: self\-supervision improves structure learning for graph neural networks\.Advances in Neural Information Processing Systems \(NeurIPS\)34,pp\. 22667–22681\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[10\]B\. Fatemi, J\. Halcrow, and B\. Perozzi\(2024\)Talk like a graph: encoding graphs for large language models\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[11\]M\. Georgescu, A\. Barbalau, R\. T\. Ionescu, F\. S\. Khan, M\. Popescu, and M\. Shah\(2021\)Anomaly detection in video via self\-supervised and multi\-task learning\.InIEEE/CVF conference on computer vision and pattern recognition \(CVPR\),pp\. 12742–12752\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[12\]C\. L\. Giles, K\. D\. Bollacker, and S\. Lawrence\(1998\)CiteSeer: an automatic citation indexing system\.InACM Conference on Digital Libraries \(DL\),pp\. 89–98\.Cited by:[§5](https://arxiv.org/html/2605.12685#S5.p3.1)\.
- \[13\]A\. Grover and J\. Leskovec\(2016\)Node2vec: scalable feature learning for networks\.InACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 855–864\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1),[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.132.132.132.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.234.234.234.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.30.30.30.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.313.7.2),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[14\]K\. Hassani and A\. H\. Khasahmadi\(2020\)Contrastive multi\-view representation learning on graphs\.InInternational Conference on Machine Learning \(ICML\),pp\. 4116–4126\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1),[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[§3](https://arxiv.org/html/2605.12685#S3.p2.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.192.192.192.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.294.294.294.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.323.17.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.90.90.90.7),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[15\]E\. Hoffer and N\. Ailon\(2015\)Deep metric learning using triplet network\.InInternational workshop on similarity\-based pattern recognition,pp\. 84–92\.Cited by:[§5](https://arxiv.org/html/2605.12685#S5.p9.2)\.
- \[16\]B\. Hui, P\. Zhu, and Q\. Hu\(2020\)Collaborative graph convolutional networks: unsupervised learning meets semi\-supervised learning\.InAssociation for the Advancement of Artificial Intelligence \(AAAI\),Vol\.34,pp\. 4215–4222\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.162.162.162.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.264.264.264.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.318.12.2),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.60.60.60.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[17\]Y\. Jiao, Y\. Xiong, J\. Zhang, Y\. Zhang, T\. Zhang, and Y\. Zhu\(2020\)Sub\-graph contrast for scalable self\-supervised graph representation learning\.In2020 IEEE international conference on data mining \(ICDM\),pp\. 222–231\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p2.1)\.
- \[18\]M\. Jin, Y\. Liu, Y\. Zheng, L\. Chi, Y\. Li, and S\. Pan\(2021\)Anemone: graph anomaly detection with multi\-scale contrastive learning\.InProceedings of the 30th ACM international conference on information & knowledge management \(CIKM\),pp\. 3122–3126\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p2.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.112.112.112.9),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.182.182.182.9),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.42.42.42.9),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[19\]M\. Jin, Y\. Zheng, Y\. Li, C\. Gong, C\. Zhou, and S\. Pan\(2021\-08\)Multi\-scale contrastive siamese networks for self\-supervised graph representation learning\.InProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI\-21,Z\. Zhou \(Ed\.\),pp\. 1477–1483\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2021/204),[Link](https://doi.org/10.24963/ijcai.2021/204)Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p2.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.119.119.119.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.189.189.189.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.49.49.49.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[20\]W\. Jin, X\. Liu, X\. Zhao, Y\. Ma, N\. Shah, and J\. Tang\(2022\)Automated self\-supervised learning for graphs\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p4.1),[§2](https://arxiv.org/html/2605.12685#S2.p3.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.147.147.147.9),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.7.7.7.9),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.77.77.77.9),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[21\]M\. Ju, T\. Zhao, Q\. Wen, W\. Yu, N\. Shah, Y\. Ye, and C\. Zhang\(2023\)Multi\-task self\-supervised graph neural networks enable stronger task generalization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p4.1),[§2](https://arxiv.org/html/2605.12685#S2.p3.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.14.14.14.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.154.154.154.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.84.84.84.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[22\]A\. Kendall, Y\. Gal, and R\. Cipolla\(2018\)Multi\-task learning using uncertainty to weigh losses for scene geometry and semantics\.InProceedings of the IEEE conference on computer vision and pattern recognition \(CVPR\),pp\. 7482–7491\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.102.102.102.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.54.54.54.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.6.6.6.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[23\]T\. N\. Kipf and M\. Welling\(2016\)Variational graph auto\-encoders\.\.InAdvances in Neural Information Processing Systems workshop \(NeurIPS workshop\),pp\. 1–3\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.138.138.138.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.144.144.144.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.240.240.240.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.246.246.246.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.314.8.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.315.9.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.36.36.36.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.42.42.42.7),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[24\]D\. Li, W\. Wang, M\. Shao, and C\. Zhao\(2023\)Contrastive representation learning based on multiple node\-centered subgraphs\.InProceedings of the 32nd ACM International Conference on Information and Knowledge Management \(CIKM\),pp\. 1338–1347\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p2.1)\.
- \[25\]J\. Li, R\. Wu, W\. Sun, L\. Chen, S\. Tian, L\. Zhu, C\. Meng, Z\. Zheng, and W\. Wang\(2023\)What’s behind the mask: understanding masked graph modeling for graph autoencoders\.InACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 1268–1279\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[26\]M\. J\. Li, G\. Zhao, S\. Huang, Q\. Zhang, J\. Liu, M\. Li, and J\. Li\(2025\)Anomaly detection in attributed networks via local multi\-order contrastive learning and global topology awareness\.Neurocomputing,pp\. 130829\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p2.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.133.133.133.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.203.203.203.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.63.63.63.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[27\]B\. Liu, X\. Liu, X\. Jin, P\. Stone, and Q\. Liu\(2021\)Conflict\-averse gradient descent for multi\-task learning\.Advances in Neural Information Processing Systems \(NeurIPS\)34,pp\. 18878–18890\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.120.120.120.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.24.24.24.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.72.72.72.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[28\]X\. Liu, X\. Tong, and qiang liu\(2021\)Profiling pareto front with multi\-objective stein variational gradient descent\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[29\]Y\. Liu, Y\. Zhao, X\. Wang, L\. Geng, and Z\. Xiao\(2023\-08\)Multi\-scale subgraph contrastive learning\.InProceedings of the Thirty\-Second International Joint Conference on Artificial Intelligence, IJCAI\-23,E\. Elkind \(Ed\.\),pp\. 2215–2223\.Note:Main TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2023/246),[Link](https://doi.org/10.24963/ijcai.2023/246)Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p2.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.126.126.126.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.196.196.196.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.56.56.56.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[30\]Y\. Liu, M\. Jin, S\. Pan, C\. Zhou, Y\. Zheng, F\. Xia, and S\. Y\. Philip\(2022\)Graph self\-supervised learning: a survey\.IEEE Transactions on Knowledge and Data Engineering \(TKDE\)35\(6\),pp\. 5879–5900\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p2.1)\.
- \[31\]J\. Lu, Y\. Liu, Y\. Zhang, and Y\. Fu\(2025\)Scale\-free graph\-language models\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=nFcgay1Yo9)Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[32\]J\. Lu, Y\. Xu, H\. Wang, Y\. Bai, and Y\. Fu\(2023\)Latent graph inference with limited supervision\.Advances in Neural Information Processing Systems \(NeurIPS\)36,pp\. 32521–32538\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[33\]D\. Mahapatra and V\. Rajan\(2020\)Multi\-task learning with user preferences: gradient descent with controlled ascent in pareto optimization\.InInternational Conference on Machine Learning \(ICML\),pp\. 6597–6607\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[34\]A\. K\. McCallum, K\. Nigam, J\. Rennie, and K\. Seymore\(2000\)Automating the construction of internet portals with machine learning\.Information Retrieval3,pp\. 127–163\.Cited by:[§5](https://arxiv.org/html/2605.12685#S5.p3.1)\.
- \[35\]N\. Mrabah, M\. M\. Amar, M\. Bouguessa, and A\. B\. Diallo\(2023\)Exploring the interaction between local and global latent configurations for clustering single\-cell rna\-seq: a unified perspective\.InAssociation for the Advancement of Artificial Intelligence \(AAAI\),Vol\.37,pp\. 9235–9242\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p1.1)\.
- \[36\]N\. Mrabah, M\. M\. Amar, M\. Bouguessa, and A\. B\. Diallo\(2023\)Toward convex manifolds: a geometric perspective for deep graph clustering of single\-cell rna\-seq data\.\.InInternational Joint Conference on Artificial Intelligence \(IJCAI\),pp\. 4855–4863\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p1.1)\.
- \[37\]N\. Mrabah, M\. Bouguessa, and R\. Ksantini\(2023\)Beyond the evidence lower bound: dual variational graph auto\-encoders for node clustering\.InProceedings of the 2023 SIAM International Conference on Data Mining \(SDM\),pp\. 100–108\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1)\.
- \[38\]N\. Mrabah, M\. Bouguessa, and R\. Ksantini\(2024\)A contrastive variational graph auto\-encoder for node clustering\.Pattern Recognition149,pp\. 110209\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1)\.
- \[39\]N\. Mrabah, M\. Bouguessa, M\. F\. Touati, and R\. Ksantini\(2022\)Rethinking graph auto\-encoder models for attributed graph clustering\.IEEE Transactions on Knowledge and Data Engineering \(TKDE\)35\(9\),pp\. 9037–9053\.Cited by:[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.168.168.168.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.270.270.270.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.319.13.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.66.66.66.7),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[40\]A\. Navon, A\. Shamsian, G\. Chechik, and E\. Fetaya\(2021\)Learning the pareto front with hypernetworks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[41\]A\. v\. d\. Oord, Y\. Li, and O\. Vinyals\(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[42\]B\. Perozzi, R\. Al\-Rfou, and S\. Skiena\(2014\)Deepwalk: online learning of social representations\.InACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 701–710\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[43\]P\. Sen, G\. Namata, M\. Bilgic, L\. Getoor, B\. Galligher, and T\. Eliassi\-Rad\(2008\)Collective classification in network data\.AI magazine29\(3\),pp\. 93–93\.Cited by:[§5](https://arxiv.org/html/2605.12685#S5.p3.1)\.
- \[44\]O\. Sener and V\. Koltun\(2018\)Multi\-task learning as multi\-objective optimization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.31,pp\. 527–538\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[45\]A\. Shaheen, N\. Mrabah, R\. Ksantini, and A\. Alqaddoumi\(2025\)Rethinking deep clustering paradigms: self\-supervision is all you need\.Neural Networks181,pp\. 106773\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p1.1)\.
- \[46\]O\. Shchur, M\. Mumme, A\. Bojchevski, and S\. Günnemann\(2023\)Pitfalls of graph neural network evaluation\.InInternational Conference on Learning Representations \(ICLR Workshop on the pitfalls of limited data and computation for Trustworthy ML\),Cited by:[§5](https://arxiv.org/html/2605.12685#S5.p3.1)\.
- \[47\]X\. Shen, D\. Sun, S\. Pan, X\. Zhou, and L\. T\. Yang\(2023\)Neighbor contrastive learning on learnable graph augmentation\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 9782–9791\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.150.150.150.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.252.252.252.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.316.10.2),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.48.48.48.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[48\]Y\. Sun, C\. Cheng, Y\. Zhang, C\. Zhang, L\. Zheng, Z\. Wang, and Y\. Wei\(2020\)Circle loss: a unified perspective of pair similarity optimization\.InIEEE/CVF conference on computer vision and pattern recognition \(CVPR\),pp\. 6398–6407\.Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p6.1),[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[49\]J\. Tang, J\. Zhang, L\. Yao, J\. Li, L\. Zhang, and Z\. Su\(2008\)Arnetminer: extraction and mining of academic social networks\.InACM SIGKDD Conference on Knowledge Discovery and Data Mining \(KDD\),pp\. 990–998\.Cited by:[§5](https://arxiv.org/html/2605.12685#S5.p3.1)\.
- \[50\]S\. Thakoor, C\. Tallec, M\. G\. Azar, M\. Azabou, E\. L\. Dyer, R\. Munos, P\. Veličković, and M\. Valko\(2022\)Large\-scale representation learning on graphs via bootstrapping\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.114.114.114.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.12.12.12.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.216.216.216.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.310.4.1),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[51\]P\. Velickovic, W\. Fedus, W\. L\. Hamilton, P\. Liò, Y\. Bengio, and R\. D\. Hjelm\(2019\)Deep graph infomax\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1),[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[§3](https://arxiv.org/html/2605.12685#S3.p2.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.186.186.186.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.288.288.288.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.322.16.2),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.84.84.84.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[52\]C\. Wang, S\. Pan, R\. Hu, G\. Long, J\. Jiang, and C\. Zhang\(2019\)Attributed graph clustering: a deep attentional embedding approach\.InInternational Joint Conference on Artificial Intelligence \(IJCAI\),pp\. 3670–3676\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[53\]C\. Wang, S\. Pan, G\. Long, X\. Zhu, and J\. Jiang\(2017\)Mgae: marginalized graph autoencoder for graph clustering\.InACM Conference on Information and Knowledge Management \(CIKM\),pp\. 889–898\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[54\]Y\. Wang, X\. Yan, C\. Hu, Q\. Xu, C\. Yang, F\. Fu, W\. Zhang, H\. Wang, B\. Du, and J\. Jiang\(2024\)Generative and contrastive paradigms are complementary for graph self\-supervised learning\.In2024 IEEE 40th International Conference on Data Engineering \(ICDE\),pp\. 3364–3378\.Cited by:[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.120.120.120.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.18.18.18.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.222.222.222.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.311.5.1),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[55\]E\. Yang, J\. Pan, X\. Wang, H\. Yu, L\. Shen, X\. Chen, L\. Xiao, J\. Jiang, and G\. Guo\(2023\)Adatask: a task\-aware adaptive learning rate approach to multi\-task learning\.InProceedings of the AAAI conference on artificial intelligence,Vol\.37,pp\. 10745–10753\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.126.126.126.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.30.30.30.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.78.78.78.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[56\]K\. Yang, Y\. Liu, Z\. Zhao, P\. Ding, and W\. Zhao\(2024\)Local structure\-aware graph contrastive representation learning\.Neural Networks172,pp\. 106083\.Cited by:[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.198.198.198.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.300.300.300.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.324.18.1),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.96.96.96.7),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[57\]Y\. You, T\. Chen, Y\. Sui, T\. Chen, Z\. Wang, and Y\. Shen\(2020\)Graph contrastive learning with augmentations\.InAdvances in neural information processing systems \(NeurIPS\),Vol\.33,pp\. 5812–5823\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1)\.
- \[58\]T\. Yu, S\. Kumar, A\. Gupta, S\. Levine, K\. Hausman, and C\. Finn\(2020\)Gradient surgery for multi\-task learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 5824–5836\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1),[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.114.114.114.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.18.18.18.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.66.66.66.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[59\]W\. Yu, H\. Xu, Z\. Yuan, and J\. Wu\(2021\)Learning modality\-specific representations with self\-supervised multi\-task learning for multimodal sentiment analysis\.InAssociation for the Advancement of Artificial Intelligence \(AAAI\),Vol\.35,pp\. 10790–10797\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1)\.
- \[60\]Z\. Zhong and D\. Mottin\(2025\)Automatic auxiliary task selection and adaptive weighting boost molecular property prediction\.InAdvances in neural information processing systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p4.1),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.138.138.138.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.42.42.42.7),[TABLE VIII](https://arxiv.org/html/2605.12685#S5.T8.90.90.90.7),[§5](https://arxiv.org/html/2605.12685#S5.p12.1)\.
- \[61\]P\. Zhu, Q\. Wang, Y\. Wang, J\. Li, and Q\. Hu\(2024\)Every node is different: dynamically fusing self\-supervised tasks for attributed graph clustering\.InAssociation for the Advancement of Artificial Intelligence \(AAAI\),Vol\.38,pp\. 17184–17192\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p3.1),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.161.161.161.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.21.21.21.8),[TABLE VI](https://arxiv.org/html/2605.12685#S5.T6.91.91.91.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5)\.
- \[62\]Y\. Zhu, Y\. Xu, F\. Yu, Q\. Liu, S\. Wu, and L\. Wang\(2020\)Deep graph contrastive representation learning\.InInternational Conference on Machine Learning \(ICML Workshop on Graph Representation Learning and Beyond\),Cited by:[§1](https://arxiv.org/html/2605.12685#S1.p3.1),[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[§3](https://arxiv.org/html/2605.12685#S3.p2.7),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.108.108.108.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.210.210.210.8),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.306.306.309.3.2),[TABLE IV](https://arxiv.org/html/2605.12685#S5.T4.6.6.6.8),[§5](https://arxiv.org/html/2605.12685#S5.p2.5),[§5](https://arxiv.org/html/2605.12685#S5.p9.2)\.
- \[63\]Y\. Zhu, Y\. Xu, F\. Yu, Q\. Liu, S\. Wu, and L\. Wang\(2021\)Graph contrastive learning with adaptive augmentation\.InInternational World Wide Web Conference \(WWW\),pp\. 2069–2080\.Cited by:[§2](https://arxiv.org/html/2605.12685#S2.p1.1),[§3](https://arxiv.org/html/2605.12685#S3.p2.7)\.

![[Uncaptioned image]](https://arxiv.org/html/2605.12685v1/photos/Mohamed_Mahmoud_Amar.png)Mohamed Mahmoud Amaris a Ph\.D\. student at the University of Quebec at Montreal \(UQAM\)\. His research interests include graph representation learning and multi\-task learning\.![[Uncaptioned image]](https://arxiv.org/html/2605.12685v1/photos/Nairouz.png)Nairouz Mrabahreceived his Ph\.D\. degree from the University of Quebec at Montreal \(UQAM\)\. His research interests include clustering, graph representation learning, and vision\-language models\.![[Uncaptioned image]](https://arxiv.org/html/2605.12685v1/photos/Mohamed.png)Mohamed Bouguessareceived the M\.Sc\. and the Ph\.D\. degrees, respectively, in 2005 and 2009 from the University of Sherbrooke, Quebec, Canada\. He is currently a professor of computer science at the University of Quebec at Montreal \(UQAM\)\. His research focuses on graph mining and graph representation learning\.![[Uncaptioned image]](https://arxiv.org/html/2605.12685v1/photos/Abdoulaye.png)Abdoulaye Baniré Dialloreceived the PhD degree from McGill University, Montreal, QC, Canada, in 2009\. He is currently a Professor of Computer Science at the Université du Québec à Montréal \(UQAM\)\. His research focuses on the design of algorithmic methods for the analysis of heterogeneous biological data\.
A Unified Perspective for Learning Graph Representations Across Multi-Level Abstractions

Similar Articles

A Unified Geometric Framework for Weighted Contrastive Learning

Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation

Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains

A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation

Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs

Submit Feedback

Similar Articles

A Unified Geometric Framework for Weighted Contrastive Learning
Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation
Learning Transferable Topology Priors for Multi-Agent LLM Collaboration Across Domains
A Unified Framework for Context-Aware and Relation-Aware Graph Retrieval-Augmented Generation
Context-aware Modality-Topology Co-Alignment for Multimodal Attributed Graphs