Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs

arXiv cs.LG 05/08/26, 04:00 AM Papers
Summary
This paper introduces NATD-GSSL, a framework evaluating the robustness of Graph Self-Supervised Learning on noisy, text-driven biomedical graphs. It demonstrates that certain GNN architectures and pretext tasks maintain performance despite real-world noise, offering practical guidance for unsupervised learning in imperfect datasets.
arXiv:2605.05463v1 Announce Type: new Abstract: Graph Self-Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real-world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text-driven graphs for unsupervised term typing. We introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean-graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message-passing designs are better suited to noisy, text-driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD-GSSL provides practical guidance for applying GSSL to real-world, noisy graphs and achieves up to a 7\% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/08/26, 07:24 AM
# Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs
Source: [https://arxiv.org/html/2605.05463](https://arxiv.org/html/2605.05463)
\[1\]\\fnmOthmane\\surKABAL

\[1\]\\orgdivNantes University,\\orgnameLS2N,\\cityNantes,\\postcode44300,\\countryFrance

2\]\\orgdivNational Institute of Informatics,\\orgname2\-1\-2 Hitotsubashi,\\orgaddress\\streetChiyoda\-ku,\\cityTokyo\\postcode101\-8430\\countryJapan 3\]\\orgnameInstitute of Science Tokyo,\\orgaddress\\cityTokyo\\postcode152\-8550\\countryJapan

###### Abstract

Graph Self\-Supervised Learning \(GSSL\) offers a powerful paradigm for learning graph representations without labeled data\. However, existing work assumes clean, manually curated graphs\. Recent advances in NLP enable the large\-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real\-world noise\. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations\. To address this gap, we present the first comprehensive evaluation of GSSL methods on text\-driven graphs for unsupervised term typing\. We introduce Noise\-Aware Text\-Driven Graph GSSL \(NATD\-GSSL\), a unified framework that combines automatic graph construction, graph refinement, and GSSL\. Our evaluation follows a dual\-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System \(UMLS\) reference graph, aligned through a shared gold standard\. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network \(GNN\) architectures\. Relation reconstruction is highly sensitive to noise and benefits from well\-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean\-graph settings\. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks\. GNN architecture also plays a critical role: bidirectional relational message\-passing designs are better suited to noisy, text\-driven graphs, while unidirectional relational ones perform best on clean graphs\. Overall, NATD\-GSSL provides practical guidance for applying GSSL to real\-world, noisy graphs and achieves up to a 7% improvement over pretrained language model baselines\. All code and benchmarks are publicly available at[https://github\.com/OthmaneKabal/MC2GAE](https://github.com/OthmaneKabal/MC2GAE)\.

###### keywords:

Graph Self\-Supervised Learning, Knowledge Graph Construction, Noisy Graphs, Robustness Evaluation, Term Typing, Noise in GNNs

## 1Introduction

Graph Neural Network \(GNN\)\-based learning has witnessed rapid progress in recent years\[wu2020comprehensive,dai2024comprehensive\]\. GNNs are now widely adopted and often preferred over purely text\-based methods, as they jointly exploit textual content and the relational structure encoded in graphs\[10\.5555/3171642\.3171829,wang2024graph\]\. Two main paradigms have shaped this field\. Supervised methods\[10\.1109/TKDE\.2020\.2981333,xiao2022graph\]achieve state of the art results but require extensive labeled data, which are costly to obtain\. In contrast,Graph Self\-Supervised Learning\(GSSL\)\[liu2022graph,liu2022survey\]avoids annotation by learning from pretext tasks, offering scalability and adaptability across domains\. A wide range of GSSL frameworks has emerged, including generative methods\[kipf2016variational,tan2023s2gae\], contrastive approaches\[you2021graphcontrastivelearningaugmentations\], implemented using various encoder\-decoder architectures\[velivckovic2017graph,yang2015embeddingentitiesrelationslearning,10\.1007/978\-3\-319\-93417\-4\_38\]\. However, most existing studies rely on general\-purpose, well\-curated graphs such as Wikidata\[10\.1145/2629489\], FB15k\-237\[toutanova2015observed\], or citation networks\[sen2008collective\], which are typically considered high\-quality and readily available\. In practice, such resources are scarce, limiting the applicability of GSSL methods to a few domains where curated graphs exist and leaving many others unexplored\. To broaden the applicability of GSSL across domains, it becomes necessary to construct knowledge graphs directly from textual resources\. Since manual curation is prohibitively time\-consuming, automatic graph construction\[zhong2023comprehensive,kabal2024enhancing\]often remains the only feasible option\. This process inevitably introduces noise\[cai2025understanding\], posing major challenges for the robustness of GSSL methods\. While several studies have investigated the impact of noise on GNN performance\[liu2022survey,Ju2024ASO,zhuang2022defending\], they primarily rely on synthetic perturbations, such as random edge deletions, additions, or adversarial attacks\[zhuang2022defending\], and assess their effects on downstream performance\. Nevertheless, these controlled settings rarely capture the complexity and heterogeneity of noise inherent in text\-driven graphs, which tend to be structurally weak, sparse, and highly fragmented, in addition to exhibiting factual and semantic inaccuracies\[hegde2015entity,mo2025kggen\]\. Due to the absence of ground\-truth datasets pairing raw text with large, connected knowledge graphs\[cai2025understanding\], a critical issue remains largely underexplored: the sensitivity of GSSL methods to the quality of text\-driven graphs\. This paper addresses this gap by systematically investigating how extraction\-induced graph quality affects the downstream performance of GSSL methods\. We focus in particular onunsupervised term typingas a downstream task a fundamental yet underexplored application of GSSL with broad implications for knowledge extraction and ontology construction\[zhong2023comprehensive,kabal2024enhancing\]\. We design Noise\-Aware Text\-Driven GSSL \(NATD\-GSSL\), a framework that enables GSSL to operate on automatically constructed graphs by integrating a graph construction step followed by a refinement stage, which implements a set of strategies to improve graph quality\. To evaluate robustness in real\-world scenarios, we propose a dual\-graph evaluation protocol that contrasts GSSL performance on two graphs built from the same domain: a noisy graph automatically extracted from the MedMentions corpus\[mohan2019medmentions\], and a clean reference knowledge graph from UMLS\[bodenreider2004unified\]\. Because both graphs share a common set of entities and annotation schema, this setting enables a controlled and quantitative analysis of the performance gap induced by automatic graph construction\. Unlike prior work based on synthetic perturbations, this evaluation reflects noise arising naturally from language ambiguity and extraction errors\. Our empirical study spans multiple GSSL Methods, implemented with various GNN architectures and trained on diverse pretext tasks in order to analyze how robustness varies across model designs and learning objectives\. The results demonstrate that relation reconstruction requires a clean graph with a well\-defined schema, while feature reconstruction remains the most robust and achieves performance comparable to that of clean graphs\. In contrast, contrastive approaches reveal that robustness is less dependent on graph quality and more on alignment with the downstream task\. Regarding model design, GNN architecture also plays a critical role: bidirectional relational message\-passing architectures are better suited to noisy, text\-driven graphs, whereas unidirectional relational architectures perform best on clean graphs\. Moreover, our refinement strategies show that graph augmentation is beneficial when the graph is sparse, while denoising even in the presence of errors may lead to degraded performance\. Finally, our results also show a \+7% improvement in term typing compared to pretrained language models\. To summarize, our contributions are as follows:

- •NATD\-GSSL, the first framework that integrates graph construction, refinement, and GSSL into a unified process for learning from raw text\.
- •A dual\-graph evaluation protocol that quantitatively measures the impact of real\-world noise using paired noisy and clean knowledge graphs aligned with the same gold standard\.
- •A comprehensive empirical study comparing GSSL methods implemented with different GNN architectures and trained on various pretext tasks, providing practical guidance for robust model design\.

The remainder of this paper is organized as follows\. Section 2 reviews related work on GSSL and learning under noisy graphs\. Section 3 presents the proposed NATD\-GSSL framework and its modules\. Section 4 details the experimental setup, Section 5 reports results and analysis, and Section 6 concludes and outlines future directions\.

## 2Related Work

### 2\.1Graph Self\-Supervised Learning

Graphs provide a natural and expressive way to model relational data, where entities are represented as nodes and their interactions as edges\[ZHOU202057\]\. To learn from such structures without requiring labeled data, Graph Self\-Supervised Learning\[liu2022graph,wang2022graph\]is employed\. GSSL is typically designed as an encoder\-decoder framework, wherein theencoderconsists of stacked GNN layers that transform nodes into low\-dimensional representations\. Various GNN architectures have been proposed to perform this transformation, each with distinct mechanisms for aggregating neighborhood information\. GCN\[kipf2016semi\]performs simple neighborhood aggregation, while GAT\[velivckovic2017graph\]refines this through attention mechanisms, though neither considers multi\-relational aspects\. To address this, RGCN\[10\.1007/978\-3\-319\-93417\-4\_38\]applies relation\-specific transformations but suffers from parameter explosion and unidirectional propagation\. TransGCN and RotatEGCN\[cai2019transgcn\]address these limitations by encoding relations as translation or rotation operators, enabling bidirectional message passing with fewer parameters\. Table[1](https://arxiv.org/html/2605.05463#S2.T1)summarizes these architectures\.

Table 1:Comparison of representative GNN architectures\. Attn: Attention mechanism; Multi\-rel: multi\-relational graph support; MPD: message passing direction \(→\\rightarrowunidirectional,↔\\leftrightarrowbidirectional\)\.GNN ArchitectureAttnMulti\-relRelation ModelingMPDGCN\[kipf2016semi\]×\\times×\\times–→\\rightarrowGAT\[velivckovic2017graph\]✓×\\times–→\\rightarrowRGCN\[10\.1007/978\-3\-319\-93417\-4\_38\]×\\times✓Relation\-specific→\\rightarrowTransGCN\[cai2019transgcn\]Optional✓Translation\-based↔\\leftrightarrowRotatEGCN\[cai2019transgcn\]Optional✓Rotation\-based↔\\leftrightarrowOn thedecoderside, which defines the learning objective through pretext tasks, various architectures can be employed for the same task, such as standard neural networks \(MLPs\), GNNs, or simple scoring functions \(e\.g\., dot product, cosine similarity\)\. Based on the nature of these tasks, GSSL methods can be broadly divided into two main families: generative and contrastive approaches\. Generative methodsformulate the pretext task as the reconstruction of the input graph from two complementary perspectives\[liu2022graph\]\. First,structure reconstructionapproaches typically employ GNN\-based encoders with dot\-product decoders to recover the adjacency matrix\[kipf2016variational,pan2019learning\]\. Other approaches extend this idea to multi\-relational reconstruction, aiming to recover relation types in heterogeneous graphs\[10\.1007/978\-3\-319\-93417\-4\_38\]\. Although effective for link prediction and relation extraction, these methods strongly depend on graph structure, which can limit performance in tasks where semantically similar nodes are weakly connected or linked through non\-informative relations\. Second,feature reconstructionmethods focus on reconstructing node attributes\[wang2017mgae,park2019symmetric\], which preserves semantic content when features are meaningful, but often neglects structural information\. To address this limitation, thirddual reconstructionapproaches jointly reconstruct both structure and features via multi\-task learning with two distinct decoders\[li2023multi,sun2021dual\], yielding more comprehensive embeddings nevertheless incurring a higher computational cost\. These methods can be further enhanced through masking strategies\[tan2023s2gae,hou2022graphmaeselfsupervisedmaskedgraph\], in which nodes or edges are partially masked and subsequently reconstructed, improving generalization and reducing overfitting\. Despite their effectiveness, generative methods tend to overfit local graph structures and often struggle to capture global contextual information, particularly in graphs with multiple disconnected components\[ren2020heterogeneousdeepgraphinfomax\]\. Moreover, their reconstruction focused objectives often yield embeddings with limited discriminative power\. Contrastive methods, which by their nature capture global information well and produce more discriminative embeddings\[10597920\], are generally developed based on the principle of mutual information maximization, where the estimated mutual information between different augmented views of the same object such as a node, a subgraph, or an entire graph is maximized\[zhu2020deepgraphcontrastiverepresentation,you2021graphcontrastivelearningaugmentations\]\. Within this family, the encoder is typically a GNN, while the decoder acts as a discriminator that estimates the level of agreement between two instances, typically using simple similarity functions such as a dot product or a bilinear function\[liu2022graph\]\. These methods mainly differ in their contrastive level and augmentation strategy\. For instance, Graph Deep InfoMax\[48921\]contrasts node representations with a global graph summary, generating negative samples through node shuffling\. GraphCL\[you2021graphcontrastivelearningaugmentations\]adopts a graph\-level contrastive approach and applies a variety of data augmentations, including node dropping, edge perturbation, feature masking, and subgraph sampling\. GRACE\[zhu2020deepgraphcontrastiverepresentation\]focuses on node\-level contrastive learning by combining edge removal and feature masking to enrich local contexts, while ASP\[chen2023attribute\]contrasts original, attribute\-based, and global views to better handle both homophilous and heterophilous graphs\. While these methods produce discriminative embeddings, they remain sensitive to augmentation quality, negative sampling design, and the risk of discarding essential structural information\. Despite the wide range of available GSSL methods, most existing studies have evaluated them on clean, well\-curated graphs\. In practice, however, real\-world graphs particularly those derived from text exhibit pervasive noise\. The effectiveness of GSSL methods under such noisy conditions remains underexplored, especially in relation to the GNN architectures that implement these methods and the pretext tasks used to guide their learning\.

### 2\.2Noise in Graph Neural Networks

The quality of the input graph plays a crucial role in the effectiveness of GNNs\[Ju2024ASO\]\. In practice, graphs are rarely perfect and often suffer from various types of noise, which are commonly categorized into two main types\[paulheim2016knowledge,liu2022survey\]\.Structural noiserefers to inconsistencies in the graph topology, such as missing or spurious edges, that distort the genuine relationships between nodes\[rong2019dropedge,yuan2023self\]\. Missing edges increase graph sparsity and hinder effective information propagation, while spurious edges may introduce misleading connections, leading to over\-smoothing and incorrect message aggregation\. Together, these issues disrupt the message\-passing mechanisms of GNNs and negatively impact model performance\[fox2019robust\]\.Node\-levelnoise arises from erroneous, missing, or incomplete node attributes\[Ju2024ASO\]\. Such noise reduces the informativeness of node features and impairs neighborhood aggregation, ultimately limiting a model’s ability to learn accurate and meaningful node representations\[liu2022survey\]\. To study the impact of noise on GNNs, a common methodology in the literature consists in introducing controlled synthetic perturbations into the input graph\[wang2023user,ennadir2024simple\]\. These perturbations typically include randomly adding or removing edges, corrupting node features \(e\.g\., by injecting Gaussian noise or masking attributes\), or applying more sophisticated adversarial attacks\[zhuang2022defending,guo2022learning\]\. Model robustness is then assessed on downstream tasks, where a noticeable degradation in performance is consistently observed, highlighting the high sensitivity of GNNs to graph corruption even under moderate perturbations\[jin2020graph\]\. While these approaches are well suited for controlled experimental settings, they are largely conducted on mono\-relational benchmark graphs such as Cora, CiteSeer, and PubMed\[sen2008collective\], which are relatively simple and carefully curated\. Alternatively, some studies\[wang2019robust,dong\-etal\-2025\-refining\]rely on existing multi\-relational graphs that are manually or semi\-automatically constructed from structured sources, such as DBpedia and Freebase\[auer2007dbpedia,bollacker2008freebase\]\. In both cases, these methodologies assume access to a clean, high\-quality underlying graph, which limits their applicability to real\-world scenarios where the graph is inherently noisy, as is typically the case for graphs automatically constructed from textual data\. In contrast to curated graphs where noise is relatively simple and more commonly stems from fact obsolescence, human editing errors, or source alignment issues rarely involving severe violations of semantic constraints due to the limited expressiveness of the underlying ontologies\[paulheim2016knowledge\]\. Noise in text\-driven graphs predominantly originates from upstream NLP pipelines\[zhong2023comprehensive\], including errors in entity recognition, entity disambiguation, relation extraction, and coreference resolution\. These errors often lead to inconsistent triples \(e\.g\.,Barack Obama,siblingOf,White House\), factually false triples \(e\.g\.,Boston,capitalOf,USA\), or overly generic triples \(e\.g\.,family,residesIn,New York\)\[mihindukulasooriya2017towards\], which are characteristic failure modes of automatic extraction systems\. Moreover, text\-driven graphs are particularly affected by entity duplication, arising from spelling variations, unresolved aliases, or incomplete entity normalization, which leads to the creation of artificial nodes an issue that is largely absent from curated knowledge graphs\[cai2025understanding\]and further results in increased sparsity and severe graph fragmentation\[mo2025kggen\]\. Finally, whereas curated graphs typically prioritize precision over coverage, automatically constructed graphs often favor high recall, leading to substantially higher levels of structural and semantic noise\[faralli2023benchmark,kabal2024enhancing\]\.

Given these characteristics, existing NLP research lacks ground\-truth datasets pairing raw text with large, connected knowledge graphs, limiting systematic analysis of how realistic extraction noise propagates to downstream graph learning tasks\[cai2025understanding\]\. As a result, the impact of extraction\-induced noise on GNNs particularly GSSL methods remains largely unexplored\. To address this gap, our study addresses the above\-mentioned limitations by \(i\) integrating the processes of knowledge graph construction and graph self\-supervised learning into a unified framework NATD\-GSSL, enabling systematic analysis of noise propagation; and \(ii\) introducing a novel dual\-graph evaluation protocol that contrasts noisy, text\-driven graphs with clean, existing graphs in order to quantitatively assess the performance degradation induced by the graph construction process\.

## 3NATD\-GSSL Framework

This section presents the NATD\-GSSL framework, which extends the applicability of GSSL methods to settings where only raw textual data is available\. Unlike conventional approaches that assume access to pre\-existing, clean graphs, NATD\-GSSL explicitly integrates knowledge graph construction into the learning pipeline, thereby enabling systematic analysis of extraction\-induced noise and its impact on downstream tasks\. Furthermore, the framework incorporates a dedicated component that mitigates this noise through a set of graph refinement strategies\. As illustrated in Figure[1](https://arxiv.org/html/2605.05463#S3.F1), the NATD\-GSSL framework consists of four main components\.

![Refer to caption](https://arxiv.org/html/2605.05463v1/x1.png)Figure 1:Overview of the NATD\-GSSL Framework\.### 3\.1Input Data and Knowledge graph Construction

NATD\-GSSL takes as input a domain\-specific corpus consisting of a collection of unstructured textual documents relevant to the target domain in which GSSL is to be applied\. The framework accepts additional inputs specific to the downstream task, which may include entities, concepts, or labels, along with any auxiliary structures required by the task formulation\. Based on these inputs, a*knowledge graph \(KG\)*is automatically constructed using the general\-purpose method General Text To Knowledge Graph \(GT2KG\)\[kabal2024enhancing\], which ensures the domain independence of the framework\. Formally, the KG is defined as

whereVVdenotes the set of nodes andE⊆V×R×VE\\subseteq V\\times R\\times Vrepresents the set of relational edges\. The node feature matrixX∈ℝ\|V\|×dX\\in\\mathbb\{R\}^\{\|V\|\\times d\}is obtained by encoding node\-associated textual information using pretrained language models \(PLMs\)\. The setRRdenotes the collection of relation types\. Each edge is represented as a triplet\(vi,r,vj\)\(v\_\{i\},r,v\_\{j\}\), wherevi,vj∈Vv\_\{i\},v\_\{j\}\\in Vare the source and target nodes, respectively, andr∈Rr\\in Rspecifies the semantic relation connecting them\.

### 3\.2Graph Refinement

The constructed graph, as discussed in Section[2\.2](https://arxiv.org/html/2605.05463#S2.SS2), suffers from several imperfections\. These include structural issues related to fragmentation and incompleteness, as well as errors in the semantic content\. The graph refinement component proposes a set of strategies aimed at improving the overall quality of the graph\. These strategies are grouped into two main categories:enrichmentandcleaning\.

##### Enrichment

Enhances the graph by adding new nodes and/or edges in order to mitigate the structural weaknesses and reinforce connectivity\. For this purpose, we propose: Rule\-basedis\-aaugmentationintroduces new nodes and is\-a relations by exploiting morphosyntactic patterns in multi\-word terms to infer missing hierarchical relations\. This augmentation reduces graph fragmentation and improves message passing, while minimizing the risk of introducing semantic errors\. We distinguish two types of rules depending on whether the term contains the preposition ”of”\. The corresponding rules and illustrative examples are presented in Table[2](https://arxiv.org/html/2605.05463#S3.T2)\.

Table 2:Rules for is\-a Augmentation from Composed TermsRule TypeSyntactic StructureExamples withGenerated RelationshipsWithout ”of”Term: A B CRoot: C \(last word\)Rules:• \(B C, is\-a, C\)• \(A B C, is\-a, B C\)Term: Stem cell therapyRelationships:• \(Cell therapy, is\-a, therapy\)• \(Stem cell therapy, is\-a,cell therapy\)With ”of”Term: C of B of ARoot: C \(first word before ”of”\)Rules:• \(B of C, is\-a, C\)• \(C of B of A, is\-a, B of C\)Term: Disease of cell physiologyRelationships:• \(Disease of cell physiology,is\-a, disease\)
##### Cleaning

This step aims to remove noise from the graph content, affecting both relations and nodes\. To this end, we adopt a removal\-based refinement strategy proposed by\[dong\-etal\-2025\-refining\]\. This method relies on large language models \(LLMs\) and has been validated by its original authors, ensuring the reliability of the underlying validation process\. We follow this established framework without modifying its core decision mechanism\. The approach operates in two stages: noise detection and filtering\. Given a triple\(E1,R,E2\)\(E\_\{1\},R,E\_\{2\}\)extracted from a source sentencess, an LLM is employed to assess whether the relation expressed by the triple is supported by the sentence context and the model’s linguistic knowledge\. Formally, this validation step is modeled as a binary decision function:

𝒱LLM\(E1,R,E2∣s\)→\{0,1\},\\mathcal\{V\}\_\{\\text\{LLM\}\}\(E\_\{1\},R,E\_\{2\}\\mid s\)\\rightarrow\\\{0,1\\\},where a value of11indicates that the relation is semantically supported byss, while0denotes a noisy or unsupported triple\. Based on this decision, triples identified as noisy are removed from the graph, while only validated triples are retained\.

### 3\.3GSSL and Downstream Task

Once the graph is refined, the text\-driven graph is passed to the GSSL component, which aims to learn node embeddings by leveraging both the structural and semantic information encoded in the graph\. To this end, a graph encoderfθf\_\{\\theta\}is employed, producing:

𝐇=fθ\(G\)=fθ\(𝐗,ℛ,ℰR\),\\mathbf\{H\}=f\_\{\\theta\}\(G\)=f\_\{\\theta\}\(\\mathbf\{X\},\\mathcal\{R\},\\mathcal\{E\}\_\{R\}\),where𝐇∈ℝ\|V\|×d\\mathbf\{H\}\\in\\mathbb\{R\}^\{\|V\|\\times d\}denotes the set of node representations learned in a self\-supervised manner through a pretext task using a decoderfϕf\_\{\\phi\}, without requiring any labeled data\. Finally the learned representations𝐇\\mathbf\{H\}are subsequently exploited by downstream tasks through task\-specific inference functions\. Depending on the application, these embeddings can be used for various tasks such as node classification, clustering, or link prediction, either directly or via lightweight task\-specific heads\.

## 4Experiments

In this section, we present our experimental study that instantiates the NATD\-GSSL framework to investigate the impact of graph construction quality on the performance of Graph GSSL methods\. Unless otherwise stated, all experimental settings correspond to specific instantiations of NATD\-GSSL, differing only in the graph refinement stage\. We first introduce the downstream task adopted in this study, followed by the proposed dual\-graph evaluation protocol and a description of the datasets and graph variants used in our experiments\. We then detail the different GSSL methods considered in this study, as well as the experimental setting\.

### 4\.1Downstream Task: Term Typing

We adopt*term typing*as the downstream task, which consists in assigning each target term to one or more predefined semantic types\. This task is particularly relevant in graph\-based settings, as it directly evaluates the semantic coherence of learned node representations and is widely used in knowledge graph construction and enrichment\. For this task, NATD\-GSSL requires two additional inputs: \(i\) a set of*target terms*to be typed, which appear in the corpus and are therefore contextualized by it; and \(ii\) a predefined set of*semantic types*to which these target terms are to be assigned\. Once the graph is constructed to explicitly include both target terms and predefined semantic types as nodes, node representations are learned through the GSSL component of NATD\-GSSL\. The resulting embeddings𝐇\\mathbf\{H\}are then used to perform type assignment\. We adopt a*nearest type assignment*strategy based on embedding similarity\. Given the embeddings produced by the GSSL encoder, each target nodeviv\_\{i\}is assigned the semantic typetj∈𝒯t\_\{j\}\\in\\mathcal\{T\}whose embedding is most similar to that ofviv\_\{i\}, measured using cosine similarity:

ϕ\(vi\)=arg⁡maxtj∈𝒯⁡cos⁡\(𝐇vi,𝐇tj\),\\phi\(v\_\{i\}\)=\\arg\\max\_\{t\_\{j\}\\in\\mathcal\{T\}\}\\cos\(\\mathbf\{H\}\_\{v\_\{i\}\},\\mathbf\{H\}\_\{t\_\{j\}\}\),\(1\)where𝐇vi\\mathbf\{H\}\_\{v\_\{i\}\}and𝐇tj\\mathbf\{H\}\_\{t\_\{j\}\}denote the embeddings of the target term node and the semantic type node, respectively\.

### 4\.2Dual\-Graph Evaluation Protocol

To quantify the impact of graph quality on downstream performance, we introduce adual\-graph evaluation protocolbased on two graphs derived from the same domain\. The first graph is automatically constructed from raw text, and as such, it contains various forms of real\-world noise including structural fragmentation, spurious or missing edges, and noisy or incomplete node attributes\. This graph includes a set of target terms to be typed, along with predefined semantic type nodes, following the NATD\-GSSL framework\. The second graph is a high\-quality reference graph, curated by domain experts\. It includes the same set of target terms and semantic type nodes as the noisy graph, which enables a direct and meaningful comparison\. All models are trained and evaluated using the same experimental setup including architecture, dimensionality, optimization settings, and evaluation procedure\. The only difference between runs is the input graph\. By applying GSSL independently to both graphs and evaluating the same downstream task \(unsupervised term typing\), we can measure the performance gap that is, the loss in performance attributable to the noise present in the automatically constructed graph\. Figure[2](https://arxiv.org/html/2605.05463#S4.F2)illustrates this dual\-graph evaluation protocol\.

![Refer to caption](https://arxiv.org/html/2605.05463v1/x2.png)Figure 2:Dual\-graph evaluation protocol
### 4\.3Data and Graph Variants

To support the dual\-graph evaluation protocol, this section describes the data sources and the different graph variants used in our experiments\.

##### Input Corpus

We use theMedMentions corpus\[mohan2019medmentions\], a large\-scale biomedical dataset originally designed for concept recognition and entity typing\. This corpus is particularly well suited to our experimental setting for two main reasons\. First, it provides a large collection of domain\-specific terms requiring semantic typing, which naturally supports the unsupervised term typing task considered in this work\. Second, MedMentions is annotated according to theUMLS semantic type system, enabling direct alignment with the UMLS knowledge graph\.

##### Graph Variants\.

Based on this corpus, we construct and evaluate the following graph variants:

- •Clean Reference Graph\.A high\-quality biomedical knowledge graph validated by domain experts derived from the UMLS\-NCI Thesaurus\[bodenreider2004unified\]\. It is characterized by a well\-defined schema, dense connectivity, absence of fragmentation, and coherent semantic organization\. This graph serves as the reference graph in our dual\-graph evaluation protocol\.
- •Noisy Graph\.Automatically constructed from the MedMentions corpus using the GT2KG approach\. This graph exhibits severe structural weaknesses, including extreme fragmentation, low average node degree, and sparse connectivity\. Examples of noisy triplets in the constructed graph are illustrated in Table[3](https://arxiv.org/html/2605.05463#S4.T3)\. Table 3:Examples of real\-world noise in text\-driven graphNoise TypeExtracted Triplet \(or Mis sing Element\)Noise CategoryExplanationErroneous Link\(Plant,related\_to, Intellectual Product\)StructuralTwo nodes that should not be connectedMissing NodeMissing node:”Cancer Genetics Network” present in text but not extractedNode\-levelImportant concept missing, degrading subgraph completeness\.Generic Entity\(study,associated\_with, outcome\)Node\-levelOverly frequent generic nodes form hubs, biasing message passing\.Boundary Error\(avian influenza and Ebola virus hemorrhagic fever,cause, symptoms\)Node\-levelTwo entities are concatenated due to conjunction misinterpretation\.Entity Ambiguity\(conventional lipid extraction technique,compare, traditional method of lipid extraction\)SemanticTwo expressions refer to the same concept but appear as different nodes\.Wrong Relation\(heavy metal,is, abiotic stress\)Structural & SemanticHeavy metals cause stress but are not a subtype of abiotic stress\.Node\-level & Structural & Semantic\(mesenchymal stem cell MTH\-68H,provide, Newcastle disease virus novel therapeutic approach\)Node\-level & Semantic & StructuralObject mixes true concept \(virus\) with irrelevant addon “novel therapeutic approach”\.Wrong Relation & Wrong Entity\(median age,is, 63 year\)Node\-level & Structural & Semantic“is\-a” is incorrect; numeric values are not valid entities\.
- •Enriched Graph\.Obtained by applying enrichment\-based refinement strategies to the noisy graph, aimed at alleviating structural deficiencies and reinforcing connectivity through the insertion of additional hierarchical relations\.
- •Cleaned Graph\.Generated by applying the cleaning refinement strategy, in which detected noisy triples are filtered out\. We follow the same prompting strategy protocol as in\[dong\-etal\-2025\-refining\], usingDeepSeekwith reasoning enabled\.
- •Combined Refined Graph\.Generated by applying both refinement strategies: first enrichment, followed by cleaning, in order to also check the added triplets\.

Key topological statistics of all graph variants are reported in Table[4](https://arxiv.org/html/2605.05463#S4.T4)\.

Table 4:Graph Statistics with Topological Measures\.
ncompn\_\{\\text\{comp\}\}→ number of connected components,rgiantr\_\{\\text\{giant\}\}→ giant component ratio, avg\_deg → average node degree\.Graph VariantNodesEdgesRelations𝒏comp\\bm\{n\_\{\\text\{comp\}\}\}𝒓giant\\bm\{r\_\{\\text\{giant\}\}\}avg\_degNoisy Graph37,14234,322766,9870\.511\.85Clean ReferenceGraph184,5741,258,93111011\.0013\.60Enriched Graph58,48380,193769890\.942\.74Cleaned Graph30,00725,932766,5320\.431\.72Combined RefinedGraph50,89560,984763,2100\.822\.39
##### Gold Standard and Evaluation Scope\.

All graph variants share an identical set of target terms, forming a gold standard composed of1,040 nodesequally distributed acrosseight semantic types\. Evaluation is conducted exclusively on this shared set of target terms\. Performance is evaluated usingAccuracy,macro\-Precision, andmacro\-F1\-score\. Recall is not reported separately, as it is equivalent to Accuracy under the balanced class distribution of the gold standard\.

### 4\.4GSSL Methods and Baseline

We systematically explored a broad range of GSSL methods, covering both generative and contrastive paradigms\. These methods are trained on different pretext tasks, including relation reconstruction, feature reconstruction, and contrastive learning, each implemented using various encoder\-decoder architectures built upon different layers, such as GNNs and MLPs\. This systematic exploration allows us to assess how architectural choices influence robustness to real\-world graph noise\. In total, 36 distinct GSSL configurations were evaluated\. The complete set of explored variants is summarized in Table[5](https://arxiv.org/html/2605.05463#S4.T5)\.

Table 5:GSSL methods considered\. Feat\. Rec\. → Feature Reconstruction, Rel\. Rec\. → Relation ReconstructionPretext taskCategoryEncodersDecodersLossFeat\. Rec\.GenerativeGCN\[kipf2016semi\], GAT\[velivckovic2017graph\]RGCN\[10\.1007/978\-3\-319\-93417\-4\_38\], TransGCN/RotatEGCN\[cai2019transgcn\]GCN\[kipf2016semi\], GAT\[velivckovic2017graph\], MLPRGCN\[10\.1007/978\-3\-319\-93417\-4\_38\], TransGCN/RotatEGCN\[cai2019transgcn\]MSERel\. Rec\.GenerativeRGCN\[10\.1007/978\-3\-319\-93417\-4\_38\]DistMult\[yang2015embeddingentitiesrelationslearning\]MCEContrastiveContrastiveGCN\[kipf2016semi\], GAT\[velivckovic2017graph\]RGCN\[10\.1007/978\-3\-319\-93417\-4\_38\], TransGCN/RotatEGCN\[cai2019transgcn\]contrastive scoring\[zhu2020deepgraphcontrastiverepresentation\]

InfoNCE\[zhu2020deepgraphcontrastiverepresentation\]To establish a meaningful point of comparison, we define a non\-graph baseline referred to as the Textual PLM Baseline \(PLM\-Only\)\. This baseline relies solely on pretrained textual semantics without using any graph structure\. Each term is represented using static embeddings derived from a pretrained language model, and semantic types are assigned based on cosine similarity in the embedding space using Eq[1](https://arxiv.org/html/2605.05463#S4.E1)\. This baseline serves two key purposes: first, it provides a lower bound that reflects the performance achievable without leveraging relational information; second, it allows us to isolate the contribution of GSSL, as the same pretrained embeddings are used to initialize the node representations in all graph\-based configurations\.

### 4\.5Experimental Setup

All experiments were conducted on an NVIDIA RTX 4500 Ada GPU \(24 GB VRAM\) with CUDA 12\.6, using PyTorch and PyTorch Geometric as the primary frameworks\. Node and relation features were initialized using pre\-trained language models, withsentence\-transformers/all\-MiniLM\-L6\-v2selected after comparative evaluation across general\-purpose, biomedical, and semantic similarity PLM families \(details in Appendix[A](https://arxiv.org/html/2605.05463#A1)\)\. Training was performed using mini\-batch mode with GraphSAGE\[hamilton2017inductive\]neighborhood sampling, retaining 200 neighbors per layer\. All models were trained for 50 epochs and repeated three times under different random seeds to ensure reproducibility\. The encoder architecture consisted of two hidden layers with dimensional configurations of \[384, 256\] and \[256, 128\], while batch sizes were set to 256 or 512 depending on the graph size\. For RGCN layers, the number of basis matrices \(num\_bases\) was set between 5 and 10 to balance expressiveness and computational efficiency\.

## 5Results and Analysis

In this section, we present the results of the conducted experiments, followed by an in\-depth analysis\. This analysis aims to answer the following research questions:

- Q1\.How does the performance of GSSL on a text\-driven graph compare to its performance on a clean graph?
- Q2\.Is the relative performance of GSSL method \(pretext task, encoder\-decoder architecture\) influenced by graph quality?
- Q3\.What is the contribution of graph refinement methods, particularly enrichment versus cleaning techniques?

### 5\.1Results Overview

The results obtained with the different GSSL methods on all graph variants are presented in Table[6](https://arxiv.org/html/2605.05463#S5.T6), where the best\-performing models for each pretext task are highlighted\. For feature reconstruction, however, the top three configurations are reported due to the limited performance variation observed across models\.

Table 6:Performance of GSSL model variants per graph type\. Feat\. Rec\.→\\rightarrowFeature Reconstruction, Rel\. Rec\.→\\rightarrowRelation Reconstruction, Acc\.→\\rightarrowAccuracy, Prec\.→\\rightarrowPrecision\. Best results per graph variant inbold\.Pretext TaskEncoderDecoderAcc\.F1Prec\.Baseline \(PLM initial embedding\)\*all\-MiniLM\-L6\-v2\*0\.46340\.43710\.4730Clean Reference GraphFeat\. Rec\.RGCNMLP0\.5480±0\.0080\.5480\_\{\\scriptsize\\pm\\,0\.008\}0\.5180±0\.0100\.5180\_\{\\scriptsize\\pm\\,0\.010\}0\.5345±0\.0090\.5345\_\{\\scriptsize\\pm\\,0\.009\}Feat\. Rec\.TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}GAT0\.5442±0\.0090\.5442\_\{\\scriptsize\\pm\\,0\.009\}0\.5254±0\.0110\.5254\_\{\\scriptsize\\pm\\,0\.011\}0\.5342±0\.0100\.5342\_\{\\scriptsize\\pm\\,0\.010\}Feat\. Rec\.RotatEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}MLP0\.5413±0\.0070\.5413\_\{\\scriptsize\\pm\\,0\.007\}0\.5287±0\.009\\bm\{0\.5287\}\_\{\\scriptsize\\pm\\,0\.009\}0\.5400±0\.008\\bm\{0\.5400\}\_\{\\scriptsize\\pm\\,0\.008\}Rel\. Rec\.RGCNDistMult0\.5577±0\.014\\bm\{0\.5577\}\_\{\\scriptsize\\pm\\,0\.014\}0\.5234±0\.0160\.5234\_\{\\scriptsize\\pm\\,0\.016\}0\.5290±0\.0150\.5290\_\{\\scriptsize\\pm\\,0\.015\}ContrastiveGATContrastive scoring0\.3836±0\.0280\.3836\_\{\\scriptsize\\pm\\,0\.028\}0\.3629±0\.0310\.3629\_\{\\scriptsize\\pm\\,0\.031\}0\.3688±0\.0290\.3688\_\{\\scriptsize\\pm\\,0\.029\}Noisy GraphFeat\. Rec\.TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}RotatEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}0\.5355±0\.014\\bm\{0\.5355\}\_\{\\scriptsize\\pm\\,0\.014\}0\.5197±0\.017\\bm\{0\.5197\}\_\{\\scriptsize\\pm\\,0\.017\}0\.5275±0\.0150\.5275\_\{\\scriptsize\\pm\\,0\.015\}Feat\. Rec\.TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}0\.5346±0\.0150\.5346\_\{\\scriptsize\\pm\\,0\.015\}0\.5144±0\.0180\.5144\_\{\\scriptsize\\pm\\,0\.018\}0\.5125±0\.0160\.5125\_\{\\scriptsize\\pm\\,0\.016\}Feat\. Rec\.TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}0\.5336±0\.0130\.5336\_\{\\scriptsize\\pm\\,0\.013\}0\.5148±0\.0170\.5148\_\{\\scriptsize\\pm\\,0\.017\}0\.5402±0\.014\\bm\{0\.5402\}\_\{\\scriptsize\\pm\\,0\.014\}Rel\. Rec\.RGCNDistMult0\.3038±0\.0260\.3038\_\{\\scriptsize\\pm\\,0\.026\}0\.2776±0\.0300\.2776\_\{\\scriptsize\\pm\\,0\.030\}0\.2802±0\.0280\.2802\_\{\\scriptsize\\pm\\,0\.028\}ContrastiveRotatEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}Contrastive scoring0\.4028±0\.0410\.4028\_\{\\scriptsize\\pm\\,0\.041\}0\.3792±0\.0450\.3792\_\{\\scriptsize\\pm\\,0\.045\}0\.4117±0\.0390\.4117\_\{\\scriptsize\\pm\\,0\.039\}Enriched GraphFeat\. Rec\.RotatEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}MLP0\.5413±0\.009\\bm\{0\.5413\}\_\{\\scriptsize\\pm\\,0\.009\}0\.5223±0\.012\\bm\{0\.5223\}\_\{\\scriptsize\\pm\\,0\.012\}0\.5490±0\.010\\bm\{0\.5490\}\_\{\\scriptsize\\pm\\,0\.010\}Feat\. Rec\.RGCNMLP0\.5394±0\.0100\.5394\_\{\\scriptsize\\pm\\,0\.010\}0\.5162±0\.0130\.5162\_\{\\scriptsize\\pm\\,0\.013\}0\.5349±0\.0110\.5349\_\{\\scriptsize\\pm\\,0\.011\}Feat\. Rec\.TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}RGCN0\.5288±0\.0120\.5288\_\{\\scriptsize\\pm\\,0\.012\}0\.4965±0\.0150\.4965\_\{\\scriptsize\\pm\\,0\.015\}0\.5176±0\.0130\.5176\_\{\\scriptsize\\pm\\,0\.013\}Rel\. Rec\.RGCNDistMult0\.3952±0\.0200\.3952\_\{\\scriptsize\\pm\\,0\.020\}0\.3384±0\.0240\.3384\_\{\\scriptsize\\pm\\,0\.024\}0\.4265±0\.0220\.4265\_\{\\scriptsize\\pm\\,0\.022\}ContrastiveTransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}Contrastive scoring0\.4394±0\.0340\.4394\_\{\\scriptsize\\pm\\,0\.034\}0\.4316±0\.0370\.4316\_\{\\scriptsize\\pm\\,0\.037\}0\.4699±0\.0350\.4699\_\{\\scriptsize\\pm\\,0\.035\}Cleaned GraphFeat\. Rec\.TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}0\.5346±0\.011\\bm\{0\.5346\}\_\{\\scriptsize\\pm\\,0\.011\}0\.5189±0\.014\\bm\{0\.5189\}\_\{\\scriptsize\\pm\\,0\.014\}0\.5434±0\.012\\bm\{0\.5434\}\_\{\\scriptsize\\pm\\,0\.012\}Feat\. Rec\.TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}0\.5307±0\.0120\.5307\_\{\\scriptsize\\pm\\,0\.012\}0\.5163±0\.0150\.5163\_\{\\scriptsize\\pm\\,0\.015\}0\.5400±0\.0130\.5400\_\{\\scriptsize\\pm\\,0\.013\}Feat\. Rec\.RotatEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}0\.5307±0\.0110\.5307\_\{\\scriptsize\\pm\\,0\.011\}0\.5128±0\.0140\.5128\_\{\\scriptsize\\pm\\,0\.014\}0\.5251±0\.0120\.5251\_\{\\scriptsize\\pm\\,0\.012\}Rel\. Rec\.RGCNDistMult0\.3952±0\.0210\.3952\_\{\\scriptsize\\pm\\,0\.021\}0\.3384±0\.0250\.3384\_\{\\scriptsize\\pm\\,0\.025\}0\.4265±0\.0230\.4265\_\{\\scriptsize\\pm\\,0\.023\}ContrastiveRotatEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}Contrastive scoring0\.3269±0\.0290\.3269\_\{\\scriptsize\\pm\\,0\.029\}0\.3125±0\.0320\.3125\_\{\\scriptsize\\pm\\,0\.032\}0\.3292±0\.0300\.3292\_\{\\scriptsize\\pm\\,0\.030\}Combined Refined GraphFeat\. Rec\.TransEGCNattn\{\}\_\{\\scriptsize\\text\{attn\}\}TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}0\.5375±0\.012\\bm\{0\.5375\}\_\{\\scriptsize\\pm\\,0\.012\}0\.5186±0\.0150\.5186\_\{\\scriptsize\\pm\\,0\.015\}0\.5180±0\.0140\.5180\_\{\\scriptsize\\pm\\,0\.014\}Feat\. Rec\.TransEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}GAT0\.5355±0\.0130\.5355\_\{\\scriptsize\\pm\\,0\.013\}0\.5224±0\.016\\bm\{0\.5224\}\_\{\\scriptsize\\pm\\,0\.016\}0\.5445±0\.014\\bm\{0\.5445\}\_\{\\scriptsize\\pm\\,0\.014\}Feat\. Rec\.RGCNMLP0\.5042±0\.0150\.5042\_\{\\scriptsize\\pm\\,0\.015\}0\.5042±0\.0170\.5042\_\{\\scriptsize\\pm\\,0\.017\}0\.5183±0\.0160\.5183\_\{\\scriptsize\\pm\\,0\.016\}Rel\. Rec\.RGCNDistMult0\.4067±0\.0190\.4067\_\{\\scriptsize\\pm\\,0\.019\}0\.3497±0\.0230\.3497\_\{\\scriptsize\\pm\\,0\.023\}0\.3955±0\.0210\.3955\_\{\\scriptsize\\pm\\,0\.021\}ContrastiveRotatEGCNconv\{\}\_\{\\scriptsize\\text\{conv\}\}Contrastive scoring0\.4278±0\.0330\.4278\_\{\\scriptsize\\pm\\,0\.033\}0\.4086±0\.0360\.4086\_\{\\scriptsize\\pm\\,0\.036\}0\.4165±0\.0340\.4165\_\{\\scriptsize\\pm\\,0\.034\}The results show a noticeable variation in performance depending on both the encoder\-decoder architecture and the pretext task\. While some configurations lead to significant improvements up to \+7% in Accuracy compared to PLM\_Init, others exhibit a clear degradation down to \-16%, with accuracy gaps reaching up to 0\.25 between the best and worst models under the same typing task\. Since Table[6](https://arxiv.org/html/2605.05463#S5.T6)only reports the best\-performing models per task, Figure[3](https://arxiv.org/html/2605.05463#S5.F3)complements it by illustrating the full range of results obtained across all encoder\-decoder combinations for each pretext task\. This variation highlights the strong influence of two factors: the choice of pretext task itself, and, within the same task, the specific architectural design of the encoder\-decoder components\. We therefore conduct a two\-fold analysis based on these axes of variation\.

![Refer to caption](https://arxiv.org/html/2605.05463v1/noisy_plot_box_v2.png)\(\(a\)\)Noisy Graph
![Refer to caption](https://arxiv.org/html/2605.05463v1/clean_plot_box_v2.png)\(\(b\)\)Clean Reference Graph
![Refer to caption](https://arxiv.org/html/2605.05463v1/enriched_plot_box.png)\(\(c\)\)Enriched Graph
![Refer to caption](https://arxiv.org/html/2605.05463v1/cleaned_graph_plot_box.png)\(\(d\)\)Cleaned Graph
![Refer to caption](https://arxiv.org/html/2605.05463v1/combined_plot_box.png)\(\(e\)\)Combined Refined Graph

Figure 3:Accuracy distribution across pretext tasks under different data settings\. The red dashed line indicates the performance of the initial PLM embedding\.
### 5\.2Pretext task Analysis

- •Feature Reconstruction:The feature reconstruction pretext task demonstrates the highest robustness across all graph variants\. It consistently yields strong performance, achieving an approximate 8% improvement in accuracy compared to PLM\-initialized embeddings, with a minimal performance gap of less than 1% between noisy and clean reference graph settings\. This stability suggests that the task generalizes well to varying noise levels, as it aims to reconstruct the semantic node features originally derived from PLMs while incorporating contextual information through neighborhood aggregation\. Even in the presence of noisy neighborhoods such as weak connections or erroneous edges the pretext task seeks to preserve the initial embeddings\. As illustrated in Figure[3](https://arxiv.org/html/2605.05463#S5.F3), however, performance stability varies across graph qualities: results on the Noisy Graph exhibit higher variance across architectural choices, indicating that feature reconstruction is more sensitive to model design in structurally deficient settings\. In contrast, the Clean Reference Graph leads to more consistent results across architectures, highlighting the stabilizing effect of well\-structured relational data\. Moreover, the refined graphs \(enriched, cleaned, and combined\) demonstrate both greater stability and overall performance gains over the PLM baseline, further confirming that even partial graph refinement can mitigate architectural sensitivity and reinforce the effectiveness of feature reconstruction\. However, this strategy presents a double\-edged sword: while it ensures stability, it may perform poorly when the initial embeddings themselves are suboptimal\. As shown in Figure[4](https://arxiv.org/html/2605.05463#S5.F4), correct classifications issues from initial embeddings are largely preserved in the final predictions \(65\.7% on the clean reference graph and 86\.3% on the noisy graph\), whereas incorrect initial classifications are rarely corrected \(44\.8% and 25\.3%, respectively\)\. ![Refer to caption](https://arxiv.org/html/2605.05463v1/transition_matrix_clean_ref_graph.png)\(\(a\)\)Clean reference graph ![Refer to caption](https://arxiv.org/html/2605.05463v1/transition_matrix_noisy_graph.png)\(\(b\)\)Noisy graph Figure 4:Transition matrices between initial embedding\-based classification and final GSSL classification\. Rows correspond to initial correctness and columns to final correctness\.
- •Relation Reconstruction:The relation reconstruction pretext task achieves the best overall performance on the clean graph compared to all other pretext tasks and graph variants\. However, its performance degrades significantly on the noisy and refined graph settings, falling below that of the PLM\-initialized embeddings\. This decline can be attributed to the structural differences between the graphs\. The clean graph is organized according to a core ontology, wherein each relation adheres to well\-defined domain and range constraints\. This structure enables the model to implicitly infer the types of subject and object entities during training\. In contrast, the noisy graph constructed using a general\-purpose method \(GT2KG\) lacks ontological structure, exhibits a fragmented and sparse topology, and contains erroneous connections\. As a result, the correspondence between relations and entity types deteriorates\. Although DistMult remains effective at reconstructing relations under these conditions \(Fig\.[5](https://arxiv.org/html/2605.05463#S5.F5)\), it is not well suited for the typing task due to its limited capacity to model type\-specific semantic dependencies\. ![Refer to caption](https://arxiv.org/html/2605.05463v1/imagesclean_R_acc.png)\(\(a\)\)Clean graph ![Refer to caption](https://arxiv.org/html/2605.05463v1/noisy_R_acc.png)\(\(b\)\)Noisy graph Figure 5:Accuracy evolution for the relation reconstruction task using the DistMult decoder on \(a\) the clean graph and \(b\) the noisy graph\.
- •Contrastive:While contrastive learning is widely recognized for its strong discriminative capacity, our results indicate that it consistently degrades performance relative to the initial embeddings across all graph variants\. This uniform drop suggests that the limitation stems not from the quality of the underlying graphs, but from a fundamental misalignment between the contrastive learning objective and the entity typing task\. A key factor contributing to this misalignment is the negative sampling strategy typically employed in contrastive learning methods\[zhu2020deepgraphcontrastiverepresentation\]\. Specifically, for a given node in one augmented view of the graph, all other nodes in the second view including the predefined type nodes are treated as negatives\. As a result, the model is explicitly encouraged to push apart a term and its correct type, which directly contradicts the goal of the typing task\. This contradiction leads to degraded performance, as clearly illustrated in Figure[6](https://arxiv.org/html/2605.05463#S5.F6), on the clean graph, the architecture that achieved the highest accuracy corresponds to the one that optimized the contrastive loss the least, further highlighting the incompatibility between the contrastive objective and the typing task\. ![Refer to caption](https://arxiv.org/html/2605.05463v1/contrastive_acc.png) ![Refer to caption](https://arxiv.org/html/2605.05463v1/contrastive_loss.png) Figure 6:Accuracy \(left\) and contrastive loss \(right\) curves for the contrastive learning pretext task across different encoder architectures on the clean reference graph\.

#### 5\.2\.1Analysis of Encoder\-Decoder Architecture Choices

We focus here on the architectures associated with the feature reconstruction pretext task, which proved to be the most robust overall\. As shown in Figure[3](https://arxiv.org/html/2605.05463#S5.F3), performance varies depending on the chosen encoder\-decoder architecture, and each graph variant exhibits its own set of top\-performing combinations\. To analyze how these architectural choices generalize across different graph qualities, we selected the top three performing encoder\-decoder combinations for each graph variant\. Their performance and stability across all graph variants are summarized in Figure[7](https://arxiv.org/html/2605.05463#S5.F7)and Figure[8](https://arxiv.org/html/2605.05463#S5.F8)\.

![Refer to caption](https://arxiv.org/html/2605.05463v1/global_heatmap.png)Figure 7:Global heatmap showing encoder\-decoder performance across different graph variants\. Colors indicate mean accuracy, with \(⋆\\star\) indicating the number of graph variants for which the encoder\-decoder combination achieves the best performance\.![Refer to caption](https://arxiv.org/html/2605.05463v1/performance_vs_stability_architectures_contour.png)Figure 8:Mean accuracy versus stability \(standard deviation\) for different encoder\-decoder combinations\. Highlighted models indicate favorable performance\-stability trade\-offs\.From these visualizations, three trends emerge:

- •RGCN\-MLP: consistently appears among the top three performers on theClean Reference,Enriched, andCombined Refinementgraphs all of which exhibit high connectivity and low fragmentation\. The effectiveness of RGCN\-MLP in these settings is attributable to its relational message\-passing mechanism, which depends on diverse relation types and aggregates messages from incoming neighbors\. However, its performance drops significantly on theNoisyandCleanedgraphs, which are sparse and structurally degraded\. In such graphs, lower in\-degree, reduced relational diversity, and isolated components hinder message propagation, negatively impacting contextualization and node typing thereby reducing both accuracy and stability\.
- •RotateEGCN\(conv\)\-MLP: This combination yields thehighest mean accuracy overalland ranks in the top three for theClean ReferenceandEnrichedgraphs both well\-connected\. Notably, despite not ranking in the top three for theNoisyandCleanedgraphs, it still performsrobustlyin these sparse settings\. Its generalization ability stems from the RotateEGCN layers, which integrateboth incoming and outgoingedges\. This bidirectional setting enables even sparsely connected or partially isolated nodes \(e\.g\., those with only one outgoing edge\) to receive meaningful information\. The architecture therefore demonstrates a favorable trade\-off between performance and stability across all graph types\.
- •TransEGCN\(conv\)\-TransEGCN\(attn\): This architecture is particularly effective forstructurally weak graphs, ranking among the top three on both theNoisyandCleanedvariants\. Moreover, it performs competitively on augmented or well\-structured graphs like theClean ReferenceandEnrichedversions\. The dual use of TransEGCN layers at both encoder and decoder levels combined with the incorporation of attention mechanisms at the decoder supports robust feature propagation while mitigating over\-smoothing in dense graphs\. Its ability to leverage both incoming and outgoing edges further enhances its adaptability across different structural conditions\.

### 5\.3Graph Refinement Analysis

To evaluate the contribution of graph refinement strategies, we compare the performance of the GSSL framework across four graph variants:*noisy*,*enriched*,*cleaned*, and*combined refined*graphs\. Figure[9](https://arxiv.org/html/2605.05463#S5.F9)summarizes the impact of each refinement strategy on the three pretext tasks: feature reconstruction, relation reconstruction, and contrastive learning\.

![Refer to caption](https://arxiv.org/html/2605.05463v1/graph_refinement_mean_accuracy.png)Figure 9:Impact of graph refinement strategies on GSSL performance using mean accuracy per task and graph variant\.Graph enrichment substantially improves structural connectivity by reducing graph fragmentation and increasing the giant component ratio\. This structural enhancement translates into consistent performance gains across all tasks\. Compared to the noisy graph, enrichment improves relation reconstruction accuracy by approximately\+9%\+9\\%, contrastive learning by\+3\.7%\+3\.7\\%, and feature reconstruction by\+0\.5%\+0\.5\\%\. The relatively modest improvement observed for feature reconstruction confirms its robustness to structural noise, whereas relation reconstruction and contrastive learning benefit directly from increased connectivity and more informative neighborhoods\. In contrast, graph cleaning exhibits a paradoxical effect\. Although it removes noisy or semantically inconsistent edges, it also degrades global connectivity, as reflected by the reduced giant component ratio\. This over\-conservative edge pruning leads to a severe drop in contrastive learning performance \(−7\.6%\-7\.6\\%\), which strongly depends on neighborhood structure\. At the same time, relation reconstruction benefits from the removal of spurious relations, while feature reconstruction remains largely unaffected, further highlighting its limited sensitivity to structural perturbations\. The combined refinement strategy yields mixed results\. While it achieves the highest accuracy for relation reconstruction \(\+10% relative to the noisy graph\), its performance for feature reconstruction and contrastive learning is comparable to enrichment alone\. This suggests that the additional computational cost of the cleaning step may not be justified\. Overall, these results indicate that for sparse text\-driven graphs, structural enrichment appears to be a more reliable refinement strategy than aggressive cleaning\.

## 6Conclusion

Graph Self\-Supervised Learning methods offer a wide variety of architectures and learning paradigms capable of deriving meaningful representations from graphs without requiring labeled data\. With the growing accessibility of text\-driven knowledge graphs, enabled by recent advances in NLP, new opportunities emerge for applying GSSL across multiple domains\. However, these opportunities come with the urgent need to understand how GSSL performs on such graphs, which often suffer from complex and heterogeneous noise introduced by automatic construction pipelines, an aspect that most existing robustness studies neglect by focusing only on synthetic noise\. This work presents the first comprehensive study evaluating the performance of various GSSL methods on text\-driven graphs for a downstream term typing task\. We introduceNATD\-GSSL, a unified framework that integrates graph construction, refinement, and GSSL into a single pipeline\. To quantify the impact of real\-world noise, we propose adual\-graph evaluation protocolthat compares GSSL performance on paired noisy and clean graphs, aligned to a shared gold standard within the biomedical domain\. Our results reveal several key findings\. First,relation reconstructiontasks are highly sensitive to noise and require clean graphs with well\-defined schemas, whilefeature reconstructionproves the most robust, achieving performance comparable to that on clean graphs and underscoring the importance of architectural choices\. In contrast, contrastive approaches show that robustness depends less on graph quality and more on alignment with the downstream objective\. Additionally, our refinement strategies indicate thatgraph enrichmentis beneficial for sparse graphs, while over\-aggressivedenoisingmay degrade performance by reducing structural connectivity\. Overall, NATD\-GSSL achieves up to\+7%improvement in term typing accuracy compared to pretrained language model baselines\.

While our findings provide a solid foundation for understanding GSSL behavior on noisy, text\-driven graphs, the study has some limitations\. These include the use of a single graph construction method, the challenge of obtaining clean, schema\-aligned graphs from text, and refinement techniques evaluated solely through downstream performance\. These limitations suggest several promising directions for future research\. Future work could extend the evaluation to other domains, downstream tasks, and alternative graph construction pipelines, design contrastive learning strategies with task\-aligned negative sampling and contrastive loss formulations, perform fine\-grained architectural analyses to separately assess the impact of encoders and decoders, and develop more advanced noise mitigation techniques, including new GNN layers robust to extraction\-induced noise\. Pursuing these directions with strong empirical grounding will enable the community to move beyond the limitations of prior theoretical work and enhance the practical deployment of GSSL in realistic, noisy, text\-driven knowledge graph scenarios\.

## Availability of Data and Materials

All datasets, resources, and source code used in this study are publicly available and fully documented on GitHub at:[https://github\.com/OthmaneKabal/MC2GAE](https://github.com/OthmaneKabal/MC2GAE)\. The UMLS\-NCI graph is not publicly redistributable due to licensing restrictions\. It requires an academic research license from the Unified Medical Language System\. Therefore, this dataset cannot be shared directly by the authors\. However, we provide all scripts and preprocessing pipelines necessary to reconstruct and prepare the graph once the data are obtained from the official source:[https://www\.nlm\.nih\.gov/research/umls/index\.html](https://www.nlm.nih.gov/research/umls/index.html)\.

## References

## Appendix APLM Initial Embeddings

This appendix reports the evaluation of initial embeddings obtained from different categories of pre\-trained language models on typing task, in order to identify the most effective representations for initializing graph nodes and relations\. The results, summarized in Table[7](https://arxiv.org/html/2605.05463#A1.T7), provide a comparative overview across general\-purpose, domain\-specific, and semantic similarity\-oriented PLMs\.

Table 7:Evaluation of PLM\-based initial embeddings on the term typing task\.CategoryModelAcc\.F1Prec\.Semantic Similarityall\-MiniLM\-L6\-v20\.46350\.43720\.4731paraphrase\-MiniLM\-L6\-v20\.43940\.42110\.4475nli\-roberta\-base\-v20\.41630\.38810\.4249msmarco\-distilbert\-base\-v30\.36830\.35530\.3587multi\-qa\-MiniLM\-L6\-cos\-v10\.41440\.39050\.4189stsb\-roberta\-base\-v20\.34810\.32680\.3438Domain\-specificscibert\_scivocab\_uncased0\.39810\.34840\.4091specter0\.34040\.32480\.3815S\-BioBert\-snli\-multinli\-stsb0\.39130\.36670\.3863S\-PubMedBert\-MS\-MARCO0\.41150\.35390\.4263BioLinkBERT\-base0\.17980\.14080\.1908biobert\-base\-cased\-v1\.10\.26440\.22880\.3776Bio\_ClinicalBERT0\.25380\.22170\.3002General\-purposebert\-base\-uncased0\.24420\.21910\.2952bert\-base\-cased0\.27120\.25680\.3079roberta\-base0\.21440\.17130\.2599distilbert\-base\-uncased0\.32880\.30690\.3553graphcodebert\-base0\.16730\.14830\.1626bart\-large0\.27120\.25550\.3149
Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical Graphs

Similar Articles

A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks

Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation

A Temporally Augmented Graph Attention Network for Affordance Classification

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG

Submit Feedback

Similar Articles

A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural Networks
Adversarial Graph Neural Network Benchmarks: Towards Practical and Fair Evaluation
A Temporally Augmented Graph Attention Network for Affordance Classification
Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
Injecting Structured Biomedical Knowledge into Language Models: Continual Pretraining vs. GraphRAG