Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification

arXiv cs.LG 05/21/26, 04:00 AM Papers
Summary
This paper introduces Transductive Sharpening (TS), a loss-level modification for semi-supervised node classification that minimizes prediction entropy on unlabeled nodes while counterbalancing on labeled nodes, achieving consistent performance improvements without architectural changes.
arXiv:2605.20248v1 Announce Type: new Abstract: In the transductive setting, where the full graph is observed but node labels are only partially available, progress in semi-supervised node classification has largely focused on architectural innovation. In this paper, we revisit an orthogonal axis: the training objective. We start from a simple observation: transductive models produce predictions for every node during training, including nodes without labels. These unlabeled-node predictions may contain useful training signal, but standard supervised objectives discard them because no ground-truth labels are available. Inspired by the decomposition of cross-entropy into a label-dependent alignment term and a label-independent entropy term, we propose prediction confidence as a natural way to extract this signal in the absence of labels. This motivates Transductive Sharpening (TS): a loss-level modification that minimizes prediction entropy on unlabeled nodes while counterbalancing this effect on labeled nodes. We evaluate Transductive Sharpening across a wide range of node-classification benchmarks and observe consistent performance improvements without requiring any changes to the backbone architecture. Code is available at https://github.com/transductive-sharpening/tunedGNN.
Original Article
View Cached Full Text
Cached at: 05/21/26, 06:12 AM
# Leveraging Unlabeled Predictions in Node Classification
Source: [https://arxiv.org/html/2605.20248](https://arxiv.org/html/2605.20248)
## Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification

Brown Zaz University of Cambridge jb2650@cl\.cam\.ac\.uk &Mar Gonzàlez I Català11footnotemark:1 University of Cambridge mg2211@cl\.cam\.ac\.uk &Ferran Hernandez Caralt University of Cambridge fh455@cl\.cam\.ac\.uk &Moshe Eliasof University of Cambridge me532@cl\.cam\.ac\.uk &Pietro Liò University of Cambridge pl219@cl\.cam\.ac\.uk

###### Abstract

In the transductive setting, where the full graph is observed but node labels are only partially available, progress in semi\-supervised node classification has largely focused on architectural innovation\. In this paper, we revisit an orthogonal axis: the training objective\. We start from a simple observation: transductive models produce predictions for every node during training, including nodes without labels\. These unlabeled\-node predictions may contain useful training signal, but standard supervised objectives discard them because no ground\-truth labels are available\. Inspired by the decomposition of cross\-entropy into a label\-dependent alignment term and a label\-independent entropy term, we propose prediction confidence as a natural way to extract this signal in the absence of labels\. This motivates Transductive Sharpening \(TS\): a loss\-level modification that minimizes prediction entropy on unlabeled nodes while counterbalancing this effect on labeled nodes\. We evaluate Transductive Sharpening across a wide range of node\-classification benchmarks and observe consistent performance improvements without requiring any changes to the backbone architecture\. Code is available at[https://github\.com/transductive\-sharpening/tunedGNN](https://github.com/transductive-sharpening/tunedGNN)\.

## 1Introduction

Graph neural networks \(GNNs\)\[[13](https://arxiv.org/html/2605.20248#bib.bib5),[14](https://arxiv.org/html/2605.20248#bib.bib8),[4](https://arxiv.org/html/2605.20248#bib.bib6),[7](https://arxiv.org/html/2605.20248#bib.bib7),[21](https://arxiv.org/html/2605.20248#bib.bib32),[42](https://arxiv.org/html/2605.20248#bib.bib30),[17](https://arxiv.org/html/2605.20248#bib.bib27),[12](https://arxiv.org/html/2605.20248#bib.bib23)\]have become the dominant approach for node classification tasks, particularly in the transductive setting\[[3](https://arxiv.org/html/2605.20248#bib.bib3)\], where the full graph is observed but only a subset of node labels is available\. Over the past years, progress in this area has been driven largely by architectural innovation, with increasingly sophisticated message\-passing schemes\[[16](https://arxiv.org/html/2605.20248#bib.bib29),[57](https://arxiv.org/html/2605.20248#bib.bib33),[1](https://arxiv.org/html/2605.20248#bib.bib39),[31](https://arxiv.org/html/2605.20248#bib.bib26),[24](https://arxiv.org/html/2605.20248#bib.bib2)\]and transformer\-based models\[[19](https://arxiv.org/html/2605.20248#bib.bib28)\]\. In contrast, the design of training objectives has received comparatively little attention, despite its central role in shaping model performance\.

In the transductive setting, models produce predictions for all nodes in the graph at every training step, including those without labels, yet the training objective is applied only to labeled nodes, as standard supervised losses require ground\-truth labels\. However, once a model begins to form reliable and confident predictions, these may themselves provide a useful learning signal for nodes with unknown labels\. We build on this observation by leveraging such predictions, encouraging confidence on unlabeled nodes while preventing overconfidence on labeled ones\.

We introduce Transductive Sharpening \(TS\), a simple and elegant loss\-level modification that implements this idea\. The method introduces a single hyperparameter and can be applied on top of any GNN architecture\. Empirically, we show that it consistently improves performance across a wide range of models and benchmarks\.

To understand the behavior induced by TS, we study the effect of the sharpening coefficientλ\\lambdaand analyze how the objective changes the distribution of predictive confidence across the graph\. Empirically, we find that moderate positive values ofλ\\lambdayield the most reliable gains, and that TS reallocates confidence toward unlabeled nodes as intended by the objective\.

Our results point to a simple but powerful principle: predictions generated during training, typically discarded when labels are unavailable, can be directly leveraged to improve learning\. While we study this idea in the context of transductive graph learning, it naturally extends to other settings, suggesting a general avenue for improving learning algorithms without increasing model complexity\.

#### Contributions\.

Our main contributions are as follows:

- •We introduce*Transductive Sharpening*\(TS\), a simple architecture\-agnostic loss modification that turns predictions on unlabeled nodes into a direct training signal for transductive node classification\.
- •We show that TS provides a strong performance\-complexity trade\-off: it improves standard GNN and MLP baselines across 13 node\-classification benchmarks while adding only a single scalar hyperparameter and requiring no architectural changes\.
- •We study the role of the sharpening coefficientλ\\lambda, showing that TS remains effective across a range of positive values, and that a single conservative setting preserves much of the benefit across models and datasets\.

## 2Background and Setup

In this section we provide background material related to our work\.

#### Notation\.

We denote byΔC−1=\{p∈ℝ≥0C:∑i=1Cpi=1\}\\Delta^\{C\-1\}=\\\{p\\in\\mathbb\{R\}^\{C\}\_\{\\geq 0\}:\\sum\_\{i=1\}^\{C\}p\_\{i\}=1\\\}the probability simplex overCCclasses\. Throughout, labels are represented as one\-hot vectorsyv∈\{0,1\}Cy\_\{v\}\\in\\\{0,1\\\}^\{C\}, whereyv,iy\_\{v,i\}indicates whether nodevvbelongs to classii\.

Node classification tasks consist of assigning a label to each node in a graph based on its features and the graph structure\[[21](https://arxiv.org/html/2605.20248#bib.bib32)\]\. In the transductive setting, the full graph and node features are available during training, but labels are observed only for a subset of nodes, and the goal is to predict the rest\.

###### Definition 1\(Transductive node classification\)\.

LetG=\(V,E\)G=\(V,E\)be a graph with node featuresX∈ℝ\|V\|×dX\\in\\mathbb\{R\}^\{\|V\|\\times d\}\. Each nodev∈Vv\\in Vhas an associated labelyv∈\{0,1\}Cy\_\{v\}\\in\\\{0,1\\\}^\{C\}, observed only for a subsetVL⊂VV\_\{L\}\\subset V\(referred to as*labeled nodes*\)\. We denote byVU:=V∖VLV\_\{U\}:=V\\setminus V\_\{L\}the remaining nodes \(referred to as*unlabeled nodes*\)\. The objective is to learn a model that predicts labels for nodes inVUV\_\{U\}, using the full graphGG, all node featuresXX, and the labels observed onVLV\_\{L\}\.

From standard to augmented training objectives\.A common approach to transductive node classification is to train a model in a supervised manner on the labeled subset of nodes, and then use it to generate predictions for the unlabeled nodes\.

In practice, a model produces, for each nodev∈Vv\\in V, a logit vectorzv∈ℝCz\_\{v\}\\in\\mathbb\{R\}^\{C\}and a corresponding probability distributionpv=softmax\(zv\)∈ΔC−1p\_\{v\}=\\mathrm\{softmax\}\(z\_\{v\}\)\\in\\Delta^\{C\-1\}\.

Models are trained using cross\-entropy loss applied only to labeled nodes:

ℒsup=−∑v∈VL∑i=1Cyv,ilog⁡pv,i\.\\mathcal\{L\}\_\{\\mathrm\{sup\}\}=\-\\sum\_\{v\\in V\_\{L\}\}\\sum\_\{i=1\}^\{C\}y\_\{v,i\}\\log p\_\{v,i\}\.\(1\)
This objective aligns the model’s predictions with ground\-truth labels, but ignores the feature\-based outputs generated on unlabeled nodesVUV\_\{U\}, even though these are computed at every training step as the model processes the full graph\.

Although the loss cannot be evaluated on unlabeled nodes due to the absence of labels, the model’s predictions produced during training may still contain useful information that could be exploited\. This suggests augmenting the objective with an additional term defined over unlabeled nodes similarly to\[[15](https://arxiv.org/html/2605.20248#bib.bib34)\]\.

###### Definition 2\(Augmented transductive objective\)\.

Consider a transductive node classification problem with labeled nodesVLV\_\{L\}and unlabeled nodesVUV\_\{U\}\. Letpv∈ΔC−1p\_\{v\}\\in\\Delta^\{C\-1\}denote the predictive distribution for nodevv\. An augmented transductive objective is any training objective of the form

ℒ=ℒsup\+f\(\{pv\}v∈VU\),\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{sup\}\}\+f\\bigl\(\\\{p\_\{v\}\\\}\_\{v\\in V\_\{U\}\}\\bigr\),\(2\)wheref:\(ΔC−1\)\|VU\|→ℝf:\(\\Delta^\{C\-1\}\)^\{\|V\_\{U\}\|\}\\to\\mathbb\{R\}extracts a learning signal from the model’s predictions on the unlabeled nodesVUV\_\{U\}\.

Definition[2](https://arxiv.org/html/2605.20248#Thmdefinition2)highlights the flexibility of this framework: different choices offfinduce different ways of extracting learning signals from unlabeled predictions\. For instance, in a binary classification problem with a known balanced class distribution,ffcould penalize deviations from a balanced prediction distribution over the unlabeled nodes during training\.

While useful, this example depends on information that may not be available in general\. The central question, then, is whether we can chooseffin a principled, task\-agnostic way that applies broadly across transductive node classification problems\.

Uncertainty\-based learning signals\.A natural starting point for designingffis to examine the structure of the supervised loss itself\. In particular, cross\-entropy admits the following decomposition:

###### Lemma 1\(Cross\-entropy decomposition\)\.

For any target distributionyyand predictionpp, cross\-entropy loss can be written as

ℒCE\(y,p\)=H\(p\)\+∑i=1C\(pi−yi\)log⁡pi,\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(y,p\)=H\(p\)\+\\sum\_\{i=1\}^\{C\}\(p\_\{i\}\-y\_\{i\}\)\\log p\_\{i\},\(3\)whereH\(p\)=−∑ipilog⁡piH\(p\)=\-\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}denotes the Shannon entropy\[[36](https://arxiv.org/html/2605.20248#bib.bib10)\]\.

The proof of Lemma[1](https://arxiv.org/html/2605.20248#Thmlemma1)is provided in Appendix[F](https://arxiv.org/html/2605.20248#A6)\.

The second term depends explicitly on the target labels111For a bayesian perspective on this approach refer to\[[15](https://arxiv.org/html/2605.20248#bib.bib34)\], whereas the entropy termH\(p\)H\(p\)depends only on the model’s predictions\. This separation reveals that part of the supervised objective is inherently label\-independent, and can therefore be evaluated on any node\.

This suggests a principled class of choices forff: functions that depend only on the predictive distribution and capture uncertainty\-related properties of the model’s predictions, as minimizing these should also implicitly minimize the original cross\-entropy loss\.

In this work, we adopt a simple instantiation of this perspective by explicitly controlling prediction confidence across labeled and unlabeled nodes through the training objective\.

## 3Transductive Sharpening for Graph Learning

In this section, we introduce*Transductive Sharpening*\(TS\), a loss\-level modification for transductive node classification\.

### 3\.1Transductive Sharpening Objective

Lemma[1](https://arxiv.org/html/2605.20248#Thmlemma1)motivates a label\-free way to use unlabeled nodes: shape the uncertainty of their predictive distributions directly through the objective\. We instantiate this principle by adding an uncertainty term overVUV\_\{U\}to the supervised loss\.

A natural choice for the unlabeled\-node term is to encourage low\-uncertainty predictions on unlabeled nodes, without otherwise modifying the supervised objective\. However, naively minimizing uncertainty everywhere can lead to overconfident and poorly calibrated models\. To address this, we introduce a simple symmetric objective that sharpens predictions on unlabeled nodes while counterbalancing this effect on labeled ones\.

###### Definition 3\(Generic Transductive Sharpening objective\)\.

LetR:ΔC−1→ℝR:\\Delta^\{C\-1\}\\to\\mathbb\{R\}be a function on the probability simplex that measures the uncertainty of a predictive distribution\. For a model producing, for each nodev∈Vv\\in V, a probability vectorpv∈ΔC−1p\_\{v\}\\in\\Delta^\{C\-1\}, we define the Generic Transductive Sharpening objective by

ℒR=ℒsup\+λ⋅1\|VU\|∑v∈VUR\(pv\)−λ⋅1\|VL\|∑v∈VLR\(pv\),\\mathcal\{L\}\_\{R\}=\\mathcal\{L\}\_\{\\mathrm\{sup\}\}\+\\lambda\\cdot\\frac\{1\}\{\|V\_\{U\}\|\}\\sum\_\{v\\in V\_\{U\}\}R\(p\_\{v\}\)\-\\lambda\\cdot\\frac\{1\}\{\|V\_\{L\}\|\}\\sum\_\{v\\in V\_\{L\}\}R\(p\_\{v\}\),\(4\)whereλ∈ℝ\\lambda\\in\\mathbb\{R\}controls the influence of the sharpening\.

This formulation captures two complementary effects\. On unlabeled nodes, minimizing uncertainty encourages confident predictions\[[15](https://arxiv.org/html/2605.20248#bib.bib34)\], allowing the model to leverage its own outputs as a learning signal\. On labeled nodes, maximizing uncertainty counteracts overconfidence, helping to prevent overfitting to the training data\[[29](https://arxiv.org/html/2605.20248#bib.bib57)\]\.

### 3\.2Implementing an Uncertainty Function

The formulation introduced in Definition[3](https://arxiv.org/html/2605.20248#Thmdefinition3)depends on the choice of the functionRR, which determines how predictive confidence is shaped during training\. A natural candidate forRRis Shannon entropy\[[36](https://arxiv.org/html/2605.20248#bib.bib10)\], widely used as a measure of uncertainty in probabilistic models\.

However, Shannon’s logarithmic form yields unbounded gradients near the boundary of the probability simplex, which can lead to overly aggressive updates for confident predictions and, in turn, unstable training dynamics and a tendency toward degenerate one\-hot solutions\.

To address these limitations, we consider an alternative based on the Tsallis entropy\.

###### Definition 4\(Tsallis entropy withq=2q=2\)\.

The Tsallis entropy\[[41](https://arxiv.org/html/2605.20248#bib.bib9)\]of orderq=2q=2, also known as Gini impurity, is defined as

S2\(p\)=1−∑i=1Cpi2\.S\_\{2\}\(p\)=1\-\\sum\_\{i=1\}^\{C\}p\_\{i\}^\{2\}\.\(5\)

Tsallis entropy is a one\-parameter generalization of Shannon entropy that reduces to Shannon entropy whenq=1q=1\. Forq=2q=2, it preserves the same qualitative behavior, assigning low values to confident predictions and high values to diffuse predictions, but admits a simple quadratic form

S2\(p\)=1−‖p‖22\.S\_\{2\}\(p\)=1\-\\\|p\\\|\_\{2\}^\{2\}\.Thus, minimizingS2\(p\)S\_\{2\}\(p\)is equivalent to maximizing the squaredℓ2\\ell\_\{2\}\-norm of the predictive distribution\.

We adoptR\(p\)=S2\(p\)R\(p\)=S\_\{2\}\(p\)in the Transductive Sharpening objective because the quadratic form gives gradients that are linear inpp, leading to stable updates even when predictions are already confident\. SubstitutingR\(p\)=S2\(p\)R\(p\)=S\_\{2\}\(p\)into Definition[3](https://arxiv.org/html/2605.20248#Thmdefinition3)yields the final form of the Transductive Sharpening objective:

###### Definition 5\(Transductive Sharpening Objective\)\.

For a model producing, for each nodev∈Vv\\in V, a probability vectorpv∈ΔC−1p\_\{v\}\\in\\Delta^\{C\-1\}, we define the Transductive Sharpening objective by

ℒTS=ℒsup\+λ⋅1\|VU\|∑v∈VU\(1−‖pv‖22\)−λ⋅1\|VL\|∑v∈VL\(1−‖pv‖22\),\\mathcal\{L\}\_\{\\mathrm\{TS\}\}=\\mathcal\{L\}\_\{\\mathrm\{sup\}\}\+\\lambda\\cdot\\frac\{1\}\{\|V\_\{U\}\|\}\\sum\_\{v\\in V\_\{U\}\}\\left\(1\-\\\|p\_\{v\}\\\|\_\{2\}^\{2\}\\right\)\-\\lambda\\cdot\\frac\{1\}\{\|V\_\{L\}\|\}\\sum\_\{v\\in V\_\{L\}\}\\left\(1\-\\\|p\_\{v\}\\\|\_\{2\}^\{2\}\\right\),\(6\)whereλ∈ℝ\\lambda\\in\\mathbb\{R\}controls the strength of the sharpening\.

On the Choice of Sharpening Coefficient\.The Transductive Sharpening objective introduces a single scalar hyperparameterλ\\lambdathat controls the strength of the unlabeled\-node confidence signal relative to the supervised objective\. In our main experiments \(Section[4\.1](https://arxiv.org/html/2605.20248#S4.SS1)\), we selectλ\\lambdaby standard validation tuning over a fixed grid and report test performance at the validation\-selected value\. To assess whether TS depends on precise per\-dataset tuning, we also evaluate a universal setting withλ=0\.25\\lambda=0\.25applied across all datasets and architectures \(Section[4\.2](https://arxiv.org/html/2605.20248#S4.SS2)\)\. Notably, TS requires no changes to the backbone architecture and adds negligible computational overhead to existing GNN training pipelines\.

## 4Experiments

In this section, we conduct an extensive set of experiments to demonstrate the effectiveness of TS for node classification in transductive settings\. Our experiments seek to address the following questions:

1. \(Q1\)Does transductive sharpening consistently improve the performance of GNNs across a broad set of node classification benchmarks?
2. \(Q2\)Does transductive sharpening consistently improve the performance of MLPs across a broad set of node classification benchmarks, and does it substitute the effect of message passing?
3. \(Q3\)How does the choice of sharpening coefficientλ\\lambdaaffect the performance of TS?
4. \(Q4\)Can a fixedλ\\lambdaperform competitively across architectures and datasets?

#### Baselines\.

To isolate the effect of TS, we consider two classes of baselines: \(i\)GNNs, using well\-tuned implementations of standard message\-passing architectures \(GCN, GAT, GraphSAGE\) following\[[26](https://arxiv.org/html/2605.20248#bib.bib4)\], and \(ii\)MLPs, which operate on node features alone and do not use graph structure\. Full comparison to current competitive methods can be found in Appendix[E](https://arxiv.org/html/2605.20248#A5)\.

We provide complete details on the experimental settings and datasets in Appendix[A](https://arxiv.org/html/2605.20248#A1), as well as additional experiments in Appendix[B](https://arxiv.org/html/2605.20248#A2)\.

### 4\.1Node Classification Results

Table 1:Per\-cell results, with the GNN baseline shown on the left and the two TS variants to its right\. Each treatment cell shows the value on the top line and theΔ\\Deltavs\. the matching baseline on the bottom\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\.BaselineBaseline\+TS \(Ours\)DatasetMLPGCNSAGEGATMLPGCNSAGEGATCora60\.96±\\pm2\.5184\.54±\\pm0\.8683\.60±\\pm0\.5882\.40±\\pm1\.0164\.48±\\pm2\.83\+\+3\.5285\.74±\\pm0\.54\+\+1\.2085\.28±\\pm1\.11\+\+1\.6884\.62±\\pm0\.89\+\+2\.22CiteSeer56\.58±\\pm1\.1472\.68±\\pm0\.4369\.60±\\pm0\.6171\.90±\\pm0\.2562\.72±\\pm3\.20\+\+6\.1475\.18±\\pm0\.15\+\+2\.5074\.96±\\pm0\.24\+\+5\.3674\.84±\\pm0\.48\+\+2\.94PubMed68\.96±\\pm1\.1680\.70±\\pm0\.9677\.86±\\pm1\.4979\.76±\\pm1\.2172\.30±\\pm1\.89\+\+3\.3480\.74±\\pm0\.30\+\+0\.0479\.72±\\pm0\.69\+\+1\.8678\.84±\\pm0\.67−\-0\.92Computer82\.46±\\pm0\.4594\.12±\\pm0\.0893\.25±\\pm0\.3693\.98±\\pm0\.2282\.87±\\pm0\.56\+\+0\.4193\.98±\\pm0\.26−\-0\.1493\.43±\\pm0\.11\+\+0\.1893\.86±\\pm0\.13−\-0\.12Photo87\.57±\\pm0\.5295\.90±\\pm0\.3396\.43±\\pm0\.2796\.69±\\pm0\.1487\.65±\\pm0\.22\+\+0\.0896\.21±\\pm0\.11\+\+0\.3196\.51±\\pm0\.23\+\+0\.0896\.60±\\pm0\.07−\-0\.09CS91\.54±\\pm0\.2095\.88±\\pm0\.0396\.29±\\pm0\.1296\.17±\\pm0\.0291\.77±\\pm0\.38\+\+0\.2395\.89±\\pm0\.06\+\+0\.0196\.24±\\pm0\.10−\-0\.0596\.17±\\pm0\.02\+\+0\.00Physics95\.97±\\pm0\.0797\.38±\\pm0\.0697\.25±\\pm0\.0897\.26±\\pm0\.0395\.98±\\pm0\.08\+\+0\.0197\.44±\\pm0\.14\+\+0\.0697\.23±\\pm0\.00−\-0\.0297\.38±\\pm0\.04\+\+0\.12WikiCS70\.96±\\pm1\.0079\.97±\\pm0\.4380\.71±\\pm0\.1980\.92±\\pm0\.5872\.48±\\pm0\.83\+\+1\.5280\.31±\\pm0\.44\+\+0\.3481\.10±\\pm0\.32\+\+0\.3981\.78±\\pm0\.22\+\+0\.86Squirrel39\.30±\\pm0\.7943\.75±\\pm1\.9140\.48±\\pm2\.9041\.51±\\pm2\.3439\.39±\\pm0\.91\+\+0\.0944\.57±\\pm2\.04\+\+0\.8241\.32±\\pm2\.35\+\+0\.8440\.36±\\pm1\.66−\-1\.15Chameleon43\.86±\\pm5\.2345\.30±\\pm2\.3044\.32±\\pm4\.5543\.07±\\pm5\.2543\.86±\\pm5\.23\+\+0\.0045\.27±\\pm4\.74−\-0\.0343\.32±\\pm4\.67−\-1\.0044\.52±\\pm3\.69\+\+1\.45Amazon\-Rat\.48\.85±\\pm0\.5553\.64±\\pm0\.5455\.18±\\pm0\.9355\.09±\\pm0\.1949\.50±\\pm0\.27\+\+0\.6554\.06±\\pm0\.58\+\+0\.4256\.72±\\pm0\.35\+\+1\.5455\.73±\\pm0\.31\+\+0\.64Roman\-Emp\.66\.10±\\pm0\.4491\.15±\\pm0\.2090\.50±\\pm0\.2190\.49±\\pm0\.2266\.12±\\pm0\.29\+\+0\.0291\.66±\\pm0\.20\+\+0\.5191\.27±\\pm0\.36\+\+0\.7790\.93±\\pm0\.20\+\+0\.44Minesweeper51\.06±\\pm1\.7697\.26±\\pm0\.2297\.09±\\pm1\.0097\.86±\\pm0\.3750\.97±\\pm1\.56−\-0\.0997\.80±\\pm0\.20\+\+0\.5497\.33±\\pm0\.94\+\+0\.2497\.86±\\pm0\.37\+\+0\.00

Table[1](https://arxiv.org/html/2605.20248#S4.T1)reports test accuracy for each dataset–model pair, comparing the supervised baseline with the corresponding TS\-augmented model across 13 different datasets\. Our key takeaways are as follows:

1. \(A1\)TS improves standard GNN training\.Across GCN, GraphSAGE, and GAT, adding TS generally matches or improves the corresponding supervised baseline\. This supports the view that unlabeled\-node predictions contain recoverable training signal, as formalized by Definition[2](https://arxiv.org/html/2605.20248#Thmdefinition2), and that uncertainty provides an effective way to recover it, as suggested by Lemma[1](https://arxiv.org/html/2605.20248#Thmlemma1)\.
2. \(A2\)TS does not replace message passing\.TS also improves MLPs on several datasets, showing that the sharpening signal is not specific to GNN architectures\. However, MLP\+TS remains below the corresponding GNN performance, indicating that TS does not replace the benefit of message passing\.

Overall, these results demonstrate that TS provides a simple and broadly applicable improvement to transductive node classification\.

### 4\.2On the Effect of the Sharpening Coefficient

To compareλ\\lambdavalues across datasets with different baseline accuracies and variances, we report improvements using Glass’sΔ\\Delta,ΔGlass\(λ\)=Accλ−Acc0σ0,\\Delta\_\{\\mathrm\{Glass\}\}\(\\lambda\)=\\frac\{\\mathrm\{Acc\}\_\{\\lambda\}\-\\mathrm\{Acc\}\_\{0\}\}\{\\sigma\_\{0\}\},whereAcc0\\mathrm\{Acc\}\_\{0\}andσ0\\sigma\_\{0\}are the mean and standard deviation of the correspondingλ=0\\lambda=0supervised baseline\. This normalization measures each gain in units of the baseline variability and is preferable to raw accuracy differences in this case because the datasets differ substantially in both difficulty and noise\.

Figure[1](https://arxiv.org/html/2605.20248#S4.F1)aggregates the Glass\-normalized gains over the 13 datasets for each GNN backbone\. The median curves remain close to or above zero for small positive values ofλ\\lambda, with the most stable region lying roughly betweenλ=0\\lambda=0andλ=0\.5\\lambda=0\.5\. Beyond this range, the curves gradually deteriorate, and large values become harmful more often\.

![Refer to caption](https://arxiv.org/html/2605.20248v1/x1.png)Figure 1:Glass’sΔ\\Deltaon test accuracy vs\.λ\\lambda, where each curve aggregates one GNN backbone over its 13 datasets\. Solid line denotes the median and the shaded band denotes the interquartile range\. This figure showcases the finding that only usingλ∈\[0,0\.5\]\\lambda\\in\[0,0\.5\]will generally improve performance on any graph\. Additional per\-λ\\lambdavisualizations, including the full distribution of improvements and regressions and the corresponding per\-dataset accuracy curves, are provided in Appendix[D](https://arxiv.org/html/2605.20248#A4)\.1. \(A3\)Effect ofλ\\lambda:Across datasets and backbones, Figure[1](https://arxiv.org/html/2605.20248#S4.F1)shows a broadly consistent relationship betweenλ\\lambdaand accuracy: performance typically improves for small positive values, reaches a plateau or local maximum at moderate sharpening strength, and then degrades whenλ\\lambdabecomes too large\. This behavior matches the intuition behind TS: a mild sharpening signal can help the model exploit reliable unlabeled\-node predictions, whereas excessive sharpening may force the model to commit too strongly to incorrect predictions\.

The consistency of the pattern observed in Figure[1](https://arxiv.org/html/2605.20248#S4.F1)suggests that the useful range ofλ\\lambdais not entirely dataset\-specific\. We therefore study whether a single coefficient can work reasonably well across datasets and architectures\. Based on Figure[1](https://arxiv.org/html/2605.20248#S4.F1), we choose the midpoint of the stable region,λ=0\.25\\lambda=0\.25, as a simple universal setting\.

Table 2:Per\-dataset results for TS atλ=0\.25\\lambda\{=\}0\.25\(universal\)\. Each cell shows test accuracy±\\pmstd on top and theΔ\\Deltavs\. the matching baseline below\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\. We observe that TS, even without tuningλ\\lambdaas a hyperparameter, significantly enhances the performance of classic baselines\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN85\.16±\\pm0\.36\+\+0\.6275\.10±\\pm0\.19\+\+2\.4280\.88±\\pm0\.31\+\+0\.1893\.81±\\pm0\.22−\-0\.3196\.32±\\pm0\.08\+\+0\.4295\.90±\\pm0\.03\+\+0\.0297\.44±\\pm0\.14\+\+0\.0680\.17±\\pm0\.48\+\+0\.2044\.25±\\pm2\.04\+\+0\.5045\.27±\\pm4\.74−\-0\.0354\.11±\\pm0\.34\+\+0\.4791\.57±\\pm0\.20\+\+0\.4297\.44±\\pm0\.17\+\+0\.18SAGE84\.94±\\pm0\.89\+\+1\.3472\.00±\\pm1\.36\+\+2\.4078\.40±\\pm1\.12\+\+0\.5493\.51±\\pm0\.06\+\+0\.2696\.41±\\pm0\.41−\-0\.0296\.20±\\pm0\.06−\-0\.0997\.23±\\pm0\.00−\-0\.0280\.77±\\pm0\.14\+\+0\.0639\.77±\\pm2\.13−\-0\.7142\.93±\\pm5\.67−\-1\.3955\.69±\\pm0\.23\+\+0\.5191\.21±\\pm0\.25\+\+0\.7196\.96±\\pm0\.47−\-0\.13GAT83\.70±\\pm1\.19\+\+1\.3074\.80±\\pm0\.81\+\+2\.9079\.80±\\pm0\.93\+\+0\.0493\.82±\\pm0\.13−\-0\.1696\.47±\\pm0\.00−\-0\.2296\.19±\\pm0\.11\+\+0\.0297\.32±\\pm0\.06\+\+0\.0681\.03±\\pm0\.92\+\+0\.1139\.53±\\pm2\.12−\-1\.9843\.87±\\pm5\.20\+\+0\.8055\.59±\\pm0\.27\+\+0\.5090\.99±\\pm0\.11\+\+0\.5097\.10±\\pm0\.69−\-0\.76

1. \(A4\)Universalλ\\lambda:Table[2](https://arxiv.org/html/2605.20248#S4.T2)shows that despite removing per\-dataset tuning, the universal setting preserves much of the benefit of TS: it improves many model\-dataset pairs and rarely causes large regressions\. This suggests that TS is not overly sensitive to precise coefficient selection, provided thatλ\\lambdais chosen in a conservative positive range\.

### 4\.3Ablation Studies

We evaluate the impact of key design choices in TS through a series of ablations\.

#### Removing the labeled\-node correction\.

Atλ=0\.25\\lambda=0\.25, we evaluate a variant that applies the entropy\-minimization term to unlabeled nodes while removing the entropy\-maximization term on labeled training nodes\. Table[3](https://arxiv.org/html/2605.20248#S4.T3)shows that removing this correction generally does not improve performance and can substantially hurt accuracy on several datasets, suggesting that the labeled\-node term is important for the stability of TS\.

Table 3:Test\-accuracy gain of dropping the labelled\-side entropy\-max term \(λL=0\\lambda\_\{L\}\{=\}0\) over the symmetric default \(λL=−λU\\lambda\_\{L\}\{=\}\{\-\}\\lambda\_\{U\}\)\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\. Results show that removing the labeled\-node correction performs worse than the symmetric TS objective\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN−\-3\.80±\\pm0\.97−\-2\.96±\\pm0\.27−\-3\.02±\\pm0\.61\+\+0\.07±\\pm0\.31−\-0\.13±\\pm0\.20\+\+0\.06±\\pm0\.10\+\+0\.03±\\pm0\.16−\-0\.10±\\pm0\.62−\-0\.54±\\pm3\.16−\-1\.08±\\pm6\.54\+\+0\.01±\\pm0\.58−\-0\.05±\\pm0\.24−\-0\.02±\\pm0\.29SAGE−\-3\.94±\\pm1\.27−\-1\.12±\\pm1\.74\+\+1\.68±\\pm1\.24\+\+0\.10±\\pm0\.18\+\+0\.15±\\pm0\.54\+\+0\.05±\\pm0\.080±\\pm0\.01−\-0\.31±\\pm0\.49\+\+0\.79±\\pm3\.32−\-0\.19±\\pm7\.46−\-0\.33±\\pm0\.68−\-0\.17±\\pm0\.47\+\+0\.22±\\pm0\.54GAT−\-2\.74±\\pm1\.41−\-3\.30±\\pm1\.11−\-1\.92±\\pm1\.36\+\+0\.16±\\pm0\.19−\-0\.04±\\pm0\.14\+\+0\.05±\\pm0\.150±\\pm0\.09\+\+0\.19±\\pm1\.22−\-0\.02±\\pm3\.17−\-0\.02±\\pm6\.80\+\+0\.26±\\pm0\.43−\-0\.08±\\pm0\.44\+\+0\.17±\\pm0\.75

#### Symmetry ofλ\\lambda\.

We test whether the symmetric formulation \(λu=−λt\\lambda\_\{u\}=\-\\lambda\_\{t\}\) is essential by introducing an offset\{−0\.10,−0\.05,\+0,\+0\.05,\+0\.10\}\\\{\-0\.10,\-0\.05,\+0,\+0\.05,\+0\.10\\\}while keepingλu−λt\\lambda\_\{u\}\-\\lambda\_\{t\}fixed\. Table[4](https://arxiv.org/html/2605.20248#S4.T4)shows that asymmetric configurations do not yield systematic improvements, and often slightly degrade performance, suggesting that the balance between sharpening unlabeled nodes and regularizing labeled ones is important for stable gains\.

Table 4:Best test\-accuracy gain from the non\-symmetric offset, relative to the symmetric defaultλu=−λt\\lambda\_\{u\}=\-\\lambda\_\{t\}\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\. This table shows the symmetric approach outperforms adding non\-symmetry to the Generic Transductive Sharpening Objective \(Definition[3](https://arxiv.org/html/2605.20248#Thmdefinition3)\)\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN−\-0\.24±\\pm0\.49−\-0\.10±\\pm0\.67−\-0\.20±\\pm0\.940−\-0\.26±\\pm0\.36−\-0\.07±\\pm0\.11−\-0\.05±\\pm0\.18−\-0\.25±\\pm0\.71−\-0\.57±\\pm2\.820−\-0\.03±\\pm0\.34−\-0\.17±\\pm0\.200SAGE0±\\pm1\.350−\-0\.84±\\pm0\.8500−\-0\.05±\\pm0\.200±\\pm0\.06−\-0\.12±\\pm0\.43−\-1\.29±\\pm3\.100\+\+0\.14±\\pm0\.54−\-0\.07±\\pm0\.210GAT00−\-0\.26±\\pm1\.190−\-0\.31±\\pm0\.08−\-0\.08±\\pm0\.05−\-0\.07±\\pm0\.13−\-0\.23±\\pm0\.24−\-0\.69±\\pm2\.73−\-1\.07±\\pm6\.04−\-0\.28±\\pm0\.35\+\+0\.08±\\pm0\.430

#### Choice of entropy\.

We compare Tsallis entropy \(q=2q=2\) with Shannon entropy \(q=1q=1\) at a fixedλ=0\.25\\lambda=0\.25\. While average accuracy remains similar, Shannon entropy leads to higher variance and instability in some cases \(see GAT\+\+Roman\-Empire cell\)\. This confirms that the quadratic form of Tsallis entropy provides more stable optimization dynamics and is a safer default\.

Table 5:Test\-accuracy gain of Shannon entropy \(q=1q\{=\}1\) over Gini \(q=2q\{=\}2\) atλ=0\.25\\lambda\{=\}0\.25\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\. These results show the choice of entropy generally does not affect the reported accuracy\. Thus motivating the choice of Tsallis entropy for its simplicity and its better differentiability properties\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN−\-0\.14±\\pm0\.77−\-0\.12±\\pm0\.38−\-0\.16±\\pm0\.57\+\+0\.21±\\pm0\.24−\-0\.15±\\pm0\.130±\\pm0\.09\+\+0\.01±\\pm0\.20−\-0\.02±\\pm0\.66−\-0\.40±\\pm2\.92−\-1\.61±\\pm6\.00−\-0\.11±\\pm0\.54−\-0\.02±\\pm0\.36\+\+0\.04±\\pm0\.29SAGE−\-0\.42±\\pm1\.35−\-2\.52±\\pm1\.93\+\+0\.54±\\pm1\.42−\-0\.06±\\pm0\.12\+\+0\.13±\\pm0\.48−\-0\.08±\\pm0\.17\+\+0\.06±\\pm0\.05\+\+0\.13±\\pm0\.31\+\+0\.02±\\pm3\.00\+\+0\.63±\\pm7\.31\+\+0\.21±\\pm0\.59−\-0\.11±\\pm0\.33—GAT\+\+0\.74±\\pm1\.27−\-0\.68±\\pm0\.92\+\+0\.26±\\pm1\.24\+\+0\.06±\\pm0\.31\+\+0\.07±\\pm0\.07−\-0\.08±\\pm0\.14\+\+0\.03±\\pm0\.08−\-0\.01±\\pm1\.07\+\+0\.90±\\pm2\.58−\-0\.81±\\pm6\.97−\-0\.27±\\pm0\.45−\-26\.21±\\pm44\.01\+\+0\.29±\\pm0\.78

#### Meta\-learned sharpening coefficient\.

We consider a dynamic variant in which the sharpening coefficient is adapted during training\. At each epoch, we randomly split the labeled training nodes into an inner\-training subset and a held\-out meta\-training subset\. The model parameters are first updated differentiably on the inner\-training subset using the transductive sharpening objective\. We then updateλ\\lambdaso that the one\-step\-updated model minimizes the supervised loss on the held\-out meta\-training subset\. As shown in Table[6](https://arxiv.org/html/2605.20248#S4.T6), this adaptive strategy performs worse than the simpler validation\-selected constantλ\\lambda, suggesting that the additional meta\-optimization introduces instability or noise that is not offset by better coefficient selection\.

Table 6:Test\-accuracy difference between the meta\-learned\-λ\\lambdavariant and the constant\-λ\\lambdatransductive sharpening baseline\. Negative values indicate that adaptingλ\\lambdaduring training performs worse than selecting a fixed value by validation\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\. Meta\-learning theλ\\lambdavalue shows no advantage to considering a fixed value\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN−\-3\.72±\\pm0\.80−\-3\.40±\\pm0\.33−\-2\.18±\\pm0\.83−\-0\.19±\\pm0\.27−\-1\.35±\\pm0\.76−\-0\.06±\\pm0\.110±\\pm0\.14−\-0\.23±\\pm0\.72\+\+0\.16±\\pm2\.83−\-0\.25±\\pm5\.82−\-0\.57±\\pm0\.88−\-0\.50±\\pm0\.27−\-0\.38±\\pm0\.48SAGE−\-4\.68±\\pm1\.59−\-5\.36±\\pm0\.94\+\+0\.16±\\pm0\.98−\-0\.39±\\pm0\.20−\-0\.32±\\pm0\.42\+\+0\.07±\\pm0\.15−\-0\.03±\\pm0\.05−\-0\.63±\\pm0\.70\+\+0\.40±\\pm3\.11−\-1\.05±\\pm8\.67−\-1\.84±\\pm0\.50−\-0\.53±\\pm0\.39\+\+0\.49±\\pm1\.07GAT−\-4\.22±\\pm1\.40−\-3\.88±\\pm1\.32−\-0\.36±\\pm1\.570±\\pm0\.46−\-0\.12±\\pm0\.23\+\+0\.13±\\pm0\.07−\-0\.15±\\pm0\.05−\-1\.13±\\pm0\.61\+\+1\.57±\\pm2\.19−\-1\.05±\\pm5\.47−\-0\.34±\\pm0\.48−\-0\.43±\\pm0\.41−\-1\.00±\\pm0\.54

## 5A Mechanistic Analysis of Transductive Sharpening

In this section, we examine the training dynamics induced by Transductive Sharpening\. We focus on a measurable effect of the objective: the evolution of predictive entropy on labeled and unlabeled nodes during training\.

The central observation is that TS changes how confidence is allocated across the graph: it encourages the model to become more confident on unlabeled nodes, while the labeled\-node correction prevents confidence from concentrating only on the supervised subset\.

To measure this effect, we track the average predictive entropy on labeled and unlabeled nodes,

HL\(t\)=1\|VL\|∑v∈VLH\(pv\(t\)\),HU\(t\)=1\|VU\|∑v∈VUH\(pv\(t\)\),H\_\{L\}\(t\)=\\frac\{1\}\{\|V\_\{L\}\|\}\\sum\_\{v\\in V\_\{L\}\}H\(p\_\{v\}\(t\)\),\\qquad H\_\{U\}\(t\)=\\frac\{1\}\{\|V\_\{U\}\|\}\\sum\_\{v\\in V\_\{U\}\}H\(p\_\{v\}\(t\)\),as well as their difference,

ΔH\(t\)=HL\(t\)−HU\(t\)\.\\Delta H\(t\)=H\_\{L\}\(t\)\-H\_\{U\}\(t\)\.Larger values ofΔH\(t\)\\Delta H\(t\)indicate that the model is relatively more confident on unlabeled nodes than on labeled nodes\.

![Refer to caption](https://arxiv.org/html/2605.20248v1/x2.png)Figure 2:Entropy dynamics during training for the supervised baseline \(grey\) and TS \(blue\)\. TS lowers the entropy of unlabeled\-node predictions relative to labeled\-node predictions, producing a larger entropy gapΔH\(t\)\\Delta H\(t\)\. This implies that TS is distributing confidence to the unlabeled nodes as intended by its design\.Figure[2](https://arxiv.org/html/2605.20248#S5.F2)shows that TS reallocates confidence toward the unlabeled portion of the observed graph\. Across the training trajectory, the entropy gap forλ=0\.25\\lambda=0\.25remains consistently above the supervised baselineλ=0\\lambda=0, indicating that TS lowers the entropy of unlabeled\-node predictions relative to labeled\-node predictions\.

This behavior suggests a reason why the effect of TS depends on the choice ofλ\\lambdain Section[4\.2](https://arxiv.org/html/2605.20248#S4.SS2)\. Moderate sharpening can improve the learned decision boundary by turning reliable unlabeled predictions into an optimization signal, whereas overly largeλ\\lambdacan force the model to commit too strongly to its own predictions\.

## 6Related Work

Graph Neural Networks\.Graph neural networks \(GNNs\) have become the standard approach for node classification\[[25](https://arxiv.org/html/2605.20248#bib.bib56)\], with architectures such as GCN\[[21](https://arxiv.org/html/2605.20248#bib.bib32)\], GAT\[[42](https://arxiv.org/html/2605.20248#bib.bib30)\], and GraphSAGE\[[17](https://arxiv.org/html/2605.20248#bib.bib27)\]\. Recent work has shown that well\-tuned implementations of these models remain highly competitive\[[26](https://arxiv.org/html/2605.20248#bib.bib4)\], suggesting that performance gains are not solely driven by architectural innovation\.

Improved training of GNNs\.A line of work has explored improving GNN performance through creating different training policies and losses\. Common approaches include dropout\-based methods\[[39](https://arxiv.org/html/2605.20248#bib.bib22)\], structural perturbations such as DropEdge\[[35](https://arxiv.org/html/2605.20248#bib.bib21)\]and DropNode\[[9](https://arxiv.org/html/2605.20248#bib.bib20)\], normalization techniques like PairNorm\[[54](https://arxiv.org/html/2605.20248#bib.bib19)\], and data augmentation strategies including Mixup\[[52](https://arxiv.org/html/2605.20248#bib.bib14),[43](https://arxiv.org/html/2605.20248#bib.bib18)\], and G\-Mixup\[[18](https://arxiv.org/html/2605.20248#bib.bib16)\]\. While these methods improve GNN training, they do not explicitly leverage predictions on unlabeled nodes during training\.

Leveraging unlabeled data in GNN training\.Recent methods explicitly incorporate both labeled and unlabeled data to improve GNN training\. InfoGraph\[[40](https://arxiv.org/html/2605.20248#bib.bib15)\]learns representations by maximizing mutual information between local and global views of the graph\. Related approaches leverage contrastive learning to exploit unlabeled data\[[50](https://arxiv.org/html/2605.20248#bib.bib13)\], while others rely on augmentation\-based regularization such as consistency and diversity objectives\[[2](https://arxiv.org/html/2605.20248#bib.bib12)\]\. Mixup\-based methods incorporate unlabeled data through interpolation strategies\[[46](https://arxiv.org/html/2605.20248#bib.bib11),[44](https://arxiv.org/html/2605.20248#bib.bib17)\], whereas\[[10](https://arxiv.org/html/2605.20248#bib.bib1)\]incorporate unlabeled nodes through information\-theoretic objectives and additional regularization terms\. Our approach differs from these methods in two key aspects\. First, rather than introducing auxiliary objectives or augmentations, we operate directly on the model’s predictive distribution\. Second, our method requires no architectural changes and integrates seamlessly into standard training\.

Entropy and confidence\-based objectives\.Entropy plays a central role in many areas of machine learning as a measure of uncertainty\. In semi\-supervised learning, entropy minimization\[[15](https://arxiv.org/html/2605.20248#bib.bib34)\]has been used to encourage confident predictions on unlabeled data\. Another approach widely used to prevent over\-confidence in training is label smoothing\. However, in graph neural networks, standard training objectives do not explicitly leverage entropy on unlabeled nodes, despite predictions being available for the entire graph\. Our work revisits this gap and shows that directly shaping prediction entropy provides a simple and effective improvement\.

## 7Conclusions and Future Work

In this work, we introduced Transductive Sharpening \(TS\), a simple loss\-level modification that exploits the predictions models already produce on unlabeled nodes during transductive graph learning\. TS adds only a single scalar hyperparameter, requires no architectural changes, and consistently improves performance across architectures and benchmarks\. These results show that TS offers a strong performance\-complexity trade\-off by delivering meaningful gains while requiring no substantive changes to the model architecture, data pipeline, or training procedure\. More broadly, our findings suggest that unlabeled predictions, which are typically discarded by the standard supervised objective, provide a useful and underexploited signal for transductive graph learning, and that confidence\-based estimates offer an effective way to extract it\.

#### Future work\.

This work focuses on transductive node classification with standard GNN and MLP backbones, where Transductive Sharpening can be applied as a lightweight loss\-level modification\. While our experiments show consistent gains across a broad set of benchmarks, an immediate extension is to study whether the same objective remains effective in closely related transductive settings, such as link prediction or temporal node classification\. Another useful direction is to make the sharpening strength mildly adaptive, for example by varying it across training stages or by using simple confidence\-based criteria\. We hope this work inspires new transductive graph\-learning objectives that leverage unlabeled predictions produced during training to obtain gains in downstream tasks without adding architectural complexity\.

## Acknowledgments and Disclosure of Funding

The authors thank Petar Veličković for helpful discussions and feedback\. Mar Gonzàlez I Català acknowledges that this project was supported by G\-Research\. Ferran Hernandez Caralt acknowledges that the project that gave rise to these results received the support of a fellowship from “la Caixa” Foundation \(ID 100010434\)\. The fellowship code is LCF/BQ/PFA25/11000012\.

## References

- \[1\]J\. Bamberger, F\. Barbero, X\. Dong, and M\. M\. Bronstein\(2025\)Bundle neural networks for message diffusion on graphs\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:2405\.15540Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.11.1.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[2\]D\. Bo, B\. Hu, X\. Wang, Z\. Zhang, C\. Shi, and J\. Zhou\(2022\-Jun\.\)Regularizing graph neural networks via consistency\-diversity graph augmentations\.Proceedings of the AAAI Conference on Artificial Intelligence36\(4\),pp\. 3913–3921\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v36i4.20307)Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p3.1)\.
- \[3\]M\. M\. Bronstein, J\. Bruna, T\. Cohen, and P\. Veličković\(2021\)Geometric deep learning: grids, groups, graphs, geodesics, and gauges\.arXiv preprint arXiv:2104\.13478\.Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[4\]J\. Bruna, W\. Zaremba, A\. Szlam, and Y\. LeCun\(2014\)Spectral networks and locally connected networks on graphs\.External Links:1312\.6203Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[5\]J\. Chen, K\. Gao, G\. Li, and K\. He\(2023\)NAGphormer: a tokenized graph transformer for node classification in large graphs\.InInternational Conference on Learning Representations,External Links:2206\.04910Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.3.1)\.
- \[6\]E\. Chien, J\. Peng, P\. Li, and O\. Milenkovic\(2021\)Adaptive universal generalized PageRank graph neural network\.InInternational Conference on Learning Representations,External Links:2006\.07988Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.16.1)\.
- \[7\]M\. Defferrard, X\. Bresson, and P\. Vandergheynst\(2017\)Convolutional neural networks on graphs with fast localized spectral filtering\.External Links:1606\.09375Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[8\]C\. Deng, Z\. Yue, and Z\. Zhang\(2024\)Polynormer: polynomial\-expressive graph transformer in linear time\.InThe Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=hmv1LpNfXa),2403\.01232Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.8.1.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.5.1.1)\.
- \[9\]T\. H\. Do, D\. M\. Nguyen, G\. Bekoulis, A\. Munteanu, and N\. Deligiannis\(2021\-07\)Graph convolutional neural networks with node transition probability\-based message passing and dropnode regularization\.Expert Systems with Applications174,pp\. 114711\.External Links:ISSN 0957\-4174,[Document](https://dx.doi.org/10.1016/j.eswa.2021.114711)Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[10\]M\. Eliasof, E\. Haber, and E\. Treister\(2022\)Every node counts: improving the training of graph neural networks on node classification\.External Links:2211\.16631Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p3.1)\.
- \[11\]M\. Fey and J\. E\. Lenssen\(2019\)Fast graph representation learning with pytorch geometric\.External Links:1903\.02428Cited by:[§A\.1](https://arxiv.org/html/2605.20248#A1.SS1.p1.1)\.
- \[12\]J\. Gilmer, S\. S\. Schoenholz, P\. F\. Riley, O\. Vinyals, and G\. E\. Dahl\(2017\)Neural message passing for quantum chemistry\.External Links:1704\.01212Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[13\]C\. Goller and A\. Kuchler\(1996\)Learning task\-dependent distributed representations by backpropagation through structure\.InProceedings of International Conference on Neural Networks \(ICNN’96\),Vol\.1,pp\. 347–352 vol\.1\.External Links:[Document](https://dx.doi.org/10.1109/ICNN.1996.548916)Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[14\]M\. Gori, G\. Monfardini, and F\. Scarselli\(2005\)A new model for learning in graph domains\.InProceedings\. 2005 IEEE International Joint Conference on Neural Networks, 2005\.,Vol\.2,pp\. 729–734 vol\. 2\.External Links:[Document](https://dx.doi.org/10.1109/IJCNN.2005.1555942)Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[15\]Y\. Grandvalet and Y\. Bengio\(2004\)Semi\-supervised learning by entropy minimization\.InAdvances in Neural Information Processing Systems,Vol\.17\.External Links:[Link](https://proceedings.neurips.cc/paper/2004/hash/96f2b50b5d3613adf9c27049b2a888c7-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.20248#S2.SS0.SSS0.Px1.p7.1),[§3\.1](https://arxiv.org/html/2605.20248#S3.SS1.p3.1),[§6](https://arxiv.org/html/2605.20248#S6.p4.1),[footnote 1](https://arxiv.org/html/2605.20248#footnote1)\.
- \[16\]A\. Gupta, G\. Waghmare, G\. Oberoi, and N\. Srivastava\(2025\)Flow matters: directional and expressive gnns for heterophilic graphs\.External Links:2509\.00772Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.12.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[17\]W\. L\. Hamilton, R\. Ying, and J\. Leskovec\(2018\)Inductive representation learning on large graphs\.External Links:1706\.02216Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.15.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.20.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1),[§6](https://arxiv.org/html/2605.20248#S6.p1.1)\.
- \[18\]X\. Han, Z\. Jiang, N\. Liu, and X\. Hu\(2022\)G\-mixup: graph data augmentation for graph classification\.External Links:2202\.07179Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[19\]Z\. Hu, K\. Li, H\. Fan, and Y\. Yang\(2026\)GraphTARIF: linear graph transformer with augmented rank and improved focus\.Note:Accepted to The Web Conference \(WWW\) 2026External Links:2510\.10631Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.13.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.10.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[20\]J\. Huang, Y\. Mo, X\. Shi, L\. Feng, and X\. Zhu\(2025\)Enhancing the influence of labels on unlabeled nodes in graph convolutional networks\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\),External Links:2411\.02279Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.12.1.1)\.
- \[21\]T\. N\. Kipf and M\. Welling\(2017\)Semi\-supervised classification with graph convolutional networks\.External Links:1609\.02907Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.14.1.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.18.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1),[§2](https://arxiv.org/html/2605.20248#S2.SS0.SSS0.Px1.p2.1),[§6](https://arxiv.org/html/2605.20248#S6.p1.1)\.
- \[22\]K\. Kong, J\. Chen, J\. Kirchenbauer, R\. Ni, C\. B\. Bruss, and T\. Goldstein\(2023\)GOAT: a global transformer on large\-scale graphs\.InInternational Conference on Machine Learning,External Links:[Link](https://proceedings.mlr.press/v202/kong23a.html)Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.5.1)\.
- \[23\]X\. Li, R\. Zhu, Y\. Cheng, C\. Shan, S\. Luo, D\. Li, and W\. Qian\(2022\)Finding global homophily in graph neural networks when meeting heterophily\.InInternational Conference on Machine Learning,External Links:2205\.07308Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.15.1.1)\.
- \[24\]S\. Luan, C\. Hua, M\. Xu, Q\. Lu, J\. Zhu, X\. Chang, J\. Fu, J\. Leskovec, and D\. Precup\(2023\)When do graph neural networks help with node classification? investigating the homophily principle on node distinguishability\.Advances in Neural Information Processing Systems36,pp\. 28748–28760\.Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[25\]Y\. Luo, L\. Shi, and X\. Wu\(2024\)Classic gnns are strong baselines: reassessing gnns for node classification\.Advances in Neural Information Processing Systems37,pp\. 97650–97669\.Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p1.1)\.
- \[26\]Y\. Luo, L\. Shi, and X\. Wu\(2024\)Classic GNNs are strong baselines: reassessing GNNs for node classification\.InThe Thirty\-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=xkljKdGe4E),[Document](https://dx.doi.org/10.52202/079017-3098)Cited by:[§A\.1](https://arxiv.org/html/2605.20248#A1.SS1.p1.1),[§A\.2](https://arxiv.org/html/2605.20248#A1.SS2.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.20248#S4.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.20248#S6.p1.1)\.
- \[27\]S\. K\. Maurya, X\. Liu, and T\. Murata\(2022\)Simplifying approach to node classification in graph neural networks\.Journal of Computational Science62,pp\. 101695\.External Links:[Document](https://dx.doi.org/10.1016/j.jocs.2022.101695),2111\.06748Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.14.1)\.
- \[28\]P\. Mernyei and C\. Cangea\(2020\)Wiki\-CS: a Wikipedia\-based benchmark for graph neural networks\.External Links:2007\.02901Cited by:[3rd item](https://arxiv.org/html/2605.20248#A1.I1.i3.p1.1)\.
- \[29\]R\. Müller, S\. Kornblith, and G\. E\. Hinton\(2019\)When does label smoothing help?\.Advances in neural information processing systems32\.Cited by:[§3\.1](https://arxiv.org/html/2605.20248#S3.SS1.p3.1)\.
- \[30\]S\. H\. Pahng and S\. Hormoz\(2025\)Improving graph neural networks by learning continuous edge directions\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:2410\.14109Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.8.1)\.
- \[31\]M\. Park, J\. Heo, and D\. Kim\(2024\-21–27 Jul\)Mitigating oversmoothing through reverse process of GNNs for heterophilic graphs\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 39667–39681\.External Links:[Link](https://proceedings.mlr.press/v235/park24d.html)Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.7.1.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.
- \[32\]H\. Pei, B\. Wei, K\. C\. Chang, Y\. Lei, and B\. Yang\(2020\)Geom\-GCN: geometric graph convolutional networks\.InInternational Conference on Learning Representations,External Links:2002\.05287Cited by:[4th item](https://arxiv.org/html/2605.20248#A1.I1.i4.p1.1)\.
- \[33\]O\. Platonov, D\. Kuznedelev, M\. Diskin, A\. Babenko, and L\. Prokhorenkova\(2023\)A critical look at the evaluation of GNNs under heterophily: are we really making progress?\.arXiv preprint arXiv:2302\.11640\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2302.11640)Cited by:[4th item](https://arxiv.org/html/2605.20248#A1.I1.i4.p1.1),[5th item](https://arxiv.org/html/2605.20248#A1.I1.i5.p1.1)\.
- \[34\]L\. Rampášek, M\. Galkin, V\. P\. Dwivedi, A\. T\. Luu, G\. Wolf, and D\. Beaini\(2022\)Recipe for a general, powerful, scalable graph transformer\.InAdvances in Neural Information Processing Systems,External Links:2205\.12454Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.2.1.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.2.1)\.
- \[35\]Y\. Rong, W\. Huang, T\. Xu, and J\. Huang\(2020\)DropEdge: towards deep graph convolutional networks on node classification\.External Links:1907\.10903Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[36\]C\. E\. Shannon\(1948\)A mathematical theory of communication\.Bell System Technical Journal27\(3\),pp\. 379–423\.External Links:[Document](https://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.x)Cited by:[§3\.2](https://arxiv.org/html/2605.20248#S3.SS2.p1.2),[Lemma 1](https://arxiv.org/html/2605.20248#Thmlemma1.p1.3.1)\.
- \[37\]O\. Shchur, M\. Mumme, A\. Bojchevski, and S\. Günnemann\(2018\)Pitfalls of graph neural network evaluation\.External Links:1811\.05868Cited by:[2nd item](https://arxiv.org/html/2605.20248#A1.I1.i2.p1.1)\.
- \[38\]H\. Shirzad, A\. Velingker, B\. Venkatachalam, D\. J\. Sutherland, and A\. K\. Sinop\(2023\)Exphormer: sparse transformers for graphs\.InInternational Conference on Machine Learning,External Links:2303\.06147Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.4.1.1)\.
- \[39\]N\. Srivastava, G\. Hinton, A\. Krizhevsky, I\. Sutskever, and R\. Salakhutdinov\(2014\)Dropout: a simple way to prevent neural networks from overfitting\.Journal of Machine Learning Research15\(56\),pp\. 1929–1958\.External Links:[Link](http://jmlr.org/papers/v15/srivastava14a.html)Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[40\]F\. Sun, J\. Hoffmann, V\. Verma, and J\. Tang\(2020\)InfoGraph: unsupervised and semi\-supervised graph\-level representation learning via mutual information maximization\.External Links:1908\.01000Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p3.1)\.
- \[41\]C\. Tsallis\(1988\)Possible Generalization of Boltzmann\-Gibbs Statistics\.J\. Statist\. Phys\.52,pp\. 479–487\.External Links:[Document](https://dx.doi.org/10.1007/BF01016429)Cited by:[Definition 4](https://arxiv.org/html/2605.20248#Thmdefinition4.p1.1.1)\.
- \[42\]P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio\(2018\)Graph attention networks\.External Links:1710\.10903Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.16.1.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.19.1.1),[§1](https://arxiv.org/html/2605.20248#S1.p1.1),[§6](https://arxiv.org/html/2605.20248#S6.p1.1)\.
- \[43\]V\. Verma, A\. Lamb, C\. Beckham, A\. Najafi, I\. Mitliagkas, A\. Courville, D\. Lopez\-Paz, and Y\. Bengio\(2019\)Manifold mixup: better representations by interpolating hidden states\.External Links:1806\.05236Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[44\]Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p3.1)\.
- \[45\]M\. Wang, D\. Zheng, Z\. Ye, Q\. Gan, M\. Li, X\. Song, J\. Zhou, C\. Ma, L\. Yu, Y\. Gai, T\. Xiao, T\. He, G\. Karypis, J\. Li, and Z\. Zhang\(2020\)Deep graph library: a graph\-centric, highly\-performant package for graph neural networks\.External Links:1909\.01315Cited by:[§A\.1](https://arxiv.org/html/2605.20248#A1.SS1.p1.1)\.
- \[46\]Y\. Wang, W\. Wang, Y\. Liang, Y\. Cai, and B\. Hooi\(2021\)Mixup for node and graph classification\.InProceedings of the Web Conference 2021,WWW ’21,New York, NY, USA,pp\. 3663–3674\.External Links:ISBN 9781450383127,[Document](https://dx.doi.org/10.1145/3442381.3449796)Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p3.1)\.
- \[47\]Q\. Wu, W\. Zhao, Z\. Li, D\. Wipf, and J\. Yan\(2022\)NodeFormer: a scalable graph structure learning transformer for node classification\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=sMezXGG5So)Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.6.1.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.3.1.1)\.
- \[48\]Q\. Wu, W\. Zhao, C\. Yang, H\. Zhang, F\. Nie, H\. Jiang, Y\. Bian, and J\. Yan\(2023\)SGFormer: simplifying and empowering transformers for large\-graph representations\.InAdvances in Neural Information Processing Systems,External Links:2306\.10759Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.7.1),[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.4.1)\.
- \[49\]Y\. Xing, X\. Wang, B\. Wu, H\. Huang, and C\. Shi\(2025\)Unifying and enhancing graph transformers via a hierarchical mask framework\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2510\.18825Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.9.1.1)\.
- \[50\]Y\. You, T\. Chen, Y\. Sui, T\. Chen, Z\. Wang, and Y\. Shen\(2021\)Graph contrastive learning with augmentations\.External Links:2010\.13902Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p3.1)\.
- \[51\]B\. Zhang, M\. Chen, J\. Song, S\. Li, J\. Zhang, and C\. Wang\(2025\)Normalize then propagate: efficient homophilous regularization for few\-shot semi\-supervised node classification\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:2501\.08581Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.10.1.1)\.
- \[52\]H\. Zhang, M\. Cisse, Y\. N\. Dauphin, and D\. Lopez\-Paz\(2018\)Mixup: beyond empirical risk minimization\.External Links:1710\.09412Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[53\]Y\. Zhang, X\. Li, Y\. Xu, X\. Xu, and Z\. Wang\(2025\)A graph transformer with optimized attention scores for node classification\.Scientific Reports15\(1\),pp\. 30015\.External Links:[Document](https://dx.doi.org/10.1038/s41598-025-15551-2),[Link](https://www.nature.com/articles/s41598-025-15551-2)Cited by:[Table 16](https://arxiv.org/html/2605.20248#A5.T16.7.1.11.1)\.
- \[54\]L\. Zhao and L\. Akoglu\(2020\)PairNorm: tackling oversmoothing in gnns\.External Links:1909\.12223Cited by:[§6](https://arxiv.org/html/2605.20248#S6.p2.1)\.
- \[55\]J\. Zhu, R\. A\. Rossi, A\. Rao, T\. Mai, N\. Lipka, N\. K\. Ahmed, and D\. Koutra\(2021\)Graph neural networks with heterophily\.InProceedings of the AAAI Conference on Artificial Intelligence,External Links:2009\.13566Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.13.1.1)\.
- \[56\]J\. Zhu, Y\. Yan, L\. Zhao, M\. Heimann, L\. Akoglu, and D\. Koutra\(2020\)Beyond homophily in graph neural networks: current limitations and effective designs\.InAdvances in Neural Information Processing Systems,External Links:2006\.11468Cited by:[Table 17](https://arxiv.org/html/2605.20248#A5.T17.7.1.17.1.1)\.
- \[57\]J\. Zhuo, Y\. Liu, Y\. Lu, Z\. Ma, K\. Fu, C\. Wang, Y\. Guo, Z\. Wang, X\. Cao, and L\. Yang\(2025\)DUALFormer: dual graph transformer\.InThe Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=4v4RcAODj9)Cited by:[§1](https://arxiv.org/html/2605.20248#S1.p1.1)\.

## Appendix ADatasets and Experimental Details

### A\.1Computing Environment

Our implementation is built upon tunedGNN\[[26](https://arxiv.org/html/2605.20248#bib.bib4)\], which is based on PyG\[[11](https://arxiv.org/html/2605.20248#bib.bib25)\]and DGL\[[45](https://arxiv.org/html/2605.20248#bib.bib24)\]\. The experiments were conducted on a single workstation with an RTX 5090\. With 3 datasets running in parallel, total compute time to run all 39 \(model, dataset\)\-pairs averaged 1 h 7 min\.

### A\.2Declaration of Hyperparameters

#### TS variants\.

All TS variants inherit their per\-cell backbone hyperparameters \(depth, width, learning rate, dropout, normalisation, residual connections, training epochs\) from TunedGNN\[[26](https://arxiv.org/html/2605.20248#bib.bib4)\]Tables 8\-10, reproduced here as Tables[7](https://arxiv.org/html/2605.20248#A1.T7),[8](https://arxiv.org/html/2605.20248#A1.T8), and[9](https://arxiv.org/html/2605.20248#A1.T9)\. The optimal lambda we found is reported inλ⋆\\lambda^\{\\star\}\.

Table 7:TunedGNN hyperparameters for GCN\.DatasetResNetNormalizationDropout rateGNNs layerLLHidden dimLRepochλ⋆\\lambda^\{\\star\}CoraFalseFalse0\.735120\.0015001\.35CiteseerFalseFalse0\.525120\.0015000\.15PubmedFalseFalse0\.722560\.0055000\.3ComputerFalseLN0\.535120\.00110000\.65PhotoTrueLN0\.562560\.00110000\.35CSTrueLN0\.325120\.00115000\.6PhysicsTrueLN0\.32640\.00115000\.25WikiCSFalseLN0\.532560\.00110000\.8SquirrelTrueBN0\.742560\.015000\.45ChameleonFalseFalse0\.255120\.0052000\.25Amazon\-RatingsTrueBN0\.545120\.00125000\.5Roman\-EmpireTrueBN0\.595120\.00125000\.3MinesweeperTrueBN0\.212640\.0120000\.1Table 8:TunedGNN hyperparameters for GAT\.DatasetResNetNormalizationDropout rateGNNs layerLLHidden dimLRepochλ⋆\\lambda^\{\\star\}CoraTrueFalse0\.235120\.0015001\.25CiteseerTrueFalse0\.532560\.0015000\.1PubmedFalseFalse0\.525120\.015001\.55ComputerFalseLN0\.52640\.00110000\.1PhotoTrueLN0\.53640\.00110000\.4CSTrueLN0\.312560\.00115000PhysicsTrueBN0\.722560\.00115000\.5WikiCSTrueLN0\.725120\.00110000\.9SquirrelTrueBN0\.575120\.0055001ChameleonTrueBN0\.722560\.012000\.3Amazon\-RatingsTrueBN0\.545120\.00125000\.3Roman\-EmpireTrueBN0\.3105120\.00125000\.15MinesweeperTrueBN0\.215640\.0120000Table 9:TunedGNN hyperparameters for SAGE\.DatasetResNetNormalizationDropout rateGNNs layerLLHidden dimLRepochλ⋆\\lambda^\{\\star\}CoraFalseFalse0\.732560\.0015000\.65CiteseerFalseFalse0\.235120\.0015000\.05PubmedFalseFalse0\.745120\.0055001\.65ComputerFalseLN0\.34640\.00110000\.4PhotoTrueLN0\.26640\.00110000\.45CSTrueLN0\.525120\.00115000\.05PhysicsTrueBN0\.72640\.00115000\.25WikiCSFalseLN0\.722560\.00110001\.55SquirrelTrueBN0\.732560\.015000\.75ChameleonTrueBN0\.742560\.012000\.15Amazon\-RatingsTrueBN0\.595120\.00125000\.9Roman\-EmpireFalseBN0\.392560\.00125000\.35MinesweeperTrueBN0\.215640\.0120000\.4
#### TS with retune\.

Theλ\\lambdasweep with retune variant in Appendix[B\.2](https://arxiv.org/html/2605.20248#A2.SS2)inherits the same backbone hyperparameters and replaces\(lr,dropout\)\(\\mathrm\{lr\},\\mathrm\{dropout\}\)with the5×55\{\\times\}5retune \(Table[10](https://arxiv.org/html/2605.20248#A1.T10)\), all other settings are unchanged\.

Table 10:Hyperparameters forλ\\lambdasweep with retune\.GCNSAGEGATDatasetlrdropoutlrdropoutlrdropoutCora0\.0010\.650\.000250\.70\.00050\.3Citeseer0\.0040\.60\.0010\.30\.00050\.55Pubmed0\.010\.80\.00250\.450\.010\.5Computer0\.0010\.550\.00050\.20\.0040\.6Photo0\.0010\.40\.00050\.30\.0010\.5CS0\.0020\.30\.000250\.40\.000250\.35Physics0\.00050\.30\.0010\.60\.0040\.7WikiCS0\.00050\.50\.0040\.750\.000250\.8Squirrel0\.020\.60\.020\.750\.0050\.4Chameleon0\.020\.150\.020\.70\.010\.75Amazon\-Ratings0\.0040\.60\.0010\.450\.0010\.6Roman\-Empire0\.0040\.40\.0040\.250\.0010\.3Minesweeper0\.020\.250\.020\.20\.010\.2
#### MLP baselines\.

For the MLP baselines and their TS variants, we use a single fixed architecture and optimization configuration across all datasets \(Table[11](https://arxiv.org/html/2605.20248#A1.T11)\)\. This avoids dataset\-specific MLP tuning and isolates the effect of adding Transductive Sharpening to a feature\-only model\.

#### Meta\-learned lambda\.

For the meta\-learnedλ\\lambdavariant in Ablation[4\.3](https://arxiv.org/html/2605.20248#S4.SS3.SSS0.Px4), we additionally use the fixed meta\-learner hyperparameters listed in Table[12](https://arxiv.org/html/2605.20248#A1.T12)\.

Table 11:Fixed MLP hyperparameters used across all datasets\.HyperparameterValueHidden channels512Training epochs1000Learning rate0\.001Local layers3Weight decay0\.0005Dropout0\.5Tsallisqq2\.0Seeds \(runs\) per cell5Table 12:Meta\-learner constants\.Meta\-learner constantValueλmax\\lambda\_\{\\max\}1Meta\-LR0\.001Meta\-hidden dim32Meta\-warmup epochs20Tsallisqq2Seeds \(runs\) per cell5

### A\.3Declaration of Splits

- •Cora, CiteSeer, PubMed: random class\-balanced splits with 20 train per class, 500 val, 1000 test\.
- •Amazon\-Computer, Amazon\-Photo, Coauthor\-CS, Coauthor\-Physics\[[37](https://arxiv.org/html/2605.20248#bib.bib53)\]: a single fixed60/20/2060/20/20split per dataset\.
- •WikiCS\[[28](https://arxiv.org/html/2605.20248#bib.bib54)\]: the 20 splits provided by the dataset\.
- •Squirrel, Chameleon\[[32](https://arxiv.org/html/2605.20248#bib.bib55)\]: 10 splits on the filtered \(updated version\) versions of Platonov et al\.\[[33](https://arxiv.org/html/2605.20248#bib.bib31)\]\(∼48/32/20\{\\sim\}48/32/20after filtering\)\.
- •Roman\-Empire, Amazon\-Ratings, Minesweeper\[[33](https://arxiv.org/html/2605.20248#bib.bib31)\]: 10 fixed50/25/2550/25/25splits\.

## Appendix BAdditional Results

### B\.1Results on ogbn\-arxiv

Table 13:Results on ogbn\-arxiv\. Means±\\pmstd over 5 seeds\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\.Baseline\+\+TS \(λ=0\.25\\lambda\{=\}0\.25\)DatasetGCNSAGEGATGCNSAGEGATogbn\-arxiv73\.0973\.09±\\pm0\.1872\.4872\.48±\\pm0\.2672\.3872\.38±\\pm0\.1273\.3073\.30±\\pm0\.23\+\+0\.2172\.7272\.72±\\pm0\.17\+\+0\.2472\.5372\.53±\\pm0\.18\+\+0\.15

### B\.2Hyperparameter retuning

In a separate diagnostic, we additionally re\-tune\(lr,dropout\)\(\\mathrm\{lr\},\\mathrm\{dropout\}\)on a5×55\\\!\\times\\\!5grid centred on TunedGNN’s per\-cell defaults at the val\-bestλ⋆\\lambda^\{\\star\}\. Table[14](https://arxiv.org/html/2605.20248#A2.T14)reports the resulting test\-accuracy shifts relative to the same cell with TunedGNN’s defaults: most cells move by less than one combined std, suggesting that the gains reported in the main results would not increase substantially under this additional tuning\.

Table 14:Test\-accuracy gain attributable to the\(lr,dropout\)\(\\mathrm\{lr\},\\mathrm\{dropout\}\)retune, relative to the same cell pre\-retune \(i\.e\., at theλ⋆\\lambda^\{\\star\}but with TunedGNN’s defaultlr\\mathrm\{lr\}anddropout\\mathrm\{dropout\}\)\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN−\-0\.12±\\pm0\.73−\-0\.08±\\pm0\.64\+\+0\.18±\\pm0\.91−\-0\.25±\\pm0\.32−\-0\.07±\\pm0\.33−\-0\.01±\\pm0\.17\+\+0\.04±\\pm0\.09−\-0\.18±\\pm0\.77−\-0\.04±\\pm2\.75−\-0\.07±\\pm6\.57\+\+0\.41±\\pm0\.49\+\+0\.36±\\pm0\.34\+\+0\.21±\\pm0\.25SAGE−\-0\.16±\\pm1\.21−\-0\.36±\\pm0\.32\+\+0\.38±\\pm1\.38\+\+0\.04±\\pm0\.24\+\+0\.33±\\pm0\.36\+\+0\.11±\\pm0\.13\+\+0\.04±\\pm0\.06\+\+0\.15±\\pm0\.44−\-1\.28±\\pm2\.51\+\+0\.67±\\pm6\.97\+\+0\.11±\\pm0\.63\+\+0\.46±\\pm0\.38\+\+0\.81±\\pm0\.52GAT−\-1\.66±\\pm1\.70\+\+0\.38±\\pm0\.630\+\+0\.05±\\pm0\.320\+\+0\.01±\\pm0\.05\+\+0\.03±\\pm0\.06\+\+0\.11±\\pm0\.31\+\+0\.61±\\pm3\.09−\-1\.89±\\pm6\.34\+\+0\.13±\\pm0\.4200

## Appendix CFurther Ablations

#### Sharpen only on test nodes\.

We evaluate a variant withλ=0\.25\\lambda=0\.25in which TS is applied only to test nodes, excluding validation nodes from the unlabeled\-node sharpening term\.

Table 15:Test\-accuracy gain of the restricted variant over the standard TS\.Greenwhen positive,redwhen negative;boldwhen\|Δ\|\>σ\|\\Delta\|\>\\sigma\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSSquirrelChameleonAmazon\-Rat\.Roman\-Emp\.MinesweeperGCN−\-3\.84±\\pm0\.70−\-3\.10±\\pm0\.30−\-2\.66±\\pm0\.74\+\+0\.25±\\pm0\.29−\-0\.02±\\pm0\.20\+\+0\.10±\\pm0\.07\+\+0\.02±\\pm0\.15−\-0\.15±\\pm0\.55\+\+0\.04±\\pm2\.70−\-0\.58±\\pm6\.47−\-0\.04±\\pm0\.47−\-0\.30±\\pm0\.28−\-0\.11±\\pm0\.37SAGE−\-4\.56±\\pm1\.43−\-0\.78±\\pm1\.65\+\+1\.20±\\pm1\.60−\-0\.19±\\pm0\.19\+\+0\.26±\\pm0\.48\+\+0\.09±\\pm0\.11−\-0\.06±\\pm0\.12−\-0\.25±\\pm0\.43\+\+0\.30±\\pm3\.26\+\+1\.84±\\pm7\.61−\-0\.07±\\pm0\.45−\-0\.22±\\pm0\.32\+\+0\.06±\\pm0\.90GAT−\-2\.32±\\pm1\.49−\-3\.00±\\pm1\.33−\-2\.18±\\pm1\.40\+\+0\.11±\\pm0\.21−\-0\.05±\\pm0\.12\+\+0\.03±\\pm0\.11−\-0\.01±\\pm0\.08−\-0\.19±\\pm1\.27\+\+1\.30±\\pm3\.46−\-1\.58±\\pm5\.890±\\pm0\.38−\-0\.52±\\pm0\.51\+\+0\.40±\\pm0\.76

Planetoid has nodes that are not in test, val, or train\. We continue to sharpen these nodes for the purpose of this ablation\.

## Appendix DFurther Visualizations

This appendix provides additional visualizations of the effect of the sharpening coefficientλ\\lambdaby further detailing the trends studied in Section[4\.2](https://arxiv.org/html/2605.20248#S4.SS2)\.

Figure[3](https://arxiv.org/html/2605.20248#A4.F3)reports improvements and regressions in units of the baseline standard deviation, using Glass’sΔ\\Deltarelative to theλ=0\\lambda=0supervised baseline\. The results show that small positive values ofλ\\lambdaprovide the most favorable trade\-off: improvements are frequent and often larger than the corresponding regressions\. Asλ\\lambdaincreases, regressions become more common and more severe, indicating that aggressive sharpening is less robust\. Figure[4](https://arxiv.org/html/2605.20248#A4.F4)complements this view by showing the full test\-accuracy curves for each dataset and backbone\. The same qualitative pattern appears across many settings: negative values ofλ\\lambdaare usually harmful, moderate positive values often improve performance, and overly large values eventually degrade accuracy\.

![Refer to caption](https://arxiv.org/html/2605.20248v1/x3.png)Figure 3:Distribution of improvements and regressions as a function ofλ\\lambda, measured using Glass’sΔ\\Deltarelative to theλ=0\\lambda\{=\}0supervised baseline\. Small positive values ofλ\\lambdayield the most favorable balance, with improvements occurring frequently and regressions remaining comparatively limited\. Larger values ofλ\\lambdaincrease both the frequency and severity of regressions, indicating that aggressive sharpening is less robust across datasets and backbones\.![Refer to caption](https://arxiv.org/html/2605.20248v1/x4.png)Figure 4:Test accuracy as a function ofλ\\lambdafor each dataset and backbone\. The crosshair marks theλ=0\\lambda\{=\}0supervised baseline\. Across many dataset–backbone pairs, performance improves over a finite interval of positiveλ\\lambdavalues before degrading whenλ\\lambdabecomes too large, while negative values ofλ\\lambdaare often harmful\. This supports the use of moderate positive sharpening and helps explain why a conservative universal value such asλ=0\.25\\lambda=0\.25performs reliably\.
## Appendix EComparison with competitive architectures

This appendix provides a comparison between TS and recent node\-classification methods\. Our aim is not to claim that a loss\-level modification replaces architectural advances, but to contextualize the performance of TS relative to more complex approaches\. In Table[16](https://arxiv.org/html/2605.20248#A5.T16)and Table[17](https://arxiv.org/html/2605.20248#A5.T17), we compare TS against competitive architectures across homophilous and heterophilous datasets, respectively\.

Overall, TS remains competitive despite using substantially simpler backbone architectures\. This is especially clear on the homophilous benchmarks, where standard GNNs augmented with TS often match or approach the performance of more specialized methods\. On heterophilous datasets, the comparison is more mixed, as expected, since many competing methods are specifically designed to handle heterophily\. Nevertheless, the results show that a simple objective\-level modification can recover a meaningful amount of performance without introducing additional architectural machinery\.

Table 16:Node classification results over homophilous graphs \(%\)\. Cells are left blank where the original paper used a different evaluation protocol than ours\. The top𝟏𝐬𝐭\\mathbf\{1^\{st\}\},𝟐𝐧𝐝\\mathbf\{2^\{nd\}\}and𝟑𝐫𝐝\\mathbf\{3^\{rd\}\}results are highlighted\.CoraCiteSeerPubMedComputerPhotoCSPhysicsWikiCSGraphGPS\[[34](https://arxiv.org/html/2605.20248#bib.bib42)\]83\.87± 0\.9672\.73± 1\.2379\.94± 0\.2691\.79± 0\.6394\.89± 0\.1494\.04± 0\.2196\.71± 0\.1578\.66± 0\.49NAGphormer\[[5](https://arxiv.org/html/2605.20248#bib.bib43)\]80\.92± 1\.1770\.59± 0\.8980\.14± 1\.0691\.69± 0\.3096\.14± 0\.1695\.85± 0\.1697\.35± 0\.1277\.92± 0\.93Exphormer\[[38](https://arxiv.org/html/2605.20248#bib.bib44)\]83\.29± 1\.3671\.85± 1\.1179\.67± 0\.7391\.80± 0\.3595\.69± 0\.3995\.92± 0\.2597\.06± 0\.1379\.38± 0\.62GOAT\[[22](https://arxiv.org/html/2605.20248#bib.bib45)\]83\.26± 1\.2472\.21± 1\.2980\.06± 0\.6792\.29± 0\.3794\.33± 0\.2193\.81± 0\.1996\.47± 0\.1677\.96± 0\.63NodeFormer\[[47](https://arxiv.org/html/2605.20248#bib.bib46)\]82\.73± 0\.7572\.37± 1\.2079\.59± 0\.9287\.29± 0\.5893\.43± 0\.5695\.69± 0\.2796\.48± 0\.3475\.13± 0\.93SGFormer\[[48](https://arxiv.org/html/2605.20248#bib.bib47)\]84\.82± 0\.8572\.72± 1\.1580\.60± 0\.4992\.42± 0\.6695\.58± 0\.3695\.71± 0\.2496\.75± 0\.2680\.05± 0\.46Polynormer\[[8](https://arxiv.org/html/2605.20248#bib.bib36)\]83\.43± 0\.8972\.19± 0\.8379\.35± 0\.7393\.78± 0\.1096\.57± 0\.2395\.42± 0\.1997\.18± 0\.1180\.26± 0\.92MLP60\.96± 2\.5156\.58± 1\.1468\.96± 1\.1682\.46± 0\.4587\.57± 0\.5291\.54± 0\.2095\.97± 0\.0770\.96± 1\.00NormProp\[[51](https://arxiv.org/html/2605.20248#bib.bib37)\]85\.46± 0\.5174\.33± 0\.5780\.72± 1\.09OGFormer\[[53](https://arxiv.org/html/2605.20248#bib.bib35)\]86\.40± 0\.3074\.70± 0\.5081\.50± 0\.5092\.90± 0\.3095\.50± 0\.0095\.20± 0\.10ELU\-GCN\[[20](https://arxiv.org/html/2605.20248#bib.bib38)\]84\.29± 0\.3974\.23± 0\.6280\.51± 0\.21GraphTARIF\[[19](https://arxiv.org/html/2605.20248#bib.bib28)\]94\.61± 0\.1797\.03± 0\.1996\.51± 0\.1197\.39± 0\.0780\.93± 0\.57GCN\[[21](https://arxiv.org/html/2605.20248#bib.bib32)\]84\.54± 0\.8672\.68± 0\.4380\.70± 0\.9694\.12± 0\.0895\.90± 0\.3395\.88± 0\.0397\.38± 0\.0679\.97± 0\.43SAGE\[[17](https://arxiv.org/html/2605.20248#bib.bib27)\]83\.60± 0\.5869\.60± 0\.6177\.86± 1\.4993\.25± 0\.3696\.43± 0\.2796\.29± 0\.1297\.25± 0\.0880\.71± 0\.19GAT\[[42](https://arxiv.org/html/2605.20248#bib.bib30)\]82\.40± 1\.0171\.90± 0\.2579\.76± 1\.2193\.98± 0\.2296\.69± 0\.1496\.17± 0\.0297\.26± 0\.0380\.92± 0\.58MLP\+TS64\.48± 2\.8362\.72± 3\.2072\.30± 1\.8982\.87± 0\.5687\.65± 0\.2291\.77± 0\.3895\.98± 0\.0872\.48± 0\.83GCN\+TS85\.74± 0\.5475\.18± 0\.1580\.74± 0\.3093\.98± 0\.2696\.21± 0\.1195\.89± 0\.0697\.44± 0\.1480\.31± 0\.44SAGE\+TS85\.28± 1\.1174\.96± 0\.2479\.72± 0\.6993\.43± 0\.1196\.51± 0\.2396\.24± 0\.1097\.23± 0\.0081\.10± 0\.32GAT\+TS84\.62± 0\.8974\.84± 0\.4878\.84± 0\.6793\.86± 0\.1396\.60± 0\.0796\.17± 0\.0297\.38± 0\.0481\.78± 0\.22

Table 17:Node classification results over heterophilous graphs \(%\)\. Cells are left blank where the original paper used a different evaluation protocol than ours\. The top𝟏𝐬𝐭\\mathbf\{1^\{st\}\},𝟐𝐧𝐝\\mathbf\{2^\{nd\}\}and𝟑𝐫𝐝\\mathbf\{3^\{rd\}\}results are highlighted\.SquirrelChameleonAmazon\-RatingsRoman\-EmpireMinesweeperGraphGPS\[[34](https://arxiv.org/html/2605.20248#bib.bib42)\]39\.81± 2\.2841\.55± 3\.9153\.27± 0\.6682\.72± 0\.6890\.75± 0\.89NodeFormer\[[47](https://arxiv.org/html/2605.20248#bib.bib46)\]38\.89± 2\.6736\.38± 3\.8543\.79± 0\.5774\.83± 0\.8187\.71± 0\.69SGFormer\[[48](https://arxiv.org/html/2605.20248#bib.bib47)\]42\.65± 2\.4145\.21± 3\.7254\.14± 0\.6280\.01± 0\.4491\.42± 0\.41Polynormer\[[8](https://arxiv.org/html/2605.20248#bib.bib36)\]41\.97± 2\.1441\.97± 3\.1854\.96± 0\.2292\.66± 0\.6097\.49± 0\.48MLP39\.30± 0\.7943\.86± 5\.2348\.85± 0\.5566\.10± 0\.4451\.06± 1\.76GCN\+ReP\[[31](https://arxiv.org/html/2605.20248#bib.bib26)\]45\.89± 1\.4547\.57± 3\.9052\.75± 0\.6286\.43± 0\.7496\.05± 0\.19CoED\[[30](https://arxiv.org/html/2605.20248#bib.bib40)\]45\.50± 1\.6247\.27± 3\.6292\.17± 0\.29M3Dphormer\[[49](https://arxiv.org/html/2605.20248#bib.bib41)\]44\.34± 1\.9447\.09± 4\.0598\.27± 0\.20GraphTARIF\[[19](https://arxiv.org/html/2605.20248#bib.bib28)\]45\.58± 1\.9155\.86± 0\.4293\.23± 0\.3899\.03± 0\.19BuNN\[[1](https://arxiv.org/html/2605.20248#bib.bib39)\]53\.74± 0\.5191\.75± 0\.3998\.99± 0\.16Dir\-Poly\[[16](https://arxiv.org/html/2605.20248#bib.bib29)\]50\.73± 0\.5694\.51± 0\.2293\.74± 0\.70CPGNN\[[55](https://arxiv.org/html/2605.20248#bib.bib48)\]30\.04± 2\.0333\.00± 3\.1539\.79± 0\.7763\.96± 0\.6252\.03± 5\.46FSGNN\[[27](https://arxiv.org/html/2605.20248#bib.bib49)\]35\.92± 1\.3240\.61± 2\.9752\.74± 0\.8379\.92± 0\.5690\.08± 0\.70GloGNN\[[23](https://arxiv.org/html/2605.20248#bib.bib50)\]35\.11± 1\.2425\.90± 3\.5836\.89± 0\.1459\.63± 0\.6951\.08± 1\.23GPRGNN\[[6](https://arxiv.org/html/2605.20248#bib.bib51)\]38\.95± 1\.9939\.93± 3\.3044\.88± 0\.3464\.85± 0\.2786\.24± 0\.61H2GCN\[[56](https://arxiv.org/html/2605.20248#bib.bib52)\]35\.10± 1\.1526\.75± 3\.6436\.47± 0\.2360\.11± 0\.5289\.71± 0\.31GCN\[[21](https://arxiv.org/html/2605.20248#bib.bib32)\]43\.75± 1\.9145\.30± 2\.3053\.64± 0\.5491\.15± 0\.2097\.26± 0\.22GAT\[[42](https://arxiv.org/html/2605.20248#bib.bib30)\]41\.51± 2\.3443\.07± 5\.2555\.09± 0\.1990\.49± 0\.2297\.86± 0\.37SAGE\[[17](https://arxiv.org/html/2605.20248#bib.bib27)\]40\.48± 2\.9044\.32± 4\.5555\.18± 0\.9390\.50± 0\.2197\.09± 1\.00MLP\+TS39\.39± 0\.9143\.86± 5\.2349\.50± 0\.2766\.12± 0\.2950\.97± 1\.56GCN\+TS44\.57± 2\.0445\.27± 4\.7454\.06± 0\.5891\.66± 0\.2097\.80± 0\.20GAT\+TS40\.36± 1\.6644\.52± 3\.6955\.73± 0\.3190\.93± 0\.2097\.86± 0\.37SAGE\+TS41\.32± 2\.3543\.32± 4\.6756\.72± 0\.3591\.27± 0\.3697\.33± 0\.94

## Appendix FProof of Lemma[1](https://arxiv.org/html/2605.20248#Thmlemma1)

###### Proof\.

By definition, the cross\-entropy between a target distributionyyand a predictionppis

ℒCE\(y,p\)=−∑i=1Cyilog⁡pi\.\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(y,p\)=\-\\sum\_\{i=1\}^\{C\}y\_\{i\}\\log p\_\{i\}\.Adding and subtracting∑i=1Cpilog⁡pi\\sum\_\{i=1\}^\{C\}p\_\{i\}\\log p\_\{i\}gives

ℒCE\(y,p\)=−∑i=1Cpilog⁡pi\+∑i=1Cpilog⁡pi−∑i=1Cyilog⁡pi\.\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(y,p\)=\-\\sum\_\{i=1\}^\{C\}p\_\{i\}\\log p\_\{i\}\+\\sum\_\{i=1\}^\{C\}p\_\{i\}\\log p\_\{i\}\-\\sum\_\{i=1\}^\{C\}y\_\{i\}\\log p\_\{i\}\.The first term is the Shannon entropyH\(p\)H\(p\)\. Combining the remaining terms yields

ℒCE\(y,p\)=H\(p\)\+∑i=1C\(pi−yi\)log⁡pi,\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(y,p\)=H\(p\)\+\\sum\_\{i=1\}^\{C\}\(p\_\{i\}\-y\_\{i\}\)\\log p\_\{i\},which proves the claim\. ∎

## Appendix GBroader Impacts

This paper presents work whose goal is to advance the field of Machine Learning\. The method is foundational graph\-learning research with no direct high\-risk application\.
Graph Transductive Sharpening: Leveraging Unlabeled Predictions in Node Classification

Similar Articles

Multi-Label Node Classification with Label Influence Propagation

Universal Multiclass Transductive Online Learning

Certification of Machine Learning Models via Directional Sharpness

Instance Discrimination for Link Prediction

Graph Alignment Topology as an Inductive Bias for Grounding Detection

Submit Feedback

Similar Articles

Multi-Label Node Classification with Label Influence Propagation
Universal Multiclass Transductive Online Learning
Certification of Machine Learning Models via Directional Sharpness
Instance Discrimination for Link Prediction
Graph Alignment Topology as an Inductive Bias for Grounding Detection