Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching

arXiv cs.LG Papers

Summary

This paper proposes LLM-GNN Co-Teaching, a bidirectional framework for few-shot graph learning on text-attributed graphs. The LLM and GNN exchange confident pseudo-labels and use round-based preference optimization (RPL-PO) to mutually improve, outperforming prior methods on benchmarks.

arXiv:2606.11583v1 Announce Type: new Abstract: Text-attributed graphs (TAGs) underlie real-world applications such as citation networks, social media, and e-commerce. Few-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own. GNNs read topology and fail on cold nodes; LLMs read text and fail on text-ambiguous nodes. Existing LLM-GNN methods all follow the same recipe: designate one model as the golden teacher and use its outputs (e.g., features or pseudo-labels) to supervise the other. We argue this golden-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student. We therefore ask: can we avoid designating either model as the golden teacher, and still perform effective graph learning? We answer with LLM-GNN Co-Teaching, a bidirectional co-teaching framework in which neither model is fixed as teacher. The GNN and LLM exchange their most confident pseudo-labels under an architecture-specific small-loss criterion, and both update every round. Supervision is then mined from the trajectory: whenever a node moves from cross-model contradiction at round t to cross-model agreement at round t+1, the LLM's two answers on the same input form a preference pair (old contradicting self < new peer-endorsed self) for DPO training. We call this Round-based Pseudo-Label Preference Optimization (RPL-PO). On six benchmarks, LLM-GNN Co-Teaching consistently outperforms GNN-as-Judge and all prior methods, with absolute 3-shot gains of 7.86% on Cora and 7.73% on ogbn-arxiv; improvements carry over to 5-shot and to zero-shot cross-dataset transfer. Error-structure analysis further shows that abandoning the golden-teacher assumption substantially improves the LLM's graph learning capability on challenging samples.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:49 PM

# Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching
Source: [https://arxiv.org/html/2606.11583](https://arxiv.org/html/2606.11583)
Zhuoyi Peng1Hanlin Gu2Lixin Fan2Yi Yang1 1The Hong Kong University of Science and Technology2WeBank

###### Abstract

Text\-attributed graphs \(TAGs\) underlie real\-world applications such as citation networks, social media, and e\-commerce\. Few\-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own\. GNNs read topology and fail on cold nodes; LLMs read text and fail on text\-ambiguous nodes\. Existing LLM\-GNN methods all follow the same recipe:*designate one model as the golden teacher and use its outputs \(e\.g\., features or pseudo\-labels\) to supervise the other\.*We argue this golden\-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student\. We therefore ask:*can we avoid designating either model as the golden teacher, and still perform effective graph learning?*We answer with LLM\-GNN Co\-Teaching, a bidirectional co\-teaching framework in which neither model is fixed as teacher\. The GNN and LLM exchange their most confident pseudo\-labels under an architecture\-specific small\-loss criterion, and both update every round\. Supervision is then mined from the trajectory: whenever a node moves from cross\-model contradiction at roundttto cross\-model agreement at roundt\+1t\+1, the LLM’s two answers on the same input form a preference pair \(*old contradicting self*≺\\prec*new peer\-endorsed self*\) for DPO training\. We call this Round\-based Pseudo\-Label Preference Optimization \(RPL\-PO\)\. On six benchmarks, LLM\-GNN Co\-Teaching consistently outperforms GNN\-as\-Judge and all prior methods, with absolute 3\-shot gains of 7\.86% on Cora and 7\.73% on ogbn\-arxiv; improvements carry over to 5\-shot and to zero\-shot cross\-dataset transfer\. Error\-structure analysis further shows that abandoning the golden\-teacher assumption substantially improves the LLM’s graph learning capability on challenging samples\. Code:[https://github\.com/llmgnncoteaching/LLM\-GNN\-Coteaching](https://github.com/llmgnncoteaching/LLM-GNN-Coteaching)\.

## 1Introduction

Text\-attributed graphs \(TAGs\)\[[1](https://arxiv.org/html/2606.11583#bib.bib1),[2](https://arxiv.org/html/2606.11583#bib.bib2),[3](https://arxiv.org/html/2606.11583#bib.bib3),[4](https://arxiv.org/html/2606.11583#bib.bib4)\]underlie real\-world applications such as citation networks, social media, recommendation, and e\-commerce, where each node carries raw text alongside graph topology\. The rise of Large Language Models \(LLMs\)\[[5](https://arxiv.org/html/2606.11583#bib.bib5),[6](https://arxiv.org/html/2606.11583#bib.bib6),[7](https://arxiv.org/html/2606.11583#bib.bib7)\]has driven growing interest in using them for TAG learning\[[8](https://arxiv.org/html/2606.11583#bib.bib8),[4](https://arxiv.org/html/2606.11583#bib.bib4),[9](https://arxiv.org/html/2606.11583#bib.bib9),[10](https://arxiv.org/html/2606.11583#bib.bib10),[11](https://arxiv.org/html/2606.11583#bib.bib11),[12](https://arxiv.org/html/2606.11583#bib.bib12)\]\. Most existing work on TAG learning, however, focuses on the supervised setting where abundant labels are available and both models can be reliably fine\-tuned\[[4](https://arxiv.org/html/2606.11583#bib.bib4),[13](https://arxiv.org/html/2606.11583#bib.bib13),[9](https://arxiv.org/html/2606.11583#bib.bib9),[11](https://arxiv.org/html/2606.11583#bib.bib11),[10](https://arxiv.org/html/2606.11583#bib.bib10),[14](https://arxiv.org/html/2606.11583#bib.bib14),[12](https://arxiv.org/html/2606.11583#bib.bib12)\]\. Real\-world TAGs are rarely labeled at this scale: only a handful of labels per class are typically available, and the bulk of the graph carries no supervision\[[15](https://arxiv.org/html/2606.11583#bib.bib15),[16](https://arxiv.org/html/2606.11583#bib.bib16),[17](https://arxiv.org/html/2606.11583#bib.bib17),[18](https://arxiv.org/html/2606.11583#bib.bib18),[19](https://arxiv.org/html/2606.11583#bib.bib19)\]\. Under this few\-shot regime, neither GNNs\[[20](https://arxiv.org/html/2606.11583#bib.bib20),[21](https://arxiv.org/html/2606.11583#bib.bib21),[22](https://arxiv.org/html/2606.11583#bib.bib22),[23](https://arxiv.org/html/2606.11583#bib.bib23)\]nor LLMs work well alone: GNNs read topology and fail on cold \(low\-degree\) nodes whose neighborhoods provide too little signal\[[24](https://arxiv.org/html/2606.11583#bib.bib24),[25](https://arxiv.org/html/2606.11583#bib.bib25)\], while LLMs read text and fail when text is short or class\-ambiguous\[[26](https://arxiv.org/html/2606.11583#bib.bib26),[27](https://arxiv.org/html/2606.11583#bib.bib27),[28](https://arxiv.org/html/2606.11583#bib.bib28),[29](https://arxiv.org/html/2606.11583#bib.bib29)\]\. Their disjoint failure modes have motivated a substantial line of work combining them\.

Existing LLM\-GNN methods all share a common structure: one model is designated as a fixed teacher whose outputs are treated as ground truth, and the other is trained to match those outputs\. We refer to this shared structural assumption as the*golden\-teacher assumption*\. Prior approaches differ only in which side is designated as golden\.*LLM\-as\-Enhancers*\[[4](https://arxiv.org/html/2606.11583#bib.bib4),[13](https://arxiv.org/html/2606.11583#bib.bib13),[30](https://arxiv.org/html/2606.11583#bib.bib30),[31](https://arxiv.org/html/2606.11583#bib.bib31)\]freeze LLM\-derived features or explanations and train a downstream GNN to imitate them\.*LLM\-as\-Predictor*methods\[[9](https://arxiv.org/html/2606.11583#bib.bib9),[11](https://arxiv.org/html/2606.11583#bib.bib11),[10](https://arxiv.org/html/2606.11583#bib.bib10),[12](https://arxiv.org/html/2606.11583#bib.bib12),[14](https://arxiv.org/html/2606.11583#bib.bib14),[32](https://arxiv.org/html/2606.11583#bib.bib32),[33](https://arxiv.org/html/2606.11583#bib.bib33)\]treat the once\-instruction\-tuned LLM as the golden predictor, typically prompting it with structural tokens\.*GNN\-as\-Judge*\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]reverses the direction: a once\-trained GNN’s verdicts filter or re\-weight pseudo\-labels for LLM fine\-tuning\. In every case, supervision flows in one direction from a fixed teacher, and the student has no way to revise what the teacher said\.

The golden\-teacher assumption breaks under sparse supervision\.With few labels per class, neither model is reliable enough to serve as the golden teacher: the GNN cannot learn good representations for cold nodes, and the LLM cannot disambiguate short or class\-ambiguous text without exemplars\.Treating either as golden transfers its blind spots into the student wholesale, and unidirectional supervision leaves the student no way to revise what the teacher said\. The question that few\-shot LLM\-GNN learning has not yet asked is therefore:

> *Can we avoid designating either model as the golden teacher, and still perform effective graph learning?*

The question is not trivial: with only a few labeled anchors as direct supervision, two weak models updating each other freely can collapse onto each other’s mistakes rather than converge toward truth\. The framework needs a mechanism that extracts reliable supervisory signal from their joint dynamics\.

Previously: both weak, they contradictAC≠\\neqweak GNNweak LLMweakweakcontradictwho should begolden teacher?Round\-by\-round structure \+ RPL\-PO: prefer the agreed selfRtR\_\{t\}: contradictAC≠\\neqGNNLLMboth weakco\-teachingRt\+1R\_\{t\{\+\}1\}: agreeBB==GNNLLMpeer\-endorsed⋯\\cdotsprefer agree,reject contradictPreferenceOptimizationPrefer the peer\-endorsed self over the earlier contradicting self\.

Figure 1:Co\-teaching without a golden teacher\.A single round between two weak models leaves them contradicting on nodevv, with no way to choose which should serve as the golden teacher \(left\)\. After one more round of bidirectional co\-teaching, both models evolve, and if they now agree onBB, the LLM’sRtR\_\{t\}contradicting answerCCandRt\+1R\_\{t\+1\}peer\-endorsed answerBBtogether form a preference pair: the earlier self is rejected, the peer\-endorsed self is preferred \(right\)\. The reward signal comes from the trajectory itself\. No golden teacher, no human label, no reward model, no external judge\.Our answer is LLM\-GNN Co\-Teaching, a co\-teaching framework that does not designate either side as the golden teacher and instead lets the GNN and LLM evolve together\. Training proceeds in rounds: in each round, every peer extracts its most confident pseudo\-labels under an architecture\-specific small\-loss criterion \(cross\-entropy fit for the GNN, minimum token log probability for the LLM\) and passes them to the other model, so that both peers grow from weak to strong round by round\. To create additional supervision, we further mine a preference signal from this trajectory: whenever a node transitions from cross\-model contradiction at roundttto cross\-model agreement at roundt\+1t\+1, the LLM’s two answers on the same node, the earlier contradicting one and the later peer\-endorsed one, form a natural preference pair, which we feed to direct preference optimization \(DPO\)\[[35](https://arxiv.org/html/2606.11583#bib.bib35)\]\. We call this Round\-based Pseudo\-Label Preference Optimization \(RPL\-PO\)\. The reward signal comes from the trajectory itself: no golden teacher, no human label, no reward model, no external judge\.

#### Contributions\.

\(1\)We abandon the golden\-teacher assumption\.LLM\-GNN Co\-Teaching is the first LLM\-GNN method in which neither model is designated as authoritative, with both updating every round and supervising each other through a small\-loss criterion\. \(2\)RPL\-PO: a self\-supervised preference\-pair generator\.A node that transitions from cross\-model contradiction at roundttto cross\-model agreement at roundt\+1t\+1yields a DPO preference pair from the LLM’s two answers on the same input\. RPL\-PO requires no human labels, no reward models, and no external judges, and is structurally inaccessible to single\-round or frozen\-teacher pipelines\. \(3\)State\-of\-the\-art on six benchmarks\.LLM\-GNN Co\-Teaching outperforms GNN\-as\-Judge by up to7\.867\.86percentage points on Cora and7\.737\.73percentage points on ogbn\-arxiv under 3\-shot supervision, with the same lead carrying over to 5\-shot and to zero\-shot cross\-dataset transfer\. The error\-structure analysis in §[5\.6](https://arxiv.org/html/2606.11583#S5.SS6)shows that abandoning the golden\-teacher assumption substantially improves the LLM’s graph learning capability on challenging samples\.

## 2Related Work

#### LLM\-GNN methods for graph learning\.

Combining LLMs and GNNs for TAGs has been extensively explored\[[8](https://arxiv.org/html/2606.11583#bib.bib8),[9](https://arxiv.org/html/2606.11583#bib.bib9),[10](https://arxiv.org/html/2606.11583#bib.bib10),[4](https://arxiv.org/html/2606.11583#bib.bib4),[29](https://arxiv.org/html/2606.11583#bib.bib29),[26](https://arxiv.org/html/2606.11583#bib.bib26),[27](https://arxiv.org/html/2606.11583#bib.bib27),[28](https://arxiv.org/html/2606.11583#bib.bib28),[36](https://arxiv.org/html/2606.11583#bib.bib36),[37](https://arxiv.org/html/2606.11583#bib.bib37)\]\.*LLM\-as\-Enhancers*\[[4](https://arxiv.org/html/2606.11583#bib.bib4),[13](https://arxiv.org/html/2606.11583#bib.bib13),[30](https://arxiv.org/html/2606.11583#bib.bib30),[31](https://arxiv.org/html/2606.11583#bib.bib31)\]freeze LLM\-derived features or explanations as enriched node input to a downstream GNN\.*LLM\-as\-Predictors*\[[9](https://arxiv.org/html/2606.11583#bib.bib9),[8](https://arxiv.org/html/2606.11583#bib.bib8),[10](https://arxiv.org/html/2606.11583#bib.bib10),[11](https://arxiv.org/html/2606.11583#bib.bib11),[12](https://arxiv.org/html/2606.11583#bib.bib12),[33](https://arxiv.org/html/2606.11583#bib.bib33),[32](https://arxiv.org/html/2606.11583#bib.bib32),[14](https://arxiv.org/html/2606.11583#bib.bib14)\]frame node classification as text generation, typically with structural prompts or graph tokens\.*GNN\-as\-Judge*\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]reverses the direction: a once\-trained GNN’s verdicts filter pseudo\-labels for fine\-tuning the LLM, with a theoretical lower bound on agreement\-set accuracy under conditional independence;Sheng et al\.\[[38](https://arxiv.org/html/2606.11583#bib.bib38)\]similarly treats LLM annotations as noisy oracles for graph active learning\. In every case, one model is fixed as the golden teacher and supervision flows in one direction\. LLM\-GNN Co\-Teaching instead designates no golden teacher: both models update every round and judge each other across multiple rounds\.

#### Co\-teaching, noisy labels, and pseudo\-label selection\.

Co\-teaching\[[39](https://arxiv.org/html/2606.11583#bib.bib39)\]trains two networks simultaneously, each selecting small\-loss samples for its peer\. Co\-Teaching\+\[[40](https://arxiv.org/html/2606.11583#bib.bib40)\]adds disagreement filtering, DivideMix\[[41](https://arxiv.org/html/2606.11583#bib.bib41)\]introduces mixture\-model selection, and earlier co\-training\[[42](https://arxiv.org/html/2606.11583#bib.bib42),[43](https://arxiv.org/html/2606.11583#bib.bib43),[44](https://arxiv.org/html/2606.11583#bib.bib44)\]variants pair networks of the same architecture\. The broader noisy\-label literature\[[45](https://arxiv.org/html/2606.11583#bib.bib45),[46](https://arxiv.org/html/2606.11583#bib.bib46),[47](https://arxiv.org/html/2606.11583#bib.bib47),[48](https://arxiv.org/html/2606.11583#bib.bib48),[49](https://arxiv.org/html/2606.11583#bib.bib49)\]likewise treats noise as homogeneous across views, and recent work warns that LLMs trained on their own outputs can degrade over time\[[50](https://arxiv.org/html/2606.11583#bib.bib50)\]\. Closely related is pseudo\-labeling\[[51](https://arxiv.org/html/2606.11583#bib.bib51)\], which augments small labeled sets with model\-generated labels, with mining of both easy and hard samples shown to be crucial\[[52](https://arxiv.org/html/2606.11583#bib.bib52),[53](https://arxiv.org/html/2606.11583#bib.bib53)\]\. On graphs, prior work explores multi\-stage self\-training\[[19](https://arxiv.org/html/2606.11583#bib.bib19)\], label\-propagation hybrids\[[54](https://arxiv.org/html/2606.11583#bib.bib54),[24](https://arxiv.org/html/2606.11583#bib.bib24)\], confidence\-aware filtering\[[55](https://arxiv.org/html/2606.11583#bib.bib55)\], and active labeling\[[56](https://arxiv.org/html/2606.11583#bib.bib56),[57](https://arxiv.org/html/2606.11583#bib.bib57)\], with single\-round GNN\-LLM agreement filters\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]the closest to our setup\. All of this prior work pairs homogeneous networks of the same architecture and uses a single\-round selection\. We are the first to co\-teach across*heterogeneous*architectures \(GNN \+ LLM\) iteratively, whose complementary inductive biases \(structural vs\. semantic\) provide stronger error independence than random\-initialization diversity\.

#### Preference optimization\.

LLM alignment from feedback originates in RLHF\[[58](https://arxiv.org/html/2606.11583#bib.bib58),[59](https://arxiv.org/html/2606.11583#bib.bib59),[60](https://arxiv.org/html/2606.11583#bib.bib60)\], with DPO\[[35](https://arxiv.org/html/2606.11583#bib.bib35)\]and its variants\[[61](https://arxiv.org/html/2606.11583#bib.bib61),[62](https://arxiv.org/html/2606.11583#bib.bib62),[63](https://arxiv.org/html/2606.11583#bib.bib63),[64](https://arxiv.org/html/2606.11583#bib.bib64),[65](https://arxiv.org/html/2606.11583#bib.bib65),[66](https://arxiv.org/html/2606.11583#bib.bib66)\]replacing the reward model with pairwise preferences\. On graphs, GNN\-as\-Judge\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]and InstructGraph\[[14](https://arxiv.org/html/2606.11583#bib.bib14)\]apply preference tuning to within\-round GNN\-LLM disagreements; RPL\-PO instead exploits the*temporal*structure of co\-teaching, contrasting the same LLM’s predictions across consecutive rounds\.

#### Our novelty\.

LLM\-GNN Co\-Teaching contributes on two fronts\. \(i\)*A new LLM\-GNN framework without an explicitly designated golden teacher*, operationalized via heterogeneous co\-teaching in which the GNN and LLM iteratively pseudo\-label each other under a small\-loss criterion and both update every round\. \(ii\)*A novel preference optimization signal mined from the learning trajectory*, which constructs preference pairs from cross\-round agreement transitions and requires no golden teacher, no human label, no reward model, and no external judge, fully releasing the supervision signal latent in sparse\-label graph learning\.

## 3Preliminaries

We consider few\-shot semi\-supervised node classification on text\-attributed graphs \(TAGs\) defined as𝒢=\(𝒱,ℰ,𝐓\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\},\\mathbf\{T\}\), where𝒱=\{v1,v2,…,vN\}\\mathcal\{V\}=\\\{v\_\{1\},v\_\{2\},\\ldots,v\_\{N\}\\\}is the node set,ℰ⊆𝒱×𝒱\\mathcal\{E\}\\subseteq\\mathcal\{V\}\\times\\mathcal\{V\}is the edge set, and𝐓=\{tv\}v∈𝒱\\mathbf\{T\}=\\\{t\_\{v\}\\\}\_\{v\\in\\mathcal\{V\}\}are the per\-node text attributes \(e\.g\. paper title and abstract in citation networks\)\. Each nodevvhas a labelyv∈\{1,…,C\}y\_\{v\}\\in\\\{1,\\ldots,C\\\}\. Given a small labeled set𝒱train\\mathcal\{V\}\_\{\\text\{train\}\}in which each class has exactlykklabeled nodes, along with a validation set𝒱val\\mathcal\{V\}\_\{\\text\{val\}\}, the goal is to predict labels for the test nodes𝒱test=𝒱∖\(𝒱train∪𝒱val\)\\mathcal\{V\}\_\{\\text\{test\}\}=\\mathcal\{V\}\\setminus\(\\mathcal\{V\}\_\{\\text\{train\}\}\\cup\\mathcal\{V\}\_\{\\text\{val\}\}\)\. Few\-shot*semi\-supervised*learning trains the models on both \(i\) the small labeled set𝒱train\\mathcal\{V\}\_\{\\text\{train\}\}together with its ground\-truth labels and \(ii\) the unlabeled nodes𝒱U⊆𝒱∖𝒱train\\mathcal\{V\}\_\{\\text\{U\}\}\\subseteq\\mathcal\{V\}\\setminus\\mathcal\{V\}\_\{\\text\{train\}\}and their text contents, e\.g\., by predicting on𝒱U\\mathcal\{V\}\_\{\\text\{U\}\}and using the most confident predictions as pseudo\-labels\.

#### LLM predictor\.

A large language modelfLLMf\_\{\\text\{LLM\}\}predicts node labels by framing classification as text generation\. Given nodevv, we build a prompt𝒫​\(v\)\\mathcal\{P\}\(v\)containingvv’s texttvt\_\{v\}, the texts ofvv’s neighbors, and the candidate class names\. The LLM emits a class name, giving predictiony^vL=fLLM​\(𝒫​\(v\)\)\\hat\{y\}\_\{v\}^\{L\}=f\_\{\\text\{LLM\}\}\(\\mathcal\{P\}\(v\)\)\.

#### GNN predictor\.

A graph neural networkfGNN:𝒱→ℝCf\_\{\\text\{GNN\}\}:\\mathcal\{V\}\\to\\mathbb\{R\}^\{C\}produces class logits by aggregating information from a node’s local neighborhood via message passing\. Layerℓ\\ellcomputes

𝐡v\(ℓ\)=UPDATE\(ℓ\)​\(𝐡v\(ℓ−1\),AGG\(ℓ\)​\(\{𝐡u\(ℓ−1\):u∈𝒩​\(v\)\}\)\),\\mathbf\{h\}\_\{v\}^\{\(\\ell\)\}=\\text\{UPDATE\}^\{\(\\ell\)\}\\\!\\left\(\\mathbf\{h\}\_\{v\}^\{\(\\ell\-1\)\},\\;\\text\{AGG\}^\{\(\\ell\)\}\\\!\\left\(\\\{\\mathbf\{h\}\_\{u\}^\{\(\\ell\-1\)\}:u\\in\\mathcal\{N\}\(v\)\\\}\\right\)\\right\),\(1\)with𝐡v\(0\)=𝐱v\\mathbf\{h\}\_\{v\}^\{\(0\)\}=\\mathbf\{x\}\_\{v\}\(a numerical embedding oftvt\_\{v\}\) and final predictiony^vG=arg⁡max⁡fGNN​\(v\)\\hat\{y\}\_\{v\}^\{G\}=\\arg\\max f\_\{\\text\{GNN\}\}\(v\)\.

## 4LLM\-GNN Co\-teaching

Why no golden teacher?Existing LLM\-GNN methods all designate one model as a fixed teacher whose outputs are treated as ground truth\. Under sparse supervision, this assumption breaks down: with few labels per class, neither model is reliable, so freezing either side as the teacher transfers its blind spots into the student wholesale and the unidirectional supervision leaves no path to revise mistakes\. Effective few\-shot LLM\-GNN learning therefore needs a framework where neither model is fixed as teacher and both can correct each other through their joint dynamics\.

Difference from classical co\-teaching\.Classical Co\-Teaching\[[39](https://arxiv.org/html/2606.11583#bib.bib39)\]pairs two networks of the same architecture and relies on random\-init diversity for the two peers to make different mistakes; this diversity is fragile and shrinks as the two peers converge\. LLM\-GNN Co\-Teaching instead pairs a GNN and an LLM, whose inductive biases are disjoint by construction: the LLM’s text\-only view fails on semantically ambiguous descriptions, while the GNN’s structural view fails on low\-degree, sparsely connected nodes\. Each peer is reliably strong where the other is weak, so confident pseudo\-labels from one carry information the other could not have produced alone, providing a far more durable basis for mutual correction than random initialization\. §[5\.6](https://arxiv.org/html/2606.11583#S5.SS6)confirms this complementarity empirically through per\-degree error rates, and Appendix[J](https://arxiv.org/html/2606.11583#A10)shows it through per\-dataset Venn diagrams of LLM/GNN correctness\.

LLM\-GNN Co\-Teaching runs forTTrounds\. Figure[2](https://arxiv.org/html/2606.11583#S4.F2)visualises one round, organised in three stages\.

1. 1\.Select and exchange confident pseudo\-labels\(§[4\.1](https://arxiv.org/html/2606.11583#S4.SS1)\)\. Both models predict on the same unlabeled batch, each picks its most confident subset, and the two subsets are swapped\.
2. 2\.Update the LLM and GNN\(§[4\.2](https://arxiv.org/html/2606.11583#S4.SS2)\)\. The LLM is fine\-tuned on the GNN\-selected pseudo\-labels and the GNN is trained on the LLM\-selected pseudo\-labels, both starting from the previous round’s checkpoint\.
3. 3\.Reinforce pseudo\-label quality with RPL\-PO\(§[4\.3](https://arxiv.org/html/2606.11583#S4.SS3)\)\. Every two rounds, nodes whose LLM prediction flipped from disagreeing\-with\-GNN to agreeing\-with\-GNN form a temporal preference pair\. The LLM is then updated by DPO on these pairs, with no human annotation or external reward model\.

Sample batchℬt⊂𝒱U\\mathcal\{B\}\_\{t\}\\subset\\mathcal\{V\}\_\{\\text\{U\}\}LLMpredictsy^vL\\hat\{y\}\_\{v\}^\{L\}onℬt\\mathcal\{B\}\_\{t\}GNNpredictsy^vG\\hat\{y\}\_\{v\}^\{G\}onℬt\\mathcal\{B\}\_\{t\}top\-R​\(t\)R\(t\)confident𝒮tL→G\\mathcal\{S\}\_\{t\}^\{L\\to G\}top\-R​\(t\)R\(t\)confident𝒮tG→L\\mathcal\{S\}\_\{t\}^\{G\\to L\}LLMSFT on𝒱train∪𝒮tG→L\\mathcal\{V\}\_\{\\text\{train\}\}\\cup\\mathcal\{S\}\_\{t\}^\{G\\to L\}GNNtrain on𝒱train∪𝒮tL→G\\mathcal\{V\}\_\{\\text\{train\}\}\\cup\\mathcal\{S\}\_\{t\}^\{L\\to G\}RPL\-PO\(every 2 rounds, see below\)exchange labelsSelect & exchange confident pseudo\-labelsUpdate LLM and GNNReinforce pseudo\-label quality

Figure 2:Round\-ttco\-teaching flow\. TheGNNandLLMpredict on the same unlabeled batchℬt\\mathcal\{B\}\_\{t\}and score by a model\-specific self\-confidence signal\. The top\-R​\(t\)R\(t\)confident pairs arecross\-exchanged:𝒮tG→L\\mathcal\{S\}\_\{t\}^\{G\\to L\}supervises the LLM in §[4\.2](https://arxiv.org/html/2606.11583#S4.SS2), and𝒮tL→G\\mathcal\{S\}\_\{t\}^\{L\\to G\}supervises the GNN in §[4\.2](https://arxiv.org/html/2606.11583#S4.SS2)\. Every two rounds the LLM additionally undergoesRPL\-PO\(zoomed in Fig\.[3](https://arxiv.org/html/2606.11583#S4.F3), §[4\.3](https://arxiv.org/html/2606.11583#S4.SS3)\)\.### 4\.1Selecting and Exchanging Confident Pseudo\-Labels

Intuition\.Two weak models can still teach each other if they are honest about what they do not know\. We therefore let each model speak only on the nodes where it is confident, and we send those confident answers to the peer as supervision; the peer treats them as ground truth on*its*blind spots\. Concretely, each model passes to its peer only the pseudo\-labels it is most certain about\. Certainty is measured by how strongly the model’s own scoring agrees with the answer it just produced, following the small\-loss principle of Co\-Teaching\[[39](https://arxiv.org/html/2606.11583#bib.bib39)\]\. An LLM that hesitates on some token of its generated answer, or a GNN that poorly fits its own predicted class, is more likely to be wrong than a model that scores its own answer with high confidence\.

For each nodev∈ℬtv\\in\\mathcal\{B\}\_\{t\}we compute a self\-confidence signal,

ℓvG=ℓCE​\(fGNN​\(v\),y^vG\)\(GNN\),dv=minj⁡log⁡pLLM​\(wj∣w<j\)\(LLM\),\\ell\_\{v\}^\{G\}\\;=\\;\\ell\_\{\\text\{CE\}\}\\\!\\bigl\(f\_\{\\text\{GNN\}\}\(v\),\\,\\hat\{y\}\_\{v\}^\{G\}\\bigr\)\\quad\\text\{\(GNN\)\},\\qquad d\_\{v\}\\;=\\;\\min\_\{j\}\\,\\log p\_\{\\text\{LLM\}\}\(w\_\{j\}\\mid w\_\{<j\}\)\\quad\\text\{\(LLM\)\},where lowℓvG\\ell\_\{v\}^\{G\}means the GNN strongly fits the label it just produced and highdvd\_\{v\}means no token along the LLM’s produced answer was uncertain\. We keep the most confident top\-R​\(t\)R\(t\)fraction\. The selection ratioR​\(t\)∈\(0,1\]R\(t\)\\in\(0,1\]is annealed linearly fromRminR\_\{\\min\}att=1t\{=\}1toRmaxR\_\{\\max\}att=Tt\{=\}T,

R​\(t\)=Rmin\+\(Rmax−Rmin\)⋅t−1T−1,R\(t\)\\;=\\;R\_\{\\min\}\+\(R\_\{\\max\}\-R\_\{\\min\}\)\\cdot\\frac\{t\-1\}\{T\-1\},\(2\)so early rounds \(when both models are still weak\) exchange few but reliable pseudo\-labels, while later rounds exchange more\. The two paired sets𝒮tG→L=\{\(v,y^vG\)\}\\mathcal\{S\}\_\{t\}^\{G\\to L\}=\\\{\(v,\\,\\hat\{y\}\_\{v\}^\{G\}\)\\\}and𝒮tL→G=\{\(v,y^vL\)\}\\mathcal\{S\}\_\{t\}^\{L\\to G\}=\\\{\(v,\\,\\hat\{y\}\_\{v\}^\{L\}\)\\\}resulting from this exchange feed the model updates that follow\.

### 4\.2Updating the LLM and GNN

Intuition\.The exchanged pseudo\-labels are the peer’s best guesses on the receiver’s weak spots, so each model now has access to supervision it could not have produced alone\. We update both models on the same training data structure, anchors plus peer\-supplied pseudo\-labels, but with model\-specific losses\. Both updates start from the previous round’s checkpoint, so each model carries forward the gains from earlier rounds\.

#### LLM SFT\.

At roundtt, the LLMfLLMf\_\{\\text\{LLM\}\}is fine\-tuned on the labeled anchors𝒱train\\mathcal\{V\}\_\{\\text\{train\}\}together with the GNN\-selected pseudo\-label set𝒮tG→L\\mathcal\{S\}\_\{t\}^\{G\\to L\}, starting from the previous round’s adapter\. Lety^\\hat\{y\}denote the supervision target: the ground\-truth labelyvy\_\{v\}whenv∈𝒱trainv\\in\\mathcal\{V\}\_\{\\text\{train\}\}, and the GNN’s predictiony^vG\\hat\{y\}\_\{v\}^\{G\}whenv∈𝒮tG→Lv\\in\\mathcal\{S\}\_\{t\}^\{G\\to L\}\. Lety^<i\\hat\{y\}\_\{<i\}denote its firsti−1i\-1tokens\. We minimise the standard token\-level cross\-entropy

ℒLLM\(t\)=−∑\(v,y^\)∈𝒱train∪𝒮tG→L∑i=1\|y^\|log⁡pLLM​\(y^i∣𝒫​\(v\),y^<i\),\\mathcal\{L\}\_\{\\text\{LLM\}\}^\{\(t\)\}\\;=\\;\-\\\!\\\!\\sum\_\{\(v,\\hat\{y\}\)\\,\\in\\,\\mathcal\{V\}\_\{\\text\{train\}\}\\,\\cup\\,\\mathcal\{S\}\_\{t\}^\{G\\to L\}\}\\;\\sum\_\{i=1\}^\{\|\\hat\{y\}\|\}\\log p\_\{\\text\{LLM\}\}\\\!\\big\(\\hat\{y\}\_\{i\}\\mid\\mathcal\{P\}\(v\),\\,\\hat\{y\}\_\{<i\}\\big\),\(3\)withpLLMp\_\{\\text\{LLM\}\}the LLM’s next\-token distribution and𝒫​\(v\)\\mathcal\{P\}\(v\)the prompt defined in §[3](https://arxiv.org/html/2606.11583#S3)\.

#### GNN training\.

The GNN is trained on the few ground\-truth anchors and the larger LLM\-selected pseudo\-label set\. Naively concatenating the two would let the larger pseudo\-label set drown out the anchor signal, so we average each loss within its own set and combine them with a round\-dependent weight\. This gives a convex combination of an anchor loss and a pseudo\-label loss:

ℒGNN\(t\)=\(1−αt\)⋅1\|𝒱train\|​∑\(v,y\)∈𝒱trainℓCE​\(fGNN​\(v\),y\)⏟anchor loss\+αt⋅1\|𝒮tL→G\|​∑\(v,y^vL\)∈𝒮tL→GℓCE​\(fGNN​\(v\),y^vL\)⏟pseudo\-label loss\.\\mathcal\{L\}\_\{\\text\{GNN\}\}^\{\(t\)\}\\;=\\;\(1\-\\alpha\_\{t\}\)\\cdot\\underbrace\{\\tfrac\{1\}\{\|\\mathcal\{V\}\_\{\\text\{train\}\}\|\}\\\!\\\!\\sum\_\{\(v,y\)\\in\\mathcal\{V\}\_\{\\text\{train\}\}\}\\\!\\\!\\ell\_\{\\text\{CE\}\}\(f\_\{\\text\{GNN\}\}\(v\),y\)\}\_\{\\text\{anchor loss\}\}\\;\+\\;\\alpha\_\{t\}\\cdot\\underbrace\{\\tfrac\{1\}\{\|\\mathcal\{S\}\_\{t\}^\{L\\to G\}\|\}\\\!\\\!\\sum\_\{\(v,\\hat\{y\}\_\{v\}^\{L\}\)\\in\\mathcal\{S\}\_\{t\}^\{L\\to G\}\}\\\!\\\!\\ell\_\{\\text\{CE\}\}\(f\_\{\\text\{GNN\}\}\(v\),\\hat\{y\}\_\{v\}^\{L\}\)\}\_\{\\text\{pseudo\-label loss\}\}\.\(4\)The mixing weightαt\\alpha\_\{t\}is annealed linearly fromα0\\alpha\_\{0\}at round 1 toαmax\\alpha\_\{\\max\}at roundTT\. Early rounds emphasise the anchor signal\. Later rounds, when the LLM produces cleaner pseudo\-labels, weight those more heavily, while the within\-set averaging prevents the larger pseudo\-label set from washing out the anchors\.

### 4\.3Reinforcing Pseudo\-Label Quality with RPL\-PO

Roundt−1t\{\-\}1\(odd\):LLM disagrees with GNNRoundtt\(even\):LLM agrees with GNNPair: chosen==even\-round answer,rejected==odd\-round answerDPO

Figure 3:RPL\-PO zoom\-in: every two rounds, the LLM’s even\-round answer \(peer\-confirmed\) is preferred\.We incorporate RPL\-PO to mine an additional supervisory signal from the cross\-round trajectory itself: the same LLM’s two answers on the same node, before and after one round of teaching, form a free preference pair without any label or external judge\.

#### Intuition\.

Round\-by\-round co\-teaching gives us something single\-round pipelines do not have:*the same LLM’s two answers on the same node, before and after one round of teaching*\. When the second answer is endorsed by the now\-stronger GNN while the first was not, the LLM has visibly self\-corrected on that node, and we want training to make this correction stick\. We harvest this signal as*Round\-based Pseudo\-Label Preference Optimization*\(RPL\-PO\), illustrated in Fig\.[3](https://arxiv.org/html/2606.11583#S4.F3): the later, peer\-endorsed answer is preferred over the earlier, peer\-rejected answer, and the LLM is updated by DPO\[[35](https://arxiv.org/html/2606.11583#bib.bib35)\]on these pairs\. As argued in §[1](https://arxiv.org/html/2606.11583#S1), this lets us mine supervision from the trajectory itself, without designating either model as the golden teacher\.

Consecutive rounds\(2​k−1,2​k\)\(2k\{\-\}1,2k\)are seeded to draw the*same*batch, so for every nodevvwe have an odd\-round LLM predictiony^vL,odd\\hat\{y\}\_\{v\}^\{L,\\text\{odd\}\}and an even\-round predictiony^vL,even\\hat\{y\}\_\{v\}^\{L,\\text\{even\}\}on the same input\. We keepvvin the preference set when both conditions hold:

y^vL,odd≠y^vG,odd⏟\(i\) odd round disagreesandy^vL,even=y^vG,evenandy^vL,even≠y^vL,odd⏟\(ii\) LLM changes to agree\.\\underbrace\{\\hat\{y\}\_\{v\}^\{L,\\text\{odd\}\}\\neq\\hat\{y\}\_\{v\}^\{G,\\text\{odd\}\}\}\_\{\\text\{\(i\) odd round disagrees\}\}\\quad\\text\{and\}\\quad\\underbrace\{\\hat\{y\}\_\{v\}^\{L,\\text\{even\}\}=\\hat\{y\}\_\{v\}^\{G,\\text\{even\}\}\\ \\ \\text\{and\}\\ \\ \\hat\{y\}\_\{v\}^\{L,\\text\{even\}\}\\neq\\hat\{y\}\_\{v\}^\{L,\\text\{odd\}\}\}\_\{\\text\{\(ii\) LLM changes to agree\}\}\.\(5\)The second condition ensures the pair represents genuine self\-correction by the LLM rather than the GNN drifting onto an unchanged LLM\. The LLM updated its answer, and the now\-stronger GNN endorses the new one\. We set

chosen=y^vL,even,rejected=y^vL,odd,\\text\{chosen\}\\;=\\;\\hat\{y\}\_\{v\}^\{L,\\text\{even\}\},\\qquad\\text\{rejected\}\\;=\\;\\hat\{y\}\_\{v\}^\{L,\\text\{odd\}\},\(6\)and apply standard DPO\[[35](https://arxiv.org/html/2606.11583#bib.bib35)\]with the SFT checkpoint as the reference\. The contrast is across*time*\(same model, two training stages\) and grounded by cross\-model consensus, providing supervision that requires no human annotation and no external reward model\.

## 5Experiments

### 5\.1Experimental Setup

#### Datasets\.

We evaluate on six text\-attributed graphs spanning citation networks \(Cora\[[2](https://arxiv.org/html/2606.11583#bib.bib2)\], Citeseer\[[3](https://arxiv.org/html/2606.11583#bib.bib3)\], PubMed\[[3](https://arxiv.org/html/2606.11583#bib.bib3)\], ogbn\-arxiv\[[1](https://arxiv.org/html/2606.11583#bib.bib1)\]\), Wikipedia hyperlinks \(WikiCS\[[67](https://arxiv.org/html/2606.11583#bib.bib67)\]\), and an Amazon product co\-purchase subset \(ogbn\-products\[[1](https://arxiv.org/html/2606.11583#bib.bib1)\]\)\. The first five overlap with the benchmarks ofXu and Ding\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]\. We additionally include WikiCS, on which their numbers are not available, and we re\-run the published implementations of GNN\-as\-Judge and the recent LLM\-as\-Predictor baselines \(LLM\-GNN, LLaGA, GraphGPT\) to fill the WikiCS column\. Full per\-dataset statistics \(\#nodes, \#edges, \#features, \#classes\) and the exactkk\-shot training / validation / test splits we use are listed in Appendix[E](https://arxiv.org/html/2606.11583#A5)\.

#### Baselines\.

We compare against methods from three categories\. \(1\)*Classical GNN models*: GCN\[[20](https://arxiv.org/html/2606.11583#bib.bib20)\], GAT\[[68](https://arxiv.org/html/2606.11583#bib.bib68)\], and GraphSAGE\[[69](https://arxiv.org/html/2606.11583#bib.bib69)\]\. \(2\)*LLM\-as\-Predictors*: Zero\-shot, Graph\-CoT\[[70](https://arxiv.org/html/2606.11583#bib.bib70)\], and neighbor\-augmented prompting\. \(3\)*LLM\-Graph methods*: GLEM\[[13](https://arxiv.org/html/2606.11583#bib.bib13)\], TAPE\[[4](https://arxiv.org/html/2606.11583#bib.bib4)\], LLM\-GNN\[[8](https://arxiv.org/html/2606.11583#bib.bib8)\], LLaGA\[[11](https://arxiv.org/html/2606.11583#bib.bib11)\], GraphGPT\[[9](https://arxiv.org/html/2606.11583#bib.bib9)\], and GNN\-as\-Judge\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]\.

#### Implementation details\.

Full implementation details are deferred to Appendix[F](https://arxiv.org/html/2606.11583#A6)\. For GLEM\[[4](https://arxiv.org/html/2606.11583#bib.bib4)\], TAPE\[[4](https://arxiv.org/html/2606.11583#bib.bib4)\], LLM\-GNN\[[8](https://arxiv.org/html/2606.11583#bib.bib8)\], LLaGA\[[11](https://arxiv.org/html/2606.11583#bib.bib11)\], GraphGPT\[[9](https://arxiv.org/html/2606.11583#bib.bib9)\], and GNN\-as\-Judge\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\], we verified that the results from\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]are reproducible under matched splits, and report their numbers on Cora, Citeseer, PubMed, ogbn\-arxiv, and ogbn\-products\. WikiCS is not covered by\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]and is re\-implemented by us\. Other remaining baselines are implemented by us\.

### 5\.2Few\-Shot Semi\-Supervised Node Classification

Table 1:Node classification accuracy \(%\) on six benchmarks under 3/5/10\-shot settings\.Bold: best\.Table[1](https://arxiv.org/html/2606.11583#S5.T1)presents the main results\. We highlight three key observations\.

Obs 1\. LLM\-GNN Co\-Teaching achieves the best accuracy on all six benchmarks across all three label budgets\.The average absolute gain over the strongest prior method \(GNN\-as\-Judge\) is\+5\.40%\+5\.40\\%at 3\-shot, with the largest individual lift of\+7\.86%\+7\.86\\%on Cora\. The gains shrink but stay consistently positive as the label budget grows, indicating that the trajectory\-mined supervision in RPL\-PO is most useful precisely where it matters most — under the few\-shot regime\.

Obs 2\. The improvement scales with task difficulty\.On the large, fine\-grained benchmarks \(ogbn\-arxiv with 40 classes, ogbn\-products with 47 classes\) and on the topologically heterogeneous WikiCS, classical GNNs and recent LLM\-as\-Predictor baselines both deteriorate sharply — LLaGA and GraphGPT drop to roughly 30% accuracy on 3\-shot ogbn\-arxiv — revealing that pure structural inductive bias and pure semantic prompting both struggle when the label budget is small and the class space is large\. LLM\-GNN Co\-Teaching narrows the gap to the supervised regime that prior methods could not close\.

Obs 3\. Bidirectional co\-teaching beats every form of unidirectional supervision\.LLM\-as\-Enhancer methods \(TAPE, GLEM, LLM\-GNN\) and LLM\-as\-Predictor methods \(LLaGA, GraphGPT\) treat one model’s outputs as fixed supervision; GNN\-as\-Judge keeps the supervision flow unidirectional but inverts who judges whom\. LLM\-GNN Co\-Teaching is the only entry where neither model is fixed as teacher, and it dominates all of these on every \(dataset, shot\) cell\. Removing the golden\-teacher constraint thus pays off across the spectrum from frozen\-feature transfer to single\-round agreement filtering, which we further dissect through ablations and error\-structure analysis below\.

### 5\.3Ablation Study

We ablate the key components of LLM\-GNN Co\-Teaching under the 3\-shot setting across all six datasets \(Table[2](https://arxiv.org/html/2606.11583#S5.T2)\)\.*Co\-teaching structure*: removing bidirectional teaching \(teach\-once, GNN frozen after Round 0\) reduces performance close to the GNN\-as\-Judge baseline, confirming that mutual improvement is essential, while removing RPL\-PO isolates the additional gain attributable to trajectory\-based preference optimization on top of SFT\.*Selection mechanism*: replacing the linearly annealedR​\(t\)R\(t\)with a fixed ratio \(0\.5 or 0\.2\) consistently underperforms, and agreement\-based selection \(keeping only nodes where GNN and LLM agree\) lags small\-loss ranking on most datasets, indicating that confidence\-based ranking provides value beyond simple agreement filtering\.*Training configuration*: removing neighbor information from the LLM prompt drops accuracy on every dataset, showing that structural context in the prompt complements the LLM’s text\-only view\.

Table 2:Ablation study \(3\-shot\)\. We report best LLM test accuracy \(%\) across rounds\. Ablation rows are mean±\\pmstd over 3 seeds\. Each row removes or modifies one component from the full method\.
### 5\.4Cross\-Dataset Zero\-Shot Transfer

We evaluate the zero\-shot generalization of LLM\-GNN Co\-Teaching by training on ogbn\-arxiv and evaluating on Cora, Citeseer, and PubMed*without any fine\-tuning on the target dataset*\. Unlike GNNs, which require task\-specific classification heads, LLMs trained via co\-teaching can transfer across label sets because the learned capability is text\-based classification, not tied to a fixed output space\. Table[3](https://arxiv.org/html/2606.11583#S5.T3)shows that LLM\-GNN Co\-Teaching achieves strong zero\-shot transfer, outperforming prior LLM\-graph methods\. This suggests that iterative co\-teaching improves the LLM’s general graph reasoning ability, not just its performance on the training distribution\.

### 5\.5Over Round Pseudo Label Quality

Table 3:Zero\-shot cross\-dataset accuracy\.![Refer to caption](https://arxiv.org/html/2606.11583v1/x1.png)
Figure 4:Per\-round signals on cora\.
For each round we record the accuracy of the small\-loss\-selected GNN and LLM pseudo\-label streams that each peer feeds the other, together with downstream LLM test accuracy\. On cora, the GNN stream stays at9797–100%100\\%, above GAJ’s GNN\-pseudo quality \(93\.1%93\.1\\%\) and uniform Random selection \(88\.2%88\.2\\%\); the LLM stream rises from80%80\\%at R1 to∼87%\\sim\\\!87\\%, exhibiting an upward trajectory that fixed\-rule baselines cannot\. The clean pseudo\-labels translate into accuracy gains: the LLM climbs from78\.9%78\.9\\%\(R1\) to85\.9%85\.9\\%\(R19\)\. The same pattern holds on arxiv, pubmed, and wikics \(App\.[C](https://arxiv.org/html/2606.11583#A3)\)\.

### 5\.6Error Structure Analysis

To examine*where*co\-teaching helps, we analyze how errors are structured along node degree on ogbn\-arxiv \(3\-shot,1,0001\{,\}000test nodes\); degree is one representative axis, not a complete characterization\. Figure[5](https://arxiv.org/html/2606.11583#S5.F5)\(a\) plots the smoothed per\-degree error rate of the no\-teaching baseline\. Panels \(b\)–\(d\) plot the*error\-fraction density*: a Gaussian KDE on the degrees of misclassified nodes, scaled bynerr/Ntotaln\_\{\\text\{err\}\}/N\_\{\\text\{total\}\}so the area under each curve equals the model’s overall error rate\. The three densities compare*No teaching*,*GNN\-as\-Judge*\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]\(its full SFT\+ORPO pipeline, using GAJ checkpoints\), and LLM\-GNN Co\-Teaching; \(b\) and \(d\) share node IDs from the same arxiv run\.

![Refer to caption](https://arxiv.org/html/2606.11583v1/x2.png)Figure 5:Error structure on ogbn\-arxiv 3\-shot\.\(a\)Per\-degree error rate at the no\-teaching baseline:LLMstays nearly flat at≈0\.5\\approx 0\.5, whileGNNspikes to0\.850\.85at degree11and decays to0\.650\.65at high degree\.\(b\)–\(d\)Error\-fraction density across stages\.Panel \(a\) reveals two complementary failure modes: the LLM is degree\-invariant \(text\-only classification ignores neighborhood size\), while the GNN fails on cold nodes \(P​\(error\|deg=1\)≈0\.85P\(\\mathrm\{error\}\\,\|\\,\\mathrm\{deg\}\{=\}1\)\\\!\\approx\\\!0\.85vs\.0\.650\.65at high degree\)\. GAJ’s single\-pass pipeline \(c\) reduces the GNN’s low\-degree peak relative to \(b\) but cannot close the gap, because it never updates the GNN past Stage 0\. Co\-teaching \(d\) attains the smallest error mass under both curves and roughly halves the GNN low\-degree spike, suggesting that repeated rounds transfer the LLM’s degree\-robust signal to the GNN where neighbors are sparse, while the LLM benefits from a progressively cleaner GNN\-selected pseudo\-label set\.

### 5\.7Time Analysis, Robustness Check, and Sensitivity Check

For space, thetime analysis\(App\.[A](https://arxiv.org/html/2606.11583#A1),[G](https://arxiv.org/html/2606.11583#A7)\),robustness check\(App\.[H](https://arxiv.org/html/2606.11583#A8)\), andsensitivity check\(App\.[I](https://arxiv.org/html/2606.11583#A9)\) are deferred to the appendix\.

## 6Conclusion

We proposed LLM\-GNN Co\-Teaching, a co\-teaching framework that abandons the golden\-teacher assumption and enables mutual improvement between GNNs and LLMs for graph learning\. By using co\-teaching manner and RPL\-PO from temporal training structure, LLM\-GNN Co\-Teaching establishes a new state of the art on six graph benchmarks, improving over GNN\-as\-Judge by up to 7\.86% under 3\-shot supervision, with consistent gains under 5\-shot supervision and in zero\-shot cross\-dataset transfer\. More broadly, our work demonstrates that bidirectional co\-teaching, in which LLM and GNN teach each other, is a powerful paradigm to facilitate effective LLM graph learning\. We believe this co\-teaching paradigm extends naturally beyond to other tasks, where GNN and LLM provide complementary views of the data\. Limitations of the present study are discussed in Appendix[A](https://arxiv.org/html/2606.11583#A1)\.

## References

- Hu et al\. \[2020\]Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec\.Open graph benchmark: Datasets for machine learning on graphs\.*Advances in neural information processing systems*, 33:22118–22133, 2020\.
- Yang et al\. \[2016\]Zhilin Yang, William Cohen, and Ruslan Salakhudinov\.Revisiting semi\-supervised learning with graph embeddings\.In*International conference on machine learning*, pages 40–48\. PMLR, 2016\.
- Sen et al\. \[2008\]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi\-Rad\.Collective classification in network data\.*AI magazine*, 29\(3\):93–93, 2008\.
- He et al\. \[2023\]Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi\.Harnessing explanations: Llm\-to\-lm interpreter for enhanced text\-attributed graph representation learning\.*arXiv preprint arXiv:2305\.19523*, 2023\.
- Brown et al\. \[2020\]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al\.Language models are few\-shot learners\.*Advances in neural information processing systems*, 33:1877–1901, 2020\.
- Achiam et al\. \[2023\]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*, 2023\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Chen et al\. \[2023\]Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, and Jiliang Tang\.Label\-free node classification on graphs with large language models \(llms\)\.*arXiv preprint arXiv:2310\.04668*, 2023\.
- Tang et al\. \[2024\]Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang\.Graphgpt: Graph instruction tuning for large language models\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 491–500, 2024\.
- Ye et al\. \[2024\]Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang\.Language is all a graph needs\.In*Findings of the association for computational linguistics: EACL 2024*, pages 1955–1973, 2024\.
- Chen et al\. \[2024\]Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang\.Llaga: Large language and graph assistant\.*arXiv preprint arXiv:2402\.08170*, 2024\.
- Liu et al\. \[2023\]Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang\.One for all: Towards training one graph model for all classification tasks\.*arXiv preprint arXiv:2310\.00149*, 2023\.
- Zhao et al\. \[2022\]Jianan Zhao, Meng Qu, Chaozhuo Li, Hao Yan, Qian Liu, Rui Li, Xing Xie, and Jian Tang\.Learning on large\-scale text\-attributed graphs via variational inference\.*arXiv preprint arXiv:2210\.14709*, 2022\.
- Wang et al\. \[2024a\]Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, and Julian McAuley\.Instructgraph: Boosting large language models via graph\-centric instruction tuning and preference alignment\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 13492–13510, 2024a\.
- Ding et al\. \[2020\]Kaize Ding, Jianling Wang, Jundong Li, Kai Shu, Chenghao Liu, and Huan Liu\.Graph prototypical networks for few\-shot learning on attributed networks\.In*Proceedings of the 29th ACM international conference on information & knowledge management*, pages 295–304, 2020\.
- Ding et al\. \[2022\]Kaize Ding, Jianling Wang, James Caverlee, and Huan Liu\.Meta propagation networks for graph few\-shot semi\-supervised learning\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 36, pages 6524–6531, 2022\.
- Wang et al\. \[2022\]Song Wang, Chen Chen, and Jundong Li\.Graph few\-shot learning with task\-specific structures\.*Advances in Neural Information Processing Systems*, 35:38925–38936, 2022\.
- Huang and Zitnik \[2020\]Kexin Huang and Marinka Zitnik\.Graph meta learning via local subgraphs\.*Advances in neural information processing systems*, 33:5862–5874, 2020\.
- Sun et al\. \[2020\]Ke Sun, Zhouchen Lin, and Zhanxing Zhu\.Multi\-stage self\-supervised learning for graph convolutional networks on graphs with few labeled nodes\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 34, pages 5892–5899, 2020\.
- Kipf and Welling \[2016\]Thomas N Kipf and Max Welling\.Semi\-supervised classification with graph convolutional networks\.*arXiv preprint arXiv:1609\.02907*, 2016\.
- Wu et al\. \[2019\]Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger\.Simplifying graph convolutional networks\.In*International conference on machine learning*, pages 6861–6871\. Pmlr, 2019\.
- Gasteiger et al\. \[2018\]Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann\.Predict then propagate: Graph neural networks meet personalized pagerank\.*arXiv preprint arXiv:1810\.05997*, 2018\.
- Xu et al\. \[2018\]Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken\-ichi Kawarabayashi, and Stefanie Jegelka\.Representation learning on graphs with jumping knowledge networks\.In*International conference on machine learning*, pages 5453–5462\. pmlr, 2018\.
- Li et al\. \[2018\]Qimai Li, Zhichao Han, and Xiao\-Ming Wu\.Deeper insights into graph convolutional networks for semi\-supervised learning\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 32, 2018\.
- Zhu et al\. \[2020\]Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra\.Beyond homophily in graph neural networks: Current limitations and effective designs\.*Advances in neural information processing systems*, 33:7793–7804, 2020\.
- Huang et al\. \[2023\]Jin Huang, Xingjian Zhang, Qiaozhu Mei, and Jiaqi Ma\.Can llms effectively leverage graph structural information through prompts, and why?*arXiv preprint arXiv:2309\.16595*, 2023\.
- Wu et al\. \[2025\]Xixi Wu, Yifei Shen, Fangzhou Ge, Caihua Shan, Yizhu Jiao, Xiangguo Sun, and Hong Cheng\.When do llms help with node classification? a comprehensive analysis\.*arXiv preprint arXiv:2502\.00829*, 2025\.
- Dai et al\. \[2024\]Xinnan Dai, Haohao Qu, Yifei Shen, Bohang Zhang, Qihao Wen, Wenqi Fan, Dongsheng Li, Jiliang Tang, and Caihua Shan\.How do large language models understand graph patterns? a benchmark for graph pattern comprehension\.In*The Thirteenth International Conference on Learning Representations*, 2024\.
- Guo et al\. \[2023\]Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han\.Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking\.*arXiv preprint arXiv:2305\.15066*, 2023\.
- Li et al\. \[2023\]Yichuan Li, Kaize Ding, and Kyumin Lee\.Grenade: Graph\-centric language model for self\-supervised representation learning on text\-attributed graphs\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 2745–2757, 2023\.
- Yang et al\. \[2021\]Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, and Xing Xie\.Graphformers: Gnn\-nested transformers for representation learning on textual graph\.*Advances in Neural Information Processing Systems*, 34:28798–28810, 2021\.
- Wang et al\. \[2024b\]Duo Wang, Yuan Zuo, Fengzhi Li, and Junjie Wu\.Llms as zero\-shot graph learners: Alignment of gnn representations with llm token embeddings\.*Advances in neural information processing systems*, 37:5950–5973, 2024b\.
- Hu et al\. \[2024\]Zhengyu Hu, Yichuan Li, Zhengyu Chen, Jingang Wang, Han Liu, Kyumin Lee, and Kaize Ding\.Let’s ask gnn: Empowering large language model for graph in\-context learning\.In*Findings of the Association for Computational Linguistics: EMNLP 2024*, pages 1396–1409, 2024\.
- Xu and Ding \[2026\]Ruiyao Xu and Kaize Ding\.Gnn\-as\-judge: Unleashing the power of llms for graph learning with gnn feedback\.*arXiv preprint arXiv:2604\.08553*, 2026\.
- Rafailov et al\. \[2023\]Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\.Direct preference optimization: Your language model is secretly a reward model\.*Advances in neural information processing systems*, 36:53728–53741, 2023\.
- Wen and Fang \[2023\]Zhihao Wen and Yuan Fang\.Augmenting low\-resource text classification with graph\-grounded pre\-training and prompting\.In*Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 506–516, 2023\.
- Yu et al\. \[2025\]Jianxiang Yu, Yuxiang Ren, Chenghua Gong, Jiaqi Tan, Xiang Li, and Xuecang Zhang\.Leveraging large language models for node generation in few\-shot learning on text\-attributed graphs\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 39, pages 13087–13095, 2025\.
- Sheng et al\. \[2025\]Zeang Sheng, Weiyang Guo, Yingxia Shao, Wentao Zhang, and Bin Cui\.Llms are noisy oracles\! llm\-based noise\-aware graph active learning for node classification\.In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2*, pages 2526–2537, 2025\.
- Han et al\. \[2018\]Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama\.Co\-teaching: Robust training of deep neural networks with extremely noisy labels\.*Advances in neural information processing systems*, 31, 2018\.
- Yu et al\. \[2019\]Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama\.How does disagreement help generalization against label corruption?In*International conference on machine learning*, pages 7164–7173\. PMLR, 2019\.
- Li et al\. \[2020\]Junnan Li, Richard Socher, and Steven CH Hoi\.Dividemix: Learning with noisy labels as semi\-supervised learning\.*arXiv preprint arXiv:2002\.07394*, 2020\.
- Qiao et al\. \[2018\]Siyuan Qiao, Wei Shen, Zhishuai Zhang, Bo Wang, and Alan Yuille\.Deep co\-training for semi\-supervised image recognition\.In*Proceedings of the european conference on computer vision \(eccv\)*, pages 135–152, 2018\.
- Ma et al\. \[2017\]Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong\.Self\-paced co\-training\.In*International Conference on Machine Learning*, pages 2275–2284\. PMLR, 2017\.
- Kumar et al\. \[2010\]M Kumar, Benjamin Packer, and Daphne Koller\.Self\-paced learning for latent variable models\.*Advances in neural information processing systems*, 23, 2010\.
- Natarajan et al\. \[2013\]N\. Natarajan, I\. S\. Dhillon, P\. Ravikumar, and A\. Tewari\.Learning with noisy labels\.In*NeurIPS*, 2013\.
- Gui et al\. \[2021\]Xian\-Jin Gui, Wei Wang, and Zhang\-Hao Tian\.Towards understanding deep learning from noisy labels with small\-loss criterion\.*arXiv preprint arXiv:2106\.09291*, 2021\.
- Chen et al\. \[2019\]Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang\.Understanding and utilizing deep neural networks trained with noisy labels\.In*International conference on machine learning*, pages 1062–1070\. PMLR, 2019\.
- Cheng et al\. \[2020\]Hao Cheng, Zhaowei Zhu, Xingyu Li, Yifei Gong, Xing Sun, and Yang Liu\.Learning with instance\-dependent label noise: A sample sieve approach\.*arXiv preprint arXiv:2010\.02347*, 2020\.
- Luo et al\. \[2024\]Junyu Luo, Xiao Luo, Kaize Ding, Jingyang Yuan, Zhiping Xiao, and Ming Zhang\.Robustft: Robust supervised fine\-tuning for large language models under noisy response\.*arXiv preprint arXiv:2412\.14922*, 2024\.
- Shumailov et al\. \[2023\]Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson\.The curse of recursion: Training on generated data makes models forget\.*arXiv preprint arXiv:2305\.17493*, 2023\.
- Lee et al\. \[2013\]Dong\-Hyun Lee et al\.Pseudo\-label: The simple and efficient semi\-supervised learning method for deep neural networks\.In*Workshop on challenges in representation learning, ICML*, volume 3, page 896\. Atlanta, 2013\.
- Mukherjee and Awadallah \[2020\]Subhabrata Mukherjee and Ahmed Awadallah\.Uncertainty\-aware self\-training for few\-shot text classification\.*Advances in Neural Information Processing Systems*, 33:21199–21212, 2020\.
- Rizve et al\. \[2021\]Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah\.In defense of pseudo\-labeling: An uncertainty\-aware pseudo\-label selection framework for semi\-supervised learning\.*arXiv preprint arXiv:2101\.06329*, 2021\.
- Wang and Leskovec \[2020\]Hongwei Wang and Jure Leskovec\.Unifying graph convolutional neural networks and label propagation\.*arXiv preprint arXiv:2002\.06755*, 2020\.
- Liu et al\. \[2022\]Hongrui Liu, Binbin Hu, Xiao Wang, Chuan Shi, Zhiqiang Zhang, and Jun Zhou\.Confidence may cheat: Self\-training on graph neural networks under distribution shift\.In*Proceedings of the ACM web conference 2022*, pages 1248–1258, 2022\.
- Cai et al\. \[2017\]Hongyun Cai, Vincent W Zheng, and Kevin Chen\-Chuan Chang\.Active learning for graph embedding\.*arXiv preprint arXiv:1705\.05085*, 2017\.
- Zhang et al\. \[2021\]Wentao Zhang, Yexin Wang, Zhenbang You, Meng Cao, Ping Huang, Jiulong Shan, Zhi Yang, and Bin Cui\.Rim: Reliable influence\-based active learning on graphs\.*Advances in neural information processing systems*, 34:27978–27990, 2021\.
- Christiano et al\. \[2017\]Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei\.Deep reinforcement learning from human preferences\.*Advances in neural information processing systems*, 30, 2017\.
- Stiennon et al\. \[2020\]Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano\.Learning to summarize with human feedback\.*Advances in neural information processing systems*, 33:3008–3021, 2020\.
- Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in neural information processing systems*, 35:27730–27744, 2022\.
- Azar et al\. \[2024\]Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello\.A general theoretical paradigm to understand learning from human preferences\.In*International Conference on Artificial Intelligence and Statistics*, pages 4447–4455\. PMLR, 2024\.
- Ethayarajh et al\. \[2024\]Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela\.Model alignment as prospect theoretic optimization\.In*Forty\-first International Conference on Machine Learning*, 2024\.
- Meng et al\. \[2024\]Yu Meng, Mengzhou Xia, and Danqi Chen\.Simpo: Simple preference optimization with a reference\-free reward\.*Advances in Neural Information Processing Systems*, 37:124198–124235, 2024\.
- Zhao et al\. \[2023\]Siyan Zhao, John Dang, and Aditya Grover\.Group preference optimization: Few\-shot alignment of large language models\.*arXiv preprint arXiv:2310\.11523*, 2023\.
- Amini et al\. \[2024\]Afra Amini, Tim Vieira, and Ryan Cotterell\.Direct preference optimization with an offset\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 9954–9972, 2024\.
- Hong et al\. \[2024\]Jiwoo Hong, Noah Lee, and James Thorne\.Orpo: Monolithic preference optimization without reference model\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 11170–11189, 2024\.
- Mernyei and Cangea \[2020\]Péter Mernyei and Cătălina Cangea\.Wiki\-cs: A wikipedia\-based benchmark for graph neural networks\.*arXiv preprint arXiv:2007\.02901*, 2020\.
- Veličković et al\. \[2018\]Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio\.Graph attention networks\.In*International Conference on Learning Representations*, 2018\.
- Hamilton et al\. \[2017\]Will Hamilton, Zhitao Ying, and Jure Leskovec\.Inductive representation learning on large graphs\.*Advances in neural information processing systems*, 30, 2017\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837, 2022\.
- Liu et al\. \[2019\]Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov\.Roberta: A robustly optimized bert pretraining approach\.*arXiv preprint arXiv:1907\.11692*, 2019\.
- Hu et al\. \[2022\]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al\.Lora: Low\-rank adaptation of large language models\.*Iclr*, 1\(2\):3, 2022\.

## Appendix ALimitations

#### Time complexity\.

LLM\-GNN Co\-Teaching increases time complexity over a single\-shot LLM\-on\-graph pipeline by a factor ofTT, the number of co\-teaching rounds, because each round repeats vLLM inference on the unlabeled batch, an LLM SFT pass on the anchors plus the GNN\-selected pseudo\-labels, and a GNN training pass on the full graph\. Every even round additionally runs a DPO pass on the temporal preference pairs from roundst−1t\{\-\}1andtt\. The GNN cost is negligible against the two LLM operations because the GNN has𝒪​\(105\)\\mathcal\{O\}\(10^\{5\}\)parameters versus the LLM’s𝒪​\(109\)\\mathcal\{O\}\(10^\{9\}\)\. We measured the wall\-clock end\-to\-end on a single NVIDIA A100\-40GB on ogbn\-arxiv 3\-shot\. Initialisation \(data preparation, initial GNN training, and the warm\-up LLM SFT\) consumes about1717minutes\. Each co\-teaching round takes about2626minutes \(∼\\sim1313min for the LLM SFT pass,∼\\sim66min for vLLM cross\-inference on the1,5001\{,\}500\-node batch,∼\\sim1\.51\.5min for the GNN re\-train, and∼\\sim33min for the held\-out evaluation\), with even rounds adding∼\\sim55min for the DPO update\. The framework remains flexible because the user controlsTTdirectly and the LLM accuracy plateaus well before our default\. On ogbn\-arxiv 3\-shot,T=10T\{=\}10already delivers the strongest result reported in Table[1](https://arxiv.org/html/2606.11583#S5.T1)and finishes in about282282minutes \(∼\\sim44h4242min\)\. On smaller benchmarks the effectiveTTis even lower because the LLM converges earlier\. A held\-out validation signal supports early stopping, so practitioners can dialTTto match their compute budget without sacrificing the reported gains\. Appendix[G](https://arxiv.org/html/2606.11583#A7)compares our wall\-clock against every baseline in the same A100\-40GB setting\.

#### Coverage of data domains\.

We evaluate on six text\-attributed graphs from three domains: citation networks \(Cora, Citeseer, PubMed, ogbn\-arxiv\), Wikipedia hyperlinks \(WikiCS\), and Amazon co\-purchase \(ogbn\-products\)\. All six are settings where node text is descriptive and the LLM can extract a strong unimodal signal from text alone\. We have not validated the framework on graphs where node text is noisy, sparse, or absent, such as molecular property prediction, biological interaction networks, or financial transaction graphs\. The assumption that GNN structural inductive bias and LLM semantic reasoning offer complementary signal is most clearly supported in our setting and may need re\-examination in domains where one of the two views is degraded\. Extending the evaluation to such domains is left for future work\.

## Appendix BBroader Impacts

#### Positive impacts\.

LLM\-GNN Co\-Teaching targets the few\-shot regime, where labeled data is scarce\. This setting is common in domains where annotation is expensive or only domain experts can produce labels, including rare scientific subfields, niche legal or policy corpora, low\-resource languages, and biomedical sub\-disciplines\. By turning an LLM and a GNN into mutual teachers, our method extracts more value from the few existing labels and from the unlabeled remainder of the graph\. The training signal comes entirely from the model trajectory, with no human annotators, no reward model, and no external judge, which lowers the practical barrier for groups with limited labeling budgets\. The framework is also model\-agnostic at the LLM side, so the same recipe applies to smaller open\-weight LLMs and is therefore accessible to research groups without frontier\-scale resources\.

#### Negative impacts\.

The recipe inherits the well\-known concerns of large language models\. LLM\-generated pseudo\-labels can carry the social, demographic, or political biases of the pretraining corpus, and the iterative co\-teaching loop can amplify any bias that the GNN does not correct\. We mitigate this with the small\-loss criterion, which keeps only the LLM’s most confident predictions, but confident LLM mistakes are still possible and can persist across rounds\. The compute and energy footprint per run is higher than a single\-shot pipeline, as quantified in Appendix[A](https://arxiv.org/html/2606.11583#A1), which has the usual implications for carbon cost\. Finally, automated node classification at scale on social or behavioral graphs raises privacy concerns when the underlying data is not consented or the application is surveillance\-oriented\. Practitioners deploying this framework outside academic benchmarks should audit the pretrained LLM for domain bias, evaluate calibration on a held\-out set, and follow the dataset licence and consent conditions\.

## Appendix CPer\-Round Pseudo\-Label Quality

Figure[6](https://arxiv.org/html/2606.11583#A3.F6)shows the same per\-round signals as Figure[4](https://arxiv.org/html/2606.11583#S5.F4)\(§[5\.5](https://arxiv.org/html/2606.11583#S5.SS5)\) but for all four datasets \(arxiv, cora, pubmed, wikics\)\. The trends shown for cora/pubmed in the main paper hold across the additional datasets: top\-10% pseudo\-label streams climb across rounds and stay in the8080–99%99\\%band, while LLM test accuracy improves across rounds\.

![Refer to caption](https://arxiv.org/html/2606.11583v1/x3.png)Figure 6:Per\-round LLM test accuracy and top\-10% GNN/LLM pseudo\-label quality across all four 3\-shot benchmarks \(arxiv, cora, pubmed, wikics\)\.Blue\(circles\): LLM test acc\.Green\(squares\): GNN top\-10% pseudo\-label GT\-acc\.Salmon\(triangles\): LLM top\-10% pseudo\-label GT\-acc\. Pseudo\-label quality at roundt\+1t\{\+\}1is paired with the test accuracy at roundttsince round\-t\+1t\{\+\}1’s pseudo\-labels are produced by the round\-ttmodels\.
## Appendix DLLM SFT Prompt Specifications

The following boxes give the verbatim per\-dataset prompt template appended to each node’s text during LLM SFT\. The user message passed to Llama\-3\-8B\-Instruct is the concatenation of \(i\) the node’s raw text \(and, when using neighbor information, the 1\-hop and 2\-hop neighbor titles\), \(ii\) the dataset\-specific prompt below\.

\{promptframe\}

cora \(7 classes\)

```
Question: Which of the following sub-categories of AI does this paper
belong to? Here are the 7 categories: Rule_Learning, Neural_Networks,
Case_Based, Genetic_Algorithms, Theory, Reinforcement_Learning,
Probabilistic_Methods. Reply only one category that you think this paper
might belong to. Only reply the category phrase without any other
explanation words.

Answer:
```

\{promptframe\}

citeseer \(6 classes\)

```
Question: Which of the following theme does this paper belong to?
Here are the 6 categories: Agents, ML (Machine Learning), IR
(Information Retrieval), DB (Databases), HCI (Human-Computer
Interaction), AI (Artificial Intelligence). Reply only one category
that you think this paper might belong to. Only reply the category
full name I give you without any other words.

Answer:
```

\{promptframe\}

pubmed \(3 classes\)

```
Question: Which of the following topic does this scientific publication
talk about? Here are the 3 categories: Experimental, Diabetes Mellitus
Type 1, Diabetes Mellitus Type 2. Reply only one category that you
think this paper might belong to. Only reply the category name without
any other words.

Answer:
```

\{promptframe\}

wikics \(10 classes\)

```
Question: Which of the following branch of Computer science does this
Wikipedia-based dataset belong to? Here are the 10 categories:
Computational Linguistics, Databases, Operating Systems, Computer
Architecture, Computer Security, Internet Protocols, Computer File
Systems, Distributed Computing Architecture, Web Technology,
Programming Language Topics. Reply only one category that you think
this paper might belong to. Only reply the category full name without
any other words.

Answer:
```

\{promptframe\}

ogbn\-arxiv \(40 classes\)

```
Question: Which of the following arXiv CS sub-categories does this
dataset belong to? Here are the 40 categories:
’arxiv cs na’, ’arxiv cs mm’, ’arxiv cs lo’, ’arxiv cs cy’,
’arxiv cs cr’, ’arxiv cs dc’, ’arxiv cs hc’, ’arxiv cs ce’,
’arxiv cs ni’, ’arxiv cs cc’, ’arxiv cs ai’, ’arxiv cs ma’,
’arxiv cs gl’, ’arxiv cs ne’, ’arxiv cs sc’, ’arxiv cs ar’,
’arxiv cs cv’, ’arxiv cs gr’, ’arxiv cs et’, ’arxiv cs sy’,
’arxiv cs cg’, ’arxiv cs oh’, ’arxiv cs pl’, ’arxiv cs se’,
’arxiv cs lg’, ’arxiv cs sd’, ’arxiv cs si’, ’arxiv cs ro’,
’arxiv cs it’, ’arxiv cs pf’, ’arxiv cs cl’, ’arxiv cs ir’,
’arxiv cs ms’, ’arxiv cs fl’, ’arxiv cs ds’, ’arxiv cs os’,
’arxiv cs gt’, ’arxiv cs db’, ’arxiv cs dl’, ’arxiv cs dm’.
Use the words in this part to answer me, not the explanation part below.

Here are the explanation of each category:
’arxiv cs ai (Artificial Intelligence)’, ’arxiv cs ar (Hardware
Architecture)’, ’arxiv cs cc (Computational Complexity)’,
’arxiv cs ce (Computational Engineering, Finance, and Science)’,
’arxiv cs cg (Computational Geometry)’,
’arxiv cs cl (Computation and Language)’,
’arxiv cs cr (Cryptography and Security)’,
’arxiv cs cv (Computer Vision and Pattern Recognition)’,
’arxiv cs cy (Computers and Society)’, ’arxiv cs db (Databases)’,
’arxiv cs dc (Distributed, Parallel, and Cluster Computing)’,
’arxiv cs dl (Digital Libraries)’, ’arxiv cs dm (Discrete Mathematics)’,
’arxiv cs ds (Data Structures and Algorithms)’,
’arxiv cs et (Emerging Technologies)’,
’arxiv cs fl (Formal Languages and Automata Theory)’,
’arxiv cs gl (General Literature)’, ’arxiv cs gr (Graphics)’,
’arxiv cs gt (Computer Science and Game Theory)’,
’arxiv cs hc (Human-Computer Interaction)’,
’arxiv cs ir (Information Retrieval)’, ’arxiv cs it (Information Theory)’,
’arxiv cs lg (Machine Learning)’, ’arxiv cs lo (Logic in Computer Science)’,
’arxiv cs ma (Multiagent Systems)’, ’arxiv cs mm (Multimedia)’,
’arxiv cs ms (Mathematical Software)’, ’arxiv cs na (Numerical Analysis)’,
’arxiv cs ne (Neural and Evolutionary Computing)’,
’arxiv cs ni (Networking and Internet Architecture)’,
’arxiv cs oh (Other Computer Science)’, ’arxiv cs os (Operating Systems)’,
’arxiv cs pf (Performance)’, ’arxiv cs pl (Programming Languages)’,
’arxiv cs ro (Robotics)’, ’arxiv cs sc (Symbolic Computation)’,
’arxiv cs sd (Sound)’, ’arxiv cs se (Software Engineering)’,
’arxiv cs si (Social and Information Networks)’,
’arxiv cs sy (Systems and Control)’.

Reply only one category that you think this paper might belong to.
Only reply the category name (not the explanation) I given without
any other words.

Answer:
```

\{promptframe\}

ogbn\-products \(47 classes\)

```
Which of the following categories does this product belong to? There
are a total of 47 categories, including Home & Kitchen, Health &
Personal Care, Beauty, Sports & Outdoors, Books, Patio, Lawn & Garden,
Toys & Games, CDs & Vinyl, Cell Phones & Accessories, Grocery &
Gourmet Food, Arts, Crafts & Sewing, Clothing, Shoes & Jewelry,
Electronics, Movies & TV, Software, Video Games, Automotive, Pet
Supplies, Office Products, Industrial & Scientific, Musical
Instruments, Tools & Home Improvement, Magazine Subscriptions, Baby
Products, NAN, Appliances, Kitchen & Dining, Collectibles & Fine Art,
All Beauty, Luxury Beauty, Amazon Fashion, Computers, All Electronics,
Purchase Circles, MP3 Players & Accessories, Gift Cards, Office &
School Supplies, Home Improvement, Camera & Photo, GPS & Navigation,
Digital Music, Car Electronics, Baby, Kindle Store, Kindle Apps,
Furniture. Reply only one category that you think this product might
belong to. Only reply the category name I give of the category without
any other words and numbers.

Answer:
```

## Appendix EDatasets and Splits

We evaluate on six text\-attributed graphs\. The first five \(Cora, Citeseer, PubMed, ogbn\-arxiv, ogbn\-products\) match the benchmark ofXu and Ding\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]\. We additionally includeWikiCS, a Wikipedia\-hyperlink graph where each node is an English Wikipedia article on computer science\. Per\-dataset statistics are listed in Table[4](https://arxiv.org/html/2606.11583#A5.T4)\. The exact training / validation / test sizes we use under eachkk\-shot setting are given in Table[5](https://arxiv.org/html/2606.11583#A5.T5)\. The test pool is the entire remainder of the graph\.

Table 4:Dataset statistics, all extracted from our local files\. For ogbn\-products\(subset\) we use the curated subset ofHe et al\.\[[4](https://arxiv.org/html/2606.11583#bib.bib4)\]\(54,025 product nodes, the same subset reused byXu and Ding\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]\)\. \#Edges is the count of unique undirected edges actually present in our pipeline’s products graph \(nnz=144,638144\{,\}638in the symmetric adjacency, no self\-loops\), \#Features matches the original OGB feature dimension, and \#Classes is the full OGB label space \(only4444of the4747classes have nodes in the subset\)\.#### kk\-shot splits\.

For Cora, Citeseer, PubMed, and WikiCS we samplekklabeled nodes per class as the training \(anchor\) set, fix500500random nodes as the validation set, and treat the remaining nodes as the test pool\. For ogbn\-products\(subset\) we follow the same per\-class sampling protocol atk∈\{3,5,10\}k\\\!\\in\\\!\\\{3,5,10\\\}, with the validation set fixed at500500random nodes and the remainder used as the test pool\. Only4444of the4747OGB classes appear in the subset, and several of those classes have very few labeled candidates, so the realised 3\-shot anchor set has9292nodes \(rather than the nominal3×47=1413\\times 47=141\); the realised 5\- and 10\-shot pools likewise undershoot5×475\\times 47and10×4710\\times 47for the same reason\. We verified the 3\-shot count directly from our pipeline run logs\. For ogbn\-arxiv we adopt the official OGB validation and test splits and re\-samplekk\-shot training nodes per class from the OGB training pool, treating all other nodes \(the unused OGB\-train remainder\+\+the OGB\-test set\) as the test pool\.

Table 5:Train / validation / test sizes per dataset andkk\-shot setting\.Train=k×=k\\,\\times\\,\#Classes \(the small labeled anchor set\)\.Val=500=500for the small\-graph datasets and the OGB official validation set \(29,79929\{,\}799\) for ogbn\-arxiv\.Testis the remainder of the graph\. We evaluate accuracy on a random subset of1,0001\{,\}000test nodes \(consistent across runs\), reported in Table[1](https://arxiv.org/html/2606.11583#S5.T1)\.†nominalk×47k\\times 47for ogbn\-products\(subset\); the 3\-shot row reports the realised pool \(9292nodes verified from logs\), and the 5\- and 10\-shot realised counts can be lower than nominal because only4444of the4747classes have labelled candidates\.3\-shot5\-shot10\-shotDatasetTrainValTestTrainValTestTrainValTestCora215002,187355002,173705002,138Citeseer185002,668305002,656605002,626PubMed950019,2081550019,2023050019,187WikiCS3050011,1715050011,15110050011,101ogbn\-arxiv12029,799139,42420029,799139,34440029,799139,144ogbn\-products\(subset\)9250053,433235†50053,290†470†50053,055†
#### Per\-dataset notes\.

- •Cora, Citeseer, PubMed\.Standard citation networks\[[2](https://arxiv.org/html/2606.11583#bib.bib2),[3](https://arxiv.org/html/2606.11583#bib.bib3)\]where each node is a paper and edges encode citations\. Node text is paper title\+\+abstract\.
- •WikiCS\.A Wikipedia hyperlink graph filtered to computer\-science articles, with1010subdomain classes \(e\.g\. Computational Linguistics, Operating Systems, Web Technology\)\. Each node’s raw text is the lead paragraph of the article\. WikiCS is*not*included in the GAJ benchmark\. We re\-run their published implementation to fill the WikiCS column of Table[1](https://arxiv.org/html/2606.11583#S5.T1)\.
- •ogbn\-arxiv\.An arXiv CS\-paper citation network\[[1](https://arxiv.org/html/2606.11583#bib.bib1)\]with4040sub\-categories\. We use the OGB official splits \(90,94190\{,\}941train /29,79929\{,\}799val /48,60348\{,\}603test\)\. Ourkk\-shot training set is sub\-sampled from the OGB\-train pool, and the remainder of the graph \(unused OGB\-train∪\\cupOGB\-test\) becomes the test pool\.
- •ogbn\-products\(subset\)\.A subset of the Amazon co\-purchase graph fromHe et al\.\[[4](https://arxiv.org/html/2606.11583#bib.bib4)\], filtered to54,02554\{,\}025products across4747categories, with text being the product title \+ description\. We adopt the same subset and split convention asXu and Ding\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]\.

## Appendix FImplementation Details and Hyperparameters

This appendix documents how every method in Table[1](https://arxiv.org/html/2606.11583#S5.T1)is implemented and trained\. Section[F\.1](https://arxiv.org/html/2606.11583#A6.SS1)covers the baseline methods \(the same set used byXu and Ding\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\], with their numbers reproduced in our main table\)\. Section[F\.2](https://arxiv.org/html/2606.11583#A6.SS2)describes the classical GNN backbones \(GCN, GAT, SAGE\) reported in our table\. Section[F\.3](https://arxiv.org/html/2606.11583#A6.SS3)lists every hyperparameter used by LLM\-GNN Co\-Teaching\.

### F\.1Baselines

#### Classical GNN backbones \(GCN, GAT, GraphSAGE\)\.

All classical\-GNN baselines are 2\-layer message\-passing networks with6464\-dimensional hidden representations, trained with Adam \(learning rate10−210^\{\-2\}, weight decay5×10−45\\times 10^\{\-4\}\) for up to200200epochs with early stopping at patience100100\. Dropout is selected from\{0\.3,0\.5,0\.7\}\\\{0\.3,\\,0\.5,\\,0\.7\\\}on the validation set and batch normalization is inserted between the two layers\. These choices match the setting inXu and Ding\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]so the numbers transfer directly\.

#### LLM\-as\-Predictors\.

Three prompting\-only baselines all use Llama\-3\-8B\-Instruct without any fine\-tuning\.*Zero\-shot*prompts feed the node text and the candidate class names\.*Graph Chain\-of\-Thought*\[[70](https://arxiv.org/html/2606.11583#bib.bib70)\]adds a step\-by\-step reasoning instruction\.*Neighbor\-Augmented Prompting*\[[8](https://arxiv.org/html/2606.11583#bib.bib8)\]additionally appends the texts of the node’s local neighbors\.

#### LLM\-Graph methods\.

- •*GLEM*\[[13](https://arxiv.org/html/2606.11583#bib.bib13)\]alternates EM steps between an LM and a GNN\. Following the original setting, EM iterations=1=1and pseudo\-label ratio=0\.5=0\.5\. The GNN is the same 2\-layer 64\-d architecture as the classical baselines\. The LM is RoBERTa\[[71](https://arxiv.org/html/2606.11583#bib.bib71)\]with LoRA\[[72](https://arxiv.org/html/2606.11583#bib.bib72)\], batch size3232, pre\-trained on each dataset before joint training\.
- •*TAPE*\[[4](https://arxiv.org/html/2606.11583#bib.bib4)\]uses Llama\-3\-8B\-Instruct \+ LoRA \(default settings\) to generate per\-node explanations that are appended to textual features\. The GNN configuration is identical to the classical baselines\.
- •*LLM\-GNN*\[[8](https://arxiv.org/html/2606.11583#bib.bib8)\]is adapted from its zero\-shot original to our few\-shot setting: the LLM \(Llama\-3\-8B\-Instruct\) acts as an annotator on labeled data, the DA\-AGE method from the original paper selects pseudo\-labels, and the GNN is jointly trained on labeled and pseudo\-labeled data\.
- •*LLaGA*\[[11](https://arxiv.org/html/2606.11583#bib.bib11)\]uses HO templates with hop count44, RoBERTa\[[71](https://arxiv.org/html/2606.11583#bib.bib71)\]as the text encoder, a linear projectionϕθ\\phi\_\{\\theta\}implemented as a 2\-layer MLP with hidden dimension2,0482\{,\}048, batch size6464, learning rate10−410^\{\-4\},1010epochs\.
- •*GraphGPT*\[[9](https://arxiv.org/html/2606.11583#bib.bib9)\]uses two instruction\-tuning stages on dataset\-specific graph\-matching tasks\. Self\-supervised stage: lr10−410^\{\-4\}, batch1616,22epochs\. Task\-specific stage: lr10−410^\{\-4\}, batch3232,1010epochs\.
- •*GNN\-as\-Judge*\[[34](https://arxiv.org/html/2606.11583#bib.bib34)\]is the closest competitor\. The GNN is a 2\-layer 64\-d GCN with the same training setup as the classical baselines\. The LLM is Llama\-3\-8B\-Instruct \+ LoRA \(r=8r\{=\}8,α=16\\alpha\{=\}16, dropout0\.10\.1, batch size88\)\. Instruction tuning runs for1010epochs at lr5×10−65\\times 10^\{\-6\}\. The subsequent weakly\-supervised fine\-tuning runs for88epochs at lr10−510^\{\-5\}with the IT/PT mixing weightλ=0\.1\\lambda=0\.1\. Top\-K=1,500K\{=\}1\{,\}500influential nodes are selected for pseudo\-labeling and the ORPO preference thresholdτ\\tauis fixed at0\.70\.7\.

### F\.2Our GCN, GAT, and SAGE Backbones

The Classical\-GNN rows of Table[1](https://arxiv.org/html/2606.11583#S5.T1)use GCN, GAT, and GraphSAGE with identical optimization \(Adam, lr10−210^\{\-2\}, weight decay5×10−45\\times 10^\{\-4\}\), depth \(22layers\), and stopping \(early stop at patience100100, max500500epochs\) as the classical baselines above\. Architecture\-specific choices: GAT uses44attention heads in layer 1 and11head in layer 2, GraphSAGE uses mean aggregation, and dropout is fixed at0\.50\.5for all three\. We re\-train these from scratch under eachnn\-shot setting and report mean and standard deviation over55random seeds\.

### F\.3LLM\-GNN Co\-Teaching Hyperparameters

Table[6](https://arxiv.org/html/2606.11583#A6.T6)lists every hyperparameter used to produce the LLM\-GNN Co\-Teaching results in Table[1](https://arxiv.org/html/2606.11583#S5.T1)\. All six 3\-shot primary runs \(cora, citeseer, pubmed, wikics, ogbn\-arxiv, ogbn\-products\) share these values\. The only per\-dataset variation is the dataset\-specific prompt template \(Appendix[D](https://arxiv.org/html/2606.11583#A4)\)\. The identical setting was applied for the 5\- and 10\-shot rows\.

Table 6:Full hyperparameter configuration for LLM\-GNN Co\-Teaching on the 3\-shot benchmarks\.StageHyperparameterValueGNN initialization \(Stage 2\)ArchitectureGCNHidden dimension64Number of layers2Dropout0\.5Learning rate1×10−21\{\\times\}10^\{\-2\}Epochs \(early stop patience\)500 \(100\)LLM warm\-up SFT \(Stage 3\)Base modelLlama\-3\-8B\-InstructPEFT methodLoRA,r=8r\{=\}8,α=16\\alpha\{=\}16Learning rate5×10−65\{\\times\}10^\{\-6\}Epochs10Per\-device batch size4 \(grad\. accum\.=1=1\)Anchor repeatKK3Per\-round LLM SFTLearning rate2×10−52\{\\times\}10^\{\-5\}Epochs2; 5; 10Per\-device batch size4Anchor repeatKK3Per\-round GNN trainingLearning rate1×10−31\{\\times\}10^\{\-3\}Epochs200Pseudo\-label weightα\\alpha0\.3→0\.70\.3\\to 0\.7\(linear\)Co\-teaching scheduleNumber of roundsTT20Unlabeled batch sizeBB1500R​\(t\)R\(t\)minlinear,Rmin=0\.05;0\.1;0\.2;0\.3;0\.4R\_\{\\min\}\{=\}0\.05;0\.1;0\.2;0\.3;0\.4R​\(t\)R\(t\)maxlinear,Rmax=0\.5;0\.6;0\.7;0\.8;0\.9;1R\_\{\\max\}\{=\}0\.5;0\.6;0\.7;0\.8;0\.9;1Neighbor info in LLM prompt1\-hop \(≤5\\leq\{\}5titles\) \+ 2\-hop \(≤5\\leq\{\}5\)RPL\-PO \(even rounds\)Learning rate5×10−65\{\\times\}10^\{\-6\}Epochs1β\\beta0\.1Loss typesigmoid \(vanilla DPO\)Hardware / numericsGPU1×\\timesNVIDIA RTX 5880 Ada \(46 GB\)Precisionbf16Random seed42Wall\-clock per round∼25\{\\sim\}25min \(arxiv\),∼5\{\\sim\}5–1010min \(cora\)#### Per\-round seed for the unlabeled batch\.

The unlabeled batch sampler usesseed=42\+⌊\(t−1\)/2⌋=42\+\\lfloor\(t\-1\)/2\\rfloorat roundtt, so consecutive odd/even rounds share the same batch\. This pairing is what makes the RPL\-PO preference construction \(§[4](https://arxiv.org/html/2606.11583#S4)\) meaningful: a node’s odd\-round LLM prediction and the corresponding even\-round \(post\-DPO\) LLM prediction are comparable on the same input\.

#### Reproducibility\.

The exact pipeline launcher is`bash pipeline\.sh <dataset\> 3 42 20 <llama\-snapshot\>`with environment variables`USE\_NEIGHBOR\_INFO=1`,`USE\_DPO=1`,`RT\_MIN=0\.2`,`RT\_MAX=0\.6`\. All other defaults in Table[6](https://arxiv.org/html/2606.11583#A6.T6)come fromconfig\.shandpipeline\.shfallbacks in the codebase\.

## Appendix GTime Analysis on ogbn\-arxiv \(3\-shot\)

To complement the time\-complexity discussion in Appendix[A](https://arxiv.org/html/2606.11583#A1), Figure[7](https://arxiv.org/html/2606.11583#A7.F7)plots end\-to\-end wall\-clock on a single NVIDIA A100\-40GB against the corresponding 3\-shot test accuracy on ogbn\-arxiv\. All methods are run under their declared training schedules; LLM\-GNN Co\-Teaching runs the defaultT=10T=10co\-teaching rounds plus the round\-0 initialisation\.

3040506070050100150200250300Accuracy \(%\)Total Time \(min\)GCNLLaGAGLEMLLM\-GNNGNN\-as\-JudgeGraphGPTTAPELG Co\-Teaching

Figure 7:Training time versus ogbn\-arxiv 3\-shot accuracy on a single A100\-40GB\.LG Co\-Teaching\(red\) sits in the same cost band as TAPE and GraphGPT but achieves the highest accuracy by a clear margin\.GNN\-as\-Judge\(cyan, thicker border\) is the strongest baseline\.#### Discussion\.

Three patterns stand out\.*Raw cost does not predict accuracy\.*TAPE and GraphGPT each consume close to five hours and still trail GAJ on ogbn\-arxiv\.*LLM\-GNN Co\-Teaching sits in the high\-cost band but pays back the compute\.*Its wall\-clock is within five percent of TAPE and GraphGPT, yet it lifts accuracy by21\.6921\.69percentage points over TAPE and38\.6838\.68percentage points over GraphGPT, and beats GAJ by7\.737\.73percentage points at roughly2\.6×2\.6\\timesthe GAJ runtime\.*The cheap end caps out below50%50\\%\.*GCN, GLEM, LLaGA, and LLM\-GNN finish in under half an hour and reach at most42\.36%42\.36\\%on this 40\-class task, the regime where pure GNNs or single\-shot LLM\-on\-graph methods cannot extract more from three labels per class\. Iterative co\-teaching trades extra compute for a markedly better point on the cost\-accuracy curve\.

## Appendix HRobustness to Backbone Choice

We swap the GNN and the LLM peers in turn while keeping every other hyperparameter fixed at the default in Appendix[F](https://arxiv.org/html/2606.11583#A6)and re\-run the 3\-shot pipeline across all six benchmarks\. Three variants are compared with the Full Model \(Llama\-3\-8B\-Instruct \+ 2\-layer GCN\) in Table[7](https://arxiv.org/html/2606.11583#A8.T7)\.Vicuna\-7Bsubstitutes a weaker LLM with no graph\-aware pretraining\.GATandGraphSAGEsubstitute the GNN backbone, with all other hyperparameters left at their classical\-GNN defaults from Appendix[F\.2](https://arxiv.org/html/2606.11583#A6.SS2)\.

Table 7:Robustness to backbone choice \(3\-shot, mean±\\,\\pm\\,std over 3 seeds\)\. Full Model uses Llama\-3\-8B\-Instruct and a 2\-layer GCN\. Each row swaps a single component\.#### Findings\.

*Robust to GNN specification\.*Swapping the GCN backbone for GAT or GraphSAGE leaves accuracy within roughly one percentage point of the Full Model on every dataset, with the differences sitting inside one standard deviation on most benchmarks\. The framework therefore does not rely on a specific message\-passing architecture\.*Suffers under a low\-capability LLM\.*Swapping Llama\-3\-8B\-Instruct for Vicuna\-7B drops accuracy by4\.04\.0to7\.27\.2percentage points across all six datasets\. The LLM peer carries the bulk of the semantic signal, so a weaker base model bottlenecks the co\-teaching loop even when the GNN side is held fixed\. The takeaway is that LLM\-GNN Co\-Teaching is robust to GNN choice but expects an LLM with sufficient base capability\.

## Appendix IHyperparameter Sensitivity

We sweep two hyperparameters of LLM\-GNN Co\-Teaching and report 3\-shot LLM test accuracy across all six benchmarks \(Figure[8](https://arxiv.org/html/2606.11583#A9.F8)\)\. The first sweep varies the unlabeled sample size\|ℬt\|\|\\mathcal\{B\}\_\{t\}\|used for cross\-inference at each round \(default15001500\)\. The second sweep varies the number of DPO epochs in RPL\-PO \(default11\)\. Each value is mean±\\,\\pm\\,std over 3 seeds\.

3005001000150065758595Accuracy \(%\)Sample size\|ℬt\|\|\\mathcal\{B\}\_\{t\}\|123465758595Accuracy \(%\)RPL\-PO epochsCoraWikiCSCiteseerarxivPubMedproducts

Figure 8:Sensitivity analysis of two LLM\-GNN Co\-Teaching hyperparameters across all six 3\-shot benchmarks\.Left:accuracy versus the unlabeled sample size\|ℬt\|\|\\mathcal\{B\}\_\{t\}\|used for cross\-inference per round;\|ℬt\|=1500\|\\mathcal\{B\}\_\{t\}\|=1500is sufficient for every dataset and adding more nodes barely moves the accuracy\.Right:accuracy versus the number of DPO epochs in RPL\-PO; the curves are flat within one standard deviation, showing that the framework is not sensitive to this knob and a single epoch suffices\.
## Appendix JPer\-Dataset Venn Diagrams of GNN/LLM Correctness

Figure[9](https://arxiv.org/html/2606.11583#A10.F9)visualizes the GNN/LLM correctness sets on three 3\-shot benchmarks \(arxiv, pubmed, wikics\), the datasets for which the linearR​\(t\)R\(t\)run reached the LLM\-test\-accuracy peak well pastR=7R\{=\}7co\-teaching rounds\. Each row corresponds to one dataset, and the three columns show R0 \(initial state\), R1 \(after a single teach\-once round\), and the LLM\-test\-accuracy peak under co\-teaching \(R7 for arxiv, R11 for pubmed, R17 for wikics\)\. The shared\-correct region \(overlap of the green GNN\-correct and blue LLM\-correct circles\) grows monotonically across rounds on all three datasets \(167→302→414167\\\!\\to\\\!302\\\!\\to\\\!414on arxiv,578→682→803578\\\!\\to\\\!682\\\!\\to\\\!803on pubmed,437→560→679437\\\!\\to\\\!560\\\!\\to\\\!679on wikics\), empirically supporting the picture that GNN and LLM are correct on largely complementary node subsets at initialization, and that co\-teaching expands their joint coverage round by round\. On pubmed, the R1 “both wrong” count is larger than R0’s because the LLM was already very strong at initialization \(80\.1%80\.1\\%\), and a single SFT round on noisy GNN\-selected pseudo\-labels temporarily underperforms the initial model\. This is a known SFT\-overfit even\-round dip that the LLM fully recovers from by R11 \(91\.5%91\.5\\%\)\.

![Refer to caption](https://arxiv.org/html/2606.11583v1/x4.png)Figure 9:Per\-dataset Venn diagrams of GNN\-correct \(green\) vs\. LLM\-correct \(blue\) sets\. Rows: arxiv \(40 classes,1,0001\{,\}000test nodes\), pubmed \(3 classes,1,0001\{,\}000test nodes\), wikics \(10 classes,1,0001\{,\}000test nodes\)\. Columns: init \(R0\), teach\-once \(R1\), LLM peak under co\-teaching \(R7 / R11 / R17 respectively\)\. Across all three datasets, co\-teaching enlarges the shared\-correct region while shrinking each model’s exclusive\-correct sliver\.

Similar Articles

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

arXiv cs.LG

Introduces GraphInfer-Bench, a benchmark to evaluate whether LLMs can perform graph inference—producing open-ended answers about a node and its neighborhood that cannot be retrieved from a single node or path. Experiments show that even frontier LLMs lag behind plain GNNs on these tasks, revealing a capability gap.