TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

arXiv cs.CL 04/22/26, 04:00 AM Papers
Summary
TRN-R1-Zero introduces a post-training framework that enables LLMs to perform zero-shot reasoning on text-rich networks using only reinforcement learning, without supervised fine-tuning or chain-of-thought data.
arXiv:2604.19070v1 Announce Type: new Abstract: Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 04/22/26, 08:30 AM
# TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
Source: [https://arxiv.org/html/2604.19070](https://arxiv.org/html/2604.19070)
Zi Huang School of Electrical Engineering and Computer Science The University of Queensland Brisbane, Queensland, Australia \{yilun\.liu, r\.qiu, helen\.huang\}@uq\.edu\.au

###### Abstract

Zero\-shot reasoning on text\-rich networks \(TRNs\) remains a challenging frontier, as models must integrate textual semantics with relational structure without task\-specific supervision\. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model \(LLM\)\-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation\. We propose TRN\-R1\-Zero, a post\-training framework for TRN reasoning trained solely via reinforcement learning\. TRN\-R1\-Zero directly optimises base LLMs using a Neighbour\-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning\. Unlike prior methods, TRN\-R1\-Zero requires no supervised fine\-tuning or chain\-of\-thought data generated from large reasoning models\. Extensive experiments across citation, hyperlink, social and co\-purchase TRN benchmarks demonstrate the superiority and robustness of TRN\-R1\-Zero\. Moreover, relying strictly on node\-level training, TRN\-R1\-Zero achieves zero\-shot inference on edge\- and graph\-level tasks, extending beyond cross\-domain transfer\. The codebase is publicly available at[https://github\.com/superallen13/TRN\-R1\-Zero](https://github.com/superallen13/TRN-R1-Zero)\.

TRN\-R1\-Zero: Text\-rich Network Reasoning via LLMs with Reinforcement Learning Only

Yilun Liu, Ruihong Qiu and Zi HuangSchool of Electrical Engineering and Computer ScienceThe University of QueenslandBrisbane, Queensland, Australia\{yilun\.liu, r\.qiu, helen\.huang\}@uq\.edu\.au

## 1Introduction

Text classification is a cornerstone task in natural language processing, underpinning applications from information retrieval to content recommendation\. Yet, in real\-world scenarios, texts seldom exist in isolation: scientific papers cite one another, Wikipedia pages are interlinked through hyperlinks, users in social networks follow each other, and e\-commerce products often co\-occur in purchases\. These relational connections naturally form text\-rich networks \(TRNs\), where nodes correspond to textual entities and edges capture their semantic or functional relations\. As illustrated in Figure[1](https://arxiv.org/html/2604.19070#S1.F1), TRNs from citation, hyperlink, social, and co\-purchase domains exhibit rich relational structures that go beyond isolated document understandingTanget al\.\([2024c](https://arxiv.org/html/2604.19070#bib.bib39),[b](https://arxiv.org/html/2604.19070#bib.bib40),[2026](https://arxiv.org/html/2604.19070#bib.bib37)\); Chenet al\.\([2025](https://arxiv.org/html/2604.19070#bib.bib42)\); Wuet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib41)\); Wanget al\.\([2025a](https://arxiv.org/html/2604.19070#bib.bib38)\); Liuet al\.\([2023](https://arxiv.org/html/2604.19070#bib.bib44),[2025b](https://arxiv.org/html/2604.19070#bib.bib45),[2025a](https://arxiv.org/html/2604.19070#bib.bib46)\)\. Effectively reasoning over such TRNs, particularly in zero\-shot classification settings without domain\-specific supervision, is a crucial step toward more generalisable and context\-aware language intelligence\.

CitationHyperlinkSocial LinkCo\-purchaseTASKZero\-shot node classification on TRNsGiven the target paperand its cited papersand, what is the most suitable category?

Figure 1:Top: Examples of text\-rich networks \(TRNs\) from citation, hyperlink, social and co\-purchase domains\.Bottom: An example of reasoning\-based user query over TRNs\.Existing large language model \(LLM\)\-based methods for node classification on TRNs generally follow two paradigms\. \(1\) Encoder\-based approaches use LLMs as text encoders for both node and label descriptionsLiet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib9)\); Fanget al\.\([2025](https://arxiv.org/html/2604.19070#bib.bib18)\); Wanget al\.\([2025b](https://arxiv.org/html/2604.19070#bib.bib10)\)\. The resulting embeddings are aggregated through structure\-aware mechanisms across neighbouring nodes, and classification is performed via node\-label similarity\. However, these methods largely treat the LLM as a feature extractor rather than an explicit reasoner\. \(2\) Generative approaches, on the other hand, reformulate node classification as a label\-token generation task\. To incorporate structural information, some employ soft\-embedding techniques that align graph encodings with the LLM’s embedding spaceTanget al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib6)\); Wanget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib7)\); Chenet al\.\([2024b](https://arxiv.org/html/2604.19070#bib.bib17)\); Konget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib8)\), while others use natural language descriptions of graph structures as inputsWanget al\.\([2023](https://arxiv.org/html/2604.19070#bib.bib34)\); Chenet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib31)\); Huanget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib33)\); Liet al\.\([2024b](https://arxiv.org/html/2604.19070#bib.bib19)\); Wuet al\.\([2025a](https://arxiv.org/html/2604.19070#bib.bib13)\)\. Recent work on both text\-based or non\-text networks attempt to transfer reasoning abilities from large reasoning models \(LRMs\) by fine\-tuning on chain\-of\-thought dataChenet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib31)\); Wanget al\.\([2025c](https://arxiv.org/html/2604.19070#bib.bib32)\); Wuet al\.\([2025b](https://arxiv.org/html/2604.19070#bib.bib2)\)\. Despite these advances,existing paradigms struggle to directly elicit explicit reasoning within LLMs on TRNs, while often relying on additional supervision or external reasoning resources\.

ModelLRMCoT SFTReason\.GraphWiz✓ \(GPT\-4\)✓ \(GPT\-4\)✓Graph\-NPH\-R1✗✓ \(QwQ\-32B\)✓Graph\-R1✓ \(DeepSeek\-V3\)✓ \(DeepSeek\-R1\)✓TRN\-R1\-Zero✗✗✓

Table 1:Comparison of reasoning training for LLMs on graph tasks\. For graph theory problems, GraphWizChenet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib31)\)and Graph\-NPH\-R1Wanget al\.\([2025c](https://arxiv.org/html/2604.19070#bib.bib32)\)rely on mimicking the CoT process of larger LLMs\. For text\-based network problems, Graph\-R1Wuet al\.\([2025b](https://arxiv.org/html/2604.19070#bib.bib2)\)and GraphWiz further depend on external LRMs to provide reasoning supervision\. Our TRN\-R1\-Zero achieves reasoning ability without relying on LRMs or their generated CoT data\.To address these limitations, we propose TRN\-R1\-Zero, a reinforcement learning\-based framework that enables explicit reasoning on TRNs without any supervised fine\-tuning or external distillation\. Unlike encoder\-based methods that treat LLMs merely as feature extractors, or generative approaches that depend on pre\-generated reasoning traces, TRN\-R1\-Zero learns to reason relationally through direct optimisation over the underlying graph context\. We develop a novel Neighbour\-aware Group Relative Policy Optimisation objective with a margin gain metric, which leverages local neighbourhood information as adaptive signals to guide reasoning training, allowing it to effectively infer structural and semantic relationships for node classifications on text\-rich networks in unseen domains\. This design activates the LLM’s reasoning capability intrinsically, rather than relying on external supervision or task\-specific data\. A comparative summary of existing paradigms and TRN\-R1\-Zero is provided in Table[1](https://arxiv.org/html/2604.19070#S1.T1)\. Our main contributions are:

1. 1\.An RL\-only pipeline for zero\-shot node classification on TRNs, without distillation, SFT, or external LRMs\.
2. 2\.A neighbour\-aware policy objective with a margin gain mechanism that explicitly encourages the use of relational context\.
3. 3\.Extensive experiments on citation, hyperlink, social, and co\-purchase TRNs demonstrate consistent zero\-shot gains in cross\-domain and cross\-task settings over prior methods\.

## 2Related Work

#### Large Language Models for Node Classification\.

Existing approaches to zero\-shot node classification on text\-rich networks \(TRNs\) can be categorised into encoder\-based and generative paradigms\.

Encoder\-basedmethods use language models \(LMs\) or LLMs primarily as text encoders, generating embeddings for nodes and labels that are subsequently aligned or aggregated by external algorithms\. ZeroGLiet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib9)\)fine\-tunes Sentence\-BERTReimers and Gurevych \([2019](https://arxiv.org/html/2604.19070#bib.bib20)\)with LoRAHuet al\.\([2022](https://arxiv.org/html/2604.19070#bib.bib21)\)to effectively encode both node texts and label descriptions\. UniGLMFanget al\.\([2025](https://arxiv.org/html/2604.19070#bib.bib18)\)fine\-tunes BERTDevlinet al\.\([2019](https://arxiv.org/html/2604.19070#bib.bib22)\)into a more generalised text encoder through contrastive learning, boosting the downstream graph model performance trained in this embedding space\. TAPEHeet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib23)\)fine\-tunes DeBERTaHeet al\.\([2021](https://arxiv.org/html/2604.19070#bib.bib24)\)with explanations and predictions generated from an extra LLM\. Nevertheless, these fine\-tuned encoders exhibit limited generalisation ability because of the small model size and data scarcity of the tuning phase\. LLM\-BPWanget al\.\([2025b](https://arxiv.org/html/2604.19070#bib.bib10)\)employs LLM2VecBehnamGhaderet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib25)\)as a text encoder and applies propagation\-based techniques to integrate neighbour information\. These methods, however, fail to exploit the explicit reasoning capabilities of LLMs\.

Generativemethods formulate node classification as a text generation task\. GraphGPTTanget al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib6)\), GOFAKonget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib8)\), TEA\-GLMWanget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib7)\), and LLaGAChenet al\.\([2024b](https://arxiv.org/html/2604.19070#bib.bib17)\)employ a learnable mapping model to project graph structures into the LLM’s token embedding space, creating soft embeddings that enable the LLM to generate graph\-aware representations after supervised fine\-tuning\. Alternatively, other works describe graph information directly in natural language, enabling the LLM to understand over structural context without explicit graph encodersHuanget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib33)\); Liet al\.\([2024b](https://arxiv.org/html/2604.19070#bib.bib19)\); Wuet al\.\([2025a](https://arxiv.org/html/2604.19070#bib.bib13)\)\.

#### Large Language Model Reasoning\.

LLMs trained with reinforcement learning have demonstrated impressive reasoning abilities and human\-like performance across a range of tasks, including mathematicsShaoet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib15)\); Balunovicet al\.\([2025](https://arxiv.org/html/2604.19070#bib.bib26)\), task planningHuet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib27)\), code generationJimenezet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib29)\), and debuggingZhonget al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib28)\)\. Proximal Policy Optimisation \(PPO\)Schulmanet al\.\([2017](https://arxiv.org/html/2604.19070#bib.bib16)\)serves as the foundation for reasoning\-oriented fine\-tuning\. The recent Group Relative Policy Optimisation \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib15)\); Guoet al\.\([2025](https://arxiv.org/html/2604.19070#bib.bib3)\)introduces a rule\-based objective that enables reasoning skills without human\-annotated supervision, while Dr\.GRPOLiuet al\.\([2025c](https://arxiv.org/html/2604.19070#bib.bib5)\)further enhances reward shaping and variance control during reasoning training\.

Thereasoning ability of LLMs has recently been extended to structured data\. For general graphs without textual attributes, GraphWizChenet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib31)\)and Graph\-NPH\-R1Wanget al\.\([2025c](https://arxiv.org/html/2604.19070#bib.bib32)\)leverage large reasoning models \(LRMs\) to generate chain\-of\-thought \(CoT\)Weiet al\.\([2022](https://arxiv.org/html/2604.19070#bib.bib30)\)data for reasoning over graph\-theoretic problems such as shortest path and connectivity\. Graph\-R1Wuet al\.\([2025b](https://arxiv.org/html/2604.19070#bib.bib2)\)further targets text\-rich graphs, using LRMs to produce long CoT traces that supervise the fine\-tuning of smaller models\. In contrast, TRN\-R1\-Zero removes the need for distillation or externally generated CoT data from larger LRMs, directly eliciting reasoning ability within the base model itself\.

## 3Methodology: TRN\-R1\-Zero

![Refer to caption](https://arxiv.org/html/2604.19070v1/x1.png)Figure 2:Overall training pipeline of TRN\-R1\-Zero, comprising three key components: graph sampling, prompt construction, and neighbour\-aware policy objective\.To perform node classification with reasoning on text\-rich networks, this section describes how TRN\-R1\-Zero achieves this capability through a novel optimisation paradigm, as illustrated in Figure[2](https://arxiv.org/html/2604.19070#S3.F2)\.

### 3\.1Zero\-shot Node Classification

Given a text\-rich networkG=\(V,E,Y\)G=\(V,E,Y\), whereV=\{v1,…,v\|V\|\}V=\\\{v\_\{1\},\\dots,v\_\{\|V\|\}\\\}denotes the set of nodes with associated texts,E⊆V×VE\\subseteq V\\times Vdenotes the set of edges, andY=\{y1,…,y\|Y\|\}Y=\\\{y\_\{1\},\\dots,y\_\{\|Y\|\}\\\}denotes the set of label texts, the objective is to predict the label of a target nodevi∈Vv\_\{i\}\\in Vwithout any supervision from the target network\.

#### Classification as Token Generation\.

Given a large language modelℳθ\\mathcal\{M\}\_\{\\theta\}, the input comprises the text of the target nodetit\_\{i\}, its neighbourhood𝒩\(vi\)\\mathcal\{N\}\(v\_\{i\}\), and the candidate label textsYY\. Each classy∈Yy\\in Yis mapped to a discrete identifier token \(e\.g\., “1”, “2”, “3”\)\. Thus, node classification is reformulated as a next\-token prediction task:

y^i=arg⁡maxy∈Y⁡Pθ\(y∣𝒫\(ti,𝒩\(vi\),Y\)\),\\hat\{y\}\_\{i\}=\\arg\\max\_\{y\\in Y\}P\_\{\\theta\}\\big\(y\\mid\\mathcal\{P\}\(t\_\{i\},\\mathcal\{N\}\(v\_\{i\}\),Y\)\\big\),\(1\)where𝒫\(⋅\)\\mathcal\{P\}\(\\cdot\)denotes the constructed prompt that integrates node, neighbour, and label information\.

### 3\.2Prompt with Neighbourhood Sampling

The input to the LLM is constructed by combining the target node text, sampled neighbour texts, and candidate label descriptions into an instruction\-style prompt \(see Box 1 below\)\. In our prompt design, neighbourhood sampling serves a dual purpose: it controls the input length and acts as a form of data augmentation\. For each target nodeviv\_\{i\}, multiple subgraphs are randomly sampled following a fixed width–depth strategy, where \(i\)widthlimits the number of included neighbours, and \(ii\)depthtruncates the text of each neighbour\. By repeatedly drawing diverse subsets of neighbours, the LLM is exposed to varied local contexts, effectively expanding the training corpus and mitigating overfitting in low\-resource graph settings\.

Box 1: Train Prompt for TRN\-R1\-Zero\# System PromptYou are a helpful AI Assistant that provides well\-reasoned and detailed responses\. You first think about the reasoning process as an internal monologue and then provide the user with the answer\. Respond in the following format: <think\> \.\.\. </think\> <answer\> \.\.\. </answer\> \# Graph PromptTarget node: \{target\_node\_text\} Neighbour nodes: \{neighbor\_node\_text\} \# Task InstructionI provide the content of the target node and its neighbour nodes\. Each node content is \{node\_type\}\. The relation between the target node and its neighbour nodes is \{relation\}\. The \{num\_categories\} categories are: \{labels\}\. Question: Based on the information of the target and neighbour nodes, predict the category ID \(0 to \{max\_id\}\) for the target node\.

### 3\.3Neighbour\-aware Group Relative Policy Optimisation Objective

Reinforcement learning for LLM post\-training builds upon the GRPO objectiveShaoet al\.\([2024](https://arxiv.org/html/2604.19070#bib.bib15)\), a variant of PPOSchulmanet al\.\([2017](https://arxiv.org/html/2604.19070#bib.bib16)\)adapted for sequence\-level rewards\. Given a queryqqand an output sequenceo=\(o1,…,o\|o\|\)o=\(o\_\{1\},\\dots,o\_\{\|o\|\}\), the objective is defined as:

𝒥\(θ\)=𝔼q∼𝒟,o∼πθold\[∑t=1\|o\|min\(rtA^t,\\displaystyle\\textstyle\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{q\\sim\\mathcal\{D\},\\;o\\sim\\pi\_\{\\theta\_\{\\text\{old\}\}\}\}\\Bigg\[\\sum\_\{t=1\}^\{\|o\|\}\\min\\\!\\Big\(r\_\{t\}\\hat\{A\}\_\{t\},clip\(rt,1−ϵ,1\+ϵ\)A^t\)\]\.\\displaystyle\\text\{clip\}\(r\_\{t\},1\-\\epsilon,1\+\\epsilon\)\\,\\hat\{A\}\_\{t\}\\Big\)\\Bigg\]\.\(2\)wherert=πθ\(ot∣q,o<t\)πθold\(ot∣q,o<t\)r\_\{t\}=\\tfrac\{\\pi\_\{\\theta\}\(o\_\{t\}\\mid q,o\_\{<t\}\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(o\_\{t\}\\mid q,o\_\{<t\}\)\}is the token\-level importance sampling ratio, andA^t\\hat\{A\}\_\{t\}is the advantage estimator\.

A^t=Ri−R¯std\(R\)\.\\hat\{A\}\_\{t\}=\\tfrac\{R\_\{i\}\-\\bar\{R\}\}\{\\mathrm\{std\}\(R\)\}\.\(3\)A KL regularisation term

−β⋅KL\[πθ\(⋅∣q,o<t\)∥πref\(⋅∣q,o<t\)\]\-\\beta\\cdot\\mathrm\{KL\}\\\!\\left\[\\pi\_\{\\theta\}\(\\cdot\\mid q,o\_\{<t\}\)\\,\\\|\\,\\pi\_\{\\text\{ref\}\}\(\\cdot\\mid q,o\_\{<t\}\)\\right\]\(4\)is added to penalise deviation from a frozen reference policyπref\\pi\_\{\\text\{ref\}\}, stabilising optimisation\.

This formulation can hinder reward shaping, as the standard deviation term dampens variations inRiR\_\{i\}\. Dr\.GRPOLiuet al\.\([2025c](https://arxiv.org/html/2604.19070#bib.bib5)\)addresses this by removing the denominator:

A^t=Ri−R¯\.\\hat\{A\}\_\{t\}=R\_\{i\}\-\\bar\{R\}\.\(5\)This allows shaped rewards to influence optimisation magnitude directly\. Although Dr\.GRPO is commonly implemented without KL, empirical results in this task reveal that omitting KL causes unstable training\. Therefore, the adopted objective is Dr\.GRPO with KL regularisation, which preserves both stability and effective scaling\.

#### Margin Gain: Quantifying Neighbouring Contribution\.

During reasoning over neighbouring nodes, the neighbourhood information may either complement or distract from the target node’s textual signal\. To identify cases where neighbour information plays a pivotal role, we introduce a margin gain metric that quantifies how much the classification decision boundary shifts after incorporating neighbours\.

Letei=f\(xi\)∈ℝde\_\{i\}=f\(x\_\{i\}\)\\in\\mathbb\{R\}^\{d\}be the embedding of node textxix\_\{i\}andec=f\(yc\)e\_\{c\}=f\(y\_\{c\}\)the embedding of label textycy\_\{c\}, wheref\(⋅\)f\(\\cdot\)is a frozen text encoder\. The raw logit of nodeiifor classccis

ℓi,c=ei⊤ec\.\\ell\_\{i,c\}=e\_\{i\}^\{\\top\}e\_\{c\}\.\(6\)Letyiy\_\{i\}denote the ground\-truth class of nodeii\. The raw margin score is defined as

mi\(ℓ\)=ℓi,yi−maxc≠yi⁡ℓi,c,m\_\{i\}\(\\ell\)=\\ell\_\{i,y\_\{i\}\}\-\\max\_\{c\\neq y\_\{i\}\}\\ell\_\{i,c\},\(7\)which measures how confidently the encoder classifies the node text in isolation\.

To measure the influence of neighbours, we apply a lightweightKK\-layer Simple Graph Convolution \(SGC\)\-style aggregatorWuet al\.\([2019](https://arxiv.org/html/2604.19070#bib.bib35)\):

E~=\(D−12AD−12\)KE,\\tilde\{E\}=\\big\(D^\{\-\\tfrac\{1\}\{2\}\}AD^\{\-\\tfrac\{1\}\{2\}\}\\big\)^\{K\}E,\(8\)whereAAis the adjacency matrix with self\-loops,DDis its degree matrix, andEEstacks all node embeddings row\-wise\. The aggregated embeddinge~i\\tilde\{e\}\_\{i\}induces aggregated logits

ℓ~i,c=e~i⊤ec,\\tilde\{\\ell\}\_\{i,c\}=\\tilde\{e\}\_\{i\}^\{\\top\}e\_\{c\},\(9\)and a corresponding aggregated margin

mi\(ℓ~\)=ℓ~i,yi−maxc≠yi⁡ℓ~i,c\.m\_\{i\}\(\\tilde\{\\ell\}\)=\\tilde\{\\ell\}\_\{i,y\_\{i\}\}\-\\max\_\{c\\neq y\_\{i\}\}\\tilde\{\\ell\}\_\{i,c\}\.\(10\)
Therefore, the margin gain can be defined to quantify the contribution of the neighbourhood:

Δi=mi\(ℓ~\)−mi\(ℓ\),\\Delta\_\{i\}=m\_\{i\}\(\\tilde\{\\ell\}\)\-m\_\{i\}\(\\ell\),\(11\)which captures how much neighbourhood aggregation improves \(or degrades\) the classification margin\.

Intuitively,Δi\>0\\Delta\_\{i\}\>0indicates that neighbours provide helpful context;Δi≈0\\Delta\_\{i\}\\approx 0suggests neighbourhood information is redundant; andΔi<0\\Delta\_\{i\}<0implies neighbours are distracting\. We use the absolute value\|Δi\|\|\\Delta\_\{i\}\|to measure the strength of neighbourhood influence, regardless of whether the effect is positive or negative\.

#### Reward Design with Margin Gain\.

Reinforcement learning with GRPO assigns each prompt a scalar rewardRiR\_\{i\}, which determines the magnitude of policy gradient updates through the advantage estimator\. Rather than treating all prompts equally, we scale the rewards by the neighbourhood influence so that samples where neighbours have a stronger impact on decisions receive greater emphasis\.

For a rolloutoio\_\{i\}associated with nodeviv\_\{i\}, the base reward comprises two components:

Ribase=sformat\(oi\)\+sacc\(oi\),R\_\{i\}^\{\\text\{base\}\}=s\_\{\\text\{format\}\}\(o\_\{i\}\)\+s\_\{\\text\{acc\}\}\(o\_\{i\}\),\(12\)wheresformats\_\{\\text\{format\}\}enforces adherence to the output schema \(e\.g\., correct use of<think\>and<answer\>tags\), andsaccs\_\{\\text\{acc\}\}verifies whether the final answer matches the ground\-truth identifier token\.

To reflect the importance of the neighbourhood via the margin gain from Eq\. \([11](https://arxiv.org/html/2604.19070#S3.E11)\), we define a reshaping factor:

gi=exp⁡\(α⋅\|Δi\|\),g\_\{i\}=\\exp\(\\alpha\\cdot\|\\Delta\_\{i\}\|\),\(13\)whereα\\alphais a temperature hyperparameter controlling sensitivity\.

This exponential form has two intuitive effects: \(i\) when\|Δi\|=0\|\\Delta\_\{i\}\|=0,gi=1g\_\{i\}=1and the reward remains unchanged; \(ii\) larger\|Δi\|\|\\Delta\_\{i\}\|values exponentially amplify the reward, encouraging the model to focus more on neighbour\-influenced samples during policy optimisation\. The final reward is therefore:

Ri=gi⋅Ribase=exp⁡\(α\|Δi\|\)\(sformat\(oi\)\+sacc\(oi\)\)\.R\_\{i\}=g\_\{i\}\\cdot R\_\{i\}^\{\\text\{base\}\}=\\exp\(\\alpha\|\\Delta\_\{i\}\|\)\\big\(s\_\{\\text\{format\}\}\(o\_\{i\}\)\+s\_\{\\text\{acc\}\}\(o\_\{i\}\)\\big\)\.\(14\)
Incorporating this reward design into LLM update using objective in Eq\. \([3\.3](https://arxiv.org/html/2604.19070#S3.Ex1)\) will emphasise structurally informative neighbourhoods, guiding the LLM generation to more effectively leverage relational context for reasoning on text\-rich networks\.

### 3\.4Inference on Edge and Graph Tasks

Although TRN\-R1\-Zero is trained only on node\-level tasks, extending it to other TRN tasks such as link prediction and graph reasoning is straightforward\. The input prompt requires only the sampled graph with neighbour information, and the task instruction\. Detailed prompt designs are provided in Appendix[B](https://arxiv.org/html/2604.19070#A2), with cross\-task experiments reported in Section[4\.3](https://arxiv.org/html/2604.19070#S4.SS3)\.

CoraWikiCSInstagramPhotoAvg\.TypeMethodAccMacro\-F1AccMacro\-F1AccMacro\-F1AccMacro\-F1AccMacro\-F1LLMGPT\-4o70\.3071\.4469\.6964\.5142\.4239\.7969\.9368\.5563\.0961\.07Llama\-3\.1\-8B64\.5564\.4159\.4354\.1636\.9828\.3245\.4950\.4451\.6149\.33Qwen2\.5\-1\.5B\-it47\.9649\.9161\.7156\.1736\.8228\.3750\.7251\.5049\.3046\.49Qwen2\.5\-7B\-it67\.5967\.1967\.4463\.9352\.2050\.3255\.6759\.3760\.7360\.20Qwen2\.5\-14B\-it67\.2268\.2673\.0370\.7855\.6052\.9458\.5161\.4563\.5963\.36GFMZeroG62\.5557\.5662\.7157\.8750\.7150\.4346\.2751\.5255\.5654\.35LLaGA18\.828\.498\.208\.2947\.9347\.7039\.184\.7128\.5317\.30SFT \+ RLGraph\-R1 \(14B\)68\.1567\.3473\.2570\.1152\.0352\.06\-\-\-\-RL OnlyTRN\-R1\-Zero \(7B\)72\.5970\.3373\.6370\.3054\.7652\.5465\.1264\.2266\.5364\.35

Table 2:Performance comparison under the zero\-shot setting with Accuracy \(%\\%\) and Macro\-F1 \(%\\%\) reported with benchmarks followingWuet al\.\([2025a](https://arxiv.org/html/2604.19070#bib.bib13)\)\. Thebestandsecond\-bestresults are highlighted per column \(paired t\-test,p≤0\.05p\\leq 0\.05, Bonferroni corrected\)\. Graph\-R1 is excluded from the Photo dataset since Photo was used in its pre\-training, not qualifying for zero\-shot evaluation\.

## 4Experiments

### 4\.1Setup

#### Datasets\.

Experiments are conducted on nine datasets spanning four relational structures \(citation, hyperlink, social, and co\-purchase\) and three task types \(node\-, graph\-, and edge\-level\)\. For RL training,CiteseerandHistoryare used to capture citation and co\-purchase relations\. The remainingCora,Photo,WikiCS, andInstagramare used for zero\-shot in\-domain and cross\-domain evaluation, whileExpla\-GraphHeet al\.\([2024b](https://arxiv.org/html/2604.19070#bib.bib43)\),WikiCS\-Link, andInstagram\-Linkare used for cross\-task evaluation\. All datasets \(detailed in Table[7](https://arxiv.org/html/2604.19070#A2.T7)and[8](https://arxiv.org/html/2604.19070#A3.T8)\) are sourced from NodeBedWuet al\.\([2025a](https://arxiv.org/html/2604.19070#bib.bib13)\)\.

- •Cora, Citeseer: Each node represents a scientific publication including the paper title and abstract\. Edges denote citation links between papers, forming a citation network\.
- •WikiCS: Each node corresponds to a Wikipedia article, and edges represent hyperlinks between articles, forming a web graph\.
- •Instagram: Each node represents a user account, and edges correspond to social\-follow relations\. Node texts are profile descriptions or short post contents, reflecting social interaction context\.
- •Photo, History: Each node corresponds to a product on the Amazon platform\. Nodes are customer reviews in Photo and product descriptions in History, and edges capture co\-purchase relations between products\.
- •Expla\-Graph: Each node denotes a commonsense concept, and each edge denotes the relation between two concepts\.
- •WikiCS\-Link and Instagram\-Link: Both datasets are constructed from original node\-level datasets by retaining the original edges as positives and uniformly sampling an equal number of non\-existent edges as negatives\.

#### Baselines\.

The comparison includes three categories of baselines:

- •LLMs: GPT\-4o is included to represent a potential upper bound of LLM performance\. LLaMA\-3\.1\-8B, Qwen2\.5\-1\.5B\-Instruct, Qwen2\.5\-7B\-Instruct, and Qwen2\.5\-14B\-Instruct are selected to cover diverse open\-source model families and scales\.
- •Graph Foundation Models \(GFMs\): ZeroGLiet al\.\([2024a](https://arxiv.org/html/2604.19070#bib.bib9)\)and LLaGAChenet al\.\([2024b](https://arxiv.org/html/2604.19070#bib.bib17)\)are evaluated in an intra\-domain manner, where each model is pre\-trained on the same domain dataset \(e\.g\., arXiv for academic data\) before being tested on the target dataset\.
- •Reasoning LLMs: Graph\-R1Wuet al\.\([2025b](https://arxiv.org/html/2604.19070#bib.bib2)\)introduces a*rethink*template that encourages LLMs to reason carefully and revise their responses before producing the final answer\. In the original setup, DeepSeek\-v3 is used to summarise node texts into compact representations\. For fairness, the following experiments use raw node texts directly\.

#### Implementations\.

The dataset statistics are summarised in Table[8](https://arxiv.org/html/2604.19070#A3.T8), covering four relation types: citation, co\-purchase, hyperlink, and social\. During training,Citeseer\(citation domain\) andHistory\(co\-purchase domain\) are used to fine\-tune the base LLM, enabling it to capture the semantic characteristics of two distinct relational types and to learn reasoning under relational constraints\. To ensure that evaluation reflects genuine cross\-domain and cross\-relation generalisation, datasets from the hyperlink and social domains are deliberately excluded from training\. All datasets are randomly split into 60%/20%/20% for training, validation, and testing, respectively\. Prompt templates for generative LLMs are listed in Table[3\.2](https://arxiv.org/html/2604.19070#S3.SS2)\. Qwen2\.5\-7B\-Instruct serves as the base model for TRN\-R1\-Zero\. Low\-Rank Adaptation \(LoRAHuet al\.\([2022](https://arxiv.org/html/2604.19070#bib.bib21)\)\) is employed for memory\-efficient fine\-tuning, with the rank set to 64\. All experiments are conducted on a single AMD MI300X GPU\.

For the margin\-gain computation, the SGC aggregator in Eq\. \([8](https://arxiv.org/html/2604.19070#S3.E8)\) is applied withK=1K\{=\}1\. The temperature in the reshaping factor of Eq\. \([13](https://arxiv.org/html/2604.19070#S3.E13)\) is set toα=10\\alpha\{=\}10, amplifying the reward contribution of samples whose classification margin shifts substantially under neighbourhood aggregation\.

### 4\.2Overall Results

The overall zero\-shot node classification results are presented in Table[2](https://arxiv.org/html/2604.19070#S3.T2)\.TRN\-R1\-Zero attains the highest average Accuracy and Macro\-F1 across all datasets, validating its effectiveness and superior generalisation across domains\.

LLaGA exhibits noticeably lower performance, indicating limited domain transferability, as its mapping layer to align graph embeddings with LLM embeddings is trained on a source graph and struggles to generalise to unseen graphs\. In contrast, ZeroG achieves competitive results, since its post\-encoding information aggregation does not compromise generalisation ability\.

Among pure LLMs, GPT\-4o achieves the best performance on Cora and Photo, whereas Qwen2\.5\-14B\-Instruct surpasses GPT\-4o on WikiCS and Instagram\. For models of comparable scale, Qwen2\.5\-7B\-Instruct consistently outperforms LLaMA\-3\.1\-8B across all datasets\. Although the smaller Qwen2\.5\-1\.5B\-Instruct exceeds LLaMA\-3\.1\-8B on WikiCS and Photo, it falls behind on Cora\. These results suggest that both larger model capacity and instruction tuning contribute positively to zero\-shot relational reasoning\.

For reasoning\-based LLMs, TRN\-R1\-Zero not only achieves the best or near\-best results overall but also demonstrates strong generalisation capability\. Despite being trained only on the Citeseer and History datasets and without any exposure to test graphs, TRN\-R1\-Zero performs well on both in\-domain datasets \(Cora, Photo\) and out\-of\-domain datasets \(WikiCS, Instagram\)\.

AccF16767696971717373\+0\.9\\mathbf\{\+0\.9\}\+5\.0\\mathbf\{\+5\.0\}−0\.9\\mathbf\{\-0\.9\}\+3\.1\\mathbf\{\+3\.1\}Performance \(%\)CoraAccF16464676770707373\+0\.2\\mathbf\{\+0\.2\}\+6\.2\\mathbf\{\+6\.2\}−0\.7\\mathbf\{\-0\.7\}\+6\.4\\mathbf\{\+6\.4\}Performance \(%\)WikiCSGraph\-R1TRN\-R1\-ZeroBefore RLAfter RL

Figure 3:Performance comparison with RL training between our TRN\-R1\-Zero \(red\) and Graph\-R1 \(blue\)\.
### 4\.3Generalisation to Graph and Edge Tasks

Although TRN\-R1\-Zero is trained only on node classification, its zero\-shot ability is further examined on two unseen tasks across Expla\-Graph, WikiCS\-Link and Instagram\-Link\. For graph\-level reasoning, TRN\-R1\-Zero improves over the base model at both 7B and 14B scales, and under the same Qwen2\.5\-14B backbone, it also surpasses Graph\-R1, even though Graph\-R1 is explicitly trained on graph\-level tasks\. For edge\-level prediction, the gains here are particularly substantial \(e\.g\., \+16\.10 on WikiCS\-Link for 7B\), indicating effective transfer of relational reasoning, and again TRN\-R1\-Zero exceeds Graph\-R1 under the same 14B backbone despite never being trained on edge\-level supervision\.Together, these results highlight the zero\-shot ability of TRN\-R1\-Zero on graph and edge level tasks, derived from the effective neighbour\-aware reasoning training only on node level tasks\.

ScaleModelExpla\-GraphWikiCS\-LinkInsta\-Link7BQwen2\.584\.1252\.1064\.90TRN\-R1\-Zero87\.18\+3\.0668\.20\+16\.1066\.80\+1\.9014BQwen2\.589\.8972\.1071\.80Graph\-R189\.7148\.9056\.40TRN\-R1\-Zero90\.25\+0\.3673\.90\+1\.8074\.20\+2\.40

Table 3:Zero\-shot performance on graph reasoning and link prediction\. Trained only on node level tasks\.
### 4\.4Effectiveness of Neighbour\-aware RL

This experiment investigates how our proposed neighbour\-aware RL post\-training can enhance the zero\-shot node classification capability of LLMs\. The comparison is between using the vanilla CoT distillation in Graph\-R1 \(14B\) and our TRN\-R1\-Zero \(7B\)\. The evaluation is conducted on the Cora and WikiCS datasets, which represent the citation and hyperlink domains, using accuracy and macro\-F1 as the primary metrics\.

Figure[3](https://arxiv.org/html/2604.19070#S4.F3)shows the performance gains achieved through RL training on the Cora and WikiCS datasets\. TRN\-R1\-Zero consistently yields larger improvements in both Accuracy and F1, whereas Graph\-R1 even experiences a decline in F1 across both datasets\. These results indicate that the neighbour\-aware reward design effectively stabilises the optimisation process and promotes more balanced metric improvements\.TRN\-R1\-Zero not only delivers consistent and robust performance gains through reinforcement learning, but also exhibits superior optimisation stability and generalisationcompared with baselines\.

### 4\.5Effectiveness of Margin Gain

![Refer to caption](https://arxiv.org/html/2604.19070v1/x2.png)
![Refer to caption](https://arxiv.org/html/2604.19070v1/x3.png)

Figure 4:Original margin gain valuesΔi\\Delta\_\{i\}across the training datasets \(Citeseer and History\)\. These results demonstrate the distribution of impact from neighbour information towards the target node, motivating the neighbour\-aware reward design\.The margin gain visualisations in Figure[4](https://arxiv.org/html/2604.19070#S4.F4)provide an intuitive view of how neighbour aggregation influences decision confidence during RL training\. Specifically, a positiveΔi\\Delta\_\{i\}indicates that neighbour aggregation shifts the target embedding closer to the ground\-truth label embedding, whereas a negativeΔi\\Delta\_\{i\}indicates the opposite effect\. A larger\|Δi\|\|\\Delta\_\{i\}\|therefore reflects a stronger influence of neighbour information on the classification of the target node, and such high\-impact samples are the ones most worth emphasising during policy optimisation\.

To examine the effect of the neighbour\-aware policy objective on training dynamics, the Qwen2\.5\-1\.5B\-Instruct model is used as the base policy model for computational efficiency\. Two reward variants are compared: \(i\) the base reward without scaling and \(ii\) theexp⁡\(\|Δi\|\)\\exp\(\|\\Delta\_\{i\}\|\)\-scaled reward\. The Cora dataset is used for evaluation\. Figure[5\(a\)](https://arxiv.org/html/2604.19070#S4.F5.sf1)illustrates the average accuracy across training steps under both settings\. The results show that incorporating neighbour\-aware reward shaping stabilises optimisation and yields more consistent performance improvements compared with the unshaped baseline\. Each checkpoint model is evaluated five times on the Cora dataset to ensure robustness\. The training statistics in Figure[5\(b\)](https://arxiv.org/html/2604.19070#S4.F5.sf2),[5\(c\)](https://arxiv.org/html/2604.19070#S4.F5.sf3), and[5\(d\)](https://arxiv.org/html/2604.19070#S4.F5.sf4)further support the effectiveness of neighbour\-aware shaping: the entropy remains relatively high, encouraging the policy model to explore more diverse and optimised responses rather than over\-exploit early patterns; meanwhile, the average response length steadily increases, suggesting that the model performs deeper reasoning\. Additionally, the rising frequency of the word “neighbour” indicates that the model gradually learns to leverage relational context more effectively\. Ourneighbour\-aware margin gain enhances both the stability and utilisation of neighbourhood in reasoning depthfor node classification\.

![Refer to caption](https://arxiv.org/html/2604.19070v1/x4.png)\(a\)Cora Evaluation
![Refer to caption](https://arxiv.org/html/2604.19070v1/x5.png)\(b\)Entropy
![Refer to caption](https://arxiv.org/html/2604.19070v1/x6.png)\(c\)Response Length
![Refer to caption](https://arxiv.org/html/2604.19070v1/x7.png)\(d\)\#Neighbour Word

Figure 5:Accuracy comparison between base reward and neighbour\-aware reward across Cora dataset\. Neighbour\-aware shaping consistently improves both optimisation stability and reasoning depth\.
### 4\.6Impact of Different LLM Backbones

To assess the generality ofTRN\-R1\-Zeroacross LLM backbones, models spanning different families and scales are trained, including Llama\-3\.2\-3B, Llama\-3\.1\-8B, and Qwen2\.5\-14B\. Across all families and scales,TRN\-R1\-Zeroconsistently improves zero\-shot node classification, with the largest gains observed on smaller backbones \(e\.g\., \+14\.4 and \+9\.0 in average accuracy on the 3B and 8B models, respectively\)\. These results indicate that theproposed training paradigm is not tied to a specific backbone and generalises across architectures and scales\.

303050507070909032\.551\.250\.136\.642\.665\.1\+32\.667\.9\+16\.752\.6\+2\.542\.1\+5\.656\.9\+14\.4303050507070909064\.659\.437\.045\.551\.668\.2\+3\.671\.7\+12\.245\.0\+8\.057\.8\+12\.360\.7\+9\.0CoraWikiCSInstagramPhotoAvg303050507070909067\.273\.055\.658\.563\.669\.3\+2\.176\.8\+3\.857\.3\+1\.769\.9\+11\.468\.3\+4\.7Base LLMTRN\-R1\-Zero\+Δ\\Deltagain\(a\) Llama\-3\.2\-3B\-Instruct\(b\) Llama\-3\.1\-8B\-Instruct\(c\) Qwen2\.5\-14B\-InstructFigure 6:Zero\-shot node classification accuracy across three model scales\. Green annotations show absolute gain over the base LLM\.
### 4\.7Performance under Supervised Settings

In the supervised setting, 60% of each dataset is used for training and 20% for testing\. This configuration follows the common practice in GNN and GNN\-adapter frameworks \(e\.g\., LLaGA\) for LLMs\. Since direct supervision signals are available from the training\-testing split within the same dataset, traditional supervised models such as Graph Convolutional Networks \(GCNs\) can perform effectively\. As shown in Table[4](https://arxiv.org/html/2604.19070#S4.T4), TRN\-R1\-Zero still outperforms both GCN and LLaGA under the supervised setting, demonstrating strong optimisation and reasoning capabilities even when explicit labels are provided\.Overall, TRN\-R1\-Zero maintains superior performance and stable optimisation behaviour under supervised learning conditions\.

ModelCiteseerHistoryGCN76\.4584\.23LLaGA76\.7385\.56TRN\-R1\-Zero77\.7486\.71Table 4:Comparison of TRN\-R1\-Zero with GCNKipf and Welling \([2017](https://arxiv.org/html/2604.19070#bib.bib36)\)and LLaGA under the supervised setting on the Citeseer and History datasets\.
### 4\.8Case Study and Efficiency

Compared with other reasoning\-based methods,TRN\-R1\-Zero enjoys a much higher efficiency by \(1\) employing a substantially smaller model; and \(2\) significantly shortens the average response length, producing more concise yet effective reasoning traceswhile achieving superior performance \(see Box2 and Box3\)\. This improvement not only reduces GPU memory consumption but also accelerates inference, demonstrating that reinforcement learning alone can promote efficient reasoning behaviour without relying on large\-scale supervised fine\-tuning or external LLMs\.

DatasetModelSizeResp\. Len\.Acc\.CoraTRN\-R1\-Zero7B152\.5772\.59CoraGraph\-R114B930\.4768\.15WikiCSTRN\-R1\-Zero7B129\.2773\.63WikiCSGraph\-R114B1129\.2273\.25Table 5:Comparison of model size, average response length \(Resp\. Len\.\), and accuracy \(Acc\.\) between TRN\-R1\-Zero and Graph\-R1 on Cora and WikiCS\.Box 2: Response of TRN\-R1\-Zero\# Response for Node 11 in Cora by TRN\-R1\-Zero<think\> The target node discusses the MIN\-FEATURES bias in machine learning, which is related to the selection of relevant features\. It talks about learning algorithms, sample complexity, and generalization performance\. The focus is on understanding and improving the learning process by considering the relevance of features, which is a theoretical aspect of machine learning\.Looking at the neighbor nodes: \- One discusses maximizing coverage in learning algorithms\. \- Another focuses on prototype and feature selection using sampling and random mutation hill climbing\. \- The third uses compression\-based methods for feature subset selection\.All these topics are closely related to feature selection and learning algorithms, which fall under the broader category of machine learning theory\. Therefore, the most appropriate category for the target node would be "Theory" \(category 4\)\. </think\> <answer\> 4 ✓ </answer\>

Box 3: Response of Graph\-R1\# Response for Node 11 in Cora by Graph\-R1Okay, let’s try to figure out how to classify the paper segment represented by node0\. \.\.\.38 words\.\.\. First, I need to understand what each node is about\. Node0’s paper is about the MIN\-FEATURES bias in machine learning, focusing on feature selection\. \.\.\.35 words\.\.\. Looking at the connections: node0 is connected to node1, node2, and node3\. Node1 talks about feature subset selection using Information Theory\. Node2 discusses MDL \(Minimum Description Length\) for feature selection\. Node3 is about constructive induction and feature selection with a non\-greedy strategy\. \.\.\.334 words\.\.\. Given the options,"Rule\_Learning" \(0\) seems the most fitting because it directly relates to the algorithms discussed \(ID3, FOCUS\) and the context of feature selection in decision trees\. "Theory" is a possible contender but might be too broad\. The other categories don’t align as well\. Therefore, the best fit is Rule\_Learning \(0\)\. </think\> <structure\> The subgraph forms a star topology with node0 at the center connected to nodes1,2,3\. \.\.\.43 words\.\.\. </structure\> <semantic\> \.\.\.40 words\.\.\."Theory" \(category4\) is plausible due to PAC\-learning and Information Theory, but the emphasis on practical algorithms \(ID3, FOCUS\) leans more toward applied methods\."Probabilistic\_Methods" \(6\) is less fitting as MDL is information\-theoretic, not probabilistic\. Other categories \(Neural\_Networks, Case\_Based, Genetic\_Algorithms\) are irrelevant\. </semantic\> <comprehensive\> Candidate answers: 1\. \*\*Rule\_Learning \(0\)\*\*: Directly supported by ID3/FOCUS comparisons, feature selection in decision trees, and algorithmic focus\. 2\. \*\*Theory \(4\)\*\*: PAC\-learning bounds and Information Theory overlap, but less emphasis on theoretical frameworks compared to applied algorithms\. 3\. \*\*Probabilistic\_Methods \(6\)\*\*: MDL relates to information theory but lacks explicit probabilistic modeling\. Rethinking: \-\*\*Rule\_Learning\*\* best captures the practical algorithm comparisons \(ID3, FOCUS\) and feature selection in decision trees\.\- \*\*Theory\*\* is secondary but insufficient due to the applied focus of node0\. \- Other categories are invalid based on content exclusion\. Final Answer: 0 \(Rule\_Learning\) ✗

## 5Conclusion

This paper introduces TRN\-R1\-Zero, a reinforcement learning\-only framework for text\-rich node reasoning\. A neighbour\-aware group relative policy optimisation objective models the impact of neighbouring information during training with a novel margin gain\. Without distillation or supervised fine\-tuning, TRN\-R1\-Zero directly optimises a base LLM for strong zero\-shot reasoning\. Extensive experiments verify the effectiveness of TRN\-R1\-Zero\.

## Limitations

For zero\-shot text\-rich network \(TRN\) tasks, LLMs must not only extract useful information from the target node text, neighbour node texts, and candidate label texts, but also comprehend the underlying semantics of these texts\. Therefore, if the base LLM lacks sufficient domain knowledge, reinforcement learning may offer limited improvements, as such knowledge primarily originates from the pre\-training phase\.

## Acknowledgments

This research has been supported by Australian Research Council Discovery Projects \(CE200100025, DP230101196 and DE250100919\)\.

## References

- M\. Balunovic, J\. Dekoninck, I\. Petrov, N\. Jovanovic, and M\. T\. Vechev \(2025\)MathArena: evaluating llms on uncontaminated math competitions\.CoRRabs/2505\.23281\.Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1)\.
- P\. BehnamGhader, V\. Adlakha, M\. Mosbach, D\. Bahdanau, N\. Chapados, and S\. Reddy \(2024\)LLM2vec: large language models are secretly powerful text encoders\.InCOLM,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- N\. Chen, Y\. Li, J\. Tang, and J\. Li \(2024a\)GraphWiz: an instruction\-following language model for graph computational problems\.InKDD,Cited by:[Table 1](https://arxiv.org/html/2604.19070#S1.T1),[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p2.1)\.
- R\. Chen, T\. Zhao, A\. K\. Jaiswal, N\. Shah, and Z\. Wang \(2024b\)LLaGA: large language and graph assistant\.InICML,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1),[2nd item](https://arxiv.org/html/2604.19070#S4.I2.i2.p1.1)\.
- Z\. Chen, R\. Tang, G\. Deng, F\. Wu, J\. Wu, Z\. Jiang, V\. K\. Prasanna, A\. Cohan, and X\. Wang \(2025\)LocAgent: graph\-guided LLM agents for code localization\.InACL,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InNAACL,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Fang, D\. Fan, S\. Ding, N\. Liu, and Q\. Tan \(2025\)UniGLM: training one unified language model for text\-attributed graphs embedding\.InWSDM,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature\.Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1)\.
- P\. He, X\. Liu, J\. Gao, and W\. Chen \(2021\)Deberta: decoding\-enhanced bert with disentangled attention\.InICLR,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- X\. He, X\. Bresson, T\. Laurent, A\. Perold, Y\. LeCun, and B\. Hooi \(2024a\)Harnessing explanations: LLM\-to\-LM interpreter for enhanced text\-attributed graph representation learning\.InICLR,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- X\. He, Y\. Tian, Y\. Sun, N\. V\. Chawla, T\. Laurent, Y\. LeCun, X\. Bresson, and B\. Hooi \(2024b\)G\-retriever: retrieval\-augmented generation for textual graph understanding and question answering\.InNeurIPS,Cited by:[§4\.1](https://arxiv.org/html/2604.19070#S4.SS1.SSS0.Px1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InICLR,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1),[§4\.1](https://arxiv.org/html/2604.19070#S4.SS1.SSS0.Px3.p1.1)\.
- H\. Hu, H\. Lu, H\. Zhang, Y\. Song, W\. Lam, and Y\. Zhang \(2024\)Chain\-of\-symbol prompting for spatial reasoning in large language models\.InCOLM,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Huang, K\. Han, Y\. Yang, D\. Bao, Q\. Tao, Z\. Chai, and Q\. Zhu \(2024\)Can GNN be good adapter for llms?\.InWWW,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InICLR,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1)\.
- T\. N\. Kipf and M\. Welling \(2017\)Semi\-supervised classification with graph convolutional networks\.InICLR,Cited by:[Table 4](https://arxiv.org/html/2604.19070#S4.T4)\.
- L\. Kong, J\. Feng, H\. Liu, C\. Huang, J\. Huang, Y\. Chen, and M\. Zhang \(2024\)Gofa: a generative one\-for\-all model for joint graph language modeling\.InICLR,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1)\.
- Y\. Li, P\. Wang, Z\. Li, J\. X\. Yu, and J\. Li \(2024a\)Zerog: investigating cross\-dataset zero\-shot transferability in graphs\.InKDD,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1),[2nd item](https://arxiv.org/html/2604.19070#S4.I2.i2.p1.1)\.
- Y\. Li, P\. Wang, X\. Zhu, A\. Chen, H\. Jiang, D\. Cai, W\. K\. \(\. Chan, and J\. Li \(2024b\)GLBench: A comprehensive benchmark for graph with large language models\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1)\.
- Y\. Liu, R\. Qiu, and Z\. Huang \(2023\)CaT: balanced continual graph learning with graph condensation\.InICDM,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- Y\. Liu, R\. Qiu, and Z\. Huang \(2025a\)GCondenser: benchmarking graph condensation\.InCIKM,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- Y\. Liu, R\. Qiu, Y\. Tang, H\. Yin, and Z\. Huang \(2025b\)PUMA: efficient continual graph learning for node classification with graph condensation\.TKDE\.Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025c\)Understanding r1\-zero\-like training: a critical perspective\.arXiv preprint arXiv:2503\.20783\.Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2604.19070#S3.SS3.p2.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InEMNLP,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.CoRRabs/1707\.06347\.Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2604.19070#S3.SS3.p1.2)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.CoRRabs/2402\.03300\.Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1),[§3\.3](https://arxiv.org/html/2604.19070#S3.SS3.p1.2)\.
- J\. Tang, Y\. Yang, W\. Wei, L\. Shi, L\. Su, S\. Cheng, D\. Yin, and C\. Huang \(2024a\)Graphgpt: graph instruction tuning for large language models\.InSIGIR,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1)\.
- Y\. Tang, R\. Qiu, Y\. Liu, X\. Li, and Z\. Huang \(2024b\)CaseGNN: graph neural networks for legal case retrieval with text\-attributed graphs\.InECIR,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- Y\. Tang, R\. Qiu, Y\. Liu, X\. Li, and Z\. Huang \(2026\)LEXA: legal case retrieval via graph contrastive learning with contextualised LLM embeddings\.World Wide Web \(WWW\)29\(2\),pp\. 20\.Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- Y\. Tang, R\. Qiu, H\. Yin, X\. Li, and Z\. Huang \(2024c\)CaseLink: inductive graph learning for legal case retrieval\.InSIGIR,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- D\. Wang, R\. Qiu, G\. Bai, and Z\. Huang \(2025a\)Text meets topology: rethinking out\-of\-distribution detection in text\-rich networks\.InEMNLP,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- D\. Wang, Y\. Zuo, F\. Li, and J\. Wu \(2024\)Llms as zero\-shot graph learners: alignment of gnn representations with llm token embeddings\.NeurIPS\.Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1)\.
- H\. Wang, S\. Liu, R\. Wei, and P\. Li \(2025b\)Model generalization on text attribute graphs: principles with large language models\.InICML,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p2.1)\.
- H\. Wang, S\. Feng, T\. He, Z\. Tan, X\. Han, and Y\. Tsvetkov \(2023\)Can language models solve graph problems in natural language?\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1)\.
- Y\. Wang, B\. Liu, J\. Tang, N\. Chen, Y\. Li, Q\. Zhang, and J\. Li \(2025c\)Graph\-r1: unleashing LLM reasoning with np\-hard graph problems\.CoRRabs/2508\.20373\.Cited by:[Table 1](https://arxiv.org/html/2604.19070#S1.T1),[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p2.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p2.1)\.
- F\. Wu, A\. H\. S\. Jr\., T\. Zhang, C\. Fifty, T\. Yu, and K\. Q\. Weinberger \(2019\)Simplifying graph convolutional networks\.InICML,Cited by:[§3\.3](https://arxiv.org/html/2604.19070#S3.SS3.SSS0.Px1.p3.1)\.
- X\. Wu, Y\. Shen, F\. Ge, C\. Shan, Y\. Jiao, X\. Sun, and H\. Cheng \(2025a\)When do LLMs help with node classification? a comprehensive analysis\.InICML,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px1.p3.1),[Table 2](https://arxiv.org/html/2604.19070#S3.T2),[§4\.1](https://arxiv.org/html/2604.19070#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Wu, Y\. Shen, C\. Shan, K\. Song, S\. Wang, B\. Zhang, J\. Feng, H\. Cheng, W\. Chen, Y\. Xiong, and D\. Li \(2024\)Can graph learning improve planning in llm\-based agents?\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2604.19070#S1.p1.1)\.
- Y\. Wu, G\. Lu, Y\. Zuo, H\. Zhang, and J\. Wu \(2025b\)Graph\-r1: incentivizing the zero\-shot graph learning capability in llms via explicit reasoning\.InEMNLP,Cited by:[Table 1](https://arxiv.org/html/2604.19070#S1.T1),[§1](https://arxiv.org/html/2604.19070#S1.p2.1),[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p2.1),[3rd item](https://arxiv.org/html/2604.19070#S4.I2.i3.p1.1)\.
- L\. Zhong, Z\. Wang, and J\. Shang \(2024\)Debug like a human: A large language model debugger via verifying runtime execution step by step\.InACL,Cited by:[§2](https://arxiv.org/html/2604.19070#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AText Quality for Current Text\-rich Networks

Textual nodes in text\-rich networks often meet noisy or incomplete text\. It is hard for a LLM directly give a reasonable label for such nodes\. Examples for Cora and Photo datasets are shown in Table[6](https://arxiv.org/html/2604.19070#A1.T6)\.

CoraTABLE DES MATI ERES 1 Apprentissage et approximation les techniques de regularisation 3 1\.1 Introduction:PhotoGood product, good price good price without shipping fee\. With shipping fee, it is still a good deal\.Table 6:Examples of meaningless and noisy texts in TRNs\.
## Appendix BPrompt Design for Edge\-level and Graph\-level Tasks

The following prompts are used to evaluate TRN\-R1\-Zero for edge\-level and graph\-level tasks\. The differences between them are the graph prompt and the task instruction parts\.

RelationDatasetNode TypeLabelsCitationCoraPaper segmentRule\_Learning; Neural\_Networks; Case\_Based; Genetic\_Algorithms; Theory; Reinforcement\_Learning; Probabilistic\_MethodsCiteseerPaper segmentAgents; ML \(Machine Learning\); IR \(Information Retrieval\); DB \(Databases\); HCI \(Human\-Computer Interaction\); AI \(Artificial Intelligence\)HyperlinkWikiCSWikipedia articleComputational Linguistics; Databases; Operating Systems; Computer Architecture; Computer Security; Internet Protocols; Computer File Systems; Distributed Computing Architecture; Web Technology; Programming Language TopicsSocialInstagramInstagram User BioNormal Users; Commercial UsersCo\-purchaseHistoryProduct descriptionWorld; Americas; Asia; Military; Europe; Russia; Africa; Ancient Civilizations; Middle East; Historical Study & Educational Resources; Australia & Oceania; Arctic & AntarcticaPhotoCustomer reviewVideo Surveillance; Accessories; Binoculars & Scopes; Video; Lighting & Studio; Bags & Cases; Tripods & Monopods; Flashes; Digital Cameras; Film Photography; Lenses; Underwater PhotographyCommonsenseExpla\-GraphCommonsense conceptSupport; CounterTable 7:Meta information for benchmark TRNs grouped by relation type\.Edge\-level Prompt for TRN\-R1\-Zero\# System PromptYou are a helpful AI Assistant that provides well\-reasoned and detailed responses\. You first think about the reasoning process as an internal monologue and then provide the user with the answer\. Respond in the following format: <think\> \.\.\. </think\> <answer\> \.\.\. </answer\> \# Graph PromptSource node: \{source\_node\} Target node: \{target\_node\} Neighbours of source node: \{source\_neighbors\} Neighbours of target node: \{target\_neighbors\} \# Task InstructionYour task is to predict whether a link exists between two nodes in a graph\. Each node represents a \{node\_type\}\. The relation type in this graph is \{relation\}\. Question: Based on the attributes and neighbourhood structure of both nodes, predict whether a \{relation\} link exists between the source and target nodes\. Answer with 0 \(no link\) or 1 \(link exists\)\.

Graph\-level Prompt for TRN\-R1\-Zero\# System PromptYou are a helpful AI Assistant that provides well\-reasoned and detailed responses\. You first think about the reasoning process as an internal monologue and then provide the user with the answer\. Respond in the following format: <think\> \.\.\. </think\> <answer\> \.\.\. </answer\> \# Graph PromptNodes: \{node\_list\} Relationships: \{edge\_list\} \# Task InstructionYour task is to determine if two arguments support or counter each other, based on the provided commonsense graph\. The commonsense graph is defined by nodes and their relationships\. Based on this graph, consider the following: \{question\}\. Your answer must be a single integer ID, where 0 means support and 1 means counter\.

## Appendix CExtended Dataset Statistics

The following tables include the extended dataset statistics \(Table[8](https://arxiv.org/html/2604.19070#A3.T8)\) with detailed description of meta data like label in text for each dataset \(Table[7](https://arxiv.org/html/2604.19070#A2.T7)\)\.

DomainDataset\#Nodes\#Graphs\#EdgesAvg\. Deg\.Homo\.\#ClassesCitationCiteseer∗3,18618,4502\.650\.726Cora2,70815,4293\.900\.837WebpageWikiCS11,7011431,20636\.850\.6810SocialInstagram11,3391144,01012\.700\.592Co\-purchasePhoto48,3621873,78218\.070\.7912History∗41,5511503,18012\.110\.7812CommonsenseExpla\-Graph5\.22,7664\.3\-\-2Table 8:Statistics of benchmark datasets\. Datasets marked with∗\(CiteseerandHistory\) are used for RL training, while the others are held out for evaluation and generalisation studies\.
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

Similar Articles

RemoteZero: Geospatial Reasoning with Zero Human Annotations

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…

OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

Submit Feedback

Similar Articles

RemoteZero: Geospatial Reasoning with Zero Human Annotations
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
@jiqizhixin: Awesome blog! State of RL for reasoning LLMs https://aweers.de/blog/2026/rl-for-llms/…
OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning