GraphReAct: Reasoning and Acting for Multi-step Graph Inference

arXiv cs.AI Papers

Summary

This paper introduces GraphReAct, a framework that extends reasoning-acting paradigms to graph-structured data for multi-step inference. It combines topological and semantic retrieval with context refinement to improve performance on graph learning benchmarks.

arXiv:2605.07357v1 Announce Type: new Abstract: Reasoning-acting frameworks enhance large language models (LLMs) by interleaving reasoning with actions for dynamic information acquisition. However, extending this paradigm to graph learning remains underexplored. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi-step inference. In this work, we propose GraphReAct, a graph reasoning-acting framework that enables step-by-step inference over graph-structured data. Specifically, we design a graph-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non-local but relevant evidence in the representation space. These actions dynamically expand the reasoning context. To further support multi-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression. Extensive experiments on six benchmark datasets demonstrate that GraphReAct consistently outperforms state-of-the-art methods, validating the effectiveness of reasoning-acting for graph learning.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:18 AM

# GraphReAct: Reasoning and Acting for Multi-step Graph Inference
Source: [https://arxiv.org/html/2605.07357](https://arxiv.org/html/2605.07357)
Xingtong Yu1Zhongwei Kuai2Chang Zhou2Xuanting Xie3Renhe Jiang4 Xikun Zhang5Hong Cheng1Xinming Zhang2Yuan Fang6 1The Chinese University of Hong Kong2The University of Science and Technology of China 3University of Electronic Science and Technology of China4The University of Tokyo 5RMIT University6Singapore Management University

###### Abstract

Reasoning\-acting frameworks enhance large language models \(LLMs\) by interleaving reasoning with actions for dynamic information acquisition\. However, extending this paradigm to graph learning remains underexplored\. Graph data is inherently structured, with information distributed across nodes and edges and encoded through both topology and latent representations\. As a result, effective reasoning over graphs requires not only retrieving informative evidence from the graph, but also progressively refining the accumulated context during multi\-step inference\. In this work, we proposeGraphReAct, a graph reasoning\-acting framework that enables step\-by\-step inference over graph\-structured data\. Specifically, we design a graph\-based action space with two complementary retrieval actions: topological retrieval, which captures local structural dependencies, and semantic retrieval, which accesses non\-local but relevant evidence in the representation space\. These actions dynamically expand the reasoning context\. To further support multi\-step reasoning, we introduce another type of action, context refinement, which distills and reorganizes accumulated information into a compact representation\. By interleaving reasoning with both retrieval and refinement actions, our framework enables a progressive transition from context expansion to compression\. Extensive experiments on six benchmark datasets demonstrate thatGraphReActconsistently outperforms state\-of\-the\-art methods, validating the effectiveness of reasoning\-acting for graph learning\.

## 1Introduction

Recent advances in large language models have demonstrated strong reasoning capabilities through Chain\-of\-Thought \(CoT\) prompting\(Weiet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib16); Fenget al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib21); Zhanget al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib19)\), which decomposes complex problems into intermediate reasoning steps\. Building upon this paradigm, the reasoning\-acting framework\(Yaoet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib20); Shenet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib170); Fuet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib169)\)further enhances reasoning by interleaving it with actions, enabling models to interleave reasoning with structured actions to acquire additional information during inference, as shown in Fig\.[1](https://arxiv.org/html/2605.07357#S1.F1)\(a\)\. This reasoning\-acting synergy has proven particularly effective in tasks such as question answering and decision making, where actions \(e\.g\., search\) help reduce hallucination and improve factual accuracy\. These successes suggest that integrating reasoning with information acquisition is a powerful paradigm for solving complex problems\.

![Refer to caption](https://arxiv.org/html/2605.07357v1/x1.png)Figure 1:Comparison of reasoning paradigms\. \(a\) Reasoning and acting in NLP\. \(b\) CoT\-based graph prompting\. \(c\) Our proposed graph reasoning\-acting \(GraphReAct\) framework\.Despite its effectiveness in natural language processing, extending reasoning\-acting framework to graph learning remains largely underexplored\. Graph\-structured data\(Xiaet al\.,[2021](https://arxiv.org/html/2605.07357#bib.bib18); Cook and Holder,[2006](https://arxiv.org/html/2605.07357#bib.bib17)\)is inherently different from textual environments: information is distributed across nodes and edges, and meaningful signals are often encoded implicitly through topology and latent representations\(Kipf and Welling,[2017](https://arxiv.org/html/2605.07357#bib.bib167); Veličkovićet al\.,[2018a](https://arxiv.org/html/2605.07357#bib.bib171)\)\. Existing graph learning methods, including end\-to\-end graph encoder\(Xuet al\.,[2019](https://arxiv.org/html/2605.07357#bib.bib172); Hamiltonet al\.,[2017](https://arxiv.org/html/2605.07357#bib.bib168)\), finetuning\(Youet al\.,[2020](https://arxiv.org/html/2605.07357#bib.bib107); Veličkovićet al\.,[2018b](https://arxiv.org/html/2605.07357#bib.bib143)\)and graph prompting\(Liuet al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib113); Yuet al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib368)\)approaches, typically rely on a fixed receptive field and perform single\-pass inference, which may limit their ability to resolve ambiguity under sparse or incomplete information\. While a recent work has incorporated CoT\-style reasoning into graph\-related tasks\(Yuet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib496)\), it lacks an action mechanism to dynamically expand the available context, as shown in Fig\.[1](https://arxiv.org/html/2605.07357#S1.F1)\(b\)\. A key obstacle is the mismatch between the unstructured action space in natural language domain \(e\.g\., web search\(Yaoet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib20)\)\) and the structured nature of graph data\. This raises a fundamental question:how can we enable reasoning\-acting synergy in graph learning through appropriate action design?This problem is non\-trivial due to two key challenges\.

First, in natural language domain, actions typically correspond to interactions with external knowledge sources such as Wikipedia\(Yaoet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib20); Fuet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib169)\)\. In contrast, for graph learning tasks, while external information remains accessible, it is often misaligned with the task objective and may introduce irrelevant or noisy signals\. Instead, the most informative evidence is inherently structured within the graph itself, encoded through both topology and latent representations\. Therefore, directly transferring text\-based action designs to graph domains is suboptimal\. This raises a key challenge:how to design graph\-aware actions that can effectively retrieve informative evidence from both topological structure and semantic similarity?In this work, we propose a graph\-based retrieval action space that enables structured information acquisition from the graph, as shown in Fig\.[1](https://arxiv.org/html/2605.07357#S1.F1)\(c\)\. Specifically, we design two complementary actions: \(1\)*topological retrieval*, which gathers textual information from structurally neighboring nodes to capture local dependencies, and \(2\)*semantic retrieval*, which identifies nodes with similar embeddings to provide non\-local, semantically relevant evidence\. By jointly leveraging these two actions, our framework effectively balances locality and globality, allowing the model to dynamically expand the context beyond fixed receptive fields\.

Second,how to support multi\-step reasoning with actions in graph domains?In standard reasoning\-acting frameworks, actions are typically used for information acquisition, where the model queries external environments to gather additional evidence\(Yaoet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib20)\)\. However, graph reasoning exhibits a different requirement\. While early\-stage reasoning benefits from expanding the context through information retrieval, later stages require consolidating and distilling the accumulated information to avoid redundancy and noise\. This introduces an inherent tension between context expansion and context compression, which is not explicitly addressed in existing reasoning\-acting frameworks\. To address this challenge, we design another type of action, termed*context refinement*, tailored for graph\-aware LLM reasoning\. Specifically, after the initial graph\-aware retrieval stage, subsequent actions focus on context refinement, where the model distills, summarizes, and reorganizes the accumulated information into a more compact and informative representation\. This design enables a natural transition from expansion to compression, allowing the model to perform multi\-step reasoning over a progressively constructed context, rather than relying on repeated retrieval from the graph\. Notably, unlike fully agentic reasoning\-acting frameworks, our action sequence follows a structured design tailored to graph data, where retrieval and refinement are applied in a predefined order to ensure stable and efficient reasoning\.

In summary, our contributions are fourfold\. \(1\) We proposeGraphReAct, a novel reasoning\-acting framework for graph learning that enables step\-by\-step inference via structured interaction with graph\-structured data\. To the best of our knowledge, this is the first work that systematically extends the reasoning\-acting paradigm to graph domains\. \(2\) We propose a graph\-based retrieval action space with topology\-based and semantic retrieval, enabling structured and complementary information acquisition from both local neighborhoods and global semantic space\. \(3\) We design context refinement action that balances context expansion and compression, enabling progressive refinement of graph\-aware information during inference\. \(4\) We conduct extensive experiments on six datasets, demonstrating thatGraphReActconsistently outperforms state\-of\-the\-art methods and validating the effectiveness of reasoning\-acting in graph learning\.

## 2Related Work

LLMs for graph learning\.Recent studies integrate LLMs into graph learning in three main ways\. Some use LLMs as semantic feature extractors to enrich node representations and mitigate domain gaps\(Wanget al\.,[2024b](https://arxiv.org/html/2605.07357#bib.bib13); Heet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib14); Liuet al\.,[2024](https://arxiv.org/html/2605.07357#bib.bib286); Yuanet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib12)\)\. Others employ LLMs as predictors to leverage their reasoning capabilities\(Tanget al\.,[2024a](https://arxiv.org/html/2605.07357#bib.bib11),[b](https://arxiv.org/html/2605.07357#bib.bib10); Chenet al\.,[2024](https://arxiv.org/html/2605.07357#bib.bib9)\)\. There are also efforts to align graph representations with LLM embedding spaces to enable graph–language interaction\(Tanget al\.,[2024a](https://arxiv.org/html/2605.07357#bib.bib11); Chenet al\.,[2024](https://arxiv.org/html/2605.07357#bib.bib9)\)\. Despite these advances, most methods rely on static graph–language interfaces with single\-pass inference, lacking explicit multi\-step reasoning and structured evidence acquisition\. In contrast, our method enables iterative evidence integration through graph\-aware operations, supporting multi\-step reasoning over graph data\.

Reasoning and acting in LLMs\.Chain\-of\-Thought \(CoT\) prompting improves reasoning by decomposing problems into intermediate steps\(Weiet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib16); Zhouet al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib8); Wanget al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib7); Yaoet al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib2)\)\. Building on this, reasoning\-acting frameworks interleave reasoning with external actions to acquire additional information\(Yaoet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib20); Fuet al\.,[2025](https://arxiv.org/html/2605.07357#bib.bib169)\)\. However, directly applying such approaches to graph learning is non\-trivial, as graph information is inherently structured and distributed across topology and latent semantic space\. Existing methods neither exploit these structured inductive biases nor address the need to organize evidence acquisition and refinement in multi\-step reasoning\. Our work instead introduces a structured mechanism tailored to graph data, combining graph\-based retrieval with progressive context refinement\.

## 3Preliminaries

Graph encoder\.A graph is defined asG=\(𝒱,ℰ,𝐗\)G=\(\\mathcal\{V\},\\mathcal\{E\},\\mathbf\{X\}\), where𝒱\\mathcal\{V\}andℰ\\mathcal\{E\}denote the sets of nodes and edges, respectively, and𝐗\\mathbf\{X\}is the node feature matrix, whoseii\-th rowxix\_\{i\}corresponds to the feature vector of nodevi∈𝒱v\_\{i\}\\in\\mathcal\{V\}\. We denote a collection of graphs as𝒢\\mathcal\{G\}\. A mainstream technique for graph representation learning is GNNs, which recursively update node representations through message passing\. Let𝐇l∈ℝ\|𝒱\|×d\\mathbf\{H\}^\{l\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\}denote the embedding matrix at thell\-th layer, where theii\-th row𝐡il\\mathbf\{h\}\_\{i\}^\{l\}represents the embedding of nodeviv\_\{i\}\. The layer\-wise update is defined as

𝐡vl=𝙼𝙿​\(𝐡vl−1,\{𝐡ul−1:u∈𝒩v\};θl\),\\displaystyle\\mathbf\{h\}^\{l\}\_\{v\}=\\mathtt\{MP\}\(\\mathbf\{h\}^\{l\-1\}\_\{v\},\\\{\\mathbf\{h\}^\{l\-1\}\_\{u\}:u\\in\\mathcal\{N\}\_\{v\}\\\};\\theta^\{l\}\),\(1\)where𝒩v\\mathcal\{N\}\_\{v\}denotes the neighbors ofvv,𝙼𝙿\\mathtt\{MP\}denotes the message\-passing function, andθl\\theta^\{l\}represents the learnable parameters of thell\-th layer\. The initial embedding is given by𝐡v0=xv\\mathbf\{h\}^\{0\}\_\{v\}=x\_\{v\}, and afterLLlayers, the final node representations are denoted as

𝐇=𝙶𝙴​\(𝐗,G;Θ\),\\displaystyle\\mathbf\{H\}=\\mathtt\{GE\}\(\\mathbf\{X\},G;\\Theta\),\(2\)whereΘ=\{θ1,…,θL\}\\Theta=\\\{\\theta^\{1\},\\dots,\\theta^\{L\}\\\}\.

Graph–LLM interface\.Recent approaches introduce a graph\-to\-language interface that integrates graph encoders with LLMs, enabling LLMs to perform prediction over graph\-structured data\. To bridge the gap between graph representations and the LLM token space, existing methods typically first conduct pre\-training based on multimodal contrastive learning\(Wanget al\.,[2024a](https://arxiv.org/html/2605.07357#bib.bib1); Chenet al\.,[2024](https://arxiv.org/html/2605.07357#bib.bib9)\)\. Given a graphGGand its associated text, the graph encoder𝙶𝙴\\mathtt\{GE\}is trained to produce representations that are aligned with the embedding space of the LLM\. Specifically, the pre\-training objective encourages graph representations to align with corresponding textual representations, thereby embedding graph features into the LLM semantic space\. This alignment enables graph embeddings to be interpreted as pseudo\-tokens that are compatible with LLM inputs\. After pre\-training, the graph encoder is frozen\. Given an input graph, it first produces representations𝐇𝒱=𝙶𝙴​\(𝐗,G;Θ\)\\mathbf\{H\}\_\{\\mathcal\{V\}\}=\\mathtt\{GE\}\(\\mathbf\{X\},G;\\Theta\), and then maps them into the token embedding space of the LLM via a projection function:

𝐇tok=𝙿𝚛𝚘𝚓​\(𝐇;ϕ\),\\displaystyle\\mathbf\{H\}^\{\\mathrm\{tok\}\}=\\mathtt\{Proj\}\(\\mathbf\{H\};\\phi\),\(3\)whereϕ\\phiis the learnable parameter\. The graph embedding tokens𝐇tok\\mathbf\{H\}^\{\\mathrm\{tok\}\}share the same embedding dimension as LLM tokens and are incorporated into an instruction template for the LLM\.

## 4Proposed Approach

In this section, we introduceGraphReAct\. We first provide a high\-level overview of the framework and then describe its core components in detail\.

### 4\.1Overall Framework

![Refer to caption](https://arxiv.org/html/2605.07357v1/x2.png)Figure 2:Overall framework ofGraphReAct\.We illustrate the overall framework ofGraphReActin Fig\.[2](https://arxiv.org/html/2605.07357#S4.F2), which consists of two phases: \(1\) pre\-training, and \(2\) graph\-based reasoning and acting\. First, we pre\-train a graph encoder to produce representations aligned with the embedding space of a large language model, as shown in Fig\.[2](https://arxiv.org/html/2605.07357#S4.F2)\(a\)\. Details of the pre\-training phase are provided in Sect\.[3](https://arxiv.org/html/2605.07357#S3)\. Second, we propose a graph\-based reasoning and acting framework to guide the LLM to perform multi\-step inference via iterative context updating, as illustrated in Fig\.[2](https://arxiv.org/html/2605.07357#S4.F2)\(b\)\. Specifically, at each step, the LLM generates an intermediate thought based on the current context, which is used to guide a predefined action for updating the context\. We consider two types of actions\. The first type is graph\-based retrieval, which expands the context by incorporating information from topologically related nodes and semantically similar nodes, as shown in Fig\.[2](https://arxiv.org/html/2605.07357#S4.F2)\(b1,b2\)\. The second type is context refinement, which compresses and reorganizes the accumulated context to reduce redundancy and enhance relevance\. Through this progressive context updating process, the model constructs a high\-quality context for final prediction\.

### 4\.2Graph\-based Retrieval

Graph\-based retrieval is designed to acquire informative evidence from graph\-structured data, where useful signals reside in both graph topology and latent representation space\. To this end, we design a unified set of action primitives that captures two complementary sources of information: structural locality and semantic similarity\.

Topological retrieval\.We first introduce a topology\-based action that extracts structurally grounded evidence from the graph\. In graphs, informative signals are often distributed along connectivity patterns, making local neighborhoods a primary source of context for reasoning\. To capture such structural dependencies, we retrieve nodes that are topologically related to the target nodevvvia a breadth\-first traversal\(Bundy and Wallen,[1984](https://arxiv.org/html/2605.07357#bib.bib150)\), collecting the firstNNvisited neighbors\. Unlike random neighbor sampling, this traversal expands the receptive field in a structured manner, ensuring a consistent retrieval budget across nodes while adaptively incorporating higher\-order neighborhoods when necessary\.

We then transform the retrieved structural evidence into a compact representation suitable for reasoning\. Specifically, we construct a topology\-based instruction by integrating the target node text with the textual information of the retrieved neighbors, and use the frozen LLM to generate a topology\-based summary\. This summary, denoted asstops^\{\\mathrm\{top\}\}, serves as a structured abstraction of the local neighborhood, encoding aggregated structural signals into a form that can be directly consumed by subsequent reasoning steps\.

Semantic retrieval\.While topological retrieval captures local structural dependencies, it may fail to access semantically relevant but structurally distant evidence\. To address this limitation, we introduce a complementary semantic retrieval action that operates in the representation space\. Specifically, we identify nodes that are semantically similar to the target nodevvbased on cosine similarity\(Xiaet al\.,[2015](https://arxiv.org/html/2605.07357#bib.bib152)\)between node embeddings, and select the top\-MMmost similar nodes\. This process enables the model to access globally relevant evidence that is not constrained by graph connectivity, thereby extending the reasoning context beyond local neighborhoods\.

Similar to topological retrieval, we convert the retrieved semantic evidence into a reasoning\-compatible representation\. We construct a semantic\-based instruction by integrating the target node text with the textual information of the retrieved semantically similar nodes, and feed it into the frozen LLM to produce a semantic\-based summary\. This summary, denoted asssems^\{\\mathrm\{sem\}\}, provides a global semantic abstraction of the target node, complementing the locality of topology\-based context and enabling reasoning over both structural and semantic dimensions\. TThis dual\-retrieval design provides a structured mechanism for accessing both local and global evidence, forming the foundation for graph\-aware reasoning in subsequent steps\.

Comparison with heterophilic graph learning\.Existing methods for modeling non\-local dependencies, particularly in heterophilic graphs\(Maet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib222); Luanet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib223)\), typically extend message passing with multi\-hop aggregation or similarity\-based propagation\. However, these approaches rely on fixed aggregation schemes learned during training, and rely on fixed aggregation schemes and lack instance\-specific evidence selection at inference time\. Moreover, they operate in a single\-modality setting and are not designed for zero\-shot generalization across datasets\. In contrast, our approach formulates graph information access as an explicit action in the reasoning process\. Instead of static propagation, our approach performs explicit evidence acquisition via topological and semantic retrieval, and converts them into natural language summaries for subsequent reasoning\. This enables flexible and instance\-specific evidence integration, which is particularly suitable for zero\-shot graph learning\.

Context construction\.Graph\-based retrieval are applied after the initial reasoning step to construct an evidence\-augmented context for subsequent reasoning\. We first perform an initial inference using the graph–LLM interface:

Thought1=𝙻𝙻𝙼​\(ℐ,𝒬\),\\displaystyle\\text\{Thought\}^\{1\}=\\mathtt\{LLM\}\(\\mathcal\{I\},\\mathcal\{Q\}\),\(4\)whereℐ\\mathcal\{I\}denotes the initial instruction formed by combining the target node textxtextx^\{\\mathrm\{text\}\}and its node token embedding𝐡Vtok\\mathbf\{h\}\_\{V\}^\{\\mathrm\{tok\}\}, and𝒬\\mathcal\{Q\}denotes a task\-level instruction specifying the prediction objective\.𝒬\\mathcal\{Q\}is shared across all nodes and remains fixed during inference\. We then retrieve structured evidence from the graph via graph\-based retrieval:

Observation1=𝙰𝚌𝚝retrieve​\(G,v\)=\{stop,ssem\},\\displaystyle\\text\{Observation\}^\{1\}=\\mathtt\{Act\}\_\{\\text\{retrieve\}\}\(G,v\)=\\\{s^\{\\mathrm\{top\}\},\\;s^\{\\mathrm\{sem\}\}\\\},\(5\)wherestops^\{\\mathrm\{top\}\}andssems^\{\\mathrm\{sem\}\}denote the topological and semantic summaries, respectively\.𝙰𝚌𝚝retrieve​\(⋅\)\\mathtt\{Act\}\_\{\\text\{retrieve\}\}\(\\cdot\)denotes a composite operation that includes topological and semantic retrieval and LLM\-based abstraction into textual summaries\. Finally, we construct the initial context by integrating the initial thought with the retrieved evidence:

𝒞1=𝙸𝚗𝚒𝚝​\(Thought1,Observation1\),\\displaystyle\\mathcal\{C\}^\{1\}=\\mathtt\{Init\}\(\\text\{Thought\}^\{1\},\\;\\text\{Observation\}^\{1\}\),\(6\)where𝙸𝚗𝚒𝚝​\(⋅\)\\mathtt\{Init\}\(\\cdot\)denotes a prompt construction operation that organizes the thought and retrieved summaries into a structured instruction that serves as the input context for the LLM\.𝒞1\\mathcal\{C\}^\{1\}corresponds to a structured textual instruction that encodes accumulated evidence and reasoning states fed into the LLM\. This initial context serves as the starting point for subsequent multi\-step reasoning, where the context is progressively refined through iterative updates\. Notably, these retrieval operations are applied in a predefined manner rather than being dynamically selected by the model, reflecting the structured nature of graph data\. We provide detailed instruction templates in Appendix[H](https://arxiv.org/html/2605.07357#A8)\.

### 4\.3Context Refinement

Context refinement is designed to support multi\-step reasoning by progressively updating the reasoning context under the guidance of intermediate thoughts\. Unlike graph\-based retrieval, which acquires external evidence from the graph, refinement is formulated as another type of action within the same action space, operating on the accumulated context and reasoning signals to enable structured consolidation and reorganization of information across steps\.

At each reasoning step, the model first generates a new thought based on the current context\. This thought is then used to guide a predefined refinement operation that updates the context\. To unify different forms of context updates, we define a generalized refinement operator that produces an observation:

Observationk\+1=𝙰𝚌𝚝refine​\(𝒞k,Thoughtk\+1\),\\displaystyle\\text\{Observation\}^\{k\+1\}=\\mathtt\{Act\}\_\{\\text\{refine\}\}\(\\mathcal\{C\}^\{k\},\\;\\text\{Thought\}^\{k\+1\}\),\(7\)where𝙰𝚌𝚝refine​\(⋅\)\\mathtt\{Act\}\_\{\\text\{refine\}\}\(\\cdot\)denotes a refinement action that transforms the accumulated context into a more compact and informative representation\. The updated context is then constructed by integrating the previous context, the newly generated thought, and the resulting observation:

𝒞k\+1=𝚄𝚙𝚍𝚊𝚝𝚎​\(𝒞k,Thoughtk\+1,Observationk\+1\),\\displaystyle\\mathcal\{C\}^\{k\+1\}=\\mathtt\{Update\}\(\\mathcal\{C\}^\{k\},\\;\\text\{Thought\}^\{k\+1\},\\;\\text\{Observation\}^\{k\+1\}\),\(8\)where𝚄𝚙𝚍𝚊𝚝𝚎​\(⋅\)\\mathtt\{Update\}\(\\cdot\)incorporates the observation into the existing context while preserving relevant historical information\. Specifically, the refinement action is implemented via instruction\-guided generation, where the LLM performs a reasoning\-conditioned transformation over𝒞k\\mathcal\{C\}^\{k\}andThoughtk\+1\\text\{Thought\}^\{k\+1\}, conducting information distillation and structural reorganization\. The resulting observation serves as a condensed abstraction of the accumulated context, analogous to the evidence summaries obtained during graph\-based retrieval\.

To summarize the overall reasoning process, at each stepkk, the LLM operates on a structured input consisting of the node\-specific instruction, current context, and task query:

Thoughtk\+1=𝙻𝙻𝙼​\(ℐ,𝒞k,𝒬\)\.\\displaystyle\\text\{Thought\}^\{k\+1\}=\\mathtt\{LLM\}\(\\mathcal\{I\},\\;\\mathcal\{C\}^\{k\},\\;\\mathcal\{Q\}\)\.\(9\)The generated thought is then used to guide the corresponding refinement operation as defined in Eq\. \([7](https://arxiv.org/html/2605.07357#S4.E7)\) and Eq\. \([8](https://arxiv.org/html/2605.07357#S4.E8)\)\. At the final step, the LLM directly produces the prediction:

y^=𝙻𝙻𝙼​\(ℐ,𝒞K,𝒬\)\.\\displaystyle\\hat\{y\}=\\mathtt\{LLM\}\(\\mathcal\{I\},\\;\\mathcal\{C\}^\{K\},\\;\\mathcal\{Q\}\)\.\(10\)

### 4\.4Adaptation and Inference

We consider a node classification task with a labeled training set𝒟=\{\(v1,y1\),\(v2,y2\),…\},\\mathcal\{D\}=\\\{\(v\_\{1\},y\_\{1\}\),\(v\_\{2\},y\_\{2\}\),\\dots\\\},wherevi∈𝒱v\_\{i\}\\in\\mathcal\{V\}denotes a target node andyi∈Yy\_\{i\}\\in Yis its class label\. Given a target nodevv, our framework performs multi\-step reasoning over the constructed context and produces a predictive distribution over labels via the LLM\. Letp​\(y∣v\)p\(y\\mid v\)denote the probability of predicting labelyyfrom the LLM output conditioned on the final instruction\. The training objective is defined as the negative log\-likelihood:

ℒdown​\(𝒟;ϕ\)=−∑\(vi,yi\)∈𝒟log⁡p​\(yi∣vi;ϕ\),\\displaystyle\\mathcal\{L\}\_\{\\text\{down\}\}\(\\mathcal\{D\};\\phi\)=\-\\sum\_\{\(v\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}\\log p\(y\_\{i\}\\mid v\_\{i\};\\phi\),\(11\)where the probability is derived from the LLM output via label verbalization or answer matching\. During training, both the graph encoder and the LLM are kept frozen\. The only trainable component is the projection function𝙿𝚛𝚘𝚓​\(⋅;ϕ\)\\mathtt\{Proj\}\(\\cdot;\\phi\), which maps graph representations into the LLM token embedding space\. By optimizingϕ\\phi, the model learns to better align graph\-derived representations with the LLM input space for downstream prediction\.

For cross\-dataset inference, the learned projection function can be directly applied to unseen graphs without further training\. Specifically, given a new graph and a target instance, we construct the corresponding instruction using the projected graph representations, and perform the same multi\-step reasoning process to obtain predictions\. This enablesGraphReActto be applied to unseen graphs in a zero\-shot manner, leveraging the learned graph–LLM alignment without requiring additional task\-specific training\.

## 5Experiments

In this section, we conduct experiments to evaluateGraphReActand analyze the empirical results\.

### 5\.1Experimental Setup

Datasets\.We evaluateGraphReActon eight text\-attributed graph datasets from two domains: citation networks and e\-commerce networks\. For citation networks, we use ArxivHuet al\.\([2020](https://arxiv.org/html/2605.07357#bib.bib432)\), PubMedHeet al\.\([2024](https://arxiv.org/html/2605.07357#bib.bib504)\), and an expanded version of CoraWen and Fang \([2023](https://arxiv.org/html/2605.07357#bib.bib193)\), where nodes represent papers and edges denote citation relationships\. For e\-commerce networks, we use Computer, Photo, Children, History, and Sports from the TAG benchmarkYanet al\.\([2023](https://arxiv.org/html/2605.07357#bib.bib505)\), where nodes represent products and edges indicate frequent co\-viewing or co\-purchasing relationships\. Detailed statistics of all datasets are provided in Table[4](https://arxiv.org/html/2605.07357#A3.T4), and further descriptions are provided in Appendix[B](https://arxiv.org/html/2605.07357#A2)\.

Table 1:Accuracy of zero\-shot node classification\.ModelCitationE\-commerceCoraPubmedChildrenHistoryPhotoSportsMLP0\.021±\\pm0\.0060\.323±\\pm0\.0270\.029±\\pm0\.0370\.080±\\pm0\.0410\.110±\\pm0\.0700\.042±\\pm0\.021GCN0\.017±\\pm0\.0040\.288±\\pm0\.0920\.030±\\pm0\.0180\.063±\\pm0\.0420\.103±\\pm0\.0470\.042±\\pm0\.025GraphSAGE0\.014±\\pm0\.0070\.316±\\pm0\.0580\.008±\\pm0\.0070\.195±\\pm0\.2060\.056±\\pm0\.0550\.051±\\pm0\.015GAT0\.016±\\pm0\.0040\.343±\\pm0\.0640\.086±\\pm0\.0840\.172±\\pm0\.0980\.050±\\pm0\.0270\.142±\\pm0\.138NodeFormer0\.016±\\pm0\.0070\.308±\\pm0\.0930\.048±\\pm0\.0280\.168±\\pm0\.1270\.073±\\pm0\.0150\.165±\\pm0\.057DIFFormer0\.029±\\pm0\.0140\.361±\\pm0\.0710\.129±\\pm0\.0300\.275±\\pm0\.1710\.321±\\pm0\.0550\.306±\\pm0\.131DGI0\.026±\\pm0\.0090\.329±\\pm0\.1030\.082±\\pm0\.0350\.218±\\pm0\.1680\.224±\\pm0\.1270\.049±\\pm0\.017GKD0\.042±\\pm0\.0080\.399±\\pm0\.0330\.202±\\pm0\.0640\.339±\\pm0\.1380\.166±\\pm0\.0860\.208±\\pm0\.077GLNN0\.031±\\pm0\.0060\.390±\\pm0\.0110\.187±\\pm0\.0120\.283±\\pm0\.0210\.403±\\pm0\.0190\.317±\\pm0\.048Vicuna\-7B\-v1\.50\.156±\\pm0\.0010\.719±\\pm0\.0100\.270±\\pm0\.0010\.363±\\pm0\.0010\.378±\\pm0\.0040\.370±\\pm0\.001Vicuna\-7B\-SPT0\.168±\\pm0\.0180\.768±\\pm0\.0360\.227±\\pm0\.0150\.281±\\pm0\.0880\.350±\\pm0\.0610\.230±\\pm0\.018OFA0\.130±\\pm0\.0190\.314±\\pm0\.0590\.064±\\pm0\.0860\.052±\\pm0\.0490\.340±\\pm0\.0260\.101±\\pm0\.071GraphGPT\-std0\.1260\.701————GraphGPT\-cot0\.1810\.521————LLaGA0\.168±\\pm0\.0320\.793±\\pm0\.0360\.199±\\pm0\.0070\.146±\\pm0\.0670\.276±\\pm0\.0690\.352±\\pm0\.033TEA\-GLM0\.202±\\pm0\.0140\.848±\\pm0\.0100\.271±\\pm0\.0100\.528±\\pm0\.0580\.497±\\pm0\.0270\.404±\\pm0\.010GraphReAct0\.273±\\pm0\.0020\.819±\\pm0\.0010\.294±\\pm0\.0010\.645±\\pm0\.0010\.523±\\pm0\.0030\.483±\\pm0\.001

The best method is bolded and the runner\-up is underlined\.

Baselines\.We compareGraphReActwith representative methods across five categories\. \(1\)*Non\-graph model*: MLP\(Taud and Mas,[2017](https://arxiv.org/html/2605.07357#bib.bib151)\)serves as a structure\-agnostic baseline that relies solely on node features without exploiting graph topology\. \(2\)*Supervised graph methods*: GCN\(Kipf and Welling,[2017](https://arxiv.org/html/2605.07357#bib.bib167)\),GraphSAGE\(Hamiltonet al\.,[2017](https://arxiv.org/html/2605.07357#bib.bib168)\), and GAT\(Veličkovićet al\.,[2018a](https://arxiv.org/html/2605.07357#bib.bib171)\)perform supervised learning via message passing over graph structures, whileNodeFormer\(Wuet al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib501)\)andDIFFormer\(Wuet al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib502)\)extend transformer architectures to graphs to capture long\-range dependencies\. \(3\)*Self\-supervised graph methods*: DGI\(Veličkovićet al\.,[2018b](https://arxiv.org/html/2605.07357#bib.bib143)\)learns node representations via contrastive objectives on unlabeled graphs, followed by a classifier for downstream prediction\. \(4\)*Graph knowledge distillation*: GKD\(Yanget al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib499)\)transfers structural knowledge from a teacher GNN trained on a full graph to a student model operating under restricted structural access, while GLNN\(Zhanget al\.,[2022](https://arxiv.org/html/2605.07357#bib.bib500)\)distills graph\-aware representations into an MLP\-like architecture to reduce reliance on graph connectivity during inference\. \(5\)*Large language models*:Vicuna\-7B\-v1\.5\(Chianget al\.,[2023](https://arxiv.org/html/2605.07357#bib.bib503)\)andVicuna\-7B\-SPTevaluate the capability of LLMs for graph tasks via textual prompting without explicit graph modeling\. \(6\)*Graph–LLM methods*: OFA\(Liuet al\.,[2024](https://arxiv.org/html/2605.07357#bib.bib286)\),GraphGPT\(Tanget al\.,[2024a](https://arxiv.org/html/2605.07357#bib.bib11)\),LLaGA\(Chenet al\.,[2024](https://arxiv.org/html/2605.07357#bib.bib9)\), and TEA\-GLM\(Wanget al\.,[2024a](https://arxiv.org/html/2605.07357#bib.bib1)\)integrate graph representations with LLMs to enable zero\-shot or few\-shot inference on graph tasks\. Detailed descriptions of all baselines are provided in Appendix[C](https://arxiv.org/html/2605.07357#A3), with implementation details in Appendix[D](https://arxiv.org/html/2605.07357#A4)\.

Evaluation setting\.We follow the cross\-dataset zero\-shot evaluation protocol of TEA\-GLM\(Wanget al\.,[2024a](https://arxiv.org/html/2605.07357#bib.bib1)\)\. Specifically, Arxiv and Computer are used as source datasets to conduct pre\-training and downstream adaptation, and all methods are evaluated on unseen target datasets without further adaptation\. For the citation domain, PubMed and Cora are used as target datasets; for the e\-commerce domain, we evaluate on Children, History, Photo, and Sports\. We adapt the same data splits as TEA\-GLM, i\.e\., 90,941 nodes for training on Arxiv and 62,748 nodes for training on Computer, with 1,000 nodes used for evaluation on the other datasets\. For all baseline methods, we directly report the results reported in TEA\-GLM\. We report accuracy and Macro\-F1 for node classification\. Each experiment is conducted with five random seeds, and we report the mean and standard deviation\.

### 5\.2Performance Evaluation

We make the following observations\. First,GraphReActconsistently outperforms or remains highly competitive with all baselines across datasets, demonstrating the effectiveness of incorporating graph\-aware reasoning and acting into zero\-shot graph learning\. Compared with conventional graph learning methods, the performance gains indicate that dynamically acquiring and refining graph\-aware evidence provides informative and transferable context for cross\-dataset prediction\. Second, while LLM\-based and graph–LLM methods, such as LLaGA and TEA\-GLM, incorporate textual or graph representations into LLM inference, they typically rely on a fixed and static context, limiting their ability to adaptively exploit available evidence\. In contrast,GraphReActintroduces a dynamic reasoning process that iteratively expands and refines the context through graph\-aware retrieval and context refinement actions\. This allows the model to progressively integrate topological and semantic evidence while filtering out irrelevant information, leading to more informative and task\-relevant representations for prediction\. On PubMed,GraphReActachieves performance comparable to TEA\-GLM\. One possible reason is that PubMed contains only three coarse\-grained classes, where semantically or topologically similar nodes \(e\.g\., papers on related types of diabetes\) may introduce ambiguous signals during context expansion, limiting the advantage of reasoning\-based refinement\.

Ablation study\.We analyze the contribution of each component in Table[2](https://arxiv.org/html/2605.07357#S5.T2)\. First, comparing Variant 1 and Variants 2–4 shows that both topological retrieval and semantic retrieval consistently improve performance over the no\-retrieval baseline, demonstrating the effectiveness of graph\-aware evidence acquisition\. Among them, topological retrieval brings slightly larger gains than SR, indicating the importance of structural signals\. Second, combining topological and semantic retrieval \(Variant 4\) further improves performance, suggesting that the two types of retrieval provide complementary information from structural and semantic perspectives\. Third, context refinement alone \(Variant 5\) yields limited improvement, as it operates without additional external evidence\. However, when combined with both topological and semantic retrieval, the full model achieves the best performance across all datasets, highlighting that refinement plays a crucial role in consolidating retrieved information and enhancing reasoning quality\.

Table 2:Ablation study of key componentss, reporting accuracy\.MethodsTRSRCFCoraChildrenHistorySportsVariant 1×\\times×\\times×\\times0\.248±\\pm0\.0010\.270±\\pm0\.0030\.568±\\pm0\.0020\.404±\\pm0\.008Variant 2✓\\checkmark×\\times×\\times0\.274±\\pm0\.0030\.290±\\pm0\.0020\.603±\\pm0\.0060\.445±\\pm0\.001Variant 3×\\times✓\\checkmark×\\times0\.256±\\pm0\.0020\.283±\\pm0\.0030\.599±\\pm0\.0010\.418±\\pm0\.002Variant 4✓\\checkmark✓\\checkmark×\\times0\.285±\\pm0\.0010\.293±\\pm0\.0040\.631±\\pm0\.0010\.421±\\pm0\.001Variant 5×\\times×\\times✓\\checkmark0\.255±\\pm0\.0060\.272±\\pm0\.0030\.578±\\pm0\.0050\.410±\\pm0\.003GraphReAct✓\\checkmark✓\\checkmark✓\\checkmark0\.273±\\pm0\.0020\.294±\\pm0\.0010\.645±\\pm0\.0010\.483±\\pm0\.001

TR: topological retrieval; SR: semantic retrieval; CF: context refinement\.

### 5\.3Comparison with Textual Search Action

We further analyze whether the standard text\-basedSearchaction is useful for graph reasoning\. Unlike our graph\-aware actions,Searchretrieves external evidence from a Wikipedia\-based111[https://dumps\.wikimedia\.org/backup\-index\.html](https://dumps.wikimedia.org/backup-index.html)knowledge source\. We compare two variants in Table[3](https://arxiv.org/html/2605.07357#S5.T3):Search, which relies solely on external retrieval, andGraphReAct\+Search, which augmentsGraphReActwith this action\. Specifically, given the current context, we first prompt the LLM to generate a query entity, then retrieve up to the top\-66relevant Wikipedia pages based on textual similarity, and finally concatenate the retrieved passages into the instruction as additional context for the next reasoning step\.

Table 3:Analysis of the textual search action\.MethodsCoraChildrenHistorySportsSearch0\.2530\.2250\.5350\.427GraphReAct\+Search0\.2690\.2820\.6350\.479GraphReAct0\.2730\.2940\.6450\.483

As shown in Table[3](https://arxiv.org/html/2605.07357#S5.T3),Searchalone performs substantially worse than graph\-based methods, indicating that external textual evidence is insufficient for node classification in graph domains\. Moreover, incorporating textual search intoGraphReActdoes not lead to consistent improvements, and even slightly degrades performance compared to the full model\. This suggests that external information may introduce noise or misaligned signals that interfere with graph\-aware reasoning\. We further analyze the effect of the number of retrieved entities in the search process in Appendix[G](https://arxiv.org/html/2605.07357#A7)\. These results highlight that, unlike in NLP tasks, effective evidence for graph reasoning is primarily encoded within the graph itself, and directly applying text\-based search actions is neither necessary nor beneficial\.

### 5\.4Parameter Sensitivity Analysis

We study the sensitivity of three key hyperparameters: the number of inference steps,the size of topological retrieved and semantic retrieval\.

![Refer to caption](https://arxiv.org/html/2605.07357v1/figures/round.png)Figure 3:Impact of inference steps inGraphReAct\.
![Refer to caption](https://arxiv.org/html/2605.07357v1/figures/number.png)Figure 4:Impact of topological and semantic retrieval size\.

Effect of inference steps\.We analyze the impact of the number of reasoning stepsKKinGraphReAct, as shown in Fig\.[4](https://arxiv.org/html/2605.07357#S5.F4)\. WhenK=1K=1, the model reduces to single\-step inference without graph\-aware retrieval or context refinement, which is equivalent to TEA\-GLM\. WhenK=2K=2, graph\-aware retrieval is introduced in the first step, while context refinement is not yet involved\. WhenK≥3K\\geq 3, the full framework is enabled\. We observe that increasingKKfrom 1 to 2 yields significant improvement, highlighting the effectiveness of graph\-aware retrieval\. Further increasingKKbrings additional gains, indicating the benefit of iterative context refinement, although the improvement gradually saturates\. In our experiments, we setK=4K=4\.

Effect of retrieval size\.We analyze the impact of the number of retrieved nodes, including the number of topological neighborsNNand semantically similar nodesMM, as shown in Fig\.[4](https://arxiv.org/html/2605.07357#S5.F4)\. We observe that increasing bothNNandMMleads to consistent performance improvements, indicating the benefit of incorporating richer structural and semantic evidence\. The gain from enlargingNNis generally more pronounced, suggesting that structural context serves as the primary signal, while semantic retrieval provides complementary information\. The improvement gradually saturates asNNandMMincrease, implying that most useful information can be captured with a small number of retrieved nodes\. Due to the input length limitation of the LLM, we set the maximum number ofNNandMMto 4 in our experiments\. In our experiment, we setN=M=4N=M=4\.

## 6Conclusion

In this paper, we proposedGraphReAct, a reasoning\-acting framework for graph learning\. We introduced graph\-based retrieval that integrates topological and semantic evidence, along with a context refinement mechanism for multi\-step reasoning\. By combining retrieval and refinement within a step\-by\-step process,GraphReActconstructs informative contexts for prediction\. Extensive experiments demonstrate strong performance under zero\-shot settings, highlighting the effectiveness of structured reasoning with graph\-aware evidence acquisition for graph inference\.

## References

- \[1\]\(1984\)Breadth\-first search\.InCatalogue of artificial intelligence tools,pp\. 13–13\.Cited by:[§4\.2](https://arxiv.org/html/2605.07357#S4.SS2.p2.2)\.
- \[2\]R\. Chen, T\. Zhao, A\. Jaiswal, N\. Shah, and Z\. Wang\(2024\)Llaga: large language and graph assistant\.ICML\.Cited by:[3rd item](https://arxiv.org/html/2605.07357#A3.I6.i3.p1.1),[§2](https://arxiv.org/html/2605.07357#S2.p1.1),[§3](https://arxiv.org/html/2605.07357#S3.p2.3),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[3\]W\. Chiang, Z\. Li, Z\. Lin, Y\. Sheng, Z\. Wu, H\. Zhang, L\. Zheng, S\. Zhuang, Y\. Zhuang, J\. E\. Gonzalez,et al\.\(2023\)Vicuna: an open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality\.See https://vicuna\. lmsys\. org \(accessed 14 April 2023\)2\(3\),pp\. 6\.Cited by:[1st item](https://arxiv.org/html/2605.07357#A3.I5.i1.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[4\]D\. J\. Cook and L\. B\. Holder\(2006\)Mining graph data\.John Wiley & Sons\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[5\]G\. Feng, B\. Zhang, Y\. Gu, H\. Ye, D\. He, and L\. Wang\(2023\)Towards revealing the mystery behind chain of thought: a theoretical perspective\.Vol\.36,pp\. 70757–70798\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p1.1)\.
- \[6\]D\. Fu, J\. Huang, S\. Lu, G\. Dong, Y\. Wang, K\. He, and W\. Xu\(2025\)PreAct: prediction enhances agent’s planning ability\.InACL,pp\. 1–16\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p1.1),[§1](https://arxiv.org/html/2605.07357#S1.p3.1),[§2](https://arxiv.org/html/2605.07357#S2.p2.1)\.
- \[7\]W\. Hamilton, Z\. Ying, and J\. Leskovec\(2017\)Inductive representation learning on large graphs\.InNeurIPS,Cited by:[2nd item](https://arxiv.org/html/2605.07357#A3.I2.i2.p1.1),[§1](https://arxiv.org/html/2605.07357#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[8\]X\. He, X\. Bresson, T\. Laurent, A\. Perold, Y\. LeCun, and B\. Hooi\(2024\)Harnessing explanations: llm\-to\-lm interpreter for enhanced text\-attributed graph representation learning\.InICLR,Cited by:[2nd item](https://arxiv.org/html/2605.07357#A2.I1.i2.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p1.1)\.
- \[9\]Y\. He, Y\. Sui, X\. He, Y\. Liu, Y\. Sun, and B\. Hooi\(2025\)Unigraph2: learning a unified embedding space to bind multimodal graphs\.InWWW 2025,pp\. 1759–1770\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p1.1)\.
- \[10\]W\. Hu, M\. Fey, M\. Zitnik, Y\. Dong, H\. Ren, B\. Liu, M\. Catasta, and J\. Leskovec\(2020\)Open graph benchmark: datasets for machine learning on graphs\.InNeurIPS,Cited by:[1st item](https://arxiv.org/html/2605.07357#A2.I1.i1.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p1.1)\.
- \[11\]T\. N\. Kipf and M\. Welling\(2017\)Semi\-supervised classification with graph convolutional networks\.InICLR,Cited by:[1st item](https://arxiv.org/html/2605.07357#A3.I2.i1.p1.1),[§1](https://arxiv.org/html/2605.07357#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[12\]H\. Liu, J\. Feng, L\. Kong, N\. Liang, D\. Tao, Y\. Chen, and M\. Zhang\(2024\)One for all: towards training one graph model for all classification tasks\.InICLR,Cited by:[1st item](https://arxiv.org/html/2605.07357#A3.I6.i1.p1.1),[§2](https://arxiv.org/html/2605.07357#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[13\]Z\. Liu, X\. Yu, Y\. Fang, and X\. Zhang\(2023\)GraphPrompt: unifying pre\-training and downstream tasks for graph neural networks\.InWWW,pp\. 417–428\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[14\]S\. Luan, C\. Hua, Q\. Lu, J\. Zhu, M\. Zhao, S\. Zhang, X\. Chang, and D\. Precup\(2022\)Revisiting heterophily for graph neural networks\.NeurIPS,pp\. 1362–1375\.Cited by:[§4\.2](https://arxiv.org/html/2605.07357#S4.SS2.p6.1)\.
- \[15\]Y\. Ma, X\. Liu, N\. Shah, and J\. Tang\(2022\)Is homophily a necessity for graph neural networks?\.InICLR,Cited by:[§4\.2](https://arxiv.org/html/2605.07357#S4.SS2.p6.1)\.
- \[16\]M\. Shen, G\. Zeng, Z\. Qi, Z\. Hong, Z\. Chen, W\. Lu, G\. W\. Wornell, S\. Das, D\. D\. Cox, and C\. Gan\(2025\)Satori: reinforcement learning with chain\-of\-action\-thought enhances llm reasoning via autoregressive search\.InICML,Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p1.1)\.
- \[17\]J\. Tang, Y\. Yang, W\. Wei, L\. Shi, L\. Su, S\. Cheng, D\. Yin, and C\. Huang\(2024\)Graphgpt: graph instruction tuning for large language models\.InSIGIR,pp\. 491–500\.Cited by:[2nd item](https://arxiv.org/html/2605.07357#A3.I6.i2.p1.1),[§2](https://arxiv.org/html/2605.07357#S2.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[18\]J\. Tang, Y\. Yang, W\. Wei, L\. Shi, L\. Xia, D\. Yin, and C\. Huang\(2024\)Higpt: heterogeneous graph language model\.InKDD,pp\. 2842–2853\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p1.1)\.
- \[19\]H\. Taud and J\. Mas\(2017\)Multilayer perceptron \(mlp\)\.InGeomatic approaches for modeling land change scenarios,pp\. 451–455\.Cited by:[1st item](https://arxiv.org/html/2605.07357#A3.I1.i1.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[20\]P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Lio, and Y\. Bengio\(2018\)Graph attention networks\.InICLR,Cited by:[3rd item](https://arxiv.org/html/2605.07357#A3.I2.i3.p1.1),[§1](https://arxiv.org/html/2605.07357#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[21\]P\. Veličković, W\. Fedus, W\. L\. Hamilton, P\. Liò, Y\. Bengio, and R\. D\. Hjelm\(2018\)Deep graph infomax\.InICLR,Cited by:[1st item](https://arxiv.org/html/2605.07357#A3.I3.i1.p1.1),[§1](https://arxiv.org/html/2605.07357#S1.p2.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[22\]D\. Wang, Y\. Zuo, F\. Li, and J\. Wu\(2024\)Llms as zero\-shot graph learners: alignment of gnn representations with llm token embeddings\.NeurIPS37,pp\. 5950–5973\.Cited by:[4th item](https://arxiv.org/html/2605.07357#A3.I6.i4.p1.1),[Appendix D](https://arxiv.org/html/2605.07357#A4.p3.1),[§3](https://arxiv.org/html/2605.07357#S3.p2.3),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p3.1)\.
- \[23\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.ICLR\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p2.1)\.
- \[24\]Z\. Wang, Z\. Zhang, N\. V\. Chawla, C\. Zhang, and Y\. Ye\(2024\)Gft: graph foundation model with transferable tree vocabulary\.NeurIPS37,pp\. 107403–107443\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p1.1)\.
- \[25\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,Vol\.35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p1.1),[§2](https://arxiv.org/html/2605.07357#S2.p2.1)\.
- \[26\]Z\. Wen and Y\. Fang\(2023\)Augmenting low\-resource text classification with graph\-grounded pre\-training and prompting\.InSIGIR,Cited by:[3rd item](https://arxiv.org/html/2605.07357#A2.I1.i3.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p1.1)\.
- \[27\]Q\. Wu, C\. Yang, W\. Zhao, Y\. He, D\. Wipf, and J\. Yan\(2023\)DIFFormer: scalable \(graph\) transformers induced by energy constrained diffusion\.InInternational Conference on Learning Representations,Cited by:[5th item](https://arxiv.org/html/2605.07357#A3.I2.i5.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[28\]Q\. Wu, W\. Zhao, Z\. Li, D\. P\. Wipf, and J\. Yan\(2022\)Nodeformer: a scalable graph structure learning transformer for node classification\.NeurIPS35,pp\. 27387–27401\.Cited by:[4th item](https://arxiv.org/html/2605.07357#A3.I2.i4.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[29\]F\. Xia, K\. Sun, S\. Yu, A\. Aziz, L\. Wan, S\. Pan, and H\. Liu\(2021\)Graph learning: a survey\.IEEE Transactions on Artificial Intelligence2\(2\),pp\. 109–127\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[30\]P\. Xia, L\. Zhang, and F\. Li\(2015\)Learning similarity with cosine similarity ensemble\.Information sciences307,pp\. 39–52\.Cited by:[§4\.2](https://arxiv.org/html/2605.07357#S4.SS2.p4.2)\.
- \[31\]K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka\(2019\)How powerful are graph neural networks?\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[32\]H\. Yan, C\. Li, R\. Long, C\. Yan, J\. Zhao, W\. Zhuang, J\. Yin, P\. Zhang, W\. Han, H\. Sun,et al\.\(2023\)A comprehensive study on text\-attributed graphs: benchmarking and rethinking\.NeurIPS36,pp\. 17238–17264\.Cited by:[4th item](https://arxiv.org/html/2605.07357#A2.I1.i4.p1.1),[5th item](https://arxiv.org/html/2605.07357#A2.I1.i5.p1.1),[6th item](https://arxiv.org/html/2605.07357#A2.I1.i6.p1.1),[7th item](https://arxiv.org/html/2605.07357#A2.I1.i7.p1.1),[8th item](https://arxiv.org/html/2605.07357#A2.I1.i8.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p1.1)\.
- \[33\]C\. Yang, Q\. Wu, and J\. Yan\(2022\)Geometric knowledge distillation: topology compression for graph neural networks\.NeurIPS35,pp\. 29761–29775\.Cited by:[1st item](https://arxiv.org/html/2605.07357#A3.I4.i1.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[34\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\(2023\)Tree of thoughts: deliberate problem solving with large language models\.NeurIPS36,pp\. 11809–11822\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p2.1)\.
- \[35\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p1.1),[§1](https://arxiv.org/html/2605.07357#S1.p2.1),[§1](https://arxiv.org/html/2605.07357#S1.p3.1),[§1](https://arxiv.org/html/2605.07357#S1.p4.1),[§2](https://arxiv.org/html/2605.07357#S2.p2.1)\.
- \[36\]Y\. You, T\. Chen, Y\. Sui, T\. Chen, Z\. Wang, and Y\. Shen\(2020\)Graph contrastive learning with augmentations\.InNeurIPS,Vol\.33,pp\. 5812–5823\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[37\]X\. Yu, Z\. Liu, Y\. Fang, Z\. Liu, S\. Chen, and X\. Zhang\(2023\)Generalized graph prompt: toward a unification of pre\-training and downstream tasks on graphs\.IEEE TKDE36\(11\),pp\. 6237– 6250\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[38\]X\. Yu, C\. Zhou, Z\. Kuai, X\. Zhang, and Y\. Fang\(2025\)GCoT: chain\-of\-thought prompt learning for graphs\.arXiv preprint arXiv:2502\.08092\.Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p2.1)\.
- \[39\]H\. Yuan, Q\. Sun, J\. Shi, X\. Fu, B\. Hooi, J\. Li, and P\. S\. Yu\(2025\)GRAVER: generative graph vocabularies for robust graph foundation models fine\-tuning\.NeurIPS\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p1.1)\.
- \[40\]S\. Zhang, Y\. Liu, Y\. Sun, and N\. Shah\(2022\)Graph\-less neural networks: teaching old mlps new tricks via distillation\.InICLR,Cited by:[2nd item](https://arxiv.org/html/2605.07357#A3.I4.i2.p1.1),[§5\.1](https://arxiv.org/html/2605.07357#S5.SS1.p2.1)\.
- \[41\]Z\. Zhang, A\. Zhang, M\. Li, and A\. Smola\(2023\)Automatic chain of thought prompting in large language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07357#S1.p1.1)\.
- \[42\]D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le,et al\.\(2023\)Least\-to\-most prompting enables complex reasoning in large language models\.ICLR\.Cited by:[§2](https://arxiv.org/html/2605.07357#S2.p2.1)\.

## Appendix AAlogrithm

We summarize the overall procedure ofGraphReActin Algorithm[1](https://arxiv.org/html/2605.07357#alg1)\. Lines 2–4 initialize the node representation by encoding the target node and projecting it into the LLM embedding space, followed by constructing the initial instruction\. In lines 6–8, the model performs the first reasoning step and invokes graph\-based retrieval to obtain both topological and semantic summaries, which are used to build the initial context\. Then, lines 10–13 implement multi\-step reasoning via context refinement, where at each step the LLM generates a new thought based on the current context, and the context is updated through a refinement action that distills and reorganizes the accumulated information\. Finally, in lines 14–15, the model produces the prediction based on the refined context afterKKsteps\.

## Appendix BFurther Descriptions of Datasets

We summarize all datasets in Table[4](https://arxiv.org/html/2605.07357#A3.T4)and provide further comprehensive descriptions of these datasets\.

- •Arxiv\[[10](https://arxiv.org/html/2605.07357#bib.bib432)\]is a large\-scale citation network built from Computer Science papers on the arXiv preprint server\. The graph contains 169,343 paper nodes and 1,166,243 citation edges, with labels covering 40 arXiv CS sub\-categories\.
- •PubMed\[[8](https://arxiv.org/html/2605.07357#bib.bib504)\]consists of 19,717 diabetes\-related publications connected by 44,338 citation links\. Each node is labeled as one of three categories: experimentally induced diabetes, type 1 diabetes, or type 2 diabetes\.
- •Cora\[[26](https://arxiv.org/html/2605.07357#bib.bib193)\], formally known as the Cora Research Paper Classification Dataset, is an expanded citation graph for research paper classification\. It includes 25,120 papers and 91,140 citation edges, where the 70 node labels correspond to fine\-grained research categories\.
- •Computer\[[32](https://arxiv.org/html/2605.07357#bib.bib505)\]is an e\-commerce graph from the TAG benchmark, extracted from computer\-related products in Amazon\-Electronics\. It has 87,229 product nodes, 721,081 co\-viewing or co\-purchasing edges, and 10 third\-level product categories\.
- •Photo\[[32](https://arxiv.org/html/2605.07357#bib.bib505)\]is constructed from photo\-related products in Amazon\-Electronics\. The graph contains 48,362 products and 500,928 behavioral links, where an edge indicates that two products are frequently co\-viewed or co\-purchased\. The prediction labels are 12 third\-level product categories\.
- •Children\[[32](https://arxiv.org/html/2605.07357#bib.bib505)\]is an Amazon\-Books graph whose nodes are children’s book products\. It comprises 76,875 nodes and 1,554,578 co\-viewing or co\-purchasing edges, with 24 labels defined by third\-level book categories\.
- •History\[[32](https://arxiv.org/html/2605.07357#bib.bib505)\]is another Amazon\-Books graph, focusing on history\-related books\. It contains 41,551 book nodes, 358,574 product\-relation edges, and 12 third\-level category labels\.
- •Sports\[[32](https://arxiv.org/html/2605.07357#bib.bib505)\]is an Amazon\-Sports graph built from fitness\-related products\. It is the largest e\-commerce target dataset in our experiments, with 173,055 product nodes, 1,773,500 co\-viewing or co\-purchasing edges, and 13 fine\-grained product categories\.

Algorithm 1Graph\-aware Reasoning and Acting1:Graph

G=\(𝒱,ℰ\)G=\(\\mathcal\{V\},\\mathcal\{E\}\), target node

vv, node text

xtextx^\{\\mathrm\{text\}\}, projection

𝙿𝚛𝚘𝚓​\(⋅;ϕ\)\\mathtt\{Proj\}\(\\cdot;\\phi\), frozen LLM, steps

KK
2:Prediction

y^\\hat\{y\}
3:Initialization:

4:

𝐡v←𝙶𝚛𝚊𝚙𝚑𝙴𝚗𝚌𝚘𝚍𝚎𝚛​\(G,v\)\\mathbf\{h\}\_\{v\}\\leftarrow\\mathtt\{GraphEncoder\}\(G,v\)
5:

𝐡vtok←𝙿𝚛𝚘𝚓​\(𝐡v;ϕ\)\\mathbf\{h\}\_\{v\}^\{\\mathrm\{tok\}\}\\leftarrow\\mathtt\{Proj\}\(\\mathbf\{h\}\_\{v\};\\phi\)
6:Construct instruction

ℐ\\mathcal\{I\}from

\(xtext,𝐡vtok\)\(x^\{\\mathrm\{text\}\},\\mathbf\{h\}\_\{v\}^\{\\mathrm\{tok\}\}\)
7:Graph\-based retrieval:

8:

Thought1←𝙻𝙻𝙼​\(ℐ,𝒬\)\\text\{Thought\}^\{1\}\\leftarrow\\mathtt\{LLM\}\(\\mathcal\{I\},\\mathcal\{Q\}\)
9:

\{stop,ssem\}←𝙰𝚌𝚝retrieve​\(G,v\)\\\{s^\{\\mathrm\{top\}\},s^\{\\mathrm\{sem\}\}\\\}\\leftarrow\\mathtt\{Act\}\_\{\\text\{retrieve\}\}\(G,v\)
10:

𝒞1←𝙸𝚗𝚒𝚝​\(Thought1,\{stop,ssem\}\)\\mathcal\{C\}^\{1\}\\leftarrow\\mathtt\{Init\}\(\\text\{Thought\}^\{1\},\\;\\\{s^\{\\mathrm\{top\}\},s^\{\\mathrm\{sem\}\}\\\}\)
11:Context refinement:

12:for

k=2k=2to

KKdo

13:

Thoughtk\+1←𝙻𝙻𝙼​\(ℐ,𝒞k,𝒬\)\\text\{Thought\}^\{k\+1\}\\leftarrow\\mathtt\{LLM\}\(\\mathcal\{I\},\\;\\mathcal\{C\}^\{k\},\\;\\mathcal\{Q\}\)
14:

Observationk\+1←𝙰𝚌𝚝refine​\(𝒞k,Thoughtk\+1\)\\text\{Observation\}^\{k\+1\}\\leftarrow\\mathtt\{Act\}\_\{\\text\{refine\}\}\(\\mathcal\{C\}^\{k\},\\;\\text\{Thought\}^\{k\+1\}\)
15:

𝒞k\+1←𝚄𝚙𝚍𝚊𝚝𝚎​\(𝒞k,Thoughtk\+1,Observationk\+1\)\\mathcal\{C\}^\{k\+1\}\\leftarrow\\mathtt\{Update\}\(\\mathcal\{C\}^\{k\},\\;\\text\{Thought\}^\{k\+1\},\\;\\text\{Observation\}^\{k\+1\}\)
16:Final prediction:

17:

y^←𝙻𝙻𝙼​\(ℐ,𝒞K,𝒬\)\\hat\{y\}\\leftarrow\\mathtt\{LLM\}\(\\mathcal\{I\},\\;\\mathcal\{C\}^\{K\},\\;\\mathcal\{Q\}\)
18:return

y^\\hat\{y\}

## Appendix CFurther Descriptions of Baselines

In this section, we provide additional details about the baselines used in our experiments\.

\(1\) Non\-graph model\.

- •MLP\[[19](https://arxiv.org/html/2605.07357#bib.bib151)\]: A multilayer perceptron that predicts node labels based only on node features\. It does not explicitly model graph topology, and therefore serves as a structure\-agnostic baseline for evaluating whether graph structural information is necessary\.

\(2\) Supervised graph methods\.

- •GCN\[[11](https://arxiv.org/html/2605.07357#bib.bib167)\]: A representative graph neural network that aggregates and transforms information from local neighborhoods through graph convolution\. It captures structural information by recursively propagating node features over the graph\.
- •GraphSAGE\[[7](https://arxiv.org/html/2605.07357#bib.bib168)\]: An inductive graph representation learning method that samples and aggregates neighborhood information\. It learns node representations by combining target node features with sampled neighbor representations, making it suitable for large\-scale graphs\.
- •GAT\[[20](https://arxiv.org/html/2605.07357#bib.bib171)\]: A graph attention network that introduces attention weights into neighborhood aggregation\. Instead of treating all neighbors equally, GAT assigns different importance scores to neighboring nodes, allowing the model to selectively aggregate more informative structural evidence\.
- •NodeFormer\[[28](https://arxiv.org/html/2605.07357#bib.bib501)\]: A scalable graph Transformer for node classification\. It models node interactions through an efficient attention mechanism and is designed to handle large\-scale graph structure learning\.
- •DIFFormer\[[27](https://arxiv.org/html/2605.07357#bib.bib502)\]: A scalable graph Transformer induced by energy\-constrained diffusion\. It captures long\-range dependencies on graphs through a diffusion\-based formulation, improving the ability to model global structural information\.

\(3\) Self\-supervised graph methods\.

- •DGI\[[21](https://arxiv.org/html/2605.07357#bib.bib143)\]: A self\-supervised graph representation learning method based on mutual information maximization\. It learns node embeddings by contrasting local node representations with a global graph summary, without relying on task\-specific labels during representation learning\.

\(4\) Graph knowledge distillation\.

- •GKD\[[33](https://arxiv.org/html/2605.07357#bib.bib499)\]: A graph knowledge distillation framework that transfers structural knowledge from a teacher GNN trained on complete graph information to a student model\. It aims to preserve useful topological knowledge while reducing the dependency on full graph access\.
- •GLNN\[[40](https://arxiv.org/html/2605.07357#bib.bib500)\]: A graph\-less neural network framework that distills graph\-aware knowledge from GNNs into an MLP\-like architecture\. By transferring graph\-aware predictions into a structure\-free student model, GLNN reduces the reliance on graph connectivity during inference\.

\(5\) Large language models\.

- •Vicuna\-7B\-v1\.5\[[3](https://arxiv.org/html/2605.07357#bib.bib503)\]: An instruction\-tuned open\-source large language model\. In our experiments, Vicuna directly performs prediction based on textual node information, without explicitly using graph structure\.
- •Vicuna\-7B\-SPT: A soft prompt tuning variant of Vicuna\-7B\-v1\.5\. It introduces learnable soft prompts to adapt the LLM to graph\-related prediction tasks while still relying mainly on textual information\.

\(6\) Graph–LLM methods\.

- •OFA\[[12](https://arxiv.org/html/2605.07357#bib.bib286)\]: A unified graph learning framework designed to handle different graph classification tasks with a shared model\. It improves transferability across tasks by formulating graph learning problems under a unified task interface\.
- •GraphGPT\[[17](https://arxiv.org/html/2605.07357#bib.bib11)\]: A graph instruction tuning framework that aligns graph representations with large language models\. We report two variants following the TEA\-GLM setting:GraphGPT\-std, which uses the standard graph instruction tuning pipeline, andGraphGPT\-cot, which further incorporates Chain\-of\-Thought\-style instruction data generated by LLMs\.
- •LLaGA\[[2](https://arxiv.org/html/2605.07357#bib.bib9)\]: A graph–language assistant that connects graph structural information with LLMs for graph\-related tasks\. It uses graph\-aware representations to support LLM\-based prediction and reasoning\.
- •TEA\-GLM\[[22](https://arxiv.org/html/2605.07357#bib.bib1)\]: A token embedding\-aligned graph language model that aligns GNN representations with the token embedding space of a frozen LLM\. It maps graph representations into a fixed number of graph token embeddings and inserts them into instruction templates for zero\-shot graph learning\. SinceGraphReActis built upon the TEA\-GLM graph–LLM interface, TEA\-GLM is used as the most direct backbone baseline for evaluating the effectiveness of our graph\-aware reasoning\-acting mechanism\.

Table 4:Summary of datasets\.DomainDataset\#Nodes\#Edges\#ClassesCitationArxiv169,3431,166,24340Pubmed19,71744,3383Cora25,12091,14070E\-commerceEle\-Computer87,229721,08110Ele\-Photo48,362500,92812Book\-Children76,8751,554,57824Book\-History41,551358,57412Sports\-Fitness173,0551,773,50013
## Appendix DImplementation Details

Environment\.All experiments are conducted on Ubuntu 22\.04 with a 25 vCPU Intel\(R\) Xeon\(R\) Platinum 8481C and an vGPU\-48GB with 48GB memory\.

Optimizer\.For all experiment, Adam is used as the optimizer\.

Details of baselines\.For fair and consistent comparison, we directly adopt the reported results of all non\-TEA\-GLM baseline methods from TEA\-GLM\[[22](https://arxiv.org/html/2605.07357#bib.bib1)\], including MLP, GCN,GraphSage, GAT, DGI, GKD, GLNN,NodeFormer, DIFFormer,Vicuna\-7B\-v1\.5,Vicuna\-7B\-SPT,GraphGPT,LLaGA, OFA and TEA\-GLM\. These baselines are evaluated under the same dataset setting and evaluation metrics as TEA\-GLM\.

For TEA\-GLM, we follow its recommended setting: raw node texts are encoded by a pretrained BERT model, GraphSAGE is used as the graph encoder, Vicuna\-7B\-v1\.5 is used as the frozen LLM backbone, and the number of GNN layers is set to 2\. The graph encoder is pretrained on the source dataset, and the linear projector maps graph representations into graph token embeddings for zero\-shot inference on unseen target datasets\.

Implementation details ofGraphReAct\.We adopt a 3\-layer GraphSAGE encoder as the graph backbone to obtain node representations, which are then mapped into the LLM embedding space via a linear projection layer\. For graph\-aware acting, we retrieve a fixed number of nodes from both structural and semantic perspectives\. Specifically, we selectN=4N=4neighbors for topological retrieval using breadth\-first traversal, andM=4M=4nodes for semantic retrieval based on embedding similarity\. For multi\-step reasoning, we set the number of inference steps toK=4K=4\. The prompt design used in our experiments is provided in Sect\.[H](https://arxiv.org/html/2605.07357#A8)\.

## Appendix ESupervised Performance

Table 5:Accuracy and Macro\-F1 on training datasets\.ModelArxivComputerAccF1AccF1MLP0\.546±\\pm0\.0040\.295±\\pm0\.0070\.420±\\pm0\.0060\.267±\\pm0\.005GCN0\.545±\\pm0\.0050\.317±\\pm0\.0060\.424±\\pm0\.0120\.386±\\pm0\.014GraphSAGE0\.556±\\pm0\.0060\.315±\\pm0\.0080\.534±\\pm0\.0370\.347±\\pm0\.036GAT0\.561±\\pm0\.0030\.339±\\pm0\.0050\.609±\\pm0\.0350\.598±\\pm0\.039NodeFormer0\.544±\\pm0\.0160\.297±\\pm0\.0290\.434±\\pm0\.0120\.288±\\pm0\.012DIFFormer0\.616±\\pm0\.0250\.356±\\pm0\.0240\.629±\\pm0\.0120\.467±\\pm0\.022DGI0\.342±\\pm0\.0240\.336±\\pm0\.0110\.594±\\pm0\.0040\.452±\\pm0\.008GKD0\.393±\\pm0\.0850\.164±\\pm0\.0290\.351±\\pm0\.0310\.155±\\pm0\.016GLNN0\.602±\\pm0\.0040\.362±\\pm0\.0080\.393±\\pm0\.0050\.243±\\pm0\.007Vicuna\-7B\-v1\.50\.347±\\pm0\.0000\.164±\\pm0\.0010\.372±\\pm0\.0100\.304±\\pm0\.002OFA0\.682±\\pm0\.0060\.495±\\pm0\.0060\.753±\\pm0\.0040\.687±\\pm0\.006GraphGPT\-std0\.6260\.262\-\-GraphGPT\-cot0\.5760\.228\-\-LLaGA0\.749±\\pm0\.0010\.575±\\pm0\.0030\.642±\\pm0\.0040\.562±\\pm0\.001TEA\-GLM0\.655±\\pm0\.0010\.445±\\pm0\.0020\.578±\\pm0\.0020\.496±\\pm0\.010GraphReAct0\.738±\\pm0\.0110\.645±\\pm0\.0120\.687±\\pm0\.0070\.513±\\pm0\.015

We further evaluateGraphReActunder a supervised setting, where models are trained on the Arxiv and Computer datasets and evaluated on held\-out test splits\. The results are reported in Table[5](https://arxiv.org/html/2605.07357#A5.T5)\. We make the following observations\. First,GraphReActachieves competitive performance compared with existing baselines, while not consistently outperforming state\-of\-the\-art graph–LLM methods in this setting\. This is expected, as supervised methods can directly optimize task\-specific objectives with labeled data, whereasGraphReActis primarily designed for reasoning\-based inference rather than end\-to\-end supervised training\. Second,GraphReActdemonstrates relatively stronger performance in terms of Macro\-F1, particularly on Arxiv, indicating improved class\-level balance\. This suggests that the reasoning\-acting mechanism, which dynamically aggregates and refines evidence, can help mitigate bias toward dominant classes even in supervised scenarios\. Noote that the supervised setting is not the primary focus of this work\. Instead,GraphReActis designed to address the more challenging zero\-shot graph learning scenario, where labeled data is unavailable and effective reasoning over structured context becomes critical\. As demonstrated in Sect\.[5\.2](https://arxiv.org/html/2605.07357#S5.SS2),GraphReActachieves significant improvements under zero\-shot settings, highlighting the advantage of reasoning\-acting synergy for cross\-dataset generalization\.

## Appendix FMacro\-F1 Results

Table 6:Macro\-F1 score of node classification\.ModelCitationE\-commerceCoraPubmedChildrenHistoryPhotoSportsMLP0\.009±\\pm0\.0040\.246±\\pm0\.0420\.007±\\pm0\.0070\.023±\\pm0\.0080\.041±\\pm0\.0230\.019±\\pm0\.005GCN0\.007±\\pm0\.0010\.187±\\pm0\.0210\.006±\\pm0\.0040\.024±\\pm0\.0130\.034±\\pm0\.0070\.017±\\pm0\.009GraphSAGE0\.007±\\pm0\.0030\.257±\\pm0\.0840\.005±\\pm0\.0030\.029±\\pm0\.0240\.020±\\pm0\.0110\.021±\\pm0\.004GAT0\.006±\\pm0\.0010\.259±\\pm0\.0650\.063±\\pm0\.0670\.159±\\pm0\.1170\.036±\\pm0\.0350\.091±\\pm0\.090NodeFormer0\.008±\\pm0\.0030\.232±\\pm0\.0890\.019±\\pm0\.0080\.046±\\pm0\.0310\.055±\\pm0\.0060\.049±\\pm0\.009DIFFormer0\.007±\\pm0\.0020\.187±\\pm0\.0070\.002±\\pm0\.0020\.050±\\pm0\.0190\.069±\\pm0\.0100\.045±\\pm0\.007DGI0\.004±\\pm0\.0020\.213±\\pm0\.1270\.012±\\pm0\.0040\.038±\\pm0\.0150\.045±\\pm0\.0150\.018±\\pm0\.005GKD0\.004±\\pm0\.0010\.247±\\pm0\.0390\.028±\\pm0\.0030\.060±\\pm0\.0080\.049±\\pm0\.0150\.050±\\pm0\.008GLNN0\.006±\\pm0\.0010\.221±\\pm0\.0330\.021±\\pm0\.0030\.064±\\pm0\.0070\.057±\\pm0\.0020\.052±\\pm0\.003Vicuna\-7B\-v1\.50\.109±\\pm0\.0020\.629±\\pm0\.0240\.279±\\pm0\.0020\.349±\\pm0\.0030\.383±\\pm0\.0010\.410±\\pm0\.002OFA0\.091±\\pm0\.0130\.287±\\pm0\.0590\.017±\\pm0\.0100\.026±\\pm0\.0070\.103±\\pm0\.0070\.043±\\pm0\.021GraphGPT\-std0\.0820\.649————GraphGPT\-cot0\.1270\.482————LLaGA0\.108±\\pm0\.0140\.778±\\pm0\.0560\.163±\\pm0\.0290\.144±\\pm0\.0250\.362±\\pm0\.0390\.446±\\pm0\.035TEA\-GLM0\.107±\\pm0\.0120\.846±\\pm0\.0110\.209±\\pm0\.0100\.336±\\pm0\.0210\.404±\\pm0\.0170\.396±\\pm0\.007GraphReAct0\.116±\\pm0\.0160\.759±\\pm0\.0170\.216±\\pm0\.0080\.351±\\pm0\.0220\.409±\\pm0\.0110\.464±\\pm0\.009

The best method is bolded and the runner\-up is underlined\.

We further report Macro\-F1 scores for zero\-shot node classification in Table[6](https://arxiv.org/html/2605.07357#A6.T6)\. Overall,GraphReActachieves competitive or superior performance on most datasets, particularly in the e\-commerce domain, where it consistently outperforms existing graph–LLM methods\. Compared with accuracy, the improvements in Macro\-F1 are more pronounced, indicating thatGraphReActprovides better class\-level balance under zero\-shot settings\. This can be attributed to the reasoning\-acting mechanism, which dynamically aggregates and refines evidence from both structural and semantic sources, thereby reducing bias toward dominant classes\. On citation datasets,GraphReActremains competitive with strong baselines such as TEA\-GLM, showing that the proposed framework maintains robust performance across different domains\.

![Refer to caption](https://arxiv.org/html/2605.07357v1/figures/search.png)Figure 5:Numbers of Wiki entities for Search\.
## Appendix GEffect of the Number of Retrieved Entities in Textual Search

We further analyze the impact of the number of retrieved Wikipedia entitiesSSin the text\-based Search action, as shown in Fig\.[5](https://arxiv.org/html/2605.07357#A6.F5)\. We observe that increasingSSbrings only marginal improvements across all datasets\. The performance slightly improves when increasingSSfrom 2 to 4, indicating that a small amount of external evidence can provide limited complementary information\. However, further increasingSSdoes not lead to consistent gains and quickly reaches saturation, with performance remaining largely stable thereafter\. This suggests that simply retrieving more external textual entities does not substantially enhance graph reasoning, and may introduce redundant or noisy information\. The results further confirm that external search plays a limited role compared to graph\-aware evidence\.

## Appendix HPrompt Design

Table 7:Prompt design for node classification on children\.TypePromptsTask InstructionGiven a representation of a book: <Token 1\>, with the following information:Description: \{description\}Title: \{title\}Question: Which category does this book belong to? Please directly give the most likely answer from the following categories: \{categories\}Thought GenerationUSER: The task is to classify a children’s book by its main topic and content type\. Your goal is to identify what kind of story or subject the book presents to young readers \(for example, adventure, humor, fantasy, education, or everyday life\)\. Based on the book information above, generate one concise sentence as your current thought to help solve the task\.ASSISTANT:Topological retrievalUSER: You are given a subgraph consisting of a target node and its neighboring nodes from a children’s book e\-commerce graph\. Each neighbor is described by a title and a description\. Your task is to infer the most likely category of the subgraph based ONLY on the textual information of the neighbors, and generate a one\-sentence summary in the following fixed format: "This subgraph xxx, so the category might be xxx\."IMPORTANT RULES \(must be followed strictly\):1\. Text\-based inference: Analyze the titles and descriptions of all neighbors as a whole\. Identify the dominant themes, genres, or target age group, such as fantasy, animals, educational concepts, activities, or everyday life experiences\. Give more weight to recurring elements shared across neighbors, like magical creatures, vehicle types, specific emotions, or learning topics\.2\. Noise filtering: Ignore non\-informative promotional content, including but not limited to: bestseller status, author biography details \(e\.g\., "has written over one hundred books"\), series accolades, review quotes, and generic publisher blurbs\. Focus only on information that reflects the book’s core story, subject matter, characters, or intended educational/entertainment purpose for children\.3\. Category constraint: The final category MUST be chosen from the following list and cannot be invented or modified: \{categories\}4\. Output constraint: Output exactly ONE sentence\. Use the fixed format exactly as specified\. Do not add explanations, bullet points, or extra text\. The phrase "This subgraph xxx" should concisely describe the overall thematic focus of the children’s books in the subgraph\.Here are correct examples:\- "This subgraph features stories about fairies and magical adventures, so the category might be Fairy Tales, Folk Tales & Myths\."\- "This subgraph contains books about trucks, trains, and airplanes, so the category might be Cars, Trains & Things That Go\."\- "This subgraph explores concepts like feelings, friendship, and school life, so the category might be Growing Up & Facts of Life\."Here is the neighbor information: \{nodes\}Please give your reply:Semantic retrievalUSER: You are given a center node and some similar nodes, which are based on similar semantic node embeddings, from a children’s book e\-commerce graph\. Each node is described by a title and a description\. Your task is to infer the most likely category of the node set based ONLY on the textual information, and generate a one\-sentence summary in the following fixed format: "This node set xxx, so the category might be xxx\."IMPORTANT RULES \(must be followed strictly\):1\. Text\-based inference: Analyze the titles and descriptions of all nodes as a whole\. Identify the dominant themes, genres, or target age group, such as fantasy, animals, educational concepts, activities, or everyday life experiences\. Give more weight to recurring elements shared across the node set, like magical creatures, vehicle types, specific emotions, or learning topics\.2\. Noise filtering: Ignore non\-informative promotional content, including but not limited to: bestseller status, author biography details \(e\.g\., "has written over one hundred books"\), series accolades, review quotes, and generic publisher blurbs\. Focus only on information that reflects the book’s core story, subject matter, characters, or intended educational/entertainment purpose for children\.3\. Category constraint: The final category MUST be chosen from the following list and cannot be invented or modified: \{categories\}4\. Output constraint: Output exactly ONE sentence\. Use the fixed format exactly as specified\. Do not add explanations, bullet points, or extra text\. The phrase "This node set xxx" should concisely describe the overall thematic focus of the children’s books in the set\.Here are correct examples:\- "This node set features stories about fairies and magical adventures, so the category might be Fairy Tales, Folk Tales & Myths\."\- "This node set contains books about trucks, trains, and airplanes, so the category might be Cars, Trains & Things That Go\."\- "This node set explores concepts like feelings, friendship, and school life, so the category might be Growing Up & Facts of Life\."Here is the node information: \{nodes\}Please give your reply:Context refinementUSER: Please summarize the key points from our discussion so far in one sentence\. Focus on the most important information that will help answer the original question\.ASSISTANT:Textual SearchUSER: To better understand the children’s book above, select ONE specific story theme, character type, literary genre, or educational concept that is central to the book and can be searched directly on Wikipedia\. Avoid overly generic terms like ’story’ or ’children’\. Output only a short phrase \(maximum 5 words\)\. Do NOT explain\.ASSISTANT:Full prompt exampleA chat between a curious user and an artificial intelligence assistant\. The assistant gives helpful, detailed, and polite answers step by step to the user’s questions\. USER: Given a representation of a book: <Token 1\> <Token 2\> <Token 3\> <Token 4\> <Token 5\>, with the following information:Description: Daisy Meadows has written over one hundred books for children\. Her RAINBOW MAGIC series is a New York Times bestseller\!Title: Amber: The Orange Fairy \(Rainbow Magic: The Rainbow Fairies, No\. 2\)\.Thought: The book is a fantasy story featuring a fairy character \(Amber the Orange Fairy\) as part of a magical rainbow\-themed series aimed at young children\.In addition to the information above, the book is part of an e\-commerce graph, where neighboring nodes represent children’s books that are often related or similar\. As a result, adjacent books may share the same or closely related categories\. The following summary describes the content and category tendencies of the neighboring books, and can be used as contextual evidence to help answer the question\.Neighbor summary: This subgraph features books written by Daisy Meadows, with a focus on the RAINBOW MAGIC series, which is a New York Times bestseller\. The category might be Literature & Fiction or Children’s Books\.Here is another summary generated from nodes that are similar to the center node\.Node summary: This node set features humorous stories and books about everyday life experiences, so the category might be Humor\.Thought: The book might belong in the Children’s Books category with a focus on fantasy and humorous stories\.Summary: The book "Amber: The Orange Fairy" by Daisy Meadows is a part of the RAINBOW MAGIC series, which is a New York Times bestseller\. It falls under the category of Children’s Books with a focus on fantasy and humorous stories\.Question: Which category does this book belong to? Please directly give the most likely answer from the following categories: "Literature & Fiction", "Animals", "Growing Up & Facts of Life", "Humor", "Cars, Trains & Things That Go", "Fairy Tales, Folk Tales & Myths", "Activities, Crafts & Games", "Science Fiction & Fantasy", "Classics", "Mysteries & Detectives", "Action & Adventure", "Geography & Cultures", "Education & Reference", "Arts, Music & Photography", "Holidays & Celebrations", "Science, Nature & How It Works", "Early Learning", "Biographies", "History", "Children’s Cookbooks", "Religions", "Sports & Outdoors", "Comics & Graphic Novels", "Computers & Technology"\. ASSISTANT:In this section, we illustrate the prompt design used in our framework in Table 7, taking the Children dataset as a representative example\. For each type of prompt, we describe its role in the reasoning\-acting process and provide a concrete instance\. For other datasets, the prompt structure remains the same, while dataset\-specific descriptions are adjusted accordingly\.

## Appendix ILimitations

Despite its effectiveness,GraphReActhas several limitations\. First, the framework introduces additional computational overhead due to multi\-step LLM inference\. Second, its performance may depend on the availability and quality of node\-associated text\. Finally, the current design is constrained by the input length of the LLM, which limits the number of reasoning steps and retrieved evidence\. We leave improving efficiency and scalability for future work\.

## Appendix JBroader Impact

This work introduces a graph\-aware reasoning framework that integrates large language models with structured data, which may benefit a wide range of applications involving relational reasoning, such as recommendation systems, knowledge discovery, and scientific data analysis\. By enabling zero\-shot generalization across graphs, our approach may reduce the need for extensive labeled data and improve the accessibility of graph learning techniques in real\-world scenarios\. Potential risks are limited but may arise if the method is applied to sensitive domains, where incorrect predictions or biases inherited from pre\-trained language models could propagate through the reasoning process\. However, our framework does not introduce new data sources or amplify such biases beyond those already present in existing models\. Overall, we believe this work has a positive impact by advancing the integration of structured and unstructured knowledge for more flexible and generalizable machine learning systems\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction clearly state the main contributions of the paper, including the proposed reasoning\-acting framework, the design of graph\-aware retrieval and context refinement, and the focus on zero\-shot graph learning\. These claims are consistently supported by the methodological descriptions and experimental results presented in the paper, without overstating the scope or generality\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: In Sect\.[I](https://arxiv.org/html/2605.07357#A9), we introduce the limitation ofGraphReAct\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[N/A\]
14. Justification: This paper does not include formal theoretical results such as theorems or proofs\. The contributions are primarily methodological and empirical, focusing on the design of a reasoning\-acting framework and its experimental evaluation\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: We provide sufficient details for reproducing the experiments, including model architecture, training and inference procedures, and hyperparameter settings\. In addition, we release the source code to facilitate verification and replication of our results\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[Yes\]
24. Justification: We provide an anonymized code repository with detailed instructions for reproducing the experiments, including environment setup, data preparation, and execution commands\. The link to the code is included in the abstract and supplemental material\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: We provide detailed experimental settings in the main paper and appendix, including dataset splits, model configurations, hyperparameters, and inference procedures\. Additional implementation details are also included in the released code to ensure clarity and reproducibility\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: We report the mean and standard deviation of performance over five runs with different random seeds\. The error bars reflect the variability across different random initializations and data splits, and are presented in both tables and figures\. We explicitly indicate that the reported uncertainty corresponds to standard deviation\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: We provide the type of compute resources \(GPU models\) and memory specifications used in our experiments\. While we do not explicitly report the execution time, the provided information is sufficient to reproduce the experimental setup and results\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: The research presented in this paper fully complies with the NeurIPS Code of Ethics\. We use only publicly available datasets, ensure proper citation of prior work, and maintain anonymity in the submission\. No human subjects or sensitive data are involved\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: In appendix, we discuss the broader impacts ofGraphReAct\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: This work does not introduce or release new datasets or models with high risk of misuse\. We use publicly available datasets and existing pre\-trained models, and therefore no additional safeguards are required\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: We properly cite all datasets and models used in this work and provide corresponding references and URLs\. These datasets are publicly available and widely used in prior literature, and we follow their standard usage and licensing terms\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.07357v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: We release the code for our method and provide detailed documentation, including a README file with instructions for setup, data preparation, and reproducing the experiments\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: This work does not involve crowdsourcing or research with human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: The paper does not involve crowdsourcing nor research with human subjects
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
78. Answer:\[N/A\]
79. Justification: The core method development in this research does not involve LLMs as any important, original, or non\-standard components\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles