GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs
Summary
Introduces GraphInfer-Bench, a benchmark to evaluate whether LLMs can perform graph inference—producing open-ended answers about a node and its neighborhood that cannot be retrieved from a single node or path. Experiments show that even frontier LLMs lag behind plain GNNs on these tasks, revealing a capability gap.
View Cached Full Text
Cached at: 06/11/26, 01:48 PM
# GraphInfer-Bench: Benchmarking LLM’s Inference Capability on Graphs
Source: [https://arxiv.org/html/2606.11562](https://arxiv.org/html/2606.11562)
Zhuoyi Peng1Jingzhou Jiang1Hanlin Gu2Lixin Fan2Yi Yang1 1The Hong Kong University of Science and Technology2Webank
###### Abstract
Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood\. We introduceGraphInfer\-Bench, a benchmark for whether LLMs can perform this*graph inference*: producing an open\-ended answer that no single node supports and no path retrieves\. Existing graph\-QA protocols cannot test this capability: algorithm simulation, node classification, single\-node description, KG\-QA, and GraphRAG all admit answers retrievable from one node or along a path\.GraphInfer\-Benchdefines five tasks along*Description*\(what a region is\) and*Comparison*\(how regions differ\), each constructed so the ground truth lives in no single node\. The release contains 42,000 samples across six real\-world graphs, produced automatically and screened by a four\-layer quality\-control protocol\. We evaluate four method families against the same tasks: graph\-token alignment models, zero\-shot frontier closed\-source LLMs, Graph2Text supervised fine\-tuning, and plain GNNs as a structural reference\. No method family closes the gap\. Graph\-token alignment partially handles description tasks \(relational, theme\) but collapses on comparison tasks\. Frontier LLMs lead on outlier detection and community partition among LLM\-based methods but lag on masked\-node prediction\. Graph2Text SFT is the strongest LLM\-based method on the description side yet falls behind frontier LLMs on comparison\. Across every task, plain GNNs match or beat the strongest LLM\-based row, with the largest margin on community detection\.GraphInfer\-Benchsurfaces graph inference as an open capability gap rather than a property of any one architecture\.
Code:[https://github\.com/graphinfer/GraphInfer\-Bench](https://github.com/graphinfer/GraphInfer-Bench)\. Dataset:[https://huggingface\.co/datasets/graphinfer/graphinfer](https://huggingface.co/datasets/graphinfer/graphinfer)\.
## 1Introduction
Graph analysis underlies many real\-world problems whose answers cannot be looked up in a single record or retrieved along a path\. A money\-laundering ring is identified by a pattern of transactions across many accounts\[[20](https://arxiv.org/html/2606.11562#bib.bib20)\]\. A drug repurposing hypothesis emerges from joint reasoning over drug\-gene\-disease relationships\[[10](https://arxiv.org/html/2606.11562#bib.bib10)\]\. A user’s preference is inferred from the structure of past interactions rather than from any individual purchase\[[32](https://arxiv.org/html/2606.11562#bib.bib32)\]\. A scientific theme is read off the joint citation pattern of many papers, not from any one abstract\[[21](https://arxiv.org/html/2606.11562#bib.bib21)\]\. None of these answers exists in any single node: each must be*inferred*from a node together with its neighbourhood\.
Graph inference\.We define*graph inference*as producing an open\-ended answer to a question about a node and its neighbourhood, where the answer \(i\) is undetermined by any single node’s content, \(ii\) is undetermined by any traversal value \(excluding KG\-QA\-style retrieval\), and \(iii\) requires jointly reading edges and node text rather than either modality alone\.GraphInfer\-Benchmeasures the capability, not any particular architecture\.
Graph inference is distinct from lookup and retrieval\.*Lookup*returns a node’s own attribute\.*Retrieval*returns a pre\-existing answer reached through the graph \(a KG\-path entity, a span from one node’s page\)\.*Inference*is neither: identifying a cluster’s unifying theme, the outlier that does not belong, or a coherent partition all require open\-ended language that exists in no single node and is not retrievable along any path\.
Existing graph\-QA evaluations do not test it\.The protocols in Table[1](https://arxiv.org/html/2606.11562#S1.T1)are lookup or retrieval, not inference\. Algorithm simulation \(NLGraph, GraphArena, and related\) is structural lookup over synthetic graphs\. Node classification, link prediction, and single\-node description \(LLaGA\) are single\-node tasks\. KG\-QA \(WebQSP, CWQ, GrailQA, MetaQA, KQA Pro\) is path\-traversal retrieval\. GraphRAG \(STaRK, CRAG, GRBench\) is corpus retrieval indexed by a graph\.GraphInfer\-Benchasks for themes, outliers, partitions, and masked content over six real\-world graphs, scored against deterministic structural ground truth\.
Table 1:Graph\-QA evaluations in current use, organised along the axesGgraph,Qquestion,Aanswer\.TaskG: graphQ: questionA: answerStructure Lookup: answer from graph structure aloneAlgorithm simulation NLGraph\[[30](https://arxiv.org/html/2606.11562#bib.bib30)\], GraCoRe\[[38](https://arxiv.org/html/2606.11562#bib.bib38)\], GraphArena\[[28](https://arxiv.org/html/2606.11562#bib.bib28)\], Talk\-Like\-a\-Graph\[[6](https://arxiv.org/html/2606.11562#bib.bib6)\], GraphWiz\[[3](https://arxiv.org/html/2606.11562#bib.bib3)\]Synthetic *e\.g\.,*ER / SBM random graphConnectivity, shortest path, cycle, degree, topological sort *e\.g\.,*“shortest pathv3→v7v\_\{3\}\\to v\_\{7\}?”Number, yes/no, path *e\.g\.,*4or\[v3,v5,v7\]Retrieval: answer retrieved from content stored in the graphKG\-QA WebQSP\[[37](https://arxiv.org/html/2606.11562#bib.bib37)\], CWQ\[[26](https://arxiv.org/html/2606.11562#bib.bib26)\], GrailQA\[[7](https://arxiv.org/html/2606.11562#bib.bib7)\], MetaQA\[[39](https://arxiv.org/html/2606.11562#bib.bib39)\], KQA Pro\[[2](https://arxiv.org/html/2606.11562#bib.bib2)\]Real KG *e\.g\.,*Freebase, WikidataMulti\-hop lookup *e\.g\.,*“actors in films directed by Nolan?”Entity / number *e\.g\.,*\{DiCaprio, Bale\}GraphRAG / retrieval\-on\-graph STaRK\[[31](https://arxiv.org/html/2606.11562#bib.bib31)\], CRAG\[[35](https://arxiv.org/html/2606.11562#bib.bib35)\], GRBench\[[12](https://arxiv.org/html/2606.11562#bib.bib12)\]Corpus \+ index *e\.g\.,*STaRK\-PrimeKGRetrieve then answer *e\.g\.,*“side effects of drugdd?”Entity / Factoid / Span *e\.g\.,*span fromdd’s pageWeak Inference: a single node \(or endpoint pair\) sufficesNode classification OFA\[[17](https://arxiv.org/html/2606.11562#bib.bib17)\], GLBench\[[15](https://arxiv.org/html/2606.11562#bib.bib15)\], GraphGPT\[[27](https://arxiv.org/html/2606.11562#bib.bib27)\]Real TAG *e\.g\.,*ogbn\-arxivPick node label *e\.g\.,*“category of this paper?”Fixed label *e\.g\.,*cs\.LGSingle\-node description LLaGA\[[4](https://arxiv.org/html/2606.11562#bib.bib4)\]Real TAG *e\.g\.,*ogbn\-arxiv“Describe this paper” *e\.g\.,*“describe nodevv”Free text on one node *e\.g\.,*paraphrase ofvv’s titleLink prediction OFA\[[17](https://arxiv.org/html/2606.11562#bib.bib17)\], LLaGA\[[4](https://arxiv.org/html/2606.11562#bib.bib4)\], GraphGPT\[[27](https://arxiv.org/html/2606.11562#bib.bib27)\]Real TAG / KG *e\.g\.,*Cora, FB15kEdge\(u,v\)\(u,v\)? *e\.g\.,*“doesuucitevv?”Yes / no *e\.g\.,*yesGraphInfer\-BenchLLM Inference Over Graph GraphInfer\-Bench\(ours\) Task 1: masked node prediction Task 2: relational description Task 3: theme summarisation Task 4: outlier detection Task 5: community detectionReal TAG \(6 domains\) academic citation, e\-commerce, clinical citation, encyclopedia, patent citation, Physics Q&AMasked node / relation / theme / outlier / community *e\.g\.,*“partition the 25 papers into research\-area communities”Open\-ended language *e\.g\.,*\{0,3,8\}: Information Theory; \{1,2,5,7\}: Numerical Analysis; \{4,6\}: Computational Engineering\.*Reasoning:*titles in \{0,3,8\} centre on coding and entropy bounds …TAG: text\-attributed graph\. KG: knowledge graph\.
#### Contributions\.
1. 1\.A benchmark dedicated to graph inference\.We give a method\-agnostic definition \(criteria \(i\)–\(iii\) above\) and build the first benchmark targeting this capability rather than any specific architecture, distinct from algorithm simulation, node classification, single\-node description, KG\-QA, and GraphRAG\.
2. 2\.A 42,000\-sample dataset across six domains and five tasks\.Six text\-attributed graphs spanogbn\-arxiv,PubMed,USPTOpatents,ogbn\-products,WikiCS, andPhysics SE\. Five tasks span two axes\.*Description*: T1 masked node prediction \(recover a held\-out node from its neighbourhood\), T2 relational description \(characterise the relation between two endpoints\), T3 theme summarisation \(name the unifying theme of an ego\-graph\)\.*Comparison*: T4 outlier detection \(identify the node that does not belong\), T5 community detection \(partition an ego\-graph into coherent groups\)\. Each sample passes a four\-layer quality gate \(scripted rules, dual 70B judges, human\-κ\\kappacalibration, structural deduplication\) at low human\-annotation cost\.
3. 3\.Evaluation, results, and what they imply for closing the gap\.Under matched splits and a unified hard\-label plus SBERT\-cosine protocol, we evaluate four method families \(graph\-token alignment, zero\-shot frontier LLMs, Graph2Text SFT, plain GNNs as a structural reference\)\.No family closes the gap\.Plain GNNs match or beat the strongest LLM\-based row on every task, with the largest margin on community detection\.*The signal is in the graph\. Closing the gap is an objective and decoding problem, not a capacity problem\.*
## 2Related Work
#### Existing graph\-QA benchmarks\.
We extend the references in Table[1](https://arxiv.org/html/2606.11562#S1.T1)\.*Algorithm simulation*: NLGraph\[[30](https://arxiv.org/html/2606.11562#bib.bib30)\], GraCoRe\[[38](https://arxiv.org/html/2606.11562#bib.bib38)\], GraphArena\[[28](https://arxiv.org/html/2606.11562#bib.bib28)\], Talk\-Like\-a\-Graph\[[6](https://arxiv.org/html/2606.11562#bib.bib6)\], GraphWiz\[[3](https://arxiv.org/html/2606.11562#bib.bib3)\], plus GraphInstruct\[[18](https://arxiv.org/html/2606.11562#bib.bib18)\]\.*KG\-QA*: WebQSP\[[26](https://arxiv.org/html/2606.11562#bib.bib26)\], CWQ\[[26](https://arxiv.org/html/2606.11562#bib.bib26)\], GrailQA\[[7](https://arxiv.org/html/2606.11562#bib.bib7)\], MetaQA\[[39](https://arxiv.org/html/2606.11562#bib.bib39)\], KQA Pro\[[2](https://arxiv.org/html/2606.11562#bib.bib2)\]\.*GraphRAG*: STaRK\[[31](https://arxiv.org/html/2606.11562#bib.bib31)\], CRAG\[[35](https://arxiv.org/html/2606.11562#bib.bib35)\], GRBench\[[12](https://arxiv.org/html/2606.11562#bib.bib12)\], plus the two GraphRAGBench variants\[[33](https://arxiv.org/html/2606.11562#bib.bib33),[34](https://arxiv.org/html/2606.11562#bib.bib34)\]\.*Node classification, single\-node description, link prediction*on real text\-attributed graphs: OFA\[[17](https://arxiv.org/html/2606.11562#bib.bib17)\], GLBench\[[15](https://arxiv.org/html/2606.11562#bib.bib15)\], GraphGPT\[[27](https://arxiv.org/html/2606.11562#bib.bib27)\], LLaGA\[[4](https://arxiv.org/html/2606.11562#bib.bib4)\], plus GPT4Graph\[[8](https://arxiv.org/html/2606.11562#bib.bib8)\], G\-Retriever\[[9](https://arxiv.org/html/2606.11562#bib.bib9)\], and the LLMs\-as\-Predictors / LLMs\-as\-Enhancers study of Chen et al\.\[[5](https://arxiv.org/html/2606.11562#bib.bib5)\]\. Different from these benchmarks,GraphInfer\-Benchthe first benchmark to our knowledge that targets graph inference itself\.
#### Methods for LLM understanding of graphs\.
Two families dominate\.*Graph\-token alignment*trains a GNN encoder and projects its output into the LLM input space, freezing or lightly tuning the LLM\. LLaGA\[[4](https://arxiv.org/html/2606.11562#bib.bib4)\], GraphToken\[[23](https://arxiv.org/html/2606.11562#bib.bib23)\], GraphGPT\[[27](https://arxiv.org/html/2606.11562#bib.bib27)\], TEA\-GLM\[[29](https://arxiv.org/html/2606.11562#bib.bib29)\], RGLM\[[40](https://arxiv.org/html/2606.11562#bib.bib40)\], GOFA\[[13](https://arxiv.org/html/2606.11562#bib.bib13)\], and InstructGLM\[[36](https://arxiv.org/html/2606.11562#bib.bib36)\]differ in how the projector is trained \(PCA alignment, contrastive, reconstructive, instruction tuning\) but share the architectural commitment that structure enters the LLM through learned graph tokens\.*Graph2Text*serialises the neighbourhood directly as text and hands it to the LLM, with the LLM either prompted zero\-shot \(frontier closed\-source models\) or supervised fine\-tuned on graph\-to\-text targets\[[6](https://arxiv.org/html/2606.11562#bib.bib6),[8](https://arxiv.org/html/2606.11562#bib.bib8),[5](https://arxiv.org/html/2606.11562#bib.bib5)\]\.GraphInfer\-Benchevaluates both families on the same tasks alongside plain GNNs as a structural reference, so the comparison isolates which attack on the inference capability works where\.
## 3Dataset and Task Description
Building on the gap identified above,GraphInfer\-Benchoperationalises the inference test as five tasks over six text\-attributed graphs\. The sampler produces 66,000 candidates \(2,200 per task\-domain cell\)\. A four\-layer quality\-control pipeline filters defective references and deduplicates the observed graph, after which every cell is capped to 1,400 samples balanced on the gold label and split 1,000 train, 100 val, and 300 test\. The public release totals42,000samples\. Below we describe the raw data sources, the task taxonomy, and the quality\-control pipeline\. Per\-domain text, full task setups with example prompts, and verification details are in the appendix\.
### 3\.1Raw Data Sources
GraphInfer\-Benchis built on six public text\-attributed graphs that span the major graph families used in the GNN\-LLM literature \(Table[2](https://arxiv.org/html/2606.11562#S3.T2)\): citation networks \(ogbn\-arxiv\[[11](https://arxiv.org/html/2606.11562#bib.bib11)\],PubMed\[[24](https://arxiv.org/html/2606.11562#bib.bib24)\],USPTO\[[14](https://arxiv.org/html/2606.11562#bib.bib14)\]\), e\-commerce co\-purchase \(ogbn\-products\[[11](https://arxiv.org/html/2606.11562#bib.bib11)\]\), encyclopedia links \(WikiCS\[[19](https://arxiv.org/html/2606.11562#bib.bib19)\]\), and a Physics Q&A graph \(Physics SE\[[25](https://arxiv.org/html/2606.11562#bib.bib25)\]\)\. Each domain provides natural\-language node text \(titles, names, descriptions\) and a structurally distinct edge type\. To construct a \(graph, question, answer\) sample, we sample 2\-hop ego\-subgraphs centred on hub nodes \(in\-degree≥3\\geq 3with valid text\), with up to 10 neighbours per hop\.111This size setting lets us compare graph\-token alignment and Graph2Text baselines under a single subgraph budget\. Larger ego\-graphs would saturate the Graph2Text context window once each node’s title is serialised\.Models receive node titles only\. Abstracts and dataset labels are withheld for ground truth\. Per\-domain descriptions appear in Appendix[I](https://arxiv.org/html/2606.11562#A9)\.
Table 2:GraphInfer\-Benchraw data sources used to construct the benchmark,*not*the final released dataset \(see Sec\.[3\.2](https://arxiv.org/html/2606.11562#S3.SS2)for the per\-cell sample counts\)\.
### 3\.2Tasks
Graph Inference TasksDescriptiondescribe node, edge, clusterComparisoncompare nodes \(outlier, clusters\)T1: Masked NodeQ:topic of masked Node 0?A:“Computational Geometry\.”R:neighbours cover clustered planarity & simultaneous embedding\.T2: Relational Sem\.Q:does Node 9 cite Node 12?A:“No\.”R:Node 9 cites point\-location works, not the working\-set tree paper\.T3: Theme SummaryQ:theme of this 13\-paper graph?A:“Distributed Computing\.”R:titles cite distributed streams, load balancing, parallelism\.T4: Outlier DetectionQ:which paper is unrelated?A:Node 8: “outlier”\.R:Nodes 0–7 are differential\-privacy works, Node 8 isn’t\.T5: Community DetectionQ:cluster nodes by topic\.A:“Comm\. 0: Computer Vision”\.R:cluster groups segmentation & pose\-estimation papers\.Figure 1:Taxonomy of the five graph inference tasks inGraphInfer\-Bench\.Descriptiontasks \(T1 through T3\) ask about a single target node, edge, or local theme\. The answer is a single short string scored by per\-task hard accuracy together with SBERT\-F1 against a reasoning paragraph\.Comparisontasks \(T4 and T5\) require cross\-node reasoning over the full ego\-graph and produce structured outputs: a chosen node for T4, a partition for T5\.GraphInfer\-Benchorganises five tasks under two graph\-inference capabilities, summarised in Table[3](https://arxiv.org/html/2606.11562#S3.T3)\.Description \(T1 to T3\)asks the model what a region of the graph*is*\.Comparison \(T4 and T5\)asks how regions*differ*\. Every task is constructed so that single\-node shortcuts cannot solve it\. For T1 the hub’s title is masked\. For T2 the answer is split across pair endpoints\. Full no\-shortcut constructions appear in Appendix[J](https://arxiv.org/html/2606.11562#A10)\.
Every sample is a triple*\(input, answer, reasoning\)*\. The*reasoning*is a short, structured natural\-language justification that grounds the answer in specific node attributes and edges\. It is auto\-generated by an LLM during dataset construction222Gold reasoning is produced by the DeepSeek API as of 2026\-04\-01\. The output is then audited by the four\-layer quality\-control pipeline described in Sec\.[3\.3](https://arxiv.org/html/2606.11562#S3.SS3)\.and passes the four\-layer quality\-control pipeline of Sec\.[3\.3](https://arxiv.org/html/2606.11562#S3.SS3)before release\. Models are trained and scored on both fields jointly: hard metrics on the*answer*, soft semantic similarity on the*reasoning*paragraph \(Sec\.[4\.2](https://arxiv.org/html/2606.11562#S4.SS2)\)\.
Table 3:The five tasks ofGraphInfer\-Bench, summarised as Question, Answer, Reasoning, and a real\-world application\. Full no\-shortcut constructions and per\-domain templates are in Appendix[J](https://arxiv.org/html/2606.11562#A10)\.
### 3\.3Data Quality Control
Every candidate passes through a four\-layer release gate\.Layers 1 to 3audit the gold*reasoning*\(answer plus rationale\) under a shared five\-code vocabulary \(Fact, Consistency, Label\-Integrity, Logic, Evidence\) so scripted rules, LLM judges, and human annotators speak the same language\.Layer 4audits the observed*graph*for split leakage\.
- •L1, agent\-grounded regex filter\.A deterministic rule set distilled from an agentic LLM audit of the candidate pool\. Admits98\.98%98\.98\\%\.
- •L2, dual 70B judges withall\_pass\.Llama\-3\.1\-70B\-AWQ and Qwen\-2\.5\-72B\-AWQ vote independently; a sample ships only if both mark it qualified\. Admits94\.0%94\.0\\%\.
- •L3, human calibration of L2\.Two annotators score a stratified300300\-sample subset independently\.333Each annotator labels every sample without seeing the other annotator’s verdict, so the two label streams are independent beforeκ\\kappais computed\.We achieve inter\-annotator Cohen’sκ=0\.606\\kappa=0\.606and L2\-vs\-human agreement=95\.2%=95\.2\\%, both above the pre\-registered threshold\.
- •L4, construction\-time deduplication\.Per\-task canonical identities prevent the same ego\-centre, edge, cluster, outlier pair, or sub\-graph from leaking across train, val, and test\.
The pipeline admits59,681/66,00059\{,\}681/66\{,\}000candidates \(90\.4%90\.4\\%\), which we balance and cap to1,4001\{,\}400samples per \(domain, task\) for the released benchmark \(Tab\.[4](https://arxiv.org/html/2606.11562#S3.T4)\)\. Per\-layer rule sets, prompts, model identifiers, calibration protocol, and canonical identity definitions are in Appendix[L](https://arxiv.org/html/2606.11562#A12)\.
Table 4:Released sample counts per task per domain, totalling 42,000 samples \(30,000 train, 3,000 val, 9,000 test\)\.
## 4Experiments
Table[5](https://arxiv.org/html/2606.11562#S4.T5)summarises the key findings of this section up front\. Each is supported by a paragraph in the relevant subsection\.
Table 5:Key findings of §[4](https://arxiv.org/html/2606.11562#S4), one per subsection\.### 4\.1Evaluated Models
We compare four families of baselines onGraphInfer\-Bench\. Hyperparameters, training schedules, prompt templates, and hardware are deferred to App\.[H](https://arxiv.org/html/2606.11562#A8)\.
Graph\-token alignment\.Seven alignment models that present the graph to the LLM as soft tokens or injected hidden states: LLaGA\[[4](https://arxiv.org/html/2606.11562#bib.bib4)\], GraphToken\[[23](https://arxiv.org/html/2606.11562#bib.bib23)\], GraphGPT\[[27](https://arxiv.org/html/2606.11562#bib.bib27)\], TEA\-GLM\[[29](https://arxiv.org/html/2606.11562#bib.bib29)\], RGLM\[[40](https://arxiv.org/html/2606.11562#bib.bib40)\], GOFA\[[13](https://arxiv.org/html/2606.11562#bib.bib13)\], and InstructGLM\[[36](https://arxiv.org/html/2606.11562#bib.bib36)\]\. We hold the node\-text encoder, the GNN message\-passing operator, and the LLM backbone fixed across these baselines so any cross\-baseline difference comes from the loss, the projector, or the schedule\.
Zero\-shot frontier LLM\.GPT\-5\[[22](https://arxiv.org/html/2606.11562#bib.bib22)\]and Claude Opus 4\.7\[[1](https://arxiv.org/html/2606.11562#bib.bib1)\]\. The subgraph is serialised as plain text \(an indexed node list and an edge list\) and the prompt provides only the canonical answer\-frame sentence with no closed\-set label list\.
Graph2Text SFT\.Three open\-weight LLMs QLoRA\-fine\-tuned on the per\-domain training split with the reference Answer\-plus\-Reasoning target: Vicuna\-7B, Qwen2\-7B, and Llama\-3\-8B\.
Plain GNN\.GCN, GAT, and GraphSAGE trained on the same training split\. These isolate how much structural signal is recoverable from graph topology alone under the same data budget\. Any graph\-LLM method that fails to clear this reference cannot claim its design adds anything over a purely structural solution\.
### 4\.2Evaluation Metrics
We use two complementary classes of metrics, both computed deterministically from the structured per\-line output with no LLM judge in the loop\.Hard metricsparse the answer block and compare it against the structural ground truth: label accuracy on T1, T2, and T3, per\-node Hit@1 on T4, and ARI and NMI on T5\.SBERT cosineon the predicted versus gold*Reasoning*paragraph captures the soft side\. We presentall\-mpnet\-base\-v2\(768d\) in the main table because it gives the largest spread between baselines\. Theall\-MiniLM\-L6\-v2ande5\-large\-v2numbers and their ranking\-stability check are in the appendix\. We score the Reasoning paragraph rather than the full Answer because the Reasoning is the discriminative part\. The two views are complementary\. A model can produce plausible cluster text \(moderate SBERT\) yet miss the structural partition \(low NMI\), so we report hard and soft side by side throughout\.
### 4\.3Main Results
Table 6:Main results, averaged across six domains\. n=300 test samples per \(domain, task\) cell\. SBERT columns reportall\-mpnet\-base\-v2cosine on the predicted Reasoning paragraph\. The plain\-GNN rows shown/ain every SBERT column because GNNs emit a label or a cluster assignment, not a Reasoning paragraph to score against\.Per\-domain and per\-encoder breakdowns are in the appendix\.Table[6](https://arxiv.org/html/2606.11562#S4.T6)reports each family’s behaviour across the five tasks\. We walk through the four families in turn, then close with what the joint pattern says about graph inference\.
#### Graph\-token alignment\.
Aligned LLMs can describe a region of the graph but cannot compare regions\.InstructGLM is the family’s strongest model on Description \(T10\.5390\.539, T30\.8000\.800\)\. The whole family lands under0\.110\.11on T4 and under0\.070\.07on T5, two orders of magnitude below GraphSAGE\. The encoder is fine\. The bottleneck sits downstream of it\.
#### Frontier closed\-source LLMs \(zero\-shot\)\.
Zero\-shot frontier LLMs dominate single\-node Comparison, still behind the GNN on multi\-node partitioning\.Claude Opus 4\.7 takes T40\.6420\.642, ahead of every other family including plain GraphSAGE \(0\.4120\.412\), and reaches T20\.9920\.992\. Open\-vocabulary Description and T5 multi\-node partitioning still trail SFT and the GNN respectively\.
#### Graph2Text SFT\.
Updating the LLM weights closes the Description gap, not the multi\-node Comparison gap\.QLoRA SFT on Llama\-3\-8B reaches T30\.7660\.766, within a few points of the GCN ceiling, and SFT\-Vicuna leads T1 across the whole table\. T4 lands between the graph\-token floor and the Claude ceiling\. T5 still trails GraphSAGE\.
#### Plain GNN \(structural reference\)\.
The structural signal lives in the subgraph and a 2\-layer GNN extracts it, with no notion of relation semantics or text\-grounded comparison\.GCN, GAT, and GraphSAGE win three of the five tasks \(T1, T3, T5\)\. T2 and T4 reward the language prior more than the topology, and the GNN trails Claude there\.
#### Joint takeaway\.
Graph inference is not one capability\. On Description, LLM\-based methods compete with the GNN\. On single\-node Comparison, the frontier LLM is the only method that clears the GNN\. On multi\-node Comparison, every LLM\-based method plateaus around 0\.3 NMI while a small GNN reaches 0\.46\.*The signal is in the graph\. Closing the remaining gap is an objective and decoding problem, not a capacity problem\.*
### 4\.4Better graph token template to fill the gap
Adding explicit per\-node neighbour titles to the prompt recovers the description\-task signal that graph tokens alone fail to deliver, and 2\-hop context is what specifically unlocks the single\-node Comparison task\.We augment the LLaGA and GOFA prompts withh1\(1\-hop neighbour titles\) andh2\(1\- and 2\-hop neighbour titles\), keeping the projector and the Llama\-3\-8B backbone identical to the main\-table rows\. Both variants are LoRA\-fine\-tuned on the same training split \(shown below\)\.
Template with interleaved graph token and titlesLLaGA \(graph tokens only\):``` Node 0 <gt_0> ... Node N-1 <gt_{N-1}> <task question> ``` h1 \(GT \+ 1\-hop titles\):each 1\-hop neighbour’s graph token is followed by its title on the same line:``` Node 0 <gt_0>: <title 0> Node 1 <gt_1>: <title 1> (1-hop neighbour of node 0) Node 2 <gt_2>: <title 2> (1-hop neighbour of node 0) ... <task question> ``` h2 \(GT \+ 1\+2\-hop titles\):same as h1, plus a “Nodekk<gtk\>: title” line for every 2\-hop neighbour of node 0\. Projector, Llama\-3\-8B backbone, and LoRA adapters are identical across variants; only title placement changes\.
Table 7:Template ablation on LLaGA and GOFA\.h1adds 1\-hop neighbour titles to the prompt\.h2adds 1\- and 2\-hop\.GraphSAGEis the no\-LLM floor\.SFT \(Llama\-3\-8B\)is the full\-LLM ceiling\. n=300 per task\. Per\-domain breakdowns of the reference rows are in Tab\.[13](https://arxiv.org/html/2606.11562#A13.T13)\(GraphSAGE\), Tab\.[18](https://arxiv.org/html/2606.11562#A17.T18)\(SFT\), and Tab\.[14](https://arxiv.org/html/2606.11562#A15.T14)\(LLaGA, GOFA\)\.#### 1\-hop unlocks Description, 2\-hop unlocks T4\.
On patents the GT\-only LLaGA collapses to T10\.0630\.063and T30\.0470\.047\. Adding 1\-hop neighbour titles lifts T1 to0\.4970\.497and T3 to0\.8830\.883\. The graph token alone under\-determines the answer\. One hop of explicit text fills the gap\. Adding the second hop adds little to T1 / T3 but lifts T4 specifically: LLaGA T4 H@1 jumps from0\.100\.10to0\.560\.56on arxiv \(\+46 percentage points\) and from0\.070\.07to0\.340\.34on patents \(\+27 percentage points\), and the GOFA rows repeat the same pattern under a different alignment recipe\. The h1 setting barely moves T4 over GT\-only, so the T4 gain is specifically a 2\-hop effect\. Outlier detection requires comparing each node against its neighbours’ neighbours, and 1\-hop context cannot supply that\.
#### T5 stays stubborn\.
NMI rises monotonically from LLaGA to h1 to h2 but stays well below the GNN reference on both domains\. Denser prompt context lifts every other task close to or above the GNN ceiling, yet does not move multi\-node partitioning, which a 2\-layer GNN extracts directly from the topology\. T5 remains a structural\-coverage problem that prompt augmentation alone cannot solve\.
### 4\.5Does the Reasoning supervision help?
Stripping the Reasoning paragraph from the training target collapses graph\-token methods on T1, T3, and T4 but does not hurt Graph2Text SFT, indicating that supervision\-side scaffolding is what gets the alignment recipe to read the graph at all\.We retrain LLaGA, TEA\-GLM, and GOFA on ogbn\-arxiv and pubmed\-diabetes asking the model to emit onlyAnswer:X, with everything else unchanged\.
Figure 2:Reasoning\-supervision ablation\.fullis the main\-table recipe \(Answer plus Reasoning\)\.w/ostrips the Reasoning paragraph\. n=300 per task\. Per\-domain numbers for thefullbars are in Tab\.[14](https://arxiv.org/html/2606.11562#A15.T14)\(LLaGA, TEA\-GLM, GOFA\) and Tab\.[18](https://arxiv.org/html/2606.11562#A17.T18)\(SFT\)\.#### Reasoning carries the Description signal\.
Hard accuracy on T1 \(masked node\) and T3 \(theme summarisation\) collapses by tens of percentage points on every graph\-token \(model, domain\) pair we test\. Without the rationale paragraph, the projector and LoRA jointly fit a shorter target distribution that captures the answer phrase as a generic surface form\. What gets lost is the per\-instance grounding to the graph, which emerges only when the model is forced to articulate*why*the label is correct\.
#### T4 collapses for graph\-token, lifts for SFT\.
For graph\-token baselines, removing Reasoning drives T4 H@1 to near zero\. Inspecting predictions, the model emits “Nodekk: not outlier” for every node, so it never names any outlier\. Graph2Text SFT goes the other way: T4*improves*without Reasoning\. The full\-LLM gradient of SFT lets the model discriminate from graph\-text alone, and removing the free\-form rationale removes a competing gradient\. T2 \(yes/no\) also gains a few percentage points without Reasoning across all rows for the same reason\. The ablation is therefore an honest signal, not a uniform regression: it surfaces a real supervision\-vs\-decoding asymmetry between the two paradigms\.
### 4\.6More findings
Three additional ablations are reported in the appendix and reinforce the same narrative\.Projector pretrainingunlocks Description on the wide label space and is a no\-op on the narrow one \(App\.[C](https://arxiv.org/html/2606.11562#A3)\)\.Scaling SFT from 8B to 70Bhelps constrained\-output tasks and hurts open\-vocabulary description, indicating that capacity is not the bottleneck on Description \(App\.[D](https://arxiv.org/html/2606.11562#A4)\)\.The GNN encoder choiceflips with label\-space size: SAGE wins on the 40\-class arxiv space, GCN slightly wins on 3\-class pubmed \(App\.[E](https://arxiv.org/html/2606.11562#A5)\)\.
## 5Conclusion
GraphInfer\-Benchmeasures the capability of LLMs to read a node and its neighbours and produce an open\-ended conclusion that no single node implies, framed along two axes:*Description*\(verbalising what a region is, T1 to T3\) and*Comparison*\(distinguishing how regions differ, T4 and T5\)\. Five tasks over six structurally diverse domains and a four\-layer quality\-control pipeline yield a public release of42,000 sampleswith deterministic ground truth at zero annotation cost, evaluated under four baseline families on the same splits \(graph\-token aligned LLMs, zero\-shot frontier LLMs, Graph2Text SFT, and plain GNNs\)\.Frontier LLMs and Graph2Text SFT close the description\-task gap, but no LLM\-based method recovers the multi\-node partition signal that plain GNNs extract, leavingGraphInfer\-Benchas the diagnostic tool to track whether graph\-LLM alignment and text\-serialised LLM approaches can close that remaining gap\.
## References
- Anthropic \[2026\]Anthropic\.Claude Opus 4\.7 System Card\.[https://www\.anthropic\.com/claude\-opus\-4\-7\-system\-card](https://www.anthropic.com/claude-opus-4-7-system-card), April 2026\.
- Cao et al\. \[2022\]Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie, Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Hanwang Zhang\.Kqa pro: A dataset with explicit compositional programs for complex question answering over knowledge base\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 6101–6119, 2022\.
- Chen et al\. \[2024a\]Nuo Chen, Yuhan Li, Jianheng Tang, and Jia Li\.Graphwiz: An instruction\-following language model for graph computational problems\.In*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*, pages 353–364, 2024a\.
- Chen et al\. \[2024b\]Runjin Chen, Tong Zhao, Ajay Kumar Jaiswal, Neil Shah, and Zhangyang Wang\.LLaGA: Large language and graph assistant\.In*Proceedings of the 41st International Conference on Machine Learning \(ICML\)*, pages 7809–7823, 2024b\.
- Chen et al\. \[2024c\]Zhikai Chen, Haitao Mao, Hang Li, Wei Jin, Hongzhi Wen, Xiaochi Wei, Shuaiqiang Wang, Dawei Yin, Wenqi Fan, Hui Liu, et al\.Exploring the potential of large language models \(llms\) in learning on graphs\.*ACM SIGKDD Explorations Newsletter*, 25\(2\):42–61, 2024c\.
- Fatemi et al\. \[2023\]Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi\.Talk like a graph: Encoding graphs for large language models\.*arXiv preprint arXiv:2310\.04560*, 2023\.
- Gu et al\. \[2021\]Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su\.Beyond iid: three levels of generalization for question answering on knowledge bases\.In*Proceedings of the web conference 2021*, pages 3477–3488, 2021\.
- Guo et al\. \[2023\]Jiayan Guo, Lun Du, Hengyu Liu, Mengyu Zhou, Xinyi He, and Shi Han\.Gpt4graph: Can large language models understand graph structured data? an empirical evaluation and benchmarking\.*arXiv preprint arXiv:2305\.15066*, 2023\.
- He et al\. \[2024\]Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi\.G\-retriever: Retrieval\-augmented generation for textual graph understanding and question answering\.*Advances in Neural Information Processing Systems*, 37:132876–132907, 2024\.
- Himmelstein et al\. \[2017\]Daniel Scott Himmelstein, Antoine Lizee, Christine Hessler, Leo Brueggeman, Sabrina L Chen, Dexter Hadley, Ari Green, Pouya Khankhanian, and Sergio E Baranzini\.Systematic integration of biomedical knowledge prioritizes drugs for repurposing\.*elife*, 6:e26726, 2017\.
- Hu et al\. \[2020\]Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec\.Open graph benchmark: Datasets for machine learning on graphs\.*Advances in neural information processing systems*, 33:22118–22133, 2020\.
- Jin et al\. \[2024\]Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, et al\.Graph chain\-of\-thought: Augmenting large language models by reasoning on graphs\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 163–184, 2024\.
- Kong et al\. \[2024\]Lecheng Kong, Jiarui Feng, Hao Liu, Chengsong Huang, Jiaxin Huang, Yixin Chen, and Muhan Zhang\.Gofa: A generative one\-for\-all model for joint graph language modeling\.*arXiv preprint arXiv:2407\.09709*, 2024\.
- Leskovec et al\. \[2005\]Jure Leskovec, Jon Kleinberg, and Christos Faloutsos\.Graphs over time: densification laws, shrinking diameters and possible explanations\.In*Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining*, pages 177–187, 2005\.
- Li et al\. \[2024\]Yuhan Li, Peisong Wang, Xiao Zhu, Aochuan Chen, Haiyun Jiang, Deng Cai, Victor W Chan, and Jia Li\.Glbench: A comprehensive benchmark for graph with large language models\.*Advances in Neural Information Processing Systems*, 37:42349–42368, 2024\.
- Lin et al\. \[2024\]Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei\-Ming Chen, Wei\-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han\.Awq: Activation\-aware weight quantization for on\-device llm compression and acceleration\.*Proceedings of machine learning and systems*, 6:87–100, 2024\.
- Liu et al\. \[2023\]Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang\.One for all: Towards training one graph model for all classification tasks\.*arXiv preprint arXiv:2310\.00149*, 2023\.
- Luo et al\. \[2024\]Zihan Luo, Xiran Song, Hong Huang, Jianxun Lian, Chenhao Zhang, Jinqi Jiang, Xing Xie, and Hai Jin\.Graphinstruct: Empowering large language models with graph understanding and reasoning capability\.*arXiv preprint arXiv:2403\.04483*, 2024\.
- Mernyei and Cangea \[2020\]Péter Mernyei and Cătălina Cangea\.Wiki\-cs: A wikipedia\-based benchmark for graph neural networks\.*arXiv preprint arXiv:2007\.02901*, 2020\.
- Motie and Raahemi \[2024\]Soroor Motie and Bijan Raahemi\.Financial fraud detection using graph neural networks: A systematic review\.*Expert Systems with Applications*, 240:122156, 2024\.
- Newman \[2012\]Mark EJ Newman\.Communities, modules and large\-scale structure in networks\.*Nature physics*, 8\(1\):25–31, 2012\.
- OpenAI \[2025\]OpenAI\.GPT\-5 system card\.*arXiv preprint arXiv:2601\.03267*, 2025\.
- Perozzi et al\. \[2024\]Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al\-Rfou, and Jonathan Halcrow\.Let your graph do the talking: Encoding structured data for llms\.*arXiv preprint arXiv:2402\.05862*, 2024\.
- Sen et al\. \[2008\]Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi\-Rad\.Collective classification in network data\.*AI magazine*, 29\(3\):93–93, 2008\.
- Stack Exchange, Inc\. \[2024\]Stack Exchange, Inc\.Stack Exchange Data Dump\.Internet Archive,[https://archive\.org/details/stackexchange](https://archive.org/details/stackexchange), 2024\.Licensed under CC BY\-SA 4\.0; Physics Stack Exchange subset used in this work\.
- Talmor and Berant \[2018\]Alon Talmor and Jonathan Berant\.The web as a knowledge\-base for answering complex questions\.In*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\)*, pages 641–651, 2018\.
- Tang et al\. \[2024a\]Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Lixin Su, Suqi Cheng, Dawei Yin, and Chao Huang\.Graphgpt: Graph instruction tuning for large language models\.In*Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval*, pages 491–500, 2024a\.
- Tang et al\. \[2024b\]Jianheng Tang, Qifan Zhang, Yuhan Li, Nuo Chen, and Jia Li\.Grapharena: Evaluating and exploring large language models on graph computation\.*arXiv preprint arXiv:2407\.00379*, 2024b\.
- Wang et al\. \[2024\]Duo Wang, Yuan Zuo, Fengzhi Li, and Junjie Wu\.Llms as zero\-shot graph learners: Alignment of gnn representations with llm token embeddings\.*Advances in neural information processing systems*, 37:5950–5973, 2024\.
- Wang et al\. \[2023\]Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov\.Can language models solve graph problems in natural language?*Advances in Neural Information Processing Systems*, 36:30840–30861, 2023\.
- Wu et al\. \[2024\]Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec\.Stark: Benchmarking llm retrieval on textual and relational knowledge bases\.*Advances in Neural Information Processing Systems*, 37:127129–127153, 2024\.
- Wu et al\. \[2022\]Shiwen Wu, Fei Sun, Wentao Zhang, Xu Xie, and Bin Cui\.Graph neural networks in recommender systems: a survey\.*ACM computing surveys*, 55\(5\):1–37, 2022\.
- Xiang et al\. \[2025\]Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, and Jinsong Su\.When to use graphs in rag: A comprehensive analysis for graph retrieval\-augmented generation\.*arXiv preprint arXiv:2506\.05690*, 2025\.
- Xiao et al\. \[2025\]Yilin Xiao, Junnan Dong, Chuang Zhou, Su Dong, Qian\-wen Zhang, Di Yin, Xing Sun, and Xiao Huang\.Graphrag\-bench: Challenging domain\-specific reasoning for evaluating graph retrieval\-augmented generation\.*arXiv preprint arXiv:2506\.02404*, 2025\.
- Yang et al\. \[2024\]Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al\.Crag\-comprehensive rag benchmark\.*Advances in Neural Information Processing Systems*, 37:10470–10490, 2024\.
- Ye et al\. \[2024\]Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang\.Language is all a graph needs\.In*Findings of the association for computational linguistics: EACL 2024*, pages 1955–1973, 2024\.
- Yih et al\. \[2016\]Wen\-tau Yih, Matthew Richardson, Christopher Meek, Ming\-Wei Chang, and Jina Suh\.The value of semantic parse labeling for knowledge base question answering\.In*Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pages 201–206, 2016\.
- Yuan et al\. \[2025\]Zike Yuan, Ming Liu, Hui Wang, and Bing Qin\.Gracore: Benchmarking graph comprehension and complex reasoning in large language models\.In*Proceedings of the 31st International Conference on Computational Linguistics*, pages 7925–7948, 2025\.
- Zhang et al\. \[2018\]Yuyu Zhang, Hanjun Dai, Zornitsa Kozareva, Alexander Smola, and Le Song\.Variational reasoning for question answering with knowledge graph\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 32, 2018\.
- Zhang et al\. \[2026\]Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, and Chuan Shi\.Toward graph\-tokenizing large language models with reconstructive graph instruction tuning\.In*Proceedings of the ACM Web Conference 2026*, pages 430–441, 2026\.
## Appendix ALimitations
GraphInfer\-Benchcurrently spans six text\-attributed graph domains: academic citations, e\-commerce co\-purchase, biomedical citations, encyclopedia articles, patent citations, and a Q&A community\. Two domains we would have liked to include are absent:finance\(e\.g\., transaction or trading networks\) andbiologybeyond biomedical citations \(e\.g\., protein\-protein interaction or single\-cell graphs with rich textual annotations\)\. Both are difficult to obtain at scale: financial transaction graphs are typically proprietary or subject to strict redistribution constraints, and biology graphs with the per\-node free\-text descriptions our pipeline requires are scarce in the public domain\. Adding these two families is the most direct extension of the benchmark\.
## Appendix BBroader Impacts
#### Positive impacts\.
GraphInfer\-Benchstandardises evaluation for graph\-LLM reasoning under a unified split, prompt, and scoring protocol, removing two confounds that have plagued the field \(non\-comparable domain choices and per\-paper metric variations\)\. This makes incremental progress easier to recognise and harder to overstate\. The dataset, the gold reasoning, and the audit annotations are released under open licences \(CC BY 4\.0 for data, MIT for code, App\.[F](https://arxiv.org/html/2606.11562#A6)\), so any group can reproduce the leaderboard or extend it\. By identifying the multi\-node partition gap \(§[4](https://arxiv.org/html/2606.11562#S4)\) and the description\-vs\-comparison asymmetry, the benchmark gives downstream research concrete failure modes to target rather than aggregate scores to chase\.
#### Negative impacts\.
The gold reasoning is generated by DeepSeek for the candidate pool, with Llama\-3\.1\-70B and Qwen\-2\.5\-72B used for Layer\-2 audit\. Any of these models may carry biases or blind spots that propagate into the gold target\. Layer 1 to Layer 3 of the quality\-control pipeline \(§[3\.3](https://arxiv.org/html/2606.11562#S3.SS3)\) catch many surface defects but do not certify substantive bias\. Any user training onGraphInfer\-Benchinherits those biases\. A second concern is dual\-use: a strong graph\-token model trained against this benchmark could be applied to sensitive social or financial graphs in ways that we cannot foresee, particularly if the model is deployed without further evaluation on the target domain\. Finally, training and judging large LLMs has a non\-trivial energy cost; we report compute resources \(App\.[L\.2](https://arxiv.org/html/2606.11562#A12.SS2), App\.[D](https://arxiv.org/html/2606.11562#A4)\) so practitioners can decide whether the benefit justifies the cost\.
## Appendix CDoes projector pretraining help?
Projector pretraining unlocks Description on the larger label space, and is a no\-op on the smaller one\.Three of the seven graph\-token baselines \(TEA\-GLM, GOFA, RGLM\) include a projector or encoder pretraining pass before joint fine\-tuning\. We disable that step, keeping every other hyperparameter identical, on ogbn\-arxiv \(40 classes\) and pubmed\-diabetes \(3 classes\)\.
Figure 3:Projector and encoder pretraining ablation on TEA\-GLM, GOFA, and RGLM\. Solid bars are thefullmain\-table recipe; hatched bars skip the corresponding pretraining pass and train directly from raw SBERT\. n=300 per task, mean across 3 seeds\. The drop concentrates on T1 and T3 \(description tasks\), strongest on the 40\-class arxiv space\. Per\-domain numbers for thefullrows are in Tab\.[14](https://arxiv.org/html/2606.11562#A15.T14)\.#### Effect concentrates on Description in the wide label space\.
Disabling TEA\-GLM’s PCA alignment collapses the 40\-class arxiv space \(T10\.470\.47to0\.060\.06, T30\.700\.70to0\.050\.05\) and gives a smaller drop on the 3\-class pubmed space \(T10\.460\.46to0\.360\.36, T30\.470\.47to0\.380\.38\)\. GOFA shows the same shape on pubmed \(T10\.490\.49to0\.280\.28, T30\.440\.44to0\.340\.34\)\. T2, T4, and T5 are invariant across the ablation\. Pretraining is doing lexical disambiguation in the projector rather than adding structural capacity, and the benefit scales with how many fine\-grained labels the projector has to separate\. With §[4\.5](https://arxiv.org/html/2606.11562#S4.SS5), the description\-task signal in graph\-token methods comes from supervision\-side scaffolding \(Reasoning paragraphs and projector pretraining\), not from raw graph\-token capacity\.
## Appendix DDoes scaling SFT to 70B help?
Scaling Graph2Text SFT from 8B to 70B helps constrained\-output tasks and hurts open\-vocabulary description\.We re\-run the SFT recipe on Llama\-3\.1\-70B\-Instruct\-AWQ \(∼10×\\sim 10\\timesthe parameters\), keeping the same training data, prompts, and test split, on the two hardest domains \(ogbn\-arxiv and patents\)\.
Table 8:SFT scaling: 8B vs\. 70B on the two hardest domains\. n=300 per task, mean±\\pmstd across 3 seeds\. Per\-domain 8B reference numbers are in Tab\.[18](https://arxiv.org/html/2606.11562#A17.T18)\.#### 70B helps where the output is constrained\.
T2 lifts on both domains \(arxiv0\.850\.85to0\.910\.91, patents0\.860\.86to0\.910\.91\)\. T4 lifts on arxiv \(0\.460\.46to0\.500\.50\)\. Both have constrained outputs \(yes/no, a single node ID\) and reward the larger language prior\.
#### 70B hurts on open\-vocabulary description\.
T1 and T3 drop on most cells \(arxiv T10\.470\.47to0\.320\.32, arxiv T30\.710\.71to0\.490\.49, patents T30\.590\.59to0\.470\.47\)\. SBERT\-Reason cosines sit within0\.050\.05of the 8B row everywhere, so the graph\-reading quality is unchanged\. The accuracy loss is on the output side: AWQ\-INT4 quantisation widens the output distribution and the 70B language prior paraphrases the canonical label, both of which hurt tightly\-templated description targets but leave Yes/No or single node ID intact\. Capacity is not the bottleneck on Description\. Output discipline is\.
## Appendix EDoes the GNN encoder choice matter?
The message\-passing operator is load\-bearing only when the projector has many fine\-grained labels to disambiguate, mirroring the pretraining ablation in the main paper\.The main table fixes GraphSAGE as the unified GNN choice across GraphToken, TEA\-GLM, RGLM, and GOFA so that any cross\-baseline differences live in the loss, pretrain, or projection head rather than the operator\. Here we re\-train GraphToken on ogbn\-arxiv and pubmed\-diabetes, swappingSAGEConvforGCNConvwith everything else fixed\.
Table 9:GNN\-backbone ablation on GraphToken\.SAGEis the canonical main\-table operator\.GCNis the ablation\. n=300 per task, mean±\\pmstd across 3 seeds\. Per\-domain SBERT\-Reason cosines for theSAGEreference are in Tab\.[14](https://arxiv.org/html/2606.11562#A15.T14)\.#### Encoder choice flips with the label space\.
Swapping SAGE to GCN on the 40\-class arxiv space collapses GraphToken’s T1 by3333percentage points \(0\.380\.38to0\.050\.05\) and T3 by6363percentage points \(0\.680\.68to0\.040\.04\), with parallel SBERT\-Reason drops \(T10\.630\.63to0\.320\.32, T30\.640\.64to0\.270\.27\)\. On 3\-class pubmed\-diabetes the picture inverts: GCN slightly outperforms SAGE on T1 \(0\.670\.67vs\.0\.590\.59\), T2 \(0\.890\.89vs\.0\.850\.85\), and T3 \(0\.650\.65vs\.0\.620\.62\)\. The operator’s inductive bias matters where the projector has many fine\-grained labels to separate, the same pattern as pretraining \(App\.[C](https://arxiv.org/html/2606.11562#A3)\)\. Cross\-baseline parity benefits from fixing the operator as the main table does, but the direction of any swap is domain\-dependent\.
## Appendix FLicense and Data Use
GraphInfer\-Benchreleases its derived data \(ego\-graph samples, gold labels, gold reasoning, audit annotations\) underCC BY 4\.0and the accompanying code underMIT\. Both permit commercial use with citation\. Source graphs retain their upstream licences, listed below\. Users must respect these when accessing raw graph structure beyond what we redistribute\.
- •ogbn\-arxiv: OGB code MIT; arXivtitle metadatais CC0 1\.0 Public Domain \(the only field we use\)\.
- •ogbn\-products: OGB code MIT; underlying Amazon co\-purchase data from Bhatia et al\. 2016 \(Chiang et al\. 2019\), distributed for academic research use\.
- •PubMed\-Diabetes: graph dataset \(Sen et al\. 2008\) in academic public release; abstract text is not held under copyright by NLM but may be subject to publisher copyrights\.
- •WikiCS: dataset repository MIT \(Mernyei & Cangea 2020\); underlying Wikipedia article text CC BY\-SA 4\.0 \(CC BY\-SA 3\.0 for contributions prior to June 2023\)\.
- •USPTO patents: not subject to U\.S\. copyright per 17 USC §105; USPTO reserves the right to assert copyright internationally; bulk\-data redistribution permitted via USPTO Open Data Portal\.
- •Physics StackExchange: CC BY\-SA 4\.0 \(per Stack Exchange Network terms\)\.
Gold reasoning and Layer\-2 verdicts were produced by DeepSeek, Llama\-3\.1\-70B\-Instruct \(AWQ\), and Qwen\-2\.5\-72B\-Instruct \(AWQ\)\. Each provider permits redistribution as academic content, and we manually inspected samples for harmful content during the L1 to L3 release gate \(§[L](https://arxiv.org/html/2606.11562#A12)\)\. The benchmark contains no PII beyond the upstream public corpora\.
## Appendix GDatasheet for GraphInfer\-Bench
We follow the datasheet template of Gebru et al\. \([arXiv:1803\.09010](https://arxiv.org/abs/1803.09010)\)\. Where the answer already lives elsewhere in the paper, we cross\-reference rather than duplicate\.
### G\.1Motivation
#### For what purpose was the dataset created?
To evaluate the capability of an LLM to read a node and its neighbourhood and produce an open\-ended natural\-language conclusion that no single node implies \(§[1](https://arxiv.org/html/2606.11562#S1)\)\. Existing graph\-QA protocols \(algorithm simulation, node classification, single\-node description, KG\-QA, GraphRAG\) cannot test this capability \(Tab\.[1](https://arxiv.org/html/2606.11562#S1.T1)\)\.
#### Who created the dataset and on behalf of which entity?
The authors\.
#### Who funded the creation of the dataset?
N/A\.
### G\.2Composition
#### What do the instances represent?
Each instance is a 2\-hop ego\-graph from one of six real text\-attributed graphs \(Tab\.[2](https://arxiv.org/html/2606.11562#S3.T2)\), paired with one of five graph\-inference tasks \(Tab\.[3](https://arxiv.org/html/2606.11562#S3.T3)\) and a reference Answer plus Reasoning ground truth\.
#### How many instances are there in total?
42,00042\{,\}000released samples:66domains×\\times55tasks×\\times1,4001\{,\}400samples per cell\. See Tab\.[4](https://arxiv.org/html/2606.11562#S3.T4)\.
#### Does the dataset contain all possible instances or is it a sample?
A sample\. The pre\-quality\-control candidate pool contains66,00066\{,\}000candidates \(App\.[K](https://arxiv.org/html/2606.11562#A11)\);42,00042\{,\}000ship after quality control and per\-cell capping\.
#### What data does each instance consist of?
An \(input, answer, reasoning\) triple\. Inputs include the ego\-graph node list, edge list, per\-node titles, and the task question; the answer is the deterministic gold target; the reasoning is a short rationale grounding the answer in specific nodes and edges \(§[3\.2](https://arxiv.org/html/2606.11562#S3.SS2)\)\.
#### Is there a label or target associated with each instance?
Yes\. T1, T3 emit a category label, T2 a yes/no, T4 a per\-node outlier label, T5 a partition assignment\.
#### Is any information missing from individual instances?
Yes, by design\. For T1 the hub node’s title is replaced with\[MASK\]\. Node abstracts and the dataset\-level class label of each node are deliberately withheld from model inputs so that the shortcut of looking up a single node’s metadata cannot solve the task\. See §[3\.2](https://arxiv.org/html/2606.11562#S3.SS2)and the per\-task prompt templates in App\.[J](https://arxiv.org/html/2606.11562#A10)\.
#### Are there recommended train/val/test splits?
Yes\. Each \(domain, task\) cell holds1,0001\{,\}000train,100100val, and300300test \(Tab\.[4](https://arxiv.org/html/2606.11562#S3.T4)\)\.
#### Are there any errors, sources of noise, or redundancies?
The Layer\-1 to Layer\-4 release gate admits90\.4%90\.4\\%of the candidate pool \(§[3\.3](https://arxiv.org/html/2606.11562#S3.SS3), App\.[L](https://arxiv.org/html/2606.11562#A12)\)\. Residual noise comes from the gold reasoning, which is generated by an LLM and audited for surface defects but not certified for substantive bias\.
#### Does the dataset relate to people?
No\. The source graphs are public corpora \(academic citations, e\-commerce co\-purchase, biomedical citations, encyclopedia, patents, physics Q&A\)\. The release contains no PII beyond what is already public in those corpora\.
### G\.3Collection Process
#### How was the data acquired?
The six source graphs are public datasets retrieved from their upstream repositories under their respective licences \(App\.[F](https://arxiv.org/html/2606.11562#A6), App\.[I](https://arxiv.org/html/2606.11562#A9)\)\.
#### What mechanisms or procedures were used to collect the data?
Per domain we sample 2\-hop ego\-graphs centred on hub nodes with in\-degree≥3\\geq 3and valid node text, capping neighbours at ten per hop \(App\.[K](https://arxiv.org/html/2606.11562#A11)\)\. Each ego\-graph is paired with one of the five tasks under deterministic per\-task construction rules \(App\.[J](https://arxiv.org/html/2606.11562#A10)\)\.
#### If sampling, what was the sampling strategy?
Random sampling without replacement among eligible hub nodes per domain\. We oversample to2,2002\{,\}200candidates per \(domain, task\) cell so that the cell still meets its1,4001\{,\}400\-sample quota after the quality\-control pipeline rejects roughly10%10\\%\.
#### Over what timeframe was the data collected?
Sampling, gold\-reasoning generation, and the four\-layer audit were run between March and May 2026\.
#### Was any data excluded? Why?
Yes\. An earlier seventh candidate domain,string\-db, was dropped because its protein\-interaction node descriptions exceeded the prompt budget and produced a Layer\-2 admission rate below14%14\\%in pilot runs, which would not meet the per\-cell quota\.
#### Were any ethical review processes conducted?
N/A\. No human\-subject data are collected\. The benchmark is built on already\-public corpora\.
### G\.4Preprocessing, Cleaning, Labeling
#### Was any preprocessing or cleaning done?
Yes\. A four\-layer release gate \(§[3\.3](https://arxiv.org/html/2606.11562#S3.SS3), App\.[L](https://arxiv.org/html/2606.11562#A12)\) audits the gold reasoning under a shared C1–C5 vocabulary \(Layers 1–3\) and deduplicates the observed graph against split leakage \(Layer 4\)\.
#### Who generated the gold reasoning?
DeepSeek \(the API as of 2026\-04\-01, §[3\.2](https://arxiv.org/html/2606.11562#S3.SS2)footnote 2\)\. Layer\-2 audit verdicts used to filter the gold reasoning are produced by Llama\-3\.1\-70B\-Instruct \(AWQ\) and Qwen\-2\.5\-72B\-Instruct \(AWQ\); see App\.[L\.2](https://arxiv.org/html/2606.11562#A12.SS2)\.
#### Who performed the Layer\-3 human calibration?
Two annotators drawn from the author team\. No external annotators were recruited and no compensation was paid \(App\.[L\.3](https://arxiv.org/html/2606.11562#A12.SS3)\)\.
#### How were the SBERT acceptance thresholds calibrated?
The per\-domain thresholds in App\.[J](https://arxiv.org/html/2606.11562#A10)\(papers0\.300\.30, products0\.250\.25, articles0\.250\.25, patents0\.250\.25, questions0\.200\.20\) were tuned on the validation \(dev\) split\. The test split was not used in threshold selection\.
#### Was the raw data saved alongside the preprocessed data?
Yes\. The release ships the66,00066\{,\}000\-sample candidate pool, the Layer\-1 / Layer\-2 / Layer\-3 verdicts, and the42,00042\{,\}000\-sample released benchmark\.
### G\.5Uses
#### Has the dataset been used for any tasks already?
Yes\. §[4](https://arxiv.org/html/2606.11562#S4)reports four baseline families \(graph\-token alignment, zero\-shot frontier LLMs, Graph2Text SFT, plain GNNs\) under matched splits and a unified scoring protocol\.
#### Is there a repository linking to papers or systems that use the dataset?
#### What \(other\) tasks could the dataset be used for?
Diagnostics for retrieval\-augmented LLM systems that ingest graph context, ablation testbeds for new graph\-token alignment objectives, probes for emergent multi\-node reasoning in larger LLMs\.
#### Are there tasks for which the dataset should NOT be used?
The dataset should not be used to train or fine\-tune models for direct deployment in finance or healthcare decision\-making: the gold reasoning is LLM\-generated and does not constitute financial or medical advice\. It should not be treated as a knowledge\-graph QA benchmark; the questions probe inference, not retrieval\. The gold reasoning paragraph itself should not be treated as ground\-truth fact about the underlying domain\.
### G\.6Distribution
#### Will the dataset be distributed to third parties?
Yes\.GraphInfer\-Benchis released as a public benchmark\.
#### How will the dataset be distributed?
#### When will the dataset be distributed?
On paper acceptance\.
#### Under what license?
Derived data \(samples, gold labels, gold reasoning, audit annotations\) underCC BY 4\.0; code underMIT\(App\.[F](https://arxiv.org/html/2606.11562#A6)\)\.
#### Are there third\-party IP restrictions on the data?
Source graphs retain their upstream licences \(ODC\-BY 1\.0 for OGB, US public domain for PubMed\-Diabetes and USPTO, CC BY\-SA 3\.0 for WikiCS, CC BY\-SA 4\.0 for Physics StackExchange\)\. Users must respect these when accessing raw graph structure beyond what we redistribute \(App\.[F](https://arxiv.org/html/2606.11562#A6)\)\.
#### Do export controls or regulatory restrictions apply?
No\.
### G\.7Maintenance
#### Who will be supporting, hosting, maintaining the dataset?
The authors, on a best\-effort basis\.
#### How can the curator be contacted?
GitHub issues at the code repository and dataset discussions on the Hugging Face dataset page\.
#### Is there an erratum?
ACHANGELOG\.mdin the GitHub repository will track post\-release errata and corrections\.
#### Will the dataset be updated?
Yes\. A v2 release is planned that adds finance and biology domains \(App\.[A](https://arxiv.org/html/2606.11562#A1)\)\. Old versions will be tagged on Hugging Face and remain accessible\. Updates will be announced in the repositoryCHANGELOG\.mdand on the Hugging Face dataset card\.
#### If others want to extend, augment, or contribute, is there a mechanism?
Yes\. External contributors can propose new domains, task variants, or stronger gold\-reasoning audits via GitHub pull requests\. Each proposal is reviewed under the same Layer\-1 to Layer\-3 quality control as the original release\.
## Appendix HImplementation Details
This section collects training and inference details for every baseline\. Code and per\-baseline launchers are released alongside the dataset\.
#### Common setup\.
Node\-text input features are SBERT embeddings fromsentence\-transformers/all\-MiniLM\-L6\-v2\(384d\), shared across every alignment baseline so cross\-baseline differences are not confounded by node\-text encoder choice\. The LLM backbone isLlama\-3\-8B\-Instructfor the graph\-token family and the SFT family unless the original paper specified a different model\. Per\-domain training splits are 1,000 samples per task \(5,000 total per domain\), val 100 per task, test 300 per task\.
#### Graph\-token alignment baselines\.
Stage 1 trains a 2\-layerSAGEConvencoder with hidden 768\. Stage 2 jointly trains the projector and a LoRA on the LLM\. We use the “rounds” schedule from the original LLaGA recipe: three rounds of \(one epoch projector\-only, one epoch projector\+LoRA\), with per\-round evaluation and best\-round\-wins\. LoRA rank 32 onq,k,v,o\. Optimizer AdamW, learning rate5×10−45\\times 10^\{\-4\}, linear warm\-up 3%, batch size 1 with gradient accumulation 8, max sequence length 2,048\. Eval uses vLLM with greedy decoding andmax\_new\_tokens=384\(T1 to T4\) or 2,500 \(T5\)\.
#### Zero\-shot frontier LLMs\.
GPT\-5 and Claude Opus 4\.7\. Subgraphs are serialised as a plain\-text indexed node list plus an edge list\. The prompt provides the canonical answer\-frame sentence with no closed\-set label list\. For T1 and T3 the rescorer projects the free\-form answer onto the per\-domain canonical label set with SBERT\-MiniLM\-L6 cosine before scoring\. For Opus 4\.7 we omittemperature\(the model uses its default\); for GPT\-5 we setreasoning\_effort=minimaland passmax\_completion\_tokens=2048\.
#### Graph2Text SFT\.
QLoRA \(NF4, double\-quant\) on Vicuna\-7B\-v1\.5\-16k, Qwen2\-7B\-Instruct, and Llama\-3\-8B\-Instruct\. Training set is the 5,000 samples per domain rendered into ShareGPT format with the answer\-plus\-reasoning target\. LoRA rank 16, alpha 32, dropout 0\.05, lr2×10−42\\times 10^\{\-4\}, batch size 1 with gradient accumulation 8, 2 epochs, warm\-up ratio 0\.03\. Eval uses vLLM with greedy decoding andmax\_new\_tokens=384\.
#### Plain GNN reference\.
2\-layer GCN, GAT, GraphSAGE with hidden 768, dropout 0\.5, optimizer AdamW with lr1×10−31\\times 10^\{\-3\}, weight decay5×10−45\\times 10^\{\-4\}, early stopping \(patience 20\) on val accuracy\. Three seeds \(42, 43, 44\)\. For GCN and GAT we use aconcat\_input=Truevariant that concatenates the raw SBERT input to the final\-layer GNN output before the head, which gave the best results among the variants we tested\.
#### Hardware\.
Graph\-token training and SFT\-8B run on single A100 \(40 GB\) or A800 \(80 GB\) GPUs\. SFT\-70B uses Llama\-3\.1\-70B\-Instruct\-AWQ\-INT4 on a single RTX PRO 6000 Blackwell \(97 GB\)\. Layer\-2 dual judges run two AWQ\-INT4 70B models in parallel on the same Blackwell card pool\. Per\-baseline training time is<3\{<\}3hours per \(domain, task\) cell\. The full benchmark and ablation suite consumed approximately 2,250 A100 GPU\-hours\.
## Appendix IPer\-Domain Raw Data
#### ogbn\-arxiv \(academic citation\)\.
169K CS papers with 40 subject\-area labels\[[11](https://arxiv.org/html/2606.11562#bib.bib11)\]\. Directed citation edges\. Rich abstracts make SBERT verification straightforward; this is the primary development domain\.License\.OGB graph \+ node features:ODC\-BY 1\.0; arXiv abstracts: per\-paper, predominantlyCC\-BY 4\.0\(and a smaller share of arXiv non\-exclusive licenses\)\.
#### ogbn\-products \(e\-commerce co\-purchase\)\.
2\.4M Amazon products with 47 category labels\[[11](https://arxiv.org/html/2606.11562#bib.bib11)\]\. Co\-purchase edges encode consumer behaviour, not topical similarity\. Product titles are shorter than paper abstracts, so SBERT thresholds are relaxed∼\\sim20%\.License\.OGB graph \+ node features:Amazon\-Berkeley Objects custom research license\(research use permitted; redistribution allowed within OGB\)\.
#### PubMed \(clinical citation\)\.
19,717 papers on diabetes research\[[24](https://arxiv.org/html/2606.11562#bib.bib24)\]with 44K citation edges and 3 disease\-type labels\. Structurally analogous to ogbn\-arxiv but in the biomedical domain, testing whether citation\-graph reasoning transfers across scientific disciplines\.License\.PubMed graph \(Sen et al\., 2008\): publicly released for research; PubMed abstracts:U\.S\. Government public\-domain\(NLM courtesy notice required when redistributed\)\.
#### WikiCS \(encyclopedia\)\.
10,893 Wikipedia articles on computer science topics connected by 276K hyperlink edges across 10 categories\[[19](https://arxiv.org/html/2606.11562#bib.bib19)\]\. Article text is rich and well\-structured, making SBERT verification straightforward\.License\.WikiCS graph \+ features \(Mernyei & Cangea, 2020\):MIT; underlying Wikipedia article text:CC\-BY\-SA 3\.0\.
#### USPTO \(patent citation\)\.
266K patents in the H04L \(transmission of digital information\) CPC class connected by 2\.0M citation edges across 19 CPC group codes\[[14](https://arxiv.org/html/2606.11562#bib.bib14)\]\. Patent titles follow a formal style distinct from academic writing, testing generalisation across text registers\.License\.USPTO patent grants and CPC classifications:U\.S\. Government public\-domain; the citation graph is released by USPTO as bulk data\.
#### Physics StackExchange \(Physics Q&A\)\.
234K physics questions connected by 105K related\-question links across 51 topic tags\. Conversational question titles are diverse even within the same tag, requiring relaxed coherence thresholds\.License\.Physics SE post titles, bodies, and comments:CC\-BY\-SA 4\.0\(per Stack Exchange Network terms; attribution to the original author and a link back to the question are required when redistributing individual posts\)\.
## Appendix JDetailed Task Specifications
For each task we list the setup, the no\-shortcut construction, the ground\-truth source, and \(for T4/T5\) the structured output format\.
### J\.1T1: Masked Node Prediction \(node level, description\)
Setup\.A subgraph centred on a hub node whose title is replaced with\[MASK\]\. The model must infer what the masked node is about from its neighbours’ titles and the link structure\.
No\-shortcut construction\.The hub’s title is suppressed; any correct answer must come from neighbourhood context\.
Ground truth\.The real abstract of the masked node \(retrieved from dataset metadata, no annotation\)\. Hard metric: SBERT\-Sim≥\\geqthreshold against abstract\.
Domain adaptation\.Across domains, the prompt vocabulary changes \(“paper”, “product”, “article”, “patent”, “question”\) and SBERT thresholds are calibrated to entity text length \(papers 0\.30, products 0\.25, articles 0\.25, patents 0\.25, questions 0\.20\)\.
### J\.2T2: Relational Description \(edge level, description\)
Setup\.Given an edge\(u,v\)\(u,v\)anduu’s local subgraph, describe the relationship, namely what rolevvplays inuu’s context\. “They share a topic” is insufficient\.
No\-shortcut construction\.The role depends on both nodes’ structural positions; neither node’s title alone encodes it\.
Ground truth\.Subject label ofvv\(hard anchor\); SBERT against the domain description ofvv’s category \(soft anchor\)\.
### J\.3T3: Theme Summarisation \(graph level, description\)
Setup\.A cluster of same\-label nodes connected by edges within the cluster\. The model must describe the collective theme\.
No\-shortcut construction\.The theme is a collective property; no single node’s title states the area\-level theme\.
Ground truth\.Dominant label of the cluster \(hard anchor\); SBERT against a reference theme description \(soft anchor\)\.
### J\.4T4: Outlier Detection \(graph level, comparison\)
Setup\.A topically coherent cluster plus one injected outlier from a different category\. The outlier shares edges with the cluster \(structurally embedded\) but its title embedding has cosine similarity<0\.40<0\.40to the cluster centroid \(topically distant\)\. Node IDs are shuffled\. The model must output one line per node:
> Node 0: cluster member Node 1: outlier \.\.\. Reasoning: \[2\-\-3 sentence explanation\]
No\-shortcut construction\.The outlier is structurally embedded but topically distant; neither topology nor a single node’s content resolves the task\.
Ground truth\.Injected outlier’s node ID \(deterministic, zero annotation cost\)\. Hard metrics: per\-node precision, recall, F1 from structured lines\. Soft metric: SBERT of explanation against outlier’s hidden abstract\.
### J\.5T5: Community Detection \(graph level, comparison\)
Setup\.A shuffled graph of nodes from 2–3 distinct categories\. Node IDs are interleaved across communities so that position encodes no information\. The model must output:
> Node 0: Community 0 Node 1: Community 1 \.\.\. Reasoning: Community 0: \[description\] Community 1: \[description\]
No\-shortcut construction\.The partition is a collective property; no single node’s text reveals which other nodes belong to its group\.
Ground truth\.Subject\-label partition \(deterministic, zero annotation cost\)\. Hard metrics: ARI, NMI, purity, community\-count accuracy\. Soft metric: per\-community SBERT against the category description\.
## Appendix KSample Pool Generation
This section describes the pre\-quality\-control data pool\.66,000 candidate samples\(6 domains×\\times5 tasks×\\times2,200\) are the raw input to the four\-layer pipeline that filters them down to the released 42,000 \(Section[3\.3](https://arxiv.org/html/2606.11562#S3.SS3)\)\. At this stage no LLM judge, human annotator, or deduplicator has touched the samples\.
#### Domain coverage\.
The pool draws from the six text\-attributed graphs listed in Table[2](https://arxiv.org/html/2606.11562#S3.T2):ogbn\-arxiv,ogbn\-products,PubMed,WikiCS,USPTO, andPhysics SE\.
#### Subgraph sampling\.
For each domain we extract 2\-hop ego\-graphs centred on randomly sampled hub nodes \(in\-degree≥3\\geq 3with valid node text\), with up to 10 neighbours per hop\. The resulting ego\-graph has at most1\+10\+10⋅10=1111\+10\+10\\cdot 10=111nodes\. The sampler enforces three properties:
- •Per\-cell quota\.For each \(domain, task\) cell we continue sampling until 2,200 valid candidates accumulate, so the pool contains exactly6×5×2,200=66,0006\\times 5\\times 2\{,\}200=66\{,\}000samples regardless of underlying graph size\.
- •Cross\-task consistency\.All five tasks for a given domain draw from the*same*ego\-graph pool of that domain; this lets us audit and cap on a per\-cell basis \(e\.g\. deduplicate ego\-graphs before splitting\) without reshuffling the cell–cell allocation\.
- •No leakage of dataset labels\.Models receive node*titles*only\. Abstracts \(arxiv, PubMed\), product descriptions, and dataset class labels are withheld for ground truth construction; they are not visible in any prompt at any stage\.
#### Per\-task ground\-truth construction\.
Each ego\-graph is then converted into one prompt per task using deterministic rules:T1masks the ego\-centre’s title and the gold answer is its dataset category;T2sub\-samples a connected and a disconnected node pair from the ego\-graph and asks for an edge\-or\-no\-edge verdict;T3uses the dominant ego\-graph category as gold theme;T4injects a hub from a different category as the outlier;T5partitions the ego\-graph by node category and asks the model to recover the partition\. These rules are pure graph operations \(no model is involved at this stage\), so the gold answers are fully reproducible from the raw graph\.
#### Domain\-adaptive prompting\.
Task prompts use domain\-specific vocabulary via aDOMAIN\_TERMSregistry that maps each domain to its natural language:
> ogbn\-arxiv:paper, papers, citation graph, cites ogbn\-products:product, products, co\-purchase graph, co\-purchased with PubMed:paper, papers, clinical citation graph, cites USPTO:patent, patents, citation graph, cites WikiCS:article, articles, hyperlink graph, links to Physics SE:question, questions, related\-question graph, links to a related question
This rewrites surface vocabulary so the prompt reads naturally for the domain \(“this patent citation graph contains 7 patents”, not “this graph contains 7 papers”\) without changing task logic or gold answers\.
#### Per\-task prompt templates\.
The five tasks share a single prompt skeleton: a graph header \(domain\-specific noun, edge type, and node listing\), a question parameterised by the domain registry, and an answer\-format block\. \{noun\}, \{noun\-plural\}, \{graph\-kind\}, \{edge\}, \{topic\}, and \{theme\} are filled fromDOMAIN\_TERMSabove\. For each task we show the template once; the same template is used in all six domains\.
Prompt template: T1 – Masked Node Prediction<graph\> The \{graph\-kind\} contains \{N\} \{noun\-plural\}\. Node \{hub\} is the central \{noun\} and its title is masked\. Based on the graph structure and the titles of its \{deg\} neighbours, what is the masked \{noun\} most likely about? Provide your answer in the format: Answer: <one sentence stating the \{topic\}\> Reasoning: <2\-4 sentences explaining why\>
Prompt template: T2 – Relational Description<graph\> Consider Node \{a\} and Node \{b\} in this \{graph\-kind\}\. First, determine whether they have a \{edge\}\. Then explain the relationship \(or lack thereof\)\. Provide your answer in the format: Answer: <one sentence stating whether the \{edge\} exists\> Reasoning: <2\-3 sentences explaining why\>
Prompt template: T3 – Theme Summarisation<graph\> This \{graph\-kind\} contains \{N\} \{noun\-plural\} connected by \{E\} \{edge\}s\. What is this group of \{noun\-plural\} collectively about? Provide your answer in the format: Answer: <one sentence naming the common \{theme\}\> Reasoning: <2\-4 sentences explaining why\>
Prompt template: T4 – Outlier Detection<graph\> This \{graph\-kind\} contains \{N\} \{noun\-plural\} connected by \{E\} \{edge\}s\. Most \{noun\-plural\} share a coherent \{theme\}; exactly ONE \{noun\} is the semantic outlier \(its \{topic\} does not fit\)\. Identify the outlier and explain why\. Provide your answer in the format: Answer: Node 0: <outlier \| not outlier\> Node 1: <outlier \| not outlier\> … \(one line per node, exactly one must be ’outlier’\) Reasoning: <2\-3 sentences explaining why the outlier does not fit\>
Prompt template: T5 – Community Detection<graph\> This \{graph\-kind\} contains \{N\} \{noun\-plural\} from \{K\} distinct communities\. Partition the nodes into \{K\} communities labelled Community 0\.\.\{K\-1\}\. Describe each community\. Provide your answer in the format: Answer: Node 0: Community <k\> Node 1: Community <k\> … \(one line per node; every node assigned to exactly one community\) Reasoning: Community 0: <2\-3 sentence description of its \{theme\}\> Community 1: <2\-3 sentence description of its \{theme\}\> … \(one block per community\)
Figure 4:Per\-task prompt templates\. The five tasks share a single skeleton \(graph header \+ question \+ answer\-format block\); only the question and answer\-format change\. Domain\-specific tokens \(\{noun\}, \{edge\}, etc\.\) are filled fromDOMAIN\_TERMS\.The literal token<graph\>is replaced with a plain\-text node listing \(one line per node:Node \{i\}: \{title\}\) followed by an edge listing \(Node \{a\} \{edge\-verb\} Node \{b\}\)\. Note that*no*prompt ever contains the dataset class label, the masked node’s own title, the outlier’s identity, or the gold partition\. Those live only inground\_truthand are never seen by the generator that drafts the placeholder reasoning, nor by any test\-time model\.
## Appendix LData Quality Control Details
The four layers split into two jobs\.Layers 1 to 3 audit the gold*reasoning*under a shared five\-code failure vocabulary \(C1 Fact,C2 Consistency,C3 Label\-Integrity,C4 Logic,C5 Evidence\) distilled from the agentic audit \(Section[3\.3](https://arxiv.org/html/2606.11562#S3.SS3)\), so scripted, model, and human verdicts are directly comparable\. Layer 1 catches mechanical defects with regex\. Layer 2 adds two open\-weight 70B judges for the semantic codes\. Layer 3 calibrates Layer 2 against humans\.Layer 4 audits the*observed graph*, deduplicating any sample whose canonical identity already appears in the release\.
### L\.1Layer 1: Agent\-audit\-grounded rule\-based filtering
Layer 1 catches deterministic defects with no LLM and no GPU\. Each rule was added in response to a recurrent failure mode that the agentic audit identified on a held\-out diagnostic subset, then operationalised as a regex or parser so that it can be applied exhaustively to all 66,000 candidates\.
#### Codes Layer 1 covers\.
Layer 1 covers the mechanically\-detectable subset of the five\-code vocabulary:
- •C1 \(Fact\)\.EveryNodeNNreference in the gold reasoning must satisfy0≤N<\|V\|0\\leq N<\|V\|\. We use a case\-sensitive regex \(\\bNode\\s\+\\d\+\\b\), so that lowercase mentions inside literal placeholder titles such as‘Item \(node 821252\)’\(common in theogbn\-productssubset where some products lack a textual title\) are not flagged as hallucinations\.
- •C2 \(Consistency, mechanical only\)\.For T2 we extract the polarity of the answer’s edge claim \(a positive verb cluster such ashave a citation/edge/interactionversus a negative cluster such asno edge / do not have / are not connected\) and compare with the ground\-truth flaghas\_edge\. For T4 we parse the uniqueNodeNN: outlierline and compare with the ground\-truth outlier id\. Semantic C2 violations \(e\.g\. a reasoning that concludes “no direct citation” while the answer asserts an edge\) require natural\-language understanding and are deferred to Layer 2\.
- •C3 \(Label\-Integrity\)\.The answer string and the ground\-truth label must be well\-formed: no unbalanced quotes, no leading\-quote fragments such as‘"Patio’\(a CSV\-truncation artefact of‘"Patio, Lawn & Garden"’\), no placeholder tokens \(\-\-\-,???,\[MASKED\]\), and no degenerate labels shorter than three characters\.
- •Format compliance \(T4, T5\)\.T4 must contain exactly oneNodeNN: outlierline and exactly\|V\|−1\|V\|\-1NodeNN: not outlierlines, with each local id0,…,\|V\|−10,\\ldots,\|V\|\-1appearing on exactly one line\. T5 must contain oneNodeNN: CommunityXXline per node and the number of distinct communities must agree with the ground\-truth partition\.
C4 \(Logic\)andC5 \(Evidence\)are inherently semantic and live in Layer 2\.
#### Cost and throughput\.
Layer 1 is∼\\sim250 lines of Python regex that runs over the full 66,000\-sample pool in under five minutes on a CPU at zero cost\.
#### Results\.
Of 66,000 candidates, Layer 1 admits 65,328 \(98\.98%\) and rejects 672 \(Table[10](https://arxiv.org/html/2606.11562#A12.T10)\)\. The dominant rejection cause is C3 \(671 samples, 99\.9% of all rejections, concentrated on T1 and T3\) and traces to a single root cause: a CSV\-parsing bug in the upstreamogbn\-productslabel loader that truncates labels containing literal commas \(‘"Patio, Lawn & Garden"’
‘"Patio’\), producing an unbalanced leading quote in both the gold label and any answer that quotes it\. After the regex fix that suppresses lowercase node mentions inside literal placeholder titles such as‘Item \(node 821252\)’, only two C1 cases survive across the full 66K release: genuine out\-of\-range reasoning references that do warrant exclusion\. T5 has zero Layer\-1 rejections after the fix, reflecting the fact that T5 templates are generated by a structured serialiser rather than free\-text reasoning\.
Table 10:Layer\-1 data\-quality\-control results on the 66,000\-sample candidate release \(5 tasks×\\times6 domains×\\times2,200\)\. C3 dominates rejections through a single upstream CSV\-truncation defect onogbn\-products; the residual C1/C2/format failures are vanishingly rare \(≤2\\leq 2per task\)\.
### L\.2Layer 2: Dual open\-source 70B judges
Layer 2 catches the semantic violations Layer 1 cannot see: ungrounded entity references that survive the C1 regex \(e\.g\. a hallucinated*title*matching no node\), reasoning–answer self\-contradictions \(C2\-semantic\), illogical inference steps \(C4\), and reasoning that cites the wrong neighbours for the queried claim \(C5\)\.
#### Judge models\.
We deploy two open\-weight 70B\-class judges,Llama\-3\.1\-70B\-InstructandQwen\-2\.5\-72B\-Instruct, both AWQ\-INT4\[[16](https://arxiv.org/html/2606.11562#bib.bib16)\]and served via vLLM with the AWQ\-Marlin kernel, greedy decoding, and a 300\-token generation budget\. Each AWQ checkpoint fits in∼\\sim40 GB, so a single RTX PRO 6000 Blackwell GPU \(97 GB\) hosts one judge instance and a fleet dispatches samples in parallel\. Open weights make the verdicts reproducible by any third party with comparable hardware\.
#### Prompt and aggregation\.
Both judges receive the identical system prompt enumeratingC1–C5, plus a user message that supplies the gold answer, the gold reasoning, the ground\-truth structure, and the local graph context \(node titles truncated to 8 000 characters\)\. The judge emits a single\-line JSON verdict\{qualified, fail\_flags, rationale\}\. We aggregate viaall\_pass: a sample is admitted only if both judges returnqualified=true\. Figure[5](https://arxiv.org/html/2606.11562#A12.F5)reproduces the verbatim prompt template used for both judges \(no domain\- or task\-specific parameter\-tuning beyond the\{task\_desc\}string that expands toMasked Node Prediction,Relational Semantics, etc\., as listed in Appendix[J](https://arxiv.org/html/2606.11562#A10)\)\.
System prompt \(identical for both judges\)You are a strict release\-quality auditor for the GraphInfer\-Bench benchmark\. You will judge exactly ONE sample\. Decide whether it is QUALIFIED for the public release\. Be conservative: when in doubt, FAIL\.A sample is QUALIFIED only ifall fivechecks pass:C1 FACT: every node ID / title / protein / paper cited in the reasoning appears in the graph context above\. No fabricated neighbours\.C2 CONSISTENCY: the conclusion of the reasoning agrees with the answer\. Specifically, reject if the reasoning states “no edge exists”, “no direct citation”, or “no interaction” while the answer asserts the edge is present \(or vice versa\)\. Reject if the reasoning concludes one subject label and the answer states a different one\.C3 LABEL\_INTEGRITY: the answer string is well\-formed: no CSV\-truncation artefacts \(*e\.g\.*,"Patiois a fragment of"Patio, Lawn & Garden"and is invalid\), no stray quotes, no obviously wrong category names\.C4 LOGIC: each step of the reasoning follows from the previous step or from the graph context\. Reject unwarranted leaps and post\-hoc justifications that don’t engage with the visible neighbours\.C5 EVIDENCE: the reasoning uses the RIGHT neighbours / edges to support its claim\. For T2, the reasoning must justify the queried pair \(not a different edge\)\. For T4, the cited cluster signal must actually differentiate the outlier\. For T5, each community claim must cite nodes assigned to that community\.Outputexactly oneJSON object on a single line, nothing else:\{"qualified": true\|false, "fail\_flags": \[\.\.\.subset of "C1","C2","C3","C4","C5"\.\.\.\], "rationale": "\.\.\."\}
User message template \(per sample\)Domain: \{domain\} Task: \{task\} \(\{task\_desc\}\) Graph context \(node titles\): \-\-\- \{node\_titles\} \-\-\- Ground truth \(hidden from the model\): \-\-\- \{gt\} \-\-\- Gold answer \(will be released as the reference label\): \-\-\- \{answer\} \-\-\- Gold reasoning \(will be released as the reference rationale\): \-\-\- \{reasoning\} \-\-\-
Figure 5:Layer\-2 judge prompt\. Both Llama\-3\.1\-70B\-AWQ and Qwen\-2\.5\-72B\-AWQ receive an identical*system prompt*\(top, definingC1–C5\) and a per\-sample*user message*\(bottom\)\. Decoding is greedy \(temperature=0=0\); generation budget 300 tokens\.
#### Throughput\.
We dispatch the 65,328 Layer\-1\-passing samples across a 16\-GPU fleet \(4 servers×\\times1, 3, 5, 6 RTX PRO 6000 Blackwells\), with deterministic per\-judge sharding:md5\(sample\_id\)modNjudge=k\\mathrm\{md5\}\(\\text\{sample\\\_id\}\)\\bmod\{N\_\{\\mathrm\{judge\}\}\}=kassigns each sample to one ofNjudge=8N\_\{\\mathrm\{judge\}\}=8shards per judge\. Aggregate throughput peaks at∼\\sim30 verdicts//s \(∼\\sim15//s per judge\) once vLLM prefix caching warms on the shared system prompt; the full 130,656\-call run completes in∼\\sim1\.7 wall hours\.
#### Results\.
Of the 65,328 Layer\-1\-passing samples, both judges return verdicts on 65,164; the missing164164trace to transient parser errors and are dropped\. Llama qualifies 97\.9% of the samples it sees; Qwen qualifies 95\.7%\. Qwen is the systematically stricter judge\. Theall\_passaggregation \(Table[11](https://arxiv.org/html/2606.11562#A12.T11)\) admits62,010samples \(95\.2% of both\-judged samples, 94\.9% of the L1\-passing set, 93\.9% of the master 66,000\) and drops the rest: 1,054 \(1\.6%\) on which both judges agree to reject and 2,100 \(3\.2%\) on which the two judges disagree\. We treat any single\-judge rejection as a rejection of the whole sample\.
#### Inter\-judge agreement\.
On the 65,164 dual\-judged sample pairs \(6 domains, post string\-db drop\), raw agreement is 96\.8% and Cohen’sκ=0\.486\\kappa=0\.486, substantial agreement, well above chance\. Disagreements are highly asymmetric: 84% of them are Llama\-pass\-Qwen\-reject \(the stricter judge catching what the lenient one missed\), versus only 16% the other way\. This is the expected signature of two independent judges on a clean dataset where most failures are subtle\.
#### Failure\-flag distribution \(6 domains\)\.
Llama flags1,4011\{,\}401samples and Qwen flags2,8072\{,\}807\. A single rejection may carry multiple flags\. Both judges share theC2≻\\succC1≻\\succC4≻\\succC5ordering\. Qwen:C22,4522\{,\}452\(87\.4%\),C3879879\(31\.3%\),C1815815\(29\.0%\),C4616616\(21\.9%\),C5567567\(20\.2%\)\. Llama:C2742742\(53\.0%\),C1444444\(31\.7%\),C4249249\(17\.8%\),C5112112\(8\.0%\)\. Qwen is calibrated tighter, especially onC2self\-contradictions\.
Table 11:Layer\-2 data\-quality\-control results on the 65,328 Layer\-1\-passing samples \(5 tasks×\\times6 domains, computed after the string\-db drop\)\.all\_passadmits a sample only if both Llama\-3\.1\-70B\-AWQ and Qwen\-2\.5\-72B\-AWQ returnqualified=true; both\-reject and inter\-judge disagreement outcomes are dropped from the release\.OutcomeCount%Both qualify \(all\_pass, ship\)62,01095\.2%Both reject \(drop\)1,0541\.6%agreement subtotal63,06496\.8%Llama\-pass / Qwen\-reject1,7532\.7%Qwen\-pass / Llama\-reject3470\.5%disagreement subtotal2,1003\.2%Cohen’sκ\\kappa\(Llama\-70B vs\. Qwen\-72B\)0\.4860\.486
### L\.3Layer 3: Humanκ\\kappacalibration
Layer 3 certifies that Layer\-2 verdicts are trustworthy\. It does not adjudicate individual samples\. Two annotators independently renderedC1toC5verdicts on a 300\-sample stratified subset \(10 per task\-domain cell, sampled to match the L2 fail\-flag distribution\)\. We require inter\-annotator Cohen’sκ≥0\.6\\kappa\\geq 0\.6on binary ship/drop and L2\-vs\-human\-consensus agreement≥95%\\geq 95\\%on the qualified/reject label\.Both thresholds are met\(κ=0\.606\\kappa=0\.606, agreement=95\.2%=95\.2\\%\) with zero L2 silent\-pass errors\. The L2all\_passcohort therefore ships as\-is\.
### L\.4Layer 4: Construction\-time leakage prevention
The first three layers police whether each individual sample is*correct*\. Layer 4 polices whether the release as a*set*is free of leakage: no single ego\-centre, edge, cluster, outlier pair, or sub\-graph may appear in more than one of the public train/val/test splits, and no sample is duplicated within a split\. Because we deduplicate*before*the random split is drawn, the leakage\-free property is a construction\-time guarantee, not a post\-hoc cleanup\.
#### Per\-task canonical identity\.
Identities are computed from*global*graph node ids \(the ids in the upstream graph, not the per\-ego\-graph local ids\), so two samples that re\-numbered the same nodes still collide:
- •T1 \(Masked Node\): the masked ego\-centre’s global node id\.
- •T2 \(Relational Semantics\): the unordered pair of global node ids of the two endpoints\{global\(node\_a\_id\),global\(node\_b\_id\)\}\\\{\\textsf\{global\}\(\\textsf\{node\\\_a\\\_id\}\),\\textsf\{global\}\(\\textsf\{node\\\_b\\\_id\}\)\\\}\.
- •T3 \(Theme Summarization\): the frozenset of all global node ids in the ego\-graph \(the cluster signature\)\. Additionally, any pair of T3 samples whose neighbour\-set Jaccard similarity is≥0\.5\\geq 0\.5is treated as a near\-duplicate; the later\-drawn sample of each such pair is dropped\.
- •T4 \(Outlier Detection\): the \(outlier global id, frozenset of cluster global ids\) tuple, so any outlier\-vs\-same\-cluster repeat is caught\.
- •T5 \(Community Detection\): the frozenset of all global node ids in the ego\-graph \(the sub\-graph signature\)\.
For each \(domain, task\) cell we keep one sample per identity using a deterministic tie\-break \(lexicographically smallest sample id\)\.
#### Effect on the release\.
Layer 4 drops1,1321\{,\}132samples by exact identity collision and an additional1,3601\{,\}360T3 near\-duplicates \(Jaccard≥0\.5\\geq 0\.5\),2,4922\{,\}492in total,4\.0%4\.0\\%of the L2\-passing set \(Table[12](https://arxiv.org/html/2606.11562#A12.T12)\)\. The final public release contains59,681 sampleson which the train/val/test splits are then drawn uniformly at random, with split membership disjoint by construction\. Layer 4 hits hardest onphysics\-se\(T3 alone loses 780 samples\) andogbn\-arxivT3 \(449 samples\), reflecting that small sub\-graph diameters in those domains make near\-isomorphic ego\-graphs more likely under random sampling\.
Table 12:Layer\-4 deduplication results, per task\. “Exact dup” counts canonical\-identity collisions; “T3 near\-dup” counts T3 sample pairs whose neighbour\-set Jaccard is≥0\.5\\geq 0\.5\(only the later\-drawn member of each pair is dropped\)\. The kept column is the post\-L4 contribution per task; their sum is the public release set\.
#### Cap and per\-cell label balance\.
After dedup, every \(domain, task\) cell is capped to 1,400 samples and split 1,000 train / 100 val / 300 test\. To prevent the cap from amplifying any pre\-existing label skew, we sample*stratified on the gold balance label*\(T1: subject label of the masked node;T2:has\_edgeflag;T3: cluster dominant label;T4: cluster subject label;T5: uniform random, since it has no single gold label per sample\), and force exact split sizes by moving overflow into the next under\-quota split\. The resulting public release contains42,000samples \(30,000 train / 3,000 val / 9,000 test\)\. Figure[6](https://arxiv.org/html/2606.11562#A12.F6)shows the per\-cell label distribution as a6×56\\times 5grid of donut charts: each donut breaks the 1,400 samples of one \(domain, task\) cell into the top\-8 gold classes plus a grey*other*slice for the long tail\. Below each donut we annotate the number of distinct labels and the top\-1 share\. T2 lands at∼\\sim50/50 \(has\-edge / no\-edge\) by construction; pubmed\-diabetes lands at∼\\sim33% top\-1 \(3 disease classes\); wikics at∼\\sim10% \(10 categories\); the larger fine\-grained taxonomies of ogbn\-arxiv, ogbn\-products, patents and physics\-se sit at 2–6% top\-1 with a long tail of small slices\. T5 has no per\-sample gold label, so its column is shown as a single uniform slice\.
Figure 6:Per\-cell gold\-label distribution after the L4 cap \(1,400 samples / cell\)\. Each donut is one \(domain, task\) cell; wedges are the top\-8 gold classes drawn in fixed order, and the grey*other*ring collects the long tail\. Captions below each donut report \(\#classes, top\-1 share\)\. The visual signature confirms that the stratified sampler kept every cell close to its uniform floor1/K1/Krather than concentrating mass in one class\.
## Appendix MPer\-Domain GNN Reference
Table[13](https://arxiv.org/html/2606.11562#A13.T13)reports the per\-cell mean±\\pmstd of three random seeds \(42, 43, 44\) for the GNN structural reference \(GCN, GAT, GraphSAGE\) trained ondata/splits\_v4\(1,000 train / 100 val / 300 test per cell\)\. The aggregate domain\-averaged numbers are the GNN block of Table[6](https://arxiv.org/html/2606.11562#S4.T6)\. T1, T2, and T3 report hard\-label accuracy, T4 reports Hit@1 of the outlier\-vs\-anchor margin, and T5 reports the Normalized Mutual Information of the predicted clustering against the gold community labels\.
Table 13:Per\-domain results of the GNN structural reference \(3 seed\)\. All metrics are higher\-is\-better\. Each cell is mean±\\pmstd on the 300\-sample test split\.
## Appendix NFailure Case Analysis
To complement the aggregate metrics in Table[6](https://arxiv.org/html/2606.11562#S4.T6)and the per\-domain breakdown in Tables[14](https://arxiv.org/html/2606.11562#A15.T14)–[15](https://arxiv.org/html/2606.11562#A15.T15), we manually inspected∼\\sim1,200 predictions per \(baseline, task\) cell and characterise the recurring failure modes below\. None of the patterns are caused by the rescorer\. Per\-baseline parser hit\-rate is 89 to 100% across all five tasks, so the remaining gap between baseline and oracle reflects*model*weaknesses, not parsing\.
### N\.1T5 \(Community Detection\): Mode Collapse
All seven baselines collapse to a near\-degenerate partition: between 85% and 99% of nodes are assigned to a single community\. The remaining 1\-15% are typically scattered as singleton “Community 1” labels\. The extreme case is GOFA onogbn\-arxiv, where 73/73 nodes are placed in Community 0\. NMI is consequently dominated by the chance\-level baseline \(NMI≈0\.02\\approx 0\.02\-0\.070\.07across baselines\), even though the parser correctly recovers the predicted partition in\>\>99% of samples\.
### N\.2T4 \(Outlier ID\): Plausible\-but\-Wrong Convergence
On theogbn\-arxivsample shown in our spot\-check, six of seven baselines \(LLaGA, GOFA, GraphGPT, GraphToken, RGLM, TEA\-GLM\) all predict “Node 28: outlier” while the gold answer is Node 15\. Each baseline produces a distinct, well\-formed reasoning paragraph justifying its \(wrong\) choice – an indication that they pick a*plausible\-but\-wrong*outlier rather than failing to engage with the task\. InstructGLM is the exception, mode\-collapsing to “Node 1” in 664/1,050 \(63%\) of T4 predictions regardless of input\.
### N\.3LLaGA T4 – Unfilled Template \(11%\)
A specific failure mode is observed only on LLaGA T4: in 11% of predictions the model emits the*template placeholder verbatim*, e\.g\. “Node 0: <outlier∣\\midnot outlier\> Node 1: <outlier∣\\midnot outlier\>…\\dots”\. The angle\-bracket pattern is a literal copy from the prompt, with no actual classification\. These predictions are treated as non\-answers by our rescorer \(counted innnbut excluded from the H@1 numerator\)\.
### N\.4T1/T3: Semantically\-Related Wrong Labels
For T1 \(subject classification\) and T3 \(cluster theme\), the dominant failure mode is selecting a*semantically related but wrong*label, e\.g\. predicting “Computer Vision” for a paper labeled “Graphics”, or “Machine Learning” for “Neural and Evolutionary Computing”\. The labels are drawn from the same parent ontology, so SBERT\-F1 remains high \(\>0\.9\>0\.9for e5\-large in many cells\) even when hard accuracy is low\. We deliberately do*not*apply label\-hierarchy fuzzy matching: doing so would inflate scores∼5\\sim 5\-1010pp without a principled stopping criterion\. The SBERT\-F1 columns already capture this graceful\-degradation signal\.
### N\.5InstructGLM T5 – Node Set Hallucination
For T5 community detection, InstructGLM consistently emits assignments for∼\\sim500 nodes when the input ego\-graph contains 50\-100\. The extra nodes are syntactically valid \(“Node 511: Community 0”\) but correspond to no actual node in the graph\. Our rescorer takes the intersection of predicted and reference node sets before computing NMI, so the hallucinated nodes are ignored\. The underlying behavior suggests InstructGLM was trained on graphs of a different size distribution than our 50\-100 node ego\-graph test set\.
### N\.6Inference\-Cost Issue: Infinite Generation
We observe that LLaGA, GOFA, GraphGPT, GraphToken, RGLM, and TEA\-GLM do not terminate cleanly after the Answer/Reasoning block\. In 80\-99% of predictions, the model emits an end\-of\-sequence token \(e\.g\.</s\>\), then resumes generation with a fresh “ASSISTANT: I agree with the classification…” prefix and writes several more thousand characters\. Median prediction length is 1\-15 KB across baselines \(with InstructGLM as the only exception at<<1 KB median\)\. This does not affect rescoring – our parser extracts only the first valid Answer block – but does waste inference compute\. A conservative re\-evaluation with properstop\_token\_idswould cut inference cost roughly55\-10×10\\timeswithout changing reported numbers\.
### N\.7Patents – Frozen\-LLM Output\-Format Lock\-in vs\. True 19\-way Difficulty
Thepatentsdomain isolates a sharp distinction between fine\-tuned and frozen\-LLM baselines that is invisible in any other domain\. The label space is 19 IPC subclasses under the H04L group, all formatted as “H04LN: <description\>” \(e\.g\., “H04L1: Error detection/correction in transmission”\)\. Training answers expose this exact format \(∼\\sim54 examples per subclass\)\.
Graph\-token baselines\.LLaGA, GraphToken, GraphGPT, RGLM, GOFA, and TEA\-GLM hover near 19\-way chance \(T1 hard accuracy≈\\approx4–6%, vs\. chance = 5\.3%\)\. Inspecting their predictions shows they emit free\-form natural\-language descriptors \(“Cryptographic protocols”, “Data transmission systems”\) rather than the IPC code format\. Because these methods freeze the LLM and only fine\-tune a small projector, the LLM never adapts its output distribution to the domain\-specific label vocabulary; the generated text reflects pre\-training priors instead of the trained label space\.
SFT baselines\.SFT with Llama\-3\-8B, Qwen\-2\-7B, and Vicuna\-7B all reproduce the IPC format on\>\>99\.5% of patents predictions \(e\.g\., 288/288 for Llama\-3, 282/283 for Qwen\-2, 242/242 for Vicuna\)\. Their hard accuracy of 35–45% reflects the genuine difficulty of distinguishing 19 closely\-related H04L subclasses \(network security vs\. traffic control vs\. addressing vs\. multimedia\) from a small set of cited\-patent titles\.
Implication\.GraphInfer\-Bench’spatentscell separates*output\-format adaptation failure*\(frozen LLM can’t emit domain\-specific label codes\) from*classification difficulty*\(19 fine\-grained subclasses with overlapping semantics\)\. Both contribute to the gap between graph\-token baselines and SFT here; on natural\-language domains \(ogbn\-arxiv, pubmed\-diabetes\) only the latter applies\.
## Appendix OPer\-Domain Results: Graph\-Token Baselines
Per\-\(domain, task\) hard accuracy and 3\-SBERT triples \(MiniLM\-L6 / mpnet\-base / e5\-large\) for the seven graph\-token alignment baselines\. The aggregate row of Table[6](https://arxiv.org/html/2606.11562#S4.T6)is the domain\-averaged form of these cells\.
Table 14:Per\-domain results for every Graph\-Token baseline \(T1–T3\)\. SBERT score reported across MiniLM\-L6\-v2 \(M\), mpnet\-base\-v2 \(p\), e5\-large\-v2 \(E5\) jointly\. Each cell is mean±\\pmstd across 3 seed\.Table 15:Per\-domain results for every Graph\-Token baseline \(T4–T5\)\. SBERT score reported across MiniLM\-L6\-v2 \(M\), mpnet\-base\-v2 \(p\), e5\-large\-v2 \(E5\) jointly\. Each cell is mean±\\pmstd across 3 seed\.
## Appendix PPer\-Domain Results: Zero\-shot LLMs
Per\-\(domain, task\) hard accuracy and 3\-SBERT triples for closed\-source zero\-shot LLMs \(Claude Opus 4\.7, GPT\-5\)\. Sampling:n=20n\{=\}20test samples per \(domain, task\) cell, fixed seed=0, 6 domains×\\times5 tasks\. Prompt is format\-only \(canonical sentence frame, no closed\-set label list shown\)\. For T1/T3, hard accuracy is computed after projecting each free\-form Answer onto the nearest canonical label using SBERT\-MiniLM cosine over the domain\-task label set\.
Table 16:Per\-domain results for zero\-shot closed\-source LLMs \(T1–T3\), n=20 samples per \(domain, task\) cell, format\-only prompt with SBERT\-MiniLM nearest\-canonical\-label projection for T1/T3 \(Section[4\.5](https://arxiv.org/html/2606.11562#S4.SS5)\)\. SBERT score reported across MiniLM\-L6\-v2 \(M\), mpnet\-base\-v2 \(p\), e5\-large\-v2 \(E5\) jointly\.Table 17:Per\-domain results for zero\-shot closed\-source LLMs \(T4–T5\), n=20 samples per \(domain, task\) cell\. SBERT score reported across MiniLM\-L6\-v2 \(M\), mpnet\-base\-v2 \(p\), e5\-large\-v2 \(E5\) jointly\.
## Appendix QPer\-Domain Results: SFT \(Graph2Text\) Baselines
Per\-\(domain, task\) hard accuracy and 3\-SBERT triples for the three Graph2Text SFT baselines \(Vicuna\-7B, Qwen2\-7B, Llama\-3\-8B\), fine\-tuned end\-to\-end with QLoRA on the per\-domain training split\.
Table 18:Per\-domain results for SFT baselines \(T1–T3\)\. SBERT score across MiniLM\-L6\-v2 \(M\), mpnet\-base\-v2 \(p\), e5\-large\-v2 \(E5\) jointly\. Each cell is mean±\\pmstd across 3 seed\.Table 19:Per\-domain results for SFT baselines \(T4–T5\)\. SBERT score across MiniLM\-L6\-v2 \(M\), mpnet\-base\-v2 \(p\), e5\-large\-v2 \(E5\) jointly\. Each cell is mean±\\pmstd across 3 seed\.Similar Articles
GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
The paper introduces GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, containing 63 problems across three difficulty levels. It evaluates five frontier models and finds that performance degrades with difficulty, with GPT-5 achieving near-perfect results on basic problems but only 82% on graduate-level proofs.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.
LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
The author introduces LLM Win, a tool that visualizes LLM benchmark results as a directed graph to analyze transitive relationships and ranking reversals. Experimental findings suggest that LLM rankings function more like a capability graph with high weak-to-strong reachability rather than a linear ladder.
BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation
BLINKG is a benchmark designed to evaluate the mapping capabilities of Large Language Models (LLMs) in constructing Knowledge Graphs from heterogeneous data sources. It provides a standardized framework to assess how effectively LLMs establish correspondences between data schemas and ontology concepts.
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
This paper introduces MHGraphBench, a knowledge-graph-grounded benchmark for evaluating large language models on mental health knowledge, including entity recognition, relation judgment, and multi-hop reasoning. Experiments across 15 LLMs reveal a gap between recognition and judgment capabilities.