Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

arXiv cs.CL Papers

Summary

Proposes Graphs of Research (GoR), a supervised fine-tuning method that uses citation evolution graphs as supervision for LLM-based research idea generation, achieving state-of-the-art results against gpt-4o-driven baselines.

arXiv:2605.14790v1 Announce Type: new Abstract: Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:24 AM

# Citation Evolution Graphs as Supervision for Research Idea Generation
Source: [https://arxiv.org/html/2605.14790](https://arxiv.org/html/2605.14790)
Songyang Gao The Hong Kong University of Science and Technology \(Guangzhou\) sgao068@connect\.hkust\-gz\.edu\.cn &Yinghui Xia11footnotemark:1 The Hong Kong University of Science and Technology \(Guangzhou\) yxia501@connect\.hkust\-gz\.edu\.cn &Siyi Liu Tsinghua University liusiyi25@mails\.tsinghua\.edu\.cn &Hui Xiong The Hong Kong University of Science and Technology xionghui@ust\.hk

###### Abstract

Research idea generation is the innovation\-driving step of automated scientific research\. Recently, large language models \(LLMs\) have shown potential for automating idea generation at scale\. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references\. We propose*Graphs of Research*\(GoR\), a supervised fine\-tuning method that extracts a 2\-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper\-evolution directed acyclic graph \(DAG\)\. We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references\. Qwen2\.5\-7B\-Instruct\-1M is fine\-tuned on a structured\-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper\. Across head\-to\-head LLM\-judge tournaments against gpt\-4o\-driven baselines, GoR\-SFT achieves SOTA, demonstrating the effectiveness of citation\-evolution graphs as supervision signal for LLM\-based idea generation\. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.14790v1/motivation.png)Figure 1:Comparison of the existing paradigm of idea generation, our GoR, and the human ideation process inspiring our design\.Automated scientific research increasingly relies on large language models \(LLMs\) to compose literature review, ideation, experimental validation, and paper writing into closed loops\[[20](https://arxiv.org/html/2605.14790#bib.bib7),[33](https://arxiv.org/html/2605.14790#bib.bib8),[28](https://arxiv.org/html/2605.14790#bib.bib6),[22](https://arxiv.org/html/2605.14790#bib.bib10)\]\. Within this loop, ideation is the innovation\-driving step\. It sets the originality and feasibility of every downstream artifact, and remains the one stage where automation cannot fall back on retrieval or rote execution\. With LLMs making research ideation tractable at scale\[[23](https://arxiv.org/html/2605.14790#bib.bib1),[16](https://arxiv.org/html/2605.14790#bib.bib3)\], a concrete question follows:*what input lets an LLM generate high\-quality, innovative research ideas?*

Existing methods inject information into LLMs to inspire idea generation through static retrieval, agent orchestration, or trained generators\.*Retrieval\-then\-generate*pipelines\[[30](https://arxiv.org/html/2605.14790#bib.bib4),[13](https://arxiv.org/html/2605.14790#bib.bib5),[16](https://arxiv.org/html/2605.14790#bib.bib3)\]inject neighboring papers as inspiration sources without any training, but rely on retrieval scoring that captures topical similarity\.*Multi\-agent autonomy*frameworks\[[28](https://arxiv.org/html/2605.14790#bib.bib6),[20](https://arxiv.org/html/2605.14790#bib.bib7),[33](https://arxiv.org/html/2605.14790#bib.bib8),[25](https://arxiv.org/html/2605.14790#bib.bib9),[22](https://arxiv.org/html/2605.14790#bib.bib10),[6](https://arxiv.org/html/2605.14790#bib.bib11),[1](https://arxiv.org/html/2605.14790#bib.bib2)\]cover the full research lifecycle but reduce ideation itself to repeated prompting and filtering\.*Trained policies*\[[32](https://arxiv.org/html/2605.14790#bib.bib12)\]internalize reviewer preferences through cycle\-trained generation, yet supervise at the manuscript level rather than the idea level\. As shown in Figure[1](https://arxiv.org/html/2605.14790#S1.F1)\(left\), despite their architectural diversity, these systems consume references as a flat text bag, projecting away the structural relations that connect those references in the source literature\. In contrast, human researchers read references through structural cues such as section placement, year arithmetic, predecessor relations, and parallel\-work patterns, and synthesize these cues into the next idea\.

Inspired by this human ideation process, we incorporate these inter\-paper structural signals into the supervision pipeline\. We propose Graphs of Research \(GoR\), illustrated in Figure[1](https://arxiv.org/html/2605.14790#S1.F1)\(middle\), a structured\-text prompt format that serializes each paper’s citation subgraph along with edge features and predecessor relations\. Concretely, we extract a 2\-hop reference neighborhood for each paper, annotate each edge with eight features spanning position provenance, influence, temporal, and structural signals, mark parallel or explicit predecessor relations among nodes, and serialize the annotated graph into a structured\-text prompt\. We then fine\-tune Qwen2\.5\-7B\-Instruct\-1M on this prompt with completion\-only cross\-entropy on the paper’s five\-field idea, calling the resulting model GoR\-SFT\. To isolate the effect of structural signals, we train a paired plain\-reference baseline \(Refs\-SFT\) on the same 498 training papers from NeurIPS, ICLR, CVPR, ICML, and ACL between 2020 and 2024, with the graph annotations stripped under matched hyperparameters and the structural blocks as the single experimental delta\.

We evaluate GoR\-SFT on a leak\-free test set drawn from 2025 papers at the same five venues using multi\-dimensional metrics, including a five\-dimension LLM\-judge tournament covering novelty, significance, feasibility, clarity, and effectiveness, surface metrics against the gold five\-field idea, and a 10\-metric human evaluation\. To evaluate effectiveness against traditional methods, we compare GoR\-SFT with three published gpt\-4o\-driven idea generation baselines, ranking first on31, 40, and 48 of the 50 seedsrespectively \(Section[4\.3](https://arxiv.org/html/2605.14790#S4.SS3)\)\. Through controlled ablation against zero\-shot Qwen2\.5\-7B\-Instruct\-1M and the matched\-capacity Refs\-SFT baseline, we isolate SFT as the dominant driver and graph supervision as a focused additional signal that lifts Significance and Clarity \(Section[4\.4](https://arxiv.org/html/2605.14790#S4.SS4)\)\. Given the same graph\-format prompt, GoR\-SFT wins the head\-to\-head against the much larger gpt\-4o consuming the prompt zero\-shot, ranking first on3232of5050seeds, isolating supervision rather than scale as the active ingredient \(Section[4\.5](https://arxiv.org/html/2605.14790#S4.SS5)\)\. We further corroborate these automated rankings with a 10\-metric blinded human evaluation by 5 NLP and ML PhD raters, where GoR\-SFT wins 5 of 10 metrics including*Overall*\.Our main contributions are summarized as follows:

- •Graphs of Research framework\.We characterize citation graph structure as an underused SFT supervision signal in current LLM\-based ideation systems, and propose GoR, a method that returns these signals to the supervision pipeline by serializing the citation subgraph as structured text input for LLM fine\-tuning \(Section[3](https://arxiv.org/html/2605.14790#S3)\)\.
- •Automated citation\-graph extraction pipeline\.We construct an automated pipeline that builds citation subgraphs, edge\-feature annotations, and structured five\-field idea targets for 498 training papers from NeurIPS, ICLR, CVPR, ICML, and ACL between 2020 and 2024, with a 50\-paper in\-domain validation set and a leak\-free 50\-seed test set drawn from 2025 papers \(Section[4\.2](https://arxiv.org/html/2605.14790#S4.SS2)\)\.
- •Experimental validity\.Extensive experimental results demonstrate that GoR\-SFT improves the quality of generated ideas, confirming that injecting citation\-evolution graph structure into the supervision pipeline is a simple yet effective recipe for steering LLM\-based idea generation toward higher\-quality, more innovative outcomes \(Section[4\.3](https://arxiv.org/html/2605.14790#S4.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2605.14790v1/framework.png)Figure 2:Our GoR framework\.*Top*: For each seed paper, we extract a citation subgraph, annotate it with eight edge features and parallel or explicit predecessor relations, and serialize the annotated graph into a structured\-text prompt \(§[3\.1](https://arxiv.org/html/2605.14790#S3.SS1)\)\.*Bottom*: We fine\-tune Qwen2\.5\-7B\-Instruct\-1M on the prompt with completion\-only cross\-entropy on the seed’s five\-field idea \(§[3\.2](https://arxiv.org/html/2605.14790#S3.SS2)\), and at inference, GoR\-SFT consumes a new citation graph and emits a new idea \(§[3\.3](https://arxiv.org/html/2605.14790#S3.SS3)\)\.
## 2Related work

#### LLM\-based scientific research ideation\.

Scientific idea generation is a core step in automated research, setting the novelty and feasibility ceiling for downstream experimentation and writing\. Recent LLM\-based systems ground ideation in prior literature in two main ways\. Retrieval\-then\-generate methods use semantic neighbors or citation context as inspiration\[[30](https://arxiv.org/html/2605.14790#bib.bib4),[23](https://arxiv.org/html/2605.14790#bib.bib1)\]\. Agent systems extend ideation into broader research workflows such as experiment design, code generation, review, and manuscript writing\[[20](https://arxiv.org/html/2605.14790#bib.bib7),[33](https://arxiv.org/html/2605.14790#bib.bib8),[28](https://arxiv.org/html/2605.14790#bib.bib6),[25](https://arxiv.org/html/2605.14790#bib.bib9),[22](https://arxiv.org/html/2605.14790#bib.bib10),[6](https://arxiv.org/html/2605.14790#bib.bib11),[1](https://arxiv.org/html/2605.14790#bib.bib2)\]\. More targeted systems improve how prior work is exposed to the generator\. ResearchAgent uses academic\-graph and entity\-level context\[[1](https://arxiv.org/html/2605.14790#bib.bib2)\], CoI organizes papers into development chains\[[16](https://arxiv.org/html/2605.14790#bib.bib3)\], Nova expands knowledge acquisition through iterative planning and search\[[13](https://arxiv.org/html/2605.14790#bib.bib5)\], and FlowPIE couples literature exploration with test\-time idea evolution\[[31](https://arxiv.org/html/2605.14790#bib.bib13)\]\. These methods mainly optimize inference\-time retrieval, agent interaction, search, or review\. GoR instead asks whether citation\-evolution structure in prior literature can be converted into training\-time supervision for the generator itself\.

#### Graph\-augmented LLMs and citation networks\.

Graph structure has long been used to model scientific knowledge, from Literature\-Based Discovery to citation\-graph mining, document representation, and impact forecasting\[[26](https://arxiv.org/html/2605.14790#bib.bib22),[27](https://arxiv.org/html/2605.14790#bib.bib20),[24](https://arxiv.org/html/2605.14790#bib.bib19),[7](https://arxiv.org/html/2605.14790#bib.bib21)\]\. Recent graph\-augmented LLMs use graphs as retrieval substrates, external memory, or architectural inputs\. GraphRAG, LightRAG, and HippoRAG retrieve over graph\-structured knowledge\[[5](https://arxiv.org/html/2605.14790#bib.bib14),[9](https://arxiv.org/html/2605.14790#bib.bib15),[10](https://arxiv.org/html/2605.14790#bib.bib16)\], while GraphGPT and LLaGA inject graph information through encoders or projector layers\[[29](https://arxiv.org/html/2605.14790#bib.bib17),[3](https://arxiv.org/html/2605.14790#bib.bib18)\]\. Scientific ideation systems also use structured literature relations, including CoI chains, ResearchAgent’s academic graph, and FlowPIE’s test\-time literature graph\[[16](https://arxiv.org/html/2605.14790#bib.bib3),[1](https://arxiv.org/html/2605.14790#bib.bib2),[31](https://arxiv.org/html/2605.14790#bib.bib13)\]\. These works show that flat paper lists discard useful relational signals, but mostly use graphs for retrieval, search, memory, or architecture design\. GoR instead serializes a citation\-evolution graph as structured text, trains a vanilla LLM on this input, and isolates the contribution of graph structure by stripping only structural labels while preserving the same referenced papers\.

## 3Method

In this paper, we model the human research\-ideation process by constructing a temporal\-evolution DAG over each paper’s references, providing the LLM with structural cues for innovative idea generation\. As illustrated in Figure[2](https://arxiv.org/html/2605.14790#S1.F2), GoR operates in three stages\. \(i\) We extract a citation subgraph for each seed paper and annotate it with eight edge features and parallel or explicit predecessor relations \(§[3\.1](https://arxiv.org/html/2605.14790#S3.SS1)\)\. \(ii\) We fine\-tune Qwen2\.5\-7B\-Instruct\-1M on the serialized subgraph with completion\-only cross\-entropy on the seed’s five\-field idea, calling the resulting model GoR\-SFT \(§[3\.2](https://arxiv.org/html/2605.14790#S3.SS2)\)\. \(iii\) At inference, GoR\-SFT consumes a new citation graph and emits a structured five\-field idea \(§[3\.3](https://arxiv.org/html/2605.14790#S3.SS3)\)\.

![Refer to caption](https://arxiv.org/html/2605.14790v1/subgraph.png)Figure 3:The pipeline turning a seed paper into an annotated citation DAG\.\(1\)*Data sources*: parse PDFs, extract five\-field ideas, fetch metadata, mine predecessor edges\. \(2\)*Citation subgraphs*: rank candidates by seed\-side evidence, then re\-rank with sibling boost rescuing foundational refs, keeping the topK∈\[12,30\]K\\in\[12,30\]\. \(3\)*DAG construction*: connect surviving refs along the temporal cone backward from the seed, yielding explicit, parallel, and direct\-to\-seed edges\. \(4\)*Datasets output*: a temporally ordered subgraph anchored at the seed, ready for serialization as the SFT input\.### 3\.1Automated graph\-aware data construction

To construct the GoR training corpus, we build a citation DAG for each accepted conference paper, with the seed paper as the sink and its references as the other nodes\. An in\-subgraph scoring policy decides which references are retained, and we annotate every node and edge with structural features\. Figure[3](https://arxiv.org/html/2605.14790#S3.F3)illustrates the four stages of the pipeline\.

#### Data sources\.

For each seed paper we retrieve the PDF \(OpenReview when available, arXiv otherwise\), parse sections and references with GROBID\[[19](https://arxiv.org/html/2605.14790#bib.bib30)\], and fetch reference\-side metadata \(abstract, year, venue, citation count,isInfluentialflag, in\-textcontexts\) from the Semantic Scholar Graph API\[[14](https://arxiv.org/html/2605.14790#bib.bib31)\]\. Both the seed and each retrievable reference are extracted by an LLM into a shared five\-field idea schema that captures each paper’s core content\. The five fields are Problem, Existing Methods, Motivation, Proposed Method, and Experiment Plan, with the extractor prompt and a quality audit reported in Appendix[A](https://arxiv.org/html/2605.14790#A1)\. For each reference, we additionally intersect its own reference list with the seed’s reference set to recover in\-subgraph predecessor links used downstream by the scoring policy and the DAG\.

#### Citation subgraph\.

Let𝒱\\mathcal\{V\}denote the universe of papers andℛ​\(v\)⊆𝒱\\mathcal\{R\}\(v\)\\subseteq\\mathcal\{V\}the explicit Semantic Scholar references of seedvv\. We form a 2\-hop expansion

𝒩​\(v\)=ℛ​\(v\)∪\{r′\|r′∈ℛ​\(r\),r∈ℛ​\(v\)\},\\mathcal\{N\}\(v\)\\;=\\;\\mathcal\{R\}\(v\)\\,\\cup\\,\\bigl\\\{r^\{\\prime\}\\,\\bigm\|\\,r^\{\\prime\}\\in\\mathcal\{R\}\(r\),\\,r\\in\\mathcal\{R\}\(v\)\\bigr\\\},\(1\)and enforce a strict temporal cone, retaining

Gv=\{u∈𝒩​\(v\)\|t​\(u\)<t​\(v\)\}\.G\_\{v\}\\;=\\;\\bigl\\\{u\\in\\mathcal\{N\}\(v\)\\,\\bigm\|\\,t\(u\)<t\(v\)\\bigr\\\}\.\(2\)This construction is identical to a standard time\-split LBD setup modulo the per\-paper truncation introduced next\.

#### DAG construction\.

We pruneGvG\_\{v\}to a budgetK=min⁡\(30,max⁡\(12,⌈0\.15​\|ℛ​\(v\)\|⌉\)\)K=\\min\\\!\\bigl\(30,\\,\\max\(12,\\,\\lceil 0\.15\\,\|\\mathcal\{R\}\(v\)\|\\rceil\)\\bigr\), chosen so the median serialized prompt fits inside the 16k\-token training context of Qwen2\.5\-7B\-Instruct\-1M\. Pruning runs in two passes\. Pass 1 ranks each candidaterrby

Score1​\(r\)=0\.45​Scite​\(r\)\+0\.40​Ssec​\(r\)\+0\.15​Sinfl​\(r\),\\mathrm\{Score\}\_\{1\}\(r\)\\;=\\;0\.45\\,S\_\{\\mathrm\{cite\}\}\(r\)\+0\.40\\,S\_\{\\mathrm\{sec\}\}\(r\)\+0\.15\\,S\_\{\\mathrm\{infl\}\}\(r\),\(3\)combining seed\-internal cite frequency, section weight, and the Semantic Scholar influential flag\. Pass 2 adds a sibling\-boost term

Ssib​\(r\)=\|\{s∈TopK1:r∈s\.predecessors\}\|/K,S\_\{\\mathrm\{sib\}\}\(r\)\\;=\\;\\bigl\|\\\{\\,s\\in\\mathrm\{TopK\}\_\{1\}:r\\in s\.\\mathrm\{predecessors\}\\,\\\}\\bigr\|\\,/\\,K,which surfaces canonical ancestors with low local cite count yet referenced by many already\-selected papers \(e\.g\. GPT\-1 within the GPT\-3 subgraph\)\. The final rank is

Score2​\(r\)=0\.40​Scite\+0\.35​Ssec\+0\.10​Sinfl\+0\.15​Ssib\.\\mathrm\{Score\}\_\{2\}\(r\)\\;=\\;0\.40\\,S\_\{\\mathrm\{cite\}\}\+0\.35\\,S\_\{\\mathrm\{sec\}\}\+0\.10\\,S\_\{\\mathrm\{infl\}\}\+0\.15\\,S\_\{\\mathrm\{sib\}\}\.\(4\)We then connect every pair\(u,v\)\(u,v\)withu∈predecessors​\(v\)u\\in\\mathrm\{predecessors\}\(v\)by an edge classified asexplicit\_predwhent​\(v\)\>t​\(u\)t\(v\)\>t\(u\),parallel\_predwhent​\(v\)=t​\(u\)t\(v\)=t\(u\), ordirect\_to\_seedfor sources without an in\-subgraph predecessor\. 2\-cycles are broken bySciteS\_\{\\mathrm\{cite\}\}with paper\-id lexicographic order as a tie\-breaker\. Every retained node carries its title, year, venue, abstract, and five\-field idea\. Every edge carries eight features grouped into five categories, namely*role*\(which seed sections citeuu\),*influence*\(Semantic Scholar’s flag plus subgraph\-local centrality\),*recency*\(year arithmetic\),*topology*\(hop distance\), and*provenance*\(parser\-side confidence\)\. The full feature list with sources and semantics appears in Appendix[B](https://arxiv.org/html/2605.14790#A2)\.

### 3\.2Graph\-conditioned SFT

We train Qwen2\.5\-7B\-Instruct\-1M\[[34](https://arxiv.org/html/2605.14790#bib.bib23)\]on a structured\-text serialization of the annotated subgraphGvG\_\{v\}from §[3\.1](https://arxiv.org/html/2605.14790#S3.SS1)to predict the seed paper’s five\-field idea, calling the resulting model GoR\-SFT\.

#### Prompt serialization\.

We serializeGvG\_\{v\}into a structured\-text prompt with one block per node \(\[EDGE\],\[PREDECESSORS\],\[IDEA\],\[ABSTRACT\]\) emitted in temporal order, followed by a TASK block requesting the seed’s five\-field idea in strict JSON\. Full template in Appendix[D](https://arxiv.org/html/2605.14790#A4)\.

#### Training objective\.

Letx=serialize​\(Gv\)x=\\mathrm\{serialize\}\(G\_\{v\}\)be the prompt andyythe gold five\-field JSON for seedvv\. We optimize standard completion\-only cross\-entropy

ℒ​\(θ\)=−∑\(x,y\)∈𝒟∑t=1\|y\|log⁡pθ​\(yt\|x,y<t\),\\mathcal\{L\}\(\\theta\)\\;=\\;\-\\sum\_\{\(x,y\)\\in\\mathcal\{D\}\}\\sum\_\{t=1\}^\{\|y\|\}\\log p\_\{\\theta\}\\\!\\bigl\(y\_\{t\}\\,\\bigm\|\\,x,\\,y\_\{<t\}\\bigr\),\(5\)masking all prompt tokens viaDataCollatorForCompletionOnlyLMso gradient flows only on the JSON answer\. The single deliberate delta betweenGoR\-SFTand the matchedRefs\-SFTbaseline is the presence of\[EDGE\]and\[PREDECESSORS\]blocks inxx\. Both runs share the base model, optimizer, schedule, batch size, precision, and random seed\. Headline settings are 2 epochs of full fine\-tuning, effective batch size 8, learning rate2×10−52\\times 10^\{\-5\}on a cosine\-with\-min\-lr schedule, bfloat16 precision, andmax\_seq\_len16,384 on4×4\\timesA800\-80G GPUs\. The full hyperparameter table is in Appendix[E](https://arxiv.org/html/2605.14790#A5)\. We perform NLL\-based checkpoint selection on a held\-out 50\-paper in\-domain validation set described in §[4\.2](https://arxiv.org/html/2605.14790#S4.SS2)\.

### 3\.3Next idea generation

At inference, GoR\-SFT consumes a new seed’s citation subgraph and decodes a five\-field idea with vLLM\[[15](https://arxiv.org/html/2605.14790#bib.bib26)\], samplingK=10K\{=\}10candidates per prompt at temperature 0\.9 and top\-pp0\.95\. The LLM\-judge tournament uses one decoded idea per method per seed, with surface metrics using oracle top\-1 selection over theKKcandidates \(§[4\.2](https://arxiv.org/html/2605.14790#S4.SS2), results in §[4\.3](https://arxiv.org/html/2605.14790#S4.SS3)\)\.

## 4Experiments

### 4\.1Research questions

We organize our empirical studies of the GoR framework around three research questions on LLM\-based idea generation\.

- •RQ1: Input paradigm comparison\.How do different paradigms for presenting prior literature to an LLM affect idea generation quality across the five evaluation dimensions of novelty, significance, feasibility, clarity, and effectiveness?
- •RQ2: Component contribution analysis\.Within the GoR pipeline at matched 7B capacity, what are the individual contributions of supervised fine\-tuning and the structural graph input to overall idea generation quality?
- •RQ3: Supervision versus capacity\.Can a 7B model trained with paper\-evolution citation graphs as supervision match or exceed a much larger closed\-source LLM that consumes the same graph\-format input in zero\-shot mode?

### 4\.2Experimental setup

#### Data\.

Following the construction pipeline described in Section[3\.1](https://arxiv.org/html/2605.14790#S3.SS1), we build the GoR dataset from accepted papers at five major NLP and ML venues: NeurIPS, ICLR, CVPR, ICML, and ACL\. The training pool contains 498 papers from 2020 to 2024, with the per\-year distribution in Appendix[C](https://arxiv.org/html/2605.14790#A3)\. We hold out a 50\-paper in\-domain validation set from the same year span and venue mix for NLL\-based checkpoint selection\. The test set comprises 50 in\-distribution seeds drawn from accepted 2025 papers at the same five venues, distributed as ICML \(13\), ACL \(12\), NeurIPS \(9\), ICLR \(9\), and CVPR \(7\), with a median of 64 references per seed \(range 24–151\)\. The strict temporal conet​\(test\)\>t​\(train\)t\(\\text\{test\}\)\>t\(\\text\{train\}\)ensures no test seed appears as a node or 2\-hop reference in any training subgraph\. We additionally verified leak\-free via title\-string and Semantic\-Scholar paper\-id overlap checks\.

#### Baselines\.

We compare our two GoR variants against three published works on LLM\-based research idea generation\. The two variants differ in how they consume the citation graph\.GoR\-SFTfine\-tunes a 7B Qwen base on the graph\-format prompt, whileGoR\-Agentfeeds the same graph\-format prompt to gpt\-4o at inference without supervision\. To ensure a fair comparison, all three baselines use gpt\-4o as the shared LLM backbone, run on the same 50\-seed test set, and produce the same five\-field idea schema, which minimizes judge preference toward more structured outputs\. We compare against the following baselines:

- •Si baseline\[[23](https://arxiv.org/html/2605.14790#bib.bib1)\]: retrieval\-augmented over\-generate\-and\-rerank with pairwise comparison\.
- •CoI\-Agent\[[16](https://arxiv.org/html/2605.14790#bib.bib3)\]: chronological chain\-of\-ideas and generates new ideas grounded in this evolution\.
- •ResearchAgent\[[1](https://arxiv.org/html/2605.14790#bib.bib2)\]: multi\-agent reviewer\-critique loop with academic\-graph entity grounding\.

#### Models\.

Three LLMs serve different roles: Qwen2\.5\-7B\-Instruct\-1M\[[34](https://arxiv.org/html/2605.14790#bib.bib23)\]as our base to trainGoR\-SFT, gpt\-4o as the shared backbone for the three baselines and ourGoR\-Agentvariant, and DeepSeek\-V3\.2\-Exp\[[4](https://arxiv.org/html/2605.14790#bib.bib25)\]as the fixed LLM judge\.

#### Evaluation protocols\.

To validate GoR, we conduct two automated evaluations \(T1 and T2\) and one human evaluation \(H\)\.T1: LLM\-judge tournament\.For each test seed, the LLM judge scores all ordered pairs in anNN\-way round\-robin along the five dimensions of novelty, significance, feasibility, clarity, and effectiveness\. Each pair receives 2/1/0 per dimension following the protocol ofLiet al\.\[[16](https://arxiv.org/html/2605.14790#bib.bib3)\], with both presentation orderings\. We report mean Elo, per\-method rank\-1 count, and per\-dimension winrate\.T2: surface metrics\.For each test seed, we sampleK=10K\{=\}10candidates at temperature 0\.9 and top\-pp0\.95, then report three surface metrics against the gold five\-field idea: SPECTER2\[[24](https://arxiv.org/html/2605.14790#bib.bib19)\]weighted top\-1 cosine \(wTop1\), Method\-field ROUGE\-L F1\[[17](https://arxiv.org/html/2605.14790#bib.bib29)\]\(mROUGE\), and DeBERTa\-based BERTScore F1\[[11](https://arxiv.org/html/2605.14790#bib.bib27),[36](https://arxiv.org/html/2605.14790#bib.bib28)\]\(BERT\-F1\), all higher\-is\-better\.H: human evaluation\.Three NLP and ML PhD raters score the four systems on 10 metrics on a 1–10 Likert scale across a balanced 5\-seed blinded subset\. The first five dimensions align with T1, and the remaining five \(Excitement, Soundness, Originality, Reproducibility, Overall\) are drawn from prior human studies\[[1](https://arxiv.org/html/2605.14790#bib.bib2),[32](https://arxiv.org/html/2605.14790#bib.bib12)\]\. Full protocol and per\-dimension results appear in Appendix[F](https://arxiv.org/html/2605.14790#A6)\.

![Refer to caption](https://arxiv.org/html/2605.14790v1/fig_exp_summary.png)Figure 4:\(A\) Mean Elo margins ofGoR\-SFTandGoR\-Agentagainst the three published baselines on the test set\. \(B\) Per\-dimension winrates ofGoR\-SFTagainst the three baselines\. \(C\) Mean Elo of the three matched 7B variantsw/o SFT,w/o graph, andGoR\-SFT, with 95% bootstrap CIs over seeds\. \(D\) Per\-dimension winrates ofGoR\-SFTversusGoR\-Agenton the same graph input\.

### 4\.3Main results: head\-to\-head against published baseline agents

To answer RQ1, GoR\-SFT and GoR\-Agent each compete against three gpt\-4o\-driven baselines in pairwise LLM\-judge tournaments\. GoR\-Agent isolates the input paradigm under the shared gpt\-4o backbone, while GoR\-SFT tests whether SFT on the same graph prompt recovers further gains\.

Table 1:Main results: GoR\-Agent and GoR\-SFT vs published baselines on the test set\.Bold marks the GoR variant’s wins\.NovSigFeasClarEffOpponentWTLWTLWTLWTLWTLRank\-1Mean Elo/20GoR\-Agent: gpt\-4o backbone with graph\-format prompt \(zero\-shot\)Si baseline0\.000\.260\.740\.060\.300\.640\.760\.120\.120\.380\.340\.280\.340\.360\.3026/ 249\.08 / 10\.92CoI\-Agent0\.160\.420\.420\.200\.320\.480\.480\.220\.300\.540\.240\.220\.480\.200\.3231/ 1910\.32/ 9\.68ResearchAgent0\.380\.460\.160\.440\.280\.280\.920\.040\.040\.960\.020\.020\.840\.100\.0647/ 315\.88/ 4\.12GoR\-SFT: 7B base fine\-tuned on the same graph supervisionSi baseline0\.040\.200\.760\.120\.240\.640\.880\.100\.020\.560\.280\.160\.480\.380\.1431/ 1910\.54/ 9\.46CoI\-Agent0\.320\.240\.440\.240\.280\.480\.780\.160\.060\.760\.160\.080\.700\.160\.1440/ 1013\.26/ 6\.74ResearchAgent0\.580\.280\.140\.540\.260\.200\.940\.060\.000\.980\.020\.000\.860\.120\.0248/ 216\.98/ 3\.02

#### Graph as a zero\-shot input alone is unstable\.

GoR\-Agent wins two of the three tournaments on mean Elo, with 10\.32 against CoI\-Agent and 15\.88 against ResearchAgent, but loses to Si baseline at 9\.08 vs 10\.92 in the top three rows of Table[1](https://arxiv.org/html/2605.14790#S4.T1), with the same margins visualized in Figure[4](https://arxiv.org/html/2605.14790#S4.F4)\(A\)\. The pattern is sharper at the dimension level\. GoR\-Agent wins decisively on the three structural dimensions Feasibility, Clarity, and Effectiveness, where its per\-seed win rate spans 0\.34 to 0\.96 across the three opponents\. The asymmetry traces to prompt paradigm: Si baseline’s prompt explicitly encourages open\-ended divergent ideation over a flat reference bank, while the GoR\-Agent prompt constrains the model to derive ideas from the structured citation evolution graph\. The structural constraint thus trades raw novelty for grounded feasibility\. This pattern raises the question of whether prompt\-level injection suffices for the model to internalize the graph signal, or whether supervised fine\-tuning on the graph is required\. We address this with GoR\-SFT next\.

#### SFT on graph supervision yields consistent SOTA\.

When the graph signal is learned at training time, GoR\-SFT wins all three head\-to\-head tournaments on mean Elo, with 10\.54 against Si baseline, 13\.26 against CoI\-Agent, and 16\.98 against ResearchAgent, ranking first on31, 40, and 48 of the 50 test seedsrespectively \(bottom three rows of Table[1](https://arxiv.org/html/2605.14790#S4.T1), Figure[4](https://arxiv.org/html/2605.14790#S4.F4)\(A\)\)\. On Feasibility, Clarity, and Effectiveness, per\-seed win rate spans 0\.48 to 0\.98 \(Figure[4](https://arxiv.org/html/2605.14790#S4.F4)\(B\)\)\. On creative dimensions GoR\-SFT outperforms ResearchAgent and CoI\-Agent on Novelty \(0\.58, 0\.32\) and Significance \(0\.54\), while losing only to Si baseline on Novelty \(0\.04\) and Significance \(0\.12\)\.

### 4\.4Ablation: structural input is the driver

To answer RQ2, we decomposeGoR\-SFT’s gain under matched 7B capacity: how much comes from SFT over the zero\-shot base, and how much comes from the structural input on top of plain\-reference SFT? Table[2](https://arxiv.org/html/2605.14790#S4.T2)merges the 3\-way LLM\-judge tournament across three 7B\-Instruct ablation variants \(w/o SFT,w/o graph,GoR\-SFT\) with T2 surface metrics on the same 50\-seed test set\.

GoR\-SFT achieves mean Elo 23\.42 out of 40 in the 3\-way tournament, exceeding plain\-referencew/o graphat 22\.46 and zero\-shotw/o SFTat 14\.12, and ranks first on 25 of the 50 test seeds compared to 22 forw/o graphand 3 forw/o SFT\. The pattern is sharper at the dimension level\. GoR\-SFT wins 4 of 5 dimensions overw/o graph, with Significance reaching 0\.730 against 0\.625 and Clarity 0\.775 against 0\.705, while Feasibility is essentially tied at 0\.640 against 0\.645\. The same ordering holds on the T2 surface metrics, where GoR\-SFT wins all three of wTop1, mROUGE, and BERT\-F1\. The T2 margins are smaller as wTop1 saturates near 0\.93 across SFT variants, while the LLM\-judge tournament discriminates more sharply on idea content\.

Table 2:Quantitative comparison ofw/o SFT,w/o graph, andGoR\-SFT\.Best per column inbold\.Figure[4](https://arxiv.org/html/2605.14790#S4.F4)\(C\) summarizes the two contributions in mean Elo terms\. The SFT contribution is large and statistically stable\. GoR\-SFT reaches mean Elo 23\.42 against zero\-shot Qwen’s 14\.12, a margin of 9\.30 with paired bootstrap 95% CI \[5\.32, 13\.14\] and one\-sided Wilcoxonp<10−4p<10^\{\-4\}\. The graph contribution on top of plain\-reference SFT is smaller\. GoR\-SFT reaches 23\.42 againstw/o graph’s 22\.46, a margin of 0\.96 that we interpret as evidence for a useful structural prior\. At the dimension level the two effects have different signatures\. The SFT\-over\-zero\-shot effect lifts all five dimensions broadly, with per\-dimension gains ranging from 0\.125 on Feasibility to 0\.255 on Effectiveness\. The graph\-over\-plain effect concentrates on Significance and Clarity at gains of 0\.105 and 0\.070, with Feasibility essentially unchanged\.

### 4\.5Supervision versus capacity

To answer RQ3, we conduct a head\-to\-head LLM\-judge evaluation ofGoR\-SFTagainstGoR\-Agenton the test set\. Both systems consume the identical graph\-format prompt at inference, isolating the contrast between using the citation graph as a prompt\-only injection versus as a training\-time supervision signal for idea generation\. GoR\-SFT wins the pair on mean Elo, 11\.10 against 8\.90, and ranks first on 32 of the 50 test seeds\. The win is concentrated on the structural dimensions Feasibility, Clarity, and Effectiveness at 0\.840, 0\.850, and 0\.780, while gpt\-4o’s larger parametric memory retains Novelty and Significance at 0\.740 each, as reported in Table[3](https://arxiv.org/html/2605.14790#S4.T3)and visualized in Figure[4](https://arxiv.org/html/2605.14790#S4.F4)\(D\)\. GoR\-SFT additionally offers a Pareto improvement in the cost\-quality tradeoff\. The 7B model runs locally and completes the 50\-seed test set in 2 to 3 minutes with no API spend, while GoR\-Agent and the three gpt\-4o\-driven baselines all rely on the OpenAI API at roughly $0\.004 per idea\.

Table 3:Quantitative comparison ofGoR\-SFTandGoR\-Agenton the test set under matched graph\-format input\.Best per column in bold\. Cost is the per\-idea inference cost in USD\.
### 4\.6Case study: structure\-aware idea generation on a citation subgraph

We close with a qualitative case study on the seed paper*AbGen*\[[37](https://arxiv.org/html/2605.14790#bib.bib32)\]\. The retained subgraph \(Fig\.[5](https://arxiv.org/html/2605.14790#S4.F5)\) spans 12 refs across multi\-agent review, agent benchmarks, end\-to\-end agents, and multimodal scientific QA, leaving a gap at the intersection of iterative ablation refinement and long\-range research planning that no retained ref addresses jointly\.

Reading the four proposals \(Table[8](https://arxiv.org/html/2605.14790#A7.T8)in Appendix[G](https://arxiv.org/html/2605.14790#A7)\),GoR\-SFTidentifies the refinement\-plus\-planning gap and proposes a two\-component framework anchored in MARG plus a planning module for long\-range decisions\. Si baseline proposes AGCLO, a free\-floating three\-stage multi\-agent system unanchored to any subgraph ref\. CoI\-Agent proposes AbGen\-XAI\+, an extension of AbGen and ML\-Bench from the chronological chain\. ResearchAgent pivots to FDMME, a multimodal embedding pipeline grounded in the multimodal subset of the subgraph but largely orthogonal to the core ablation\-design problem\. The LLM\-judge verdict reflects the grounding pattern:GoR\-SFTwins15\-5against Si baseline and15\-5against CoI\-Agent, and ties10\-10against ResearchAgent\.

Unlike cases where one paradigm sweeps all baselines, this seed spreads the verdict:GoR\-SFTdecisively beats the two baselines whose paradigm gives no clean grounding, while it ties ResearchAgent whose entity\-store paradigm coincidentally surfaces the multimodal subset of the subgraph\. The case nevertheless supports the proposed mechanism\. When a coherent structural signal exists in the subgraph, a fine\-tuned model identifies it and produces a grounded proposal that two of the three gpt\-4o\-driven baseline paradigms cannot reproduce\.

![Refer to caption](https://arxiv.org/html/2605.14790v1/case-study.png)Figure 5:Citation subgraph for the*AbGen*\.

## 5Conclusion

We introduce GoR, a supervised fine\-tuning recipe that supervises a 7B LLM on citation evolution graphs, built by an automated graph\-aware data\-construction pipeline, rather than on a flat reference bag\. GoR\-SFT wins all three head\-to\-head tournaments against three published baselines, ranking first on 31, 40, and 48 of the 50 seeds respectively\. Ablation isolates SFT as the dominant driver, with graph supervision contributing focused additional gains on Significance and Clarity over plain\-reference SFT\. GoR\-SFT further takes 32 of 50 seeds head\-to\-head against a much larger gpt\-4o consuming the same graph, showing that a 7B open\-source base with graph supervision can surpass flagship LLMs on idea generation at near\-zero inference cost\. In a blinded 5\-rater human study, GoR\-SFT additionally wins 5 of 10 dimensions, independently corroborating the LLM\-judge ranking\. Citation\-graph structure is therefore a useful supervision signal that current LLM\-based ideation systems systematically overlook\. Looking forward, we plan to scale the automated extraction pipeline to build larger paper\-evolution graph datasets, equipping LLMs with a deeper grasp of scientific evolution and accelerating research innovation\. We will release the data, training, and evaluation pipeline to encourage extensions to other base LLMs and to hybrid creative\-plus\-graph prompts\.

## References

- \[1\]J\. Baek, S\. K\. Jauhar, S\. Cucerzan, and S\. J\. Hwang\(2025\)ResearchAgent: iterative research idea generation over scientific literature with large language models\.InAnnual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics \(NAACL\),External Links:[Link](https://arxiv.org/abs/2404.07738)Cited by:[Appendix A](https://arxiv.org/html/2605.14790#A1.p1.1),[Appendix F](https://arxiv.org/html/2605.14790#A6.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1),[3rd item](https://arxiv.org/html/2605.14790#S4.I2.i3.p1.1),[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[2\]J\. S\. Chan, N\. Chowdhury, O\. Jaffe, J\. Aung, D\. Sherburn, E\. Mays, G\. Starace, K\. Liu, L\. Maksin, T\. Patwardhan, A\. Madry, and L\. Weng\(2024\)MLE\-bench: evaluating machine learning agents on machine learning engineering\.Note:arXiv preprint arXiv:2410\.07095External Links:[Link](https://arxiv.org/abs/2410.07095)Cited by:[Appendix F](https://arxiv.org/html/2605.14790#A6.p1.1)\.
- \[3\]R\. Chen, T\. Zhao, A\. Jaiswal, N\. Shah, and Z\. Wang\(2024\)LLaGA: large language and graph assistant\.InInternational Conference on Machine Learning \(ICML\),External Links:[Link](https://arxiv.org/abs/2402.08170)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]DeepSeek\-AI\(2025\)DeepSeek\-V3\.2\-exp technical report\.Note:Model release; technical report at[https://github\.com/deepseek\-ai/DeepSeek\-V3\.2\-Exp](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp)Hugging Face / GitHub releaseCited by:[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px3.p1.1)\.
- \[5\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson\(2024\)From local to global: a GraphRAG approach to query\-focused summarization\.Note:arXiv preprint arXiv:2404\.16130External Links:[Link](https://arxiv.org/abs/2404.16130)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]A\. Ghafarollahi and M\. J\. Buehler\(2025\)SciAgents: automating scientific discovery through bioinspired multi\-agent intelligent graph reasoning\.Note:arXiv preprint arXiv:2409\.05556External Links:[Link](https://arxiv.org/abs/2409.05556)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]X\. Gu and M\. Krenn\(2024\)Forecasting high\-impact research topics via citation knowledge graph embeddings\.Note:arXiv preprint arXiv:2402\.08640External Links:[Link](https://arxiv.org/abs/2402.08640)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[8\]X\. Guan, L\. L\. Zhang, Y\. Liu, N\. Shang, Y\. Sun, Y\. Zhu, F\. Yang, and M\. Yang\(2025\)rStar\-Math: small LLMs can master math reasoning with self\-evolved deep thinking\.Note:arXiv preprint arXiv:2501\.04519External Links:[Link](https://arxiv.org/abs/2501.04519)Cited by:[Appendix F](https://arxiv.org/html/2605.14790#A6.p1.1)\.
- \[9\]Z\. Guo, L\. Xia, Y\. Yu, T\. Ao, and C\. Huang\(2024\)LightRAG: simple and fast retrieval\-augmented generation\.Note:arXiv preprint arXiv:2410\.05779External Links:[Link](https://arxiv.org/abs/2410.05779)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[10\]B\. J\. Gutiérrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su\(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InConference on Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2405.14831)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]P\. He, X\. Liu, J\. Gao, and W\. Chen\(2021\)DeBERTa: decoding\-enhanced BERT with disentangled attention\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2006.03654)Cited by:[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[12\]P\. Hsu, Y\. Dai, V\. Kothapalli, Q\. Song, S\. Tang, S\. Zhu, S\. Shimizu, S\. Sahni, H\. Ning, and Y\. Chen\(2024\)Liger\-kernel: efficient triton kernels for LLM training\.Note:arXiv preprint arXiv:2410\.10989External Links:[Link](https://arxiv.org/abs/2410.10989)Cited by:[Table 5](https://arxiv.org/html/2605.14790#A5.T5.4.13.9.2)\.
- \[13\]X\. Hu, H\. Fu, J\. Wang, Y\. Wang, Z\. Li, R\. Xu, Y\. Lu, Y\. Jin, L\. Pan, and Z\. Lan\(2024\)Nova: an iterative planning and search approach to enhance novelty and diversity of LLM generated ideas\.Note:arXiv preprint arXiv:2410\.14255External Links:[Link](https://arxiv.org/abs/2410.14255)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]R\. Kinney, C\. Anastasiades, R\. Authur, I\. Beltagy, J\. Bragg, A\. Buraczynski, I\. Cachola, S\. Candra, Y\. Chandrasekhar, A\. Cohan, M\. Crawford, D\. Downey, J\. Dunkelberger, O\. Etzioni, R\. Evans, S\. Feldman, J\. Gorney, D\. Graham, F\. Hu, R\. Huff, D\. King, S\. Kohlmeier, B\. Kuehl, M\. Langan, D\. Lin, H\. Liu, K\. Lo, J\. Lochner, K\. MacMillan, T\. Murray, C\. Newell, S\. Rao, S\. Rohatgi, P\. Sayre, Z\. Shen, A\. Singh, L\. Soldaini, S\. Subramanian, A\. Tanaka, A\. D\. Wade, L\. Wagner, L\. L\. Wang, C\. Wilhelm, C\. Wu, J\. Yang, A\. Zamarron, M\. Van Zuylen, and D\. S\. Weld\(2023\)The Semantic Scholar open data platform\.arXiv preprint arXiv:2301\.10140\.External Links:[Link](https://arxiv.org/abs/2301.10140)Cited by:[§3\.1](https://arxiv.org/html/2605.14790#S3.SS1.SSS0.Px1.p1.1)\.
- \[15\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with PagedAttention\.InACM Symposium on Operating Systems Principles \(SOSP\),External Links:[Link](https://arxiv.org/abs/2309.06180)Cited by:[§3\.3](https://arxiv.org/html/2605.14790#S3.SS3.p1.3)\.
- \[16\]L\. Li, W\. Xu, J\. Guo, R\. Zhao, X\. Li, Y\. Yuan, B\. Zhang, Y\. Jiang, Y\. Xin, R\. Dang, Y\. Rong, D\. Zhao, T\. Xu, and L\. Bing\(2025\)Chain of ideas: revolutionizing research in idea development with LLM agents\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2410.13185)Cited by:[Appendix A](https://arxiv.org/html/2605.14790#A1.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1),[2nd item](https://arxiv.org/html/2605.14790#S4.I2.i2.p1.1),[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[17\]C\. Lin\(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out \(Workshop at ACL\),pp\. 74–81\.Cited by:[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[18\]Z\. Liu, Y\. Wang, S\. Vaidya, F\. Ruehle, J\. Halverson, M\. Soljačić, T\. Y\. Hou, and M\. Tegmark\(2024\)KAN: Kolmogorov\-Arnold networks\.Note:arXiv preprint arXiv:2404\.19756External Links:[Link](https://arxiv.org/abs/2404.19756)Cited by:[Appendix F](https://arxiv.org/html/2605.14790#A6.p1.1)\.
- \[19\]P\. Lopez\(2009\)GROBID: combining automatic bibliographic data recognition and term extraction for scholarship publications\.InResearch and Advanced Technology for Digital Libraries \(ECDL\),pp\. 473–474\.Note:Software:[https://github\.com/kermitt2/grobid](https://github.com/kermitt2/grobid)External Links:[Document](https://dx.doi.org/10.1007/978-3-642-04346-8%5F62)Cited by:[§3\.1](https://arxiv.org/html/2605.14790#S3.SS1.SSS0.Px1.p1.1)\.
- \[20\]C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha\(2024\)The AI scientist: towards fully automated open\-ended scientific discovery\.Note:arXiv preprint arXiv:2408\.06292External Links:[Link](https://arxiv.org/abs/2408.06292)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. Miserendino, M\. Wang, T\. Patwardhan, and J\. Heidecke\(2025\)SWE\-Lancer: can frontier LLMs earn $1 million from real\-world freelance software engineering?\.Note:arXiv preprint arXiv:2502\.12115External Links:[Link](https://arxiv.org/abs/2502.12115)Cited by:[Appendix F](https://arxiv.org/html/2605.14790#A6.p1.1)\.
- \[22\]S\. Schmidgall, Y\. Su, Z\. Wang, X\. Sun, J\. Wu, X\. Yu, J\. Liu, M\. Moor, Z\. Liu, and E\. Barsoum\(2025\)Agent laboratory: using LLM agents as research assistants\.Note:arXiv preprint arXiv:2501\.04227External Links:[Link](https://arxiv.org/abs/2501.04227)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]C\. Si, D\. Yang, and T\. Hashimoto\(2025\)Can LLMs generate novel research ideas? A large\-scale human study with 100\+ NLP researchers\.InInternational Conference on Learning Representations \(ICLR\),Note:arXiv preprint arXiv:2409\.04109, 2024External Links:[Link](https://arxiv.org/abs/2409.04109)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p1.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1),[1st item](https://arxiv.org/html/2605.14790#S4.I2.i1.p1.1)\.
- \[24\]A\. Singh, M\. D’Arcy, A\. Cohan, D\. Downey, and S\. Feldman\(2023\)SciRepEval: a multi\-format benchmark for scientific document representations\.InConference on Empirical Methods in Natural Language Processing \(EMNLP\),Note:Introduces SPECTER2 embedding family\.External Links:[Link](https://arxiv.org/abs/2211.13308)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[25\]H\. Su, R\. Chen, S\. Tang, X\. Zheng, J\. Li, Z\. Yin, W\. Ouyang, and N\. Dong\(2024\)Two heads are better than one: a multi\-agent system has the potential to improve scientific idea generation\.Note:arXiv preprint arXiv:2410\.09403External Links:[Link](https://arxiv.org/abs/2410.09403)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]D\. R\. Swanson\(1986\)Fish oil, Raynaud’s syndrome, and undiscovered public knowledge\.Perspectives in Biology and Medicine30\(1\),pp\. 7–18\.Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]J\. Sybrandt, I\. Tyagin, M\. Shtutman, and I\. Safro\(2020\)AGATHA: automatic graph mining and transformer based hypothesis generation approach\.InACM International Conference on Information and Knowledge Management \(CIKM\),External Links:[Link](https://arxiv.org/abs/2002.05635)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[28\]J\. Tang, L\. Xia, Z\. Li, and C\. Huang\(2025\)AI\-researcher: autonomous scientific innovation\.Note:arXiv preprint arXiv:2505\.18705External Links:[Link](https://arxiv.org/abs/2505.18705)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]J\. Tang, Y\. Yang, W\. Wei, L\. Shi, L\. Su, S\. Cheng, D\. Yin, and C\. Huang\(2024\)GraphGPT: graph instruction tuning for large language models\.InACM SIGIR Conference on Research and Development in Information Retrieval \(SIGIR\),External Links:[Link](https://arxiv.org/abs/2310.13023)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]Q\. Wang, D\. Downey, H\. Ji, and T\. Hope\(2024\)SciMON: scientific inspiration machines optimized for novelty\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://arxiv.org/abs/2305.14259)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]Q\. Wang, H\. Wang, L\. Chen, Z\. Yang, G\. Chen, H\. Alinejad\-Rokny, H\. Li, Y\. Lin, and M\. Yang\(2026\)FlowPIE: test\-time scientific idea evolution with flow\-guided literature exploration\.Note:arXiv preprint arXiv:2603\.29557External Links:[Link](https://arxiv.org/abs/2603.29557)Cited by:[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]Y\. Weng, M\. Zhu, G\. Bao, H\. Zhang, J\. Wang, Y\. Zhang, and L\. Yang\(2025\)CycleResearcher: improving automated research via automated review\.Note:arXiv preprint arXiv:2411\.00816External Links:[Link](https://arxiv.org/abs/2411.00816)Cited by:[Appendix F](https://arxiv.org/html/2605.14790#A6.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[33\]Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. Foerster, J\. Clune, and D\. Ha\(2025\)The AI scientist\-v2: workshop\-level automated scientific discovery via agentic tree search\.Note:arXiv preprint arXiv:2504\.08066External Links:[Link](https://arxiv.org/abs/2504.08066)Cited by:[§1](https://arxiv.org/html/2605.14790#S1.p1.1),[§1](https://arxiv.org/html/2605.14790#S1.p2.1),[§2](https://arxiv.org/html/2605.14790#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.Note:arXiv preprint arXiv:2412\.15115External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[Table 5](https://arxiv.org/html/2605.14790#A5.T5.4.6.2.2),[§3\.2](https://arxiv.org/html/2605.14790#S3.SS2.p1.1),[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px3.p1.1)\.
- \[35\]Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, S\. Song, and G\. Huang\(2025\)Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?\.arXiv preprint arXiv:2504\.13837\.Cited by:[Appendix G](https://arxiv.org/html/2605.14790#A7.SS0.SSS0.Px1.p1.1)\.
- \[36\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2020\)BERTScore: evaluating text generation with BERT\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/1904.09675)Cited by:[§4\.2](https://arxiv.org/html/2605.14790#S4.SS2.SSS0.Px4.p1.3)\.
- \[37\]Y\. Zhao, W\. Chen, Z\. Xu, M\. Patwardhan, C\. Wang, Y\. Liu, L\. Vig, and A\. Cohan\(2025\)AbGen: evaluating large language models in ablation study design and evaluation for scientific research\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL, Volume 1: Long Papers\),pp\. 12479–12491\.Cited by:[§4\.6](https://arxiv.org/html/2605.14790#S4.SS6.p1.1)\.
- \[38\]T\. Y\. Zhuo, M\. C\. Vu, J\. Chim, H\. Hu, W\. Yu, R\. Widyasari, I\. N\. B\. Yusuf, H\. Zhan, J\. He, I\. Paul, S\. Brunner, C\. Gong, T\. Hoang, A\. R\. Zebaze, X\. Hong, W\. Li, J\. Kaddour, M\. Xu, Z\. Zhang, P\. Yadav, N\. Jain, A\. Gu, Z\. Cheng, J\. Liu, Q\. Liu, Z\. Wang, D\. Lo, B\. Hui, N\. Muennighoff, D\. Fried, X\. Du, H\. de Vries, and L\. V\. Werra\(2025\)BigCodeBench: benchmarking code generation with diverse function calls and complex instructions\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2406.15877)Cited by:[Appendix F](https://arxiv.org/html/2605.14790#A6.p1.1)\.

## Appendix

## Appendix AFive\-field idea extraction pipeline

We extract a structured five\-field idea \(Problem, Existing Methods, Motivation, Proposed Method, Experiment Plan\) from each paper using a prompted\-LLM pipeline operating on the paper’s introduction and method sections, parsed via GROBID and OpenReview HTML\. The extractor follows the schema convention ofBaeket al\.\[[1](https://arxiv.org/html/2605.14790#bib.bib2)\]andLiet al\.\[[16](https://arxiv.org/html/2605.14790#bib.bib3)\]with a strict\-JSON output constraint\. The same five\-field schema is used uniformly for the 498 training papers, the 50\-paper in\-domain validation set, and the 50\-seed test set, so every paper in our pool has a comparable target representation regardless of its role\.

## Appendix BCitation subgraph edge features

Table[4](https://arxiv.org/html/2605.14790#A2.T4)lists the eight edge features used by GoR\. Every directed edgeu→vu\\to vin the citation DAGGvG\_\{v\}carries these features as labeled key\-value pairs, which are emitted in the\[EDGE\]block of the serialized prompt \(Appendix[D](https://arxiv.org/html/2605.14790#A4)\)\. The features cover five categories\.*Role*\(cited\_in\_sections,section\_weight\) describes which seed sections cite the predecessor and how heavily they weight it\.*Influence*\(cite\_count,is\_influential\_raw,cited\_by\_subgraph\) combines a global citation signal from Semantic Scholar with subgraph\-local centrality\.*Recency*\(delta\_year\) is plain year arithmetic between seed and predecessor\.*Topology*\(layer\_depth\) records hop distance in the 2\-hop neighborhood, distinguishing direct references from references\-of\-references\.*Provenance*\(low\_confidence\) flags edges whose section attribution comes from a noisy parse and whose value should be discounted by a careful reader\. The*w/o graph*ablation strips all eight features along with the\[PREDECESSORS\]block, leaving only paper title, year, venue, abstract, and the five\-field idea\.

Table 4:The eight edge features inGvG\_\{v\}used by GoR\.Source indicates where each feature is derived\. Semantics give the intended meaning a graph\-aware reader should assign\. The*w/o graph*ablation strips all eight features and the predecessor block, leaving only paper title, year, venue, abstract, and the five\-field idea\.
## Appendix CTraining data year distribution

The 498\-paper training pool spans accepted papers at NeurIPS, ICLR, CVPR, ICML, and ACL between 2020 and 2024, with per\-year counts of 114 \(2020\), 102 \(2021\), 98 \(2022\), 60 \(2023\), and 124 \(2024\)\. The 50\-paper in\-domain validation set draws from the same year span and venue mix and is held out only for NLL\-based checkpoint selection\. The 50\-seed test set is drawn from accepted 2025 papers at the same five venues, distributed as ICML \(13\), ACL \(12\), NeurIPS \(9\), ICLR \(9\), and CVPR \(7\), and is verified leak\-free against the training pool by both title\-string and Semantic\-Scholar paper\-id overlap\.

#### Subgraph\-level statistics\.

The pipeline yields 498 fully annotated subgraphs\. After temporal\-cone filtering the median subgraph keeps 12 references \(mean 12\.7, max 30\) and 30 edges \(mean 32\.8, max 111\), distributed on average as 22\.6 explicit, 5\.6 parallel, and 4\.6 direct\-to\-seed edges\. As a running example used in the paper, the GPT\-3 paper’s 146 references shrink to 22 after Pass 2 scoring and to 18 after temporal\-cone filtering, producing 91 edges \(55 explicit, 31 parallel, 5 direct\-to\-seed\) and a 7,265\-token serialized prompt \(Appendix[D](https://arxiv.org/html/2605.14790#A4)\)\.

## Appendix DPrompt serialization format

Each prompt has the schematic structure shown below, with tokens in angle brackets filled per paper\.

\#SEEDMETA

venue:<venue\>year:<year\>

\#CITATIONSUBGRAPH\(<n\>refs,temporallyorderedbyyear\)

\#\#\[1\]<Title\>\(<year\>,<venue\>\)authors:<authors\>

\[EDGE\]

layer\_depth=<int\>

cited\_in\_sections=<list\>

cite\_count=<int\>

section\_weight=<float\>

delta\_year=<int\>

is\_influential\_raw=<bool\>

low\_confidence=<bool\>

cited\_by\_subgraph=<int\>

\[PREDECESSORS\]

\-ref\_idx=<list\>delta\_yr=<int\>edge\_type=<parallel\_pred\|explicit\_pred\>

\[IDEA\-\-5fields\]

Problem:\.\.\.

ExistingMethods:\.\.\.

Motivation:\.\.\.

ProposedMethod:\.\.\.

ExperimentPlan:\.\.\.

\[ABSTRACT\]\.\.\.

\#\#\[2\]\.\.\.

\.\.\.

\#TASK

GiventheSEEDMETAandCITATIONSUBGRAPHabove,predicttheSEEDpaper’s

five\-fieldidea\.

\#OUTPUTFORMAT\(strict\)

\{

"Problem":"\.\.\.",

"ExistingMethods":"\.\.\.",

"Motivation":"\.\.\.",

"ProposedMethod":"\.\.\.",

"ExperimentPlan":"\.\.\."

\}

The*w/o graph*ablation strips\[EDGE\]and\[PREDECESSORS\]blocks and renames\# CITATION SUBGRAPHto\# REFERENCES\. Every other field is preserved, so the only deliberate delta between theGoR\-SFTinput and the matchedRefs\-SFTinput is the structural annotation\.

## Appendix ETraining hyperparameters

Table[5](https://arxiv.org/html/2605.14790#A5.T5)lists the hyperparameters shared byGoR\-SFTand the matchedRefs\-SFTablation\. The two runs share the base model, optimizer, schedule, batch size, precision, fused\-kernel set, and random seed\. The single deliberate difference is the presence of the\[EDGE\]and\[PREDECESSORS\]blocks in the input prompt, isolating the contribution of structural annotation under matched 7B capacity\. We use4×4\\timesA800\-80G GPUs in DeepSpeed ZeRO Stage 3 with full fine\-tuning, and we report the NLL\-best checkpoint on the held\-out 50\-paper in\-domain validation set as the model used for downstream evaluation\.

Table 5:Shared hyperparameters for bothGoR\-SFTandRefs\-SFT\.The only deliberate difference between the two is the presence of structural blocks in the prompt\.
## Appendix FHuman evaluation: protocol and results

We complement the LLM\-judge tournament with a small\-scale blinded human study comparing four systems \(GoR\-SFT, Si baseline, CoI\-Agent, ResearchAgent\) on a balanced 5\-seed subset of the 50\-seed test set\. The 5 seeds were drawn to span widely\-cited 2024\-2025 papers across the five training venues with varied head\-to\-head verdicts \(3 GoR\-favorable, 2 Si\-favorable, by mean Elo on T1\) and include MLE\-bench\[[2](https://arxiv.org/html/2605.14790#bib.bib34)\], KAN\[[18](https://arxiv.org/html/2605.14790#bib.bib35)\], rStar\-Math\[[8](https://arxiv.org/html/2605.14790#bib.bib36)\], BigCodeBench\[[38](https://arxiv.org/html/2605.14790#bib.bib37)\], and SWE\-Lancer\[[21](https://arxiv.org/html/2605.14790#bib.bib38)\]\.

#### Recruitment and anonymization\.

We recruited three PhD\-track NLP and ML graduate students from the same research community as the authors\. None are authors of this paper\. Each rater received a private packet containing the 5 topic descriptions \(seed paper title and abstract\) and the four anonymized 5\-field ideas per topic, yielding 600 idea\-level scores in total\. System labels \(A, B, C, D\) were independently shuffled per \(rater, topic\) pair to prevent label leakage across raters\. Raters were instructed not to look up the actual seed papers during scoring\.

#### Scoring and metrics\.

Raters scored each \(topic, system\) pair on 10 metrics on a 1–10 integer Likert scale, with anchor descriptions for each metric \(1–3, 4–6, 7–8, 9–10\)\. The first five \(Novelty, Significance, Feasibility, Clarity, Effectiveness\) align with the T1 LLM\-judge dimensions to enable cross\-population comparison\. The second five \(Excitement, Soundness, Originality, Reproducibility, Overall\) cover reviewer\-side properties drawn fromBaeket al\.\[[1](https://arxiv.org/html/2605.14790#bib.bib2)\]andWenget al\.\[[32](https://arxiv.org/html/2605.14790#bib.bib12)\]and the NeurIPS reviewer guidelines\.

#### Workload and compensation\.

Each rater submitted5×4×10=2005\\times 4\\times 10=200scores plus optional comments\. Self\-reported task time averaged 3 to 4 hours per rater\. The study was an internal academic exercise within the authors’ research group, so no IRB protocol was required at our institution and no monetary compensation was offered\.

#### Aggregation\.

For each \(system, dimension\) cell of Table[6](https://arxiv.org/html/2605.14790#A6.T6), we report the mean overn=25n\{=\}25scores \(55seeds×\\times55raters\)\. For inter\-rater agreement we report Krippendorff’sα\\alphaon interval data per dimension, computed by treating each \(seed, system\) as an item and the five raters as observers\. The packet generation script, shuffling seed, and aggregation script are released alongside the paper to enable replication\.

#### Results\.

Table[6](https://arxiv.org/html/2605.14790#A6.T6)reports the per\-dimension means\. The 10 metrics partition cleanly along an execution\-versus\-creativity axis\.GoR\-SFTwins the four execution\-oriented dimensions Feasibility, Clarity, Soundness, and Reproducibility at 6\.96, 7\.48, 7\.00, and 6\.92, plus*Overall*at 6\.56 against Si baseline’s 6\.48, CoI\-Agent’s 6\.00, and ResearchAgent’s 4\.84\. The five creativity\-oriented dimensions go entirely to Si baseline, which leads Novelty, Significance, Effectiveness, Excitement, and Originality\. ResearchAgent ranks last on 7 of 10 dimensions, with Feasibility at 4\.24 and Reproducibility at 4\.04, well below the other three systems and consistent with the surface\-level retrieval its entity\-store paradigm rewards\.

#### Inter\-rater agreement\.

Krippendorff’sα\\alphaon interval data follows the same execution\-versus\-creativity split: moderate on the four execution dimensions \(α\\alphafrom 0\.56 to 0\.62\), fair on*Overall*\(α=0\.40\\alpha=0\.40\), and fair to very fair on the five creativity dimensions \(α\\alphafrom 0\.05 to 0\.27\)\.GoR\-SFT’s wins fall on the dimensions where raters most reliably converge, making the execution\-side advantage a clearly discriminated human\-judgment signal, while the creativity\-side gap to Si baseline sits within the noise floor of expert creativity scoring\.

Table 6:Human evaluationon a 5\-seed blinded subset of the test set, scored by 5 PhD raters on a 1–10 Likert scale \(mean overn=25n\{=\}25per cell\)\. Best per column inbold\.

## Appendix GCase studies: full idea proposals

This appendix collects two qualitative case studies\. The first \(Case study \#2 below\) is a supplementary example beyond the body case in Section[4\.6](https://arxiv.org/html/2605.14790#S4.SS6)that illustrates a typical narrow\-loss outcome\. The second \(Table[8](https://arxiv.org/html/2605.14790#A7.T8)\) provides the full proposals for the body*AbGen*case in Section[4\.6](https://arxiv.org/html/2605.14790#S4.SS6)\.

#### Case study \#2:*Does Reinforcement Learning Really Incentivize Reasoning?*

A second seed paper from the test set,*Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?*\[[35](https://arxiv.org/html/2605.14790#bib.bib33)\], retains a subgraph spanning 12 refs across RL methods \(DeepSeek\-R1, DeepSeekMath, code\-r1, PPO, REINFORCE, the Sutton textbook\), recent base models \(Qwen2\.5\-1M, Llama 3, OpenAI o1\), and reasoning benchmarks \(Solving Quantitative Reasoning, OlympiadBench\)\. The frontier gap visible here is that no retained ref combines per\-step process supervision \(DeepSeekMath / o1\-style\) with an RLHF training pipeline that learns from human feedback rather than from a fixed reward\.

Reading the four proposals \(Table[7](https://arxiv.org/html/2605.14790#A7.T7)\),GoR\-SFTproposes HumanEvalRLHF, an RLHF framework with an RNN\-based policy gradient and a novel reward approximation scheme, grounded in the PPO and REINFORCE classics from the subgraph\. Si baseline proposes RRC \(Reinforcement\-Reasoner Circuits\), a graph\-decomposed reasoning step\-reward scheme on GSM\-8K and MATH\. CoI\-Agent proposes a multi\-dimensional RL framework extending RLHF for reasoning\. ResearchAgent proposes ART\-FD, combining self\-validated token gradients on high\-entropy tokens, dynamic theorem creation, and multi\-level feedback loops\. The LLM\-judge verdict:GoR\-SFTwins14\-6against Si baseline and13\-7against CoI\-Agent, but loses9\-11to ResearchAgent narrowly, primarily on the Feasibility and Clarity dimensions where RA’s more elaborate multi\-component framework reads as more polished\.

This case complements the AbGen case in Section[4\.6](https://arxiv.org/html/2605.14790#S4.SS6)by illustrating a typical narrow\-loss outcome:GoR\-SFT’s grounding in the RL classics \(PPO, REINFORCE\) yields a coherent but more conservative proposal than RA’s multi\-component synthesis\. The two cases together cover both the decisive\-win and narrow\-loss ends ofGoR\-SFT’s distribution against the three published baselines\.

Table 7:Case study \#2 on the*Does RL Really Incentivize Reasoning?*seed: abridged proposed methods\.Pair\-Elo is the LLM\-judge tournament verdict on this single seed \(5\-dimension×\\times2\-ordering×\\times2\-point scale, max 20\)\.
#### Case study \#1 \(body\):*AbGen*\.

Table[8](https://arxiv.org/html/2605.14790#A7.T8)below provides the full text of the four proposals summarized in Section[4\.6](https://arxiv.org/html/2605.14790#S4.SS6)\. Each cell reproduces the system’s*Proposed Method*field, lightly trimmed for table fit but preserving the proposal’s components and named\-reference grounding\.

Table 8:Case study on the*AbGen*seed: full proposed methods\.Subgraph nodes referenced are retained refs in Fig\.[5](https://arxiv.org/html/2605.14790#S4.F5)that the proposal cites by name\. Pair\-Elo is the LLM\-judge tournament verdict on this single seed \(5\-dimension×\\times2\-ordering×\\times2\-point scale, max 20\)\.

Similar Articles

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

arXiv cs.LG

This paper explores teaching language models to forecast the empirical success of research ideas by comparing pairs of ideas. Using a dataset of 11,488 idea pairs from PapersWithCode, the authors show that fine-tuning (SFT) boosts accuracy to 77.1%, outperforming GPT-5, and reinforcement learning with verifiable rewards achieves 71.35% with interpretable reasoning.

Why Retrieval-Augmented Generation Fails: A Graph Perspective

arXiv cs.CL

This paper investigates why Retrieval-Augmented Generation (RAG) systems fail despite having access to correct evidence. Using circuit tracing and attribution graphs, the authors find that correct predictions exhibit deeper reasoning paths and more distributed evidence flow, while failures show shallow and fragmented patterns. They propose a graph-based error detection framework and targeted interventions to improve RAG reliability.

Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering

arXiv cs.CL

This paper constructs the first large-scale citation graph from 100.7 million Ukrainian court decisions, extracting over 500 million citation links. It demonstrates that the citation structure can automatically recover legal domain boundaries and predict legislative importance with near-perfect accuracy, and releases the pipeline and data as open resources.