EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery
Summary
EvoSci proposes a bio-inspired multi-agent framework that integrates evolutionary algorithms with knowledge graph modeling to iteratively generate, evaluate, and refine research ideas, achieving top performance in peer-review evaluations.
View Cached Full Text
Cached at: 05/26/26, 09:04 AM
# EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery
Source: [https://arxiv.org/html/2605.24018](https://arxiv.org/html/2605.24018)
Xiaoyu Xiong,Yuqi Ren†\\dagger,Deyi Xiong†\\dagger TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China \{2025244184, ryq20, dyxiong\}@tju\.edu\.cn
###### Abstract
Large language models \(LLMs\), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi\-role collaboration mechanisms\. To mitigate these issues, we propose EvoSci, a multi\-agent scientific collaboration framework, which integrates bio\-inspired evolution with knowledge graph modeling\. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role\-based agents, including mentor, researcher, and reviewer\. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration\. Experiments on real\-world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM\-based structured peer\-review and comparative ranking evaluations, achieving the highest overall peer\-review score \(ICLR 4\.90\) and top ranking \(Top\-10 = 54\)\. These results suggest its superiority in both scientific idea generation and continuous discovery\.
EvoSci: A Bio\-Inspired Multi\-Agent Framework for the Evolution of Scientific Discovery
Xiaoyu Xiong, Yuqi Ren†\\dagger, Deyi Xiong†\\daggerTJUNLP Lab, School of Computer Science and Technology, Tianjin University, China\{2025244184, ryq20, dyxiong\}@tju\.edu\.cn
††footnotetext:†\\daggerCorresponding authors\.## 1Introduction
With the breakthrough development in knowledge representationPan et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib23)\), logical reasoningKe et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib14)\); Xu et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib35)\), and complex multimodal information integrationHan et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib9)\), LLMs are gradually reshaping the paradigm of scientific researchBuehler \([2024](https://arxiv.org/html/2605.24018#bib.bib2)\)\. Significant progress has already been made in traditional AI\-powered domains such as mathematical reasoning and theorem provingTrinh et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib30)\); Zhang and Xiong \([2025](https://arxiv.org/html/2605.24018#bib.bib39)\); Liu et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib19)\), automatic code generationLi et al\. \([2023](https://arxiv.org/html/2605.24018#bib.bib17)\); Ren et al\. \([2023](https://arxiv.org/html/2605.24018#bib.bib24)\); He et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib10)\); Yang et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib37)\), and complex data analysisSui et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib27)\)\. Building on these successes, researchers are increasingly exploring the potential of LLMs across broader scientific workflows, including idea and hypothesis generation from large\-scale scientific corporaKulkarni et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib16)\); Zhou et al\. \([2024b](https://arxiv.org/html/2605.24018#bib.bib44)\); Wang et al\. \([2024b](https://arxiv.org/html/2605.24018#bib.bib32)\); Yang et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib38)\); Wang et al\. \([2024a](https://arxiv.org/html/2605.24018#bib.bib31)\), experimental designDesai et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib4)\); Tian et al\. \([2021](https://arxiv.org/html/2605.24018#bib.bib28)\); Noh et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib22)\); Tom et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib29)\), and result interpretationZheng et al\. \([2023](https://arxiv.org/html/2605.24018#bib.bib42)\); Charness et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib3)\)\.
However, scientific discovery is not a one\-shot solution but a gradual and evolving process, driven by the continual refinement of research problems and the accumulation of intermediate insightsElliott \([2012](https://arxiv.org/html/2605.24018#bib.bib6)\)\. It is also fundamentally collaborative, relying on the interplay of diverse roles and perspectivesMilojević \([2014](https://arxiv.org/html/2605.24018#bib.bib21)\)\. However, existing approaches often reduce LLMs to static executors within rigid pipelines, overlooking their potential for long\-horizon inquiry and structured coordination\. This raises two central challenges: \(1\) how could LLMs be steered toward progressively deepening scientific problems? and \(2\) how could effective collaboration frameworks be developed to allow multiple agents to engage in sustained, dynamic exploration\.
To address these challenges, we proposeEvoSci, anEvolutionaryScience framework driven by multiple collaborative agents for automatic research ideation\. Inspired by real\-world research teams and biological evolution, EvoSci models scientific discovery as a long\-horizon, iterative exploration, consisting of four stages:Problem Space Construction,Collaborative Research Execution,Research Idea Evaluation, andBio\-Inspired Evolutionary Iteration\. EvoSci builds upon explicitly defined role\-based agents, including a mentor, a group of researchers, and a reviewer, each responsible for a distinct stage of the ideation process\. Beyond static, pre\-defined pipelines, EvoSci enables adaptive coordination through dynamic task decomposition, where the mentor agent reallocates subtasks based on intermediate feedback, while role\-aware assignment ensures that each agent’s actions remain aligned with its disciplinary background across multiple interaction rounds\. Furthermore, inspired by biological evolution, we iteratively update research ideas by aligning and recombining conceptual knowledge across different domains, thereby enhancing the novelty of scientific exploration\.
To validate the effectiveness and generality of EvoSci, we have conducted systematic experiments across diverse scientific scenarios\. Results show that EvoSci consistently generates more novel and impactful research ideas than strong baselines\. Equipped with DeepSeek\-v3, EvoSci achieves the highest overall peer\-review scores \(ICLR 4\.90 / NeurIPS 3\.95\), surpassing the next best baseline \(4\.68 / 3\.72\) by a large margin, and maintaining consistent advantages in terms of Elo\-based ranking metrics \(Avg Wins 4\.19, Top\-10 Count 47\)\. These results validate that EvoSci achieves superior overall research quality and more reliable relative performance compared to strong baselines\. In summary, the main contributions of our work are as follows:
- •We conceptualize scientific discovery as a problem\-oriented process, in which research problems are dynamically generated and progressively refined through a multi\-agent collaboration loop\.
- •We construct a heterogeneous multi\-agent framework that mirrors real\-world research laboratories, where diverse agents operate under role\-specific objectives\. The framework is grounded on datasets derived from real scientists, enabling more authentic simulation of collaborative scientific workflows\.
- •We implement a multi\-round feedback with bio\-inspired evolution mechanism \(selection, crossover, mutation\) to enable continuous and open\-ended scientific exploration\.
## 2Related Work
Our work is related to both AI\-driven scientific discovery and multi\-agent systems\. We briefly review these two topics within the scope of LLM and the constraint of space\.
### 2\.1AI for Scientific Discovery
Recent advances in LLMs have enabled AI systems to participate more deeply in scientific discoveryZheng et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib41)\); Xiong et al\. \([2026](https://arxiv.org/html/2605.24018#bib.bib34)\)\. Early systems primarily assist with literature miningSmalheiser and Swanson \([1998](https://arxiv.org/html/2605.24018#bib.bib25)\); Hristovski et al\. \([2005](https://arxiv.org/html/2605.24018#bib.bib11)\)and experiment designKing et al\. \([2009](https://arxiv.org/html/2605.24018#bib.bib15)\); Tian et al\. \([2021](https://arxiv.org/html/2605.24018#bib.bib28)\), while recent developments aim at more comprehensive support for the scientific process\. Notably, SciPIP employs three retrieval strategies \(semantic, entity, and co\-occurrence\) to enhance hypothesis generationWang et al\. \([2024b](https://arxiv.org/html/2605.24018#bib.bib32)\), and the MOOSE framework introduces iterative feedback mechanisms to evolve hypotheses from large\-scale web corporaYang et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib38)\)\. Building on these advances, SciAgents integrates multi\-agent reasoning with scientific knowledge graphs to autonomously generate, test, and refine hypothesesGhafarollahi and Buehler \([2025](https://arxiv.org/html/2605.24018#bib.bib8)\), while CoScientist further extends such capabilities by autonomously planning and executing experimental procedures within chemical research workflowsJansen et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib13)\)\.
Recent systems have taken a more ambitious step toward end\-to\-end scientific discovery\. For example, AI\-ScientistLu et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib20)\); Yamada et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib36)\)and CycleResearcherWeng et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib33)\)automate nearly the entire scientific workflow, from topic formulation and literature exploration to hypothesis generation, experiment simulation, paper writing, and even peer review\. However, existing approaches primarily focus on conducting a single round of research under a fixed initial topic, overlooking the evolutionary and cyclic nature of scientific discovery\. This gap highlights the need for frameworks that support iterative refinement and problem reformulation across successive research cycles\.
Figure 1:Overall workflow of the proposed EvoSci framework\. EvoSci begins with problem space construction from literature and domain knowledge, followed by collaborative research execution through role\-based agents and iterative evaluation with reviewer feedback\. The bio\-inspired evolutionary loop operates over multiple rounds, leveraging feedback to recombine, adapt, and refine research directions\.
### 2\.2Collaboration in Multi\-Agent Systems
Recent work has increasingly turned to multi\-agent systems \(MAS\) as a way to mitigate the limitations of single LLMs, such as hallucinationHuang et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib12)\), weak long\-horizon reasoningFerrag et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib7)\), and stale knowledgeZhang et al\. \([2023](https://arxiv.org/html/2605.24018#bib.bib40)\)\. MAS build on the paradigm of agentic AIDurante et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib5)\), organizing multiple LLM\-based agents with specialized roles and shared goals\. Through coordinated planning, division of labor, and mutual evaluation, these systems aim to achieve more reliable and scalable cognitive performance than any individual modelZhou et al\. \([2024a](https://arxiv.org/html/2605.24018#bib.bib43)\); Li et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib18)\)\.
A growing line of studies applies MAS to scientific discovery, demonstrating their potential in iterative and multi\-step research workflows\. VIRSCI simulates a virtual team of scientists engaging in structured idea generation and evaluationSu et al\. \([2025](https://arxiv.org/html/2605.24018#bib.bib26)\), while ResearchAgent coordinates specialized agents for literature analysis, hypothesis generation, and experiment planningBaek et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib1)\)\. These works largely remain task\- or pipeline\-oriented, and fall short of providing an integrated framework that supports cyclic scientific evolution together with role\-aware, interdisciplinary collaboration\.
## 3EvoSci
EvoSci models scientific discovery as an evolutionary, problem\-centric process, where research directions are iteratively explored, evaluated, and refined over long horizons\. Accordingly, EvoSci consists of four core components: \(1\) a problem space construction module guided by a mentor agent, \(2\) a collaborative research execution module led by a prime researcher agent, \(3\) an evaluation module that systematically assesses the quality, novelty, and feasibility of generated research ideas, and \(4\) a bio\-inspired evolutionary iteration module that enables iterative refinement of research directions\. Together, these components form a closed\-loop workflow for the iterative generation and refinement of interdisciplinary research ideas\. Figure[1](https://arxiv.org/html/2605.24018#S2.F1)illustrates the overall workflow of EvoSci\.
### 3\.1Problem Space Construction
A core challenge in automating scientific idea generation lies in the construction of a structured, high\-quality problem space that supports exploration across disciplinary boundaries\. This phase aims to transform an initial research theme into a diverse collection of research problems that are both semantically grounded and structurally expandable\.
Data Preparation\.We construct a lightweight, multi\-level knowledge graph to organize scientific disciplines and their associated entities\. We begin with a predefined set of representative disciplines spanning major scientific disciplines \(e\.g\., Physics, Chemistry, Biology, Medicine, Economics\), each represented as a first\-layer node\.
For each discipline, we extract candidate entities from its Wikipedia page using the page summary and hyperlink structure\. An LLM\-based classifier assigns each entity a semantic type \(e\.g\.,*Theory*,*Model*,*Material*,*Phenomenon*\) and estimates its relevance to the discipline\. Only relevant entities are retained and connected to their corresponding disciplines via*has\_entity*edges\.
To capture cross\-disciplinary connections, we compute cosine similarity between the embedding representations of entities and add a cross\-entity edge if the similarity exceeds a threshold, i\.e\.,s\(ei,ej\)=cos\(emb\(ei\),emb\(ej\)\)\>τs\(e\_\{i\},e\_\{j\}\)=\\cos\(\\mathrm\{emb\}\(e\_\{i\}\),\\mathrm\{emb\}\(e\_\{j\}\)\)\>\\tau\.
Topic Analysis\.Given a core research topicTTand a set of target disciplines𝒟target=\{D1,…,Dn\}\\mathcal\{D\}\_\{\\mathrm\{target\}\}=\\\{D\_\{1\},\\dots,D\_\{n\}\\\}, the system first grounds the topic by using an LLM\-based classifier to mapTTto one or more core disciplines in the knowledge graph, ensuring that subsequent exploration is anchored in a clear scientific context\. To encourage cross\-disciplinary integration, we adopt a question\-answering architecture with domain expert agents instantiated from a dataset of real\-world scientists \(Appendix[A\.1](https://arxiv.org/html/2605.24018#A1.SS1)\)\. Each expert agent is equipped with literature retrieval and reading capabilities and engages in structured discussions with a mentor agent\. Through this process, the system identifies promising interdisciplinary directions and updates the knowledge graph with domain\-specific entities derived from their exploration trajectories, enabling iterative evolution to support downstream idea generation\.
Bio\-Inspired Entity Evolution\.After the topic analysis concludes, the system enters the problem generation stage by focusing on the intersection between the topicTTand a selected disciplined∈𝒟targetd\\in\\mathcal\{D\}\_\{\\mathrm\{target\}\}\. For disciplinedd,ℰd\\mathcal\{E\}\_\{d\}denotes the set of associated entities in the knowledge graph, which are clustered into semantic entity clusters𝒞d=\{𝒞d,1,…,𝒞d,n\}\\mathcal\{C\}\_\{d\}=\\\{\\mathcal\{C\}\_\{d,1\},\\ldots,\\mathcal\{C\}\_\{d,n\}\\\}\. The most topic\-relevant cluster is then explicitly incorporated into the prompt of the mentor agent, which uses these entity clusters as contextual cues to generate scientific research problems:
Qproblem=⟨T,d,Top\(𝒞d;T\)⟩,Q\_\{\\mathrm\{\{problem\}\}\}\\;=\\;\\langle\\,T,\\;d,\\;\\mathrm\{Top\}\(\\mathcal\{C\}\_\{d\};\\,T\)\\,\\rangle,\(1\)whereTop\(𝒞d;T\)\\mathrm\{Top\}\(\\mathcal\{C\}\_\{d\};T\)denotes the single entity cluster in𝒞d\\mathcal\{C\}\_\{d\}that is most semantically relevant to the topicTT\. After each exploration round, bio\-inspired evolutionary operations \(i\.e\.,*Crossover*,*Variation*,*Inheritance*, and*Selection*\) are applied at the cluster level to the discipline entity space, as detailed in Section[3\.4](https://arxiv.org/html/2605.24018#S3.SS4)\.
Structured Problem Cluster Generation\.Driven by LLMs, the system generates a diverse set of candidate scientific problems𝒬=\{q1,q2,…\}\\mathcal\{Q\}=\\\{q\_\{1\},q\_\{2\},\\dots\\\}by expanding along the evolving entity clusters in the knowledge graph\. Each problemqiq\_\{i\}is represented by a concise problem statement𝒫\(qi\)\\mathcal\{P\}\(q\_\{i\}\), an explanatory description𝒟\(qi\)\\mathcal\{D\}\(q\_\{i\}\), and a research guidance field𝒢\(qi\)\\mathcal\{G\}\(q\_\{i\}\)\.
The generated problems are subsequently grouped into problem clusters𝒫=\{𝒫1,𝒫2,…\}\\mathcal\{P\}=\\\{\\mathcal\{P\}\_\{1\},\\mathcal\{P\}\_\{2\},\\dots\\\}according to their primary interdisciplinary foci, forming structured problem sets that emphasize cross\-disciplinary perspectives and conceptual novelty\.
### 3\.2Collaborative Research Execution
Building on the structured problem clusters, the system proceeds to a multi\-agent research exploration phase\. A set of role\-specialized agents collaboratively investigates selected problem clusters through literature review, structured discussion, and iterative idea generation and refinement, simulating the workflow of real scientific teams\.
Problem Confirmation and Team Assembly\.From the candidate problem clusters𝒫\\mathcal\{P\}, the system selects a target cluster𝒫∗\\mathcal\{P\}^\{\*\}by jointly considering its relevance to the initial topic, interdisciplinary potential, and future extensibility\. A research team is then assembled to explore𝒫∗\\mathcal\{P\}^\{\*\}\. The prime researcher and assistant researchers are instantiated from a real\-world scientist dataset, where each agent is represented by anonymized metadata and semantic behavior embeddings\. The prime researcher is selected to align with the initial topic, while assistant researchers are chosen based on their relevance to𝒫∗\\mathcal\{P\}^\{\*\}, facilitating effective interdisciplinary collaboration\.
Task Decomposition and Team Collaboration\.Given a selected problem clusterP∗P^\{\*\}, the Prime Researcher organizes a collaborative research process by decomposing the exploration into a set of structured tasks𝒯=\{t1,t2,…\}\\mathcal\{T\}=\\\{t\_\{1\},t\_\{2\},\\dots\\\}\. These tasks correspond to key stages of scientific inquiry, including background investigation, problem analysis, idea generation, and iterative refinement\. We adopt the CrewAI framework to organize the agent team and establish a dynamic delegation mechanism based on a “lead\-and\-collaborate” interaction paradigm, which supports the following collaboration processes:
- •Task Leading and Delegation:Each core task is initiated and led by the Prime Researcher, who analyzes the task context, decomposes it into subtasks, and assigns them to assistant agents based on semantic relevance and skill profiles\.
- •Recursive Delegation and Collaboration:Assistant agents may further decompose assigned subtasks and reassign them when necessary, resulting in a recursive collaboration structure\.
- •Phased Integration:At predefined stages of task execution, structured discussion rounds are conducted to aggregate intermediate results, align perspectives across agents, and integrate subtask outcomes\.
Formally, for a tasktkt\_\{k\}, we define the research state and prompt as:
𝒮k=⟨tdescription,⋃i=1k−1ℛi,ℛksub,ak⟩,\\mathcal\{S\}\_\{k\}=\\langle\\,t\_\{\\mathrm\{description\}\},\\;\\bigcup\_\{i=1\}^\{k\-1\}\\mathcal\{R\}\_\{i\},\\;\\mathcal\{R\}\_\{k\}^\{\\mathrm\{sub\}\},\\;a\_\{k\}\\,\\rangle,\(2\)Qresearch=⟨T,P∗,𝒮k⟩,Q\_\{\\mathrm\{research\}\}=\\langle\\,T,\\;P^\{\*\},\\;\\mathcal\{S\}\_\{k\}\\,\\rangle,\(3\)wheretdescriptiont\_\{\\mathrm\{description\}\}denotes the description of the current task or subtask under exploration,⋃i=1k−1ℛi\\bigcup\_\{i=1\}^\{k\-1\}\\mathcal\{R\}\_\{i\}represents the aggregated responses accumulated from previously completed tasks,ℛksub\\mathcal\{R\}\_\{k\}^\{\\mathrm\{sub\}\}denotes the intermediate responses produced for subtasks under the current tasktkt\_\{k\}, andaka\_\{k\}denotes the agent assigned to executetkt\_\{k\}\. The research promptQresearchQ\_\{\\mathrm\{research\}\}combines the initial topicTT, the selected problem clusterP∗P^\{\*\}, and the current research state𝒮k\\mathcal\{S\}\_\{k\}to guide the task execution\. In practice, the accumulated responses in𝒮k\\mathcal\{S\}\_\{k\}are maintained through a hierarchical memory mechanism, including short\-term memory, long\-term memory, and entity memory, which enables compact context representation and more precise summarization across tasks\.
Seed Idea Generation and Refinement\.Through domain investigation and collaborative discussion, the system generates diverse seed ideas, broadening the exploration space and increasing the likelihood of high\-quality discoveries\. These ideas are refined through multi\-agent collaboration, where redundant or low\-value ideas are removed, the remaining ideas are ranked by novelty, feasibility, and interdisciplinary value\.
### 3\.3Research Idea Evaluation
To further refine generated ideas, a reviewer agent is engaged to simulate a peer\-review process\. Each generated ideaℐ=\{I1,I2,…\}\\mathcal\{I\}=\\\{I\_\{1\},I\_\{2\},\\dots\\\}is evaluated along multiple dimensions, including novelty, feasibility, validity, and scientific excitement\. In addition to quantitative scores, the reviewer agent provides structured feedback by identifying logical weaknesses and suggesting complementary experimental directions\.
Formally, the evaluation results can be expressed as:
E\(ℐ\)=⟨title,s,r,c,δ⟩,E\(\\mathcal\{I\}\)=\\langle\\,\\mathrm\{title\},\\;s,\\;r,\\;c,\\;\\delta\\,\\rangle,\(4\)wheretitle\\mathrm\{title\}denotes the title of the evaluated idea,s=\(snov,sfea,seff,sexc,soverall\)s=\(s\_\{\\mathrm\{nov\}\},s\_\{\\mathrm\{fea\}\},s\_\{\\mathrm\{eff\}\},s\_\{\\mathrm\{exc\}\},s\_\{\\mathrm\{overall\}\}\)is a vector of scores assessing novelty, feasibility, expected effectiveness, scientific excitement and overall, andr=\(rnov,rfea,reff,rexc,roverall\)r=\(r\_\{\\mathrm\{nov\}\},r\_\{\\mathrm\{fea\}\},r\_\{\\mathrm\{eff\}\},r\_\{\\mathrm\{exc\}\},r\_\{\\mathrm\{overall\}\}\)denotes the corresponding rationales\.ccdenotes the reviewer’s confidence level, andδ\\deltadenotes concrete suggestions for improving the idea\.
### 3\.4Bio\-Inspired Evolutionary Iteration
Scientific idea generation is inherently iterative rather than one\-shot\. To support long\-term and open\-ended discovery, EvoSci introduces a bio\-inspired evolutionary loop that operates on the entity layer of the knowledge graph and is guided by structured evaluation feedback\.
Entity\-Level Evolution\.Building on the multi\-level knowledge graph, the discipline layer remains static as a structural backbone, while the entity layer is treated as an evolving population\. For each discipline, entities are organized into semantic clusters, which serve as the basic units of evolution\. Across successive exploration rounds, these entity clusters are iteratively updated to refine cross\-disciplinary exploration\.
Concretely, the system applies a set of bio\-inspired evolutionary operations at the cluster level\.
- •Crossover:Enables recombination by exchanging entities between different semantic clusters within the same discipline, producing novel concept combinations while preserving domain coherence\.
- •Variation:Injects diversity by introducing new or low\-frequency entities into existing clusters, preventing premature convergence\.
- •Selection:Filters entity clusters based on evaluation feedback from generated ideas, favoring clusters that exhibit higher novelty, feasibility, and relevance to the research topic\.
- •Inheritance:Propagates high\-fitness entities and clusters into subsequent iterations, ensuring that valuable knowledge accumulates over time\.
Evaluation\-Guided Loop\.After each exploration round, refined ideas and their evaluations are summarized by the Prime Researcher and passed to the Mentor Agent\. High\-fitness entities identified from successful ideas are re\-integrated into the knowledge graph, while low\-value entities are pruned\. The updated entity clusters then serve as seeds for reconstructing the next problem set, forming a closed evolutionary loop that balances exploration and consolidation\.
## 4Experiments
We designed an experimental setup to examine EvoSci’s adaptability across diverse scientific domains\. Ten representative and challenging open research topics were selected from the task settings introduced in AI ScientistLu et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib20)\)\. Detailed task settings and corresponding experimental prompts are provided in Appendices[B](https://arxiv.org/html/2605.24018#A2)and[F](https://arxiv.org/html/2605.24018#A6)\.
### 4\.1Evaluation Methodologies
To comprehensively assess the effectiveness and creativity of our multi\-agent scientific system, we adopted both qualitative and quantitative evaluations, combining an expert\-simulated review mechanism with a tournament\-style idea ranking procedure\. In addition, we conducted an additional experiment to further validate the effectiveness and stability of the proposed meta\-review mechanism \(see Appendix[E](https://arxiv.org/html/2605.24018#A5)\)\.
Multi\-Reviewer \+ Meta\-Reviewer Mechanism\.Inspired by the AI Scientist evaluation frameworkLu et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib20)\), we designed a structured LLM\-driven peer\-review workflow that simulates academic conference reviewing\. Reviewer agents independently assess generated ideas using prompts aligned with ICLR and NeurIPS review templates, each incorporating a reflection mechanism to refine their evaluations\. A meta\-reviewer agent then aggregates individual reviews into a unified meta\-review, enabling interpretable, reproducible, and academically representative evaluation\.
Tournament\-Style Idea Ranking\.In addition, we implemented a comparative ranking procedure to evaluate idea quality\. All generated ideas were pooled, randomized, and initially assigned one point\. Ideas were paired and compared using the prompt:*“One of them is accepted by a top AI conference \(like ICLR or ACL\) and the other one is rejected\.”*The winner of each comparison received one point\. This process was repeated for five rounds, and the final ranking was determined by aggregated scores\. This tournament\-style evaluation provided a robust relative measure of idea quality, complementing the structured review mechanism\. Prior studies have also shown that such pairwise comparison, rather than absolute scoring, enables LLM\-based evaluations to better align with human expert judgments\.
### 4\.2Main Results
To systematically verify the overall effectiveness of EvoSci, we conducted experiments on the ten representative research topics described above\. Our system was compared against four baseline methods: SciPIP, AI Scientist, COI Agent and VirSci \(detailed descriptions are provided in the Appendix[C](https://arxiv.org/html/2605.24018#A3)\)\. For fairness, each method was configured to generate 10 research ideas per topic, following comparable settings on iteration rounds, team size, and evaluation feedback\. The generated ideas were evaluated using two mechanisms: an expert\-simulated review and a tournament\-style ranking\. The aggregated results are reported in Table[1](https://arxiv.org/html/2605.24018#S4.T1), while the tournament\-style ranking results are reported in Table[2](https://arxiv.org/html/2605.24018#S4.T2)\.
Table 1:Subjective evaluation scores of various agent models and methods\.Agent ModelMethodNoveltyFeasibilityValidityExcitementICLR OverallNeurIPS OverallGPT\-4oAI Scientist4\.316\.494\.874\.114\.333\.19\\rowcolorgray\!10SciPIP4\.724\.764\.744\.494\.263\.06VirSci5\.123\.644\.564\.954\.283\.26\\rowcolorgray\!10CoI\-Agent4\.524\.774\.794\.334\.203\.21EvoSci4\.784\.625\.014\.754\.453\.44DeepSeek\-v3AI Scientist4\.606\.855\.004\.294\.683\.39\\rowcolorgray\!10SciPIP4\.425\.804\.854\.134\.343\.02VirSci5\.483\.954\.885\.114\.503\.69\\rowcolorgray\!10CoI\-Agent5\.074\.664\.924\.584\.513\.72EvoSci5\.714\.685\.255\.154\.903\.95Qwen3\-maxAI Scientist4\.187\.005\.044\.154\.483\.31\\rowcolorgray\!10SciPIP4\.875\.414\.894\.534\.543\.17VirSci5\.374\.014\.905\.004\.353\.57\\rowcolorgray\!10CoI\-Agent4\.644\.384\.764\.404\.193\.62EvoSci5\.144\.985\.204\.894\.723\.81
Table 2:Average wins and top 10 counts for various agent models and methods\.Agent ModelMethodAvg WinsTop 10 CountGPT\-4oAI Scientist3\.8813\\rowcolorgray\!10SciPIP2\.707VirSci4\.0752\\rowcolorgray\!10CoI\-Agent3\.5836EvoSci4\.2754DeepSeek\-v3AI Scientist2\.838\\rowcolorgray\!10SciPIP2\.503VirSci3\.9035\\rowcolorgray\!10CoI\-Agent4\.0837EvoSci4\.1947Qwen3\-maxAI Scientist2\.927\\rowcolorgray\!10SciPIP2\.391VirSci3\.9434\\rowcolorgray\!10CoI\-Agent4\.0039EvoSci4\.2550
From Table[1](https://arxiv.org/html/2605.24018#S4.T1), we observe that EvoSci consistently outperforms all baselines across models and evaluation dimensions\. It achieves the highest scores onValidity,Excitement, and both overall metrics, indicating that the generated research ideas are not only credible but also engaging\. While baselines such as AI Scientist or VirSci exhibit isolated strengths on individual criteria, their overall performance remains uneven, whereas our framework provides balanced improvements across all aspects\. On the two overall measures, EvoSci yields 5–10% relative gains in ICLR Overall \(e\.g\., 4\.90 vs\. 4\.68 with DeepSeek\-v3, 4\.72 vs\. 4\.54 with Qwen3\-max\) and 10–15% gains in NeurIPS Overall \(e\.g\., 3\.95 vs\. 3\.39 with DeepSeek\-v3, 3\.81 vs\. 3\.31 with Qwen3\-max\)\. These advantages are more pronounced when paired with stronger backbone models\.
The complementary tournament\-style evaluation further corroborates these findings \(Table[2](https://arxiv.org/html/2605.24018#S4.T2)\)\. EvoSci obtains the highestAvg Winsacross all backbones \(4\.27 with GPT\-4o, 4\.19 with DeepSeek\-v3, 4\.25 with Qwen3\-max\) and the largestTop 10 Count\. Together, these results demonstrate that our framework reliably produces more credible, exciting, and impactful research ideas than existing baselines\.
### 4\.3Ablation Study
To further examine the contribution of individual components within the proposed framework, we designed three ablation experiments\.
Impact of Structured Problem Formulation\.To evaluate the effect of the problem formulation module, we compared two settings:
\(1\)*w/ Problem Guidance*: The full system, where the Mentor Agent actively interprets the initial keyword, retrieves relevant literature, constructs a structured problem space, and generates problem clusters to guide downstream research;
\(2\)*w/o Problem Guidance*: A simplified version using the raw prompt template from AI Scientist without problem construction, where the Prime Researcher directly explores the given topic\.
Each setting generates 10 research ideas for each of the ten benchmark topics \(100 ideas in total\), following the unified evaluation protocol\.
As shown in Table[3](https://arxiv.org/html/2605.24018#S4.T3), the system with explicit problem construction achieves higher scores across most dimensions, particularly in*novelty*and*excitement*, indicating that structured problem formulation effectively focuses the research direction while broadening the exploration boundary\. This demonstrates its essential role in establishing semantic anchoring and path guidance for automated scientific discovery\.
Table 3:The impact of problem formulation on research quality\.SettingNoveltyFeasibilityValidityExcitementICLROverallNeurIPSOverallw/ ProblemGuidance4\.784\.625\.014\.754\.453\.44w/o ProblemGuidance4\.224\.754\.964\.514\.223\.28
Role of Multi\-agent Collaboration\.To evaluate the effect of multi\-agent collaboration, we compare systems with different team sizes: a single\-agent setting \(*team\_size = 1*\), a standard collaborative setting \(*team\_size = 3*\), and extended collaborative settings \(*team\_size = 5*,*team\_size = 7*, and*team\_size = 9*\)\.
Figure 2:The impact of team size on research quality\.As shown in Figure[2](https://arxiv.org/html/2605.24018#S4.F2), increasing team size initially yields consistent improvements across novelty, validity, and overall quality, with performance peaking at team\_size = 5\. Beyond this point, however, we observe a clear degradation at team\_size = 7 and 9\. This pattern suggests that while moderate levels of agent diversity enhance idea generation and evaluation, excessively large teams incur substantial coordination overhead and conflicting reasoning trajectories\. Consequently, the marginal utility of additional agents becomes negative, indicating that optimal research performance emerges from moderately sized, rather than maximal teams\.
Table 4:Evolution ablation results\. We report topic\-level average scores under NeurIPS\-style and ICLR\-style review templates, with and without the evolutionary module\.TopicNeurIPS \(w/o Evo\)NeurIPS \(w/ Evo\)ICLR \(w/o Evo\)ICLR \(w/ Evo\)2D Diffusion3\.403\.324\.564\.50Character\-Level Language Modeling3\.483\.524\.504\.46Earthquake Prediction3\.403\.544\.424\.44Grokking3\.263\.224\.164\.08NanoGPT3\.443\.664\.404\.66Materials Adaptive Convolutional Equivariants3\.303\.164\.204\.10SEIR Infection Modeling3\.383\.364\.384\.42Sketch Generation with RNNs3\.403\.424\.344\.52Multi\-Dataset CNN Optimization3\.323\.484\.164\.28TensoRF3\.423\.564\.224\.18Average3\.383\.4244\.3344\.364
Effect of Evolutionary Iteration\.To evaluate the effect of the evolutionary module, we compared two settings:
\(1\)*w/ Evo*: the full system with bio\-inspired evolutionary iteration enabled;
\(2\)*w/o Evo*: a simplified version in which the evolutionary module is disabled, while all other components and evaluation settings remain unchanged\.
As shown in Table[4](https://arxiv.org/html/2605.24018#S4.T4), enabling evolution shifts the topic\-level scores upward in the majority of cases under both evaluation templates\. Under the NeurIPS\-style review, most topics exhibit improvements when evolution is enabled, and a similar trend is observed under the ICLR\-style template\. Although not every individual topic increases, the overall direction across topics is predominantly positive\. Aggregating across topics \(see the Average row\), enabling evolution yields a consistent increase in overall performance under identical computational budgets and evaluation settings \(NeurIPS: 3\.38→\\rightarrow3\.424; ICLR: 4\.334→\\rightarrow4\.364\)\. Although the absolute gains are moderate, they are systematic rather than driven by isolated instances, indicating that improvements cannot be solely attributed to iterative feedback or memory accumulation\.
Table 5:Overall comparison between w/o and w/ evolution\.Metricw/o Evow/ EvoΔ\\Delta\(w/ – w/o\)NeurIPS TemplateMean3\.3803\.424\+0\.044Within\-topic Std \(avg over topics\)0\.1460\.176\+0\.030ICLR TemplateMean4\.3344\.364\+0\.030Within\-topic Std \(avg over topics\)0\.1540\.157\+0\.003
Table[5](https://arxiv.org/html/2605.24018#S4.T5)further summarizes the aggregated comparison\. In addition to the consistent mean shifts, we examine within\-topic variability across five independent runs per topic\. Under the NeurIPS\-style evaluation, enabling evolution leads to a moderate increase in within\-topic standard deviation \(0\.146→\\rightarrow0\.176\), suggesting that the evolutionary operators introduce stronger exploration dynamics rather than simply repeating deterministic refinement steps\. Under the ICLR\-style template, within\-topic variability remains largely comparable \(0\.154→\\rightarrow0\.157\)\. This suggests that different evaluation templates may exhibit varying sensitivity to quality differences, rather than indicating inconsistent behavior of the evolution model itself\. Taken together, these ablations isolate two primary drivers of improvement, namely workflow\-level role specialization and evolutionary search, and clarify how each component contributes to performance gains\.
## 5Analysis
To better understand the behavior of EvoSci beyond primary performance results, we conducted a set of additional*analytical experiments*that probed the system from complementary perspectives\. More detailed experimental analyses and the corresponding quantitative results for each topic are reported in Appendix[D](https://arxiv.org/html/2605.24018#A4)\.
Evolution of Ideas under Iterative Evaluation\.To analyze the effect of evaluation\-guided iteration on idea evolution, we conducted a qualitative analysis on*Grokking*by tracking how generated ideas changed across feedback rounds\. For each iteration, we extracted technical terms from the generated ideas and visualized their cumulative growth, together with changes in semantic organization, as shown in Figure[3](https://arxiv.org/html/2605.24018#A4.F3)\(Appendix[D\.1](https://arxiv.org/html/2605.24018#A4.SS1)\), to characterize the evolution of conceptual structure over time\.
We observe that early iterations introduce diverse learning\-related concepts with limited internal organization, while later iterations progressively form clearer and more structured conceptual clusters\. Over time, ideas increasingly emphasize delayed generalization and representation reorganization, resulting in more coherent descriptions of grokking\-like behavior\.
Quality Gains beyond the Initial Exploration Stage\.To examine whether idea quality improved beyond the initial exploratory phase, we analyzed the evolution of quality scores across iterative rounds\. For each topic, EvoSci performed ten iterations and generated a total of 50 ideas\. Ideas were grouped into batches of 10, and quality scores were computed at the group level to track changes over time, as shown in Table[6](https://arxiv.org/html/2605.24018#A4.T6)\(Appendix[D\.2](https://arxiv.org/html/2605.24018#A4.SS2)\)\. We treated the first group as an exploratory baseline and identified the earliest subsequent iteration that exhibited noticeable improvement\.
We observe that quality gains beyond the first group occurred in most topics, although the timing and affected metrics vary\. For topics with latent mechanisms or less explored conceptual structures, later groups show clear improvements in Novelty and/or Overall scores\. In contrast, engineering\-oriented or well\-studied topics exhibit limited or no improvement beyond early groups\. Across topics, improvements do not appear simultaneously across all metrics, with some groups showing gains in Novelty without corresponding increases in Overall scores, or vice versa\.
Exploration Dynamics Across and Within Iterations\.To analyze exploration behavior beyond aggregate quality trends, we measured idea similarity from two perspectives: intra\-round convergence and inter\-round continuity\. Within each iteration, we computed the average pairwise cosine similarity between sentence embeddings of ideas to quantify intra\-round convergence\. Across iterations, we computed inter\-round similarity by measuring the cosine similarity between aggregated embeddings of ideas from consecutive rounds, as shown in Figure[4](https://arxiv.org/html/2605.24018#A4.F4)\(Appendix[D\.3](https://arxiv.org/html/2605.24018#A4.SS3)\)\.
We observe substantial variation in both intra\-round and inter\-round similarity patterns across topics\. Some topics exhibit consistently higher intra\-round similarity together with strong inter\-round continuity, indicating concentrated exploration and stable refinement across iterations\. Other topics show lower or more fluctuating intra\-round similarity and greater inter\-round variation, reflecting broader exploration and frequent shifts in focus\. Overall, intra\-round convergence and inter\-round continuity exhibit aligned trends across topics, with topics showing stronger within\-round concentration also tending to display higher continuity across iterations\.
Grounding the Evolutionary Dynamics\.To examine whether the evolutionary process in EvoSci reflects concrete system behavior rather than a high\-level analogy, we conducted an additional qualitative analysis on the*Grokking*topic by tracing how ideas evolved across iterative rounds\. We find that the system maintains persistent conceptual variation over time, while evaluation feedback gradually favors mechanisms that are more directly relevant to grokking\.
Concretely, the search moves from broad learning\-related exploration toward more specialized concepts such as delayed generalization, phase transitions, and representation reorganization, while still retaining multiple conceptual lineages in parallel\. This pattern suggests that the evolutionary process in EvoSci is not merely metaphorical, but is grounded in heritable variation, feedback\-guided selection, and population diversity\.
## 6Conclusion
In this study, we have presented EvoSci, a multi\-agent, feedback\-driven, and bio\-inspired evolutionary framework for automated scientific discovery\. The framework conceptualizes scientific discovery as a problem\-oriented process, integrates heterogeneous research agents that emulate real\-world laboratory roles, and employs multi\-round feedback with evolutionary operations to support continuous and open\-ended exploration\. Extensive experiments across ten scientific domains show that EvoSci consistently outperforms strong baselines in idea validity, excitement, and overall quality\. The gains are balanced across evaluation dimensions and remain robust across backbone models, validating feedback\-guided evolutionary exploration for open\-ended scientific discovery\.
## Limitations
Experimental results across ten interdisciplinary research topics demonstrate that EvoSci generates more novel and insightful research ideas than existing systems, showing particular strength in innovation\-oriented dimensions such asnovelty,excitement, and overall peer\-review evaluations\. However, due to its broad cross\-domain exploration, the framework sometimes produces ideas with lower practical feasibility, suggesting a trade\-off between creativity and applicability\.
Future work will focus on enhancing EvoSci’s ability to reason and operate across disciplines by improving interdisciplinary knowledge integration through structured knowledge representations, strengthening causal reasoning to increase scientific rigor and interpretability, and developing more open\-ended iterative mechanisms that enable long\-term, autonomous scientific discovery\. A key challenge in achieving such continuous evolution lies in establishing more objective and high\-quality evaluation mechanisms that allow LLM\-based agents to better assess their own reasoning and outputs, which is essential for truly effective self\-improvement and sustained innovation\.
## Acknowledgments
The present research was supported by the National Key Research and Development Program of China \(Grant No\. 2024YFE0203000\)\. We also acknowledge support from the State Key Laboratory of Tibetan Intelligence \(Grant No\. 2025\-ZJ\-J08\) and the Postdoctoral Fellowship Program of CPSF \(Grant No\. GZC20251075\)\. We thank the anonymous reviewers for their insightful comments\.
## References
- Baek et al\. \(2024\)Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang\. 2024\.Researchagent: Iterative research idea generation over scientific literature with large language models\.*arXiv preprint arXiv:2404\.07738*\.
- Buehler \(2024\)Markus J Buehler\. 2024\.Accelerating scientific discovery with generative knowledge extraction, graph\-based representation, and multimodal intelligent graph reasoning\.*Machine Learning: Science and Technology*, 5\(3\):035083\.
- Charness et al\. \(2025\)Gary Charness, Brian Jabarian, and John A\. List\. 2025\.[The next generation of experimental research with LLMs](https://doi.org/10.1038/s41562-025-02137-1)\.*Nature Human Behaviour*, 9\(5\):833–835\.
- Desai et al\. \(2025\)Saaketh Desai, Sadhvikas Addamane, Jeffrey Y Tsao, Igal Brener, Laura P Swiler, Remi Dingreville, and Prasad P Iyer\. 2025\.Autoscilab: A self\-driving laboratory for interpretable scientific discovery\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 39, pages 146–154\.
- Durante et al\. \(2024\)Zane Durante, Qiuyuan Huang, Naoki Wake, Ran Gong, Jae Sung Park, Bidipta Sarkar, Rohan Taori, Yusuke Noda, Demetri Terzopoulos, Yejin Choi, and 1 others\. 2024\.Agent ai: Surveying the horizons of multimodal interaction\.*arXiv preprint arXiv:2401\.03568*\.
- Elliott \(2012\)Kevin C\. Elliott\. 2012\.[Epistemic and methodological iteration in scientific research](https://doi.org/10.1016/j.shpsa.2011.12.034)\.*Studies in History and Philosophy of Science Part A*, 43\(2\):376–382\.
- Ferrag et al\. \(2025\)Mohamed Amine Ferrag, Norbert Tihanyi, and Merouane Debbah\. 2025\.[Reasoning beyond limits: Advances and open problems for llms](https://arxiv.org/abs/2503.22732)\.*Preprint*, arXiv:2503\.22732\.
- Ghafarollahi and Buehler \(2025\)Alireza Ghafarollahi and Markus J Buehler\. 2025\.Sciagents: automating scientific discovery through bioinspired multi\-agent intelligent graph reasoning\.*Advanced Materials*, 37\(22\):2413523\.
- Han et al\. \(2025\)Longzhen Han, Awes Mubarak, Almas Baimagambetov, Nikolaos Polatidis, and Thar Baker\. 2025\.[A survey of generative categories and techniques in multimodal large language models](https://arxiv.org/abs/2506.10016)\.*Preprint*, arXiv:2506\.10016\.
- He et al\. \(2024\)Xinyi He, Jiaru Zou, Yun Lin, Mengyu Zhou, Shi Han, Zejian Yuan, and Dongmei Zhang\. 2024\.[CoCoST: Automatic complex code generation with online searching and correctness testing](https://doi.org/10.18653/v1/2024.emnlp-main.1082)\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 19433–19451, Miami, Florida, USA\. Association for Computational Linguistics\.
- Hristovski et al\. \(2005\)Dimitar Hristovski, Borut Peterlin, Joyce A\. Mitchell, and Susanne M\. Humphrey\. 2005\.[Using literature\-based discovery to identify disease candidate genes](https://doi.org/10.1016/j.ijmedinf.2004.04.024)\.*International Journal of Medical Informatics*, 74\(2\):289–298\.MIE 2003\.
- Huang et al\. \(2025\)Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and 1 others\. 2025\.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions\.*ACM Transactions on Information Systems*, 43\(2\):1–55\.
- Jansen et al\. \(2025\)Peter Jansen, Oyvind Tafjord, Marissa Radensky, Pao Siangliulue, Tom Hope, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Daniel S Weld, and Peter Clark\. 2025\.Codescientist: End\-to\-end semi\-automated scientific discovery with code\-based experimentation\.*arXiv preprint arXiv:2503\.22708*\.
- Ke et al\. \(2025\)Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan\-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, and 1 others\. 2025\.A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems\.*arXiv preprint arXiv:2504\.09037*\.
- King et al\. \(2009\)Ross D\. King, Jem Rowland, Stephen G\. Oliver, Michael Young, Wayne Aubrey, Emma Byrne, Maria Liakata, Magdalena Markham, Pinar Pir, Larisa N\. Soldatova, Andrew Sparkes, Kenneth E\. Whelan, and Amanda Clare\. 2009\.[The automation of science](https://doi.org/10.1126/science.1165620)\.*Science*, 324\(5923\):85–89\.
- Kulkarni et al\. \(2025\)Adithya Kulkarni, Fatimah Alotaibi, Xinyue Zeng, Longfeng Wu, Tong Zeng, Barry Menglong Yao, Minqian Liu, Shuaicheng Zhang, Lifu Huang, and Dawei Zhou\. 2025\.[Scientific hypothesis generation and validation: Methods, datasets, and future directions](https://arxiv.org/abs/2505.04651)\.*Preprint*, arXiv:2505\.04651\.
- Li et al\. \(2023\)Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu\. 2023\.Skcoder: A sketch\-based approach for automatic code generation\.In*2023 IEEE/ACM 45th International Conference on Software Engineering \(ICSE\)*, pages 2124–2135\. IEEE\.
- Li et al\. \(2025\)Zhigen Li, Jianxiang Peng, Yanmeng Wang, Yong Cao, Tianhao Shen, Minghui Zhang, Linxi Su, Shang Wu, Yihang Wu, YuQian Wang, Ye Wang, Wei Hu, Jianfeng Li, Shaojun Wang, Jing Xiao, and Deyi Xiong\. 2025\.[ChatSOP: An SOP\-guided MCTS planning framework for controllable LLM dialogue agents](https://doi.org/10.18653/v1/2025.acl-long.863)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 17637–17659, Vienna, Austria\. Association for Computational Linguistics\.
- Liu et al\. \(2025\)Yan Liu, Minghui Zhang, Bojian Xiong, Yifan Xiao, Yinong Sun, Yating Mei, Longyu Zeng, Jingchao Yang, Yang Wang, and Deyi Xiong\. 2025\.[HighMATH: Evaluating math reasoning of large language models in breadth and depth](https://doi.org/10.18653/v1/2025.findings-emnlp.542)\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 10241–10253, Suzhou, China\. Association for Computational Linguistics\.
- Lu et al\. \(2024\)Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha\. 2024\.The ai scientist: Towards fully automated open\-ended scientific discovery\.*arXiv preprint arXiv:2408\.06292*\.
- Milojević \(2014\)Staša Milojević\. 2014\.[Principles of scientific research team formation and evolution](https://doi.org/10.1073/pnas.1309723111)\.*Proceedings of the National Academy of Sciences*, 111\(11\):3984–3989\.
- Noh et al\. \(2024\)Juran Noh, Hieu A Doan, Heather Job, Lily A Robertson, Lu Zhang, Rajeev S Assary, Karl Mueller, Vijayakumar Murugesan, and Yangang Liang\. 2024\.An integrated high\-throughput robotic platform and active learning approach for accelerated discovery of optimal electrolyte formulations\.*Nature Communications*, 15\(1\):2757\.
- Pan et al\. \(2024\)Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu\. 2024\.Unifying large language models and knowledge graphs: A roadmap\.*IEEE Transactions on Knowledge and Data Engineering*, 36\(7\):3580–3599\.
- Ren et al\. \(2023\)Xiaoxue Ren, Xinyuan Ye, Dehai Zhao, Zhenchang Xing, and Xiaohu Yang\. 2023\.From misuse to mastery: Enhancing code generation with knowledge\-driven ai chaining\.In*2023 38th IEEE/ACM International Conference on Automated Software Engineering \(ASE\)*, pages 976–987\. IEEE\.
- Smalheiser and Swanson \(1998\)Neil R Smalheiser and Don R Swanson\. 1998\.[Using arrowsmith: a computer\-assisted approach to formulating and assessing scientific hypotheses](https://doi.org/10.1016/S0169-2607(98)00033-9)\.*Computer Methods and Programs in Biomedicine*, 57\(3\):149–153\.
- Su et al\. \(2025\)Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong\. 2025\.[Many heads are better than one: Improved scientific idea generation by a LLM\-based multi\-agent system](https://doi.org/10.18653/v1/2025.acl-long.1368)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 28201–28240, Vienna, Austria\. Association for Computational Linguistics\.
- Sui et al\. \(2024\)Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang\. 2024\.Table meets llm: Can large language models understand structured table data? a benchmark and empirical study\.In*Proceedings of the 17th ACM International Conference on Web Search and Data Mining*, pages 645–654\.
- Tian et al\. \(2021\)Yunsheng Tian, Mina Konaković Luković, Timothy Erps, Michael Foshey, and Wojciech Matusik\. 2021\.[Autooed: Automated optimal experiment design platform](https://arxiv.org/abs/2104.05959)\.*Preprint*, arXiv:2104\.05959\.
- Tom et al\. \(2024\)Gary Tom, Stefan P Schmid, Sterling G Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo\-García, Ella M Rajaonson, Marta Skreta, and 1 others\. 2024\.Self\-driving laboratories for chemistry and materials science\.*Chemical Reviews*, 124\(16\):9633–9732\.
- Trinh et al\. \(2024\)Trieu H Trinh, Yuhuai Wu, Quoc V Le, He He, and Thang Luong\. 2024\.Solving olympiad geometry without human demonstrations\.*Nature*, 625\(7995\):476–482\.
- Wang et al\. \(2024a\)Qingyun Wang, Doug Downey, Heng Ji, and Tom Hope\. 2024a\.[Scimon: Scientific inspiration machines optimized for novelty](https://doi.org/10.18653/v1/2024.acl-long.18)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, page 279–299\. Association for Computational Linguistics\.
- Wang et al\. \(2024b\)Wenxiao Wang, Lihui Gu, Liye Zhang, Yunxiang Luo, Yi Dai, Chen Shen, Liang Xie, Binbin Lin, Xiaofei He, and Jieping Ye\. 2024b\.Scipip: An llm\-based scientific paper idea proposer\.*arXiv preprint arXiv:2410\.23166*\.
- Weng et al\. \(2024\)Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang\. 2024\.Cycleresearcher: Improving automated research via automated review\.*arXiv preprint arXiv:2411\.00816*\.
- Xiong et al\. \(2026\)Xiaoyu Xiong, Hao Wang, Keming Wu, Zhenfei Yang, Hanjie Zhao, Hongxiang Wang, Hao Liu, and Deyi Xiong\. 2026\.[Collaborative and autonomous ai for science and innovation: Practices, challenges, and future directions](https://doi.org/10.36227/techrxiv.177155935.57684125/v1)\.*TechRxiv*, 2026\(0220\)\.
- Xu et al\. \(2025\)Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, and 1 others\. 2025\.Towards large reasoning models: A survey of reinforced reasoning with large language models\.*arXiv preprint arXiv:2501\.09686*\.
- Yamada et al\. \(2025\)Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha\. 2025\.The ai scientist\-v2: Workshop\-level automated scientific discovery via agentic tree search\.*arXiv preprint arXiv:2504\.08066*\.
- Yang et al\. \(2025\)Lei Yang, Renren Jin, Ling Shi, Jianxiang Peng, Yue Chen, and Deyi Xiong\. 2025\.[Probench: Benchmarking large language models in competitive programming](https://api.semanticscholar.org/CorpusID:276725031)\.*ArXiv*, abs/2502\.20868\.
- Yang et al\. \(2024\)Zonglin Yang, Xinya Du, Junxian Li, Jie Zheng, Soujanya Poria, and Erik Cambria\. 2024\.Large language models for automated open\-domain scientific hypotheses discovery\.In*Findings of the Association for Computational Linguistics ACL 2024*, pages 13545–13565\.
- Zhang and Xiong \(2025\)Shaowei Zhang and Deyi Xiong\. 2025\.[Debate4MATH: Multi\-agent debate for fine\-grained reasoning in math](https://doi.org/10.18653/v1/2025.findings-acl.862)\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 16810–16824, Vienna, Austria\. Association for Computational Linguistics\.
- Zhang et al\. \(2023\)Zihan Zhang, Meng Fang, Ling Chen, Mohammad\-Reza Namazi\-Rad, and Jun Wang\. 2023\.How do large language models capture the ever\-changing world knowledge? a review of recent advances\.*arXiv preprint arXiv:2310\.07343*\.
- Zheng et al\. \(2025\)Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, and Yangqiu Song\. 2025\.From automation to autonomy: A survey on large language models in scientific discovery\.*arXiv preprint arXiv:2505\.13259*\.
- Zheng et al\. \(2023\)Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Anh TN Nguyen, Lauren T May, Geoffrey I Webb, and Shirui Pan\. 2023\.Large language models for scientific synthesis, inference and explanation\.*arXiv preprint arXiv:2310\.07984*\.
- Zhou et al\. \(2024a\)Hang Zhou, Yehui Tang, Haochen Qin, Yujie Yang, Renren Jin, Deyi Xiong, Kai Han, and Yunhe Wang\. 2024a\.[Star\-agents: Automatic data optimization with llm agents for instruction tuning](https://doi.org/10.52202/079017-0149)\.In*Advances in Neural Information Processing Systems*, volume 37, pages 4575–4597\. Curran Associates, Inc\.
- Zhou et al\. \(2024b\)Yangqiaoyu Zhou, Haokun Liu, Tejes Srivastava, Hongyuan Mei, and Chenhao Tan\. 2024b\.[Hypothesis generation with large language models](https://doi.org/10.18653/v1/2024.nlp4science-1.10)\.In*Proceedings of the 1st Workshop on NLP for Science \(NLP4Science\)*, pages 117–139, Miami, FL, USA\. Association for Computational Linguistics\.
## Appendix AData Collection
### A\.1Real\-World Scientist Dataset
The Digital Scientist dataset111[https://drive\.google\.com/drive/folders/1ZwWMBQ5oK\-l4VuzMa60GbMND0g2EIxIu](https://drive.google.com/drive/folders/1ZwWMBQ5oK-l4VuzMa60GbMND0g2EIxIu)used in this study was constructed by the VirSci team based on real\-world scientist information from the AMiner Computer Science dataset,222[https://www\.aminer\.cn/aminernetwork](https://www.aminer.cn/aminernetwork)which was originally compiled by extracting researcher profiles from online academic databases\. The AMiner Computer Science dataset contains information on 1,712,433 authors and 2,092,356 papers, covering the period from 1948 to 2014 and focusing on the field of computer science\.
To ensure data quality, the VirSci team filtered out scientists who had published fewer than 50 papers or had fewer than 50 collaborators\. Using the remaining data, they built the Digital Scientist dataset, which includes 156 representative scientists\. The profile information of each scientist was embedded using themxbai\-embed\-largemodel\.
All personal identity information has been anonymized, and the profiles only contain abstracted metadata for research simulation purposes\. An example of a digital scientist profile is shown below:
Digital ScientistYour name is Scientist0, you belong to the following affiliations \[’Naval Research Laboratory’, ’College of William and Mary’, ’George Mason University’\], you have researched on the following topics \[’data cube’, ’attack graph’, ’data mining’, ’access control’, ’data owner’, ’data protection’, ’data item’, ’data redundancy’, ’data security’, ’data structure’\], you have published 372 papers, you have 4230 citations, and you have previously collaborated with these individuals \[’Scientist78’, ’Scientist105’\]\.
### A\.2Technical Terminology Dataset
The Technical Terminology dataset333[https://github\.com/jiqizhixin/Artificial\-Intelligence\-Terminology\-Database](https://github.com/jiqizhixin/Artificial-Intelligence-Terminology-Database)used in this study was compiled to support the analysis of conceptual evolution and the measurement of scientific depth across iterations\. It is based on theArtificial Intelligence Terminology Database, which systematically collects, organizes, and standardizes key technical terms from major subfields of AI, including machine learning, natural language processing, computer vision, and robotics\.
Each term entry contains its English form, corresponding Chinese translation, and a short definition, ensuring cross\-lingual consistency and facilitating accurate concept extraction\. The dataset enables automated detection and tracking of emerging research topics and specialized vocabulary in scientific idea generation\.
## Appendix BTask Overview
### B\.1Task Settings
This appendix provides detailed descriptions of the experimental task settings used to evaluate EvoSci\. Following AI ScientistLu et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib20)\), we adopt ten representative and challenging open\-ended research topics spanning machine learning, scientific modeling, and simulation\. These tasks are designed to assess the system’s adaptability across diverse scientific domains\.
### B\.2Task Descriptions
The ten experimental tasks include:
- •2D Diffusion Modeling:Learning and analyzing diffusion processes in two\-dimensional synthetic data\.
- •Character\-Level Language Modeling:Training and evaluating character\-based language models\.
- •Earthquake Prediction:Modeling seismic activity for temporal event prediction\.
- •Grokking:Investigating delayed generalization behavior in overparameterized networks\.
- •NanoGPT:Training and scaling lightweight transformer\-based language models\.
- •Materials Adaptive Convolutional Equivariants:Modeling symmetry\-aware representations for material science tasks\.
- •SEIR Infection Modeling:Simulating epidemiological dynamics using compartmental models\.
- •Sketch Generation with Recurrent Neural Networks:Generating hand\-drawn sketches using RNN\-based generative models\.
- •Multi\-Dataset CNN Architecture Optimization:Optimizing small CNN architectures across multiple datasets\.
- •TensoRF:Learning neural radiance fields using tensor factorization\.
### B\.3Initialization Prompts
For each task, EvoSci is initialized with a minimal topic prompt describing the research domain\. For baseline methods, we incorporate the task descriptions used in AI ScientistLu et al\. \([2024](https://arxiv.org/html/2605.24018#bib.bib20)\)into their prompts in a consistent and reasonable manner, without introducing additional task\-specific heuristics or privileged information\. This design ensures fair comparison across different methods\.
## Appendix CBaseline Methods
To comprehensively evaluate the performance of our system, we compare it against four representative research\-agent frameworks built upon large language models\. For each baseline, we follow the official implementation or publicly described procedure to ensure a fair and consistent comparison\. Below, we summarize each method and its configuration used in our experiments\.
##### Baseline 1: SciPIP
SciPIP444[https://github\.com/cheerss/SciPIP](https://github.com/cheerss/SciPIP), proposed by Zhejiang University, is a research idea generation framework designed to enhance scientific creativity through improved literature retrieval and dual\-path reasoning\. The system constructs a literature repository enriched with semantic relations, entity links, and citation co\-occurrence information, and employs multi\-granularity retrieval algorithms to ensure comprehensive coverage of relevant works\. During idea generation, SciPIP combines inference from retrieved literature with model\-driven brainstorming to produce solutions balancing originality and feasibility\. In our experiments, we retain SciPIP’s original retrieval mechanism and do not introduce additional ArXiv alignment to maintain consistent comparison\.
##### Baseline 2: AI Scientist
AI Scientist555[https://github\.com/SakanaAI/AI\-Scientist](https://github.com/SakanaAI/AI-Scientist), developed by Sakana AI, is an automated research platform that aims to cover the end\-to\-end scientific workflow, including idea formulation, code generation, experiment execution, result analysis, and manuscript drafting\. The framework leverages large language models and implements a multi\-round reasoning pipeline that iteratively refines research plans\. It further includes an automated reviewer module for assessing the quality of generated papers\. In this study, we adopt the publicly available multi\-step reasoning and research evaluation procedures as a strong baseline to compare scientific idea generation capabilities\.
##### Baseline 3: VirSci
VirSci666[https://github\.com/RenqiChen/Virtual\-Scientists](https://github.com/RenqiChen/Virtual-Scientists), released by the Shanghai Artificial Intelligence Laboratory, is a multi\-agent scientific collaboration framework designed to emulate real\-world research team dynamics\. The system builds agent profiles from data of real scientists, assigning distinct domain backgrounds and reasoning characteristics to each agent\. Through cross\-disciplinary discussion, collaborative reasoning, and complementary expertise, the agents collectively generate diverse research insights\. The framework includes team construction, topic deliberation, idea generation, innovation assessment, and summary writing\. In our evaluation, we use its multi\-agent modeling paradigm and evaluation criteria as a baseline for collaborative scientific reasoning\.
##### Baseline 4: CoI\-Agent
CoI\-Agent777[https://github\.com/DAMO\-NLP\-SG/CoI\-Agent](https://github.com/DAMO-NLP-SG/CoI-Agent), introduced by researchers at the Chinese University of Hong Kong, is a research agent framework built upon the “Chain\-of\-Insight” \(CoI\) reasoning strategy\. The framework decomposes scientific thinking into a sequence of intermediate, verifiable insight steps covering task understanding, literature reasoning, gap identification, and preliminary solution design\. Each stage produces structured intermediate outputs that are subsequently evaluated for coherence and scientific contribution\. In our work, we implement the publicly available multi\-stage CoI reasoning paradigm as a baseline focusing on structured scientific insight formation\.
## Appendix DAdditional Experiments and Analyses
### D\.1Evolution of Ideas under Iterative Evaluation
To examine how research ideas evolved under iterative evaluation, we conducted a detailed analysis of ideas generated for the*Grokking*topic\. In total, 50 ideas were produced across ten consecutive iterations, with five ideas generated per iteration\. We analyzed the evolution of ideas by combining qualitative inspection of their thematic content with changes in conceptual focus across feedback rounds\.
Figure 3:Evolutionary trajectory of research ideas on the grokking topic\.As shown in Figure[3](https://arxiv.org/html/2605.24018#A4.F3), the cumulative growth of technical terminology exhibits a stepwise consolidation over iterative rounds, reflecting changes in the conceptual focus of generated ideas \(for details of the technical terminology dataset, please refer to Appendix[A\.2](https://arxiv.org/html/2605.24018#A1.SS2)\)\. In the initial iterations \(Rounds 1–3\), ideas primarily explored general learning and cognitive mechanisms, with a strong emphasis on memory\-augmented architectures, information retention, and enhanced representational capacity\. These ideas reflected broad exploration of architectural and cognitive factors related to learning efficiency, without explicitly addressing grokking or phase\-transition behavior\.
In the middle iterations \(Rounds 4–6\), the thematic focus diversified\. While memory\-related mechanisms remained present, ideas increasingly incorporated higher\-level considerations such as decision\-making under uncertainty, temporal reasoning, and ethical or behavioral constraints\. This stage represented a transitional phase in which the system explored alternative framings of learning behavior before converging on a dominant explanatory direction\.
In later iterations \(Rounds 7–10\), the ideas increasingly centered on grokking\-specific phenomena\. Key concepts such as grokking, learning phase transitions, delayed generalization, and internal structural reorganization became prominent\. Ideas in this stage consistently framed grokking as a dynamic learning process involving qualitative changes in representation structure during training, rather than as a consequence of static optimization\. Compared to earlier stages, the ideas exhibited higher conceptual alignment and recurring explanatory patterns\.
Overall, the evolution of the 50 ideas followed a trajectory from broad architectural exploration, through thematic diversification, to focused conceptual consolidation around grokking as a phase\-transition\-driven learning phenomenon\. This analysis illustrates how evaluation\-guided iteration shaped not only the content of individual ideas but also the thematic structure of the idea space over time\.
### D\.2Quality Gains beyond the Initial Exploration Stage
Scientific idea generation is inherently non\-monotonic: early stages often prioritize broad exploration and diversity, while higher\-quality ideas may only emerge after subsequent consolidation and refinement\. Accordingly, rather than assuming monotonic improvement across iterations, we examined whether EvoSci exhibited quality gains beyond the initial exploration stage\. For each topic, the system performed ten iterative rounds and generated 50 ideas in total\. Ideas were aggregated into fixed\-size groups of ten, and quality scores were computed at the group level to track the evolution of idea quality over time\.
Specifically, for each research topic, we treated the first idea group as an exploratory baseline and identified the earliest subsequent group that exhibited noticeable improvements\. Our analysis focused on three core evaluation metrics:*Novelty*,*ICLR Overall*, and*NeurIPS Overall*, which respectively capture conceptual originality and holistic research quality under different review standards\.
As shown in Table[6](https://arxiv.org/html/2605.24018#A4.T6), EvoSci demonstrated such quality gains in most topics, although the patterns were task\-dependent\. For topics characterized by latent mechanisms or underexplored conceptual structures \(e\.g\.,*2D Diffusion Modeling*,*Neural Network Grokking*, and*NanoGPT*\), later groups yielded clear improvements in Novelty and/or Overall scores, indicating a transition from heuristic combinations to more coherent and theory\-driven ideas\. In contrast, for engineering\-oriented or well\-studied domains \(e\.g\.,*Materials Adaptive Convolutional Equivariants*and*Multi\-Dataset CNN Optimization*\), improvements were limited or absent, suggesting early convergence of the idea space\.
Notably, improvements did not necessarily occur simultaneously across all metrics\. In several cases, Novelty increased without corresponding gains in Overall scores, or vice versa, reflecting the non\-uniform nature of scientific progress\. These observations suggest that EvoSci’s evaluation\-guided, bio\-inspired evolutionary mechanism does not enforce uniform optimization, but instead enables selective refinement where the problem structure admits deeper exploration\.
Table 6:Quality gains beyond the initial exploration stage\. For each topic, we report the iteration that achieves the strongest improvement over the initial \(Round 1\) baseline, together with comparisons on core evaluation metrics\.TopicEmergentRoundNoveltyICLROverallNeurIPSOverall2D Diffusion24\.90→\\rightarrow5\.404\.40→\\rightarrow4\.503\.10→\\rightarrow3\.50Character\-LevelLanguage Modeling24\.40→\\rightarrow4\.704\.60→\\rightarrow4\.503\.40→\\rightarrow3\.90EarthquakePrediction34\.80→\\rightarrow5\.104\.20→\\rightarrow4\.603\.30→\\rightarrow3\.80Grokking24\.80→\\rightarrow5\.104\.20→\\rightarrow4\.203\.30→\\rightarrow3\.50NanoGPT35\.20→\\rightarrow5\.404\.70→\\rightarrow4\.803\.70→\\rightarrow3\.80Materials AdaptiveConvolutional Equivariants54\.40→\\rightarrow4\.404\.30→\\rightarrow4\.203\.10→\\rightarrow3\.30SEIR InfectionModeling54\.70→\\rightarrow5\.004\.70→\\rightarrow4\.203\.60→\\rightarrow3\.20Sketch Generationwith RNNs55\.80→\\rightarrow5\.104\.60→\\rightarrow4\.803\.50→\\rightarrow3\.70Multi\-Dataset CNNOptimization45\.20→\\rightarrow5\.104\.50→\\rightarrow4\.403\.70→\\rightarrow3\.60TensoRF44\.40→\\rightarrow5\.004\.20→\\rightarrow4\.403\.40→\\rightarrow3\.70
### D\.3Exploration Dynamics Across and Within Iterations
To better understand the exploration dynamics of EvoSci beyond aggregate quality scores, we analyzed idea similarity from two complementary perspectives:*intra\-round convergence*and*inter\-round continuity*\.
##### Intra\-Round Convergence\.
We first examined how ideas generated within the same iteration evolved over time\. Each iteration consisted of five ideas, and we computed the average pairwise cosine similarity between their sentence embeddings\. This intra\-round similarity measured the degree of conceptual convergence within an iteration: lower values indicated more diverse exploration, while higher values suggested increasing focus around shared concepts\.
\(a\)Intra\-round convergence \(set 1\)
\(b\)Intra\-round convergence \(set 2\)
Figure 4:Intra\-round convergence across iterations\. Higher values indicated stronger conceptual concentration within each iteration, while lower values reflected greater diversity among concurrently generated ideas\.Across topics, we observed substantial variation in intra\-round similarity patterns \(Figure[4](https://arxiv.org/html/2605.24018#A4.F4)\)\. Topics such as*Grokking*,*SEIR Infection Modeling*, and*2D Diffusion Modeling*exhibited relatively lower or more fluctuating intra\-round similarity across iterations\. These topics were often characterized by a stronger emphasis on conceptual understanding, theoretical interpretation, or alternative modeling perspectives, which encouraged the system to explore multiple distinct directions within the same iteration\.
In contrast, topics including*Earthquake Prediction*,*Materials Adaptive Convolutional Equivariants*,*Sketch Generation with RNNs*, and*TensoRF*showed consistently higher intra\-round similarity, and in some cases increasing similarity in later iterations\. These topics were more closely associated with concrete modeling pipelines or structured engineering objectives, where idea generation tended to concentrate on variations around shared architectures, representations, or experimental setups, leading to stronger within\-round consolidation\.
##### Inter\-Round Continuity\.
To complement the intra\-round perspective, we further analyzed how research directions evolved across consecutive iterations\. We computed inter\-round similarity by measuring the cosine similarity between the aggregated embeddings of ideas from adjacent rounds\. Higher inter\-round similarity reflected stable refinement across iterations, whereas lower similarity indicated directional shifts or renewed exploration\.
\(a\)Inter\-round similarity \(set 1\)
\(b\)Inter\-round similarity \(set 2\)
Figure 5:Inter\-round continuity across iterations\. Higher similarity indicated stable refinement between consecutive iterations, while lower values suggested shifts in exploration direction\.The inter\-round results closely aligned with the intra\-round analysis \(Figure[5](https://arxiv.org/html/2605.24018#A4.F5)\)\. Topics that exhibited higher intra\-round convergence also tended to show stronger inter\-round continuity, indicating that successive iterations built upon similar conceptual foundations and refined related ideas over time\. In contrast, topics with lower intra\-round similarity often displayed greater inter\-round fluctuation, suggesting that the system continued to shift its focus across iterations rather than committing to a single dominant direction\.
Taken together, the intra\-round and inter\-round analyses indicated that EvoSci did not impose a uniform convergence pattern across all topics\. Instead, different topics exhibited distinct exploration dynamics: some favored progressive consolidation across and within iterations, while others maintained broader exploratory behavior throughout the process\. This adaptive behavior suggested that EvoSci flexibly balanced exploration and refinement in a topic\-dependent manner, rather than enforcing premature convergence or unstructured diversity\.
### D\.4Grounding the Evolutionary Dynamics
To further examine whether the evolutionary process in EvoSci reflects concrete system behavior rather than a high\-level analogy, we conducted an additional qualitative analysis on the*Grokking*topic by tracing how idea trajectories evolved across 10 iterative rounds\. Our goal was to assess whether the system exhibits three core properties commonly associated with evolutionary dynamics: heritable variation, fitness\-guided selection, and maintained population diversity\.
##### Heritable variation\.
We observe that heritable variation in EvoSci manifests at the level of exploratory entities that intersect with the grokking topic\. In the early rounds, broad adaptive\-learning and training\-dynamics entities appear in ideas such as \(6\)Integrated Cognitive AI Architectures with Enhanced Episodic Memory and Adaptive Operant Conditioningand \(7\)Enhancing Decision\-Making in Deep Reinforcement Learning through Episodic Memory and Cognitive Attention Mechanisms\. Many other early branches do not persist into later rounds, indicating selective retention rather than uniform reuse\.
In the middle rounds, retained entities become increasingly specialized toward mechanisms that are more directly relevant to grokking, as seen in \(31\)Investigating Cognitive Phase Transitions in Transformer Models through Hyperparameter Modulationand \(33\)Enhancing Phase Transitions in AI Systems through Innovative Curriculum Learning Strategies, where grokking is explicitly framed as a phase\-transition phenomenon in training\. In the later rounds, these stabilized conceptual cores are further recombined with distinct methodological toolkits, including \(41\)Enhanced Simulated Annealing for Modeling Phase Transitions in Cognitive Neural Networksand \(47\)Harnessing Entropy in Statistical Mechanics for “Grokked Tickets” in Neural Networks\. Across distant rounds, the same grokking\-intersecting entity persists while its mechanistic instantiation changes under evaluation\-driven selection, which is consistent with heritable variation in an evolving hypothesis population\.
##### Fitness\-guided selection\.
The search process in EvoSci is shaped by evaluation feedback rather than by unconstrained topic drift\. At each round, generated ideas are assessed along multiple dimensions, including novelty, feasibility, expected effectiveness, and overall quality, and these signals influence which conceptual directions are retained for subsequent exploration\. This mechanism induces a structured fitness landscape over the evolving idea space\.
Empirically, we observe a clear directional shift in the grokking trajectory\. In the early rounds \(1–22\), exploration is dominated by broad memory\-augmentation and ethical\-AI entities, such asIntegrating Memory\-Augmented Neural NetworksandDeveloping Ethical Frameworks for Responsible AI Memory Augmentation Integration\. In the middle rounds \(23–34\), entities increasingly specialize toward grokking\-specific mechanisms, including temporal analysis inAdvanced Temporal Analysis of Grokking Patterns in AI Learning Curves, phase\-transition framing inInvestigating Cognitive Phase Transitions in Transformer Models, and curriculum\-induced shifts inEnhancing Phase Transitions through Curriculum Learning\. In the later rounds \(35\+\), the search further migrates toward more formal mechanistic modeling and incorporates optimization\- and statistical\-physics\-inspired tools, as reflected inEnhanced Simulated Annealing for Modeling Phase Transitions,Harnessing Entropy in Statistical Mechanics for “Grokked Tickets”, andLeveraging Network Topology for the Identification of Grokked Tickets\. This gradual movement from broad exploration toward more grokking\-specific mechanisms suggests that evaluation feedback acts as a selective pressure over the conceptual search space\.
##### Population diversity\.
Although the search becomes increasingly concentrated around grokking\-relevant mechanisms, it does not collapse into a single explanatory path\. Instead, EvoSci maintains structured population diversity throughout iterative exploration\. In the early rounds, exploration spans multiple learning\-related directions, including episodic\-memory\-augmented training in \(7\)Enhancing Decision\-Making in Deep Reinforcement Learning through Episodic Memory and Cognitive Attention Mechanismsand adaptive learning architectures in \(11\)Enhancing Task\-Specific Learning through Episodic Memory\-Driven Neural Adaptation\. These branches reflect diverse hypotheses about training behavior and generalization, many of which do not persist under evaluation\.
As the search progresses, diversity narrows toward entities that are more directly relevant to grokking, but multiple explanatory basins remain active\. In the later rounds, these include phase\-transition\-based accounts such as \(31\)Investigating Cognitive Phase Transitions in Transformer Modelsand \(45\)Investigating Phase Transition Analogies in Neural Network Grokking, optimization\-oriented approaches such as \(41\)Enhanced Simulated Annealing for Modeling Phase Transitionsand \(43\)Enhanced Genetic Algorithm Techniques for Optimizing Neural Network Configurations in Grokking Tasks, and statistical\-physics\- or topology\-based formulations such as \(47\)Harnessing Entropy in Statistical Mechanics for “Grokked Tickets”and \(50\)Leveraging Network Topology for the Identification of Grokked Tickets\. This contraction without collapse indicates that EvoSci maintains structured population diversity while narrowing its search toward higher\-fitness regions\. Taken together, these observations provide qualitative evidence that the evolutionary process in EvoSci is operationally grounded in persistent variation, feedback\-guided selection, and maintained diversity over time\.
## Appendix EAdditional Validation of the Meta\-Review Mechanism
To further evaluate the stability of the proposed review mechanism, we conduct an additional controlled experiment on the NanoGPT topic\. Specifically, we sample 10 generated ideas and repeat the evaluation process 5 times under two settings: \(1\) Single Review, where each idea is assessed by a single reviewer, and \(2\) Meta Review, where multiple reviewer assessments are aggregated through a meta\-review procedure\. Table[7](https://arxiv.org/html/2605.24018#A5.T7)reports the consistency statistics under the two settings\. The average scores are highly similar \(3\.44 for Meta Review vs\. 3\.40 for Single Review\), suggesting that the meta\-review process does not systematically inflate the evaluation scores\. At the same time, the variance under Meta Review is substantially lower than that under Single Review \(0\.018 vs\. 0\.035\), and the score range is also narrower \(0\.3 vs\. 0\.5\)\. These results indicate that the meta\-review mechanism reduces evaluation variability while preserving similar central tendencies, leading to more stable and consistent assessments\.
Evaluation SettingMeanVarianceMinMaxRangeMeta Review3\.440\.0183\.33\.60\.3Single Review3\.400\.0353\.23\.70\.5Table 7:Consistency comparison between Single Review and Meta Review evaluations\.
## Appendix FPrompt
### F\.1Agent Roles Definition
We define a set of specialized agent roles for the proposed multi\-agent research framework, where the corresponding system prompts are illustrated in Figs\.[6](https://arxiv.org/html/2605.24018#A6.F6)–[9](https://arxiv.org/html/2605.24018#A6.F9)\.
Figure 6:System prompt for the Mentor agent\.Figure 7:System prompt for the Prime Research Scientist agent\.Figure 8:System prompt for the Assistant Research Scientist agent\.Figure 9:System prompt for the Evaluator agent\.
### F\.2Task Flow Definition
#### F\.2\.1Topic Analysis
The prompt for the topic analysis task is illustrated in Fig\.[10](https://arxiv.org/html/2605.24018#A6.F10)\.
Figure 10:System prompt for the topic analysis task\.
#### F\.2\.2Problem Cluster Generation
The prompt for the problem cluster generation task is illustrated in Fig\.[11](https://arxiv.org/html/2605.24018#A6.F11)\.
Figure 11:System prompt for the problem cluster generation task\.
#### F\.2\.3Select Problem Cluster
The prompt for the problem cluster selection task is illustrated in Fig\.[12](https://arxiv.org/html/2605.24018#A6.F12)\.
Figure 12:System prompt for the problem cluster selection task\.
#### F\.2\.4Background Investigation
The prompt for the background investigation task is illustrated in Fig\.[13](https://arxiv.org/html/2605.24018#A6.F13)\.
Figure 13:System prompt for the background investigation task\.
#### F\.2\.5Problem Analysis
The prompt for the problem analysis task is illustrated in Fig\.[14](https://arxiv.org/html/2605.24018#A6.F14)\.
Figure 14:System prompt for the problem analysis task\.
#### F\.2\.6Seed Idea Generation
The prompt for the seed idea generation task is illustrated in Fig\.[15](https://arxiv.org/html/2605.24018#A6.F15)\.
Figure 15:System prompt for the seed idea generation task\.
#### F\.2\.7Idea Generation
The prompt for the idea generation task is illustrated in Fig\.[16](https://arxiv.org/html/2605.24018#A6.F16)\.
Figure 16:System prompt for the idea generation task\.
#### F\.2\.8Evaluation
The prompt for the evaluation task is illustrated in Fig\.[17](https://arxiv.org/html/2605.24018#A6.F17)\.
Figure 17:System prompt for the evaluation task\.
#### F\.2\.9Iterative Refinement
The prompt for the iterative refinement task is illustrated in Fig\.[18](https://arxiv.org/html/2605.24018#A6.F18)\.
Figure 18:System prompt for the iterative refinement task\.
#### F\.2\.10Evaluation\-Guided Loop
The prompt for the evaluation\-guided loop task is illustrated in Fig\.[19](https://arxiv.org/html/2605.24018#A6.F19)\.
Figure 19:System prompt for the evaluation\-guided loop task\.
### F\.3Evaluation Methodologies Definition
#### F\.3\.1Multi\-Reviewer \+ Meta\-Reviewer Mechanism
The NeurIPS\-style reviewer prompt used for the multi\-reviewer and meta\-reviewer evaluation mechanism is illustrated in Fig\.[20](https://arxiv.org/html/2605.24018#A6.F20), and the ICLR\-style reviewer prompt is illustrated in Figs\.[21](https://arxiv.org/html/2605.24018#A6.F21)–[22](https://arxiv.org/html/2605.24018#A6.F22)\.
Figure 20:NeurIPS\-style LLM reviewer prompt used for idea evaluation\.Figure 21:ICLR\-style LLM reviewer prompt used for idea evaluation\.Figure 22:ICLR\-style LLM reviewer prompt used for idea evaluation \(continued\)\.
#### F\.3\.2Tournament\-Style Idea Ranking
The prompt used for tournament\-style pairwise comparison and relative idea ranking is illustrated in Fig\.[23](https://arxiv.org/html/2605.24018#A6.F23)\.
Figure 23:Tournament\-style pairwise comparison prompt used for idea ranking\.Similar Articles
EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
EvoScientist is an adaptive multi-agent framework for end-to-end scientific discovery that continuously improves through persistent memory modules, comprising three specialized agents for idea generation, experiment execution, and knowledge distillation. It outperforms 7 state-of-the-art systems in scientific idea generation and improves code execution success rates through multi-agent evolution.
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster is a scalable, self-evolving agent framework for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles. It achieves state-of-the-art results on four benchmarks including Humanity's Last Exam (41.1%) and MLE-Bench Lite (75.8%), outperforming general-purpose baselines by up to 316%.
@tom_doerr: Automates research workflows with persistent multi-agent memory https://github.com/EvoScientist/EvoScientist…
EvoScientist is an open-source framework that automates research workflows using self-evolving AI scientists with persistent multi-agent memory, adopting a human-on-the-loop paradigm for autonomous research exploration and insight generation.
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
MetaEvo proposes a two-stage framework for continual evolution of LLM-based agents, using preference-based optimization to enhance principle abstraction and modular architecture for experience reuse, outperforming strong baselines on reasoning benchmarks.
@tom_doerr: Semi-autonomous agents optimize codebases through parallel experimentation https://github.com/evo-hq/evo
Evo is an open-source tool that provides semi-autonomous agents to optimize codebases through parallel experimentation, using tree search and multiple subagents to autonomously discover and improve metrics.