AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle
Summary
AutoSci is a memory-centric agentic system designed to automate the full scientific research lifecycle, from literature understanding to rebuttal, using LLM-based agents with persistent memory and self-evolution capabilities.
View Cached Full Text
Cached at: 06/01/26, 09:27 AM
# AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle
Source: [https://arxiv.org/html/2605.31468](https://arxiv.org/html/2605.31468)
Weitong Qian\*,†, Beicheng Xu\*,†, Zhongao Xie\*,†, Bowen Fan†, Guozheng Tang†, Jiale Chen†, Xinzhe Wu†, Mingtian Yang†, Chenyang Di†, Jiajun Li†, Lingching Tung†, Peichao Lai†, Yifei Xia†, Ziyi Guo†, Yanwei Xu†, Yanzhao Qin†, Shaoduo Gan†, Xupeng Miao†, Bin Cui†,🖂 †Peking University, Beijing, China \*Equal contribution\.🖂Corresponding author\. \*\{weitong\.qian, beichengxu, zaxie25\}@stu\.pku\.edu\.cn,🖂bin\.cui@pku\.edu\.cn
###### Abstract
Scientific research has traditionally been human\-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles\. The rise of LLM\-based scientific agents creates an opportunity to automate this process\. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time\. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system\. As a result, we present AutoSci, a memory\-centric agentic system for the full scientific research lifecycle\. AutoSci is organized around four modules\. SciMem provides schema\-governed research memory, separating Long\-Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project\-level artifacts such as ideas, experiments, manuscripts, and reviews\. SciFlow executes a five\-stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration\. SciDAG augments difficult skills with DAG\-shaped multi\-agent operators and reusable stage\-specific templates\. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates\. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects\. The code repository is available at[https://github\.com/skyllwt/AutoSci](https://github.com/skyllwt/AutoSci)\.
## 1Introduction
Scientific research has long been a heavily human\-driven process: researchers must manually track literature, formulate hypotheses, implement methods, run experiments, analyze evidence, write papers, and respond to reviews\. This process is labor\-intensive, especially when projects require broad literature coverage, experimentation, and careful coordination across many intermediate artifacts\. The rise of large language models and multi\-agent systems has begun to change this picture\. When coupled with tools, code execution, external scientific resources, and coordinated workflows, these systems can automate the research lifecycle and support systematic exploration, validation, monitoring, and manuscript production\(Huanget al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib12); Chaiet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib15); Zhang,[2026](https://arxiv.org/html/2605.31468#bib.bib16)\)\.
One line focuses on particular scientific capabilities rather than a whole paper\-production pipeline\. The AI co\-scientist\(Gottweiset al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib8)\)targets scientist\-in\-the\-loop hypothesis generation and biomedical validation; POPPER\(Huanget al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib12)\)studies automated falsification of free\-form hypotheses; AutoSciLab\(Desaiet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib13)\)develops self\-driving laboratory workflows; bilevel LLM–simulation optimization connects LLM reasoning with scientific simulation\(Maet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib14)\); SciMaster/X\-Master\(Chaiet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib15)\)uses code and tool\-augmented reasoning for scientific problem solving; and Deep Researcher Agent\(Zhang,[2026](https://arxiv.org/html/2605.31468#bib.bib16)\)emphasizes sustained experiment execution, monitoring, and reflection\. These works demonstrate the value of LLM agents for individual scientific operations, but they do not by themselves define a complete research lifecycle\.
More ambitious systems move beyond individual capabilities and target complete research workflows\. The AI Scientist series\(Luet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib4); Yamadaet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib5)\)and AI\-Researcher\(Tanget al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib6)\)automate idea generation, experiment execution, paper writing, review, and later template\-free agentic search; while Agent Laboratory\(Schmidgallet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib7)\)supports research workflows starting from a user\-provided idea\. Related full\-loop systems further model research–review–refinement, goal\-oriented cumulative findings, and memory\-augmented discovery, as in CycleResearcher\(Wenget al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib9)\), DeepScientist\(Wenget al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib10)\), and EvoScientist\(Lyuet al\.,[2026](https://arxiv.org/html/2605.31468#bib.bib11)\)\. As these workflows become longer\-running, research harness design becomes central, since scientific agents need durable state, tool contracts, review gates, and recoverable execution in addition to stronger models\. Representative harness\-oriented systems include ARIS\(Yanget al\.,[2026](https://arxiv.org/html/2605.31468#bib.bib17)\), NORA\(Zhouet al\.,[2026](https://arxiv.org/html/2605.31468#bib.bib18)\), and Deep Researcher Agent\(Zhang,[2026](https://arxiv.org/html/2605.31468#bib.bib16)\), which add persistent state, monitoring, provenance or claim checks, review gates, domain guardrails, and recoverable long\-running execution\. These systems move scientific agents toward end\-to\-end automation, but most remain organized around a single project or a paper\-generation pipeline\.
Table 1:Feature\-level comparison of representative full\-loop scientific\-agent systems and research harnesses\. Symbols denote full support \(✓\), partial or project\-local support \(∘\\circ\), and features that are not a primary focus \(–\)\. Here, persistent memory refers to research memory that survives across complete research or paper\-generation pipelines and can be reused by later pipelines, rather than state retained only within a single run\. Full system evolution further requires modifying the scientific\-agent system itself, such as its skills, workflow protocols, prompts, or artifact contracts; accumulating reusable textual experience alone is counted as partial support\.From a system perspective, an automated scientific research system should satisfy four requirements: 1\)Full\-lifecycle support\. Since scientific research spans literature understanding, idea generation, experimental validation, manuscript writing, and rebuttal, an automated system should provide stage\-specific skills and artifact handoffs throughout the entire process\. Table[1](https://arxiv.org/html/2605.31468#S1.T1)therefore focuses on representative full\-loop scientific agents and compares how well they satisfy the remaining system requirements\. 2\)Execution harness\. Long\-running research cannot rely on unconstrained conversations alone; it requires persistent state, controlled context, verification gates, feedback routing, and recoverable orchestration\. However, some earlier systems mainly coordinate research with lighter runtime control\. 3\)Structured and persistent memory\. Memory must be structured to make scientific information semantically interpretable, extensible, and organized by its dependencies rather than stored as undifferentiated text\. However, existing systems mostly store summaries, logs, strategies, or artifacts rather than organizing scientific information as typed objects with explicit dependencies\. Moreover, most prior systems retain memory only within a single research project or paper\-generation pipeline, rather than preserving reusable cross\-project experience for future workflows\. 4\)Self\-evolution\. An automated research system should not only accumulate textual experience, but also use user feedback and experimental outcomes to iteratively improve its own skills and workflows\. Although EvoScientist distills prior experience into reusable textual memories, existing systems generally do not revise the agent system itself, such as its skills and workflow protocols\. Overall, the comparison shows that existing systems remain fragmented: they improve different parts of automated science, but do not yet form a unified research system that can execute, remember, and evolve across projects\.
To address this gap, we introduce*AutoSci*, a memory\-centric agentic system for the full scientific research lifecycle\. AutoSci is built around four modules:*SciMem*, a structured persistent research memory that stores scientific knowledge, active project artifacts, and cross\-project experience;*SciFlow*, a harness\-based execution framework for the full scientific research lifecycle, coordinating literature, ideation, experimentation, writing, and rebuttal;*SciDAG*, a DAG\-based multi\-agent augmentation mechanism for stages that require broader search, debate, verification, or refinement; and*SciEvolve*, a full\-system evolution layer that converts user feedback, experimental outcomes, and review signals into versioned updates to memory, skills, and orchestration templates\. Figure[1](https://arxiv.org/html/2605.31468#S1.F1)provides an overview of the AutoSci architecture\. Our main contributions are summarized as follows:
- •We formulate automated scientific research as a long\-lifecycle system problem that requires execution, memory, and evolution across projects, rather than isolated task automation\.
- •We design a memory\-centric architecture in which structured persistent scientific memory serves as the shared substrate for research workflows, multi\-agent execution, and self\-improvement\.
- •We implement an end\-to\-end AutoSci system that integrates lifecycle skills, harnessed execution, DAG\-based multi\-agent augmentation, and versioned self\-evolution\.
- •We conduct two end\-to\-end case studies in GPU kernel optimization and biomedical drug discovery, where AutoSci generates reviewable paper\-level artifacts that receive automated ICLR\-review scores of 6\.3/10 and 5\.8/10, respectively\.
Figure 1:Overview of AutoSci\.
## 2System Overview
AutoSci targets long\-lifecycle scientific research: a system should not only complete one research project, but also accumulate knowledge, experimental experience, submission feedback, and execution strategies across many projects\. This goal requires the agent to behave less like a single\-session assistant and more like a persistent research environment that can interact with users, external resources, and experimental systems over time\. We follow four design principles:
- •Environment interaction\.AutoSci should interact with the full research environment, including user instructions, literature and codebases, and the experimental runtime\.
- •Structured persistent memory\.AutoSci should maintain structured long\-term scientific memory, so that papers, concepts, methods, experiments, reviews, and their relations can persist and be reused across complete research projects\.
- •Harnessed execution\.AutoSci should run through an explicit harness to make the research lifecycle interruptible, reviewable, and reusable across sessions\.\.
- •Full\-system evolution\.AutoSci should go beyond accumulating reusable experience by turning recurring feedback into controlled updates to its own skills, protocols, and prompts\.
Table 2:AutoSci system overview \(v1\.0\.0, May 2026\)\.These principles lead to four modules\.*SciMem*provides schema\-governed research memory;*SciFlow*executes the full scientific lifecycle over that memory;*SciDAG*optionally augments difficult stages with DAG\-shaped multi\-agent operators; and*SciEvolve*converts traces and feedback into versioned system updates\. Together, the modules form a closed loop in which research artifacts are produced, checked, stored, reused, and eventually used to improve the system itself\. Table[2](https://arxiv.org/html/2605.31468#S2.T2)summarizes the implemented system scope before we describe each module in detail\.
## 3SciMem: Schema\-Governed Research Memory
SciMem is designed for long\-lifecycle scientific memory: memory should not disappear after a single experiment, paper, or research project, but should remain reusable across future projects\. To support this goal, SciMem separates memory into two regions with different responsibilities\.*Long\-Term Knowledge Memory*preserves consolidated scientific knowledge that should accumulate across projects, while*Active Research Memory*tracks the fast\-changing state of an ongoing research paper or experimental report\. Below, we first introduce the two memory regions and then describe how SciMem grows and flows across them over time\.
### 3\.1Long\-Term Knowledge Memory
Long\-Term Knowledge Memory is to preserve the scientific knowledge that AutoSci has accumulated from external sources and prior research cycles, so that later projects can reuse it without reconstructing the same context from scratch\. The region is populated by literature ingestion skills from sources such as arXiv, Semantic Scholar, GitHub, and user\-provided documents, and can be refined by consolidated experience from completed projects\. The region is organized by a typed entity schema rather than by flat documents or vector chunks\. Table[3](https://arxiv.org/html/2605.31468#S3.T3)summarizes the long\-term entity types\.*In implementation*, each entity is stored as a\.mdpage\.
Beyond defining entity types, the long\-term schema also governs how these entities are connected\. For example,Topicentities provide the coarsest organizing layer:Paper,Foundation,Concept,Method, andPeopleentities can be placed within one or more topics\.Paperentities act as evidence\-bearing sources that introduce or critiqueConceptentities, apply or extendMethodentities, and contribute toFoundationentities\.Foundationentities provide stable background knowledge that groundsConceptandMethodentities\. These typed relations turn Long\-Term Knowledge Memory from a set of structured pages into a traversable scientific knowledge graph\.*In implementation*, typed links are stored as schema\-constrained bidirectional cross\-references between entity pages, making the graph navigable and mechanically checkable\. Figure[2\(a\)](https://arxiv.org/html/2605.31468#S3.F2.sf1)visualizes the entity schema and typed connections of Long\-Term Knowledge Memory\.
Long\-Term Knowledge Memory has two defining properties: 1\)*semantic addressability*, which lets downstream skills retrieve typed scientific objects and relations directly; and 2\)*incremental extensibility*, which lets new literature and validated findings be appended across research pipelines, making the memory a reusable scientific substrate rather than a project\-local cache\.
Table 3:Entity types in Long\-Term Knowledge Memory\.\(a\)Long\-Term Knowledge Memory\.
\(b\)Active Research Memory\.
Figure 2:Two memory regions in SciMem\.
### 3\.2Active Research Memory
Active Research Memory is the project\-level workspace for producing a research paper or experimental report from start to finish\. It records the key active artifacts of the current project, includingIdea,Experiment,Manuscript, andReviewentities, as AutoSci advances the project\. Each active entity carries an explicit lifecycle state\.Ideaentities move fromproposedtotesting, then totested,validated, orfailed;Experimententities move fromplannedandrunningtocompletedorabandoned;Manuscriptentities move throughdrafting,revised,submitted, andfinal version; andReviewentities record received feedback, rebuttal drafting, revision, and final decision\. These lifecycle states make Active Research Memory a structured progress map rather than a loose project folder\.*In implementation*, each active entity is stored as a\.mdpage with a schema\-defined lifecycle state\. Figure[2\(b\)](https://arxiv.org/html/2605.31468#S3.F2.sf2)visualizes the active entity schema and lifecycle structure of Active Research Memory\.
The significance of recording Active Research Memory is that AutoSci can recover and audit the state of a research project without relying on chat history\. At any moment, the system can identify whichIdeaentities are still viable, whichExperimententities produced evidence, and whichReviewconcerns remain unresolved\. Terminal active artifacts also become the bridge back to Long\-Term Knowledge Memory as reusable knowledge for future projects\.
Figure 3:Memory growth and flow in SciMem\.
### 3\.3Memory Growth and Flow
The previous sections define what SciMem stores\. We next describe how the memory grows\. SciMem expands through three complementary flow paths: aggregation within Long\-Term Knowledge Memory, bidirectional flow between Long\-Term Knowledge Memory and Active Research Memory, and temporal accumulation of cross\-cycle experience, as illustrated in Figure[3](https://arxiv.org/html/2605.31468#S3.F3)\.
Long\-term aggregation\.Long\-Term Knowledge Memory grows through internal aggregation\. Newly ingestedPaperentities do not remain isolated reading notes\. Their key observations can update the domain\-level understanding of aTopic; recurring definitions or mechanisms can refine aConcept; implementation details and empirical behavior can enrich aMethod; and repeatedly supported background knowledge can strengthen aFoundation\. This flow turns low\-level source material into higher\-level scientific memory, allowing long\-term entities to become progressively more informative as more literature and project experience accumulate\.
Cross\-region flow\.SciMem supports cross\-region activation and consolidation\.1\) Long\-Term→\\rightarrowActive\.During activation, an active entity draws on related long\-term memory: aIdeaentity is grounded inTopicentities, prior evidence,Conceptentities, andMethodentities; anExperimententity activates theMethodentities and assumptions it relies on; and aManuscriptorReviewentity activates the evidence it relies on\.2\) Active→\\rightarrowLong\-Term\.During consolidation, terminal active artifacts write back reusable scientific traces\.validatedIdeaentities, completed experimental findings, failed attempts, and unresolved limitations can update the corresponding long\-term entries, especially the relevantTopicpages\.
Cross\-cycle accumulation\.SciMem accumulates methodological memory across cycles\. Reviewer concerns and rebuttal outcomes are retained as cross\-cycle notes that can be consulted when later projects enter writing or rebuttal stages\. As a result, SciMem grows not only in what AutoSci knows about a topic, but also in how AutoSci conducts future research, experiments, writing, and rebuttal\.
Trust\-guarded writes\.All SciMem writes pass through Trust Guard before entering the usable graph, since memory errors can propagate to future projects\. Trust Guard checks both*form validity*\(schema fields, lifecycle states, link types, and bidirectional links\) and*content validity*\(evidence support and consistency with existing memory\)\. Form checks are handled by deterministic linting, while content checks are handled by an independent reviewer agent\. Each write is assignedPass,Warn, orBlock; blocked artifacts are quarantined until resolved\.
## 4SciFlow: Memory\-Grounded Research Lifecycle
SciFlow is the lifecycle executor that runs AutoSci over a complete research project\. Its goal is to make long\-horizon research executable, resumable, and memory\-grounded rather than a sequence of free\-form agent conversations\. To this end, SciFlow decomposes a project into five stages,*Literature*,*Ideation*,*Experiment*,*Writing*, and*Rebuttal*\. Each stage is implemented as a harness\-based skill contract\. Figure[4](https://arxiv.org/html/2605.31468#S4.F4)illustrates how the five\-stage skills and harness guarantees are organized\.
Figure 4:SciFlow research lifecycle and harness organization\.### 4\.1Five\-Stage Research Lifecycle
SciFlow follows the natural lifecycle of a scientific project: it first builds the knowledge base, then proposes candidate directions, turns selected ideas into experimental evidence, writes the evidence into a paper, and finally handles reviewer feedback after submission\.*In implementation*, SciFlow is supported by more than 30 research skills spanning the five lifecycle stages\.
Memory\-grounded execution\.The research lifecycle is memory\-grounded because each stage is coupled to SciMem through explicit read and write operations\.*Literature*writes external knowledge into long\-term memory\.*Ideation*reads long\-term memory and writesIdeaentities\.*Experiment*reads selected ideas, then writes evidence\-bearingExperimententities\.*Writing*reads provenance and evidence chains to produceManuscriptartifacts\.*Rebuttal*reads the submitted manuscript, review records, and prior rebuttal lessons, then writes newReviewrecords\. This read\-write loop makes SciFlow memory\-grounded: stages communicate through SciMem rather than transient conversation, and the memory used by later stages is already enriched by earlier stages\.
### 4\.2Harness Guarantees
The five\-stage lifecycle describes what scientific work is performed, while the SciFlow harness controls how this work is executed across stages\. The harness is the cross\-stage control layer around skills to make the lifecycle interruptible, reviewable, and reusable across sessions\.
- •State\.SciFlow records stage outputs, lifecycle states, links, and pipeline\-level progress outside the transient LLM context, making projects resumable from a specified stage\.
- •Context\.Before each skill runs, SciFlow equips it with a tailored SciMem view, providing the evidence, prior failures or lessons needed for that skill without exposing the full memory graph\.
- •Verification\.Trust Guard checks memory writes and high\-stakes handoffs through schema/link validation and evidence\-oriented review before downstream stages consume them\.
- •Feedback\.Failures and critiques are treated as process signals: insufficient evidence can trigger/refineor self\-evolution\.
- •Orchestration\.The/researchloop invokes stage skills, records progress, handles stopping points, and supports long\-running experiments through non\-blocking execution and monitoring\.
## 5SciDAG: DAG\-Based Multi\-Agent Augmentation
SciDAG is an optional augmentation for SciFlow\. A selected skill can call a directed acyclic graph of reusable multi\-agent operators as a tool to strengthen its execution\. Given a stage taskzz, the SciMem\-compiled context, and the artifact schema expected by SciFlow, SciDAG executes an operator graphG=\(V,E\)G=\(V,E\)and returns the final result to the same artifact contract, so downstream SciFlow stages remain unchanged\.
Adaptive operator graph\.SciDAG represents each tool call as an operator graph\. Each nodevi∈Vv\_\{i\}\\in Vinstantiates an operatoroi∈𝒪o\_\{i\}\\in\\mathcal\{O\}with a specialized sub\-agent and produces an intermediate output from upstream node outputs\. Directed edges specify information flow, while conditional edges call a router over the current execution state to decide whether to continue, retry, branch, prune, or stop\. Thus, SciDAG is not a fixed multi\-agent chain: it adapts execution according to intermediate quality, cost, and convergence signals\.
Evolving templates\.To make such graphs reusable, SciDAG stores common operator graphs as stage\-aware templates\. For example, Ideation templates emphasize diverse generation and debate, experimentation templates emphasize reliability checks, and writing templates emphasize evidence fidelity and refinement\. The template repository stores reusable graphs together with lightweight metadata and past execution experience\. For a new skill call, SciDAG retrieves a suitable template, executes it, and writes the resulting trace and feedback back to the repository\. Appendix[A](https://arxiv.org/html/2605.31468#A1)lists the operator library and shows stage\-specific templates for ideation, experimentation, and writing\.
## 6SciEvolve: Full\-System Evolution
Figure 5:SciEvolve self\-evolution loop\.SciEvolve implements the second part of AutoSci’s self\-improvement mechanism\. Beyond the reusable textual experience accumulated through SciMem, SciEvolve turns feedback signals from research practice into auditable updates to SciMem organization, SciFlow skills, and SciDAG orchestration templates\.*In implementation*, the three evolution paths are exposed as separate skills:/dreamfor SciMem evolution,/forgefor SciFlow evolution, and/morphfor SciDAG evolution\. Figure[5](https://arxiv.org/html/2605.31468#S6.F5)illustrates this signal\-to\-update loop\.
Evolution signals\.SciEvolve collects signals from three environments\. The user environment provides instructions, corrections, and research preferences\. The task environment provides stage outcomes, experimental evidence, and failure reasons\. The open environment provides new papers, codebases, and venue expectations\. These signals are first stored in a signal repository, where SciEvolve detects recurring patterns and uses them to trigger updates to the relevant system module\.
SciMem evolution\.Memory evolution maintains the usefulness of SciMem as it grows\./dreamperiodically reviews recent traces and related memory neighborhoods\. It can down\-weight or archive stale entries, compress redundant material, consolidate related entities, and propose new associations acrossConcept,Method,Paper,Idea, andExperimententities\.
SciFlow evolution\.Skill evolution treats SciFlow skills as versioned research protocols\. A skill is not only a prompt, but a structured procedure that specifies inputs, required SciMem context, execution steps, checks, output artifacts, and handoff rules\. After a research episode, SciEvolve analyzes repeated failure modes, user corrections, review warnings, unsupported claims, high\-cost stages, and successful ad hoc repairs\. When the evidence is stable enough, it proposes patches such as strengthening claim\-evidence checks in writing skills, revising handoff requirements, or promoting a successful repair strategy into a reusable skill step\.
SciDAG evolution\./morphuses SciDAG traces to improve multi\-agent templates across executions\. When an operator repeatedly underperforms, SciEvolve can revise its prompt, role, or tool configuration\. When a graph shows stable failure or success patterns, SciEvolve can prune weak branches, add verification nodes, or specialize the template for a stage and problem type\. Thus, SciEvolve improves SciDAG across executions\.
## 7Case Studies and Evaluation
### 7\.1Experimental Setup
We evaluate AutoSci through two end\-to\-end research case studies that span different scientific domains: GPU kernel optimization and biomedical drug discovery\. The goal is not to test an isolated skill, but to examine whether AutoSci can run a complete research cycle, including literature organization, idea generation, novelty checking, feasibility analysis, experiment design, execution, result interpretation, and paper\-oriented artifact production\. Both case studies use the same AutoSci system, including SciMem for persistent research memory, SciFlow for lifecycle execution, SciDAG for multi\-agent augmentation when needed, and SciEvolve for recording reusable feedback signals\.
Case Study 1: GPU Kernel Optimization\.AutoSci explores iterative GPU operator optimization with Claude Code guided by performance feedback\. The experiment is executed on a 4×\\timesNVIDIA A40 environment with Triton 3\.2\.0 and PyTorch 2\.6\.0\+cu124 in the TritonBench workspace\.
Case Study 2: Biomedical Drug Discovery\.AutoSci explores structure\-aware post\-translational modification \(PTM\) modeling for degrader target nomination\. The real executed blocks run on a single NVIDIA RTX 4060 using public DeepTernary v1\.0\.0 and PROTAC\-STAN inference repositories, with additional Boltz\-2\-conditioned cross\-checks on selected protein\-of\-interest cases\.
For each case study, the user provides an initial research direction and a small set of relevant seed papers\. We instantiate the AutoSci agents with Claude Code powered by Opus 4\.7\. The user then invokes the/researchworkflow, which first ingests the seed papers into SciMem, uses/discoverto retrieve and ingest additional related papers, and constructs a structured Long\-Term Knowledge Memory over papers, topics, concepts, methods, foundations, and researchers\. After this memory\-building stage,/researchproceeds through ideation, novelty and feasibility screening, experiment design, experiment execution, result analysis, and manuscript\-oriented artifact generation\. Because these case studies are simulated research submissions rather than papers undergoing real external peer review, we do not evaluate the rebuttal stage\.
### 7\.2Structured Memory Construction
We first examine whether AutoSci produces a structured and reusable Long\-Term Knowledge Memory rather than a flat collection of notes\. Figure[6](https://arxiv.org/html/2605.31468#S7.F6)shows an example memory graph constructed in the GPU kernel optimization case study\. The graph contains typed entities such as topics, papers, concepts, methods, foundations, and researchers, with links recording how papers support concepts, how methods instantiate technical approaches, and how people connect to related research areas\. This structure allows later skills to retrieve scientific context by entity type and relation, instead of relying only on unstructured keyword search\. Appendix[B](https://arxiv.org/html/2605.31468#A2)provides concrete entity\-page examples from this memory\.
Figure 6:Example Long\-Term Knowledge Memory built from the GPU kernel generation domain\.
### 7\.3Idea Evolution and Selection
We first analyze the idea pipeline in the GPU kernel optimization case study\. Given the user direction of iterative GPU operator optimization with Claude Code and performance feedback, AutoSci generates five candidate paths, performs novelty checking, refines the surviving ideas, evaluates feasibility under the project hardware budget, and selects one idea for full experimentation\. Figure[7](https://arxiv.org/html/2605.31468#S7.F7)summarizes this process\. Specifically,/ideatefirst proposes five candidate directions\. After/noveltychecking, candidate A is removed as a duplicate of timing\-only feedback approaches, while B, C, D, and E remain for refinement\. The refined candidates are then checked by/exp\-pilot\-rununder the 4×\\timesA40 budget: B and C are eliminated because their pilot plans exceed the available cost envelope, D is deferred because its upstream Optimization\-Rewind mining would consume the main\-run budget, and E is selected for full experimentation\. The final selected path is profiling\-guided Claude Code agent optimization, represented asclaude\-code\-agent\-profiling\-guided\-gpu\. The Stage\-2 outcomes in the figure are a demonstrative reconstruction projected onto the actual hardware ceiling consumed by the selected idea, rather than a formal pre\-experiment screen\. Appendix[C](https://arxiv.org/html/2605.31468#A3)provides the corresponding pipeline for the biomedical drug discovery case\.
Stage 0 \|/ideate– direction: iterative GPU operator optimization with Claude Code based on performance feedbackAlightweighttiming\-onlyoptimizerBlearned behavioraldescriptors forkernel searchCparallel pathexplorer\(MAP\-Elites \+ agents\)Dexperience\-aug\.iterative kernelrefinementEprofiling\-guidedClaude CodeagentStage 1 \|/novelty– Semantic Scholar \+ WebSearch \+ wiki cross\-verificationA : eliminatedduplicate approach:execution timing asthe sole performancefeedback signalB : refinedCodeT5\+ encoder onMAP\-Elites traces;contrastive vshand\-crafted 3\-DC : refined30\-variant population\+ bottleneck\-specialisedVerifier agents on a3\-D behavioral gridD : refinedOptimization\-Rewindmining→\\toretrieval\-augmentedrefinement loopE : refinedClaude Code tool\-use:nsys/ncu→\\toJSON→\\tosingle\-hypothesistargeted editsStage 2 \|/exp\-pilot\-run– feasibility pilot on the project hardwareHardware: 4×\\timesNVIDIA A40 \(sm\_86, 696 GB/s, 149\.7 TFLOPS FP16 TC, 48 GB\)⋅\\cdotTriton 3\.2\.0 / PyTorch 2\.6\.0\+cu124⋅\\cdotpilot budget≈\\approx250 GPU\-hrB : eliminated \(cost\)MAP\-Elites pilot yields∼\\sim120 triples / 8 GPU\-hr;contrastive needs∼\\sim10 ksamples⇒\\Rightarrow\> 250GPU\-hr just to trainthe encoderC : eliminated \(cost\)30\-variant pop\.×\\timesper\-variantncu≈\\approx6 GPU\-hr/op×\\times200 ops⇒\\Rightarrow1\.2 k GPU\-hr;profiler overhead∼\\sim3×\\timesA100 on A40D : deferredOptimization\-Rewindmining at 4–6 hr/op;a 200\-tuple bank costs∼\\sim1 k GPU\-hr upfront– exhausts main\-runbudget\. Off\-deadlineE : pilot passedpilot opsdequantize\_rowwise\+kldiv\_compute:5\-iter loop∼\\sim30 min/op\.Full sweep∼\\sim40 GPU\-hr– fits budgetSelected idea→\\toclaude\-code\-agent\-profiling\-guided\-gpu\(NeurIPS 2026 target; addresses*repair\-biased\-iterative\-refinement*&*correctness\-efficiency\-gap\-kernel\-generation*\)
Figure 7:Idea screening pipeline for the kernel optimization case\. AutoSci filters candidate directions through novelty checking and pilot experimentation to select one path\.
### 7\.4Experiment Execution and Analysis
After selecting the profiling\-guided Claude Code agent direction, AutoSci expands the idea into an executable experiment suite rather than a single benchmark run\. Figure[8](https://arxiv.org/html/2605.31468#S7.F8)summarizes the four experiment blocks\. The sensitivity analysis first fixes the experimental protocol using two reference operators and screens 184 operator prompts into 156 feasible operators\. The main experiment then runs 157 operators for five iterations on the 4×\\timesA40 environment; by iteration 5, all 157 generated kernels are executable and correct, with a geometric\-mean speedup of 1\.52×\\timesover matched baselines, or 1\.18×\\timesafter excluding degenerate baselines\. The ablation experiment isolates the value of metric feedback by replaying two 60\-operator cohorts with a blind autotuning baseline; feedback contributes a 1\.58×\\timesgain on the high\-headroom cohort and a 1\.22×\\timesgain on the broader cohort\. Finally, the intermediate\-data analysis audits 628 iteration transitions and shows that most structural changes occur early, while later iterations increasingly become small autotuning adjustments\. The numerical results are reproduced fromresult\.md,claude\_ablation\_summary\.md, andclaude\_iter\_changes\.mdin the TritonBench workspace\. Appendix[D](https://arxiv.org/html/2605.31468#A4)provides the corresponding experiment\-suite view for the biomedical case, where AutoSci distinguishes executed validation blocks from pre\-registered follow\-up benchmarks after a negative result\.
Experiment suite forclaude\-code\-agent\-profiling\-guided\-gpuexecuted on 4×\\timesNVIDIA A40 \(sm\_86, 696 GB/s, 149\.7 TFLOPS FP16 TC, 48 GB\)⋅\\cdotTriton 3\.2\.0 / PyTorch 2\.6\.0\+cu124⋅\\cdotTritonBench\-G workloadSensitivity Analysis\(protocol validation\)Main Experiment\(full workload sweep\)Ablation Experiments\(feedback vs blind\)Intermediate Data Analysis\(transition audit\)Reference ops–dequantize\_rowwise,kldiv\_compute\.Lock the 5\-iter loop and per\-opeval\_harness\.py\.Feasibility triage– 184 op prompts→\\to156 feasible; 28 screened by workload mismatch or under\-specification\.*Outcome:*defined the iter1→\\toiter5 protocol used downstream; no metric or baseline access during iter1–5\.Scope– 157 operators×\\times5 iterations on 4×\\timesA40, work\-stealing dispatcher;prompt==comp\_instruonly\.Baseline matching– 101 valid matches; invalid matches screened after iter5\.*Result:*157/157exe\_acc= 1\.00 at iter5;geomean speedup1\.52×\\times\(1\.18×\\timesexcl\. degenerate baselines\); 25 wins≥\\geq1\.1×\\times, 7 losses<<0\.9×\\times\.Blind baseline– replay iter1 and widen@triton\.autotunewithout metric feedback\.Two cohorts– 60 ops×\\times5 iterations; high\-headroom and broad\-headroom bands\.*Metric\-feedback bonus:*1\.58×\\timeson the high\-headroom cohort;1\.22×\\timeson the broader cohort\.Transition classification– 157 ops×\\times4 transitions==628 events; 14 edit categories\.Patterns– 96/157 ops add@autotuneat iter1→\\toiter2; structural rewrites concentrate early; late iters==warps/stages tweaks\.*Stabilisation:*TRIVIALtransitions grow monotonically \(0→\\to29→\\to66→\\to105\); iter5 is a near\-no\-op for∼\\sim67% of ops\.ablation cohort bandsdrawn from per\-op M5/M1
Figure 8:Experiment suite for the selected kernel optimization idea\. AutoSci organizes the selected idea into sensitivity analysis, main evaluation, ablation, and intermediate\-data analysis\.
### 7\.5Paper\-Level Evaluation
Table 4:Automated paper\-level review outcomes for the two AutoSci\-generated case\-study manuscripts\. Scores are reported by PaperReview\.ai under the ICLR target\-venue setting\.For each case study, AutoSci runs theLiterature,Ideation,Experiment, andWritingstages to produce a manuscript\-oriented artifact, taking27\.3hours for GPU kernel optimization and22\.6hours for biomedical drug discovery\. To evaluate whether AutoSci can produce complete paper\-level artifacts rather than isolated ideas or experiments, we further conduct an automated review evaluation\. For each case study, AutoSci generates a manuscript\-oriented artifact, and we submit the generated paper to PaperReview\.ai111[https://paperreview\.ai/](https://paperreview.ai/)with*ICLR*as the target venue\. The resulting reviews are used as a paper\-level review proxy, not as a replacement for formal peer review\. Table[4](https://arxiv.org/html/2605.31468#S7.T4)summarizes the reviews\.
The reviews suggest that AutoSci can produce reviewable paper\-level artifacts rather than only local experimental outputs\. The kernel case is evaluated as a relatively mature empirical paper, while the biomedical case is evaluated as a transparent but incomplete negative\-result paper\. Importantly, both reviews expose actionable weaknesses, including limited external validity, missing comparator runs, measurement\-noise concerns, and presentation gaps\. These review signals can be stored as submission\-stage feedback in SciMem to improve future research workflows\.
## 8Related Work
### 8\.1Agent Memory
Long\-horizon agents require memory beyond the fixed context window of an LLM\. Prior systems store and retrieve past interactions, user events, or episodic traces to support continuity over time, as in Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.31468#bib.bib21)\), MemoryBank\(Zhonget al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib22)\), and MemGPT\(Packeret al\.,[2023](https://arxiv.org/html/2605.31468#bib.bib23)\)\. These works show that memory is a core part of agent behavior rather than a passive cache, especially because longer context alone does not guarantee reliable use of relevant information\(Liuet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib29)\)\. Recent work further moves from flat logs toward structured memory\. A\-MEM\(Xuet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib24)\)links memory notes in an agentic network, AriGraph\(Anokhinet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib25)\)and Zep\(Rasmussenet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib26)\)build graph\-based world or temporal memories, HippoRAG\(Jimenez Gutierrezet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib27)\)combines knowledge graphs with retrieval\.
### 8\.2Agent Evolution
Agent evolution has mainly developed along two complementary directions\. The first improves an agent through accumulated experience while keeping the base model fixed\. Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.31468#bib.bib30)\)stores verbal feedback as lessons, ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib31)\)extracts reusable experiential knowledge, and Voyager\(Wanget al\.,[2023](https://arxiv.org/html/2605.31468#bib.bib32)\)grows an executable skill library through exploration and feedback\. This form of evolution is non\-parametric: the agent becomes more capable by reusing memories, strategies, and skills from previous tasks\.
The second direction treats the agent system itself as an optimization target\. Promptbreeder\(Fernandoet al\.,[2023](https://arxiv.org/html/2605.31468#bib.bib33)\)evolves prompts, GPTSwarm\(Zhugeet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib34)\)and AFlow\(Zhanget al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib35)\)search agent graphs or code\-represented workflows, and symbolic learning\(Zhouet al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib36)\)frames prompts, tools, and their composition as learnable symbolic parameters\. Recent self\-evolving systems further adapt memories, tool libraries, templates, or model behavior, as in SAGE\(Lianget al\.,[2024](https://arxiv.org/html/2605.31468#bib.bib37)\), STELLA\(Jinet al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib38)\), and SEAL\(Zweigeret al\.,[2025](https://arxiv.org/html/2605.31468#bib.bib39)\)\.
## 9Conclusion
In this paper, we presented AutoSci, a memory\-centric agentic system for the full scientific research lifecycle\. AutoSci combines schema\-governed scientific memory, a harnessed research workflow, DAG\-based multi\-agent augmentation, and auditable self\-evolution\. Together, these modules allow the system to conduct research across literature understanding, ideation, experimentation, writing, and submission feedback while preserving reusable knowledge and experience across projects\. Our case studies in GPU kernel optimization and biomedical drug discovery show that AutoSci can produce reviewable paper\-level artifacts from end\-to\-end research processes\.
AutoSci also has important limitations\. First, the current implementation is built as a skill package on top of general\-purpose coding and reasoning agents\. This makes the system easy to deploy and inspect, but it is not yet a science\-specialized agent foundation\. Future work can develop a research\-native agent substrate with models, tools, memory interfaces, and execution protocols designed specifically for scientific work\. Second, evaluation remains underdeveloped\. Existing benchmarks do not yet adequately measure the separate capabilities required by a full research system, including literature understanding, ideation, experimentation, and writing\. We therefore rely on end\-to\-end case studies and automated review as a paper\-level proxy\. A promising direction is to accumulate benchmark tasks from real user workflows, so that future versions of AutoSci can be evaluated both end\-to\-end and at the level of individual research skills\.
## References
- AriGraph: learning knowledge graph world models with episodic memory for LLM agents\.arXiv preprint arXiv:2407\.04363\.Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- J\. Chai, S\. Tang, R\. Ye, Y\. Du, X\. Zhu, M\. Zhou, Y\. Wang, W\. E, and S\. Chen \(2025\)SciMaster: towards general\-purpose scientific AI agents, part I\. X\-master as foundation: can we lead on humanity’s last exam?\.arXiv preprint arXiv:2507\.05241\.Cited by:[§1](https://arxiv.org/html/2605.31468#S1.p1.1),[§1](https://arxiv.org/html/2605.31468#S1.p2.1)\.
- S\. Desai, S\. Addamane, J\. Y\. Tsao, I\. Brener, L\. P\. Swiler, R\. Dingreville, and P\. P\. Iyer \(2024\)AutoSciLab: a self\-driving laboratory for interpretable scientific discovery\.arXiv preprint arXiv:2412\.12347\.Cited by:[§1](https://arxiv.org/html/2605.31468#S1.p2.1)\.
- C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktaschel \(2023\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.arXiv preprint arXiv:2309\.16797\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
- J\. Gottweis, W\. Weng, A\. Daryin, T\. Tu, A\. Palepu, P\. Sirkovic, A\. Myaskovsky, F\. Weissenberger, K\. Rong, R\. Tanno,et al\.\(2025\)Towards an AI co\-scientist\.arXiv preprint arXiv:2502\.18864\.Cited by:[§1](https://arxiv.org/html/2605.31468#S1.p2.1)\.
- K\. Huang, Y\. Jin, R\. Li, M\. Y\. Li, E\. Candès, and J\. Leskovec \(2025\)Automated hypothesis validation with agentic sequential falsifications\.arXiv preprint arXiv:2502\.09858\.Cited by:[§1](https://arxiv.org/html/2605.31468#S1.p1.1),[§1](https://arxiv.org/html/2605.31468#S1.p2.1)\.
- B\. Jimenez Gutierrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su \(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.Advances in Neural Information Processing Systems37,pp\. 59532–59569\.Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- R\. Jin, Z\. Zhang, M\. Wang, and L\. Cong \(2025\)STELLA: self\-evolving LLM agent for biomedical research\.arXiv preprint arXiv:2507\.02004\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
- X\. Liang, Y\. He, Y\. Xia, X\. Song, J\. Wang, M\. Tao, L\. Sun, X\. Yuan, J\. Su, K\. Li, S\. Chen, and T\. Shi \(2024\)Self\-evolving agents with reflective and memory\-augmented abilities\.arXiv preprint arXiv:2409\.00872\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha \(2024\)The AI scientist: towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.1.1.2.1.1.1.2.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- Y\. Lyu, X\. Zhang, X\. Yi, Y\. Zhao, S\. Guo, W\. Hu, J\. Piotrowski, J\. Kaliski, J\. Urbani, Z\. Meng, L\. Zhou, and X\. Yan \(2026\)EvoScientist: towards multi\-agent evolving AI scientists for end\-to\-end scientific discovery\.arXiv preprint arXiv:2603\.08127\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.7.7.4.1.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- P\. Ma, T\. Wang, M\. Guo, Z\. Sun, J\. B\. Tenenbaum, D\. Rus, C\. Gan, and W\. Matusik \(2024\)LLM and simulation as bilevel optimizers: a new paradigm to advance physical scientific discovery\.arXiv preprint arXiv:2405\.09783\.External Links:[Link](https://arxiv.org/abs/2405.09783)Cited by:[§1](https://arxiv.org/html/2605.31468#S1.p2.1)\.
- C\. Packer, V\. Fang, S\. G\. Patil, K\. Lin, S\. Wooders, and J\. E\. Gonzalez \(2023\)MemGPT: towards LLMs as operating systems\.CoRRabs/2310\.08560\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2310.08560)Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,pp\. 1–22\.Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- P\. Rasmussen, P\. Paliychuk, T\. Beauvais, J\. Ryan, and D\. Chalef \(2025\)Zep: a temporal knowledge graph architecture for agent memory\.CoRRabs/2501\.13956\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2501.13956)Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- S\. Schmidgall, Y\. Su, Z\. Wang, X\. Sun, J\. Wu, X\. Yu, J\. Liu, Z\. Liu, and E\. Barsoum \(2025\)Agent laboratory: using LLM agents as research assistants\.arXiv preprint arXiv:2501\.04227\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.3.3.2.1.1.1.2.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in Neural Information Processing Systems36,pp\. 8634–8652\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p1.1)\.
- J\. Tang, L\. Xia, Z\. Li, and C\. Huang \(2025\)AI\-researcher: autonomous scientific innovation\.arXiv preprint arXiv:2505\.18705\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.2.2.2.1.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p1.1)\.
- Y\. Weng, M\. Zhu, G\. Bao, H\. Zhang, J\. Wang, Y\. Zhang, and L\. Yang \(2024\)CycleResearcher: improving automated research via automated review\.arXiv preprint arXiv:2411\.00816\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.4.4.2.1.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- Y\. Weng, M\. Zhu, Q\. Xie, Q\. Sun, Z\. Lin, S\. Liu, and Y\. Zhang \(2025\)DeepScientist: advancing frontier\-pushing scientific findings progressively\.arXiv preprint arXiv:2509\.26603\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.9.9.3.1.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang \(2025\)A\-MEM: agentic memory for LLM agents\.CoRRabs/2502\.12110\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2502.12110)Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. Foerster, J\. Clune, and D\. Ha \(2025\)The AI scientist\-v2: workshop\-level automated scientific discovery via agentic tree search\.arXiv preprint arXiv:2504\.08066\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.1.1.2.1.1.1.2.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- R\. Yang, Y\. Li, and S\. Li \(2026\)ARIS: autonomous research via adversarial multi\-agent collaboration\.arXiv preprint arXiv:2605\.03042\.Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.11.11.3.1.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- J\. Zhang, J\. Xiang, Z\. Yu, F\. Teng, X\. Chen, J\. Chen, M\. Zhuge, X\. Cheng, S\. Hong, J\. Wang,et al\.\(2024\)AFlow: automating agentic workflow generation\.arXiv preprint arXiv:2410\.10762\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
- X\. Zhang \(2026\)Deep researcher agent: autonomous deep learning experiment framework\.arXiv preprint arXiv:2604\.05854\.External Links:[Link](https://arxiv.org/abs/2604.05854)Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.15.15.3.1.1.1.2.1),[§1](https://arxiv.org/html/2605.31468#S1.p1.1),[§1](https://arxiv.org/html/2605.31468#S1.p2.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: LLM agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p1.1)\.
- W\. Zhong, L\. Guo, Q\. Gao, H\. Ye, and Y\. Wang \(2024\)MemoryBank: enhancing large language models with long\-term memory\.InProceedings of the AAAI Conference on Artificial Intelligence,pp\. 19724–19731\.Cited by:[§8\.1](https://arxiv.org/html/2605.31468#S8.SS1.p1.1)\.
- B\. Zhou, Q\. Wu, X\. Huang, H\. Ning, D\. Li, and Z\. Zhang \(2026\)NORA: night owl research agent — autonomous AI research for geoscience, remote sensing, and GIScience\.Note:[https://github\.com/GRIND\-Lab\-Core/night\_owl\_research\_agent](https://github.com/GRIND-Lab-Core/night_owl_research_agent)External Links:2605\.02092,[Link](https://arxiv.org/abs/2605.02092)Cited by:[Table 1](https://arxiv.org/html/2605.31468#S1.T1.13.13.3.1.1),[§1](https://arxiv.org/html/2605.31468#S1.p3.1)\.
- W\. Zhou, Y\. Ou, S\. Ding, L\. Li, J\. Wu, T\. Wang, J\. Chen, S\. Wang, X\. Xu, N\. Zhang,et al\.\(2024\)Symbolic learning enables self\-evolving agents\.arXiv preprint arXiv:2406\.18532\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
- M\. Zhuge, W\. Wang, L\. Kirsch, F\. Faccio, D\. Khizbullin, and J\. Schmidhuber \(2024\)GPTSwarm: language agents as optimizable graphs\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
- A\. Zweiger, J\. Pari, H\. Guo, E\. Akyurek, Y\. Kim, and P\. Agrawal \(2025\)Self\-adapting language models\.arXiv preprint arXiv:2506\.10943\.Cited by:[§8\.2](https://arxiv.org/html/2605.31468#S8.SS2.p2.1)\.
## Appendix ASciDAG Operators and Templates
Table 5:Reusable operators in SciDAG\.Figure 9:Stage\-specific SciDAG templates for ideation, experimentation, and writing\.Table[5](https://arxiv.org/html/2605.31468#A1.T5)lists the reusable operators used by SciDAG, and Figure[9](https://arxiv.org/html/2605.31468#A1.F9)shows three stage\-specific DAG templates for ideation, experimentation, and writing\.
## Appendix BLong\-Term Knowledge Memory Entity Examples
The examples below show representative Long\-Term Knowledge Memory entity pages generated from the same memory used in Figure[6](https://arxiv.org/html/2605.31468#S7.F6)\. These examples illustrate howTopic,Paper,Foundation,Concept,Method, andPeopleentities are stored as schema\-governed pages with frontmatter, body sections, and typed cross\-references\.
concept\|\|repair\-biased\-iterative\-refinementFrontmatter \(selected\) title:Repair\-Biased Iterative Refinementaliases:\[repair\-vs\-optimization\-iteration, iterative\-refinement\-repair\-bias\] tags:\[llm\-code\-generation, gpu\-kernels, iterative\-refinement, agent\-systems\] maturity:emergingfirst\_introduced:2026\-05key\_papers:kernelbenchx\-benchmark\-evaluating\-llm\-generated related\_concepts:iterative\-kernel\-refinement,correctness\-efficiency\-gap\-kernel\-generation …\#\# Definition \(excerpt\) The empirical observation that iterative refinement loops in LLM\-based kernel generation systems \(such as GEAK\) operate primarily as*repair*mechanisms rather than*optimization*mechanisms\. Across GEAK iterations, compile success rises from 52\.3% to 68\.8% and correctness from 18\.2% to 30\.7%, but average speedup falls from 1\.58×\\timesto 1\.44×\\times…\#\# Intuition \(excerpt\) The underlying asymmetry is structural: repair responds to explicit, local error signals \(compile errors, shape mismatches\), while performance improvement requires plan\-level decisions about tiling, memory layout and kernel boundaries – decisions not recoverable from the feedback available in current iterative pipelines …\#\# …
Concept entry\.A stable named idea synthesised from one or more papers\. Lives inwiki/concepts/; therelated\_conceptsandkey\_papersfields drive most of the in\-graph edges to other concepts and to papers\.
topic\|\|llm\-kernel\-generationFrontmatter \(selected\) title:LLM\-Based Kernel Generationtags:\[llm, kernel\-generation, gpu, code\-generation, compiler\] related\_topics:\[\]\(none yet\)key\_people:\[\]\(populated via reverse\-xref from people pages\)linked\_ideas:\[\]\(populated when an idea declaresorigin\_gapspointing here\)…\#\# Overview \(excerpt\) LLM\-based kernel generation is the task of using language models to automatically produce high\-performance GPU or accelerator kernels from high\-level specifications\. It combines code generation with hardware\-aware optimization\.\#\# SOTA tracker \(excerpt\) MethodBenchmarkCorrectness / SpeedupFrontier reasoning modelsKernelBench<<20% match PyTorch baselinekernelfoundry\-\.\.\.KernelBench97% \(SYCL\), 2\.32×\\timesavgAscendOptimizer101 AscendC ops1\.21×\\timesgeo\-mean\#\# Open problems \(excerpt\) Fusion tasks: 72% failure rate across all methods \(KernelBenchX\)\. Correctness vs\. efficiency disconnect: 46\.6% of correct kernels are*slower*than baseline\. Cross\-hardware variance: speedup ratio reaches 21\.4×\\timesacross NVIDIA GPUs …\#\# …
Topic entry\.A research area that aggregates many concepts, methods and papers\. Lives inwiki/topics/\. The body sections \(Timeline,Seminal works,SOTA tracker,Open problems\) make a topic page the natural landing point for new readers and the place where cross\-paper synthesis is recorded\.
method\|\|optimization\-rewindFrontmatter \(selected\) name:Optimization Rewindtype:optimizationtags:\[self\-supervised, kernel\-optimization, experience\-mining, de\-optimization, knowledge\-scarcity\] source\_papers:ascendoptimizer\-episodic\-agent\-ascend\-npu\-operator parent\_methods:\[\] child\_methods:\[\] code\_repo:https://github\.com/KernelHive date\_updated:2026\-05\-24…\#\# Mechanism \(excerpt\) A self\-supervised technique for mining reusable optimization knowledge from existing high\-performance implementations\. It works by systematically removing recognizable optimization motifs from strong kernels and retaining only the removals that measurably degrade hardware performance …\#\# Procedure \(excerpt\) 1\.Start from the strongest available implementation of an operator\.2\.An inverse agent proposes semantically meaningful de\-optimizations \(removing pipelining, breaking vectorized paths, reintroducing synchronization, …\)\.3\.Each de\-optimized candidate is compiled, correctness\-checked and profiled on real hardware\.4\.Validated degradations are distilled into experience tuples\(Title, Bottleneck, Applicability, Effect, Diff\)\.5\.…\#\# Assumptions \(excerpt\) Optimization motifs are*compositional*– they can be removed and added independently, so a single\-factor screen \(Δm=Lat\(KA∖\{m\}\)−Lat\(KA\)\\Delta\_\{m\}=\\mathrm\{Lat\}\(K\_\{A\\setminus\\\{m\\\}\}\)\-\\mathrm\{Lat\}\(K\_\{A\}\)\) recovers each motif’s marginal contribution\. The profiling noise thresholdτnoise\\tau\_\{\\text\{noise\}\}is taken to be stable across de\-optimizations …\#\# …
Method entry\.A concrete system, algorithm or benchmark that realises one or more concepts\. Lives inwiki/methods/\.source\_paperspins it to its paper of record;parent\_methods/child\_methodscapture method lineage when a system extends or specialises another\.
paper\|\|kernelbench\-llms\-write\-efficient\-gpu\-kernelsFrontmatter \(selected\) title:KernelBench: Can LLMs Write Efficient GPU Kernels?arxiv:2502\.10517year:2025importance:4source\_type:texdate\_added:2025\-05\-24contribution\_type:\[benchmark, method\] datasets:\[KernelBench\] tldr:Introduces KernelBench, a benchmark of 250 PyTorch workloads showing frontier reasoning models match the PyTorch baseline on <20% of tasks\.…\#\# Key idea \(excerpt\) A standardised evaluation framework with 250 PyTorch workloads at three difficulty levels \(single ops, op sequences, full architectures\)\. Thefast\_pmetric – percentage of kernels both correct*and*faster than baseline by factorpp– captures correctness and performance in a single adjustable\-threshold measure …\#\# Experiment & Results \(excerpt\) One\-shotfast\_1over PyTorch eager: Level 1: DeepSeek\-R1 12% \(others 3–10%\); Level 2: DeepSeek\-R1 36% \(others 0–24%\); Level 3: OpenAI o1 12% \(others 0–8%\)\. Iterative refinement \(10 turns\) lifts DeepSeek\-R1 Level 2 from 36% to 72% …\#\# Limitations \(excerpt\) Limited to NVIDIA GPUs; CUDA is low\-resource in pre\-training \(≈\\approx0\.07% of The Stack v1\.2\); models rarely use tensor cores / wmma; functional\-correctness errors are hard to fix via feedback; few\-shot examples*degrade*fast\_1by encouraging aggressive but error\-prone optimisations …\#\# …
Paper entry\.A first\-class citizen – each paper has its own page rather than only an entry in a bibliography\. Lives inwiki/papers/\.importance\(1–5\) drives sort order in the index;contribution\_typelets/discoverand/surveyretrieve “benchmark papers” vs\. “method papers” without re\-reading the bodies\.
people\|\|jun\-zhuFrontmatter \(selected\) name:Jun Zhuaffiliation:\(unset\) research\_areas:\[llm\-code\-generation, gpu\-kernels, systems\] type\.kind:researcherhomepage:\(unset\) scholar:\(unset\) date\_updated:2026\-05\-24…\#\# Research areas \(excerpt\) LLM\-based code generation, GPU kernel generation, systems …\#\# Recent work \(excerpt\) •\[\[kernelbenchx\-comprehensive\-benchmark\-evaluating\-llm\-generated\]\]•…\(populated by thepapers→\\topeoplereverse\-xref rule whenever a paper body wikilinks this author slug\.\) \#\# …
People entry\.Lightweight by design – only the fields needed to attribute and disambiguate\. Lives inwiki/people/\. TheRecent worksection is populated automatically by thepapers→\\topeoplereverse\-xref rule, so the page stays in sync with the literature without manual edits\.
foundation\|\|operator\-fusionFrontmatter \(selected\) title:Operator Fusiondomain:compilers and high\-performance computingstatus:mainstreamaliases:\[kernel fusion, op fusion, loop fusion\] first\_introduced:1980s loop fusion; revived in TVM \(2018\) / XLA \(2017\)date\_updated:2026\-05\-27source\_url:\(unset\) …\#\# Definition \(excerpt\) The program\-transformation technique of combining the bodies of two or more separately specified operators into a single executable kernel, so that intermediate values flowing between them are*not*materialised in main memory\. Semantics are preserved; only the schedule changes …\#\# Intuition \(excerpt\) GPU kernels are bandwidth\-bound: moving an element from DRAM to a register dwarfs the arithmetic on it\. Fusingy=f\(x\)andz=g\(y\)intoz=g\(f\(x\)\)keepsyon\-chip, killing one DRAM round\-trip per element*and*one kernel\-launch overhead per call – pushing the kernel from bandwidth\-bound toward compute\-bound …\#\# Relevance to active research \(excerpt\) A load\-bearing assumption of every downstream concept in this wiki:\[\[gpu\-kernel\-generation\]\],\[\[iterative\-kernel\-refinement\]\], and the\[\[correctness\-efficiency\-gap\-kernel\-generation\]\]all sit on top of “good fusion is the primary lever for GPU performance\.” FlashAttention is a tour\-de\-force application; KernelBench Level 2 is essentially a fusion benchmark …\#\# …
Foundation entry\.Textbook background that newer research builds on, separated from active\-researchconcepts/so the rest of the wiki can wikilink to “well\-known prerequisites” without inflating the concept list\. Lives inwiki/foundations/\.
## Appendix CBiomedical Drug Discovery Idea Pipeline
Figure[10](https://arxiv.org/html/2605.31468#A3.F10)shows the idea screening and lifecycle pipeline for the biomedical drug discovery case study\. Unlike the kernel case in the main text, this case produces a negative result that becomes useful: the refuted premise and a deferred sibling idea jointly define the next feasible research direction\. Specifically,/ideatefirst proposes five candidate directions for structure\-aware PTM modelling\. After/noveltychecking, A is removed because the PTM\-site disorder\-prediction subspace is already saturated, and B is removed because the required AF3 fine\-tuning is infeasible under the available constraints\. C, D, and E remain as refined candidates\. The/exp\-designstage then prioritizes these survivors: C and D are deferred, while E, PTM\-aware degrader target nomination, is selected for execution because its Phase\-0 floor test is cheap, fast, and feasible on the available RTX 4060 environment\. The selected idea is decomposed into two sub\-claims and evaluated with real operator runs\. The result refutes the load\-bearing premise under current PTM\-blind scorers, and the post\-mortem combines this negative evidence with the deferred PTM\-conditioning idea to regenerate a feasible follow\-up plan\.
Stage 0 \|/ideate– direction: structure\-aware PTM modelling for drug discoveryAPTM\-sitedisorder predictor\(pLDDT \+ IDR\)Bchirality\-awareAF3 diffusionnoise scheduleCPTM\-resolvedstructuralinteractome\(Δ\\DeltapDockQ\-per\-PTM\)DPTM\-conditionedensemble \(pair\-repscaling adapter\)EPTM\-aware degradertarget nomination\(Δ\\DeltapTernary\)Stage 1 \|/novelty– Semantic Scholar \+ WebSearch \+ PubMed \+ wiki cross\-verificationA : eliminatedsubspace saturated:SAPP / PhosAF /GraphPhos / AstraPTM2 /DeepPCT / MTPrompt\-PTM\(≥\\geq5 2024–25 entries\)B : eliminatednot feasible:AF3 weights arenon\-commercial; anexternal group cannotfine\-tune the diffusion headC : refinedrestrict to PTM\-near\-interface edges; pre\-register 4\-case holdout\(14\-3\-3/Cdc25C, HIF1α\\alpha/pVHL, PCNA\-K164, FOXO3a\)D : refinedPTM adapter on afrozen Boltz\-2 backbone;train to ensemblepopulations; Boltz\-sampleis the head\-to\-head baselineE : refinedper\-POI noise\-floorcalibration as theload\-bearing fail\-fastgate; MD\-relaxed routedecouples from a PTM\-CfoldStage 2 \|/exp\-design– composite\-score prioritisation of the three survivorsC : deferredcomposite 10\. Proteome\-scale AF2\-Multimer foldis off\-budget; AF3\-phosphocollapse risk on theconformational casesD : deferredcomposite 13\. Scoop risk:Boltz\-sample \(Jan 2026\)already does PTM\-blindpair\-rep scaling; needs anensemble\-labelled trainsetE : selectedcomposite 16, priority 5\.Phase\-0 floor is cheap andkills\-or\-de\-risks fast;public weights, GREENfeasibility on an RTX 4060Stage 3 \|/exp\-design→\\to/exp\-run→\\to/exp\-eval– selected idea decomposed into 2 sub\-claims and executedReal blocks executed on 1×\\timesNVIDIA RTX 4060 \(8 GB\)⋅\\cdotDeepTernary v1\.0\.0 \+ PROTAC\-STAN inference repos⋅\\cdotpre\-registered blocks frozen, not executedsub\-claim 1noise\-floor\-calibratedΔ\\DeltapTernary improves rankingsub\-claim 2MD\-relaxed phospho route≈\\approxnative CCD\-PTM token\(decouples from a PTM\-Cfold, cf\. idea D\)Real\-operator pilot⇒\\Rightarrowload\-bearing premise REFUTED\(bounded to current PTM\-blind scorers\)\.15 POIs / 189 interface sites: phospho 14\.5%, alanine 15\.9%, Kme3 15\.7% clear the per\-POI floor vs 13\.4% chance \(p\>0\.3p\>0\.3; effect\-size CIs include 0;0/69survive BH\-FDR / Bonferroni\)\. Dose\-response control proves the operator is*non\-inert*→\\tothe bottleneck is the scorer’s dynamic range\.Post\-mortemthe limit is the structure→\\toscore chain, not the readout or the threshold: a PTM\-*blind*scorer cannot be wrapped into a PTM\-selective one\. Null\-matching must use the identical operation as the signal\.idea D re\-enters\(deferred\)PTM\-conditioning suppliesthe missing scorerRegenerated feasible plan\(new idea from two\)next\-gen idea==a*PTM\-sensitive*ternary scorer — the explicit prerequisite\. Deliverable==the negative bound\+\+the frozen\-threshold benchmark a future scorer must clear\.
Figure 10:Idea screening and lifecycle pipeline for the biomedical case\. AutoSci filters candidate directions, executes selected paths, and converts negative evidence into a regenerated follow\-up plan\.
## Appendix DBiomedical Drug Discovery Experiment Suite
Figure[11](https://arxiv.org/html/2605.31468#A4.F11)shows the experiment suite for the selected biomedical idea,ptm\-aware\-degrader\-target\-nomination\. The suite separates executed experiments from frozen follow\-up benchmarks\. The executed blocks reproduce the base DeepTernary scorer, calibrate a per\-protein noise floor, and test whether PTM\-like interface perturbations produce signal beyond that floor\. The result is negative: the perturbations do not exceed chance\-level clearance under the current PTM\-blind scorer, and no site survives multiplicity correction\. Rather than discarding this outcome, AutoSci turns the negative bound into a pre\-registered benchmark that a future PTM\-sensitive scorer must clear\.
Experiment suite forptm\-aware\-degrader\-target\-nominationreal blocks on 1×\\timesNVIDIA RTX 4060 \(8 GB\)⋅\\cdotDeepTernary v1\.0\.0 \+ PROTAC\-STAN⋅\\cdotsolid = executed, dashed = frozen benchmarkPrecondition / Baseline\(scorer reproduction\)Calibration / Sensitivity\(per\-POI noise floor\)Core Operator\(primary negative finding\)Pre\-registered Benchmark\(frozen follow\-up\)DeepTernary reproduction– v1\.0\.0 on the 22\-complex unbound PROTAC test set\.Reproduction metric– DockQ top\-10\.397vs authors’ 0\.429; Acceptable\-Rate 1\.0\.*Outcome:*base scorer reproduced, so downstream null results are not attributed to pipeline failure\.Phase\-0 noise floor–N=200N\{=\}200size\-matched surface perturbations per tuple across 6 POIs\.Sensitivity ablation– perturbation count, displacement size, and chemistry\-aware vs geometric perturbations\.*Outcome:*per\-POIσ\\sigmais0\.037–0\.18; the null is POI\-specific and chemistry\-sensitive\.RealΔ\\DeltapTernary operator– 15 POIs / 189 interface sites with phospho, alanine\-scan, and Kme3 perturbations\.Cross\-checks– PROTAC\-STAN control, Boltz\-2\-conditioned route on 2 POIs, and dose\-response Ala control\.*Result \(negative\):*14\.5/15\.9/15\.7% clear vs 13\.4% chance;0/69survive FDR/Bonferroni\.Validation track– calibratedΔ\\DeltapTernary phospho\-PROTAC ranking; frozen top\-KKAUC threshold\.Ablations and robustness– calibration, route comparison, scorer comparison, PTM type, and mutant tracks\.*Status:*six frozen\-threshold experiments await a qualifying PTM\-sensitive scorer\.calibrated floordefines the nullnegative bounddefines the benchmark
Figure 11:Experiment suite for the biomedical drug discovery case\.Similar Articles
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
A survey paper examining the transition of AI from task-specific assistants to workflow-level research automators, defining AutoResearch as the spectrum of AI-powered scientific workflow automation and analyzing challenges in autonomy, reproducibility, and accountability.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
This survey examines the emerging field of AI-powered research automation (AutoResearch), analyzing how AI systems are moving from isolated task assistance to full workflow-level scientific discovery. It defines a spectrum from human-steered 'Vibe Research' to AI-led systems, and proposes five evaluation dimensions for scientific credibility.
LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs
LLM-AutoSciLab is a closed-loop framework that uses LLMs to iteratively generate hypotheses, select informative experiments, and refine mechanisms, achieving superior accuracy and sample efficiency on physics and biology benchmarks over prior static methods.
@adaption_ai: Introducing AutoScientist. Most model training fails outside of frontier labs. AutoScientist automates the full researc…
Adaption AI introduces AutoScientist, a tool that automates the full research loop to make model training more accessible outside of frontier labs.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw is a multi-agent autonomous research system that improves scientific discovery through structured debate, self-healing execution, and human collaboration, outperforming previous systems on the ARC-Bench benchmark by 54.7%.