@snowboat84: https://x.com/snowboat84/status/2070656715515932930

X AI KOLs Timeline 06/26/26, 11:53 PM News

Summary

This article details the new paradigm of AI for Science (AI4S), from AI as an analysis tool to the transition to scientific agents, explaining autonomy levels, key cases, and future trends.

https://t.co/rcxSYIru6D

Original Article

View Cached Full Text

Cached at: 06/27/26, 07:51 AM

AI for Science Detailed Introduction (Part 1): Paradigm and Landscape

Introduction

Artificial intelligence has been present in science for over a decade, but for the first ten-plus years, it has been doing essentially one thing: reading data, finding patterns, and making predictions. AlphaFold is the apex of this path. After 2024, things changed. A batch of systems with large models as the core, capable of planning and acting on their own, began to push AI to the “proactive” side: proposing hypotheses, designing and executing experiments, writing papers, and iterating after failure. These systems are collectively called scientific agents, and the research paradigm they uphold is known in the industry as AI for Science (hereinafter AI4S).

Around AI4S, the two most common judgments are both somewhat off. One is fooled by impressive demos, thinking “AI scientists” are nearly here. The other dismisses it as another hype cycle. This article aims to sidestep both emotions, first honestly mapping out the landscape: what it really is, what it looks like, and how far it has actually come. As for who is paying for it and where the opportunities lie, that will be left for the next installment.

The following six chapters proceed as follows. Chapter 1 discusses the paradigm itself, how AI moves from “reading science” to “doing science.” Chapter 2 clarifies concepts, what exactly counts as a scientific agent. Chapters 3 and 4 lay out existing systems using two frameworks: a vertical cut (six stages of the scientific workflow) and a horizontal cut (four layers of infrastructure). Chapter 5 scans the distribution of money and interest by discipline. Chapter 6 provides an honest assessment of maturity: where the capabilities are, where reliability stands, and where real discoveries are.

Chapter 1: From “AI Reading Science” to “AI Doing Science”

1.1 When AI designs usable molecules on its own

In 2024, a group of researchers from Stanford and other institutions built a multi-agent system called “Virtual Lab.” Several AI agents playing different roles—lead scientist, immunologist, computational biologist, critic—sat around “discussing” how to deal with the mutation of the new coronavirus. They proposed plans, questioned each other, converged on conclusions, and finally produced a batch of new nanobody candidates. These candidates were then synthesized and validated in a real wet lab, and some did bind to the target virus (Swanson et al., 2025, published in Nature).

Breaking this down reveals its significance. Proposing scientific hypotheses, designing validation paths, and producing new molecules verifiable by experiments is normally the work of a trained human scientist team. A significant part of that was done autonomously by AI. Around the same time, Lawrence Berkeley National Laboratory’s “A-Lab” demonstrated another form: a truly automated materials lab where robots autonomously synthesized and characterized new materials based on AI decisions (LBNL 2023, Nature). Earlier, Boiko et al. (2023, Nature) had already shown that an LLM-driven system could autonomously design and execute chemical experiments.

A chatbot tells you how to do an experiment. Here, it’s different: the AI actually did the experiment. The industry calls this emerging research paradigm AI for Science (AI4S). What this article aims to portray is the starting point of this transformation.

1.2 Paradigm shift: from analytical tool to active agent

To understand its novelty, we need to place it in the history of AI entering science. In the first phase, AI was an “upgraded calculator,” using machine learning for regression, classification, and dimensionality reduction, helping scientists handle data that was too much for human computation. The peak of this line was AlphaFold, which turned protein structure prediction from a hard problem into infrastructure. But its essence was still “reading”: given an amino acid sequence, it predicted a structure—a very strong predictor, but it didn’t decide “what to study next.”

The second phase, the subject of this article, is AI moving from “reading” to “doing.” The system does not just output predictions; it takes on the stages of the scientific method itself: planning, decision-making, using tools, operating equipment, interpreting results, and correcting direction. Multiple reviews use “autonomy levels” to characterize this leap, dividing LLM’s role in scientific discovery into three levels: “Tool / Analyst / Scientist” (From Automation to Autonomy, EMNLP 2025), or “Assistant / Partner / Avatar” (Hitchhiker’s Guide to Scientific Agents, 2025). Regardless of the terminology, the core is the same: increased autonomy is the fundamental difference between this generation of systems and the previous one.

This shift matters because the bottleneck in science is often not “computing fast” but “thinking right and doing effectively.” A system that can ask good questions, design good experiments, and actually carry them out addresses the core constraint of research productivity, not just speeding up a computational step.

1.3 Why these two years: capability tipping point and “data exhaustion”

Why has this exploded in the last two years? Two forces converged. First, a capability tipping point. From 2023 to 2024, large models crossed two thresholds: long-chain reasoning and planning, and reliable tool use (search, code, databases, lab equipment interfaces). Once these two things were established, “letting AI do a long series of tasks autonomously” went from a demo to a possibility. In 2025, a batch of agentic systems emerged densely, and 2026 enters the validation phase. The competition focus has shifted from “can it be built?” to “can it work reliably in real environments?”

Second, the data side logic—this is deeper and more relevant to capital, and will be expanded in the middle chapter. Frontier large models are nearly exhausting high-quality text on the internet. Where will the next wave of high-value data come from? A growing consensus is: real-world scientific experiments. Having AI do experiments and generate physical and biological data that didn’t previously exist is seen by many as a key path to breaking the data bottleneck. This turns “AI doing science” from an academic ideal into a serious bet by industry and capital. The hook is planted here; the middle chapter will elaborate how this reshapes capital dynamics across the entire track.

Chapter 2: Conceptual Clarification: What Counts as a “Scientific Agent”

“Agent” is one of the most overused words since 2025, to the point where it carries almost no information. To make the subsequent landscape stand, we must pin down the concept: which systems count as scientific agents, which do not, and what are the criteria.

2.1 Three types of systems often confused

When placing current AI systems related to science together, at least three types are often mixed up, but their capability boundaries are completely different.

First, conversational AI (science Q&A assistant). You ask “How should I validate this hypothesis?” or “Explain the methods of this paper,” and it gives a textual answer. Its upper limit is “talk.” No matter how knowledgeable, it’s just an advisor who can express things; it won’t act for you. Most daily-use general chat models by researchers fall into this category. They are useful, but they don’t change “who is doing the research.”

Second, traditional scientific machine learning (SciML). This is the mainstream form of AI entering science and the most fruitful line: using neural networks for property prediction, structure prediction, surrogate models, solving partial differential equations, etc. AlphaFold is its peak. Its characteristic is “strong prediction, zero autonomy”: it excels to the extreme on a well-defined task given by humans, but it does not decide research direction, does not plan multi-step processes, and does not use tools to pursue an open-ended goal. It is always the one being called; the subject that truly decides and uses tools is someone else.

Third, scientific agents. They add “autonomy” on top of the first two: they can decompose a relatively open goal into multiple steps, decide for each step what tools to use (retrieval, code execution, simulation, driving equipment), read results, and adjust next steps until the task is closed. The Virtual Lab, A-Lab, and Coscientist from section 1.1 all belong to this category.

The three are more like a layered relationship. Scientific agents often call conversational AI for reasoning and call SciML models for prediction internally. In other words, the agent is the “subject that uses tools,” while the first two are often the tools in its hands.

2.2 A dividing line: can it close the loop autonomously?

If you can only remember one criterion, it is this: Does it only output suggestions, or can it autonomously execute and close the loop?

A straightforward test: ask the system to “reset a database password.” A conversational AI will tell you the steps; a scientific agent will actually log into the system, execute the reset, and confirm success. If a tool still requires a human to “press the final button” for critical actions, it is closer to an assistant than an agent.

In the scientific context, this line manifests as: can the system, without step-by-step human intervention, complete the “plan, execute, observe, correct” cycle by itself? Works like Curie (Kon et al. 2025) emphasize “rigorous automated experiments” precisely because once this loop is automated, the biggest enemy becomes “errors midway that go uncorrected.” This also foreshadows the reliability issues to be discussed in Chapter 6.

Also, it must be clear that autonomy is a continuum, not a binary switch. Completely human-independent “fully autonomous scientists” practically do not exist today. Real-world systems lie between “heavy human involvement” and “light human oversight.” This is exactly what the next section on “autonomy levels” aims to depict.

2.3 Levels of autonomy: Tool / Analyst / Scientist

Multiple reviews organize this field by levels. One clearer framework comes from the EMNLP 2025 survey “From Automation to Autonomy,” which divides LLM’s role in scientific discovery into three levels:

Level 1: LLM as a Tool. Humans lead the research; AI helps with clearly specified subtasks: polishing text, generating a piece of code, doing a literature search. Decision-making power is entirely with humans.
Level 2: LLM as an Analyst. AI takes on more complete analysis tasks: autonomous data analysis, table/chart reasoning, statistical modeling, even proposing candidate models. Humans still set the problem and boundaries, but AI has some autonomous discretion within the boundaries.
Level 3: LLM as a Scientist. AI operates across multiple stages in long, autonomous chains: from hypothesis to experiment to conclusion. Humans retreat to “setting high-level goals plus oversight.” The cases in 1.1 and the end-to-end systems in 3.6 are approaching this level.

Another frequently cited three-part division (Hitchhiker’s Guide to Scientific Agents, 2025) uses “Assistant / Partner / Avatar” to express a similar progression. The survey by Ren et al. (2025, arXiv 2503.24047) takes a “mechanism-centric” perspective, focusing on internal mechanisms like planning and memory, and sharply points out: most current planning architectures are “task-specific,” far from a “general scientific planning ability” that can support open-ended discovery.

The value of these level frameworks for readers is: when you see a system claiming to be an “AI scientist,” the first thing is to judge which level it actually occupies. Many impressive demos are actually at Level 2, packaged as Level 3.

2.4 Similarities and differences with enterprise agents: same form, different essence

Scientific agents share the same “form” with today’s hot enterprise agents (customer service, sales, coding assistants); they are all systems that autonomously call tools and execute multi-step tasks. This is why Claude Code, customer service bots, and research assistants are all called “agents.” But they are different businesses, and the difference lies in the moat:

Enterprise agents have barriers mainly in engineering capability, system integration, and understanding of a specific business process. Their “right or wrong” usually has immediate, quantifiable feedback (ticket resolved, code passes tests, conversion improved).
Scientific agents have barriers in scientific judgment and domain know-how. Their “right or wrong” often has no immediate feedback. Whether a scientific conclusion is valid may require peer review, replication, and even years of validation to confirm. This makes the evaluation of scientific agents itself a difficult and critical problem (detailed in Chapters 4 and 6).

This difference has a direct corollary: in the scientific agent track, pure AI engineering capability is not enough to form a moat; people who understand science hold walls that others cannot climb. This point will run through the entire series.

2.5 Our working definition

Based on the above, this article adopts the following working definition as the inclusion criterion for the subsequent landscape:

Scientific Agent: A system with an LLM or foundation model as its reasoning core, possessing autonomous planning capabilities, able to call external tools (retrieval, code execution, simulation, databases, experimental equipment, etc.) to advance a relatively open scientific goal, and able to adjust subsequent actions based on intermediate results. Its autonomy can range from the “analyst” to “scientist” levels.

Under this definition, pure chat Q&A and pure predictive SciML are not counted as main subjects, but they will be repeatedly mentioned as “internal organs” of scientific agents. In the next two chapters, we will systematically lay out existing work that fits this definition using two frameworks: vertical cut and horizontal cut.

Chapter 3: Vertical Cut of the Ecosystem: Six Stages of the Scientific Workflow

The most intuitive way to understand the AI4S landscape is to walk through the workflow of a scientist: read literature → propose hypotheses → design experiments → execute experiments → analyze data → write and review papers. For each stage, corresponding agents have emerged, with varying maturity and activity. This chapter reviews representative works by stage, and Chapter 4 complements with the horizontal infrastructure that underpins these stages.

3.1 Literature Retrieval and Knowledge Synthesis

This is the first stage to mature and be adopted in daily use by researchers. The reason is simple: low risk, immediate value—the overwhelming volume of papers is a real pain point for every researcher.

The benchmark in this stage is FutureHouse’s PaperQA / PaperQA2. Their 2024 work titled “Language Agents Achieve Superhuman Synthesis of Scientific Knowledge” (Skarlinski et al. 2024, arXiv 2409.13740) demonstrated a retrieval-augmented (RAG) scientific Q&A agent. It searches literature in real time, locates evidence, gives answers with citations, and claims on several benchmarks to surpass human expert-level knowledge synthesis, rather than answering from memory. The “with citations, verifiable” point is especially critical because it directly counteracts the hallucination problem of large models, making outputs trustable in scientific settings.

Around “reading and searching,” there are other works with different emphases. LitLLM (Agarwal et al. 2024) targets literature survey generation. LitSearch (Ajith et al. 2024) targets literature retrieval. CiteME (Press et al. 2024) targets citation attribution—given a statement, find which paper it really comes from—important for combating “hallucinated citations.” ResearchArena (Kang & Xiong 2024) and SciLitLLM (Li et al. 2024) build literature survey workflows and scientific literature understanding capabilities respectively.

Also worth mentioning separately is AutoSurvey (Wang et al. 2024, NeurIPS), which lets an LLM automatically write survey articles. Such work demonstrates capability but also brings side effects: the flood of AI-generated surveys is already impacting the academic ecosystem (Chapter 6 and later in the series will return to this). This illustrates the two sides of the same ability: it can be both productivity and pollution.

Placing this stage in the overall picture, literature and knowledge synthesis is the stage closest to “reliable and usable” in AI4S. But it is essentially “assisted reading,” furthest from “autonomous research”—an entry point, not the destination.

3.2 Hypothesis Generation and Scientific Reasoning

If reading literature is “input,” then proposing valuable hypotheses is the core of scientific creativity and the touchstone for whether AI can truly “do science.”

The most substantial empirical work in this stage comes from Si et al. at Stanford (2024, arXiv 2409.04109). They conducted a large-scale controlled study where LLMs and human experts respectively generated research ideas, then had experts blindly evaluate them. The conclusion is interesting: Ideas generated by LLMs were rated higher in “novelty” than expert ideas, but weaker in “feasibility.” This finding almost defines the characteristics of AI hypothesis capability at the current stage: it can break out of human thinking patterns and propose novel combinations, but lacks judgment on “whether this path is actually practical.” Novel but infeasible is only half of creativity.

To compensate for the limitations of a single LLM, multi-agent approaches have been repeatedly tried. “Many Heads Are Better Than One” (Su et al. 2024) used multiple LLM agents collaborating to generate scientific ideas, letting different “minds” stimulate and filter each other. ResearchAgent (Baek et al. 2024) turned hypothesis generation into an iterative refinement cycle: generate, criticize, modify. The shared idea of these works is to use structured, multi-turn interaction to approximate the social process of “propose, question, polish” in human research. The Virtual Lab from 1.1 is a successful demonstration of this idea on a real problem, letting AIs “meet” and using a critic role specifically to find flaws.

So hypothesis generation is the most exciting and also the most unreliable stage in AI4S. It occasionally impresses, but the systematic bias of “novel but infeasible” means that “judging which hypothesis is worth pursuing” must remain firmly in human hands for the foreseeable future.

3.3 Experiment Design and Self-Driving Labs (Experiment Execution)

Putting “hands-on experiment execution” into an autonomous loop is the heaviest, most domain- and hardware-dependent, and most impactful stage in AI4S. It pulls AI from the screen into the physical world.

The foundational work is Coscientist (Boiko et al. 2023, Nature): an LLM-driven system that can autonomously search literature, plan chemical synthesis routes, and actually execute experiments through lab automation equipment. It proved the path of “language model + lab hardware” is feasible in chemistry.

The representative in materials science is Berkeley’s A-Lab (2023, Nature). A highly automated lab where robots autonomously synthesize and characterize candidate materials based on AI decisions, completing the “propose formula, synthesize, measure, update model” loop. It accomplished in a few days what would take humans months. It embodies the essence of self-driving labs: the coupling of cognitive intelligence (deciding what to do) with physical automation (doing it). However, A-Lab also reminds us not to rush to take demos as conclusions. Solid-state chemist Robert Palgrave and others later publicly questioned that the characterization (XRD analysis) of those “new materials” in the paper was not sound, and none was convincingly proven to be truly new phases; the Nature paper subsequently issued a correction. This is a live example of what Chapter 6 will discuss: “impressive demos, questionable real-world validity.”

The most convincing work in biology is still the Virtual Lab from 1.1 (Swanson et al., 2025 Nature): a complete loop from hypothesis to molecular design to wet lab validation, producing new molecules confirmed by experiments. Additionally, ChatMOF and other systems demonstrated autonomous prediction and generation for specific material classes (metal-organic frameworks). In bioinformatics, attempts with two-loop architectures (planning loop + implementation loop) for autonomous analysis have also appeared (e.g., Huang et al. 2025).

The key constraint in this stage is not “intelligence” but “reliability and cost of the closed loop”: real experiments are expensive, time-consuming, and cannot be “undone.” If AI makes an error in a long chain that goes uncorrected, the cost is real materials and time. This turns the “cascading failure” problem from an abstract probability into a tangible reality on the lab bench.

3.4 Data Analysis, Code, and Paper Replication

The credibility of science rests on “reproducibility.” One core action of reproducibility is turning the methods in a paper into code that produces the same results. This stage is therefore both an analysis tool and a gatekeeper of credibility.

Paper2Code (Seo et al. 2025) and AutoP2C (Lin et al. 2025) use multi-stage LLM pipelines to automatically translate (machine learning) papers into runnable code repositories, essentially “reading methods + reconstructing implementation.” MLR-Copilot (Du 2024) targets autonomous machine learning research, linking the process from idea to experiment.

More ambitious is DeepMind’s AlphaEvolve (Novikov et al. 2025, arXiv 2506.13131). It orchestrates a group of Gemini models for algorithm and scientific discovery, using evolutionary code mutation plus evaluation feedback to iteratively find better solutions. Follow-up work further combines this with “deep research” for discovering scientific algorithms (DeepEvolve, arXiv 2510.06056). This line shows that AI can not only reproduce existing methods but also evolve new ones.

But “being able to reproduce” also exposes the difficulty of “reproducing.” When an AI system claims to have reproduced a paper, how do we verify it truly did, rather than generating superficially plausible results? This leads us to the evaluation and benchmarks emphasized in Chapters 4 and 5: without a credible judge, “automatic reproduction” itself cannot be trusted.

3.5 Paper Writing and Peer Review

At the end of the workflow are writing and review, which are also the most controversial stages.

On the writing side, the aforementioned AutoSurvey and the “paper generation” modules of various end-to-end systems can already produce well-formatted, readable draft papers. The problem: fluency does not equal correctness, and complete formatting does not equal real contribution.

Exploration on the review side is more sensitive. Generative Adversarial Reviews (Bougie & Watanabe 2024) and similar attempts let LLMs act as reviewers, using adversarial criticism to help improve papers. Some conferences (e.g., ICLR) have already incorporated AI review suggestions into their workflows. However, an independent evaluation of Sakana’s AI Scientist (2025, arXiv 2502.14297) provides a sobering finding: AI-generated reviews are often well-formatted but shallow, missing deep methodological flaws. This actually raises the importance of “second-level reviewers” (area chairs), because humans are needed to judge whether a piece of work has real value.

The tension in this stage condenses the entire tension of AI4S: AI approaches humans in “form” but still lags significantly in “substantive judgment.” Automation of writing and review is therefore deeply tied to research integrity issues (expanded in Chapter 6).

3.6 Unifying the Six Stages: End-to-End “AI Scientists”

When the above stages are linked into an automated pipeline, we get the most attention-grabbing and controversial type of system: the end-to-end “AI scientist,” corresponding to Level 3 in the autonomy framework.

The starting point is Sakana AI’s The AI Scientist (v1) (Lu et al. 2024, arXiv 2408.06292). It first demonstrated the full end-to-end automation “from idea to paper,” but relied heavily on manually written code templates, and the exploration process was linear, limiting depth and adaptability. The AI Scientist-v2 (Yamada et al. 2025, arXiv 2504.08066) introduced “intelligent tree search” and a dedicated experiment management agent, reducing template dependence and supporting multi-round iterative exploration. Its output once had a paper accepted at an ICLR workshop (withdrawn before formal publication per prior agreement), sparking heated discussion about AI authors and AI reviewers entering simultaneously. Kosmos (Mitchener et al. 2025) follows a similar line, decoupling experiment management from templates.

There are many parallel explorations. Agent Laboratory (Schmidgall et al. 2025, arXiv 2501.04227) organizes LLM agents into a research assistant team covering literature to experiments to reports. AI-Researcher (Tang et al. 2025, arXiv 2505.18705) targets autonomous scientific innovation. Curie (Kon et al. 2025, arXiv 2502.16069) emphasizes experimental rigor and reproducibility. SciAgents (Ghafarollahi et al. 2024) uses multi-agent “graph reasoning” for automated discovery, particularly prominent in materials. PiFlow (2025, arXiv 2505.15047) introduces “principle-awareness,” constraining the discovery process by scientific principles rather than blind search. DeepScientist (Weng et al. 2025) focuses on progressive frontier discovery. Carl (Autoscience Institute 2025) was reported as one of the earliest systems to produce research that passed academic peer review, though this “first” is debatable with AI Scientist-v2. Google’s AI Co-Scientist (2025) puts emphasis on hypothesis generation and human-AI collaboration. Additionally, multi-domain assistants like Denario, and astrophysics-specific ones like AI Cosmologist (automated cosmological statistical inference; see Chapter 5), are domain-specific variants of this category. Some have even begun envisioning dedicated publication ecosystems for AI scientist outputs, such as aiXiv (Zhang et al. 2025).

Putting these systems together yields a sobering judgment: They work in terms of “running the pipeline,” producing well-formatted papers and occasionally passing review, but the evidence for “producing real, important new science” remains thin. Most impressive results are either on controlled, verifiable narrow problems or still involve significant human intervention. Mistaking “end-to-end pipeline works” for “scientists are automated” is the most common cognitive bias today. Chapter 6 will use concrete evidence to calibrate this judgment.

The table below summarizes representative systems across the six stages for a holistic index:

Stage	Representative Systems
Literature Retrieval & Knowledge Synthesis	PaperQA/PaperQA2, LitLLM, LitSearch, CiteME, ResearchArena, SciLitLLM, AutoSurvey
Hypothesis Generation & Scientific Reasoning	Many Heads (Su et al.), ResearchAgent, Virtual Lab
Experiment Design & Self-Driving Labs	Coscientist, A-Lab, Virtual Lab, ChatMOF, Huang et al. 2025
Data Analysis, Code & Replication	Paper2Code, AutoP2C, MLR-Copilot, AlphaEvolve
Paper Writing & Peer Review	AutoSurvey, Generative Adversarial Reviews, various end-to-end modules
End-to-End AI Scientist	AI Scientist v1/v2, Agent Laboratory, AI-Researcher, Curie, SciAgents, PiFlow, DeepScientist, Carl, AI Co-Scientist, Denario, AI Cosmologist, aiXiv

Chapter 4: Horizontal Cut of the Ecosystem: Four Layers of Infrastructure

The six stages in Chapter 3 answer “what AI does at each step of the scientific process.” But they don’t answer another question: what underpins these stages? Any scientific agent, no matter which stage it works in, needs a reasoning foundation, an execution environment, a yardstick for evaluation, and a set of interfaces to the outside world. These four things cut across all stages, forming the infrastructure layer of AI4S—what this series repeatedly calls the “utilities.” This chapter expands on each layer.

4.1 Scientific Foundation Models (SciFM)

The first layer is the foundation. In the early years of scientific agents, most systems directly called general large models (GPT, Claude, Gemini, etc.) as the brain. But general models are not trained for science; they often have shortcomings in specialized symbols, units, and domain reasoning. So a clear trend has emerged: foundation models specifically trained for science (Scientific Foundation Models, SciFM).

The idea of SciFM is to train models on scientific literature, experimental data, and specialized modalities (molecular graphs, crystal structures, gene sequences, spectra, time-series observations, etc.), so that they have a more solid “prior” on scientific tasks than general models. Representative directions include: BioNeMo series for biology; multi-modal models for materials and chemistry (e.g., IBM’s FM4M covering molecular graphs, 3D atomic coordinates, electron density, etc.); time-series foundation models for climate and earth systems; and “deep potential” style models that integrate first-principles potential energy into molecular dynamics. This direction is important enough that it has spawned dedicated academic conferences (e.g., SciFM-themed conferences).

SciFM for scientific agents is like general large models for general agents: it determines the scientific literacy ceiling of the “brain.” There’s one judgment worth remembering: in scientific fields lacking data, embedding “hard knowledge” like physical constraints, conservation laws, and symmetries into the model (physics-informed approach) is often more effective than simply piling on data. This point will reappear in Chapter 5 for physics and climate directions, and is also where domain experts can contribute unique value.

4.2 Self-Driving Labs as Infrastructure

The second layer is the execution environment. Section 3.3 introduced self-driving labs from the “experiment execution stage” perspective. From another angle, self-driving labs themselves are a layer of reusable infrastructure that standardizes and platforms the process of “turning AI decisions into real operations in the physical world.”

From an infrastructure perspective, self-driving labs solve the interface problem between cognition and physics: how does the AI brain translate “what should be measured next” into instructions a robot can execute, and how are measurement results fed back structurally to the brain? Systems like Coscientist and A-Lab provided answers in their respective domains, but they are mostly vertical and specialized. A still-open opportunity is the “general orchestration layer for self-driving labs,” allowing different devices and different disciplines to share a common scheduling and learning framework for experimental loops. The core difficulty of this layer is experimental design in high-dimensional space (deciding the next experiment) and uncertainty management—exactly the strengths of those with statistics and physics backgrounds.

Self-driving labs are the “heaviest” of the four layers: they require real equipment, space, and capital, making them the hardest for small teams to build from scratch (as the middle chapter will show, this is where capital is most concentrated and the barrier highest). But they are also the soul that distinguishes AI4S from “pure software AI”: without them, AI can only ever “read and think,” never truly “do.”

4.3 Evaluation, Benchmarks, and Reproducibility

The third layer is the yardstick for validation, and in this series’ judgment, the most underappreciated yet critical layer. The logic is straightforward: when a system claims “I can make scientific discoveries,” why should we believe it? Someone must design the exam, set the answer key, judge correctness, and verify reproducibility of results. Without this layer, the outputs of all previous stages cannot be trusted, and therefore cannot be deployed.

This layer is rapidly proliferating along “discipline × task type” lines, and most benchmarks are public and downloadable:

Cross-disciplinary / General Science: ScienceAgentBench (Chen et al. 2024) evaluates agents on data-driven discovery tasks taken from peer-reviewed papers and validated by domain experts. OpenAI’s PaperBench (Starace et al. 2025) evaluates AI’s ability to reproduce AI research. SciReplicate-Bench (Xiang et al. 2025) tests algorithm reproduction from papers. CORE-Bench reproduces paper results from full code repositories. AstaBench (2025) is a rigorous benchmark suite for scientific research. AAAR-1.0, DiscoveryWorld, and ScienceBoard approach from angles of research assistance, simulated discovery environment, and scientific desktop tasks respectively.
Biology / Chemistry: LAB-Bench (FutureHouse, arXiv 2407.10362) targets biology research workflows with over 2,400 tasks covering literature, databases, sequences, protocols, patents, and more. BixBench targets biomedicine.

Three features of this layer are worth remembering. First, it must be built by domain experts: setting problems, defining correct answers, and judging correctness all require genuine understanding of that science; pure AI teams cannot create effective scientific benchmarks. Second, it is far from saturated: with combinatorial explosion along “discipline × subfield × task type × real facility data,” existing benchmarks only cover a tiny fraction. Third, scores are generally very low: on multiple benchmarks, the strongest models are far from expert level, indicating the problem is far from solved and is still early stage. These three points together make the evaluation layer the sharpest entry point for people with domain backgrounds. This judgment will be expanded in the next installment; for now, it’s marked as a piece of the landscape. Chapter 5 will show that physics and astrophysics happen to be among the disciplines where this layer started relatively early.

4.4 Tools and Orchestration Layer

The fourth layer is the interface to the outside world. A scientific agent must be able to call tools to do real work: search engines, code execution sandboxes, scientific databases, simulation engines, experimental equipment APIs. Safely and reliably connecting these tools to the agent, and orchestrating the collaboration of multiple tools and multiple sub-agents, is the responsibility of the tools and orchestration layer.

This layer is already relatively mature in the general agent domain (various agent frameworks, tool protocols, memory and state management). In the scientific domain, additional domain-specific interfaces need to be handled, such as connecting a particular astronomy database, a specific simulation code, or a type of experimental equipment. A concrete example is using natural language to query scientific databases (e.g., text-to-SQL systems in astronomy, see Chapter 5). Its essence is “making the database a tool for the agent to use.”

From a reliability perspective, the tools and orchestration layer is also a high-frequency failure zone: tool call failures, inconsistent return formats, state loss in long chains—all can cause the entire task to collapse. This directly leads us to Chapter 6: the maturity of the infrastructure ultimately determines whether scientific agents can move from “demo” to “production.”

The table below summarizes the four infrastructure layers:

Infrastructure Layer	Description and Key Systems
Scientific Foundation Models (SciFM)	Models trained specifically for science (e.g., BioNeMo, FM4M, deep potential, physics-informed models)
Self-Driving Labs	Physical lab infrastructure enabling autonomous experiments (e.g., Coscientist, A-Lab)
Evaluation, Benchmarks & Reproducibility	Yardsticks for assessing agents: ScienceAgentBench, PaperBench, LAB-Bench, AstroBench etc.
Tools & Orchestration	Interfaces and frameworks for tool use and multi-step orchestration (general agent frameworks, domain-specific connectors)

Overlaying the “vertical cut” from Chapter 3 with the “horizontal cut” from this chapter gives the complete ecological coordinates of AI4S: each specific system can be located by “which stage it works in and which infrastructure layers it relies on.” With this coordinate system, in the next chapter we switch to a different axis—discipline—to see how money and interest are distributed.

Chapter 5: Disciplinary Landscape: Money, Interest, and Commercial Exits

The previous two chapters mapped the “technology landscape” using the axes of stages and infrastructure. But the development of AI4S is not uniform across disciplines: funding, talent, and attention are highly uneven. Behind this unevenness is a clear logic: the closer to a “monetizable outcome,” the more money. This chapter scans by discipline, noting representative work, commercial exit, and appeal to capital for each. The specific functioning of capital (who invests, how much) is left for the middle chapter; this chapter only outlines the distribution and logic.

5.1 Life Sciences and Health: Most money, because the exit is clearest

Life sciences are the most capital-intensive discipline in AI4S. The reason is straightforward: outcomes can become drugs, and drugs have clear, large payers—the ultimate commercial exit. From protein structure (AlphaFold’s legacy) to protein/enzyme design, antibody discovery, target identification, clinical data analysis, AI has a landing point in every step of life sciences.

Representative capabilities have been validated. The Virtual Lab from 1.1 designed nanobodies confirmed by experiments. In protein/enzyme design, deep partnerships with large pharma have emerged