AI scientists produce results without reasoning scientifically
Summary
Large-scale study finds LLM-based scientific agents ignore evidence 68% of the time and rarely revise beliefs, showing they execute workflows but lack genuine scientific reasoning.
View Cached Full Text
Cached at: 04/23/26, 11:54 AM
Paper page - AI scientists produce results without reasoning scientifically
Source: https://huggingface.co/papers/2604.18805
Abstract
Large language model-based scientific agents demonstrate consistent reasoning patterns that lack key epistemic features of scientific inquiry, regardless of task type or successful context, indicating fundamental limitations in their ability to replicate genuine scientific reasoning processes.
Large language model(LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to theepistemic normsthat make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-basedscientific agentsacross eight domains, spanning workflow execution tohypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-drivenbelief revisionoccurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes acomputational workflowor conductshypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterizescientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.
View arXiv pageView PDFProject pageAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18805 in a model README.md to link it from this page.
Datasets citing this paper13
#### jablonkagroup/corral-traces Viewer• Updatedabout 23 hours ago • 80.9k • 497 #### jablonkagroup/corral-oss-trace-logprobs Viewer• Updated1 day ago • 122k • 162 #### jablonkagroup/corral-environment-tasks Viewer• Updatedabout 23 hours ago • 909 • 114 #### jablonkagroup/corral_runs_reports Viewer• Updatedabout 23 hours ago • 609 • 112 Browse 13 datasets citing this paper### Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18805 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AI scientists produce results without reasoning scientifically [R]
A study of 25,000 AI scientist trials finds the agents ignore evidence 68% of the time and rarely revise hypotheses, showing popular scaffolding fixes don’t instill true scientific reasoning.
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
Introduces ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make forward-looking research judgments from historical evidence. It contains 500 tasks across four AI domains and shows that explicit evidence organization improves traceability but reveals a recurring evidence-decision decoupling.
@ProfBuehlerMIT: For science, AI sovereignty and physics-grounded reasoning are non-negotiable. But how can we teach a small LLM like Ge…
mistral.rs now natively supports Agent Skills, enabling locally-run small LLMs to perform complex agentic workflows for scientific tasks, with full control over models, data, and execution.
@dair_ai: Can an LLM agent actually build a model of an environment it cannot see? This work makes the question gradeable. An age…
A research paper proposes agentic automata learning to evaluate whether LLM agents can infer hidden world models through interaction, finding that performance drops sharply as task complexity increases and that reasoning models outperform non-reasoning ones but still struggle.
Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making
Researchers from the University of Michigan introduce MechSim, a mechanism-grounded neuro-symbolic reasoning framework that enables LLM agents to reason about the internal assumptions, dependencies, and execution behavior of scientific simulators rather than treating them as black boxes. The framework improves explanation quality and decision-making reliability across high-stakes domains like healthcare, finance, and public policy.