@THUTeamEureka: 1/3 Excited to open-source EurekAgent! A fully autonomous research system for metric-driven tasks, built with Claude Co…

X AI KOLs Timeline 06/15/26, 04:50 PM Papers

Summary

THU Team Eureka open-sources EurekAgent, an autonomous research system built with Claude Code that achieves state-of-the-art results on math, kernel engineering, and ML tasks through environment engineering.

1/3 Excited to open-source EurekAgent! 👾✨ A fully autonomous research system for metric-driven tasks, built with Claude Code. Define the problem and metric. Get breakthrough results. 📜 https://t.co/UQyFTr7aK8 💻 https://t.co/vWexAPltaT #AI #LLM #ClaudeCode #AutoResearch https://t.co/lrluZXZPaZ

Original Article

View Cached Full Text

Cached at: 06/16/26, 01:37 PM

1/3 Excited to open-source EurekAgent! 👾✨

A fully autonomous research system for metric-driven tasks, built with Claude Code.

Define the problem and metric. Get breakthrough results.

📜 https://t.co/UQyFTr7aK8 💻 https://t.co/vWexAPltaT

#AI #LLM #ClaudeCode #AutoResearch https://t.co/lrluZXZPaZ

Agent Environment Engineering is All You Need for Autonomous Scientific Discovery

Source: https://arxiv.org/html/2606.13662 Amy Xin1, Jiening Siow1, Junjie Wang1, Zijun Yao1,Fanjin Zhang2, Jian Song1, Lei Hou1, Juanzi Li11Department of Computer Science and Technology, Tsinghua University2School of Information, Renmin University of China{xin-x25, xiaojn25}@mails.tsinghua.edu.cn

Abstract

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments—the resources, constraints, and interfaces that shape agent behavior. We frame this asenvironment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We presentEurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery.EurekAgentengineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention.EurekAgentsets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results111https://github.com/THU-Team-Eureka/EurekAgent, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

Refer to caption Figure 1:EurekAgentscore evolution progress on the 26-circle packing problem.MathematicsKernel Eng.Machine LearningCircle Packing (↑\uparrow)Erdős’ Min. Overlap (↓\downarrow)1st Autocorr. Ineq. (↓\downarrow)TriMul (↓\downarrow)MLE-Bench (↑\uparrow)Prev. Best Human∼\sim2.634[5]0.380927[7]1.509730[18]2096.04μs2096.04\,\mu\mathrm{s}N/APrev. Best AI2.635986[27]0.380876[30]1.502863[30]2247.78μs2247.78\,\mu\mathrm{s}[30]71.43%[31]EurekAgent2.6359990.3808701.5028612005.03 µs85.71%

Table 1:An overview ofEurekAgent’s performance on metric-driven research tasks across mathematics, kernel engineering, and machine learning.EurekAgentsets new state-of-the-art across all mathematics and kernel engineering tasks, and ranks first on the evaluated MLE-Bench subset. (↑\uparrow) denotes higher is better, while (↓\downarrow) denotes lower is better.Prev. Best HumanandPrev. Best AIdenote the best published human and AI results beforeEurekAgent.## 1Introduction

Large language models are increasingly transforming scientific discovery from manual trial-and-error to computational exploration: in domains where research progress can be measured by an optimizable metric, LLM-based agents can autonomously propose hypotheses, run experiments, observe feedback, and iterate solutions, reducing human effort in method tuning while largely expanding the scale of exploration. LLM-based agents have already produced strong results in tasks across domains such as mathematics, algorithms, kernel engineering, and machine learning engineering(Jianget al.,2025; Novikovet al.,2025; Langeet al.,2025; Wanget al.,2025; Yuksekgonulet al.,2026; Ouyanget al.,2025; Chanet al.,2025; Yanget al.,2025; Zhanget al.,2026; Liet al.,2025). We envision this as an emerging paradigm shift in scientific research: humans increasingly focus on selecting valuable directions, formulating meaningful metrics, and supervising validity, while agents execute large volumes of methodological exploration.

Most existing autonomous research systems realize this vision by prescribing research-specific agentic workflows. Evolutionary systems such as AlphaEvolve explicitly maintain populations of candidate programs and use evaluator feedback to guide mutation and selection(Novikovet al.,2025; Langeet al.,2025; Liuet al.,2026b). Machine learning systems such as AIDE organize exploration around solution trees, feedback loops, and role-specialized agents(Jianget al.,2025; Yanget al.,2025). More recent systems introduce structured debate, periodic self-review, and self-learning modules(Liuet al.,2026a; Quet al.,2026). While these designs can be effective, they also encode strong assumptions about how research should proceed. As general-purpose coding agents like Claude Code and Codex become stronger, recent evidence suggests that much of the useful capability may already reside in the base agent: given a clear research task and an optimizable metric, these agents can already discover new state-of-the-art scientific solutions(Liuet al.,2026c; Karpathy,2026). On ResearchClawBench(Xuet al.,2026), a benchmark of 40 research tasks across 10 diverse domains, both Claude Code and Codex, used as standalone general-purpose agents, outperform all evaluated research-specific agent systems.

However, task performance alone does not make reliable autonomous researchers. Scientific discovery requires rigor, reproducibility, and inspectability, yet agents may contaminate evaluations, manipulate artifacts, or fail to follow procedural constraints. Such reward-hacking and observability failures have already been reported in agentic research systems(Luoet al.,2025; Kokoromyti,2026; Anthropic,2026). Therefore, trusting agents without environmental constraints can lead to impressive but unreliable results.

These observations suggest that as general-purpose agents become more capable, the bottleneck for autonomous scientific discovery is shifting from prescribing agent behavior through detailed workflows to engineering the environments in which agents operate. We frame this asenvironment engineering. This echoes Gibson’s theory of affordances in ecological psychology: an environment shapes the possibilities for action available to an actor, “either for good or ill”(Gibson,1979). For scientific discovery, a well-engineered environment should suppress harmful affordances such as evaluation tampering and artifact manipulation, while amplifying productive affordances such as free exploration, accurate rewards, inter-agent coordination, and easy human supervision. The analogy is a capable PhD student: productivity comes not from minute-by-minute instructions, but from accountability, research autonomy, accurate feedback, peer collaboration, and mentor supervision.

We presentEurekAgent, an agent system for autonomous scientific discovery that coordinates off-the-shelf CLI agents through four environment engineering dimensions: (1) permissions engineering, to expose useful capabilities and resources while preventing research-integrity violations; (2) artifact engineering, to structure solutions, logs, and evaluation results as shared progress memory; (3) budget engineering, to enable budget-aware exploration with runtime and compute boundaries; and (4) human-in-the-loop engineering, to support easy human supervision and intervention. Within this environment, the agent remains free to select its own research workflows and strategies.

We evaluateEurekAgenton metric-driven research tasks spanning mathematics, kernel engineering, and machine learning engineering. Using off-the-shelf CLI agents and environment-level design,EurekAgentachieves new state-of-the-art results across all mathematics and kernel engineering tasks, and ranks first on the evaluated MLE-Bench subset. Furthermore, with Claude Code as the CLI agent and GLM-5.1 as the base model,EurekAgentachieves new state-of-the-art results on the three mathematics tasks with an average API cost below $17, where the 26-circle packing task achieves the lowest API cost of $11. We call for environment engineering as a core research direction for building capable, efficient, and responsible autonomous research agents.

2Related Work

2.1Agents for Scientific Discovery

Autonomous research agents have attracted growing interest to accelerate scientific with large-scale computational exploration. Systems such as The AI Scientist aim to automate scientific research in an end-to-end manner, covering stages such as idea generation, experimentation, and paper writing(Luet al.,2024). Within this broader vision, one especially concrete direction is scientific discovery with verifiable objectives and optimizable metrics, where agents autonomously explore and evolve solutions through evaluator feedback. In machine learning engineering, systems such as AIDE, R&D-Agent, AIBuildAI, MLE-STAR, and ML-Master formulate progress as iterative code development guided by validation scores(Jianget al.,2025; Yanget al.,2025; Zhanget al.,2026; Namet al.,2026; Zhuet al.,2026). In algorithmic and mathematical discovery, training-free solution evolution methods such as FunSearch, AlphaEvolve, ShinkaEvolve, EvoX, AdaEvolve, and OpenEvolve use LLMs to propose or mutate candidate programs under evaluator-guided selection(Romera-Paredeset al.,2024; Novikovet al.,2025; Langeet al.,2025; Liuet al.,2026b; Cemriet al.,2026; Sharma,2025). More recently, test-time training systems such as ThetaEvolve and TTT-Discover further use the optimizable metric as a reward signal to update the model during exploration(Wanget al.,2025; Yuksekgonulet al.,2026). These systems demonstrate the power of evaluator-guided discovery, but they typically use fixed workflows to prescribe core agent behaviors such as proposal, mutation, selection, or reflection.EurekAgentinstead uses strong general-purpose CLI agents as basic nodes, and focuses on engineering an environment that lets agents exercise their own capabilities reliability.

2.2Agent Environments and Research Integrity

As agents become more autonomous, the surrounding environment becomes a central determinant of reliability. Some recent systems have begun to recognize the importance of environment reliability and introduce safeguards for specific failure modes. For example, MLE-STAR adds leakage checking for machine learning pipelines, and CORAL hides grader code behind an evaluation interface(Namet al.,2026; Quet al.,2026). At the same time, analyses of real reward-hacking incidents show that agents can exploit weak evaluation protocols, contaminate evidence, or violate procedural assumptions(Luoet al.,2025; Kokoromyti,2026; Anthropic,2026). Instruction-following failures in complex agentic settings further suggest that reliability cannot be delegated entirely to prompt engineering(Qiet al.,2025). Some existing work therefore explores environment design to avoid common failures, but these are usually introduced as task-specific safeguards.EurekAgentmakes environment engineering the central design objective: it organizes permissions, artifacts, budgets, and human oversight as first-class mechanisms for supporting open-ended agent exploration while preserving evaluator integrity, traceability, and reproducibility.

Refer to caption Figure 2:Overview ofEurekAgent. Given task inputs and budgets,EurekAgentexecutes a prepare stage followed by repeated propose and parallel implement stages, while the environment engineering layer provides secure evaluation, artifact memory, budget control, and human oversight.

3EurekAgent

In this section, we present the system design ofEurekAgent. Figure2summarizes the overall architecture. We first overview the overall system loop (3.1), then detail the environment engineering designs (3.2).

3.1System Overview

EurekAgentis an environment-engineered agent system for metric-driven research tasks. Given a problem description, a hidden evaluation script, a submission-format specification document, optional initial code, and time and API cost budgets,EurekAgentcoordinates multiple sessions of off-the-shelf CLI agents to autonomously propose and iterate high-scoring solutions. Instead of prescribing a detailed research workflow,EurekAgentengineers an outer environment that organizes agent activity through a simple three-stage loop:

Prepare→[Proposer→{Implementr,p}p=1Pr]r=1R,Pr≤P,\textsc{Prepare}\rightarrow\left[\textsc{Propose}_{r}\rightarrow\{\textsc{Implement}_{r,p}\}_{p=1}^{P_{r}}\right]_{r=1}^{R},\quad P_{r}\leq P,whereRRis the maximum number of iteration rounds andPPis the maximum number of parallel implementation sessions per implement stage, both adjustable by the user. Each round consists of one proposal session followed by up toPPparallel implementation sessions. Across stages and rounds, the environment only handles outer-loop coordination: it initializes the workspace, transitions between stages, specifies each session’s objective and required deliverables, exposes the tool and resource interfaces, records and ranks scored solutions, persists run and session state, and enforces time and cost budgets. Within these boundaries, CLI-agent sessions freely decide their own research strategies, experiment plans, implementation details, and refinement processes.

Prepare Stage.

Before iteration begins,EurekAgentlaunches a preparation agent session to set up a reliable runtime setup for subsequent solution iteration. The agent reads the problem description, the evaluator-facing submission requirements document, and optional initial code; tests the hidden evaluation service; and installs or validates required runtime dependencies. If the problem setup is ambiguous or broken, the agent can pause and request human clarification rather than allowing optimization to proceed from an unreliable setup. The stage ends by writing a preparation summary and a completion artifact, which become shared context for later proposal and implementation sessions. This stage is executed only once at the beginning of the research process; after that,EurekAgentexecutesRRrounds of propose–implement iteration.

Propose Stage.

At the beginning of each iteration round,EurekAgentlaunches a proposal agent session to generate diverse initial hypotheses for the next round of solution optimization. The session reads the task inputs, the preparation summary, and the ranked best solutions from previous rounds, if any. It may also inspect previous-round workspaces for implementation details and use web search or browsing tools to gather related literature or existing open-source solutions. It then writes a manifest containing up toPPcandidate hypotheses and creates an implementation-ready description for each hypothesis. This stage acts as the fan-in step ofEurekAgent: empirical evidence from earlier rounds, together with information from the internet, is distilled into a new set of promising, diverse, and independently executable research hypotheses.

Implement Stage.

The implement stage is the fan-out step ofEurekAgent. For each proposed hypothesis,EurekAgentlaunches a separate implementation agent session in parallel and assigns it a separate workspace. Each session starts from its assigned hypothesis as an initial direction, but may iteratively refine, debug, or modify the solution according to feedback from the hidden evaluator. Sessions submit candidate solutions through the secure evaluation service, which records all evaluated submissions and maintains both intermediate results and the best valid result. After the parallel implement sessions complete or exhaust their budgets,EurekAgentautomatically ranks all valid submissions, and updates a ranked solution history file as shared context for the next round. This propose-implement loop combines broad parallel exploration with cross-round accumulation of empirical progress, continuing for improvement until the budget limits or stage completion conditions are reached.

3.2Environment Engineering inEurekAgent

EurekAgentis designed through four environment engineering dimensions: (1) permissions engineering, (2) artifact engineering, (3) budget engineering, and (4) human-in-the-loop engineering. The goal is to grant agent sessions enough affordances to perform open-ended solution optimization, while making the research process reliable, inspectable, and resource-bounded.

Refer to caption Figure 3:TheEurekAgentweb monitor interface. The monitor provides a user-friendly overview of each run, including status logs, score evolution, per-round and global best approaches, and budget usage. It also records complete session transcripts in an organized view, allowing users to inspect full trajectories of every agent session.#### Permissions Engineering.

Scientific discovery agents need broad capabilities, but unconstrained capability can compromise research integrity.EurekAgentimposes system-level permission boundaries to support productive exploration while preventing research-integrity violations. On the productive side,EurekAgentprovides a freely configurable Python environment, workspace-level shell access, capable web search and browser tools, and full access to the same run’s previous-round artifacts. This gives agents researcher-like access to tools, files, internet, and prior experience to aid solution iteration. On the constraint side,EurekAgentuses run-level isolation and controller-owned interfaces to prevent common failure modes. Each run executes inside a Docker container with a mounted workspace, protecting files outside the run from accidental or adversarial modification. The hidden evaluator with possible test data are kept outside the agent-visible workspace and exposed only through a secure grading service: agents can submit candidates and receive official scores, but cannot inspect or modify the evaluator itself. The authoritative result files generated by the hidden evaluator are automatically updated by the system, and hooks are implemented to block agent modification of these controller-owned files.EurekAgentalso enforces same-round isolation among parallel implementation sessions: an implementation session may learn from previous rounds, but cannot inspect or copy from peer approaches in the same round, reducing premature collapse toward a single local direction. For GPU tasks,EurekAgentuses a default-deny policy: GPUs are invisible unless acquired through a provided GPU helper API, which records lock ownership and ensures that each physical GPU is held by at most one agent session at a time. Together, these mechanisms expose useful resources while removing high-risk affordances such as evaluator leakage, score tampering, uncontrolled GPU contention, and same-round solution copying.

Artifact Engineering.

EurekAgentuses the filesystem coupled with Git history as shared long-term memory. The filesystem stores stage deliverables for cross-session communication, including preparation summaries, proposal manifests, hypotheses, solution code, evaluator feedback, and scored submissions.EurekAgentalso maintains system-managed artifacts: web-search history is logged as a cache of explored internet information, and official scores are automatically recorded and ranked. The ranked historical solutions enable later agent sessions to quickly identify strong prior solutions and inspect their code, logs, and intermediate results when needed. All run artifacts are persisted under the run directory, providing the persistent substrate for traceability, interruption recovery, and resumability. Within each session, Git commits track solution evolution. We instruct agents to describe both the current standalone solution and what changed from the previous version in each commit message.

Budget Engineering.

Autonomous research agents can consume substantial time, compute, and API budget, soEurekAgentmakes budget limits part of the environment settings.EurekAgentcontrols resources along two axes: wall-clock time and API cost. For time, users specify separate limits for proposal and implementation sessions, reflecting that hypothesis generation and long-running solution iteration require different time scales. Furthermore,EurekAgentmakes agents time-aware through both active and passive mechanisms: (i) actively, agents can call a provided time-checking helper API to inspect elapsed and remaining time for the current stage; (ii) passively, when the deadline for a stage is approaching and required deliverables are still missing,EurekAgentinjects a warning message asking the agent to stop exploration and generate the necessary artifacts. For API cost,EurekAgenttracks accumulated token usage across sessions, but does not expose token consumption information to the agent. When the cost limit is reached, the run is aborted and the current workspace is preserved as the final snapshot. Budget control also supports operational continuity for long-running research processes.EurekAgentpersists each stage’s session identifier, status, elapsed time, and effective budget, so an interrupted run can resume from the latest filesystem state under the remaining budget rather than restarting from scratch. Users may also revise the configured time limits, or grant explicit extra resume time when a stage has exhausted its budget before producing required artifacts. This makes budget engineering not only a stopping rule, but also an operational interface for controlled continuation.

Refer to caption Figure 4:TheEurekAgentterminal UI interface. The terminal UI preserves a CLI-agent style view for inspecting live session outputs and communicating with agents through the bottom input box. Left: prepare-stage snapshot. Right: implement-stage snapshot with three parallel implementation sessions; users can preview all sessions or enter any session to inspect details and communicate.

Human-in-the-loop Engineering.

AlthoughEurekAgentsupports fully autonomous iteration, scientific discovery still benefits from human oversight.EurekAgenttherefore provides two complementary interfaces. On one hand, a terminal UI (Figure4) exposes per-approach progress, raw session outputs, and an input box for users to communicate with active sessions. On the other hand, a web monitor (Figure3) provides higher-level views of the run, presenting a visualized score evolution with per-round and global best approaches. These interfaces preserve agent autonomy while keeping the process fully observable and allowing humans to redirect agent behavior when needed.

4Experiments

We evaluateEurekAgenton three domains: mathematics, kernel engineering, and machine learning engineering. We focus on tasks with optimizable metrics, where progress can be measured by objective scores. All experiments useEurekAgentwith Claude Code as the CLI agent and GLM-5.1 as the base LLM. Building on this setting, we configure the Web Search Prime MCP222https://docs.z.ai/devpack/mcp/search-mcp-serverto enable search engine capabilities, and the Playwright MCP333https://github.com/microsoft/playwright-mcpto enable web browser navigation.

4.1Mathematics

We evaluateEurekAgenton three mathematical optimization problems following prior work(Novikovet al.,2025; Wanget al.,2025; Yuksekgonulet al.,2026): (1) circle packing, where the objective is to place 26 non-overlapping circles inside a unit square and maximize the sum of their radii, using the OpenEvolve-style evaluator with a10−610^{-6}tolerance for boundary and overlap checks; (2) Erdős’ minimum overlap, where the objective is to minimize the limiting maximum overlap between two equal-size sets; and (3) the first autocorrelation inequality, where the objective is to find a nonnegative construction that certifies the tightest known upper bound on an autoconvolution constant. For circle packing, we compare against the previous best AI result reported under the same tolerance setting. All three problems have verifiable and optimizable objective functions, making them suitable for agentic solution iteration. We reportEurekAgenthyperparameter settings in AppendixA.

Table 2:Performance ofEurekAgenton three mathematical optimization problems. Previous best AI results are from test-time training systems, whileEurekAgentremains training-free with only environment engineering designs.As shown in Table2,EurekAgentestablishes new state-of-the-art results on all three mathematics tasks. Notably,EurekAgentoutperforms prior test-time training systems while remaining training-free, suggesting that environment engineering alone can unlock breakthrough results without updating the backbone model.

4.2Kernel Engineering

We evaluateEurekAgenton the GPUMODE TriMul competition, which targets optimized implementations of triangular matrix multiplication. Submissions are evaluated by the geometric mean runtime across benchmark cases, where lower runtime is better. We evaluate on the A100 setting.

Because the official GPUMODE leaderboard closed, we could not submit new solutions and get official scores. We therefore evaluate locally on an A100 GPU using the released TTT-Discover TriMul setting(Yuksekgonulet al.,2026), with only minimal format adaptation forEurekAgentsubmissions. For fair comparison, we download top leaderboard solution scripts from GPUMODE and regrade them under the same local protocol. All candidates are evaluated on the same A100 GPU with the original correctness tests, benchmark cases, scoring rule, and timing logic unchanged. We run three warmup rounds, followed by ten measured rounds with randomly shuffled candidate order to reduce order effects, and report both median and mean geometric-mean runtimes. ForEurekAgent, we report the four best solutions discovered throughout a single system run; hyperparameter settings are listed in AppendixA.

Table 3:Performance ofEurekAgenton the TriMul kernel engineering task.Table3shows thatEurekAgentdiscovers multiple solutions that outperform the top leaderboard submissions under the same local evaluator. The top fourEurekAgentsolutions all achieve median runtimes below2031μs2031\,\mu\mathrm{s}, indicating stable high-quality optimization rather than a single lucky candidate. The bestEurekAgentkernel improves over the strongest regraded leaderboard solution by about4.3%4.3\%and over TTT-Discover by about10.8%10.8\%.

4.3Machine Learning Engineering

We evaluateEurekAgenton a curated subset of seven competitions from the MLE-Bench Lite split(Chanet al.,2025). MLE-Bench evaluates agents on real Kaggle-style machine learning competitions, where submissions are scored against held-out test sets and mapped to medal thresholds. To balance cost, diversity, and difficulty, we start from the 22 Lite competitions and use public MLE-Bench leaderboard results to estimate tractability. We divide tasks into Easy, Medium, and Hard tiers by aggregate prior-agent medal rate, then sample 2 Easy, 2 Medium, and 3 Hard competitions. The selected competitions span image, text, audio, and tabular prediction. Our selected tasks are detailed in AppendixB.

We runEurekAgentonce per competition and report the resulting medal rates. Following MLE-Bench’s official 24-hour and single-GPU setting, we grant one GPU to each run and reportEurekAgenthyperparameter settings in AppendixA. For baselines, we use the corresponding public MLE-Bench leaderboard results on the same tasks. When a baseline reports the aggregated scores of multiple runs, we report the upper end of its reported score range.

Table 4:Machine learning engineering results on our seven-task MLE-Bench Lite subset. For baselines with multiple runs, we report the upper end of the reported range.As shown in Table4,EurekAgentachieves the highest any-medal rate on the selected MLE-Bench subset, reaching85.71%85.71\%with a single run per task. It also attains the highest gold-medal rate among methods using non-commercial open models. All listed baselines use closed commercial models, whileEurekAgentruns with open-source LLM GLM-5.1, suggesting that environment-engineered autonomous iteration can be competitive even without relying on the strongest proprietary models.

5Conclusion and Limitations

We presentedEurekAgent, an environment-engineered system for autonomous scientific discovery on metric-driven research tasks. Rather than prescribing detailed research workflows,EurekAgentcoordinates off-the-shelf CLI-agent sessions through a simple prepare-propose-implement loop, while shaping the surrounding environment for reliable evaluation, shared progress memory, resource boundaries, and human oversight. Using Claude Code as the CLI agent and GLM-5.1 as the base LLM,EurekAgentachieves new state-of-the-art results on all evaluated mathematics and kernel engineering tasks, and ranks first on our evaluated MLE-Bench Lite subset. These results suggest that, as general-purpose CLI agents become more capable, carefully engineered scientific-discovery environments can turn model capability into reliable scientific progress.

Looking forward, we view environment engineering as a central layer in the next generation of autonomous research systems. As agents become more capable, scientific progress will depend not only on model intelligence, but also on the environments that define reliable feedback, persistent memory, resource control, evaluator integrity, human oversight, and recoverable long-running operation. Our current experiments focus on metric-driven tasks with executable evaluators, but the same perspective becomes even more important as autonomous research moves toward broader and more open-ended scientific settings.

We open-sourceEurekAgentas an initial step toward this direction and invite the community to build on, improve, and contribute to it. We will continue maintaining the project, extending it to richer research domains, and updating empirical results on its performance, capabilities, and boundaries. We hopeEurekAgentcan serve as a practical starting point for collective exploration of environment engineering as a foundation for reliable autonomous scientific discovery.

References

Claude Opus 4.7 system card.Note:https://www.anthropic.com/system-cardsCited by:§1,§2.2.
M. Cemri, S. Agrawal, A. Gupta, S. Liu, A. Cheng, Q. Mang, A. Naren, L. E. Erdogan, K. Sen, M. Zaharia,et al.(2026)Adaevolve: adaptive llm driven zeroth-order optimization.arXiv preprint arXiv:2602.20133.Cited by:§2.1.
J. S. Chan, N. Chowdhury, O. Jaffe, J. Aung, D. Sherburn, E. Mays, G. Starace, K. Liu, L. Maksin, T. Patwardhan,et al.(2025)Mle-bench: evaluating machine learning agents on machine learning engineering.InInternational Conference on Learning Representations,Vol.2025,pp. 50466–50494.Cited by:§1,§4.3.
J. Chen, B. D. Mishra, J. Nam, R. Meng, T. Pfister, and J. Yoon (2026)MARS: modular agent with reflective search for automated ai research.arXiv preprint arXiv:2602.02660.Cited by:Table 4.
E. Friedman (2024)Erich’s Packing Center.Note:https://erich-friedman.github.io/packing/Cited by:Table 1.
J. J. Gibson (1979)The ecological approach to visual perception.Houghton Mifflin,Boston, MA.Cited by:§1.
J. K. Haugland (2016)The minimum overlap problem revisited.arXiv preprint arXiv:1609.08000.Cited by:Table 1.
Z. Jiang, D. Schmidt, D. Srikanth, D. Xu, I. Kaplan, D. Jacenko, and Y. Wu (2025)Aide: ai-driven exploration in the space of code.arXiv preprint arXiv:2502.13138.Cited by:§1,§1,§2.1.
A. Karpathy (2026)autoresearch: AI agents running research on single-GPU nanochat training automatically.Note:https://github.com/karpathy/autoresearchCited by:§1.
N. Kokoromyti (2026)Anatomy of a reward hack: a real story from the latest GPU Mode NVFP4 competition.Note:https://www.gpumode.com/news/reward-hacking-nvfp4Cited by:§1,§2.2.
R. T. Lange, Y. Imajuku, and E. Cetin (2025)Shinkaevolve: towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349.Cited by:§1,§1,§2.1.
A. Li, C. Wu, Z. Ge, Y. H. Chong, Z. Hou, L. Cao, C. Ju, J. Wu, H. Li, H. Zhang, S. Feng, M. Zhao, F. Qiu, R. Yang, M. Zhang, W. Zhu, Y. Sun, Q. Sun, S. Yan, D. Liu, D. Yin, and D. Shen (2025)The FM agent.External Links:2510.26144,LinkCited by:§1,Table 4,Table 4.
J. Liu, S. Qiu, M. Li, B. Li, H. Ji, S. Han, X. Ye, P. Xia, Z. Dong, C. Zhang,et al.(2026a)AutoResearchClaw: self-reinforcing autonomous research with human-ai collaboration.arXiv preprint arXiv:2605.20025.Cited by:§1.
S. Liu, S. Agarwal, M. Maheswaran, M. Cemri, Z. Li, Q. Mang, A. Naren, E. Boneh, A. Cheng, M. Z. Pan,et al.(2026b)Evox: meta-evolution for automated discovery.arXiv preprint arXiv:2602.23413.Cited by:§1,§2.1.
T. Liu, Y. Yang, X. Ye, and D. Chen (2026c)Can coding agents optimize algorithms autonomously?.Note:https://tengxiaoliu.github.io/autoevolver/Cited by:§1.
C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha (2024)The ai scientist: towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292.Cited by:§2.1.
Z. Luo, A. Kasirzadeh, and N. B. Shah (2025)The more you automate, the less you see: hidden pitfalls of ai scientist systems.arXiv preprint arXiv:2509.08713.Cited by:§1,§2.2.
M. Matolcsi and C. Vinuesa (2010)Improved bounds on the supremum of autoconvolutions.Journal of mathematical analysis and applications372(2),pp. 439–447.Cited by:Table 1.
J. Nam, J. Yoon, J. Chen, J. Shin, S. Arik, and T. Pfister (2026)Mle-star: machine learning engineering agent via search and targeted refinement.Advances in Neural Information Processing Systems38,pp. 116692–116712.Cited by:§2.1,§2.2.
A. Novikov, N. Vũ, M. Eisenberger, E. Dupont, P. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian,et al.(2025)Alphaevolve: a coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131.Cited by:§1,§1,§2.1,§4.1.
A. Ouyang, S. Guo, S. Arora, A. L. Zhang, W. Hu, C. Ré, and A. Mirhoseini (2025)Kernelbench: can llms write efficient gpu kernels?.arXiv preprint arXiv:2502.10517.Cited by:§1.
Y. Qi, H. Peng, X. Wang, A. Xin, Y. Liu, B. Xu, L. Hou, and J. Li (2025)Agentif: benchmarking instruction following of large language models in agentic scenarios.arXiv preprint arXiv:2505.16944.Cited by:§2.2.
A. Qu, H. Zheng, Z. Zhou, Y. Yan, Y. Tang, S. Y. Ong, F. Hong, K. Zhou, C. Jiang, M. Kong,et al.(2026)Coral: towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658.Cited by:§1,§2.2.
B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi,et al.(2024)Mathematical discoveries from program search with large language models.Nature625(7995),pp. 468–475.Cited by:§2.1.
A. Sharma (2025)OpenEvolve: an open-source evolutionary coding agentExternal Links:LinkCited by:§2.1.
C. Wan, X. Dai, Z. Wang, M. Li, Y. Wang, Y. Mao, Y. Lan, and Z. Xiao (2025)Loongflow: directed evolutionary search via a cognitive plan-execute-summarize paradigm.arXiv preprint arXiv:2512.24077.Cited by:Table 4.
Y. Wang, S. Su, Z. Zeng, E. Xu, L. Ren, X. Yang, Z. Huang, X. He, L. Ma, B. Peng,et al.(2025)Thetaevolve: test-time learning on open problems.arXiv preprint arXiv:2511.23473.Cited by:Table 1,§1,§2.1,§4.1,Table 2.
W. Xu, S. Li, T. Ye, Q. Cao, Y. Chen, H. Gao, Y. Wang, Q. Li, K. Li, S. Xu, S. Chai, F. Yu, X. Zhao, Z. Zhao, W. Ma, Z. Guo, H. Zhou, H. Yin, L. Cheng, C. Hu, H. Li, L. Mi, X. Xie, Y. Zhou, R. Chen, Z. Zhou, X. Guo, Y. Zhou, X. He, S. Xu, X. Gu, J. Wu, M. Liu, C. Song, F. Ling, D. Zhou, S. Tang, Y. Li, M. Su, P. Ye, S. Sun, B. Wang, X. Yang, Z. Yin, T. Fu, G. Zhai, W. Ouyang, B. Zhang, L. Bai, and W. Zhang (2026)ResearchClawBench: a benchmark for end-to-end autonomous scientific research.External Links:2606.07591,LinkCited by:§1.
X. Yang, X. Yang, S. Fang, B. Xian, Y. Li, J. Wang, M. Xu, H. Pan, X. Hong, W. Liu,et al.(2025)R&d-agent: automating data-driven ai solution building through llm-powered automated research, development, and evolution.arXiv e-prints,pp. arXiv–2505.Cited by:§1,§1,§2.1.
M. Yuksekgonul, D. Koceja, X. Li, F. Bianchi, J. McCaleb, X. Wang, J. Kautz, Y. Choi, J. Zou, C. Guestrin,et al.(2026)Learning to discover at test time.arXiv preprint arXiv:2601.16175.Cited by:Table 1,Table 1,Table 1,§1,§2.1,§4.1,§4.2,Table 2,Table 2,Table 3.
R. Zhang, P. Qin, Q. Cao, L. Zhang, and P. Xie (2026)AIBuildAI: an ai agent for automatically building ai models.arXiv preprint arXiv:2604.14455.Cited by:Table 1,§1,§2.1,Table 4.
X. Zhu, Y. Cai, Z. Liu, B. Zheng, C. Wang, R. Ye, J. Chen, H. Wang, W. Wang, Y. Zhang,et al.(2026)Toward ultra-long-horizon agentic science: cognitive accumulation for machine learning engineering.arXiv preprint arXiv:2601.10402.Cited by:§2.1.

Appendix AEurekAgentHyperparameter Settings

Table5summarizes theEurekAgenthyperparameters used in our experiments. Here,RRdenotes the maximum number of propose–implement iteration rounds, andPPdenotes the maximum number of parallel implementation sessions spawned in each implement stage.

Table 5:EurekAgenthyperparameter settings used in our experiments.

Appendix BSelected MLE-Bench Lite Competitions

We select seven MLE-Bench Lite competitions across three difficulty tiers, using aggregate medal rates of prior public leaderboard agents as a proxy for task difficulty:

•Easy(>40%>40\%): histopathologic-cancer-detection (57.0%) and plant-pathology-2020-fgvc7 (49.7%).
•Medium(15%15\%–40%40\%): aerial-cactus-identification (26.2%) and the-icml-2013-whale-redux (23.5%).
•Hard(<10%<10\%): jigsaw-toxic-comment (9.1%), dog-breed-identification (0.4%), and tabular-playground-may-2022 (0.4%).