Remember, Don't Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation

arXiv cs.LG Papers

Summary

This paper proposes replacing the stateless autoresearch pattern with a stateful ReAct agent using LangGraph, reducing per-iteration token costs from O(n) to O(1) and achieving 52-90% fewer tokens on hyperparameter tuning and code optimization benchmarks.

arXiv:2606.14945v1 Announce Type: new Abstract: The autoresearch pattern enables autonomous experimentation by having a large language model (LLM) iteratively modify code to optimize a target metric. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurring $O(n)$ token cost per iteration and $O(n^{2})$ total. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool-calling interface. Two benchmarks are evaluated: hyperparameter tuning (15 iterations, small per-iteration observations) and code performance optimization (40 iterations, large per-iteration observations containing full source code and benchmark results). On hyperparameter tuning, the stateful agent consumes 90\% fewer tokens (2{,}492 vs.\ 24{,}465). On code optimization, the stateful agent consumes 52\% fewer tokens (627K vs.\ 1{,}275K) while achieving comparable optimization quality on both tasks. The token reduction is structural: the stateless agent re-reads the full history at $O(n)$ cost per iteration, while the stateful agent operates within a fixed-size conversation window at $O(1)$ cost. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:36 AM

# Remember, Don’t Re-read: Stateful ReAct Agents for Token-Efficient Autonomous Experimentation
Source: [https://arxiv.org/html/2606.14945](https://arxiv.org/html/2606.14945)
###### Abstract

The autoresearch pattern enables autonomous experimentation by having a large language model \(LLM\) iteratively modify code to optimize a target metric\. Its stateless design, however, reconstructs experimental context from scratch at every iteration, incurringO​\(n\)O\(n\)token cost per iteration andO​\(n2\)O\(n^\{2\}\)total\. This work reformulates the pattern as a stateful ReAct agent using LangGraph, where typed persistent state carries experimental history across iterations via a tool\-calling interface\. Two benchmarks are evaluated: hyperparameter tuning \(15 iterations, small per\-iteration observations\) and code performance optimization \(40 iterations, large per\-iteration observations containing full source code and benchmark results\)\. On hyperparameter tuning, the stateful agent consumes 90% fewer tokens \(2,492 vs\. 24,465\)\. On code optimization, the stateful agent consumes 52% fewer tokens \(627K vs\. 1,275K\) while achieving comparable optimization quality on both tasks\. The token reduction is structural: the stateless agent re\-reads the full history atO​\(n\)O\(n\)cost per iteration, while the stateful agent operates within a fixed\-size conversation window atO​\(1\)O\(1\)cost\. This paper describes the architecture in sufficient detail for practitioners to implement a stateful autoresearch agent for their own workflows\.

## 1Introduction

Machine learning research is fundamentally iterative: a researcher proposes a hypothesis, implements a code change, runs an experiment, analyzes results, and decides what to try next\. This cycle repeats dozens to hundreds of times before convergence\. While the computational cost of individual experiments has decreased, the human reasoning between experiments remains the primary bottleneck\.

Karpathy\[[1](https://arxiv.org/html/2606.14945#bib.bib1)\]showed that an LLM can close this loop autonomously\. The*autoresearch*pattern decomposes experimentation into three files: a frozen data pipeline \(prepare\.py\), a mutable training script \(train\.py\), and natural\-language research directives \(program\.md\)\. An LLM reads the directives, modifies the training script, executes the experiment, observes metrics, and repeats\. Karpathy in its initial demonstration, this system ran approximately 700 experiments over two days and discovered 20 independent training optimizations without human intervention\.

The elegance of this pattern lies in its simplicity: the LLM’s only interface with the ML system is code modification, and the only feedback signal is the target metric\. However, this simplicity introduces a structural inefficiency\. Each LLM invocation is*stateless*—the agent must reconstruct experimental context by re\-reading the full results history accumulated at program\.md file at every iteration\. As the number of experiments grows, the prompt grows linearly, consuming tokens to re\-transmit information the agent has already processed\. This is not a prompt engineering problem; Huang et al\.\[[10](https://arxiv.org/html/2606.14945#bib.bib10)\]show that LLMs cannot internally maintain transient state across interactions, making external state management an architectural necessity\.

In this work, the autoresearch pattern is reformulated as astateful ReAct agent\[[2](https://arxiv.org/html/2606.14945#bib.bib2)\]using LangGraph\. The agent interacts with the ML system through tool calls rather than monolithic prompts, and experimental history, strategic reasoning, and convergence tracking persist in a typed state graph across iterations\. The key structural advantage is that per\-iteration token cost drops fromO​\(n\)O\(n\)toO​\(1\)O\(1\), enabling arbitrarily long experiment sequences without prompt truncation or summarization\.

## 2Related Work

### 2\.1LLM Agents for ML Research

The autoresearch pattern belongs to a growing family of agentic AI systems that prioritize ease of implementation and flexibility over provably optimal search\. Unlike classical AutoML, which operates within predefined parameterized search spaces, LLM\-guided experimentation treats the search space itself as mutable—the agent can modify arbitrary code rather than selecting from a fixed set of configurations\.

Several systems explore this space\. MLAgentBench\[[6](https://arxiv.org/html/2606.14945#bib.bib6)\]provides a 13\-task benchmark for evaluating ML agents; the AI Scientist\[[4](https://arxiv.org/html/2606.14945#bib.bib4),[5](https://arxiv.org/html/2606.14945#bib.bib5)\]extends scope to full paper generation; AIDE\[[7](https://arxiv.org/html/2606.14945#bib.bib7)\]frames ML engineering as tree search over code variants; AgentHPO\[[8](https://arxiv.org/html/2606.14945#bib.bib8)\]applies LLM agents to hyperparameter optimization, matching human performance on 12 tasks; Agent Laboratory\[[9](https://arxiv.org/html/2606.14945#bib.bib9)\]demonstrates a multi\-agent pipeline at 84% lower cost\. The autoresearch pattern\[[1](https://arxiv.org/html/2606.14945#bib.bib1)\]takes the most minimal approach—a single LLM iteratively editing a training script—but its stateless design limits efficiency on longer experiment sequences\. The present work preserves this minimality while adding persistent state\.

### 2\.2Reasoning and Memory in LLM Agents

ReAct\[[2](https://arxiv.org/html/2606.14945#bib.bib2)\]interleaves reasoning traces with tool actions\. Reflexion\[[3](https://arxiv.org/html/2606.14945#bib.bib3)\]extends this with explicit self\-reflection and episodic memory\. Huang et al\.\[[10](https://arxiv.org/html/2606.14945#bib.bib10)\]show that LLMs cannot maintain transient state across interactions, motivating external memory\. MemGPT\[[11](https://arxiv.org/html/2606.14945#bib.bib11)\]addresses this with OS\-inspired memory hierarchies; Voyager\[[12](https://arxiv.org/html/2606.14945#bib.bib12)\]builds a skill library for embodied agents; CoALA\[[13](https://arxiv.org/html/2606.14945#bib.bib13)\]provides a taxonomy of agent memory types\. The present work instantiates these principles for iterative experimentation: the graph state serves as working memory, the experiment history as episodic memory, and the domain constraints as semantic memory, managed through typed state transitions rather than LLM\-driven paging\.

## 3Method

### 3\.1Architecture

The agent is implemented as a LangGraph state graph with three node types \(Figure[1](https://arxiv.org/html/2606.14945#S3.F1)\)\.

![Refer to caption](https://arxiv.org/html/2606.14945v1/x1.png)Figure 1:ReAct agent state graph\.Reasoninvokes the LLM; if it produces tool calls, control passes toToolsand returns\. Otherwise,Checkevaluates convergence\. If not converged, control loops back toReason\.REASONinvokes the LLM \(Claude Haiku 4\.5\) with the current state context\. The model generates reasoning traces and tool calls following the ReAct pattern\[[2](https://arxiv.org/html/2606.14945#bib.bib2)\]\.TOOLSexecutes tool calls deterministically and returns observations\.CHECKevaluates convergence criteria without LLM involvement\. Conditional edges route based on whether the LLM produced tool calls \(toTools\) or a terminal response \(toCheck\)\.

The graph is constructed usinglanggraph\.graph\.StateGraphwith conditional edges:

graph=StateGraph\(AgentState\)

graph\.add\_node\("reason",reason\_node\)

graph\.add\_node\("tools",ToolNode\(TOOLS\)\)

graph\.add\_node\("check",check\_node\)

graph\.set\_entry\_point\("reason"\)

graph\.add\_conditional\_edges\(

"reason",

lambdas:"tools"ifhas\_tool\_calls\(s\)

else"check",

\)

graph\.add\_edge\("tools","reason"\)

graph\.add\_conditional\_edges\(

"check",

lambdas:"end"ifs\["status"\]\!="running"

else"reason",

\)

### 3\.2State Schema

The agent state is defined as a typed dictionary with six fields:

classAgentState\(TypedDict\):

messages:list

experiment\_history:list

current\_best:dict

iteration:int

train\_py\_content:str

status:str

This state persists across all iterations\. Unlike the stateless design, where the LLM must reconstruct context from the results table at each step, the stateful agent carries forward its experimental history, current best result, and reasoning trace\. Themessagesfield is bounded by a sliding window of 20 messages, ensuring that per\-iteration input cost does not grow with experiment count\.

AnExperimentStatehelper class manages the history and provides a compact summary to the LLM at each step:

classExperimentState:

def\_\_init\_\_\(self\):

self\.history:list\[dict\]=\[\]

self\.best\_f1:float=0\.0

self\.best\_params:dict=\{\}

self\.best\_iteration:int=\-1

self\.iteration:int=0

self\.tried\_configs:set=set\(\)

self\.strategy\_notes:list\[str\]=\[\]

defsummary\(self\)\-\>str:

"""CompactstatesummaryfortheLLM\."""

lines=\[

f"Iteration:\{self\.iteration\}/\{MAX\}",

f"BestF1:\{self\.best\_f1\}",

\]

recent=self\.history\[\-5:\]

forhinrecent:

lines\.append\(

f"Iter\{h\[’iteration’\]\}:"

f"F1=\{h\[’f1’\]\},"

f"params=\{json\.dumps\(h\[’params’\]\)\}"

\)

return"\\n"\.join\(lines\)

Thesummary\(\)method is what makes theO​\(1\)O\(1\)cost possible: regardless of how many experiments have been run, the LLM receives only the most recent 5, plus aggregate statistics\. The full history remains accessible in state but does not enter the prompt\.

### 3\.3Tools and Guardrails

The agent accesses the ML system through two tools:get\_experiment\_history\(query past results\) andrun\_experiment\(train with specified hyperparameters, return metrics\)\. Following the autoresearch principle that the data pipeline must be immutable, the agent can modify only hyperparameters, not data loading or evaluation code\.

In the production variant \(see Section[5](https://arxiv.org/html/2606.14945#S5)\), additional tools are available:get\_current\_train\_pyto read the current training notebook,modify\_train\_pyto upload a modified version with guardrails \(must preserve evaluation and logging calls\),submit\_training\_jobto trigger a Databricks job and poll until completion, andquery\_resultsto run read\-only SQL against the results table\.

### 3\.4Convergence

The agent terminates when any of three conditions holds: \(1\) the target metric crosses a user\-defined threshold, \(2\) the iteration budget is exhausted, or \(3\) the LLM concludes that further improvement is unlikely\.

## 4Experiments

Two benchmark tasks are used to evaluate the stateful architecture across different observation sizes per iteration\. Both use Claude Haiku 4\.5 via the Databricks model serving endpoint \(databricks\-claude\-haiku\-4\-5\), 3 random seeds \(42, 123, 456\), and the same stateless\-vs\-stateful comparison structure\. Token counts are estimated via tiktoken \(cl100k\_base\)\. All experiments ran on Databricks serverless compute\.

### 4\.1Task 1: Hyperparameter Tuning

The first task optimizes XGBoost hyperparameters on the UCI Covertype dataset\[[14](https://arxiv.org/html/2606.14945#bib.bib14)\]from the University of California, Irvine Machine Learning Repository—a 7\-class forest cover classification task with 54 features, subsampled to 20,000 instances\. Each iteration produces a small observation: a JSON hyperparameter configuration \(∼\{\\sim\}200 tokens\) and three scalar metrics \(F1, precision, recall\)\. The budget is 15 iterations\.

This task represents thesmall\-observation regime: the stateless agent’s per\-iteration prompt grows by∼\{\\sim\}200 tokens per experiment, so the total cost grows as∑i=1n200​i=O​\(n2\)\\sum\_\{i=1\}^\{n\}200i=O\(n^\{2\}\)but with a small constant\.

#### Stateless runner\.

At each iteration, the runner constructs a fresh prompt containing the*complete*experiment history—every iteration’s parameters and metrics—plus the current best score\. This prompt is sent as a single message to the LLM; the response is parsed for a JSON hyperparameter configuration\.

#### Stateful runner\.

The stateful runner maintains anExperimentStateobject and a persistent conversation across all iterations\. At each step, the LLM receives only a compact summary \(current iteration, best score, last 5 experiments, strategy notes\) rather than the full history\. The message window is trimmed to the most recent 20 messages\.

Table 1:Task 1: Hyperparameter tuning results \(15 iterations, mean over 3 seeds\)\. The F1 difference is within noise; the token reduction is the primary finding\.![Refer to caption](https://arxiv.org/html/2606.14945v1/figure2.png)Figure 2:Task 1: Hyperparameter tuning on UCI Covertype \(15 iterations, 3 seeds\)\. \(a\) Per\-iteration input token cost: stateless grows linearly; stateful remains constant atO​\(1\)O\(1\)\. \(b\) Cumulative best macro F1 over iterations\. \(c\) Total token consumption: 9\.8×\\timesreduction\. \(d\) Wall\-clock time comparison\.In the small\-observation regime, the stateful agent achieves a9\.8×\\timestoken reduction\(Table[1](https://arxiv.org/html/2606.14945#S4.T1)\)\. Both agents reach comparable F1 scores; the observed difference \(0\.779 vs\. 0\.764\) is within seed variance and is not a claimed contribution\. The per\-iteration input token cost for the stateless agent grows linearly from∼\{\\sim\}700 at iteration 1 to∼\{\\sim\}1,900 at iteration 15 \(Figure[2](https://arxiv.org/html/2606.14945#S4.F2)a\), while the stateful agent remains flat at∼\{\\sim\}100 tokens per iteration\.

### 4\.2Task 2: Code Performance Optimization

The second task optimizes a deliberately inefficient Python data\-processing function \(100 lines, processing 10,000 employee records\) for execution speed\. The function contains bubble sorts, string concatenation in loops, manual dictionary accumulation, and other anti\-patterns that an LLM can systematically replace with Pythonic constructs \(collections\.Counter,sorted\(\), list comprehensions, f\-strings\)\. Each iteration produces a*large*observation: the complete modified source code \(∼\{\\sim\}1,500–3,000 tokens\) plus benchmark results \(median execution time, speedup, correctness check\)\. The budget is 40 iterations\.

This task represents thelarge\-observation regime: the stateless agent’s per\-iteration prompt grows by∼\{\\sim\}2,000–4,000 tokens per experiment\. Over 40 iterations, the cumulative history sent at iteration 40 exceeds 80,000 tokens—approaching context window limits for smaller models\.

#### Stateless runner\.

At each iteration, the runner constructs a fresh prompt containing the*complete*optimization history: every previous code version, its benchmark timing, speedup factor, and correctness status\. The prompt follows the structure:

Fulloptimizationhistory:

===Iteration0\(time=8\.14ms,speedup=1\.00x\)===

Code:

defprocess\_records\(records\):

filtered=\[\]

forrinrecords:

ifr\["active"\]==Trueandr\["years"\]\>=3:

\.\.\.

Benchmark:

Executiontime:8\.140ms\(medianof20runs\)

Baselinetime:8\.140ms

Speedup:1\.00x

===Iteration1\(time=4\.55ms,speedup=1\.79x\)===

Code:

defprocess\_records\(records\):

filtered=\[rforrinrecords

ifr\["active"\]andr\["years"\]\>=3\]

\.\.\.

Benchmark:

Executiontime:4\.550ms

Speedup:1\.79x

\.\.\.

\(all39previousiterations\)

Proposethenextoptimizedversion\.

By iteration 40, this history contains∼\{\\sim\}40 full code listings and exceeds 80,000 tokens of input per call\. The response is parsed for aCODE:block containing the modified function, which is executed in a sandboxedexec\(\)call\. The function must return the correct output structure \(verified against required keys\) before its timing is recorded\.

#### Stateful runner\.

The stateful runner maintains aCodeStateobject and a persistent conversation \(messageslist\) across all iterations\. TheCodeStatetracks:

classCodeState:

def\_\_init\_\_\(self\):

self\.history=\[\]

self\.best\_time\_ms=float\("inf"\)

self\.best\_code=""

self\.best\_iteration=\-1

self\.iteration=0

self\.strategy\_notes=\[\]

self\.failed\_approaches=\[\]

At each step, the LLM receives a compact summary plus only the current best code and the latest benchmark result—not the full history:

Iteration:25/40

Baseline:8\.140ms

Best:4\.275ms\(1\.90x,iter6\)

Recent:

Iter21:4\.516ms\(1\.80x\)\[OK\]

Iter22:4\.290ms\(1\.90x\)\[OK\]\*BEST\*

Iter23:4\.713ms\(1\.73x\)\[OK\]

Iter24:4\.633ms\(1\.76x\)\[OK\]

Iter25:4\.688ms\(1\.74x\)\[OK\]

Strategies:useCounterfortags;replace

bubblesortwithsorted\(\);joinforstrings

Failed:iter7\(missingkeysinoutput\)

Currentbestcode:

defprocess\_records\(records\):

\.\.\.

Proposeanoptimizedversion\.

The LLM responds with both aSTRATEGY:line \(a one\-sentence hypothesis about the proposed optimization\) and aCODE:block\. The strategy note is stored instate\.strategy\_notes; if the code fails correctness checks, the approach is recorded instate\.failed\_approachesand the agent reverts to the best known code\. After benchmarking, the result is fed back as a short message \(e\.g\., “Result: 4\.290ms \(1\.90x\)\. NEW BEST\!”\), and the conversation continues\. The message window is trimmed to the most recent 20 messages to prevent context overflow\.

Table 2:Task 2: Code optimization results \(40 iterations, mean over 3 seeds\)\. Both conditions achieve comparable speedup; the stateful agent uses half the tokens\.![Refer to caption](https://arxiv.org/html/2606.14945v1/figure3.png)Figure 3:Task 2: Code performance optimization \(40 iterations, 3 seeds\)\. \(a\) Per\-iteration input token cost: the stateless prompt grows linearly as code listings accumulate, while the stateful agent plateaus once the 20\-message window saturates \(carrying only the current best code and recent results\)\. \(b\) Best speedup over the unoptimized baseline; both conditions converge to∼\{\\sim\}1\.9×\\times\. \(c\) Total token consumption: 2\.0×\\timesreduction \(1,275,309 vs\. 626,702\)\. \(d\) Wall\-clock time comparison\.In the large\-observation regime, the stateful agent achieves a2\.0×\\timestoken reduction\(Table[2](https://arxiv.org/html/2606.14945#S4.T2); Figure[3](https://arxiv.org/html/2606.14945#S4.F3)\)\. The ratio is lower than the 9\.8×\\timesseen in Task 1 because the stateful agent must still send the current best code \(∼\{\\sim\}1,500 tokens\) at every iteration—a fixed cost that does not shrink with state management\. As shown in Figure[3](https://arxiv.org/html/2606.14945#S4.F3)a, the stateful agent’s per\-iteration input cost rises during early iterations and then plateaus once the message window saturates, whereas the stateless cost continues to climb linearly through iteration 40\. Both agents achieve comparable optimization quality, reaching∼\{\\sim\}1\.9×\\timesspeedup over the deliberately unoptimized baseline by replacing bubble sorts withsorted\(\), string concatenation withstr\.join\(\), and manual loops withcollections\.defaultdictand list comprehensions\.

### 4\.3Comparing the Two Regimes

Table 3:Token reduction ratio as a function of observation size per iteration\. The stateful advantage is largest when per\-iteration observations are small relative to the state summary\.The two tasks illustrate how the token savings depend on the ratio of observation size to state summary size \(Table[3](https://arxiv.org/html/2606.14945#S4.T3)\)\. When observations are small \(a few hundred tokens of hyperparameters and metrics\), the stateful agent’s fixed\-size summary is dramatically smaller than the accumulated history, yielding a 9\.8×\\timesreduction in just 15 iterations\. When observations are large \(full source code at∼\{\\sim\}3,000 tokens each\), the stateful agent still sends the current code every iteration, so the fixed cost is higher and the ratio is 2\.0×\\timesat 40 iterations\. In both cases, the stateless cost grows asO​\(n2\)O\(n^\{2\}\)total while the stateful cost grows asO​\(n\)O\(n\), so the ratio continues to widen with additional iterations\.

## 5Production Agent Architecture

The benchmark runners demonstrate the core stateless\-vs\-stateful distinction, but the full production agent \(used for internal ML research\) extends the pattern with richer tooling and integration with Databricks\. This section describes the production architecture as a reference for practitioners\.

### 5\.1Tools

The production agent exposes five tools to the LLM via LangGraph’sToolNode:

1. 1\.get\_experiment\_history\(\): Queries a Delta table for the 20 most recent experiment results, returning metrics, parameters, and timestamps via the Databricks SQL Statement API\.
2. 2\.get\_current\_train\_py\(\): Exports the current training notebook from the Databricks workspace as source code, so the agent can inspect what it is about to modify\.
3. 3\.modify\_train\_py\(new\_content, experiment\_name, description\): Uploads a modified training notebook with guardrails—the tool rejects submissions that remove theevaluate\_model\(\),log\_result\(\), orsave\_model\_to\_mlflow\(\)calls, preventing the agent from breaking the evaluation pipeline\.
4. 4\.submit\_training\_job\(\): Triggers a Databricks job via the Jobs API and polls until completion \(60\-second intervals, 1\-hour timeout\)\. On success, fetches and returns the latest result row from the results table\.
5. 5\.query\_results\(sql\): Executes read\-only SQL against Databricks tables, allowing the agent to compare experiments, inspect data distributions, or check for overfitting between validation and test splits\.

### 5\.2System Prompt

The system prompt defines the agent’s domain context: the target metric, what can and cannot be changed in the training code, and the rules of engagement \(one experiment per iteration, unique experiment names, mandatory evaluation calls\)\. This prompt is fixed across all iterations and does not grow with history\.

### 5\.3Graph Nodes

Thereason\_nodebinds the tools to the LLM viaChatAnthropic\.bind\_tools\(TOOLS\)and invokes the model with the full system prompt plus the accumulated message history\. Thecheck\_nodeincrements the iteration counter, updatescurrent\_bestif the latest experiment improved the target metric, and setsstatusto"converged"or"max\_iterations"as appropriate\. Theend\_nodeappends a summary message to the conversation\.

## 6Analysis

### 6\.1Scaling Properties

The token reduction is not merely a cost optimization\. At Haiku pricing \($0\.80/MTok input, $4/MTok output\), the per\-run cost difference on Task 1 is modest\. However, Task 2 demonstrates that the gap becomes material with larger observations and longer sequences: 1\.25M tokens for stateless versus 627K for stateful represents a meaningful cost difference at scale—hundreds of experiments per day across multiple projects\.

More importantly, theO​\(n2\)O\(n^\{2\}\)total cost of stateless execution versusO​\(n\)O\(n\)for the stateful agent means the ratio continues to widen with experiment count\. Extrapolating from Task 2: at 100 iterations with 3,000\-token observations, the stateless agent would consume∼\{\\sim\}6M input tokens versus∼\{\\sim\}1\.5M for stateful—a 4×\\timesratio\. At 200 iterations, the ratio would exceed 7×\\times\. TheO​\(1\)O\(1\)per\-iteration property also means the stateful agent can run arbitrarily long experiment sequences without prompt truncation or summarization, preserving decision\-making capacity regardless of history length\.

Table 4:Comparison of architectural properties between the stateless autoresearch baseline and the stateful LangGraph agent\.
### 6\.2Limitations

The evaluation has several limitations\. First, two tasks are evaluated \(hyperparameter tuning and code optimization\); a more comprehensive study would include additional domains such as prompt optimization, scientific formula discovery, and hardware design generation, where the autoresearch pattern has also been applied\[[1](https://arxiv.org/html/2606.14945#bib.bib1)\]\. Second, the wall\-clock overhead of the ReAct pattern \(3\.2×\\timeson Task 1\) may be prohibitive for tasks with very fast training loops, though the multi\-turn conversation overhead is amortized over longer experiments\. Third, a single LLM \(Haiku\) is used; larger models with stronger in\-context learning may partially compensate for the lack of persistent state\. Finally, 3 seeds provide limited statistical power; the optimization quality differences between conditions are within noise and should not be interpreted as quality advantages\.

## 7Conclusion

This work shows that reformulating the autoresearch pattern\[[1](https://arxiv.org/html/2606.14945#bib.bib1)\]as a stateful ReAct agent yields a structural efficiency gain:O​\(1\)O\(1\)per\-iteration token cost versusO​\(n\)O\(n\)\. On two benchmark tasks spanning different observation sizes, the stateful agent reduces total token consumption by 9\.8×\\times\(small observations, 15 iterations\) and 2\.0×\\times\(large observations, 40 iterations\) while achieving comparable optimization quality\. The advantage grows with experiment sequence length and is independent of the underlying task\. Future work should extend to longer experiment horizons and additional domains where the quadratic token cost of stateless execution becomes the dominant bottleneck\.

## References

- \[1\]A\. Karpathy\.autoresearch\.GitHub, March 2026\.[https://github\.com/karpathy/autoresearch](https://github.com/karpathy/autoresearch)\.
- \[2\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\.ReAct: Synergizing Reasoning and Acting in Language Models\.In*ICLR*, 2023\.
- \[3\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\.Reflexion: Language Agents with Verbal Reinforcement Learning\.In*NeurIPS*, 2023\.
- \[4\]C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha\.The AI Scientist: Towards Fully Automated Open\-Ended Scientific Discovery\.arXiv:2408\.06292, 2024\.
- \[5\]C\. Lu et al\.The AI Scientist\-v2: Workshop\-Level Automated Scientific Discovery via Agentic Tree Search\.*ICLR Workshop*, 2025\.
- \[6\]Q\. Huang, J\. Vora, P\. Liang, and J\. Leskovec\.MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation\.In*ICML*, 2024\.
- \[7\]Z\. Jiang, Y\. Song, J\. A\. Kroger, and J\. Li\.AIDE: AI\-Driven Exploration in the Space of Code\.arXiv:2502\.13138, 2025\.
- \[8\]S\. Liu, C\. Gao, and Y\. Li\.Large Language Model Agent for Hyper\-Parameter Optimization\.arXiv:2402\.01881, 2024\.
- \[9\]S\. Schmidgall, Y\. Zhuang, J\. Liu, and A\. Barbu\.Agent Laboratory: Using LLM Agents as Research Assistants\.In*EMNLP Findings*, 2025\.
- \[10\]J\. Huang et al\.On the Failure of Latent State Persistence in Large Language Models\.arXiv:2505\.10571, 2025\.
- \[11\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\.MemGPT: Towards LLMs as Operating Systems\.arXiv:2310\.08560, 2023\.
- \[12\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\.Voyager: An Open\-Ended Embodied Agent with Large Language Models\.arXiv:2305\.16291, 2023\.
- \[13\]T\. R\. Sumers, S\. Yao, K\. Narasimhan, and T\. L\. Griffiths\.Cognitive Architectures for Language Agents\.*TMLR*, 2024\.
- \[14\]J\. A\. Blackard and D\. J\. Dean\.Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables\.*Computers and Electronics in Agriculture*, 24\(3\):131–151, 1999\.

Similar Articles