Hierarchical Experimentalist Agents

arXiv cs.AI 06/30/26, 04:00 AM Papers
hierarchical experimentalist agents llm active-experimentation skill-learning continual-learning
Summary
Introduces Hierarchical Experimentalist Agents (HExA), an in-context, experiment-centric self-improvement framework that enables LLM agents to design experiments, learn reusable skills, and answer queries in novel domains, achieving significant improvements over baselines on the Interphyre physics simulation benchmark.
arXiv:2606.29315v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used to take actions in the real world and support human decision-making, yet most agents rely on parametric knowledge, fixed post-training data, retrieval, or search. This paradigm breaks down in novel domains and for sophisticated queries that cannot be answered from prior knowledge alone. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long-horizon tasks in a complex physical system. To address this, we introduce Hierarchical Experimentalist Agents (HExA), an in-context self-improvement framework to learn from active experimentation. HExA iteratively designs and refines query-relevant experiments, learns a reusable library of composable skills from experience, and integrates experimental evidence to answer queries or take actions. HExA is training-free, compatible with any black-box model, and does not require external supervision, oracles, or offline data. To evaluate active experimentation, we introduce Interphyre, a tool-calling benchmark built on the PHYRE 2D procedural physics environment, where agents propose interventions and test hypotheses through simulation APIs. Experiments show that current LLM agents struggle in these settings, especially on the hardest levels of Interphyre. Claude Sonnet 4.6 achieves only 2% success, while HExA improves the same model to up to 77% success. HExA also improves open-weight models and outperforms agentic baselines such as ReAct and Reflexion. Moreover, using only skills learned from easier levels and transferred without active experimentation, HExA achieves 44% success, demonstrating the reusability and generalization of its learned skills. Overall, HExA shows that learning through active experimentation can help agents discover useful knowledge, acquire reusable skills, and make efficient progress on novel long-horizon tasks.
Original Article
View Cached Full Text
Cached at: 06/30/26, 05:33 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.29315](https://arxiv.org/html/2606.29315)
Hierarchical Experimentalist AgentsAbhranil Chandra Sankaran Vaidyanathan Utsav Dhanuka Varun Gandhi Scott NiekumUniversity of Massachusetts Amherst[HExA Website](https://general-exp-3-continual-learning-agent.github.io/HeXA/)\|\|[HExA Code](https://github.com/General-Exp-3-Continual-Learning-Agent/HeXA-Hierarchical-Experimentalist-Agents)\|\|[Interphyre Website](https://sankaranv.com/interphyre/)\|\|[Interphyre Code](https://github.com/sankaranv/interphyre)AbstractLarge language models \(LLMs\) are increasingly used to take actions in the real world and augment human decision\-making\. However, most systems rely on parametric knowledge acquired by imitation or are augmented via post\-training with fixed data, retrieval, or search\. This paradigm breaks down in novel domains and on sophisticated queries that cannot be answered from prior knowledge alone\. Knowing the laws of physics, for instance, does not by itself enable LLMs to answer queries or complete long\-horizon tasks in a complex physical system\. To solve such novel problems generally, agents should have the fundamental ability ofactive experimentation—exploring and gathering targeted query\-specific data or general principles about the unseen environment, and acquiring new reusable skills by learning from these diverse interactions and experiences\. We thus introduceHierarchical Experimentalist Agents \(HExA\), an in\-context, experiment\-centric self\-improvement framework that \(1\) iteratively designs and refines query\-relevant experiments; \(2\) incrementally learns from experiences a library of reusable and composable skills that accelerate experimentation within and across tasks; and \(3\) integrates the experimental data to effectively take actions or answer queries\. HExA is entirely in\-context and training\-free, compatible with any black\-box model, and does not rely on external supervision, oracles, or offline data\. To evaluate agents on active experimentation tasks, we introduceInterphyre, which builds on the PHYRE 2D procedural physics simulation environment with tool\-calling and intervention APIs that agents can use to propose and test hypotheses\. Our experiments show that current LLM agents still struggle in these settings\. On the hardest levels ofInterphyre, Claude Sonnet 4\.6 only achieves 2% success, while HExA improves the same model to a success rate of up to 77%\. We also show similar improvements across all experiments with open weight models and over other agentic baselines like ReAct and Reflexion\. The HExA agent also achieves 44% success by not doing any active experimentation and using only skills that were learned and transferred from easier levels, demonstrating the reusability and generalization of HExA’s learned skills\. HExA demonstrates that learning via experimentation enables agents to effectively and efficiently make progress on novel tasks, by helping them discover new knowledge and acquire reusable skills\.![[Uncaptioned image]](https://arxiv.org/html/2606.29315v1/umass_cics.png)Correspondence: \{abhranilchan,sniekum\} @cs\.umass\.edu![[Uncaptioned image]](https://arxiv.org/html/2606.29315v1/scalar.png)

Large Language Models \(LLMs\) and LLM Agents have demonstrated remarkable capabilities across a wide range of knowledge work and agentic tasks, from code generation to mathematical reasoning to multi\-step planning\(Gao et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib18); Jimenez et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib25); Yao et al\.,[2023b](https://arxiv.org/html/2606.29315#bib.bib68); Schick et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib45)\)\. The dominant paradigm for training and deploying these models assumes that parametric knowledge—the information encoded during imitative pretraining over massive collections of human\-generated data, augmented with post\-training, and further scaled with additional test time compute and retrieval augmented generation—is sufficient to address any query at inference time\(Anthropic,[2026](https://arxiv.org/html/2606.29315#bib.bib6); OpenAI,[2025](https://arxiv.org/html/2606.29315#bib.bib40); Team et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib54); Team,[2026](https://arxiv.org/html/2606.29315#bib.bib56); Gao et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib19)\)\. This assumption holds reasonably well in domains densely represented in training corpora and domains where reasoning with prior knowledge or via in\-context learning over zero\-shot or few\-shot user prompts is sufficient\(Dong et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib15); Mei et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib38); Dou et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib16)\)\. However, this recipe breaks down in two ways\. First, an agent may face a novel environment whose dynamics, constraints, or solution strategies were never seen during training, or are poorly understood and unknown even to humans and thus cannot be recalled from any training corpus\. Second, even when the relevant general knowledge is encoded in the model’s parameters, applying it to a specific instance can demand far more than recall and reasoning: knowing the laws of physics, for example, does not by itself solve an experimental physics problem\(Silver & Sutton,[2025](https://arxiv.org/html/2606.29315#bib.bib50); Ying et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib69)\)\.

In real world settings, given the vast space of novel tasks and queries, an agent’s success likely cannot rely on pretrained parametric knowledge alone\. Instead, we argue that, agents should, in principle, learn like scientists through active experimentation: this entails exploration, information\-gathering, proposing falsifiable hypotheses, testing them through interaction and experiments, and reasoning over what the environment confirms or refutes\(Spelke & Kinzler,[2007](https://arxiv.org/html/2606.29315#bib.bib51); Team et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib53),[2021](https://arxiv.org/html/2606.29315#bib.bib55)\)\. This is precisely the information that parametric recall, chains of thought, and static retrieval struggle to supply\. Experimentation, however, is costly: each instance may take many such cycles to solve, and an agent typically faces a whole family of related instances rather than one in isolation\. An agent that re\-explores from scratch every time fails to take advantage of this structure\. This motivates, not just experimentation but also learning from it—distilling reusable skills from past experiments and transferring them to related tasks to amortize the cost of experimentation\. These observations form the central question of our work:

*How can LLM agents efficiently reason and act in complex, novel domains via in\-context experimentation, skill learning, and transfer?*

![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/HExA.png)Figure 1:Overview of theHExAframework on Interphyre physics puzzles\.\(a\) Baseline ReAct:each seed is solved independently with no cross\-seed learning\. Each seed consists of alternatingGet\-State\-Infocalls \(scene layout, gap analysis\) andSimulate\-Actioncalls \(place red ball at\(x,y,r\)\(x,y,r\)\), terminating in success \(✓\) or failure \(×\\times\)\.\(b\)HExA:seeds are grouped into a*meta\-episode*\. After each seed, an evolver agent distils the trajectory into a*Skill Bank*\(physics principles \+ common mistakes\); the evolved bank is injected into subsequent seeds’ system prompts\.HExAis run for 25 iterations over 50 seeds on each source level to produce a fully\-evolved level\-specific Skill Bank\.\(c\) Cross\-level transfer to unseen levels:each source level \(down\_to\_earth,two\_body\_problem,pass\_the\_parcel\) runs theHExAloop from\(b\)independently \(arrows indicate this reuse\) to build its Skill Bank\. The resulting banks are then synthesised by the*evolver*into a single cross\-level Skill Bank for a harder target level \(catapult\), with*no target\-level trajectories required*\.We thus proposeHierarchical Experimentalist Agents \(HExA\)\(Figure[1](https://arxiv.org/html/2606.29315#S1.F1)\), a novel in\-context reinforcement learning framework that extends in\-context learning along three directions: \(1\) the agent actively interacts with a simulator as a tool, proposes and test hypotheses, gathering query\-relevant information through deliberate experimentation rather than passive observation; \(2\) the agent reflects on batches of successful and failed experimental observations to distill reusable skills \(reward labeled strategies with feedback of when to use, what it helps with, and when to avoid\) into a persistent external skill bank; \(3\) skills from the bank are retrieved and injected into the agent’s context at the start of each new episode to improve experimentation efficiency, test new hypotheses, and guide further exploration\. This loop amortizes the cost of exploration across problem instances\. Our approach can be viewed as hierarchical, cause HExA uses high\-level skills that abstract over many low\-level skills and simulator actions \(specific placements and parameter settings\)\. Since each skill is distilled while the agent already has access to previously learned ones, later skills implicitly build on earlier ones\. Beyond skills for individual instances, HExA also constructs higher\-order skills that capture how to learn from interaction itself, supporting consistent progress across new seeds and configurations and generalization to structurally novel, harder levels without any target\-level experimentation\.

Recent benchmarks like CL\-Bench\(Dou et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib16)\)focus to some extent on evaluating the in\-context learning ability of frontier LLMs, but rely only on static factual analysis of unseen data during pretraining\. By contrast, we instantiate and formalize HExA for a broader setting: any task where solving an instance requires active experimentation to gather targeted, instance\-specific evidence, rather than recall and reasoning alone\. This covers two cases: tasks where the model already holds the relevant general knowledge but must experiment to apply it precisely, and in genuinely unfamiliar domains where the model must learn from scratch through interaction in unseen open\-ended environments\(Team et al\.,[2021](https://arxiv.org/html/2606.29315#bib.bib55); Hughes et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib23)\)\. Physical reasoning is a natural testbed, since models often have the relevant principles yet cannot apply them at the required precision without experimentation and interaction\. To study this setting concretely, we introduce Interphyre \(Section[4\.1](https://arxiv.org/html/2606.29315#S4.SS1)and[G](https://arxiv.org/html/2606.29315#A7)\), an extensible 2D physics reasoning benchmark, built on top of PHYRE\(Bakhtin et al\.,[2019](https://arxiv.org/html/2606.29315#bib.bib8)\), which adds support for programmatic tool\-call and intervention APIs—scene inspection, partial simulation, contact logging, and counterfactual snapshot/restore\. This lets us both evaluate agents for their capabilities as well as to teach them to ask query\-relevant questions, hypothesize and experiment in unseen domains, and consolidate reusable skills to solve novel tasks efficiently, all in\-context\.

In summary, our main contributions are:

1. 1\.We formalize anexperiment\-centric in\-context learning problem settingto evaluate and improve generalist LLM agents in which solving a task requires performing active experimentation with an environment to gather query\-specific evidence, either to complement parametric knowledge or to tackle entirely novel, unseen domains\. We also introduceInterphyre, a benchmark to better evaluate and train agents for this problem setting\.
2. 2\.We proposeHierarchical Experimentalist Agents \(HExA\), a training\-free, in\-context RL framework in which an LLM jointly designs experiments, distills hierarchical and composable skills from interaction experience, and integrates experimental evidence to solve tasks, all without parameter updates or external supervision\.
3. 3\.We demonstrate that HExA improves both frontier black\-box as well as smaller LLMs, achieving in\-distribution gains in accuracy of up to\+75%on the hardest Interphyre level with Claude Sonnet 4\.6 as the base model stuck at 2%\. Also, the skill banks learned on easier levels transfer to unseen levels without any target\-level experimentation, yielding up to\+36%gains from transferred hierarchical skills alone\.

## 2Related Works

Learning from Interaction\.LLM agents improve decision\-making via*parametric fine\-tuning*, such as reinforcement learningOuyang et al\. \([2022](https://arxiv.org/html/2606.29315#bib.bib41)\); Rafailov et al\. \([2023](https://arxiv.org/html/2606.29315#bib.bib44)\); Guo et al\. \([2025](https://arxiv.org/html/2606.29315#bib.bib21)\), or*non\-parametric adaptation*\. While parametric updates yield strong performance, they often incur high computational costs and risk catastrophic forgetting or over\-specializationZiegler et al\. \([2020](https://arxiv.org/html/2606.29315#bib.bib72)\); Shi et al\. \([2025](https://arxiv.org/html/2606.29315#bib.bib47)\); Luo et al\. \([2025](https://arxiv.org/html/2606.29315#bib.bib34)\), driving the shift toward in\-context agents as a more efficient alternative\.HExA, our training\-free framework, directly targets this challenge by self\-improving LLM agents via in\-context experimentation and self\-evolution of reusable hierarchical skills\.

Skills for LMs\.Skills are natural\-language modules that capture reusable procedural knowledge for augmenting LMs at inference time\(Anthropic,[2025a](https://arxiv.org/html/2606.29315#bib.bib4),[b](https://arxiv.org/html/2606.29315#bib.bib5)\), and skill augmentation has been validated across agent tasks\(Si et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib49)\), including coding\(Ma et al\.,[2026a](https://arxiv.org/html/2606.29315#bib.bib35)\)and web navigation\(Wang et al\.,[2026c](https://arxiv.org/html/2606.29315#bib.bib62)\)\. Earlier skill libraries were built largely through human annotation\(Li et al\.,[2026a](https://arxiv.org/html/2606.29315#bib.bib28),[b](https://arxiv.org/html/2606.29315#bib.bib31); Liang et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib32); Wang et al\.,[2026b](https://arxiv.org/html/2606.29315#bib.bib60)\), which is effective but hard to scale in context\-learning settings where contexts are long, technical, and domain\-specific\. Recent work has therefore moved toward automated skill construction\. AutoSkill\(Yang et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib65)\)extracts reusable behaviors from interaction traces as lifelong, versioned artifacts, while AutoRefine\(Qiu et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib43)\)turns agent trajectories into reusable expertise\. CoEvoSkills\(Zhang et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib70)\)builds multi\-file skill packages through generator–verifier co\-evolution; EvoSkill\(Alzubi et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib3)\)performs failure\-driven refinement into structured skill folders; and SkillX\(Wang et al\.,[2026a](https://arxiv.org/html/2606.29315#bib.bib59)\)distills agent trajectories into a hierarchical skill knowledge base refined through execution feedback\. However, these methods typically rely on external feedback signals, such as execution feedback, ground\-truth comparison, or task\-completion rewards, to evaluate and improve skill quality\(Ma et al\.,[2026b](https://arxiv.org/html/2606.29315#bib.bib36)\), which are unavailable in context\-learning scenarios without automatic feedback\. A separate line of work internalizes skills through weight updates: SKILL0\(Lu et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib33)\)uses in\-context reinforcement learning \(RL\) to absorb skills, while SkillRL\(Xia et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib63)\)builds a skill bank via RL\-guided distillation from stronger teacher trajectories\. These approaches require parameter access, limiting their use with closed\-source models and sacrificing the interpretability of natural\-language skill documents\. We therefore proposeHExAas a self\-improving framework that discovers and evolves skills through in\-context active experimentation\.

## 3Hierarchical Experimentalist Agents: Learning via Experimentation

![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/skill_hexa_catapult.png)\(a\)Example oncatapultwith its skill bank\.
![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/Feedback_Loop.png)\(b\)HExAactor–evolver–retriever loop\.

Figure 2:HExAlearns reusable skills through an actor–evolver–retriever loop\.\(a\)An example oncatapultlevel illustrates the kind of physics reasoning skillsHExAdistills\.\(b\)In each round, the actor generates reward\-tagged trajectories, the evolver updates the skill bank, and the retriever injects the most relevant skills back into the actor for the next attempt\.HExAis a training\-free, in\-context reinforcement learning framework in which an LLM agent learns to solve unfamiliar tasks through active experimentation\. Rather than updating model weights,HExAimproves across episodes by distilling interaction experience into a persistent external*skill bank*—a structured collection of natural\-language strategies and mistakes defined below—which is re\-injected into the agent’s context at the start of each new episode\. This makesHExAcompatible with any tool\-augmented LLM, including closed\-source models, and requires no offline data, oracle supervision, or external teacher—only the environment’s own interaction feedback\. Unlike standard in\-context RL, it requires no initial domain\-specific training on reward\-labeled data; it learns self\-improvement from scratch on each new task family\. We use thecatapultlevel from Interphyre as a running example throughout our paper, as shown in Figure[4](https://arxiv.org/html/2606.29315#S4.F4)with the skill bank in Figure[2\(a\)](https://arxiv.org/html/2606.29315#S3.F2.sf1)\. The agent has to choose where to place a red ball\(x,y\)\(x,y\)and its dimensionsrr—so that the ball falls and strikes a pivoting arm, launching a green ball into a distant basket\. The full environment descriptions are in Appendix[G](https://arxiv.org/html/2606.29315#A7)\.

##### Problem Setup\.

We consider task familiesℓ\\ellwhose instances\{\(ℓ,sj\)\}\\\{\(\\ell,s\_\{j\}\)\\\}each fix a specific environment configuration: geometry, parameters, and initial conditions\. For each familyℓ\\ellwe have a finite set of instances \(or seeds\)𝒮ℓ=\{s1,…,s\|𝒮ℓ\|\}\\mathcal\{S\}\_\{\\ell\}=\\\{s\_\{1\},\\ldots,s\_\{\|\\mathcal\{S\}\_\{\\ell\}\|\}\\\}, and episodes iterate through𝒮ℓ\\mathcal\{S\}\_\{\\ell\}in a fixed order\. For example, incatapult,sjs\_\{j\}sets the arm pose, deflector layout, and ceiling\-blocker position\. The agent has at mostTTturns to solvesjs\_\{j\}\. We index turns bytt\(t=0,1,…,T−1t=0,1,\\ldots,T\-1\) and denote the entire experimentation trajectory forsjs\_\{j\}asτj\\tau\_\{j\}\.

The agent interacts through a tool\-call interfaceℱ\\mathcal\{F\}provided by the environment\. Interphyre supplies a set of shared tools available on every level—scene inspection, full and partial simulation, and contact logging—as well as some general level\-specific analysis tools \(shown in Appendix[B](https://arxiv.org/html/2606.29315#A2)and Figure[10](https://arxiv.org/html/2606.29315#A1.F10)\)\. For example,catapulthas access to tools likedescribe\_scene\_geometry,predict\_first\_contact\(x,y,r\),simulate\_with\_trace\(x,y,r,object\_names,stop\_step\), andfinish\(x,y,r\)\.

At turntt, the agent observes the historyht=\(a0,o0,a1,o1,…,at−1,ot−1\)h\_\{t\}=\(a\_\{0\},o\_\{0\},a\_\{1\},o\_\{1\},\\ldots,a\_\{t\-1\},o\_\{t\-1\}\)and selects the next action under actor policyπact\\pi\_\{\\mathrm\{act\}\}\. For instance, incatapulta typical action isat=simulate\_with\_trace\(0\.5,0\.4,1\.5,\["green\_ball"\]\)a\_\{t\}=\\texttt\{simulate\\\_\\allowbreak with\\\_\\allowbreak trace\(0\.5,0\.4,1\.5,\\allowbreak\["green\\\_ball"\]\)\}, returning a set of contact events and a kinematic summary \(e\.g\. peak height, peak speed, and net displacement\)\. A binary predicateGoalℓ\\textsc\{Goal\}\_\{\\ell\}scores the completed simulation, whereSim\(τ\)\\textsc\{Sim\}\(\\tau\)executes trajectoryτ\\tauin the environment and returns its final state\. The episode ends after at mostTTturns or a terminal action, with success on seedsjs\_\{j\}defined as

yj\(πact;ℓ\)=𝕀\[Goalℓ\(Sim\(τj\)\)=true\],τj∼πact\(⋅∣ℓ,sj\)\.y\_\{j\}\(\\pi\_\{\\mathrm\{act\}\};\\ell\)=\\mathbb\{I\}\\\!\\left\[\\textsc\{Goal\}\_\{\\ell\}\\\!\\left\(\\textsc\{Sim\}\(\\tau\_\{j\}\)\\right\)=\\text\{true\}\\right\],\\qquad\\tau\_\{j\}\\sim\\pi\_\{\\mathrm\{act\}\}\(\\cdot\\mid\\ell,s\_\{j\}\)\.\(1\)
Input:Actorπact\\pi\_\{\\mathrm\{act\}\}, evolverπev\\pi\_\{\\mathrm\{ev\}\}, levelℓ\\ell, seeds of that level𝒮ℓ\\mathcal\{S\}\_\{\\ell\}, roundsRR, episodes per roundxx

1ex//Optional warm start \(first round only\)

𝒦0←πev\(∅,𝒯0\)\\mathcal\{K\}\_\{0\}\\leftarrow\\pi\_\{\\mathrm\{ev\}\}\(\\emptyset,\\,\\mathcal\{T\}\_\{0\}\)or

𝒦0←∅\\mathcal\{K\}\_\{0\}\\leftarrow\\emptyset
1exfor*n=1,…,Rn=1,\\ldots,R*do

𝒢n←Retriever\(𝒦n−1,ℓ,M,N\)\\mathcal\{G\}\_\{n\}\\leftarrow\\textsc\{Retriever\}\(\\mathcal\{K\}\_\{n\-1\},\\ell,M,N\)
//top\-MMskills \+ top\-NNmistakes

for*i=1\+x∗\(n−1\),…,x∗ni=1\+x\*\(n\-1\),\\ldots,x\*n*do

τi←πact\(⋅∣ℓ,si,𝒢n\)\\tau\_\{i\}\\leftarrow\\pi\_\{\\mathrm\{act\}\}\(\\cdot\\mid\\ell,\\,s\_\{i\},\\,\\mathcal\{G\}\_\{n\}\)
//actor prompt: task \+ tools \+ skills

r\(τi\)←Reward\(τi\)r\(\\tau\_\{i\}\)\\leftarrow\\textsc\{Reward\}\(\\tau\_\{i\}\)
//outcome×\\timesefficiency; see Appendix[A\.3\.1](https://arxiv.org/html/2606.29315#A1.SS3.SSS1)

𝒦n←πev\(𝒦n−1,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}\\leftarrow\\pi\_\{\\mathrm\{ev\}\}\(\\mathcal\{K\}\_\{n\-1\},\\,\\mathcal\{T\}^\{\(n\)\}\)
//evolver prompt: bank \+ trajectories \+ rewards

return*𝒦R\\mathcal\{K\}\_\{R\}*

Algorithm 1HExA
##### TheHExALoop\.

As shown in Figure[2\(b\)](https://arxiv.org/html/2606.29315#S3.F2.sf2)and[9](https://arxiv.org/html/2606.29315#A1.F9), each round ofHExAruns as follows: a*retriever*selects the most relevant entries from the current skill bank𝒦\\mathcal\{K\}and injects them into the*actor*’s context; the actor runsxxepisodes under that skill context, each scored by a reward; and an*evolver*reads the resulting batch of trajectories and updates𝒦\\mathcal\{K\}\. On the first round,𝒦\\mathcal\{K\}may start empty or be warm\-started from an offline batch of base\-actor trajectories; every subsequent round is identical\. Algorithm[1](https://arxiv.org/html/2606.29315#algorithm1)gives the full loop; initialization and update strategies are ablated in Section[4](https://arxiv.org/html/2606.29315#S4)\.HExAcouples three components: an*actor*πact\\pi\_\{\\mathrm\{act\}\}\(any tool\-augmented LLM\) that reasons about experimental hypotheses to test and performs experiments via exploration, interventions and interactions with the environment to generate trajectories; an*evolver*πev\\pi\_\{\\mathrm\{ev\}\}\(LLM call to the same model with an evolver prompt\) that reads batches of actor trajectories and distills them into an external*skill bank*𝒦\\mathcal\{K\}; and a*retriever*that selects the most relevant entries from𝒦\\mathcal\{K\}and injects them into the actor’s context before each episode\. We now describe each component in terms of the prompt it receives, what it produces, and why the step is necessary\.

##### Actor prompt and skill\-augmented episode\.

At the start of each episode, the actor receives a prompt containing the task description, the available toolsℱ\\mathcal\{F\}, and a*skill context*𝒢\\mathcal\{G\}retrieved from the current skill bank:

𝒢\\displaystyle\\mathcal\{G\}=Retriever\(𝒦,ℓ,M,N\),τj\\displaystyle=\\textsc\{Retriever\}\(\\mathcal\{K\},\\ell,M,N\),\\tau\_\{j\}∼πact\(⋅∣ℓ,sj,𝒢\)\.\\displaystyle\\sim\\pi\_\{\\mathrm\{act\}\}\(\\cdot\\mid\\ell,s\_\{j\},\\mathcal\{G\}\)\.\(2\)Equation[1](https://arxiv.org/html/2606.29315#S3.E1)defines the task without method\-specific assumptions; any agent—including skill\-free baselines such asDirect\(base model’s zero\-shot interactions\) orReAct—can be evaluated against it\.HExAaugments the actor with skill context𝒢\\mathcal\{G\}as in Eq\.[2](https://arxiv.org/html/2606.29315#S3.E2), replacing the base policyπact\(⋅∣ℓ,sj\)\\pi\_\{\\mathrm\{act\}\}\(\\cdot\\mid\\ell,s\_\{j\}\)with the skill\-conditioned version\. The skill context𝒢\\mathcal\{G\}consists of the top\-MMskills and top\-NNmistake records from the bank, ranked using the skill reward labels defined below, prepended as a fixed\-size prefix to the actor’s system prompt alongside the standard task description and tool documentation\. On the first episode,𝒦\\mathcal\{K\}may be empty or warm\-started \(see below\); from the second episode onward, it contains skills distilled from prior rounds\. The injection mechanism is deliberately simple—it changes only the textual context, not the agent’s architecture or decoding—soHExAworks with any tool\-augmented LLM\. The actor is free to use or disregard individual skills based on its own reasoning about the current instance\.

##### Trajectory reward\.

Each trajectoryτ\\taureceives a scalar rewardr\(τ\)∈\[−1,\+1\]r\(\\tau\)\\in\[\-1,\+1\]reflecting both*outcome*\(success or failure\) and*efficiency*\(how quickly the outcome was reached, as a fraction of the budgetTT\)\. Fast, decisive successes earn higher rewards; failures are penalized less when the agent explored extensively before failing, since such trajectories yield richer material for the evolver than one that abandoned the episode after a single attempt\. We use discrete reward bins rather than a continuous function because the evolver is an LLM: it can reliably act on qualitative categories \(fast success, slow success, exploratory failure, early exit\)\(Viswanathan et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib58); Gunjal et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib20)\), but cannot meaningfully distinguish, say,r=0\.73r\{=\}0\.73fromr=0\.71r\{=\}0\.71\. The exact reward function is given in Appendix[A\.3\.1](https://arxiv.org/html/2606.29315#A1.SS3.SSS1)\.

##### Evolver prompt and skill distillation\.

After each roundnn\(n=1,…,Rn=1,\\ldots,R, whereR=\|𝒮ℓ\|/xR=\\lvert\\mathcal\{S\}\_\{\\ell\}\\rvert/x\) ofxxtrajectories, the evolver receives a prompt containing the current skill bank𝒦n−1\\mathcal\{K\}\_\{n\-1\}, the new batch𝒯\(n\)\\mathcal\{T\}^\{\(n\)\}\(each trajectory rendered as its\(thought,tool\_call,observation\)sequence tagged with its reward\), and a structured\-output instruction\. The instruction directs the evolver to use rewards for evaluating trajectory quality and*contrast*high\-reward behaviors against low\-reward ones, to learn the unique lessons of success over failure based on the observations\. From this, the evolver produces two types of distilled knowledge\.

*Strategy skills*are extracted by contrasting high\-reward against low\-reward trajectories, yielding a set of new skills\. Each skill is a structured record defined by a*title*, a*principle*\(the insight gained from testing the proposed hypothesis via experimentation\), a*when\-to\-apply*condition, an optional demonstrative*example*, the*source seed*it was derived from, and a*reward score*defined below\. For example, the round\-14catapultskillcat\_ev\_1\_001\(Appendix[E\.4](https://arxiv.org/html/2606.29315#A5.SS4)\): title “x≈0\.5x\\\!\\approx\\\!0\.5is the primary launch sweet spot,” principle “placing the red ball atx=0\.5x\{=\}0\.5gives a consistent rightward launch; deviate tox=0\.3x\{=\}0\.3only on a ceiling hit,” applicability “always as the first placement attempt\.”

*Mistake records*are extracted from failures alone, yielding recordsμ=\(δ,ρ,α\)\\mu=\(\\delta,\\rho,\\alpha\)pairing a descriptionδ\\delta, a root causeρ\\rho, and a corrective actionα\\alpha\. For example:δ\\delta“the agent fixates on one launch point and micro\-tunesx/y/rx/y/r,”ρ\\rho“the arm is the obvious mechanism, so failures are met with small perturbations,”α\\alpha“after two failures withinx±0\.2x\\pm 0\.2, move to a differentxxzone\.” The evolver also recovers*partial skills*from otherwise\-failed trajectories: reasoning steps that were individually correct even though the episode failed overall\. For example, a trajectory that correctly identifies the catapult arm as the target mechanism but then micro\-tunes placement in the wrong direction yields a partial skill encoding the correct mechanism identification, even though the solve failed\. This is a learning signal unavailable from successful trajectories alone, since successes do not reveal which intermediate steps were necessary versus incidental\. Skills are extracted while the agent already holds previously learned ones, so later skills can build on earlier ones\. Whether this compounds into a hierarchy of increasingly abstract principles is an empirical question we investigate in Section[4](https://arxiv.org/html/2606.29315#S4)\. Full evolver prompt templates are in Appendix[A\.3\.2](https://arxiv.org/html/2606.29315#A1.SS3.SSS2)\.

Abridged evolved skill bank used oncatapultseed 45Before attempting seed 45,HExAreceives a skill bank distilled from earliercatapultepisodes\. The entries below show how it turns the observed trajectory into a structured decision\.Skill 1:Default⊳\\trianglerightStart from a stable launch geometryPlace the red ball nearx=0\.5x=0\.5with sufficiently large radius\. This contact point provides a stable lever arm and usually produces a strong rightward launch\.Use when:Use as the initial launch hypothesis for a new seed; deviate when the trajectory hits the ceiling or violates placement constraints\.Skill 2:Diagnose⊳\\trianglerightRecognize the radius plateauAt fixedx=0\.5x=0\.5, increasing the radius beyond approximatelyr=1\.5r=1\.5may not increase range because the arm reaches its rotational limit\. More mass therefore does not necessarily produce a better trajectory\.Use when:A large\-radius launch still fails; stop assuming that additional force alone will solve the problem\. For example ffr=1\.5r=1\.5fails atx=0\.5x=0\.5, change the contact geometry rather than continuing to increaserr\.Skill 3:Correct⊳\\trianglerightFlatten a ceiling\-blocked trajectoryIf a smaller radius falls short while a larger radius strikes the ceiling, the bottleneck is launch angle rather than energy\. Shiftxxtoward0\.10\.1–0\.30\.3to change the arm\-contact geometry and flatten the launch arc\.Use when:After two radius variations at the samexxproduce one short trajectory and one ceiling collision, stop tuningrrand explore a differentxxregion\.Avoid⊳\\trianglerightLocal parameter fixationThe agent repeatedly micro\-tunes\(x,y,r\)\(x,y,r\)around the same unsuccessful launch point because the catapult arm appears to be the correct mechanism\.Correction:After repeated failures in the same neighborhood, change the relevant search dimension or move to a qualitatively different placement region\.How the bank guides seed 45\.HExAfirst tests the learned default launch and observes a ceiling overshoot\. The bank identifies this as a geometry problem rather than a lack\-of\-force problem, so the agent shifts the drop point from thex≈0\.5x\\approx 0\.5region tox=0\.3x=0\.3instead of continuing to increase the radius\. It then succeeds with\(x,y,r\)=\(0\.3,0\.9,1\.5\)\(x,y,r\)=\(0\.3,0\.9,1\.5\)in six interaction iterations \(Figure[4](https://arxiv.org/html/2606.29315#S4.F4)\)\.default launch⏟prior skill⟶ceiling hit⏟observation⟶shift contact geometry⏟correction⟶success⏟6 iterations\\underbrace\{\\text\{default launch\}\}\_\{\{\\color\[rgb\]\{0\.18359375,0\.41796875,0\.60546875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.18359375,0\.41796875,0\.60546875\}\\text\{prior skill\}\}\}\\;\\longrightarrow\\;\\underbrace\{\\text\{ceiling hit\}\}\_\{\{\\color\[rgb\]\{0\.7109375,0\.4140625,0\.0625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7109375,0\.4140625,0\.0625\}\\text\{observation\}\}\}\\;\\longrightarrow\\;\\underbrace\{\\text\{shift contact geometry\}\}\_\{\{\\color\[rgb\]\{0\.15625,0\.4765625,0\.3125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.15625,0\.4765625,0\.3125\}\\text\{correction\}\}\}\\;\\longrightarrow\\;\\underbrace\{\\text\{success\}\}\_\{\{\\color\[rgb\]\{0\.15625,0\.4765625,0\.3125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.15625,0\.4765625,0\.3125\}\\text\{6 iterations\}\}\}

##### Skill bank and retrieval\.

Distilled knowledge is stored in the skill bank𝒦\\mathcal\{K\}with two collections, both indexed by task family: a set of*skills*\(reusable strategies and principles\) and a set of*mistakes*\(recurring failure modes to avoid\)\. Each skill carries a*reward score*rkr\_\{k\}, computed from the trajectories that demonstrated or used it, as a function of the mean reward of these source trajectories \(r¯src\\bar\{r\}\_\{\\mathrm\{src\}\}\)\. We show an example of this reward label in Section[4\.3](https://arxiv.org/html/2606.29315#S4.SS3)with the individual trajectory rewards demonstrated in Appendix[A\.3\.1](https://arxiv.org/html/2606.29315#A1.SS3.SSS1)\. This reward label is used for two purposes: ranking skills during retrieval \(top\-MMbyrkr\_\{k\}\), and guiding the evolver in deciding which skills to retain or prune when the bank is at capacity\. Skills derived from fast, efficient successful trajectories carry higher scores; skills from partial reasoning within failed trajectories carry lower scores but may still encode valuable learnings and discoveries\.

The bank is capped atMmaxM\_\{\\max\}skills andNmaxN\_\{\\max\}mistakes per task family to prevent the injected context from growing without bound and to force the evolver to prioritize the most informative knowledge\. Full bank structure and retrieval details are in Appendix[A](https://arxiv.org/html/2606.29315#A1)\.

The abridged skill bank above shows the information available toHExA*before*it attemptscatapultseed 45\. Unlike raw trajectory memory, the bank organizes evidence from earlier episodes into a default hypothesis, diagnostic conditions, corrective strategies, and mistakes to avoid\. Figure[4](https://arxiv.org/html/2606.29315#S4.F4)subsequently shows how these skills guide the successfulHExAtrajectory\. We can also see how skills from individual experimental and interaction observations are composed together by HExA to from hierarchical and reusable general skills\.

##### Cross\-task skill transfer\.

The same mechanism supports transfer across task families\. Low\-level details such as source\-task coordinates or object identities are generally not directly reusable, whereas the mechanism\-level abstractions distilled above—such as contact geometry controlling impulse direction, moment arm controlling torque, or collision order controlling downstream motion—can remain valid under a different scene configuration\. Given evolved source banks𝒦ℓ1,…,𝒦ℓS\\mathcal\{K\}\_\{\\ell\_\{1\}\},\\ldots,\\mathcal\{K\}\_\{\\ell\_\{S\}\}and only a textual description of an unseen target familyℓ∗\\ell^\{\*\}, the evolver synthesizes a target bank without observing any target\-task trajectories\. It \(i\) selects source skills whose underlying physical or search principles are structurally relevant toℓ∗\\ell^\{\*\}, \(ii\) removes source\-specific parameter and entity bindings, \(iii\) re\-grounds each principle in the entities, constraints, and tools ofℓ∗\\ell^\{\*\}, and \(iv\) assigns a transfer score according to the directness of the structural correspondence and the degree of corroboration across source banks\.

This procedure transfers abstractions rather than memorized solutions: the evolver is prohibited from inventing target\-specific coordinates and must express each transferred skill as a mechanism and an applicability condition\. For example, evidence about lever\-mediated impulse transfer from several source levels can be re\-grounded in the pivoting arm and launch dynamics ofcatapult, even though no successfulcatapultplacement has been observed\. The resulting bank𝒦ℓ∗\\mathcal\{K\}\_\{\\ell^\{\*\}\}is injected through the same retriever, enabling zero\-shot cross\-task transfer, or used as the initialization𝒦0\\mathcal\{K\}\_\{0\}for a subsequent within\-taskHExArun\. Full transfer prompts are in Appendix[I\.5](https://arxiv.org/html/2606.29315#A9.SS5); results are in Section[4](https://arxiv.org/html/2606.29315#S4)\.

## 4Experiments and Results on Interphyre

In this section, we provide details ofInterphyre, and evaluate different LLMs on experiment\-centric tasks usingInterphyrelevels\. Our benchmark is designed with the following key properties:

1. \(i\)tasks whose solution is unknown or cannot be read off a static description and must be discovered by interacting with the environment
2. \(ii\)a programmatic interface that lets the agent inspect the scene, test interventions, and observe intermediate dynamics
3. \(iii\)a controllable difficulty range with procedural generation

![Refer to caption](https://arxiv.org/html/2606.29315v1/x1.png)Figure 3:Interphyre’s snapshot/restore API\.This branches a shared mid\-trajectory state into a factual rollout and perturbed alternatives \(Section[G\.4\.1](https://arxiv.org/html/2606.29315#A7.SS4.SSS1)\)\. Each row is one seed of thecatapultlevel with a different oracle solution strategy\. The leftmost column shows the initial scene for reference; the teal curve is the green ball’s trajectory, shown up to the branch point in the*Branch point*column and continuing through the outcome in each subsequent column\. Top \(seed 8, deflector strategy\): removing the deflector causes failure; shrinking the red action ball does not\. Bottom \(seed 5, direct\-launch strategy\): shrinking the red ball causes failure; removing the deflector does not\. The same interventions have opposite causal relevance across seeds, showing that causal relevance is a property of an object relative to the active strategy, not of the object in isolation\. See Section[G\.4\.3](https://arxiv.org/html/2606.29315#A7.SS4.SSS3)for the branching protocol\.### 4\.1Interphyre: A Testbed for Learning through Experimentation

Unlike benchmarks that primarily ask whether an agent can solve a task or answer a query, our setting requires an environment in which the agent can deliberately interact and gather evidence: inspecting the current instance, testing interventions, observing intermediate dynamics, diagnosing failures, and reusing what it learned across future seeds and levels in an experimentation loop\.Interphyreprovides this in a controlled 2D physics domain with continuous actions and long\-horizon causal structure\. Each task requires placing a red ball at\(x,y,r\)\(x,y,r\), after which the simulator evaluates whether the level\-specific success predicate is achieved\.

Interphyrebuilds on the level designs in the PHYRE benchmark\(Bakhtin et al\.,[2019](https://arxiv.org/html/2606.29315#bib.bib8)\)by introducing APIs for level editing, test\-time interventions on environment parameters, and logging traces from the simulation\. These APIs can be used to customize the benchmark for evaluating agents on unseen levels, or provided to an LLM for active experimentation\. In particular, our experiments use theInterphyreAPI to expose a structured tool\-calling interface for experimentation via interventional scene inspection, partial simulation, full rollouts, contact logs, and level\-specific geometric analysis, for example as shown in Figure[3](https://arxiv.org/html/2606.29315#S4.F3)\. However, successful actions often depend on contact chains, lever mechanics, collision timing, and strategy\-dependent dynamics\. Even strong LLMs struggle to anticipate such interactions reliably during inference, particularly when only given the scene description\. Now using these intervention API and traces recovered fromInterphyreruns, HExA enables the LLM agent to reason over its proposed hypothesis and corresponding experimental observations, than merely guessing solutions from parametric prior knowledge\. This lets the agent leverage the information acquired through exploration and interactions, to improve its performance in\-context\.Interphyretherefore serves as both a benchmark and a diagnostic environment for experimentation based methods such asHExA\. We provide full details of the environment design, curriculum, tool API, observation modes, and prompts in Appendix[G](https://arxiv.org/html/2606.29315#A7)and[H](https://arxiv.org/html/2606.29315#A8)\.

### 4\.2Experimental Setup, Evaluations, and Baselines

Table 1:Base LLM agent’s \(Directbaseline\) solve rate \(%\) across eightInterphyrelevels\.†\\daggerPass the Parcel and Catapult evaluated over 50 seeds; all other levels evaluated over 100 seeds\.We evaluate five LLMs including frontier black box and small open LLMs—Claude Sonnet 4\.6, Qwen\-2\.5\-\{7B, 14B\} Instruct, GPT\-OSS\-\{20B,120B\}, and a random agent \(a heuristic policy that samples uniformly from the valid placement region, with no interaction with the environment\)—each capped at 25 interaction turns per trajectory onInterphyrelevels\. We compare against three baselines:*ReAct*\(Yao et al\.,[2023a](https://arxiv.org/html/2606.29315#bib.bib67)\), which runs one Thought\-Action\-Observation trajectory per seed;*Reflexion*\(Shinn et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib48)\), which adds verbal self\-reflection; and*Direct*, which prompts the model to solve the task in a single shot from the scene description alone, with no tools, no environment interaction, and no skills, measuring the base models’ parametric capability\. With the smaller models, we also include RL training via*GRPO*\(Shao et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib46)\), which fine\-tunes Qwen\-2\.5\-3B with a binary\{0,1\}\\\{0,1\\\}reward, making it the only method that updates model weights and compares directly against our training\-free framework\. All methods on a given level are evaluated on the same 50 seeds, so comparisons are paired on identical instances \(full details in Appendix[C](https://arxiv.org/html/2606.29315#A3)\)\. In Table[1](https://arxiv.org/html/2606.29315#S4.T1)above and Table[9](https://arxiv.org/html/2606.29315#A4.T9)in Appendix[D](https://arxiv.org/html/2606.29315#A4), we organize the eight levels into two difficulty tiers based on base LLM performance \(DIRECT\) and its performance with ReAct as agent baseline: Tier 1 \(the first six levels\), and Tier 2 \(pass\_the\_parcelat0%0\\%,catapultat2%2\\%, with every open\-weight model scoring0%0\\%on the latter\)\. These two hard levels from Tier 2, where purely parametric reasoning is insufficient, are the primary testbed forHExA\. Throughout, we report two metrics:*solve rate*\(% of seeds satisfyingGoalℓ\\textsc\{Goal\}\_\{\\ell\}, Eq\.[1](https://arxiv.org/html/2606.29315#S3.E1)\) for task success accuracy and average turns per seed \(unsolved seeds capped till maximum 25 turns\) for efficiency\.

### 4\.3Instantiation ofHExA– Skill Initialization, Update, and Reward Labels

The skill bank inHExAhas three orthogonal axes of instantiation \(Algorithm[1](https://arxiv.org/html/2606.29315#algorithm1)\): how it is*initialized*before experimentation, how it is*updated*between rounds and if we use interaction reward labels with the skills\. We test three initializations—*Offline*,*Offline\-to\-Online*\(*Off2On*\), and*Online*\. Both Offline and Off2On warm\-start the bank fromN0N\_\{0\}base\-actor trajectories collected with no skill context, distilled into a fixed skill bank𝒦0=Evolver\(∅,𝒯0\)\\mathcal\{K\}\_\{0\}=\\textsc\{Evolver\}\(\\emptyset,\\mathcal\{T\}\_\{0\}\); Offline then freezes the bank, whereas Off2On keeps updating it after new experiments\. Online instead starts empty \(𝒦0=∅\\mathcal\{K\}\_\{0\}=\\emptyset\) and bootstraps skills purely from the agent’s own interactions\. For updates, we use two strategies—*Evolving*and*Iterative Replacement*\. In Evolving, we pass the evolver both the current bank and the new rounds trajectories, adding, merging, revising, or pruning skills in light of these new evidence \(𝒦n=Evolver\(𝒦n−1,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}=\\textsc\{Evolver\}\(\\mathcal\{K\}\_\{n\-1\},\\mathcal\{T\}^\{\(n\)\}\)\)\.*Iterative Replacement*instead rebuilds the bank from the current rollouts alone \(𝒦n=Evolver\(∅,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}=\\textsc\{Evolver\}\(\\emptyset,\\mathcal\{T\}^\{\(n\)\}\)\), since in the in\-context RL loop the actor has already ingested prior skills, so the new skills carry their influence plus the novelty of the latest rollouts\. We thus have five variants of skill bank initialization and update—Off2OnorOnlineinitialization×\\timesEvolutionorIterative Replacementupdate—plus the frozen Offline bank \(𝒦n=𝒦n−1\\mathcal\{K\}\_\{n\}=\\mathcal\{K\}\_\{n\-1\}\)\. We ablate these settings in Table[4](https://arxiv.org/html/2606.29315#S4.T4)and*Off2On \+ Evolution*performs the best\.

HExA is also flexible to effectively ingest and use additional experimental evidence and data like trajectory rewards which are a function of outcome reward and efficiency, as shown in Appendix[A\.3\.1](https://arxiv.org/html/2606.29315#A1.SS3.SSS1), to further guide skill evolution and retrieval\. The skill reward labelsrkr\_\{k\}are again a function of these individual trajectory rewards that invoked the skill,

rk=clamp⁡\(r¯src\+12,0\.1,1\.0\),r\_\{k\}=\\operatorname\{clamp\}\\\!\\left\(\\frac\{\\bar\{r\}\_\{\\mathrm\{src\}\}\+1\}\{2\},\\,0\.1,\\,1\.0\\right\),\(3\)
ReAct\(25 iterations, FAILURE\) ![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step0.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step30.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step57.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step70.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step91.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step151.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/react_step193.png) HExA\(6 iterations, SUCCESS\) ![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step0.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step25.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step57.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step74.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step119.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step140.png)![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/sim_frames_seed45/hexa_step172.png)

Figure 4:How the evolved skill bank of HExA guides the agent to solvecatapult\.Top:ReAct, which has no cross\-episode memory, exhausts its 25\-iteration budget while repeatedly exploring ineffective placements\. Bottom:HExAbegins with the evolved bank summarized above\. After observing that its initial launch strikes the ceiling, it applies the bank’s diagnosis that the limiting factor is launch geometry rather than insufficient force\. It therefore shifts the drop point towardx=0\.3x=0\.3instead of further increasing the radius, and succeeds with\(x,y,r\)=\(0\.3,0\.9,1\.5\)\(x,y,r\)=\(0\.3,0\.9,1\.5\)in six interaction iterations\. Frames are sampled from the final simulations; full prompts, bank entries, and trajectories are provided in Appendix[E\.4](https://arxiv.org/html/2606.29315#A5.SS4)\.wherer¯src\\bar\{r\}\_\{\\mathrm\{src\}\}is the mean reward of the source trajectories\. Depending on experimental settings, users can define any trajectory reward and skill reward; HExA as a framework is agnostic of these choices\. We also tried skill evolution without rewards, using the evolver LLM’s prior knowledge itself to guide skill bank update, ignoring the useful environment grounded rewards\. Our ablations in Table[11](https://arxiv.org/html/2606.29315#A4.T11)show that reward labeled skill banks perform the best\. Thus, our default experimental setting forHExAis Off2On \+ Evolution with reward labeled skills\.

MethodAcc\. \(%\)Avg IterDirect\(no tools\)2\.01\.0ReAct8\.022\.9Reflexion \(K=2K\{=\}2\)21\.3±2\.521\.3\{\\pm\}2\.521\.2±0\.721\.2\{\\pm\}0\.7HExA\(no reward\)50\.7±9\.450\.7\{\\pm\}9\.416\.5±2\.716\.5\{\\pm\}2\.7HExA\(Off2On Evol\.\)67\.3±9\.3\\mathbf\{67\.3\{\\pm\}9\.3\}14\.4±1\.814\.4\{\\pm\}1\.8
![Refer to caption](https://arxiv.org/html/2606.29315v1/x2.png)

Figure 5:Reward\-guided skill accumulation makesHExAmuch stronger oncatapult, the hardest level\.This experiment tests whether an agent can improve by turning past attempts into reusable, reward\-guided skills\. Oncatapult, Claude Sonnet solves only8\.0%8\.0\\%of seeds with a standardReActloop, whileHExAreaches67\.3%67\.3\\%and uses fewer iterations per solve\.Left:final solve rate and average iterations\.Right:cumulative success over5050seeds; curves show mean over33runs with shaded±1\\pm 1std\. Removing the reward signal dropsHExAto50\.7%50\.7\\%, and Reflexion reaches only21\.3%21\.3\\%, showing that the main gain comes from scoring and reusing skills, not merely from extra interaction or verbal self\-reflection\. Full protocol in Appendix[C](https://arxiv.org/html/2606.29315#A3)\.
### 4\.4Results, Analysis, and Ablations

Table 2:HExAon open\-weight solvers, 50 training/evaluation seeds per cell \(max 25 iterations\)\. Qwen\-2\.5 runs use the Online\-Evolving configuration \(k=3k\{=\}3seeds per round, 17 rounds\); GPT\-OSS\-120B uses Off2On Evolving\. Baselines are the same agent with no skill injection\. Per\-seed cumulative trajectories in Figure[13](https://arxiv.org/html/2606.29315#A4.F13)\.##### Does active experimentation and skill creation help agents solve complex tasks?

On levels where a singleReActtrajectory fails, we hypothesize that distilling interaction traces into an evolving skill bank should raise solve rates over base model and non\-experimental agent baselines that accumulate no reusable skills \(Direct,ReAct, Reflexion\)\. We test this with frontier model Claude Sonnet, as well as open\-weight models like Qwen\-2\.5\-3B and 7B and GPT\-OSS\-120B against the same non\-experimental baselines\. On the harder levels where even the base Clause Sonnet model could barely make progress,HExAlearns consistently via in\-context experimentation: oncatapultit reaches67\.3±9\.3%67\.3\\pm 9\.3\\%against2\.0%2\.0\\%\(Direct\),8\.0%8\.0\\%\(ReAct\), and21\.3±2\.5%21\.3\\pm 2\.5\\%\(Reflexion\) \(Figure[5](https://arxiv.org/html/2606.29315#S4.F5)and Table[5](https://arxiv.org/html/2606.29315#S4.F5)\), and onpass\_the\_parcel—which requires discovering the ramp\-and\-basket mechanism rather than a direct collision—it reaches60\.0%60\.0\\%against24\.0%24\.0\\%,16\.0%16\.0\\%, and0\.0%0\.0\\%respectively \(Table[12](https://arxiv.org/html/2606.29315#A4.T12)\)\. In Figure[4](https://arxiv.org/html/2606.29315#S4.F4)we show a representativecatapultseed solved in a handful of iterations by reusing distilled placement skills whileReActexhausts its budget\. The same pattern holds for weaker open models:HExAimproves every open\-weight model we tested on every level \(Table[2](https://arxiv.org/html/2606.29315#S4.T2)\), raising Qwen\-2\.5\-3B from8\.0→24\.0%8\.0\\to 24\.0\\%and Qwen\-2\.5\-7B from62\.0→72\.0%62\.0\\to 72\.0\\%ondown\_to\_earth,6\.0→14\.0%6\.0\\to 14\.0\\%and18\.0→34\.0%18\.0\\to 34\.0\\%ontwo\_body\_problem, and GPT\-OSS\-120B from0\.0→54\.0%0\.0\\to 54\.0\\%oncatapult\. The improvement therefore comes from distilling past interactions into reusable strategies rather than from extra retries, and also helps solve the tasks more efficiently and in lesser number of turns \(full sweep in Figure[13](https://arxiv.org/html/2606.29315#A4.F13)and Appendix[D](https://arxiv.org/html/2606.29315#A4.SS0.SSS0.Px1)\)\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x3.png)\(a\)Catapult
![Refer to caption](https://arxiv.org/html/2606.29315v1/x4.png)\(b\)Pass the Parcel

Figure 6:Cumulative average turns per seedfor everyHExAvariant against the baseline \(Claude Sonnet\); lower is more efficient\. On both levels the Off2On Evolving configuration converges to the lowest cost per seed, while the iterative and pure\-online variants remain higher, indicating thatHExA’s gains come from guiding search more efficiently with accumulated experience rather than from spending more computation per instance\.
##### CanHExAleverage learned skills for more efficient exploration and experimentation?

In our task settings, simply spending more compute on interactions does not ensure higher task success\. The agents benefits from active skill acquisition and reuse\(Anthropic,[2025a](https://arxiv.org/html/2606.29315#bib.bib4)\), helping it with more successes as well as doing so efficiently\. These compact skills in turn makes learning more efficient in these complex tasks with lesser interactions\. We record the average number of iterations or turns per seed as an efficiency proxy, and track the cumulative success and cumulative iteration curves across our experiments\.HExAlowers the average iterations from22\.922\.9\(ReAct\) and21\.2±0\.721\.2\\pm 0\.7\(Reflexion\) to14\.4±1\.814\.4\\pm 1\.8, roughly37%37\\%fewer, so each future episode becomes easier to tackle using the accumulated skills \(Figure[5](https://arxiv.org/html/2606.29315#S4.F5)and Figure[6](https://arxiv.org/html/2606.29315#S4.F6)\)\. So the get much more gains in accuracy from HExA at equivalent inference compute\. Similar efficiency gains also hold onpass\_the\_parcel\. Reflexion, which adds verbal self\-reflection but keeps no persistent skill bank to retrieve and reuse from, takes more trails thanHExA\. Thus self\-reflection between trials help far less than converting past experiences into an explicit, reusable skill bank, and this holds even when the reward labels are removed from the bank\. The evolving skill bank in HExA thus helps with structured exploration and amortizes the cost of future experimentation, helping both improve performance and efficiency\.

##### Which ingredients drive improvements in HExA?

Using the instantiation axes defined in Section[4\.3](https://arxiv.org/html/2606.29315#S4.SS3), we study how these aid learning from experimentation\.

*How do reward labels help learning via experimentation?*We try using HExA with and without the reward labels in the skill bank\. In our ablations without the reward labels, every success and failure trajectory and skills formed out of them are treated equally and the LLM evolver implicitly updates the skill bank using its existing knowledge\. On Qwen\-2\.5\-7B the unguided reward\-free skill variant reaches64%64\\%versus72%72\\%with rewards ondown\_to\_earthand26%26\\%versus34%34\\%ontwo\_body\_problem\(Table[4](https://arxiv.org/html/2606.29315#S4.T4)\)\. Oncatapultit drops Claude Sonnet from67\.3±9\.3%67\.3\\pm 9\.3\\%to50\.7±9\.4%50\.7\\pm 9\.4\\%\(Figure[5](https://arxiv.org/html/2606.29315#S4.F5)\)\. Thus the reward labels provide valuable environment feedback grounded in actual explorations and interactions of the LLM actor that better helps assign credit for effective and efficient skill acquisition as well as better guide the LLM evolver to compose, retain or prune skills\.

Table 3:Reward\-signal ablation on Qwen\-2\.5\-7B, 50 seeds per cell\. FullHExAuses reward\-guided, two\-pass distillation; the no\-reward variant treats successful trajectories equally\.
Table 4:HExAvariant accuracy \(%\) on the two hard levels Claude Sonnet;Cat\.=catapult;PtP=pass\_the\_parcel\. Off2On Evolving \(x=3x\{=\}3\), referred to simply asHExA, is best on both\.

*Does skill bank initialization matter?*With the update rule fixed, a skill bank initialized with some initial offline bank helps warm\-start learning faster, but matters less as more rounds of interaction happens \(Section[4\.3](https://arxiv.org/html/2606.29315#S4.SS3)\)\. As we can see in Table[4](https://arxiv.org/html/2606.29315#S4.T4), Off2On reaches76%76\\%and60%60\\%oncatapultandpass\_the\_parcelversus44%44\\%and58%58\\%for a pure Online start\. The gap oncatapulthappens where a cold start wastes early rounds rediscovering basic dynamics of the level, an advantage that shrinks once enough enough online experimentation has been done\.

*What helps skill update more, evolving over the existing bank or rebuilding from scratch?*We ablate over the two skill update variants, Evolving and Iterative Replacement \(Section[4\.3](https://arxiv.org/html/2606.29315#S4.SS3)\) with the same initialization\. Evolving the previous skill bank with new experimentation trajectories across seeds, beats Iterative Replacement where we rebuild the skill bank from scratch every time with new trajectories from the updated LLM actor\. Onpass\_the\_parcelthe evolving variants reaches60%60\\%accuracy on solving the level versus48%48\\%for skill update from scratch \(Table[4](https://arxiv.org/html/2606.29315#S4.T4)\)\. Similarly oncatapultEvolving skill update helps achieve76%76\\%versus56%56\\%for the iterative variant \(Table[4](https://arxiv.org/html/2606.29315#S4.T4)\)\. We also show the cumulative\-accuracy curves for these ablations in \(Appendix figure[11](https://arxiv.org/html/2606.29315#A4.F11)\)\. We can thus see how gradually evolving the existing skill banks with evidence from new experimentation rounds helps build better skill bank, since re\-distilling from scratch discards accumulated skills and amplifies overfitting\. A key risk in any iterative skill\-update loop is skill overfitting, the skill bank may accumulate increasingly narrow strategies that overfit to the particular instances seen so far, while discarding general principles that would transfer to new instances\. Thus the Evolving update strategy works better, because trajectories in roundnnare generated while conditioned on𝒢n=Retriever\(𝒦n−1,ℓ,M,N\)\\mathcal\{G\}\_\{n\}=\\textsc\{Retriever\}\(\\mathcal\{K\}\_\{n\-1\},\\ell,M,N\), they explicitly reflect the influence of previously learned skills\. The evolutionary update𝒦n=πev\(𝒦n−1,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}=\\pi\_\{\\mathrm\{ev\}\}\(\\mathcal\{K\}\_\{n\-1\},\\mathcal\{T\}^\{\(n\)\}\)can therefore refine an earlier rule, identify its boundary conditions, merge it with corroborating evidence, or replace it when contradicted by subsequent experience\.

##### DoesHExAlearn hierarchical skills that build on one another, and do those skills transfer to unseen levels?

HExAaccumulates skills that compose into increasingly abstract principles, because each round’s trajectories are generated while the actor is conditioned on previously retrieved skills, letting a new skill refine, bound, merge with, or overturn an earlier one\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x5.png)Figure 7:Cross\-level skill transfer with no target trajectories\. Each target is solved using only skills synthesized from source\-level banks; labels report target success and matchedReActbaselines\. Multi\-source transfer tocatapultyields the largest gain \(\+36\+36pp\)\.Qualitatively, the evolvedcatapultbank shows this hierarchical nature \(Appendix[E\.4](https://arxiv.org/html/2606.29315#A5.SS4)\): experiments atx=0\.5x\{=\}0\.5wherer=1\.0r\{=\}1\.0fell short andr=1\.5r\{=\}1\.5struck the ceiling are not stored as separate parameter settings but consolidated into a higher\-level rule—"when radius tuning at fixedxxalternates between falling short and hitting the ceiling, stop tuningrrand shiftxxtoward0\.10\.1–0\.30\.3to flatten the launch arc"—compressing the many\(x,y,r\)\(x,y,r\)trials into a physical principle plus a policy for when to abandon local search\. The example in Figure[4](https://arxiv.org/html/2606.29315#S4.F4)applies exactly these hierarchical skills \(default launch→\\todiagnosed ceiling failure→\\tocorrective geometry shift\) to succeed in just six iterations\. If the skill banks encode not only level\-specific recipes but also abstract principles, those principles should transfer to unseen levels and help experimentation there as well\. So we test transfer by synthesizing a target skill bank from easier level banks plus only a textual description of the unseen harder target\. No target\-level trajectories are used at any stage—and injecting it with no further evolution\. Combining evolved banks from three source levels \(down\_to\_earth,two\_body\_problem,pass\_the\_parcel\) helps improve Claude Sonnet oncatapultfrom2\.0%2\.0\\%solve rate with the base model and8\.0%8\.0\\%with active level interaction via REACT to44\.0%44\.0\\%with HExA on the transferred skill bank\(Figure[7](https://arxiv.org/html/2606.29315#S4.F7)and Table[13](https://arxiv.org/html/2606.29315#A4.T13)\)\. Transfer also helps with a single source bank and a smaller actor model like Qwen 7B—transfer on a structurally similar pair \(down\_to\_earth→\\tofalling\_into\_place\) leads to gains of\+12%\+12\\%in success rate and the dissimilar pair \(down\_to\_earth→\\totwo\_body\_problem\) still gains\+16%\+16\\%\. These transfers across dissimilar levels without any online interactions in the held\-out test levels indicates that the evolver extracts generalizable abstract principles—momentum transfer, contact geometry, directional impulse—rather than narrow level\-specific heuristics\. ThusHExA’s hierarchical skills and evolution methods help learn and acquire reusable and transferable knowledge that even generalize to unseen complex domains via learning from exploration and experimentation\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x6.png)Figure 8:In\-context skill evolution learns faster than GRPO at the same interaction budget\.We compareHExAwith GRPO fine\-tuning on Qwen\-2\.5\-3B using 50 training seeds\. The green star marksHExA’s success rate after the same 50\-seed budget \(Down to Earth:24%24\\%, Two Body Problem:14%14\\%\), while the dashed line marks the matched\-budget GRPO checkpoint\. In this low\-data regime,HExAoutperforms both GRPO variants, suggesting that reusable in\-context skills provide faster early adaptation; with many more updates, GRPO can eventually close the gap through direct reward optimization\.
##### How does performance ofHExAcompare to RL finetuning with weight updates?

Finally we ask how learning via experimentation in\-context with skill evolution compares to model training with RL in terms of sample\-efficient and performance\. We compare against GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib46)\)fine\-tuning Qwen\-2\.5\-3B on the same levels, matched on the number of model updates and on5050unique environment seeds\. At this budget GRPO reaches20%20\\%ondown\_to\_earthand6%6\\%ontwo\_body\_problem, versusHExA’s24%24\\%and14%14\\%\(Figure[8](https://arxiv.org/html/2606.29315#S4.F8)\)\. The gap reflects the core advantage of in\-context skill evolution: strategies distilled from early trajectories are immediately available to later episodes through context, whereas gradient\-based methods must accumulate signal over many rollouts before the weights encode equivalent knowledge—so in harder domains where early successes are rare,HExAcan still enable experimentation guided learning from failure and partial\-success trajectories\. With many more updates, GRPO can eventually close the gap through direct optimization on the environment reward\. But combining learning from experimentation in\-context to bootstrap initial progress on complex domains with later stages of gradient based RL post\-training can be a robust learning method\. We present more details of these ablations in Appendix[D](https://arxiv.org/html/2606.29315#A4.SS0.SSS0.Px2)\.

## 5Conclusion, Limitations, & Future Works

We introducedHExA, a training\-free in\-context RL framework for LLM agents to learn via experimentation\. A single LLM actively designs experiments and gathers query\-relevant trajectories, distills hierarchical skills from this interaction experience and its reward feedback, and reuses those skills across tasks, all without parameter updates, offline data, or external supervision\. We also present Interphyre, a physical reasoning benchmark that enables learning via experimentation using tool call APIs that lets the agent inspect the scene, test interventions, and observe intermediate dynamics\. On Interphyre,HExAsignificantly outperforms the base models, other agentic baselines like ReAct and Reflexion, and gradient\-based GRPO under matched interaction budgets, achieving strong gains on the hardest levels while reducing per\-seed iteration cost\. Its evolved skill banks transfer zero\-shot to unseen levels, helping make progress without any target\-level interaction, and the same mechanism improves both frontier and open\-weight models\. The matched\-budget comparison with GRPO showts that a strategyHExAdiscovers becomes usable by the next episode immediately through context, whereas a gradient\-based learner must first encode it into weights over many rollouts, in\-context skill evolution amortizes exploration more efficiently than weight updates when interaction data is scarce\. This also points to a practical recipe, namely usingHExAto bootstrap capabilities in a novel domain before any downstream post\-training\.

We want to highlight that performance ofHExAis not tied to any environment—it is a general experiment\-centric in\-context self\-improvement algorithm, that operates only on interaction trajectories, scalar rewards, and natural\-language skills, none of which assume a particular domain, so it applies generally to any domains and environments where learning requires active experimentation\. But we are currently confined to 2D physics is our evaluation, through the Interphyre benchmark\. Naturally, demonstratingHExAon other experimentation\-centric domains, such as scientific data analysis or interactive coding, is therefore a valuable direction of future work\. HExA also depends on an LLM\-based evolver for contrastive skill consolidation, making skill quality somewhat bounded by the evolver’s reasoning capability, and its current binary success metric with an efficiency term may not extend directly to domains without concise success criteria\. Finally, HExA pays a per\-round overhead for evolver calls in addition to actor calls; although this yields strong sample efficiency in low\-data regimes, it remains an open question whether its asymptotic accuracy ceiling matches gradient\-based RL methods such as GRPO at much larger interaction budgets\. Future work should therefore study cross\-level and cross\-domain skill transfer and the framework’s ability to meta\-learn in out\-of\-distribution domains, improve the structure and scalable reusability of skill banks, and explore hybrid methods that use HExA and learned skills to bootstrap exploration before refining policies through parametric RL updates of the agent to consolidate gains learned in novel complex domains\.

## Acknowledgments

This work has taken place in the Safe, Correct, and Aligned Learning and Robotics Lab \(SCALAR\) at The University of Massachusetts Amherst\. SCALAR research is supported in part by the NSF \(IIS\-2437426\), the Long\-Term Future Fund, and Open Philanthropy\. Scott Niekum holds concurrent appointments as an Associate Professor at the University of Massachusetts Amherst and as an Amazon Scholar\. This paper describes work performed at the University of Massachusetts Amherst and was funded by a gift from Amazon\.

## References

- Ahmed et al\. \(2021\)Ossama Ahmed, Frederik Träuble, Anirudh Goyal, et al\.CausalWorld: A robotic manipulation benchmark for causal structure and transfer learning\.In*International Conference on Learning Representations \(ICLR\)*, 2021\.URL[https://arxiv\.org/abs/2010\.04296](https://arxiv.org/abs/2010.04296)\.
- Allen et al\. \(2020\)Kelsey R\. Allen, Kevin A\. Smith, and Joshua B\. Tenenbaum\.Rapid trial\-and\-error learning with simulation supports flexible tool use and physical reasoning\.*Proceedings of the National Academy of Sciences*, 117\(47\):29302–29310, 2020\.URL[https://arxiv\.org/abs/1907\.09620](https://arxiv.org/abs/1907.09620)\.
- Alzubi et al\. \(2026\)Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu\.Evoskill: Automated skill discovery for multi\-agent systems\.*arXiv preprint arXiv:2603\.02766*, 2026\.
- Anthropic \(2025a\)Anthropic\.Equipping agents for the real world with agent skills\.2025a\.URL[https://www\.anthropic\.com/engineering/equipping\-agents\-for\-the\-real\-world\-with\-agent\-skills](https://www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills)\.
- Anthropic \(2025b\)Anthropic\.Introduction to agent skills\.2025b\.URL[https://anthropic\.skilljar\.com/introduction\-to\-agent\-skills](https://anthropic.skilljar.com/introduction-to-agent-skills)\.
- Anthropic \(2026\)Anthropic\.System card: Claude opus 4\.6\.2026\.URL[https://www\-cdn\.anthropic\.com/0dd865075ad3132672ee0ab40b05a53f14cf5288\.pdf](https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf)\.
- Ates et al\. \(2022\)Tayfun Ates, Muhammed Ali Atesoglu, Cansu Yigit, et al\.CRAFT: A benchmark for causal reasoning about forces and interactions\.In*Findings of the Association for Computational Linguistics: ACL 2022*, 2022\.URL[https://aclanthology\.org/2022\.findings\-acl\.205/](https://aclanthology.org/2022.findings-acl.205/)\.
- Bakhtin et al\. \(2019\)Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick\.PHYRE: A new benchmark for physical reasoning\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2019\.URL[https://arxiv\.org/abs/1908\.05656](https://arxiv.org/abs/1908.05656)\.
- Baradel et al\. \(2020\)Fabien Baradel, Natalia Neverova, Julien Mille, Greg Mori, and Christian Wolf\.CoPhy: Counterfactual learning of physical dynamics\.In*International Conference on Learning Representations \(ICLR\)*, 2020\.URL[https://arxiv\.org/abs/1909\.12000](https://arxiv.org/abs/1909.12000)\.
- Bogdan et al\. \(2025\)Paul C\. Bogdan, Uzay Macar, Neel Nanda, and Arthur Conmy\.Thought anchors: Which LLM reasoning steps matter?, 2025\.URL[https://arxiv\.org/abs/2506\.19143](https://arxiv.org/abs/2506.19143)\.
- Chen et al\. \(2025\)Yuxuan Chen, Piotr Piękos, Mateusz Ostaszewski, Firas Laakom, and Jürgen Schmidhuber\.PhysGym\.In*Advances in Neural Information Processing Systems \(NeurIPS\), Datasets & Benchmarks Track*, 2025\.URL[https://arxiv\.org/abs/2507\.15550](https://arxiv.org/abs/2507.15550)\.
- Cherian et al\. \(2024\)Anoop Cherian, Radu Corcodel, Siddarth Jain, and Diego Romeres\.LLMPhy: Complex physical reasoning using large language models and world models, 2024\.URL[https://arxiv\.org/abs/2411\.08027](https://arxiv.org/abs/2411.08027)\.
- Chow et al\. \(2025\)Wei Chow, Jiajun Mao, Bowen Li, Daniel Seita, Vitor Guizilini, and Yue Wang\.PhysBench: Benchmarking and enhancing vision\-language models for physical world understanding\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.URL[https://arxiv\.org/abs/2501\.16411](https://arxiv.org/abs/2501.16411)\.
- Demircan et al\. \(2025\)Can Demircan, Tankred Saanum, Akshay K\. Jagadish, Marcel Binz, and Eric Schulz\.Sparse autoencoders reveal temporal difference learning in large language models\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.URL[https://arxiv\.org/abs/2410\.01280](https://arxiv.org/abs/2410.01280)\.
- Dong et al\. \(2024\)Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Baobao Chang, et al\.A survey on in\-context learning\.In*Proceedings of the 2024 conference on empirical methods in natural language processing*, pp\. 1107–1128, 2024\.
- Dou et al\. \(2026\)Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu\-Gang Jiang, Di Wang, and Shunyu Yao\.CL\-Bench: A benchmark for context learning\.*arXiv preprint arXiv:2602\.03587*, 2026\.URL[https://arxiv\.org/abs/2602\.03587](https://arxiv.org/abs/2602.03587)\.
- Foley et al\. \(2018\)John Foley, Emma Tosch, Kaleigh Clary, and David Jensen\.Toybox: Better atari environments for testing reinforcement learning agents, 2018\.URL[https://arxiv\.org/abs/1812\.02850](https://arxiv.org/abs/1812.02850)\.NeurIPS 2018 Workshop on Systems for ML\.
- Gao et al\. \(2025\)Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang\.Omni\-MATH: A universal olympiad level mathematic benchmark for large language models\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=yaqPf0KAlN](https://openreview.net/forum?id=yaqPf0KAlN)\.
- Gao et al\. \(2024\)Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang\.Retrieval\-augmented generation for large language models: A survey, 2024\.URL[https://arxiv\.org/abs/2312\.10997](https://arxiv.org/abs/2312.10997)\.
- Gunjal et al\. \(2025\)Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx\.Rubrics as rewards: Reinforcement learning beyond verifiable domains\.*arXiv preprint arXiv:2507\.17746*, 2025\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z\. F\. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H\. Zhang, Hanwei Xu, Honghui Ding, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jingchang Chen, Jingyang Yuan, Jinhao Tu, Junjie Qiu, Junlong Li, J\. L\. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaichao You, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R\. J\. Chen, R\. L\. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S\. S\. Li, Shuang Zhou, Shaoqing Wu, Tao Yun, Tian Pei, Tianyu Sun, T\. Wang, Wangding Zeng, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W\. L\. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X\. Q\. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y\. X\. Zhu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z\. Z\. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang\.Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning\.*Nature*, 645\(8081\):633–638, September 2025\.ISSN 1476\-4687\.doi:10\.1038/s41586\-025\-09422\-z\.URL[http://dx\.doi\.org/10\.1038/s41586\-025\-09422\-z](http://dx.doi.org/10.1038/s41586-025-09422-z)\.
- He et al\. \(2024\)Zhengfu He, Xuyang Ge, Qiong Tang, et al\.Dictionary learning improves patch\-free circuit discovery in mechanistic interpretability: A case study on othello\-GPT, 2024\.URL[https://arxiv\.org/abs/2402\.12201](https://arxiv.org/abs/2402.12201)\.
- Hughes et al\. \(2024\)Edward Hughes, Michael Dennis, Jack Parker\-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel\.Open\-endedness is essential for artificial superhuman intelligence\.*arXiv preprint arXiv:2406\.04268*, 2024\.
- Jiang et al\. \(2025\)Dongfu Jiang, Yi Lu, Zhuofeng Li, Zhiheng Lyu, Ping Nie, Haozhe Wang, Alex Su, Hui Chen, Kai Zou, Chao Du, Tianyu Pang, and Wenhu Chen\.Verltool: Towards holistic agentic reinforcement learning with tool use, 2025\.URL[https://arxiv\.org/abs/2509\.01055](https://arxiv.org/abs/2509.01055)\.
- Jimenez et al\. \(2024\)Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan\.SWE\-bench: Can language models resolve real\-world github issues?In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=VTF8yNQM66](https://openreview.net/forum?id=VTF8yNQM66)\.
- Karvonen \(2024\)Adam Karvonen\.Emergent world models and latent variable estimation in chess\-playing language models\.In*Conference on Language Modeling \(COLM\)*, 2024\.URL[https://arxiv\.org/abs/2403\.15498](https://arxiv.org/abs/2403.15498)\.
- Laskin et al\. \(2022\)Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al\.In\-context reinforcement learning with algorithm distillation\.*arXiv preprint arXiv:2210\.14215*, 2022\.
- Li et al\. \(2026a\)Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu\.Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026a\.URL[https://arxiv\.org/abs/2603\.02176](https://arxiv.org/abs/2603.02176)\.
- Li et al\. \(2023\)Kenneth Li, Aspen K\. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg\.Emergent world representations: Exploring a sequence model trained on a synthetic task\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.URL[https://arxiv\.org/abs/2210\.13382](https://arxiv.org/abs/2210.13382)\.
- Li et al\. \(2024\)Shiqian Li, Kewen Wu, Chi Zhang, and Yixin Zhu\.I\-PHYRE: Interactive physical reasoning\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.URL[https://arxiv\.org/abs/2312\.03009](https://arxiv.org/abs/2312.03009)\.
- Li et al\. \(2026b\)Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Binxu Li, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, and Han chung Lee\.Skillsbench: Benchmarking how well agent skills work across diverse tasks, 2026b\.URL[https://arxiv\.org/abs/2602\.12670](https://arxiv.org/abs/2602.12670)\.
- Liang et al\. \(2026\)Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia\-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, Xin Xie, Peng Zhang, Zhengke Gui, Lei Liang, Jun Zhou, Chiyu Wu, Jin Shang, Yu Gong, Junyu Lin, Changliang Xu, Hongjie Deng, Wen Zhang, Keyan Ding, Qiang Zhang, Fei Huang, Ningyu Zhang, Jeff Z\. Pan, Guilin Qi, Haofen Wang, and Huajun Chen\.Skillnet: Create, evaluate, and connect ai skills, 2026\.URL[https://arxiv\.org/abs/2603\.04448](https://arxiv.org/abs/2603.04448)\.
- Lu et al\. \(2026\)Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen\.SKILL0: In\-context agentic reinforcement learning for skill internalization\.*arXiv preprint arXiv:2604\.02268*, 2026\.URL[https://arxiv\.org/abs/2604\.02268](https://arxiv.org/abs/2604.02268)\.
- Luo et al\. \(2025\)Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang\.An empirical study of catastrophic forgetting in large language models during continual fine\-tuning, 2025\.URL[https://arxiv\.org/abs/2308\.08747](https://arxiv.org/abs/2308.08747)\.
- Ma et al\. \(2026a\)Yingwei Ma, Yue Liu, Xinlong Yang, Yanhao Li, Kelin Fu, Yibo Miao, Yuchong Xie, Zhexu Wang, and Shing\-Chi Cheung\.Scaling coding agents via atomic skills, 2026a\.URL[https://arxiv\.org/abs/2604\.05013](https://arxiv.org/abs/2604.05013)\.
- Ma et al\. \(2026b\)Ziyu Ma, Shidong Yang, Yuxiang Ji, Xucong Wang, Yong Wang, Yiming Hu, Tongwen Huang, and Xiangxiang Chu\.Skillclaw: Let skills evolve collectively with agentic evolver, 2026b\.URL[https://arxiv\.org/abs/2604\.08377](https://arxiv.org/abs/2604.08377)\.
- Matthews et al\. \(2025\)Michael Matthews, Michael Beukman, Chris Lu, and Jakob Foerster\.Kinetix: Investigating the training of general agents through open\-ended physics\-based control tasks\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.URL[https://arxiv\.org/abs/2410\.23208](https://arxiv.org/abs/2410.23208)\.Oral presentation\.
- Mei et al\. \(2025\)Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong\-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, and Shenghua Liu\.A survey of context engineering for large language models, 2025\.URL[https://arxiv\.org/abs/2507\.13334](https://arxiv.org/abs/2507.13334)\.
- Moeini et al\. \(2025\)Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, and Shangtong Zhang\.A survey of in\-context reinforcement learning\.*arXiv preprint arXiv:2502\.07978*, 2025\.
- OpenAI \(2025\)OpenAI\.Gpt\-5 technical report\.2025\.URL[https://cdn\.openai\.com/gpt\-5\-system\-card\.pdf](https://cdn.openai.com/gpt-5-system-card.pdf)\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback\.In S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(eds\.\),*Advances in Neural Information Processing Systems*, volume 35, pp\. 27730–27744\. Curran Associates, Inc\., 2022\.
- Pearl \(2009\)Judea Pearl\.*Causality: Models, Reasoning, and Inference*\.Cambridge University Press, Cambridge, 2nd edition, 2009\.
- Qiu et al\. \(2026\)Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang\.Autorefine: From trajectories to reusable expertise for continual llm agent refinement\.*arXiv preprint arXiv:2601\.22758*, 2026\.
- Rafailov et al\. \(2023\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D\. Manning, and Chelsea Finn\.Direct preference optimization: your language model is secretly a reward model\.In*Proceedings of the 37th International Conference on Neural Information Processing Systems*, NIPS ’23, Red Hook, NY, USA, 2023\. Curran Associates Inc\.
- Schick et al\. \(2023\)Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.
- Shao et al\. \(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Shi et al\. \(2025\)Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang\.Continual learning of large language models: A comprehensive survey\.*ACM Computing Surveys*, 58\(5\):1–42, 2025\.
- Shinn et al\. \(2023\)Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.*Advances in Neural Information Processing Systems*, 36:8634–8652, 2023\.
- Si et al\. \(2023\)Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting\-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, and Yongbin Li\.SpokenWOZ: A large\-scale speech\-text benchmark for spoken task\-oriented dialogue agents\.In*Thirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2023\.URL[https://openreview\.net/forum?id=viktK3nO5b](https://openreview.net/forum?id=viktK3nO5b)\.
- Silver & Sutton \(2025\)David Silver and Richard S Sutton\.Welcome to the era of experience\.*Google AI*, 1:11, 2025\.
- Spelke & Kinzler \(2007\)Elizabeth S Spelke and Katherine D Kinzler\.Core knowledge\.*Developmental science*, 10\(1\):89–96, 2007\.
- Spies et al\. \(2025\)Alex F\. Spies, William Edwards, Michael I\. Ivanitskiy, et al\.Transformers use causal world models in maze\-solving tasks\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.URL[https://arxiv\.org/abs/2412\.11867](https://arxiv.org/abs/2412.11867)\.
- Team et al\. \(2023\)Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley\-Schmieg, Michael Chang, Natalie Clay, Adrian Collister, et al\.Human\-timescale adaptation in an open\-ended task space\.*arXiv preprint arXiv:2301\.07608*, 2023\.
- Team et al\. \(2024\)Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean\-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al\.Gemini: A family of highly capable multimodal models\. arxiv 2023\.*arXiv preprint arXiv:2312\.11805*, 2024\.
- Team et al\. \(2021\)Open Ended Learning Team, Adam Stooke, Anuj Mahajan, Catarina Barros, Charlie Deck, Jakob Bauer, Jakub Sygnowski, Maja Trebacz, Max Jaderberg, Michael Mathieu, Nat McAleese, Nathalie Bradley\-Schmieg, Nathaniel Wong, Nicolas Porcel, Roberta Raileanu, Steph Hughes\-Fitt, Valentin Dalibard, and Wojciech Marian Czarnecki\.Open\-ended learning leads to generally capable agents, 2021\.URL[https://arxiv\.org/abs/2107\.12808](https://arxiv.org/abs/2107.12808)\.
- Team \(2026\)Qwen Team\.Qwen3\. 5: Towards native multimodal agents\.*URL: https://qwen\. ai/blog*, 2026\.
- Towers et al\. \(2024\)Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U\. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, et al\.Gymnasium: A standard interface for reinforcement learning environments\.*arXiv preprint arXiv:2407\.17032*, 2024\.URL[https://arxiv\.org/abs/2407\.17032](https://arxiv.org/abs/2407.17032)\.
- Viswanathan et al\. \(2026\)Vijay Viswanathan, Shiqi Wang, Devamanyu Hazarika, Chirag Nagpal, Tongshuang Wu, Graham Neubig, and Yuning Mao\.Discretizing reward models, 2026\.URL[https://arxiv\.org/abs/2606\.21795](https://arxiv.org/abs/2606.21795)\.
- Wang et al\. \(2026a\)Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, Shuofei Qiao, Kexin Cao, Guozhou Zheng, Xiang Qi, Peng Zhang, and Shumin Deng\.SkillX: Automatically constructing skill knowledge bases for agents\.*arXiv preprint arXiv:2604\.04804*, 2026a\.URL[https://arxiv\.org/abs/2604\.04804](https://arxiv.org/abs/2604.04804)\.
- Wang et al\. \(2026b\)Jiayu Wang, Yifei Ming, Zixuan Ke, Shafiq Joty, Aws Albarghouthi, and Frederic Sala\.Skillorchestra: Learning to route agents via skill transfer, 2026b\.URL[https://arxiv\.org/abs/2602\.19672](https://arxiv.org/abs/2602.19672)\.
- Wang et al\. \(2023\)Yi R\. Wang, Jiafei Duan, Dieter Fox, and Siddhartha Srinivasa\.NEWTON: Are large language models capable of physical reasoning?In*Findings of the Association for Computational Linguistics: EMNLP 2023*, 2023\.URL[https://arxiv\.org/abs/2310\.07018](https://arxiv.org/abs/2310.07018)\.
- Wang et al\. \(2026c\)Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao, Fazle Elahi Faisal, Baolin Peng, Si Qin, Suman Nath, Qingwei Lin, Chetan Bansal, Dongmei Zhang, Saravan Rajmohan, Jianfeng Gao, and Huaxiu Yao\.Webxskill: Skill learning for autonomous web agents, 2026c\.URL[https://arxiv\.org/abs/2604\.13318](https://arxiv.org/abs/2604.13318)\.
- Xia et al\. \(2026\)Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao\.Skillrl: Evolving agents via recursive skill\-augmented reinforcement learning, 2026\.URL[https://arxiv\.org/abs/2602\.08234](https://arxiv.org/abs/2602.08234)\.
- Xu et al\. \(2026\)Xinrun Xu, Pi Bu, Ye Wang, Börje F Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, and Bo Zheng\.Deepphy: Benchmarking agentic vlms on physical reasoning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pp\. 34160–34168, 2026\.
- Yang et al\. \(2026\)Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He\.AutoSkill: Experience\-driven lifelong learning via skill self\-evolution\.*arXiv preprint arXiv:2603\.01145*, 2026\.URL[https://arxiv\.org/abs/2603\.01145](https://arxiv.org/abs/2603.01145)\.
- Yao et al\. \(2022\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*The eleventh international conference on learning representations*, 2022\.
- Yao et al\. \(2023a\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models, 2023a\.URL[https://arxiv\.org/abs/2210\.03629](https://arxiv.org/abs/2210.03629)\.
- Yao et al\. \(2023b\)Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*The Eleventh International Conference on Learning Representations*, 2023b\.URL[https://openreview\.net/forum?id=WE\_vluYUL\-X](https://openreview.net/forum?id=WE_vluYUL-X)\.
- Ying et al\. \(2025\)Lance Ying, Katherine M Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J Gershman, Jacob D Andreas, et al\.Assessing adaptive world models in machines with novel games\.*arXiv preprint arXiv:2507\.12821*, 2025\.
- Zhang et al\. \(2026\)Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei\-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, and Philip S\. Yu\.Coevoskills: Self\-evolving agent skills via co\-evolutionary verification, 2026\.URL[https://arxiv\.org/abs/2604\.01687](https://arxiv.org/abs/2604.01687)\.
- Zheng et al\. \(2024\)Yanming Zheng, Mingyu Yan, Jingzhou Chen, et al\.ContPhy: Continuum physical concept learning and reasoning from videos\.In*International Conference on Machine Learning \(ICML\)*, 2024\.URL[https://proceedings\.mlr\.press/v235/zheng24l\.html](https://proceedings.mlr.press/v235/zheng24l.html)\.
- Ziegler et al\. \(2020\)Daniel M\. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B\. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving\.Fine\-tuning language models from human preferences, 2020\.URL[https://arxiv\.org/abs/1909\.08593](https://arxiv.org/abs/1909.08593)\.

###### Appendix Contents

1. [1Introduction](https://arxiv.org/html/2606.29315#S1)
2. [2Related Works](https://arxiv.org/html/2606.29315#S2)
3. [3Hierarchical Experimentalist Agents: Learning via Experimentation](https://arxiv.org/html/2606.29315#S3)
4. [4Experiments and Results on Interphyre](https://arxiv.org/html/2606.29315#S4)1. [4\.1Interphyre: A Testbed for Learning through Experimentation](https://arxiv.org/html/2606.29315#S4.SS1) 2. [4\.2Experimental Setup, Evaluations, and Baselines](https://arxiv.org/html/2606.29315#S4.SS2) 3. [4\.3Instantiation ofHExA– Skill Initialization, Update, and Reward Labels](https://arxiv.org/html/2606.29315#S4.SS3) 4. [4\.4Results, Analysis, and Ablations](https://arxiv.org/html/2606.29315#S4.SS4)
5. [5Conclusion, Limitations, & Future Works](https://arxiv.org/html/2606.29315#S5)
6. [References](https://arxiv.org/html/2606.29315#bib)
7. [AHExA: More Methodological Details](https://arxiv.org/html/2606.29315#A1)1. [A\.1Overview](https://arxiv.org/html/2606.29315#A1.SS1) 2. [A\.2Problem Formulation](https://arxiv.org/html/2606.29315#A1.SS2) 3. [A\.3The HExA Framework](https://arxiv.org/html/2606.29315#A1.SS3) 4. [A\.4Cross\-Task Skill Transfer](https://arxiv.org/html/2606.29315#A1.SS4) 5. [A\.5Design Principles](https://arxiv.org/html/2606.29315#A1.SS5)
8. [BTool Descriptions](https://arxiv.org/html/2606.29315#A2)1. [B\.1Shared Tools](https://arxiv.org/html/2606.29315#A2.SS1) 2. [B\.2Tools for Levels](https://arxiv.org/html/2606.29315#A2.SS2)
9. [CBaseline Implementation Details](https://arxiv.org/html/2606.29315#A3)1. [C\.1ReAct: Inner Loop](https://arxiv.org/html/2606.29315#A3.SS1) 2. [C\.2Reflexion:K=2K\{=\}2Trial Wrapper](https://arxiv.org/html/2606.29315#A3.SS2) 3. [C\.3Direct: One\-Shot, No\-Tool Baseline](https://arxiv.org/html/2606.29315#A3.SS3) 4. [C\.4GRPO: Weight\-Update Baseline](https://arxiv.org/html/2606.29315#A3.SS4) 5. [C\.5Seed Ranges per \(method, level\) Cell](https://arxiv.org/html/2606.29315#A3.SS5)
10. [DAdditional Details of Experiments and Results](https://arxiv.org/html/2606.29315#A4)
11. [EWorking Example: ReAct vs\.HExAon Level Catapult of InterPhyre](https://arxiv.org/html/2606.29315#A5)1. [E\.1Sparsified simulation frames \(catapult seed 45\)](https://arxiv.org/html/2606.29315#A5.SS1) 2. [E\.2ReActsystem prompt \(catapult seed 45\)](https://arxiv.org/html/2606.29315#A5.SS2) 3. [E\.3ReActtrajectory \(catapult seed 45, 25 iterations, FAILURE\)](https://arxiv.org/html/2606.29315#A5.SS3) 4. [E\.4Round\-14 evolving skill bank used byHExA](https://arxiv.org/html/2606.29315#A5.SS4) 5. [E\.5HExAsystem prompt \(catapult seed 45\)](https://arxiv.org/html/2606.29315#A5.SS5) 6. [E\.6HExAtrajectory \(catapult seed 45, 6 iterations, SUCCESS\)](https://arxiv.org/html/2606.29315#A5.SS6)
12. [FMore Background and Preliminaries](https://arxiv.org/html/2606.29315#A6)
13. [GInterphyre: Environment and Benchmark Details](https://arxiv.org/html/2606.29315#A7)1. [G\.1Introduction](https://arxiv.org/html/2606.29315#A7.SS1) 2. [G\.2Related Work](https://arxiv.org/html/2606.29315#A7.SS2) 3. [G\.3Design](https://arxiv.org/html/2606.29315#A7.SS3) 4. [G\.4Intervention API](https://arxiv.org/html/2606.29315#A7.SS4) 5. [G\.5Conclusion](https://arxiv.org/html/2606.29315#A7.SS5)
14. [HReAct System Prompts](https://arxiv.org/html/2606.29315#A8)1. [H\.1Down to Earth](https://arxiv.org/html/2606.29315#A8.SS1) 2. [H\.2Two Body Problem](https://arxiv.org/html/2606.29315#A8.SS2) 3. [H\.3Pass the Parcel](https://arxiv.org/html/2606.29315#A8.SS3) 4. [H\.4Catapult](https://arxiv.org/html/2606.29315#A8.SS4) 5. [H\.5Falling Into Place](https://arxiv.org/html/2606.29315#A8.SS5) 6. [H\.6Basket Case](https://arxiv.org/html/2606.29315#A8.SS6) 7. [H\.7Cliffhanger](https://arxiv.org/html/2606.29315#A8.SS7) 8. [H\.8Tipping Point](https://arxiv.org/html/2606.29315#A8.SS8)
15. [IEvolver Prompts forHeXAand Cross\-Level Transfer](https://arxiv.org/html/2606.29315#A9)1. [I\.1Pass 1 Contrastive Skill Distillation](https://arxiv.org/html/2606.29315#A9.SS1) 2. [I\.2Pass 2 Mistake & Partial\-Skill Extraction](https://arxiv.org/html/2606.29315#A9.SS2) 3. [I\.3Skill\-Bank Evolution](https://arxiv.org/html/2606.29315#A9.SS3) 4. [I\.4Catapult\-Specific Variants \(Claude\)](https://arxiv.org/html/2606.29315#A9.SS4) 5. [I\.5Skill Transfer Prompts](https://arxiv.org/html/2606.29315#A9.SS5) 6. [I\.6Cross\-Level Synthesis on Qwen\-7B](https://arxiv.org/html/2606.29315#A9.SS6)

## Appendix AHExA: More Methodological Details

### A\.1Overview

Many agentic tasks require an LLM to go beyond its parametric knowledge and actively interact with an environment to gather the instance\-specific information needed to complement existing parametric knowledge in complex queries or experimental tasks; or to fundamentally interact and experiment in an unseen domain to reason, act and correctly complete some novel user task or query beyond the model’s existing parametric knowledge and capabilities\. An intuitive paradigm for learning in such settings and retaining useful reusable information is via inference\-time in\-context skill augmentation: distilling reusable strategies and common pitfalls from*interaction experience and active experimentation*into compact natural\-language*skills*that are reusable in the agent’s context at the start of each new episode, amortizing cost of exploration and skill reuse in newer tasks or domains\. Turning this idea into a working self\-improving framework, however, faces three difficulties: \(1\) the agent has no access to oracle solutions, expert demonstrations, or optimal offline data from the task family; \(2\) there is no human or oracle guided skills to start with and no external reward model or verifier beyond the interaction feedback on the unseen domain to bootstrap the learning; and \(3\) the skill library must be autonomously evolved and remain compact and bounded to fit within the agent’s context window while still capturing the most useful knowledge\.

In our paper, we introduceHExA\(Hierarchical Experimentalist Agents\), shown in Figure[1](https://arxiv.org/html/2606.29315#S1.F1), a training\-free, in\-context reinforcement learning framework that autonomously discovers, evolves, and transfers reusable skills without parameter updates, external feedback, or human annotation\. At its core, HExA operates a two\-agent loop over a sequence of rounds \([A\.3](https://arxiv.org/html/2606.29315#A1.SS3)\): an*actor*πact\\pi\_\{\\mathrm\{act\}\}actively experiments with the environment through a tool\-call interface, generating trajectories of successes and failures; an*evolver*πev\\pi\_\{\\mathrm\{ev\}\}reads batches of these trajectories, contrasts high\-reward against low\-reward episodes, and distills the result into an evolving external*skill bank*𝒦\\mathcal\{K\}via a two\-phase contrastive and failure\-analytic distillation process \([A\.3\.2](https://arxiv.org/html/2606.29315#A1.SS3.SSS2)\)\. A reward\-weighted*retriever*\([A\.3\.4](https://arxiv.org/html/2606.29315#A1.SS3.SSS4)\) selects the most relevant skills and injects them into the actor’s context before each subsequent episode, so that the cost of exploration is amortized across problem instances rather than paid from scratch each time\.

A key risk in any iterative skill\-evolution loop is*skill drift*: the evolver may accumulate increasingly narrow strategies that overfit to the particular instances seen so far, while discarding general principles that would transfer to new instances\. To address this, HExA supports multiple initialisation and update regimes \([A\.3\.5](https://arxiv.org/html/2606.29315#A1.SS3.SSS5)\) that control the balance between retaining proven skills and incorporating new evidence, ranging from a frozen offline bank to a fully online evolution strategy\. Beyond within\-task learning, HExA further enables*cross\-task skill transfer*\([A\.4](https://arxiv.org/html/2606.29315#A1.SS4)\): skill banks evolved on one or more source task families are re\-grounded by the evolver onto a structurally related target task, enabling zero\-shot transfer with no target\-task interaction\. In this way, HExA autonomously constructs task\-specific skills that can be deployed with any frozen LLM at inference time to enable in\-context self\-improvement without weight updates\.

### A\.2Problem Formulation

We consider a task familyℓ\\ellconsisting of a set of task instances\{\(ℓ,sj\)\}\\\{\(\\ell,s\_\{j\}\)\\\}, where each instance is identified by an indexsjs\_\{j\}\(e\.g\., a random seed\) that determines the specific configuration of the environment \(geometry, parameters, initial conditions\)\. The agent interacts with each instance through a tool\-call interface exposing a set of toolsℱ\\mathcal\{F\}, and the environment defines a binary success predicateGoalℓ\\textsc\{Goal\}\_\{\\ell\}that evaluates the outcome of a completed simulation\.

At each iterationttof an episode, the agent observes the conversation historyht=\(o0,a1,o1,…,at−1,ot−1\)h\_\{t\}=\(o\_\{0\},a\_\{1\},o\_\{1\},\\ldots,a\_\{t\-1\},o\_\{t\-1\}\), whereoio\_\{i\}is an observation returned by a tool call andai∈ℱa\_\{i\}\\in\\mathcal\{F\}is a tool invocation, and selects the next action according to its policyπ\\pi\. The episode terminates after at mostTTiterations or when the agent issues a terminal action\. We define a success indicatoryjy\_\{j\}for task instancesjs\_\{j\}as:

yj\(π;ℓ,𝒢\)=𝕀\[Goalℓ\(Sim\(τj\)\)=true\],τj∼π\(⋅∣ℓ,sj,𝒢\),y\_\{j\}\(\\pi;\\,\\ell,\\,\\mathcal\{G\}\)\\;=\\;\\mathbb\{I\}\\bigl\[\\textsc\{Goal\}\_\{\\ell\}\\\!\\left\(\\textsc\{Sim\}\(\\tau\_\{j\}\)\\right\)=\\text\{true\}\\bigr\],\\qquad\\tau\_\{j\}\\sim\\pi\(\\cdot\\mid\\ell,\\,s\_\{j\},\\,\\mathcal\{G\}\),\(4\)whereτj\\tau\_\{j\}is the trajectory generated byπ\\pi,Simruns the environment simulation to completion, and𝒢\\mathcal\{G\}is an optional skill context\.

Our goal is to maximize the solve rate across instances ofℓ\\ellwithout any parameter updates toπ\\pi\. To this end, we introduce a natural\-language*skill bank*𝒦\\mathcal\{K\}, a structured collection of reusable strategies and common\-mistake records that is prepended to the agent’s system prompt at the start of each episode:

τj∼π\(⋅∣ℓ,sj,𝒢\),𝒢=Retriever\(𝒦,ℓ,M,N\),\\tau\_\{j\}\\sim\\pi\(\\cdot\\mid\\ell,\\,s\_\{j\},\\,\\mathcal\{G\}\),\\qquad\\mathcal\{G\}=\\textsc\{Retriever\}\(\\mathcal\{K\},\\,\\ell,\\,M,\\,N\),\(5\)whereRetrieverselects the top\-MMskills by reward and top\-NNmistakes from𝒦\\mathcal\{K\}\. The skill bank𝒦\\mathcal\{K\}is constructed and evolved entirely through HExA’s actor\-evolver loop \(Section[A\.3](https://arxiv.org/html/2606.29315#A1.SS3)\), with no human annotation, external teacher, or access to oracle solutions\. Each trajectory is additionally assigned a scalar rewardr\(τj\)∈\[−1,\+1\]r\(\\tau\_\{j\}\)\\in\[\-1,\+1\]that reflects both outcome and efficiency \(Eq\.[6](https://arxiv.org/html/2606.29315#A1.E6)\), providing the evolver with a richer learning signal than the binary success indicator alone\.

### A\.3The HExA Framework

![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/Feedback_Loop.png)Figure 9:TheHExAactor\-evolver loop\.The actor receives distilled skills in its system prompt and runs on physics puzzles, producing trajectories annotated with rewards \(successes ✓, failures ✗\)\. The Skill Evolver analyzes these trajectories and curates the evolving Skill Bank: skills \(max M\) are merged, pruned, or promoted by reward signal, and common mistakes \(max N\) are updated accordingly\. Each skill carries a generation index tracking its lineage across evolution rounds \(Eq\.[7](https://arxiv.org/html/2606.29315#A1.E7)\)\. The updated bank is injected back into the actor’s system prompt for the next round, closing the loop\.HExA instantiates the skill\-augmented policy of Eq\.[5](https://arxiv.org/html/2606.29315#A1.E5)through a two\-agent architecture operating over a sequence of*rounds*\. Anactorπact\\pi\_\{\\mathrm\{act\}\}\(any tool\-augmented LLM\) generates trajectories by interacting with the environment, while anevolverπev\\pi\_\{\\mathrm\{ev\}\}\(which may be the same model\) reads batches of actor trajectories and distills them into the skill bank𝒦\\mathcal\{K\}\. At the start of each new episode, theRetrieverselects the most relevant skills from𝒦\\mathcal\{K\}and injects them into the actor’s context, so that each subsequent trajectory benefits from the accumulated experience of all prior ones\.

This loop can be viewed as an instantiation of in\-context reinforcement learning\(Laskin et al\.,[2022](https://arxiv.org/html/2606.29315#bib.bib27); Moeini et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib39)\): the agent’s policy improves over rounds through context augmentation rather than weight updates, and the skill bank serves as a compressed, curated form of the cross\-episode context that classical in\-context RL methods maintain in raw form\. Unlike standard in\-context RL, however, HExA does not require pretraining on a task distribution; it operates on a single frozen model and learns from scratch on each task family\.

#### A\.3\.1Trajectory Reward

Not all trajectories are equally informative\. A fast, decisive solve reveals a high\-reward strategy; a failure that explored extensively yields richer material for the evolver than one that abandoned the episode after a single attempt\. To communicate this signal, we assign a scalar rewardr\(τ\)∈\[−1,\+1\]r\(\\tau\)\\in\[\-1,\+1\]to each trajectoryτ\\tau, reflecting both*outcome*\(success or failure\) and*efficiency*\(iteration countttat termination\):

r\(τ\)=\{\+1\.00success,t≤3\+0\.75success,t≤7\+0\.50success,t≤15\+0\.25success,t≤T−0\.50failure,t≥10−0\.75failure,t<10r\(\\tau\)\\;=\\;\\begin\{cases\}\+1\.00&\\text\{success\},\\;t\\leq 3\\\\ \+0\.75&\\text\{success\},\\;t\\leq 7\\\\ \+0\.50&\\text\{success\},\\;t\\leq 15\\\\ \+0\.25&\\text\{success\},\\;t\\leq T\\\\\[3\.0pt\] \-0\.50&\\text\{failure\},\\;t\\geq 10\\\\ \-0\.75&\\text\{failure\},\\;t<10\\end\{cases\}\(6\)The asymmetry in failure penalties is deliberate: an agent that explores extensively before failing produces richer interaction records for the evolver to extract partial skills and diagnose mistakes\.

Input:Base agent

πact\\pi\_\{\\text\{act\}\}, evolver

πev\\pi\_\{\\text\{ev\}\}, level

ℓ\\ell, regime

ℛ∈\{Offline,Off2On,Online\}\\mathcal\{R\}\\in\\\{\\text\{Offline\},\\text\{Off2On\},\\text\{Online\}\\\}, update strategy

𝒰∈\{Evolution,Iterative,None\}\\mathcal\{U\}\\in\\\{\\text\{Evolution\},\\text\{Iterative\},\\text\{None\}\\\}, rounds

RR, trajectories per round

xx, initial batch size

N0N\_\{0\}, retrieval sizes

M,NM,N
Output:Final skill bank

𝒦R\\mathcal\{K\}\_\{R\}and per\-round trajectories

\{𝒯\(n\)\}n=1R\\\{\\mathcal\{T\}^\{\(n\)\}\\\}\_\{n=1\}^\{R\}
1

//Stage 1: Initialise skill bank𝒦0\\mathcal\{K\}\_\{0\}

2if*ℛ∈\{Offline,Off2On\}\\mathcal\{R\}\\in\\\{\\text\{Offline\},\\ \\text\{Off2On\}\\\}*then

3Collect batch

𝒯0\\mathcal\{T\}\_\{0\}of

N0N\_\{0\}trajectories from

πact\\pi\_\{\\text\{act\}\}on

ℓ\\ell
𝒦0←Evolver\(∅,𝒯0\)\\mathcal\{K\}\_\{0\}\\leftarrow\\text\{Evolver\}\(\\emptyset,\\mathcal\{T\}\_\{0\}\)
//teacher\-driven distillation

4

5else

𝒦0←∅\\mathcal\{K\}\_\{0\}\\leftarrow\\emptyset
//cold start \(Online\)

6

7

//Stage 2: Iterative rounds of generation, distillation, and update

8for*n=1,2,…,Rn=1,2,\\dots,R*do

𝒢n←Retriever\(𝒦n−1,ℓ,M,N\)\\mathcal\{G\}\_\{n\}\\leftarrow\\text\{Retriever\}\(\\mathcal\{K\}\_\{n\-1\},\\ell,M,N\)
//retrieve top\-MMskills, top\-NNmistakes

9

𝒯\(n\)←∅\\mathcal\{T\}^\{\(n\)\}\\leftarrow\\emptyset
10for*i=1,…,xi=1,\\dots,x*do

τi←πact\(ℓ∣𝒢n\)\\tau\_\{i\}\\leftarrow\\pi\_\{\\text\{act\}\}\(\\ell\\mid\\mathcal\{G\}\_\{n\}\)
//run skill\-augmented agent

r\(τi\)←Reward\(τi\)r\(\\tau\_\{i\}\)\\leftarrow\\textsc\{Reward\}\(\\tau\_\{i\}\)
//Eq\.[6](https://arxiv.org/html/2606.29315#A1.E6)

11

𝒯\(n\)←𝒯\(n\)∪\{\(τi,r\(τi\)\)\}\\mathcal\{T\}^\{\(n\)\}\\leftarrow\\mathcal\{T\}^\{\(n\)\}\\cup\\\{\(\\tau\_\{i\},r\(\\tau\_\{i\}\)\)\\\}
12

13if*ℛ=Offline\\mathcal\{R\}=\\text\{Offline\}*then

𝒦n←𝒦n−1\\mathcal\{K\}\_\{n\}\\leftarrow\\mathcal\{K\}\_\{n\-1\}
//frozen bank

14

15else if*𝒰=Evolution\\mathcal\{U\}=\\text\{Evolution\}*then

𝒦n←Evolver\(𝒦n−1,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}\\leftarrow\\text\{Evolver\}\(\\mathcal\{K\}\_\{n\-1\},\\mathcal\{T\}^\{\(n\)\}\)
//merge, revise, prune

16

17else if*𝒰=Iterative\\mathcal\{U\}=\\text\{Iterative\}*then

𝒦n←Evolver\(∅,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}\\leftarrow\\text\{Evolver\}\(\\emptyset,\\mathcal\{T\}^\{\(n\)\}\)
//re\-distil from new trajectories only

18

19

return*𝒦R,\{𝒯\(n\)\}n=1R\\mathcal\{K\}\_\{R\},\\ \\\{\\mathcal\{T\}^\{\(n\)\}\\\}\_\{n=1\}^\{R\}*

Algorithm 2HExA
#### A\.3\.2Skill Distillation from Experimentation on Self\-Proposed Hypotheses

At each roundnn, the evolverπev\\pi\_\{\\mathrm\{ev\}\}receives the batch of trajectories𝒯\(n\)=𝒯\(n\),\+∪𝒯\(n\),−\\mathcal\{T\}^\{\(n\)\}=\\mathcal\{T\}^\{\(n\),\+\}\\cup\\mathcal\{T\}^\{\(n\),\-\}generated by the actor, where𝒯\(n\),\+\\mathcal\{T\}^\{\(n\),\+\}and𝒯\(n\),−\\mathcal\{T\}^\{\(n\),\-\}denote the successful and failed subsets respectively, each annotated with its scalar reward\. The evolver performs a two\-phase distillation to extract complementary forms of knowledge\.

##### Phase 1: Contrastive skill extraction\.

Given the full batch𝒯\(n\)\\mathcal\{T\}^\{\(n\)\}with reward annotations, the evolver contrasts high\-reward trajectories against low\-reward ones to identify the strategies and insights that distinguish success from failure\. This produces 4–6*strategy skills*, each consisting of a short title, a 2–3 sentence natural\-language principle describing the underlying mechanism, and an applicability predicate specifying when the skill is relevant\. The evolver is instructed to weight its analysis by reward: skills derived from fast, efficient solves are treated as more reliable than those from trajectories that barely succeeded\.

##### Phase 2: Mistake and partial\-skill extraction\.

Focusing on the failure subset𝒯\(n\),−\\mathcal\{T\}^\{\(n\),\-\}, the evolver produces two complementary outputs\.*Common mistakes*are structured recordsμ=\(δ,ρ,α\)\\mu=\(\\delta,\\rho,\\alpha\), whereδ\\deltadescribes the error,ρ\\rhoidentifies its root cause \(the broken causal belief that led to it\), andα\\alphaspecifies a concrete corrective strategy\.*Partial skills*capture correct reasoning steps that appeared*within*otherwise failed trajectories: an agent may have identified a valid strategy or correctly diagnosed a mechanism before making an error elsewhere\. These partial successes cannot be discovered from successful trajectories alone and are a distinctive source of learning signal in HExA\.

#### A\.3\.3Skill Bank

Distilled knowledge is stored in the skill bank𝒦\\mathcal\{K\}with two collections, both indexed by task family: a set of*skills*\(reusable strategy and principles\) and a set of*mistakes*\(patterns recording and describing recurring failure modes to avoid\)\. Each skill carries a reward label, computed from the interaction rewards of the trajectories that demonstrated it:

rk=clamp⁡\(r¯src\+12,0\.1,1\.0\),r\_\{k\}\\;=\\;\\operatorname\{clamp\}\\\!\\left\(\\frac\{\\bar\{r\}\_\{\\mathrm\{src\}\}\+1\}\{2\},\\;0\.1,\\;1\.0\\right\),\(7\)wherer¯src\\bar\{r\}\_\{\\mathrm\{src\}\}is the mean reward of the source trajectories from which the skill was extracted\. Skills derived from fast, efficient solves carry higher rewards; skills surfaced from partial reasoning within failed trajectories carry lower rewards but may still encode valuable discoveries\. The skill bank is hard\-capped atMmaxM\_\{\\max\}skills andNmaxN\_\{\\max\}mistakes per task family to prevent the injected context from growing without bound and to force the evolver to prioritize the most informative knowledge during evolution rounds\. This is where the reward labels and the creation of hierarchical skills, help in improving the skill bank even though it is capped at a maximum number of skills\.

#### A\.3\.4Skill Retrieval and Injection

At the start of each episode, theRetrieverselects the top\-MMskills from𝒦\\mathcal\{K\}sorted by rewardrkr\_\{k\}and the top\-NNmistakes for the relevant task family, and injects them into the actor’s system prompt as the structured skill context𝒢\\mathcal\{G\}defined in Eq\.[5](https://arxiv.org/html/2606.29315#A1.E5)\. The actor receives this block alongside the standard task description and tool documentation, and is free to use or disregard individual skills based on its own reasoning about the current instance\. This injection mechanism is deliberately simple: it adds a fixed\-size prefix to the prompt rather than modifying the agent’s architecture or decoding procedure, making HExA compatible with any tool\-augmented LLM, including closed\-source API\-served models\.

#### A\.3\.5Initialisation, Update Regimes, and Full Algorithm

The loop described above admits several configurations that vary along two orthogonal axes: how the skill bank is*initialised*before evaluation begins, and how it is*updated*between rounds\. Algorithm[2](https://arxiv.org/html/2606.29315#algorithm2)summarizes the full procedure; we now describe the design axes it parameterizes\.

##### Initialisation\.

In theOfflineandOffline\-to\-Online\(Off2On\) regimes, a warm\-start batch ofN0N\_\{0\}trajectories is collected from the base actor \(without any skill context\) and distilled into an initial bank𝒦0\\mathcal\{K\}\_\{0\}viaEvolver\(∅,𝒯0\)\\text\{Evolver\}\(\\emptyset,\\mathcal\{T\}\_\{0\}\)\(Algorithm[2](https://arxiv.org/html/2606.29315#algorithm2), lines 3–4\)\. In thePure Onlineregime, the bank starts empty \(𝒦0=∅\\mathcal\{K\}\_\{0\}=\\emptyset\) and the agent must bootstrap skills entirely from its own online interactions \(line 6\)\.

##### Update strategy\.

Given trajectories from roundnn, the bank can be updated in two ways\. UnderEvolution, the evolver receives both the current bank and the new trajectories, producing𝒦n=Evolver\(𝒦n−1,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}=\\text\{Evolver\}\(\\mathcal\{K\}\_\{n\-1\},\\mathcal\{T\}^\{\(n\)\}\)by merging, revising, or pruning existing skills in light of new evidence \(line 17\)\. UnderIterative Replacement, the evolver distills𝒦n=Evolver\(∅,𝒯\(n\)\)\\mathcal\{K\}\_\{n\}=\\text\{Evolver\}\(\\emptyset,\\mathcal\{T\}^\{\(n\)\}\)purely from the new trajectories, discarding all prior skills \(line 19\)\. In the Offline regime, the bank is frozen after initialisation \(𝒦n=𝒦n−1\\mathcal\{K\}\_\{n\}=\\mathcal\{K\}\_\{n\-1\}, line 15\)\.

Crossing these axes yields five concrete variants: Offline \(static bank\), Off2On with Evolution, Off2On with Iterative Replacement, Pure Online with Evolution, and Pure Online with Iterative Replacement\. We compare these empirically in[4](https://arxiv.org/html/2606.29315#S4)\.

Finally we see that the Off2On \+ Evolution variant works best across all settings and is thus the variant that we generally refer to as HExA in the rest of the paper\.

### A\.4Cross\-Task Skill Transfer

The skill banks produced by HExA encode task\-family\-specific strategies, but the underlying principles are often more general: momentum conservation, lever mechanics, collision geometry, and similar structural insights recur across task families that share physical or procedural primitives\. HExA exploits this by synthesizing a*cross\-task skill bank*for a target task family from the evolved banks of one or more source families, with no target\-task trajectories required\.

Given source banks𝒦ℓ1,…,𝒦ℓS\\mathcal\{K\}\_\{\\ell\_\{1\}\},\\ldots,\\mathcal\{K\}\_\{\\ell\_\{S\}\}and a textual description of the target task familyℓ∗\\ell^\{\*\}, the evolver receives all source skills together with the target description and is prompted to:

1. 1\.identify which source skills encode principles that are structurally relevant to the target \(based on shared physical or procedural primitives\),
2. 2\.re\-ground each selected principle in the entities and mechanics of the target scene, and
3. 3\.calibrate reward based on how directly the principle transfers \(skills corroborated by multiple source banks receive higher reward\)\.

The synthesised bank𝒦ℓ∗\\mathcal\{K\}\_\{\\ell^\{\*\}\}is then injected into the actor’s context via the sameRetrievermechanism as within\-task skills\. This enables zero\-shot transfer: the actor attempts the target task with the benefit of cross\-task skills but without having seen any target\-task trajectories\.

Importantly, cross\-task transfer and within\-task skill evolution are complementary\. The synthesised bank can serve as the initialisation𝒦0\\mathcal\{K\}\_\{0\}for a subsequent within\-task HExA run, combining the benefits of transferred knowledge with task\-specific refinement\.

### A\.5Design Principles

Several design choices in HExA merit explicit discussion\.

##### Domain agnosticism\.

Nothing in the framework above assumes a particular domain\. The actor interacts with the environment through a generic tool\-call interface; the evolver reads trajectory text and produces structured JSON\. The only domain\-specific components are the tool suite exposed by the environment and the system prompt describing the task, both of which are inputs to HExA rather than parts of its architecture\. We instantiate and evaluate HExA on physics puzzles in §[4](https://arxiv.org/html/2606.29315#S4), but the same framework applies to any setting where an agent must learn from sequential interactions with an environment\.

##### Reward Function Agnostic\.

As we have shown previously in our experiments, rewards from environment interactions and shaped according to the needs of the user help HExA more\. But the evolver can guide skill creation and learning even without it\. So this reward agnostic nature of HExA make it widely applicable in domains where even environment reward might not be present\. Moreover in cases where there is presence of such rewards and additional feedback, we can plug in any reward function to further guide skill acquisition either guided by users or designed by the LLM evolver itself\.

##### No weight updates\.

HExA is entirely in\-context: neither the actor nor the evolver undergoes any parameter updates\. All adaptation occurs through the evolving skill bank, which modifies the actor’s behavior by changing the textual context it conditions on\. This makes HExA compatible with closed\-source, API\-served models and avoids the computational cost and catastrophic\-forgetting risks of fine\-tuning\.

##### Hierarchical skill structure\.

The skills in𝒦\\mathcal\{K\}are hierarchical in two senses\. First, they abstract over low\-level actions \(specific tool calls and parameter values\) into high\-level strategy descriptions, compressing lengthy trajectories into concise principles\. Second, skills are learned iteratively while the agent has access to previously distilled skills, so later skills implicitly build on earlier ones\. The evolver may revise a skill in light of new evidence or merge two complementary skills into a more general one, producing a form of meta\-learning: the agent learns not just strategies but*how to learn strategies*from interactions\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/figures/Agent_Interphyre_Tools.png)Figure 10:Tool interface between theHExAagent and the Interphyre environment\.The agent issues two classes of tool calls:State Information Tools\(left\) query the scene without advancing the simulation—returning object positions, gap widths, relative ball positions, and level\-specific geometry—andSimulation Tools\(right\) place the red ball at\(x,y,r\)\(x,y,r\)and run the physics engine, returningSUCCESS/FAILUREwith final object positions \(simulate\_action\) or intermediate state at a chosen timestep \(simulate\_partial\)\. Thefinishaction terminates the episode\. Level\-specific state tools \(compute\_gap\_analysis,compute\_relative\_positions,get\_ramp\_center, etc\.\) are available only on their respective levels; shared tools are available on all levels\. Selected tools presented here\. Full tool signatures and usage instructions are provided in Appendix[B](https://arxiv.org/html/2606.29315#A2)\.

## Appendix BTool Descriptions

This appendix provides complete descriptions of all tools available to theReActagent\. Tools are partitioned into*shared tools*\(available on every level\) and*level\-specific tools*\(exposed only on the corresponding puzzle\)\.

### B\.1Shared Tools

##### get\_level\_state\(\)\.

Returns a structured text description of the current puzzle scene: the name, position, radius \(for balls\), dimensions and angle \(for bars\), and dynamic flag of every object; the world bounds \(x,y∈\[−5,5\]x,y\\in\[\-5,5\]\); and the level\-specific success condition\.

##### simulate\_action\(x, y, radius\)\.

Places the red ball at\(x,y\)\(x,y\)with the givenradius∈\[0\.1,2\.0\]\\in\[0\.1,2\.0\]and runs the full physics simulation \(up to 2,000 steps at 60 Hz\) to completion\. Returns the outcome \(SUCCESS/FAILURE\), total step count, and final positions and velocities of all objects\. Before launching the simulation the toolkit performs*pre\-simulation validation*: if the placement violates world bounds, overlaps an existing ball, or intersects a platform, the call returns a detailed error specifying the minimum adjustment required rather than consuming a simulation attempt\.

##### simulate\_partial\(x, y, radius, stop\_step\)\.

Identical tosimulate\_actionbut halts afterstop\_stepphysics steps \(or earlier if the success condition is satisfied\)\. Returns object positions and velocities at the chosen checkpoint, enabling mid\-simulation diagnostics without committing to a full run\.

##### get\_contact\_log\(\)\.

Returns timestamped collision events from the most recent simulation: a list of\(step, object\_A, object\_B\)entries \(capped at 20; a notice is appended if more exist\), or"No contact events recorded"if no contacts occurred\. Must be called aftersimulate\_actionorsimulate\_partial\.

##### finish\(x, y, radius\)\.

Submits the agent’s final answer: places the red ball at\(x,y,r\)\(x,y,r\)and ends the episode\. The episode is scored as a success if the physics engine confirms the goal condition\.

### B\.2Tools for Levels

#### B\.2\.1Down to Earth —compute\_gap\_analysis\(\)

Analyses the gaps on each side of the platform that separates the green ball from the floor\. Returns:

- •platform span \(left and rightxx\-coordinates of the platform edges\);
- •green ball diameter;
- •left gap width \(distance from left wall to platform left edge\) and right gap width \(distance from platform right edge to right wall\);
- •YES/NOfor whether the green ball can fit through each gap;
- •a recommendation indicating which side has the larger viable gap, or aWARNINGif neither gap is wide enough\.

#### B\.2\.2Two Body Problem —compute\_relative\_positions\(\)

Analyses the spatial relationship between the green and blue balls\. Returns:

- •position and radius of each ball;
- •horizontal separationΔx\\Delta x, vertical separationΔy\\Delta y, and centre\-to\-centre distance;
- •minimum contact distance \(sum of radii\);
- •relative direction of the blue ball from the green ball \(LEFT/RIGHT\)\.

This information guides placement of the red ball to push the green ball into contact with the blue ball\.

#### B\.2\.3Catapult — Four Analysis Tools

The Catapult level exposes four complementary tools to handle its multi\-body lever mechanics\.

##### describe\_scene\_geometry\(\)\.

Returns a strategy\-neutral geometric inventory of the scene with no prescriptive advice: all balls \(position, radius, dynamic flag\), all bars \(centre, angle, length, dynamic flag\), and all baskets \(position, dynamic flag\); plus the key pairwise distance \(green–blue\) and the success condition\. Intended as a first\-pass survey before committing to a simulation strategy\. No arguments\.

##### simulate\_with\_trace\(x, y, radius, object\_names, stop\_step\)\.

Places the red ball and runs a full \(or truncated\) simulation, then returns a per\-object kinematic\-extrema summary for every name inobject\_names\(e\.g\.\["green\_ball", "catapult\_arm"\]\)\. The summary per object includes:peak\_y\(maximum height reached\),min\_y\(minimum height\),v\_max\(peak speed\),Δpos\\Delta\\mathrm\{pos\}\(net displacement\), and, for bodies that rotate appreciably, peak angular speedωmax\\omega\_\{\\max\}and angle range\[θmin,θmax\]\[\\theta\_\{\\min\},\\theta\_\{\\max\}\]\. The most relevant contact events involving the red ball or any traced object are also returned \(capped at 15\)\. Used to verify lever mechanics and confirm that the green ball reaches the basket\.

##### trace\_green\_ball\(x, y, radius\)\.

A lighter trajectory probe thansimulate\_with\_trace: only the green ball is sampled\. Places the red ball, runs the simulation up to a 600\-step ceiling, and returns the green ball’s\(x,y\)\(x,y\)waypoints at fixed 30\-step intervals together with a start/end/peak summary \(start position, end position,Δ\\Delta, peakyy, peak speed\)\. Stops early once the green ball settles \(consecutive samples with\|Δpos\|<0\.10\|\\Delta\\mathrm\{pos\}\|<0\.10\) or once the success condition fires\. Use this when the only quantity of interest is*where*the green ball travels rather than full per\-object kinematics or contact events\.

##### predict\_first\_contact\(x, y, radius\)\.

A cheap pre\-simulation check \(at most 90 physics steps,≈\\approx1\.5 s of simulated time\) that identifies the first object the red ball contacts after release\. Returns: placement validity, the name of the first object hit, the step at which contact occurs, approach speed, approximate contact point, and surface normal\. Used to verify that the red ball reaches the intended catapult arm before committing to a full simulation\.

#### B\.2\.4Falling Into Place —compute\_intercept\_setup\(\)

Computes the intercept geometry for the timed\-interception puzzle\. Returns:

- •green ball position and which platform it rests on \(LEFT/RIGHT\);
- •blue jar position and travel direction the green ball must take;
- •the platform edge the green ball must cross and the gap centrexx\-coordinate;
- •estimated time until the jar falls to platform height \(in seconds and in 60 Hz physics steps\)\.

#### B\.2\.5Basket Case —compute\_basket\_analysis\(\)

Analyses the spatial relationship between the green ball and the basket that the agent must avoid\. Returns:

- •green ball position and radius;
- •basket position and scale, including opening half\-width;
- •purple ground position \(if present\);
- •horizontal distance from the green ball to the basket centre;
- •recommended push direction \(LEFT/RIGHT\) to deflect the green ball away from the basket\.

\.

#### B\.2\.6Pass the Parcel —get\_ramp\_center\(\)

Analyses the ramp\-and\-basket mechanism in the Pass the Parcel puzzle\. Returns the centre coordinates\(x,y\)\(x,y\)of the ramp, together with inferred geometry including the top\-basket position, bottom\-basket position, platform position, ramp angle, and ramp bounds\. This information guides placement of the red ball to roll it onto the ramp and interact with downstream mechanisms\.

#### B\.2\.7Cliffhanger —compute\_cliffhanger\_analysis\(\)

Analyses the cliffhanger geometry: a vertical green bar standing on a black platform with a ceiling above and the purple ground below\. Returns:

- •green bar centre, length, thickness, and angle, plus the\(x,y\)\(x,y\)coordinates of its bottom point \(resting on the platform\) and top point \(opposite end\);
- •black platform extents \(left and rightxx, centreyy, top\-surfaceyy, thickness\);
- •ceilingyyand purple\-groundyywhere present;
- •the bar’s bottom\-point distance to the platform’s left and right edges;
- •closer platform edge \(LEFT/RIGHT\) — i\.e\. the edge the bar must fall past — and the width of the falling gap on that side \(between the platform edge and the world wall\);
- •the implied tip direction the bar’s top end must rotate \(LEFT/RIGHT\) for the bar’s centre of mass to cross the closer edge\.

#### B\.2\.8Tipping Point —compute\_tipping\_point\_analysis\(\)

Analyses the tipping\-point geometry: a vertical green bar pinned at its base by a small gray basket, with a static purple wall flush against either the left or right side of the box\. Returns:

- •green bar centre, length, thickness, and angle, plus the\(x,y\)\(x,y\)coordinates of the free top end and the basket\-pinned bottom end;
- •gray basket centre \(when present\);
- •purple wallxxand its top/bottomyy;
- •the wall’s side relative to the bar \(LEFT/RIGHT\);
- •horizontal distance from the bar centre to the wall;
- •approximate angle the bar must tip through, treated as a rigid stick pivoting at its base, to reach the wall \(arcsin⁡\(Δx/L\)\\arcsin\(\\Delta x/L\)whenΔx<L\\Delta x<L, elseN/A\);
- •suggested tip direction \(LEFT/RIGHT\)\.

## Appendix CBaseline Implementation Details

This appendix documents the three baselines summarized in Section[4\.2](https://arxiv.org/html/2606.29315#S4.SS2): the per\-stepReActloop, the Reflexion wrapper \(K=2K\{=\}2trials\), and the GRPO fine\-tune\. We run open weight models using vLLM, and Huggingface on upto 4 A100 GPUs with 80GB VRAM each and the Claude Sonnet model using the Anthropic API\. Together with the per\-level system prompts \(Appendix[H](https://arxiv.org/html/2606.29315#A8)\) and tool descriptions \(Appendix[B](https://arxiv.org/html/2606.29315#A2)\), the contents below are sufficient to reproduce every baseline number reported in Section[4](https://arxiv.org/html/2606.29315#S4)\.

### C\.1ReAct: Inner Loop

Every method we evaluate runs the sameReActloop on top of theInterphyretool API\.HExAadds a skill block to the system prompt; Reflexion adds a reflection block; both leave the inner loop otherwise unchanged\.

##### Per\-step protocol\.

At each iteration the agent emits exactly one block of three lines:

Thought:<free\-formreasoningovertheconversationhistory\>

Action:<toolname,drawnfromthelevel’sAPI\>

ActionInput:<JSONarguments,oremptyfornullarytools\>

The harness parses these three fields, dispatches the tool against the simulator, and appends the returned text as

Observation:<tooloutput\>

The agent then produces the nextThought/Action/Action Inputblock, conditioned on the full conversation so far\.

##### Termination, attempts, and scoring\.

A trial ends as soon as one of three things happens\. \(1\) The agent emitsAction: finishwith arguments\(x,y,r\)\(x,y,r\); the harness places the red ball and the simulator runs to completion\. \(2\) The agent emitsAction: simulate\_actionwith arguments\(x,y,r\)\(x,y,r\)and the simulator confirms the goal predicate \(the named target pair stays in contact for at least 3 seconds, i\.e\. 180 physics steps at 60 Hz\)\. \(3\) The iteration count reaches the cap of 25 without either of the above\. Eachsimulate\_actioncall counts as a placement attempt; an agent may make up to roughly 24 attempts within a single trial\.

##### Decoding\.

All baselines share decoding: temperature0\.30\.3, max\-new\-tokens700700per agent turn, and a single tool call per turn \(parsed greedily from the firstActionline in the model’s reply\)\.

### C\.2Reflexion:K=2K\{=\}2Trial Wrapper

Reflexion wraps the same single\-trialReActagent \(above\) inK=2K\{=\}2trials per seed each trial runs for 12 iterations\. Trial 1 runs as a standardReActloop\. If trial 1 fails, the harness calls a separate Claude Sonnet instance to write a verbal reflection, prepends the reflection to the system prompt of trial 2, and runs trial 2 from a fresh 25\-iteration budget\. The episode is scored a success if either trial succeeds; iteration count and wall\-clock time aggregate across the two trials\.

##### Reflection model\.

The reflection step uses Claude Sonnet 4\.6 \(matching the actor for ourcatapultrun\), invoked with a single non\-streaming completion\. The reflection prompt is held fixed across seeds and levels\.

##### Reflection prompt\.

The reflector receives the fullReActtrajectory of the failed trial \(Thought,Action,Observationper step\) plus the final observation\. Long observations are truncated to the first 300 and last 300 characters with an ellipsis to keep the prompt under context\. The prompt template is:

SYSTEM:

Youareanalyzingafailedattemptata2Dphysicspuzzle\.

YouwillreceivethefullThought/Action/ObservationtrajectoryfromasinglefailedattemptbyaReAct\-styleagent,followedbythefinaloutcomeandanypriorreflectionsaccumulatedonthissametask\.

Yourjob:produceashortreflection\(<=5sentences\)thattheagentwillreadbeforeitsnextattemptontheSAMEtask\.Cover:

\(a\)whichstrategy/approachtheagentpursuedinthisattempt;

\(b\)thespecifickinematic,geometric,orproceduralreasonitfailed\(citeconcretecoordinates,distances,orcontacteventsfromtheobservations\);

\(c\)oneconcretedifferentactionorstrategytotrynext\-\-bespecific\(objectnames,approximate\(x,y,radius\),expectedmechanism\)\.

Hardrules:

\-DoNOTrepeatlessonsthatalreadyappearinthepriorreflectionslist\.

\-DoNOTre\-statethegoalorthepuzzlerules\.

\-OutputONLYthereflectiontext\.Nopreamble,noheaders,nomarkdown\.

USER:

Outcome:FAILURE\.

Finalobservation:

\{final\_observation\}

Trajectory:

\-\-\-Step1\-\-\-

Thought:\{thought\}

Action:\{action\}

Observation:\{observation\_truncated\}

\.\.\.\(oneblockperstep\)\.\.\.

\#\#Priorreflections\(donotrepeatthese\)

1\.\{prior\_reflection\_1\}

\.\.\.

Writethereflectionnow\.Bespecificandconcise\(<=5sentences\)\.

##### Trial 2 conditioning\.

The returned reflection text is appended to the level’s system prompt under a heading\#\# Reflexion memorybefore trial 2 starts\. The actor sees this block exactly once at trial 2’s first turn, in addition to the unchanged level system prompt and tool list\. Trial 2 then proceeds as a freshReActloop with the standard 25\-iteration cap\.

##### Empty / failed reflections\.

If trial 1 already succeeds, no reflection is generated and trial 2 is skipped \(we still report the seed as a success but the reported iteration count is just trial 1’s\)\. If the reflection model returns an empty string \(CLI error or empty output\), trial 2 runs without a memory block, which makes it equivalent to a secondReActtrial from scratch\.

### C\.3Direct: One\-Shot, No\-Tool Baseline

Directremoves theReActloop entirely\. The agent never emits aThought/Action/Action Inputblock, never calls a simulator\-write tool, and never observes the consequence of a placement before committing\. It serves as a lower\-bound test for how much of each level is solvable purely from the static scene description\.

Table 5:Directbaseline solve rate \(%\) across eightInterphyrelevels\.Directpeeks at the level state once and makes a single model call tofinish\(\), with no environment interaction\. Underlined levels \(Pass the Parcel, Catapult\) form the harder tier, based on Claude Sonnet performance\.†\\daggerPartial run \(Pass the Parcel: 50 seeds; Catapult: user\-reported\); all other cells 100 seeds\.##### Per\-seed protocol\.

The harness performs exactly two operations per seed:

1. 1\.A single scripted call toget\_level\_statevia the sameInterphyretoolkit used byReAct\. The returned object table \(positions, sizes, dynamic flags, and any level\-specific geometry\) is captured verbatim\.
2. 2\.One non\-streaming Claude Sonnet 4\.6 call\. The conversation has a level\-specific system prompt and a single user message that embeds the scene table from step 1 and demands the final\(x,y,r\)\(x,y,r\)placement as JSON\. The harness parses the JSON, runssimulate\_actiononce, and records success/failure\.

The model is given no read tools, no probe tools \(simulate\_partial,get\_contact\_log\), and no retries\. There is one inspection of the scene and one committed placement\.

##### Actor model and decoding\.

Actor model: Claude Sonnet 4\.6 invoked via theclaude \-pCLI with\-\-max\-turns 1and\-\-no\-session\-persistence, matching the actor used by theReActand Reflexion baselines on the same level so the comparison isolates the harness, not the model\. Temperature0\.30\.3andmax\-new\-tokens700700, identical toReAct\.

##### Answer prompt \(no\-CoT variant\)\.

We use the no\-CoT variant of the answer block: the model emits a singleAction/Action Inputpair and no preceding reasoning\. This shape mirrors the per\-iteration output ofReAct, which keeps Sonnet’s response routine and avoids triggering extended\-thinking trajectories that would otherwise inflate latency by an order of magnitude\. The exact answer block appended to the user message is:

SubmityourFINALredballplacementnowusingthefinishaction\.

OutputEXACTLYthisformatandNOTHINGelse\(noThoughtline,nopreamble,

nocommentary\):

Action:finish

ActionInput:\{"x":<X\>,"y":<Y\>,"radius":<R\>\}

Replace<X\>,<Y\>,<R\>withyourchosenfloating\-pointnumbers\(subjectto

theplacementconstraintsabove\)\.

##### System prompt\.

The system prompt is self\-contained and does not reuse theReActlevel prompts, so the baseline cannot accidentally inherit ReAct framing\. It states the world dimensions, gravity direction, placement constraints, and a per\-level*elements*block enumerating the dynamic and static objects with one\-line descriptions\. The full per\-level blurbs are in the released code \(react\_agent/run\_direct\_answer\_claude\.py\)\.

##### Termination and scoring\.

A seed terminates after the singlesimulate\_actioncall\. Success is the same simulator predicate used by every other method \(target pair in contact for≥3\\geq 3s\)\. If the model’s response cannot be parsed into a valid\(x,y,r\)\(x,y,r\)tuple, the seed is recorded as a failure with no simulator call \(these are extremely rare: the no\-CoT format leaves only one syntactic shape to emit\)\.

### C\.4GRPO: Weight\-Update Baseline

We use Group Relative Policy Optimization\(Shao et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib46)\)to fine\-tune Qwen\-2\.5\-3B\-Instruct on the same two levels reported in Section[D](https://arxiv.org/html/2606.29315#A4.SS0.SSS0.Px1)\(down\_to\_earth,two\_body\_problem\) using the VERL\-TOOL framework\(Jiang et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib24)\)\. GRPO is included purely as a sample\-efficiency comparison for Section[D](https://arxiv.org/html/2606.29315#A4.SS0.SSS0.Px2); it is the only method we evaluate that updates model weights\.

##### Setup\.

Per\-step batch size11, four rollouts per gradient step \(n=4n\{=\}4\), binary\{0,1\}\\\{0,1\\\}environment reward \(the simulator’s success predicate, identical to the scoring rule used by every other method\)\. One*epoch*corresponds to a full pass over the 50 training seeds, i\.e\. 50 gradient updates\. We train for 10 epochs \(500 total gradient steps\)\.

##### Hyperparameters\.

All runs share the configuration in Table[6](https://arxiv.org/html/2606.29315#A3.T6); the only difference between the non\-skilled and static\-skilled variants ismax\_prompt\_length\(2048 vs\. 3072\), which is increased to accommodate the skill\-bank prefix \(≈\{\\approx\}2200 tokens vs\.≈\{\\approx\}1400 tokens without skills\)\.

Table 6:GRPO hyperparameters shared across all four runs\.
##### Prompt variants\.

We evaluate two prompt variants\. The*non\-skilled*variant uses the level’s full system prompt \(Appendix[H](https://arxiv.org/html/2606.29315#A8)\) without any skill block; the*static\-skilled*variant pre\-pends the same skill bankHExAends with after its final training pass, baked in at data generation time so the GRPO trainer sees a fixed augmented prompt across all rollouts\. The static\-skilled variant therefore tests whether the skill bank is also useful as a frozen prefix to gradient\-based RL\.

##### Results\.

Table[7](https://arxiv.org/html/2606.29315#A3.T7)reports greedy val success rate \(mean@1, 50 val seeds\) at each epoch boundary\. Figure[8](https://arxiv.org/html/2606.29315#S4.F8)visualizes the same data on a log\-scaled x\-axis to emphasis the low\-data region\. Three observations follow\.

Table 7:GRPO val success rate \(greedymean@1, 50 val seeds\) per epoch\. Epoch 0 is the pre\-training baseline \(no gradient updates\)\. non\-skilled DTE, non\-skilled TBP, static\-skilled DTE, static\-skilled TBP\.
##### Compute matching\.

The figure in Section[D](https://arxiv.org/html/2606.29315#A4.SS0.SSS0.Px2)marks the matched\-sample\-budget point at 50 unique seeds \(one epoch\)\. At this point, GRPO has performed 50 gradient updates and produced50×4=20050\\times 4=200rollouts;HExAhas generated5151trajectories total \(5 warm\-start \+17×317\\times 3training\)\. We compare on*unique seeds seen*, not on rollouts or gradient steps, since counting rollouts or gradient steps would artificially advantage the gradient\-based method\.

### C\.5Seed Ranges per \(method, level\) Cell

For full reproducibility, Table[8](https://arxiv.org/html/2606.29315#A3.T8)lists the integer seed range used by each \(method, level\) cell reported in Section[4](https://arxiv.org/html/2606.29315#S4)\. Seeds are passed directly to the simulator’s level generator and determine the randomised positions, masses, and dimensions of the level’s objects \(see Section[4\.1](https://arxiv.org/html/2606.29315#S4.SS1)\)\. All cells use exactly 50 seeds\.

Table 8:Seed range used by each \(method, level\) cell\. All within\-level cells share the evaluation pool66–5555, so comparisons acrossReAct, Reflexion, andHExAon a given level are paired on identical seeds\.MethodLevelSeed rangeReActbaselineevery level66–5555Reflexion \(K=2K\{=\}2\)catapult66–5555HExA\(within\)every level66–5555\(warm\-start11–55held out\)HExAcontrastiveQwen\-7BDTE,TBP66–5555Cross\-level→\\tocatapultcatapult66–5555Cross\-level→\\toFIP / TBPfalling\_into\_place,two\_body\_problem66–5555GRPO trainingQwen\-2\.5\-3BDTE,TBP11–5050The seed scheme has two properties worth flagging\.

##### Within\-level: paired comparison\.

Every method on a given level is evaluated on the same 50 seeds, so theReAct, Reflexion, andHExAnumbers in Tables[5](https://arxiv.org/html/2606.29315#S4.F5),[12](https://arxiv.org/html/2606.29315#A4.T12),[10](https://arxiv.org/html/2606.29315#A4.T10), and[11](https://arxiv.org/html/2606.29315#A4.T11)are paired contrasts on identical instances rather than independent samples\.HExAadditionally uses seeds11–55as a warm\-start batch to construct the offline bank𝒦0\\mathcal\{K\}\_\{0\}; these seeds are not part of the evaluation pool and never appear in any reported solve\-rate cell\.

##### Round\-by\-round: bank has not yet been distilled on the attempted seed\.

WithinHExA’s online rounds, the agent attempts a fresh batch of seeds \(k=3k\{=\}3ork=5k\{=\}5per round\) using the current bank, and only after that round is the evolver allowed to update the bank from those trajectories\. Every reportedHExAattempt is therefore made with a bank that has not yet been distilled on the seed being attempted, even though the bank is trained from earlier seeds on the same level\.

##### Cross\-level: same\-pool comparison against target baseline\.

For cross\-level transfer the synthesised target bank is evaluated on the same 50 seeds as the target\-levelReActbaseline, so the cross\-level numbers in Table[13](https://arxiv.org/html/2606.29315#A4.T13)are paired with the corresponding baseline numbers on identical instances\. The synthesis itself never sees a target\-level trajectory at any seed, so the comparison remains strictly zero\-shot at the trajectory level\.

## Appendix DAdditional Details of Experiments and Results

Table 9:ReActbaseline solve rate \(%\) across eightInterphyrelevels\. Underlined levels \(Pass the Parcel, Catapult\) form the harder tier, based on Claude Sonnet performance\.†\\daggerEvaluated on 50 seeds;‡\\ddagger25 seeds; all others 100 seeds\. — = not evaluated\.![Refer to caption](https://arxiv.org/html/2606.29315v1/x7.png)\(a\)Catapult
![Refer to caption](https://arxiv.org/html/2606.29315v1/x8.png)\(b\)Pass the Parcel

Figure 11:Cumulative solve rate as seeds are evaluated, for everyHExAvariant against the strongest agent\-side baseline \(Claude Sonnet\)\. Each curve is the running mean over the seeds seen so far\. On both levels the Off2On Evolving configuration finishes highest—76%76\\%oncatapultand60%60\\%onpass\_the\_parcel—and stays consistently above the other variants and baselines throughout the sweep, so the ranking is not an artifact of a few lucky seeds\.![Refer to caption](https://arxiv.org/html/2606.29315v1/x9.png)\(a\)Catapult
![Refer to caption](https://arxiv.org/html/2606.29315v1/x10.png)\(b\)Pass the Parcel

Figure 12:Cumulative average turns per seedfor everyHExAvariant against the baseline \(Claude Sonnet\); lower is more efficient\. On both levels the Off2On Evolving configuration converges to the lowest cost per seed, while the iterative and pure\-online variants remain higher, indicating thatHExA’s gains come from guiding search more efficiently with accumulated experience rather than from spending more computation per instance\.##### The gains extend to open\-weight solvers

We also uniformly notice that gains fromHExAis not limited to a frontier\-model agent\. It also improves smaller open\-weight agents when the same skill\-evolution mechanism is applied\. Table[10](https://arxiv.org/html/2606.29315#A4.T10)shows improvements for every open\-weight model\-level pair tested\.

Table 10:HeXA on open\-weight solvers, 50 training/evaluation seeds per cell\. Qwen\-2\.5 runs use Online\-Evolving configuration \(k=3k\{=\}3seeds per round, 17 rounds\)\. GPT\-OSS\-120B uses the same Off2On Evolving configuration\. Baselines are the same agent without any skill injection\. Per\-seed cumulative trajectories are unchanged from the per\-seed metric reported in Figure[13](https://arxiv.org/html/2606.29315#A4.F13)\.LevelMethodAcc\. \(%\)Avg Iters \(max 25\)Δ\\Deltavs\. baselineDown to EarthQwen\-2\.5\-3BReActbaseline8\.023\.8—Qwen\-2\.5\-3BHeXA24\.018\.6\+16\.0\+16\.0Qwen\-2\.5\-7BReActbaseline62\.012\.5—Qwen\-2\.5\-7BHeXA72\.012\.6\+10\.0\+10\.0Two Body ProblemQwen\-2\.5\-3BReActbaseline6\.024\.0—Qwen\-2\.5\-3BHeXA14\.021\.1\+8\.0\+8\.0Qwen\-2\.5\-7BReActbaseline18\.022\.2—Qwen\-2\.5\-7BHeXA34\.018\.9\+16\.0\+16\.0CatapultGPT\-OSS\-120BReActbaseline0\.025\.0—GPT\-OSS\-120BHeXA54\.016\.2\+54\.0\+54\.0For Qwen\-2\.5\-3B,HExAimprovesdown\_to\_earthfrom8\.0%8\.0\\%to24\.0%24\.0\\%andtwo\_body\_problemfrom6\.0%6\.0\\%to14\.0%14\.0\\%\. For Qwen\-2\.5\-7B, it improvesdown\_to\_earthfrom62\.0%62\.0\\%to72\.0%72\.0\\%andtwo\_body\_problemfrom18\.0%18\.0\\%to34\.0%34\.0\\%\. The largest open\-weight gain occurs oncatapult: GPT\-OSS\-120B scores0\.0%0\.0\\%under theReActbaseline but reaches54\.0%54\.0\\%withHExA\. These results show that the method is not merely exploiting Claude Sonnet’s stronger prior knowledge; the skill bank provides a general in\-context adaptation mechanism that can lift weaker agents as well\.

Table 11:Reward\-signal ablation on Qwen\-2\.5\-7B, 50 seeds per cell\. We compare fullHExAdistillation \(reward\-weighted, two\-pass\) against a contrastive\-only variant that drops the reward from Eq\.[6](https://arxiv.org/html/2606.29315#A1.E6)\. TheReActbaseline is the same Qwen\-2\.5\-7B agent with no skill bank\.
##### Incontext RL method HExA vs\. Gradient\-Based LLM RL method GRPO

Figure[8](https://arxiv.org/html/2606.29315#S4.F8)shows validation accuracy across 6 GRPO epochs for two variants: one where agent has access to no skills and other where a previously evolved skill bank using HExA is baked into the prompt at data\-generation time\)\. With sufficient training, GRPO’s direct optimisation on the exact environment reward dominates:down\_to\_earthreaches100%100\\%by epoch 5 \(GRPO\) and epoch 6 \(GRPO w\. evolved skills\), whiletwo\_body\_problemconverges to100%100\\%by epoch 4 for \(GRPO w\. evolved skills\) and plateaus around4040–50%50\\%for \(GRPO\)\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x11.png)Figure 13:HExA shows consistent improvement on small open LLMs as well\.To situateHExAin the broader RL landscape, we train the same Qwen\-2\.5\-3B model on the same two levels with Group Relative Policy Optimization \(GRPO\)\(Shao et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib46)\), using batch size11,n=4n\{=\}4rollouts per step, and a binary\{0,1\}\\\{0,1\\\}reward from the environment\. With 50 training seeds and batch size11, one epoch of GRPO corresponds to 50 gradient updates — the same number of unique puzzle instances seen byHExAin a single pass\. Figure[8](https://arxiv.org/html/2606.29315#S4.F8)shows val accuracy across 6 epochs for two variants: GRPO and \(GRPO w\. skills\) \(evolved skill bank baked into the prompt at data\-generation time\)\. At the matched sample budget \(50 unique seeds, vertical dashed line\), GRPO non\-skilled reaches20%20\\%ondown\_to\_earthand6%6\\%ontwo\_body\_problem, compared toHExA’s24%24\\%and14%14\\%respectively\. This gap is consistent with the established finding that in\-context methods are more sample\-efficient in new domains: skills and strategies distilled from early trajectories are immediately available to subsequent episodes via context injection, whereas GRPO must accumulate gradient signal over many rollouts before policy weights encode the same knowledge\.

However, given sufficient training, GRPO’s exact environment reward eventually dominates:down\_to\_earthreaches100%100\\%by epoch 5 \(GRPO\) and epoch 6 \(GRPO w\. skills\), whiletwo\_body\_problempeaks at56%56\\%and plateaus around4040–50%50\\%for GRPO and converges to100%100\\%by epoch 4 for \(GRPO w\. skills\)\. The static\-skilled variant additionally exhibits a pronounced epoch\-1 dip \(6%6\\%on DTE,0%0\\%on TBP\) before recovering, consistent with the model needing several epochs to re\-adapt its generation format to the longer skill\-augmented prompt \(∼2200\{\\sim\}2200vs\.∼1400\{\\sim\}1400tokens\) before it can leverage the injected skills\. Together, these results confirm thatHExA’s in\-context skill evolution is the preferred approach when the sample budget is limited to a small size and interaction or experimentation budget is limited, while gradient\-based RL with exact reward remains a strong option to further bootstrap learning after in\-context warm\-start\.

Table 12:HeXAconfiguration comparison onpass\_the\_parcel\(Claude Sonnet, 50 seeds for every method, sorted by accuracy\)\. The five configurations cross two design axes: initialisation \(*Offline*/ Off\-to\-Online*\(Off2On\)*/ pure*Online*\) and per\-round update \(*Static*/*Iterative*re\-distillation /*Evolving*merge\)\. The headline configuration — Off2On Evolving withx=3x\{=\}3seeds per round — is the one we refer to simply asHeXAelsewhere in the paper\. Per\-seed cumulative trajectories are shown in Figure[11](https://arxiv.org/html/2606.29315#A4.F11)\. TheDirectbaseline \(one scripted scene read \+ one Claude call, no simulator feedback\) is reported belowReActfor reference; its*Avg Iter*column reports the single committed model call per seed\. Reflexion \(K=2K\{=\}2trials with verbal self\-reflection between trials\) is reported belowReAct; the placeholder cells \(–\) will be filled in once the full 50\-seed run completes\.InitUpdateAcc\. \(%\)Succ\.FailAvg IterOff2OnEvolving \(x=3x\{=\}3\) \(HeXA\)60\.0302017\.3OnlineEvolving \(x=3x\{=\}3\)58\.0292116\.4OnlineIterative56\.0282218\.7Off2OnIterative48\.0242617\.0OfflineStatic48\.0242618\.4ReActbaseline24\.0123821\.3Reflexion \(K=2K\{=\}2trials\)1684222\.4Direct\(one\-shot, no tools\)0\.00501\.0Table 13:Cross\-level skill transfer results\. For each*target*level we report \(i\) the source skill banks consumed byℳθ\\mathcal\{M\}\_\{\\theta\}, with their per\-bank skill counts; \(ii\) the size of the synthesised target bank \(nkn\_\{k\}skills /nμn\_\{\\mu\}mistakes\); \(iii\) the agent that solves the target with*only*the synthesised bank injected; \(iv\) the resulting target success rate; and \(v\) the matchedReActbaseline\. No target\-level trajectories are seen at any stage of synthesis\.

## Appendix EWorking Example: ReAct vs\.HExAon Level Catapult of InterPhyre

This appendix walks through a single Interphyre seed —catapultseed 45 — on which theReActbaseline fails andHExAsucceeds, with both agents using Claude Sonnet 4\.6 as the actor\. We reproduce, in order: the system prompt and full Thought/Action/Observation transcript forReAct\(failure, 25 iterations\); the round\-14 evolving skill bank that theHExAagent has access to; the system prompt and full transcript forHExA\(success in 6 iterations, final placement\(x=0\.3,y=0\.9,r=1\.5\)\(x\{=\}0\.3,y\{=\}0\.9,r\{=\}1\.5\)\)\. The two prompts differ only in theLEARNED PHYSICS SKILLSblock injected by the evolverρ\\rhoahead of the user prompt; the tool list, format instructions, and kickoff message are identical, isolating the contribution of the skill bank\.

### E\.1Sparsified simulation frames \(catapult seed 45\)

Figure[4](https://arxiv.org/html/2606.29315#S4.F4)shows seven sparsely\-sampled frames from each agent’s final accepted simulation, in left\-to\-right time order\. The top row isReAct: after 25 iterations the agent never produces a placement that brings the green ball into contact with the blue ball\. The bottom row isHExA’s 6\-iteration solve at\(x=0\.3,y=0\.9,r=1\.5\)\(x\{=\}0\.3,\\,y\{=\}0\.9,\\,r\{=\}1\.5\): the heavy red ball lands on the right end of the catapult arm, the lever flings the green ball over the ceiling\-blocker, and the green ball settles into the basket against the blue ball\.

### E\.2ReActsystem prompt \(catapult seed 45\)

============================================================

SYSTEMPROMPT

============================================================

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccess

toaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:Catapult\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravity

pullsobjectsdownward\.

\*\*KeyElements\(factual\-\-\-noimpliedapproach\):\*\*

\-\*\*GreenBall:\*\*AsmalldynamicballsittingontheLEFTendofagraybar\.

\-\*\*GrayBar\(CatapultArm\):\*\*Adynamicleverrestingonagrayball\(pivot\)\.Thegreen

ballsitsonitsleftend\.

\-\*\*GrayBall\(Pivot\):\*\*Adynamicballactingasthefulcrum\.Itsitsontheleftblack

platform\.

\-\*\*BlackBall\(CeilingBlocker\):\*\*Astaticballnearthetopofthescene\.

\-\*\*BlackPlatform\(Left\):\*\*Astatichorizontalplatformontheleftside\.

\-\*\*BlackLedge\(Right\):\*\*Astatic\(possiblyangled\)platformontherightside\.

\-\*\*Basket\(Gray\):\*\*Adynamicbasketsittingontherightledge\.

\-\*\*BlueBall\(Target\):\*\*Adynamicballinsidethebasket\.

Usemoreradiusforbetterenergy\(r\>1\)

\*\*TheGoal:\*\*

PlaceONERedBallsomewhereintheboxsothat,oncethesimulationruns,thegreen

ballcontactstheblueballforatleast3seconds\.ThesuccessconditionisONLYthe

green\-bluecontact\-\-\-howyouachieveitisyourchoice\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+

radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwithanyexistingobjectatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.describe\_scene\_geometry

Description:Returnstrategy\-neutralgeometry:everyball\(position,radius,dynamic

flag\),everybar\(position,angle,length,dynamicflag\),everybasket\(position,

dynamicflag\),andthekeydistance\(green<\-\>blue\)\.Noprescriptiveadvice;you

interpretthelayouttoformastrategy\.

Arguments:None

Usage:Action:describe\_scene\_geometry

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,and

properties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefull

physicssimulationtocompletion\.Returnswhetherthegoalwasachieved,final

positionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(out

ofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjects

touchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.

Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobserving

mid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

6\.trace\_green\_ball

Description:Lightweighttrajectoryprobe\-\-\-onlythegreenballissampled\.Placesa

redball,runsthesimulation,andreturnsthegreenball’s\(x,y\)waypointsatfixed

stepintervalsplusstart/end/peaksummary\.Stopsearlyoncethegreenballcomesto

rest\(cappedat~600steps\)\.UsethiswhenyouonlycareaboutWHEREthegreenball

travels,notcontacteventsorotherobjects\-\-\-muchcheaperthansimulate\_with\_trace\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:trace\_green\_ball

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6\}

7\.predict\_first\_contact

Description:Cheappre\-simulationcheck\(<=90physicssteps,~1\.5sofsimtime\)\.Runs

justlongenoughtofindtheFIRSTobjecttheredballtouchesafteritisreleased,

andreports:placementvalidity,theotherobject’sname,thestepofimpact,approach

speed,approximatecontactpoint,andsurfacenormal\.Usethistoverifythatyour

redballactuallyreachestheobjectyouintendedtohitBEFOREburningafull

simulationbudget\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:predict\_first\_contact

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6\}

9\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyour

solution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon’taffectnewones\.

============================================================

USERPROMPT

============================================================

SolvetheCatapultpuzzle\.Youhave25iterationstosolveit\.Thesuccessconditionis:

green\_ballmustcontactblue\_ball\.Usetoolseffectivelyandthinkaboutalternate

approachesifonedoesnotwork\.

\-\*\*ALWAYScallpredict\_first\_contactfirst\*\*\-\-\-itischeapandtellsyouifyourred

ballhitstheintendedobject\.

\-Onlyifpredictconfirmstherightcontact:calltrace\_green\_balltotrackgreenball\.

### E\.3ReActtrajectory \(catapult seed 45, 25 iterations, FAILURE\)

\-\-\-Step1\-\-\-

Thought:

Ineedtofirstunderstandthescenelayoutbeforeplanningmyapproach\.

Action:

describe\_scene\_geometry\(\{\}\)

Observation:

===SceneGeometry===

Worldbounds:xin\[\-5,5\],yin\[\-5,5\]

Balls:

green\_ball:pos=\(\-3\.06,\-0\.57\),r=0\.20,dynamic=True

blue\_ball:pos=\(3\.50,\-2\.39\),r=0\.44,dynamic=True

black\_ball:pos=\(\-2\.28,4\.60\),r=0\.40,dynamic=False

gray\_ball:pos=\(\-1\.14,\-1\.67\),r=0\.70,dynamic=True

Bars:

ledge:pos=\(3\.50,\-3\.23\),angle=6\.2deg,len=3\.02,dynamic=False

black\_platform:pos=\(\-1\.74,\-2\.47\),angle=0\.0deg,len=3\.00,dynamic=False

gray\_platform:pos=\(\-1\.14,\-0\.87\),angle=0\.0deg,len=4\.25,dynamic=True

Baskets:

basket:pos=\(3\.50,\-3\.03\),dynamic=True

Keydistances:

green<\-\>blue:6\.81

Successcondition:green\_ball<\-\>blue\_ballfor3\.0s

==================================================

\-\-\-Step2\-\-\-

Thought:

Letmeanalyzethescene\.Thecatapultarm\(gray\_platform\)isadynamicbarat\(\-1\.14,

\-0\.87\)withthegreenballsittingonitsleftendat\(\-3\.06,\-0\.57\)\.Thegrayball

pivotisat\(\-1\.14,\-1\.67\)\.Theblueballisinthebasketat\(3\.50,\-2\.39\)\.

Tolaunchthegreenballtowardtheblueball,IneedtodroparedballontheRIGHT

endofthecatapultarmtopushitdownandlaunchthegreenballupwardandtothe

right\.Therightendofthecatapultarmwouldbearound\(1\.0,\-0\.87\)sincethearmis

4\.25longcenteredat\(\-1\.14,\-0\.87\)\.Letmedroparedballabovetherightendof

thecatapultarm\.

Action:

predict\_first\_contact\(\{’x’:1\.5,’y’:2\.0,’radius’:0\.5\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=1\.500,y=2\.000,radius=0\.500

first\_contact\_object:ledge

contact\_step:62

approach\_speed:9\.055

contact\_point\(approxmidpoint\):\(2\.487,\-3\.269\)

contact\_normal\(red\-\>otherinverted\):\(\-0\.999,\-0\.034\)

\-\-\-Step3\-\-\-

Thought:

Theredballat\(1\.5,2\.0\)felltotheledge,notthecatapultarm\.Ineedtoaimfor

therightendofthecatapultarm\.Thecatapultarmiscenteredat\(\-1\.14,\-0\.87\)with

length4\.25,soitsrightendisaround\(\-1\.14\+4\.25/2,\-0\.87\)=\(0\.99,\-0\.87\)\.I

shoulddroptheredballabovetherightendofthearm,aroundx=1\.0,y=1\.0\.

Action:

predict\_first\_contact\(\{’x’:1\.0,’y’:1\.0,’radius’:0\.4\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=1\.000,y=1\.000,radius=0\.400

first\_contact\_object:gray\_platform

contact\_step:28

approach\_speed:2\.006

contact\_point\(approxmidpoint\):\(\-0\.074,\-0\.470\)

contact\_normal\(red\-\>otherinverted\):\(0\.940,0\.342\)

\-\-\-Step4\-\-\-

Thought:

Theredballat\(1\.0,1\.0\)hitsthegray\_platform\(catapultarm\)\.Letmerunafull

simulationwithtracetoseeifthislaunchesthegreenballtowardtheblueball\.

Action:

simulate\_with\_trace\(\{’x’:1\.0,’y’:1\.0,’radius’:0\.4,’object\_names’:\[’green\_ball’,

’blue\_ball’,’catapult\_arm’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Note:unknownobject\_names:catapult\_arm

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~27:gray\_platform<\-\>red\_ball

step~32:gray\_platform<\-\>red\_ball

step~51:basket<\-\>blue\_ball

step~82:bottom\_wall<\-\>red\_ball

step~105:bottom\_wall<\-\>red\_ball

step~110:bottom\_wall<\-\>red\_ball

step~114:black\_platform<\-\>green\_ball

step~132:black\_platform<\-\>green\_ball

step~190:black\_platform<\-\>green\_ball

step~229:bottom\_wall<\-\>green\_ball

step~247:bottom\_wall<\-\>green\_ball

\.\.\.\+3more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=0\.67,y\_min=\-4\.78,v\_max=7\.27,Deltapos=\(\-1\.72,\-4\.20\)

omega\_peak=4\.69rad/s,thetain\[\-0\.7deg,\+887\.4deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.31,Deltapos=\(\-0\.25,\-0\.12\)

omega\_peak=1\.37rad/s,thetain\[\+0\.0deg,\+33\.1deg\]

\-\-\-Step5\-\-\-

Thought:

Thegreenballonlywentuptoy=0\.67andfellbackdownwithoutreachingtheblueball

at\(3\.50,\-2\.39\)\.Ineedthegreenballtoflyfurtherright\.I’lltrydroppinga

largerredballfromhigherabovetherightendofthecatapultarmtogiveitmore

momentum\.

Action:

predict\_first\_contact\(\{’x’:1\.0,’y’:3\.0,’radius’:0\.8\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=1\.000,y=3\.000,radius=0\.800

first\_contact\_object:gray\_platform

contact\_step:42

approach\_speed:5\.428

contact\_point\(approxmidpoint\):\(\-0\.098,\-0\.148\)

contact\_normal\(red\-\>otherinverted\):\(0\.841,0\.541\)

\-\-\-Step6\-\-\-

Thought:

Theredballhitsthegray\_platformwithhigherspeed\.Letmesimulatethistoseeif

thegreenballgetslaunchedfurthertowardtheblueball\.

Action:

simulate\_with\_trace\(\{’x’:1\.0,’y’:3\.0,’radius’:0\.8,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~41:gray\_platform<\-\>red\_ball

step~44:gray\_platform<\-\>red\_ball

step~45:gray\_platform<\-\>green\_ball

step~51:basket<\-\>blue\_ball

step~70:ledge<\-\>red\_ball

step~83:bottom\_wall<\-\>red\_ball

step~101:bottom\_wall<\-\>red\_ball

step~102:left\_wall<\-\>green\_ball

step~155:left\_wall<\-\>red\_ball

step~161:green\_ball<\-\>red\_ball

step~167:left\_wall<\-\>green\_ball

\.\.\.\+16more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=2\.34,y\_min=\-4\.79,v\_max=10\.22,Deltapos=\(\+2\.41,\-4\.21\)

omega\_peak=10\.75rad/s,thetain\[\-321\.0deg,\+975\.4deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.31,Deltapos=\(\-0\.25,\-0\.12\)

omega\_peak=1\.37rad/s,thetain\[\+0\.0deg,\+33\.1deg\]

\-\-\-Step7\-\-\-

Thought:

Thegreenballpeakedaty=2\.34andmovedrightby2\.41unitsbutdidn’treachblue\_ball

at\(3\.50,\-2\.39\)\.Ineedmoreforceandabetterangle\.Letmetryalargerball

droppedfromhighertohitthefarrightendofthecatapultarmharder,ideally

givingthegreenballmorerightwardvelocity\.

Action:

predict\_first\_contact\(\{’x’:0\.9,’y’:4\.5,’radius’:1\.0\}\)

Observation:

INVALIDplacement:

\-Topedge\(y\+radius=5\.5000\)exceedsceiling\(5\)\.

placement\_valid=False;nocontactcheckperformed\.

\-\-\-Step8\-\-\-

Thought:

Theradiusistoolargeforthatheight\.Letmetrydroppingfromy=3\.5withradius1\.0,

whichshouldfitwithinbounds\.

Action:

predict\_first\_contact\(\{’x’:0\.9,’y’:3\.5,’radius’:1\.0\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.900,y=3\.500,radius=1\.000

first\_contact\_object:gray\_platform

contact\_step:44

approach\_speed:6\.152

contact\_point\(approxmidpoint\):\(\-0\.155,\-0\.018\)

contact\_normal\(red\-\>otherinverted\):\(0\.784,0\.621\)

\-\-\-Step9\-\-\-

Thought:

Good,theredballhitsthegray\_platformwithspeed6\.152\.Letmesimulatethistosee

ifthegreenballgetslaunchedfurthertowardtheblueball\.

Action:

simulate\_with\_trace\(\{’x’:0\.9,’y’:3\.5,’radius’:1\.0,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~43:gray\_platform<\-\>red\_ball

step~45:gray\_platform<\-\>red\_ball

step~46:gray\_platform<\-\>green\_ball

step~51:basket<\-\>blue\_ball

step~72:ledge<\-\>red\_ball

step~80:bottom\_wall<\-\>red\_ball

step~91:left\_wall<\-\>green\_ball

step~100:bottom\_wall<\-\>red\_ball

step~159:left\_wall<\-\>red\_ball

step~160:gray\_ball<\-\>green\_ball

step~166:left\_wall<\-\>green\_ball

\.\.\.\+8more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=3\.29,y\_min=\-1\.99,v\_max=9\.19,Deltapos=\(\-1\.72,\-1\.41\)

omega\_peak=14\.31rad/s,thetain\[\-0\.7deg,\+1148\.2deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.31,Deltapos=\(\-0\.25,\-0\.12\)

omega\_peak=1\.37rad/s,thetain\[\+0\.0deg,\+33\.1deg\]

\-\-\-Step10\-\-\-

Thought:

Thegreenballkeepsgoingleft/upratherthanrighttowardtheblueballat\(3\.50,

\-2\.39\)\.Ineedtohitthefarrightendofthecatapultarm\(aroundx=0\.985\)toget

morerightwardtorque\.Letmetryasmallerballplacedfurtherrightandcloseabove

thearmleveltohitthearm’stip\.

Action:

predict\_first\_contact\(\{’x’:1\.5,’y’:0\.5,’radius’:0\.3\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=1\.500,y=0\.500,radius=0\.300

first\_contact\_object:bottom\_wall

contact\_step:62

approach\_speed:2\.025

contact\_point\(approxmidpoint\):\(0\.750,\-4\.825\)

contact\_normal\(red\-\>otherinverted\):\(0\.975,0\.221\)

\-\-\-Step11\-\-\-

Thought:

Theredballat\(1\.5,0\.5\)missedthearmentirely\.Thearm’srightendisaround

x=0\.985\.Previoushitsshowedcontactfarfromtherighttip\.Letmetryplacing

closertothearmsurfacejustabovetherightendtiptogetabettercatapult

effect\.

Action:

predict\_first\_contact\(\{’x’:0\.85,’y’:0\.2,’radius’:0\.3\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.850,y=0\.200,radius=0\.300

first\_contact\_object:gray\_platform

contact\_step:20

approach\_speed:0\.390

contact\_point\(approxmidpoint\):\(\-0\.146,\-0\.594\)

contact\_normal\(red\-\>otherinverted\):\(0\.965,0\.262\)

\-\-\-Step12\-\-\-

Thought:

Ineedtohitthefarrightendofthecatapultarm\(righttipat~x=0\.985\)tomaximize

thecatapultlevereffect\.Previoushitswerelandingnearthecenter\(\-0\.1\)\.Letme

tryplacingdirectlyabovetherighttip\.

Action:

predict\_first\_contact\(\{’x’:0\.85,’y’:\-0\.3,’radius’:0\.2\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.850,y=\-0\.300,radius=0\.200

first\_contact\_object:gray\_platform

contact\_step:13

approach\_speed:0\.350

contact\_point\(approxmidpoint\):\(\-0\.144,\-0\.677\)

contact\_normal\(red\-\>otherinverted\):\(0\.983,0\.185\)

\-\-\-Step13\-\-\-

Thought:

Thecatapultislaunchingthegreenballbutitkeepsgoingleftinsteadofright\.Let

metryadifferentstrategy\-\-\-knocktheblueballoutofthebasketandhaveitroll

towardthegreenball\.I’lldroparedballfromabovethebasketarea\.

Action:

predict\_first\_contact\(\{’x’:3\.5,’y’:0\.0,’radius’:0\.4\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=3\.500,y=0\.000,radius=0\.400

first\_contact\_object:blue\_ball

contact\_step:35

approach\_speed:0\.632

contact\_point\(approxmidpoint\):\(3\.442,\-2\.100\)

contact\_normal\(red\-\>otherinverted\):\(0\.143,0\.990\)

\-\-\-Step14\-\-\-

Thought:

Theredballhitstheblue\_balldirectly\.Letmesimulatethisandtracebothblueand

greenballstoseeifknockingtheblueballoutofthebasketcausesittoroll

towardthegreenball\.

Action:

simulate\_with\_trace\(\{’x’:3\.5,’y’:0\.0,’radius’:0\.4,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~34:blue\_ball<\-\>red\_ball

step~42:blue\_ball<\-\>red\_ball

step~48:basket<\-\>blue\_ball

step~52:basket<\-\>red\_ball

step~98:bottom\_wall<\-\>green\_ball

step~118:bottom\_wall<\-\>green\_ball

step~141:gray\_platform<\-\>green\_ball

step~142:left\_wall<\-\>green\_ball

step~147:gray\_platform<\-\>green\_ball

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=\-0\.57,y\_min=\-4\.78,v\_max=8\.04,Deltapos=\(\-1\.72,\-4\.21\)

omega\_peak=9\.95rad/s,thetain\[\-0\.7deg,\+591\.9deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.69,Deltapos=\(\-0\.26,\-0\.12\)

omega\_peak=1\.58rad/s,thetain\[\+0\.0deg,\+32\.8deg\]

\-\-\-Step15\-\-\-

Thought:

Thecatapultlaunchesthegreenballupward/left,butIneedittoreachtheblueball

at\(3\.50,\-2\.39\)\.Letmetrydroppingamaximum\-sizeballfromashighaspossible

directlyonthecatapultarm’srighttip\(x~=0\.985\)tomaximizelaunchforceand

rightwardarc\.

Action:

predict\_first\_contact\(\{’x’:0\.95,’y’:3\.8,’radius’:1\.0\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.950,y=3\.800,radius=1\.000

first\_contact\_object:gray\_platform

contact\_step:46

approach\_speed:6\.415

contact\_point\(approxmidpoint\):\(\-0\.136,0\.007\)

contact\_normal\(red\-\>otherinverted\):\(0\.784,0\.621\)

\-\-\-Step16\-\-\-

Thought:

Thebestrightwardlaunchsofarwaswithr=0\.8from\(1\.0,3\.0\),givinggreen

Deltapos=\(\+2\.41right\)\.Ineedmoreforcetogetthegreenballtotravel6\.56unitsright

toreachblueat\(3\.50,\-2\.39\)\.Letmetryr=0\.8fromhigherup\(y=4\.0\)togetmore

speedwhilestillhittingnearthearm’srighttip\.

Action:

predict\_first\_contact\(\{’x’:1\.0,’y’:4\.0,’radius’:0\.8\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=1\.000,y=4\.000,radius=0\.800

first\_contact\_object:gray\_platform

contact\_step:49

approach\_speed:6\.294

contact\_point\(approxmidpoint\):\(\-0\.119,\-0\.082\)

contact\_normal\(red\-\>otherinverted\):\(0\.825,0\.565\)

\-\-\-Step17\-\-\-

Thought:

Thegreenballhasbeengoingupward\-leftratherthanreachingtheblueballat\(3\.50,

\-2\.39\)\.Letmesimulatethehigh\-speedr=0\.9hitonthecatapultarmtoseeifmore

forcegetsthegreenballfurtherright\.

Action:

simulate\_with\_trace\(\{’x’:0\.95,’y’:3\.8,’radius’:0\.9,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~46:gray\_platform<\-\>red\_ball

step~51:basket<\-\>blue\_ball

step~72:ledge<\-\>red\_ball

step~78:left\_wall<\-\>green\_ball

step~90:bottom\_wall<\-\>red\_ball

step~106:bottom\_wall<\-\>red\_ball

step~142:left\_wall<\-\>red\_ball

step~152:green\_ball<\-\>red\_ball

step~159:bottom\_wall<\-\>red\_ball

step~164:bottom\_wall<\-\>red\_ball

step~180:left\_wall<\-\>green\_ball

\.\.\.\+15more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=1\.89,y\_min=\-4\.78,v\_max=9\.57,Deltapos=\(\+2\.05,\-4\.21\)

omega\_peak=12\.02rad/s,thetain\[\-0\.7deg,\+1320\.1deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.31,Deltapos=\(\-0\.25,\-0\.12\)

omega\_peak=1\.37rad/s,thetain\[\+0\.0deg,\+33\.1deg\]

\-\-\-Step18\-\-\-

Thought:

Thegreenballhasbeenlaunchingleftwardornotreachingtheblueball\.Letmetrya

muchlargerball\(r=2\.0\)ontherightendofthecatapultarmformaximumforcetosee

ifthegreenballcantraveltherequired~6\.5unitstotheright\.

Action:

predict\_first\_contact\(\{’x’:0\.9,’y’:2\.8,’radius’:2\.0\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.900,y=2\.800,radius=2\.000

first\_contact\_object:gray\_platform

contact\_step:35

approach\_speed:5\.267

contact\_point\(approxmidpoint\):\(\-0\.138,0\.116\)

contact\_normal\(red\-\>otherinverted\):\(0\.730,0\.684\)

\-\-\-Step19\-\-\-

Thought:

Letmesimulatethelarger=2\.0ballhittingthecatapultarmtoseeifmaximummass

providesenoughforcetolaunchthegreenballtoreachtheblueball\.

Action:

simulate\_with\_trace\(\{’x’:0\.9,’y’:2\.8,’radius’:2\.0,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~34:gray\_platform<\-\>red\_ball

step~42:gray\_platform<\-\>red\_ball

step~48:basket<\-\>red\_ball

step~48:basket<\-\>blue\_ball

step~52:basket<\-\>blue\_ball

step~58:black\_platform<\-\>red\_ball

step~62:basket<\-\>red\_ball

step~75:ledge<\-\>red\_ball

step~86:ledge<\-\>red\_ball

step~90:top\_wall<\-\>green\_ball

step~94:basket<\-\>blue\_ball

\.\.\.\+23more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=4\.78,y\_min=\-4\.78,v\_max=11\.52,Deltapos=\(\-1\.72,\-4\.21\)

omega\_peak=12\.52rad/s,thetain\[\-0\.7deg,\+2042\.6deg\]

blue\_ball:y\_peak=\-2\.25,y\_min=\-2\.52,v\_max=1\.53,Deltapos=\(\-0\.03,\-0\.10\)

omega\_peak=1\.94rad/s,thetain\[\-10\.8deg,\+51\.9deg\]

\-\-\-Step20\-\-\-

Thought:

Thegreenballkeepslaunchingleftandnotreachingtheblueball\.Letmetryavery

differentapproach\-\-\-dropaprecisely\-aimed,smallredballfromveryhighabovethe

arm’srighttip\(x~=0\.985\)tohititwithmaximumspeedatthetipforastronger

rightwardcatapult\.

Action:

predict\_first\_contact\(\{’x’:0\.985,’y’:4\.6,’radius’:0\.2\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.985,y=4\.600,radius=0\.200

first\_contact\_object:bottom\_wall

contact\_step:83

approach\_speed:2\.711

contact\_point\(approxmidpoint\):\(0\.493,\-4\.878\)

contact\_normal\(red\-\>otherinverted\):\(0\.973,0\.232\)

\-\-\-Step21\-\-\-

Thought:

Thesmallballmissedthearmentirely\.Letmetryacompletelydifferentapproach\-\-\-

droptheredballonthebaskettoknockthebasket\(withblueballinside\)leftward,

sotheblueballrollstowardthegreenball\.Thebasketisdynamic\.

Action:

predict\_first\_contact\(\{’x’:3\.5,’y’:3\.0,’radius’:0\.5\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=3\.500,y=3\.000,radius=0\.500

first\_contact\_object:basket

contact\_step:51

approach\_speed:7\.452

contact\_point\(approxmidpoint\):\(3\.493,\-1\.859\)

contact\_normal\(red\-\>otherinverted\):\(\-0\.003,1\.000\)

\-\-\-Step22\-\-\-

Thought:

Theredballhitsthebasketdirectlywithspeed7\.452\.Letmesimulatetoseeifthe

basket\(dynamic\)getsknockedleftward,potentiallybringingtheblueballtowardthe

greenball\.

Action:

simulate\_with\_trace\(\{’x’:3\.5,’y’:3\.0,’radius’:0\.5,’object\_names’:\[’green\_ball’,

’blue\_ball’,’basket’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~7:basket<\-\>ledge

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~50:basket<\-\>red\_ball

step~50:basket<\-\>blue\_ball

step~58:blue\_ball<\-\>red\_ball

step~64:basket<\-\>blue\_ball

step~66:basket<\-\>ledge

step~78:basket<\-\>red\_ball

step~81:blue\_ball<\-\>red\_ball

step~98:bottom\_wall<\-\>green\_ball

step~118:bottom\_wall<\-\>green\_ball

step~120:basket<\-\>blue\_ball

\.\.\.\+5more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=\-0\.57,y\_min=\-4\.78,v\_max=8\.04,Deltapos=\(\-1\.72,\-4\.21\)

omega\_peak=9\.95rad/s,thetain\[\-0\.7deg,\+591\.9deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.36,Deltapos=\(\-0\.25,\-0\.12\)

omega\_peak=1\.34rad/s,thetain\[\+0\.0deg,\+34\.2deg\]

basket:y\_peak=\-3\.04,y\_min=\-3\.12,v\_max=2\.17,Deltapos=\(\-0\.01,\-0\.08\)

omega\_peak=2\.01rad/s,thetain\[\+4\.8deg,\+9\.4deg\]

\-\-\-Step23\-\-\-

Thought:

Ther=0\.8casesentthegreenballrightward\(\+2\.41\)withv\_max=10\.22,suggestingitmay

havetraveledfurtherrightbeforebouncing\.Ineedtotryamoderate\-forcehitthat

sendsthegreenballinashallowerarctolandinthebasketat\(3\.50,\-3\.03\)where

theblueballis\.Letmetryr=0\.4fromamoderateheight\.

Action:

simulate\_with\_trace\(\{’x’:1\.0,’y’:2\.0,’radius’:0\.4,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~13:basket<\-\>blue\_ball

step~36:gray\_platform<\-\>red\_ball

step~45:gray\_platform<\-\>red\_ball

step~51:basket<\-\>blue\_ball

step~72:ledge<\-\>red\_ball

step~89:bottom\_wall<\-\>red\_ball

step~105:bottom\_wall<\-\>red\_ball

step~108:bottom\_wall<\-\>red\_ball

step~128:bottom\_wall<\-\>green\_ball

step~132:left\_wall<\-\>green\_ball

step~151:bottom\_wall<\-\>green\_ball

step~155:bottom\_wall<\-\>green\_ball

\.\.\.\+6more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=\-0\.05,y\_min=\-4\.78,v\_max=9\.58,Deltapos=\(\-1\.71,\-4\.21\)

omega\_peak=6\.06rad/s,thetain\[\-0\.7deg,\+498\.3deg\]

blue\_ball:y\_peak=\-2\.40,y\_min=\-2\.52,v\_max=1\.31,Deltapos=\(\-0\.25,\-0\.12\)

omega\_peak=1\.37rad/s,thetain\[\+0\.0deg,\+33\.1deg\]

\-\-\-Step24\-\-\-

Thought:

Letmetryapartialsimulationwithr=0\.8at\(1\.0,3\.0\)totrackthegreenball’s

trajectoryandseeifitpassesnearthebasket/blueballregionaroundstep100\-150\.

Action:

simulate\_partial\(\{’x’:1\.0,’y’:3\.0,’radius’:0\.8,’stop\_step’:120\}\)

Observation:

Simulationstateatstep120:

green\_ball:pos=\(\-4\.689,1\.626\),vel=\(0\.311,\-3\.514\)

red\_ball:pos=\(\-1\.833,\-4\.175\),vel=\(\-3\.941,0\.000\)

blue\_ball:pos=\(3\.246,\-2\.518\),vel=\(0\.000,0\.000\)

ledge:pos=\(3\.500,\-3\.235\),vel=\(0\.000,0\.000\)

basket:pos=\(3\.493,\-3\.119\),vel=\(0\.000,0\.000\)

black\_ball:pos=\(\-2\.281,4\.600\),vel=\(0\.000,0\.000\)

black\_platform:pos=\(\-1\.736,\-2\.473\),vel=\(0\.000,0\.000\)

gray\_ball:pos=\(\-2\.219,\-1\.668\),vel=\(\-0\.876,0\.000\)

gray\_platform:pos=\(\-0\.997,\-0\.228\),vel=\(0\.600,0\.011\)

\-\-\-Step25\-\-\-

Thought:

Ineedtounderstandthegreenball’sinitialvelocitydirectionrightaftercatapult

launch\.Letmecheckapartialsimatstep50withther=0\.8triggertoseethegreen

ball’sexactstaterightafterlaunch\.

Action:

simulate\_partial\(\{’x’:1\.0,’y’:3\.0,’radius’:0\.8,’stop\_step’:50\}\)

Observation:

Simulationstateatstep50:

green\_ball:pos=\(\-3\.433,\-0\.337\),vel=\(\-1\.778,6\.657\)

red\_ball:pos=\(1\.084,\-0\.229\),vel=\(0\.532,\-6\.311\)

blue\_ball:pos=\(3\.258,\-2\.519\),vel=\(\-0\.584,\-0\.064\)

ledge:pos=\(3\.500,\-3\.235\),vel=\(0\.000,0\.000\)

basket:pos=\(3\.499,\-3\.119\),vel=\(\-0\.000,0\.000\)

black\_ball:pos=\(\-2\.281,4\.600\),vel=\(0\.000,0\.000\)

black\_platform:pos=\(\-1\.736,\-2\.473\),vel=\(0\.000,0\.000\)

gray\_ball:pos=\(\-1\.221,\-1\.669\),vel=\(\-0\.713,0\.000\)

gray\_platform:pos=\(\-1\.352,\-0\.853\),vel=\(\-1\.090,0\.517\)

### E\.4Round\-14 evolving skill bank used byHExA

\#CATAPULTSKILLS

\[cat\_ev\_6\_000\]

Title:

UseLargeRadius\(\>=1\.5\)forSufficientLaunchEnergy

generation:10confidence:0\.90

sourceseeds:\[44\]

Principle:

Thecatapultarmisalever\-\-\-theredball’smass\(proportionaltor^3\)determinesangularmomentum

impartedtothearm\.Belowr~=1\.2,thearmrotatestooslowlytolaunchthegreen

ballacrossthe~7\-unitgaptothebasket\.r=1\.5istheminimumreliable

threshold;r=2\.0addsnegligibleadditionalrangeatx=0\.5duetoarmrotation

saturation\.Onlyreduceradiuswhenforcedbyoverlapconstraints,accepting

reducedrange\.

Whentoapply:

Alwaysuser=1\.5asthebaseline\.Scaledownonlywhenoverlapconstraintsrequireit\.

Example:

r=1\.5atx=0\.5,y=0\.4isthecanonicalprimaryplacement\(Seed44\)\.

\[cat\_ev\_1\_001\]

Title:

x~=0\.5IsthePrimaryCatapultLaunchSweetSpot

generation:10confidence:0\.91

sourceseeds:\[44\]

Principle:

Placingtheredballatx=0\.5onthecatapultarmproducesaconsistentrightward

launcharc\.Thecontactpointisontherightsideofthepivot,creating

sufficientleverarmforstrongrotationwhilestayinginthegeometricallystable

zone\.Onlydeviatetox=0\.3whenaceilinghitisobserved\.

Whentoapply:

Alwaysasthefirstplacementattempt\.Deviateonlywhenceilinghitsoroverlap

constraintsrequireit\.

Example:

x=0\.5,y=0\.4,r=1\.5isthecanonicalprimary\(Seed44\)\.

\[cat\_ev\_7\_002\]

Title:

y=0\.4IsDefaultDropHeight;ScaleUptoy=0\.9WhenArmSitsHigher

generation:10confidence:0\.90

sourceseeds:\[44\]

Principle:

Whenthearmy<=\-1\.5,y=0\.4givesadequatefalldistancefortheredballtobuild

approachspeed\.Whenthearmsitshigher\(y\>\-1\.5\),increasetoy=0\.9\.

Critically,y=0\.4alsoservesasthesafeceiling\-escapefallbackheight\-\-\-even

whenusingx=0\.3forarcflattening,y=0\.4avoidsgray\_ballpivotoverlapwithout

sacrificingarcgeometry\.

Whentoapply:

Sety=0\.4bydefault\.Usey=0\.9whenarmishigh\.Usey=0\.4evenduring

ceiling\-escapex\-shiftstoavoidgray\_balloverlap\.

Example:

Seed44armaty=\-1\.59:y=0\.4default;ceiling\-escapefallbackx=0\.3,y=0\.4also

usedy=0\.4\.

\[cat\_ev\_7\_003\]

Title:

predict\_first\_contactIsEssentialforVariable\-ArmSeeds

generation:10confidence:0\.88

sourceseeds:\[44\]

Principle:

Thecatapultarm’sy\-positionvariesacrossseeds,makingtheoverlapboundary

unpredictable\.predict\_first\_contactverifiesbothplacementvalidityandthatthe

redballcontactsgray\_platform\(notanotherobject\),confirmingthecatapult

mechanismactivates\.Italsocatchesgray\_ballpivotoverlapsbeforeexpensive

simulationruns\.

Whentoapply:

Beforeeverysimulate\_with\_tracecall\.Ifplacement\_valid=False,adjustposition

beforesimulating\.

Example:

Seed44:predict\_first\_contactcaughtx=0\.3,y=\-0\.3,r=2\.0overlappinggray\_ball

beforesimulationwaswasted\.

\[cat\_ev\_7\_004\]

Title:

CeilingBlockerLethalityDependsonItsx\-PositionandArcAngle

generation:10confidence:0\.87

sourceseeds:\[44\]

Principle:

Thestaticblackballneary~=4\.6variesinx\-positionacrossseeds\(x~=\-3\.85to

x~=\-1\.14\)\.Whentheceilingblockerisatx\>\-2\.5,primaryplacementcausesa

ceilinghit\.Whentheblockerisatx<\-3\.5\(farleft\),x=0\.3launchesallowthe

greenballtoarcupandevenbounceoffthetopwallbeforedescendingintothe

basket\-\-\-theblockerisofftothesideofthearc\.

Whentoapply:

Checkceilingblockerxindescribe\_scene\_geometry\.Ifx\>\-2\.5,ceilingescapeis

needed\.Ifx<\-3\.5,x=0\.3launchescanreachthetopwallandstillsucceed\.

Example:

Seed44:black\_ballatx=\-3\.63\-\-\-greenballbouncedofftop\_wall\(step72\)then

landedinbasket\.

\[cat\_ev\_7\_005\]

Title:

RadiusPlateauatx=0\.5:rBeyond1\.5YieldsNoAdditionalRange

generation:10confidence:0\.79

sourceseeds:\[42\]

Principle:

Atx~=0\.5,thecatapultarm’srotationsaturates\-\-\-r=1\.5andr=2\.0produceidentical

greenballlandingpositions\.Increasingmassbeyond1\.5atthisxcannotconvert

toadditionalangularmomentumbecausethearmhitsitsgeometricrotationlimit\.

Intermediateradii\(r=1\.1\-\-1\.4\)alsofailtobridgeceiling\-rangetradeoffswhen

therootcauseislaunchangle\.

Whentoapply:

Whenr=1\.5failsatx=0\.5,doNOTtryr=2\.0\.Changex\-positioninstead\.

Example:

Seed42:agenttriedr=1\.1,1\.2,1\.3,1\.4,2\.0atx=0\.5\-\-\-noneresolvedthe

ceiling\-rangetradeoff\.

\[cat\_ev\_4\_005\]

Title:

x\-PositionFine\-TunesLandingRange;BifurcationZonesatx<=\-0\.3andx=0\.7\-\-1\.5

generation:10confidence:0\.82

sourceseeds:\[42,44\]

Principle:

Smallxshifts\(0\.1\-\-0\.2units\)causelargechangesingreenballlandingposition\.

Thestablelaunchzoneisx~=0\.0to0\.5\.Positionsx<=\-0\.3andx=0\.7\-\-1\.5are

bifurcationzoneswheretrajectoryischaoticandsmallchangesproduce

unpredictableoutcomes\.Withinx=0\.1\-\-0\.5,landingispredictableandmonotonically

tunable\.

Whentoapply:

Whenadjustingxtochangelandingrange,staywithinx=0\.0\-\-0\.5\.Immediatelyreset

tox=0\.2\-\-0\.4ifanyattemptlandsoutsidethisband\.

Example:

x=0\.3vsx=0\.5bothproducereliablearcs\(Seed44\);x=0\.9produceschaoticbounces

\(Seed42\)\.

\[cat\_ev\_8\_007\]

Title:

Three\-TierFallbackSequenceforFailedPrimaryPlacement

generation:10confidence:0\.93

sourceseeds:\[44\]

Principle:

Whenprimary\(x=0\.5,y=0\.4,r=1\.5\)fails:\(1\)ceilinghit\-\>immediatelytryx=0\.3,

y=0\.4,r=1\.5\(gray\-ball\-safefirstchoice\);ifstillceilinghit,tryx=0\.3,

y=\-0\.3,r=2\.0butverifynooverlapfirst\.\(2\)shortrange\-\>increaseyto0\.9or

shiftxto0\.4\-\-0\.45\.\(3\)persistentfailureafter2\+radiusvariationsatsamex\-\>

stoptuning,shiftxto0\.2\-\-0\.3\.

Whentoapply:

Immediatelywhensimulate\_with\_traceshowsceilinghitorshortrange\.Donot

micro\-tunethefailedplacement\.

Example:

Seed44:primaryhitceiling\-\>x=0\.3,y=0\.4,r=1\.5succeeded\(y=\-0\.3optionswere

blockedbygray\_ball\)\.

\[cat\_ev\_10\_008\]

Title:

x=0\.3IstheCeiling\-Escapex\-Position;y=0\.4IstheGray\-Ball\-SafeDefaultHeight

generation:10confidence:0\.93

sourceseeds:\[44\]

Principle:

Shiftingxfrom0\.5to0\.3flattensthegreenball’slauncharcbychangingthearm

contactpoint\-\-\-thisx\-shiftistheprimaryceiling\-escapemechanism,notlowering

y\.Whenx=0\.3,y=\-0\.3,r=2\.0isinvalidduetogray\_ballpivotoverlap\(thepivot

atr~=0\.70canbewithin2\.7unitsoflargeredballsatlowy\),usex=0\.3,y=0\.4,

r=1\.5\-\-\-y=0\.4avoidstheoverlapzonewhilepreservingthearc\-flatteningfrom

thex\-shift\.

Whentoapply:

Whenprimary\(x=0\.5\)causesceilinghit,tryx=0\.3,y=0\.4,r=1\.5first\.Onlyattempt

y=\-0\.3ifthegray\_ballpivotispositionedfarfromx=0\.3\.

Example:

Seed44:x=0\.3,y=\-0\.3,r=2\.0overlappedgray\_ball;x=0\.3,y=0\.4,r=1\.5succeeded

withgreenballbouncingofftopwallintobasket\.

\[cat\_ev\_11\_009\]

Title:

WhenCeiling\-RangeTradeoffIsUnsolvablebyRadiusTuning,Shiftxto0\.1\-\-0\.3

generation:11confidence:0\.50

sourceseeds:\[42,44\]

Principle:

Whenr=1\.5hitstheceilingandr=1\.0fallsshort\(Deltaxgap\>1\.0unit\),interpolating

intermediateradii\(r=1\.1\-\-1\.4\)doesnotresolvetheconflict\-\-\-theyalsohitthe

ceilingorremainshort\.Thisoccursbecauseceilingclearanceisdeterminedby

launchangle\(setbyx\-position\),notenergylevel\.Shiftingxleftto0\.1\-\-0\.3

changesthecontactgeometryandflattensthearc,openinganewceiling\-safe

launchwindowatr=1\.5withoutradiuscompromise\.

Whentoapply:

After2radiusvariationsatx=0\.5bothfail\(oneceilinghit,oneshort\),stop

tuningradiusandimmediatelyshiftxto0\.3\.

Example:

Seed42:triedr=1\.0,1\.1,1\.2,1\.3,1\.4atx=0\.5\-\-\-allfailed\.Shouldhaveshifted

tox=0\.3,y=0\.4,r=1\.5after2ndfailure\.

\#COMMONMISTAKES\(negativeskills\)

\[cat\_err\_000\]generation:10

Description:

Agentfixatesonasinglelaunchmechanism\(catapultarm\)andmicro\-tunesx/y/radius

aroundthesamenarrowregionwithoutescapingthelocalsolutionspace\.

Whyithappens:

Thecatapultarmisthemostobviousmechanism\.After2\-3failures,agentstrysmall

perturbationsinsteadofqualitativelydifferentplacements\.

Howtoavoid:

After2failureswithinthesamex\+/\-0\.2region,movetoacompletelydifferentxzone

\(shiftfromx=0\.5tox=0\.3\)ratherthancontinuingtofine\-tune\.

\[cat\_evo\_001\]generation:10

Description:

Agentincreasesradiusatfixedx=0\.5expectingmorerange,butarmrotationis

saturatedandresultisidentical\.

Whyithappens:

Linearenergyintuition:moremass=moreenergy=morerange\.Butleverrotation

saturatesbeforeadditionalmassconvertstogreenballvelocityatx=0\.5\.

Howtoavoid:

Whenr=1\.5failsatx=0\.5,nevertryr=2\.0atthesamex\.Changex\-positionto0\.3

instead\.

\[cat\_err\_002\]generation:10

Description:

Agentadjustsyinthewrongdirectionafterceilinghits:increasesyto2\.0\+

\(higherdrop\)insteadofshiftingxto0\.3orusingy=0\.4atx=0\.3\.

Whyithappens:

Higherdrop=morekineticenergyatimpact\.Butmoreenergyproducesa

higher\-arcinglaunch,worseningtheceilinghit\.

Howtoavoid:

Afteraceilinghit,shiftxleftto0\.3\(keepingy=0\.4\)toflattenthearc\.Never

increaseyabove1\.5asaresponsetoceilinghits\.

\[cat\_err\_003\]generation:10

Description:

Agentbinary\-searchesinthex=\-0\.3to\-0\.5orx=0\.8\-\-1\.5bifurcationzoneswhere

greenballtrajectoryischaotic\.

Whyithappens:

Smallxadjustmentsfeellikefine\-tuning,butinbifurcationzonestheyproduce

wildlydifferentoutcomesthatappeartoneedmorefine\-tuning\.

Howtoavoid:

Ifanyattemptlandsinx<=\-0\.3orx=0\.7\-\-1\.5,immediatelyresettothestablezone

x=0\.2\-\-0\.4\.

\[cat\_err\_004\]generation:10

Description:

Agentplacesdeflectorballsnearthebasket\(x~=3\.5\-\-4\.5\)hopingtopushtheblue

balltowardthegreenball’slandingzone\.

Whyithappens:

Whenthegreenballconsistentlylandsshort,movingthetarget\(blueball\)seems

logical\.Butbasketphysicsareunpredictableanddeflectorsrarelyproduce

sustainedgreen\-bluecontact\.

Howtoavoid:

Fixthegreenballtrajectorytoreachtheblueballbyadjustingcatapult

x\-position\.Donotattempttorelocatetheblueballviadeflectors\.

### E\.5HExAsystem prompt \(catapult seed 45\)

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccess

toaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:Catapult\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravity

pullsobjectsdownward\.

\*\*KeyElements\(factual\-\-\-noimpliedapproach\):\*\*

\-\*\*GreenBall:\*\*AsmalldynamicballsittingontheLEFTendofagraybar\.

\-\*\*GrayBar\(CatapultArm\):\*\*Adynamicleverrestingonagrayball\(pivot\)\.Thegreen

ballsitsonitsleftend\.

\-\*\*GrayBall\(Pivot\):\*\*Adynamicballactingasthefulcrum\.Itsitsontheleftblack

platform\.

\-\*\*BlackBall\(CeilingBlocker\):\*\*Astaticballnearthetopofthescene\.

\-\*\*BlackPlatform\(Left\):\*\*Astatichorizontalplatformontheleftside\.

\-\*\*BlackLedge\(Right\):\*\*Astatic\(possiblyangled\)platformontherightside\.

\-\*\*Basket\(Gray\):\*\*Adynamicbasketsittingontherightledge\.

\-\*\*BlueBall\(Target\):\*\*Adynamicballinsidethebasket\.

Usemoreradiusforbetterenergy\(r\>1\)

\*\*TheGoal:\*\*

PlaceONERedBallsomewhereintheboxsothat,oncethesimulationruns,thegreen

ballcontactstheblueballforatleast3seconds\.ThesuccessconditionisONLYthe

green\-bluecontact\-\-\-howyouachieveitisyourchoice\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+

radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwithanyexistingobjectatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.describe\_scene\_geometry

Description:Returnstrategy\-neutralgeometry:everyball\(position,radius,dynamic

flag\),everybar\(position,angle,length,dynamicflag\),everybasket\(position,

dynamicflag\),andthekeydistance\(green<\-\>blue\)\.Noprescriptiveadvice;you

interpretthelayouttoformastrategy\.

Arguments:None

Usage:Action:describe\_scene\_geometry

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,and

properties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefull

physicssimulationtocompletion\.Returnswhetherthegoalwasachieved,final

positionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(out

ofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjects

touchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.

Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobserving

mid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

6\.trace\_green\_ball

Description:Lightweighttrajectoryprobe\-\-\-onlythegreenballissampled\.Placesa

redball,runsthesimulation,andreturnsthegreenball’s\(x,y\)waypointsatfixed

stepintervalsplusstart/end/peaksummary\.Stopsearlyoncethegreenballcomesto

rest\(cappedat~600steps\)\.UsethiswhenyouonlycareaboutWHEREthegreenball

travels,notcontacteventsorotherobjects\-\-\-muchcheaperthansimulate\_with\_trace\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:trace\_green\_ball

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6\}

7\.predict\_first\_contact

Description:Cheappre\-simulationcheck\(<=90physicssteps,~1\.5sofsimtime\)\.Runs

justlongenoughtofindtheFIRSTobjecttheredballtouchesafteritisreleased,

andreports:placementvalidity,theotherobject’sname,thestepofimpact,approach

speed,approximatecontactpoint,andsurfacenormal\.Usethistoverifythatyour

redballactuallyreachestheobjectyouintendedtohitBEFOREburningafull

simulationbudget\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:predict\_first\_contact

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6\}

9\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyour

solution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon’taffectnewones\.

===LEARNEDPHYSICSSKILLS===

\#\#Catapult\-SpecificSkills

1\.\[Strategy:Three\-TierFallbackSequenceforFailedPrimaryPlacement\]Whenprimary

\(x=0\.5,y=0\.4,r=1\.5\)fails:\(1\)ceilinghit\-\>immediatelytryx=0\.3,y=0\.4,r=1\.5

\(gray\-ball\-safefirstchoice\);ifstillceilinghit,tryx=0\.3,y=\-0\.3,r=2\.0but

verifynooverlapfirst\.\(2\)shortrange\-\>increaseyto0\.9orshiftxto0\.4\-\-0\.45\.

\(3\)persistentfailureafter2\+radiusvariationsatsamex\-\>stoptuning,shiftxto

0\.2\-\-0\.3\.

Applywhen:Immediatelywhensimulate\_with\_traceshowsceilinghitorshortrange\.Do

notmicro\-tunethefailedplacement\.

Example:Seed44:primaryhitceiling\-\>x=0\.3,y=0\.4,r=1\.5succeeded\(y=\-0\.3options

wereblockedbygray\_ball\)\.

2\.\[Strategy:x=0\.3IstheCeiling\-Escapex\-Position;y=0\.4IstheGray\-Ball\-Safe

DefaultHeight\]Shiftingxfrom0\.5to0\.3flattensthegreenball’slauncharcby

changingthearmcontactpoint\-\-\-thisx\-shiftistheprimaryceiling\-escapemechanism,

notloweringy\.Whenx=0\.3,y=\-0\.3,r=2\.0isinvalidduetogray\_ballpivotoverlap

\(thepivotatr~=0\.70canbewithin2\.7unitsoflargeredballsatlowy\),usex=0\.3,

y=0\.4,r=1\.5\-\-\-y=0\.4avoidstheoverlapzonewhilepreservingthearc\-flatteningfrom

thex\-shift\.

Applywhen:Whenprimary\(x=0\.5\)causesceilinghit,tryx=0\.3,y=0\.4,r=1\.5first\.

Onlyattempty=\-0\.3ifthegray\_ballpivotispositionedfarfromx=0\.3\.

Example:Seed44:x=0\.3,y=\-0\.3,r=2\.0overlappedgray\_ball;x=0\.3,y=0\.4,r=1\.5

succeededwithgreenballbouncingofftopwallintobasket\.

3\.\[Strategy:x~=0\.5IsthePrimaryCatapultLaunchSweetSpot\]Placingtheredballat

x=0\.5onthecatapultarmproducesaconsistentrightwardlauncharc\.Thecontact

pointisontherightsideofthepivot,creatingsufficientleverarmforstrong

rotationwhilestayinginthegeometricallystablezone\.Onlydeviatetox=0\.3whena

ceilinghitisobserved\.

Applywhen:Alwaysasthefirstplacementattempt\.Deviateonlywhenceilinghitsor

overlapconstraintsrequireit\.

Example:x=0\.5,y=0\.4,r=1\.5isthecanonicalprimary\(Seed44\)\.

4\.\[Strategy:UseLargeRadius\(\>=1\.5\)forSufficientLaunchEnergy\]Thecatapultarmis

alever\-\-\-theredball’smass\(proportionaltor^3\)determinesangularmomentumimpartedtothearm\.

Belowr~=1\.2,thearmrotatestooslowlytolaunchthegreenballacrossthe~7\-unit

gaptothebasket\.r=1\.5istheminimumreliablethreshold;r=2\.0addsnegligible

additionalrangeatx=0\.5duetoarmrotationsaturation\.Onlyreduceradiuswhen

forcedbyoverlapconstraints,acceptingreducedrange\.

Applywhen:Alwaysuser=1\.5asthebaseline\.Scaledownonlywhenoverlap

constraintsrequireit\.

Example:r=1\.5atx=0\.5,y=0\.4isthecanonicalprimaryplacement\(Seed44\)\.

5\.\[Strategy:y=0\.4IsDefaultDropHeight;ScaleUptoy=0\.9WhenArmSitsHigher\]When

thearmy<=\-1\.5,y=0\.4givesadequatefalldistancefortheredballtobuild

approachspeed\.Whenthearmsitshigher\(y\>\-1\.5\),increasetoy=0\.9\.Critically,

y=0\.4alsoservesasthesafeceiling\-escapefallbackheight\-\-\-evenwhenusingx=0\.3

forarcflattening,y=0\.4avoidsgray\_ballpivotoverlapwithoutsacrificingarc

geometry\.

Applywhen:Sety=0\.4bydefault\.Usey=0\.9whenarmishigh\.Usey=0\.4evenduring

ceiling\-escapex\-shiftstoavoidgray\_balloverlap\.

Example:Seed44armaty=\-1\.59:y=0\.4default;ceiling\-escapefallbackx=0\.3,y=0\.4

alsousedy=0\.4\.

6\.\[Strategy:predict\_first\_contactIsEssentialforVariable\-ArmSeeds\]Thecatapult

arm’sy\-positionvariesacrossseeds,makingtheoverlapboundaryunpredictable\.

predict\_first\_contactverifiesbothplacementvalidityandthattheredballcontacts

gray\_platform\(notanotherobject\),confirmingthecatapultmechanismactivates\.It

alsocatchesgray\_ballpivotoverlapsbeforeexpensivesimulationruns\.

Applywhen:Beforeeverysimulate\_with\_tracecall\.Ifplacement\_valid=False,adjust

positionbeforesimulating\.

Example:Seed44:predict\_first\_contactcaughtx=0\.3,y=\-0\.3,r=2\.0overlapping

gray\_ballbeforesimulationwaswasted\.

7\.\[Strategy:CeilingBlockerLethalityDependsonItsx\-PositionandArcAngle\]The

staticblackballneary~=4\.6variesinx\-positionacrossseeds\(x~=\-3\.85tox~=\-1\.14\)\.

Whentheceilingblockerisatx\>\-2\.5,primaryplacementcausesaceilinghit\.When

theblockerisatx<\-3\.5\(farleft\),x=0\.3launchesallowthegreenballtoarcup

andevenbounceoffthetopwallbeforedescendingintothebasket\-\-\-theblockeris

offtothesideofthearc\.

Applywhen:Checkceilingblockerxindescribe\_scene\_geometry\.Ifx\>\-2\.5,ceiling

escapeisneeded\.Ifx<\-3\.5,x=0\.3launchescanreachthetopwallandstill

succeed\.

Example:Seed44:black\_ballatx=\-3\.63\-\-\-greenballbouncedofftop\_wall\(step72\)

thenlandedinbasket\.

8\.\[Strategy:x\-PositionFine\-TunesLandingRange;BifurcationZonesatx<=\-0\.3and

x=0\.7\-\-1\.5\]Smallxshifts\(0\.1\-\-0\.2units\)causelargechangesingreenballlanding

position\.Thestablelaunchzoneisx~=0\.0to0\.5\.Positionsx<=\-0\.3andx=0\.7\-\-1\.5are

bifurcationzoneswheretrajectoryischaoticandsmallchangesproduceunpredictable

outcomes\.Withinx=0\.1\-\-0\.5,landingispredictableandmonotonicallytunable\.

Applywhen:Whenadjustingxtochangelandingrange,staywithinx=0\.0\-\-0\.5\.

Immediatelyresettox=0\.2\-\-0\.4ifanyattemptlandsoutsidethisband\.

Example:x=0\.3vsx=0\.5bothproducereliablearcs\(Seed44\);x=0\.9produceschaotic

bounces\(Seed42\)\.

9\.\[Strategy:RadiusPlateauatx=0\.5:rBeyond1\.5YieldsNoAdditionalRange\]At

x~=0\.5,thecatapultarm’srotationsaturates\-\-\-r=1\.5andr=2\.0produceidenticalgreen

balllandingpositions\.Increasingmassbeyond1\.5atthisxcannotconvertto

additionalangularmomentumbecausethearmhitsitsgeometricrotationlimit\.

Intermediateradii\(r=1\.1\-\-1\.4\)alsofailtobridgeceiling\-rangetradeoffswhenthe

rootcauseislaunchangle\.

Applywhen:Whenr=1\.5failsatx=0\.5,doNOTtryr=2\.0\.Changex\-positioninstead\.

Example:Seed42:agenttriedr=1\.1,1\.2,1\.3,1\.4,2\.0atx=0\.5\-\-\-noneresolvedthe

ceiling\-rangetradeoff\.

10\.\[Strategy:WhenCeiling\-RangeTradeoffIsUnsolvablebyRadiusTuning,Shiftxto

0\.1\-\-0\.3\]Whenr=1\.5hitstheceilingandr=1\.0fallsshort\(Deltaxgap\>1\.0unit\),

interpolatingintermediateradii\(r=1\.1\-\-1\.4\)doesnotresolvetheconflict\-\-\-theyalso

hittheceilingorremainshort\.Thisoccursbecauseceilingclearanceisdetermined

bylaunchangle\(setbyx\-position\),notenergylevel\.Shiftingxleftto0\.1\-\-0\.3

changesthecontactgeometryandflattensthearc,openinganewceiling\-safelaunch

windowatr=1\.5withoutradiuscompromise\.

Applywhen:After2radiusvariationsatx=0\.5bothfail\(oneceilinghit,one

short\),stoptuningradiusandimmediatelyshiftxto0\.3\.

Example:Seed42:triedr=1\.0,1\.1,1\.2,1\.3,1\.4atx=0\.5\-\-\-allfailed\.Shouldhave

shiftedtox=0\.3,y=0\.4,r=1\.5after2ndfailure\.

\#\#CommonMistakestoAvoid

1\.\*\*Agentfixatesonasinglelaunchmechanism\(catapultarm\)andmicro\-tunes

x/y/radiusaroundthesamenarrowregionwithoutescapingthelocalsolutionspace\.\*\*

Whyithappens:Thecatapultarmisthemostobviousmechanism\.After2\-3failures,

agentstrysmallperturbationsinsteadofqualitativelydifferentplacements\.

Howtoavoid:After2failureswithinthesamex\+/\-0\.2region,movetoacompletely

differentxzone\(shiftfromx=0\.5tox=0\.3\)ratherthancontinuingtofine\-tune\.

2\.\*\*Agentincreasesradiusatfixedx=0\.5expectingmorerange,butarmrotationis

saturatedandresultisidentical\.\*\*

Whyithappens:Linearenergyintuition:moremass=moreenergy=morerange\.But

leverrotationsaturatesbeforeadditionalmassconvertstogreenballvelocityat

x=0\.5\.

Howtoavoid:Whenr=1\.5failsatx=0\.5,nevertryr=2\.0atthesamex\.Change

x\-positionto0\.3instead\.

3\.\*\*Agentadjustsyinthewrongdirectionafterceilinghits:increasesyto2\.0\+

\(higherdrop\)insteadofshiftingxto0\.3orusingy=0\.4atx=0\.3\.\*\*

Whyithappens:Higherdrop=morekineticenergyatimpact\.Butmoreenergyproduces

ahigher\-arcinglaunch,worseningtheceilinghit\.

Howtoavoid:Afteraceilinghit,shiftxleftto0\.3\(keepingy=0\.4\)toflattenthe

arc\.Neverincreaseyabove1\.5asaresponsetoceilinghits\.

4\.\*\*Agentbinary\-searchesinthex=\-0\.3to\-0\.5orx=0\.8\-\-1\.5bifurcationzoneswhere

greenballtrajectoryischaotic\.\*\*

Whyithappens:Smallxadjustmentsfeellikefine\-tuning,butinbifurcationzones

theyproducewildlydifferentoutcomesthatappeartoneedmorefine\-tuning\.

Howtoavoid:Ifanyattemptlandsinx<=\-0\.3orx=0\.7\-\-1\.5,immediatelyresettothe

stablezonex=0\.2\-\-0\.4\.

5\.\*\*Agentplacesdeflectorballsnearthebasket\(x~=3\.5\-\-4\.5\)hopingtopushtheblue

balltowardthegreenball’slandingzone\.\*\*

Whyithappens:Whenthegreenballconsistentlylandsshort,movingthetarget\(blue

ball\)seemslogical\.Butbasketphysicsareunpredictableanddeflectorsrarely

producesustainedgreen\-bluecontact\.

Howtoavoid:Fixthegreenballtrajectorytoreachtheblueballbyadjusting

catapultx\-position\.Donotattempttorelocatetheblueballviadeflectors\.

===ENDSKILLS===

============================================================

USERPROMPT

============================================================

SolvetheCatapultpuzzle\.Youhave12iterationstosolveit\.Thesuccessconditionis:

green\_ballmustcontactblue\_ball\.Usetoolseffectivelyandthinkaboutalternate

approachesifonedoesnotwork\.

\-\*\*ALWAYScallpredict\_first\_contactfirst\*\*\-\-\-itischeapandtellsyouifyourred

ballhitstheintendedobject\.

\-Onlyifpredictconfirmstherightcontact:calltrace\_green\_balltotrackgreenball\.

IMPORTANT\-\-\-AftereachFAILEDsimulation,beforetryinganewplacement,youMUST:

1\.ReviewtheLEARNEDPHYSICSSKILLSlistedinthesystempromptabove\.

2\.Identifywhichskillappliestothefailureyoujustobserved\.

3\.ExplicitlystatewhichskillyouareapplyingandwhyinyournextThought\.

### E\.6HExAtrajectory \(catapult seed 45, 6 iterations, SUCCESS\)

\-\-\-Step1\-\-\-

Thought:

Letmestartbydescribingthescenegeometrytounderstandthelayoutbeforeattempting

anyplacement\.

Action:

describe\_scene\_geometry\(\{\}\)

Observation:

===SceneGeometry===

Worldbounds:xin\[\-5,5\],yin\[\-5,5\]

Balls:

green\_ball:pos=\(\-3\.06,\-0\.57\),r=0\.20,dynamic=True

blue\_ball:pos=\(3\.50,\-2\.39\),r=0\.44,dynamic=True

black\_ball:pos=\(\-2\.28,4\.60\),r=0\.40,dynamic=False

gray\_ball:pos=\(\-1\.14,\-1\.67\),r=0\.70,dynamic=True

Bars:

ledge:pos=\(3\.50,\-3\.23\),angle=6\.2deg,len=3\.02,dynamic=False

black\_platform:pos=\(\-1\.74,\-2\.47\),angle=0\.0deg,len=3\.00,dynamic=False

gray\_platform:pos=\(\-1\.14,\-0\.87\),angle=0\.0deg,len=4\.25,dynamic=True

Baskets:

basket:pos=\(3\.50,\-3\.03\),dynamic=True

Keydistances:

green<\-\>blue:6\.81

Successcondition:green\_ball<\-\>blue\_ballfor3\.0s

==================================================

\-\-\-Step2\-\-\-

Thought:

Thesceneshowsthecatapultarm\(gray\_platform\)aty=\-0\.87,whichis\>\-1\.5,soper

Skill\#5Ishouldusey=0\.9insteadofy=0\.4\.PerSkills\#3and\#4,theprimary

placementisx=0\.5,y=0\.9,r=1\.5\.Letmeverifywithpredict\_first\_contactfirst\.

Action:

predict\_first\_contact\(\{’x’:0\.5,’y’:0\.9,’radius’:1\.5\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.500,y=0\.900,radius=1\.500

first\_contact\_object:gray\_platform

contact\_step:10

approach\_speed:1\.488

contact\_point\(approxmidpoint\):\(\-0\.319,\-0\.056\)

contact\_normal\(red\-\>otherinverted\):\(0\.710,0\.704\)

\-\-\-Step3\-\-\-

Thought:

Theplacementisvalidandtheredballhitsthegray\_platform\(catapultarm\)as

intended\.Nowletmerunafullsimulationwithtracetoseeifthegreenballreaches

theblueball\.

Action:

simulate\_with\_trace\(\{’x’:0\.5,’y’:0\.9,’radius’:1\.5,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:2000\)===

Success:False

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~9:gray\_platform<\-\>red\_ball

step~13:basket<\-\>blue\_ball

step~41:basket<\-\>red\_ball

step~42:gray\_platform<\-\>red\_ball

step~43:basket<\-\>blue\_ball

step~44:basket<\-\>red\_ball

step~49:basket<\-\>blue\_ball

step~54:basket<\-\>red\_ball

step~57:basket<\-\>red\_ball

step~67:top\_wall<\-\>green\_ball

step~71:basket<\-\>red\_ball

step~79:ledge<\-\>red\_ball

\.\.\.\+14more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=4\.78,y\_min=\-0\.91,v\_max=10\.78,Deltapos=\(\+5\.91,\-0\.34\)

omega\_peak=16\.07rad/s,thetain\[\-0\.4deg,\+1542\.0deg\]

blue\_ball:y\_peak=\-2\.33,y\_min=\-2\.51,v\_max=2\.01,Deltapos=\(\+0\.46,\-0\.04\)

omega\_peak=2\.03rad/s,thetain\[\-5\.8deg,\+41\.1deg\]

\-\-\-Step4\-\-\-

Thought:

Thegreenballhitthetop\_wall\(ceilinghitatstep67,y\_peak=4\.78\)andthesuccess

wasFalse\.PerSkill\#3,aceilinghitmeansIshouldshiftxfrom0\.5to0\.3to

flattenthearc\.PerSkill\#5,thearmisaty=\-0\.87\(\>\-1\.5\)soIkeepy=0\.9\.Letme

verifyx=0\.3,y=0\.9,r=1\.5withpredict\_first\_contactfirst\.

Action:

predict\_first\_contact\(\{’x’:0\.3,’y’:0\.9,’radius’:1\.5\}\)

Observation:

===PredictedFirstContact===

placement\_valid:True

red\_balldrop:x=0\.300,y=0\.900,radius=1\.500

first\_contact\_object:gray\_platform

contact\_step:10

approach\_speed:1\.463

contact\_point\(approxmidpoint\):\(\-0\.419,\-0\.056\)

contact\_normal\(red\-\>otherinverted\):\(0\.663,0\.749\)

\-\-\-Step5\-\-\-

Thought:

Validplacement,hitsgray\_platformasintended\.PerSkill\#2,shiftingxto0\.3

flattensthearctoavoidtheceilinghit\.Letmerunthefullsimulation\.

Action:

simulate\_with\_trace\(\{’x’:0\.3,’y’:0\.9,’radius’:1\.5,’object\_names’:\[’green\_ball’,

’blue\_ball’\]\}\)

Observation:

===SimulationResult\(steps:343\)===

Success:True

Contacts\(red\_ball\+tracedobjects\):

step~0:gray\_platform<\-\>green\_ball

step~0:basket<\-\>blue\_ball

step~8:basket<\-\>blue\_ball

step~9:gray\_platform<\-\>red\_ball

step~13:basket<\-\>blue\_ball

step~46:basket<\-\>red\_ball

step~47:basket<\-\>blue\_ball

step~50:gray\_platform<\-\>red\_ball

step~56:top\_wall<\-\>green\_ball

step~61:basket<\-\>red\_ball

step~67:basket<\-\>red\_ball

step~82:ledge<\-\>red\_ball

step~86:basket<\-\>blue\_ball

step~93:ledge<\-\>red\_ball

step~95:basket<\-\>blue\_ball

\.\.\.\+12more

Objectsummary\(keykinematicextrema\):

green\_ball:y\_peak=4\.79,y\_min=\-2\.21,v\_max=11\.09,Deltapos=\(\+7\.60,\-1\.63\)

omega\_peak=26\.11rad/s,thetain\[\-0\.4deg,\+1387\.3deg\]

blue\_ball:y\_peak=\-2\.25,y\_min=\-2\.52,v\_max=2\.93,Deltapos=\(\+0\.44,\-0\.04\)

omega\_peak=2\.57rad/s,thetain\[\+0\.0deg,\+67\.6deg\]

\-\-\-Step6\-\-\-

Thought:

Thesimulationsucceededwithx=0\.3,y=0\.9,r=1\.5\.I’llsubmitthisasthefinalanswer\.

Action:

finish\(\{’x’:0\.3,’y’:0\.9,’radius’:1\.5\}\)

Observation:

SUCCESS\!Thegreenballcontactedtheblueballinthebasket\.

Totalsimulationsteps:343

Reward:1\.0

Finalobjectpositions:

green\_ball:pos=\(4\.536,\-2\.207\),vel=\(0\.000,0\.000\)

red\_ball:pos=\(1\.420,\-1\.904\),vel=\(0\.001,0\.001\)

blue\_ball:pos=\(3\.945,\-2\.440\),vel=\(0\.000,0\.000\)

ledge:pos=\(3\.500,\-3\.235\),vel=\(0\.000,0\.000\)

basket:pos=\(4\.191,\-3\.041\),vel=\(0\.000,0\.000\)

black\_ball:pos=\(\-2\.281,4\.600\),vel=\(0\.000,0\.000\)

black\_platform:pos=\(\-1\.736,\-2\.473\),vel=\(0\.000,0\.000\)

gray\_ball:pos=\(\-4\.275,\-4\.277\),vel=\(0\.000,0\.000\)

gray\_platform:pos=\(\-0\.303,\-1\.809\),vel=\(\-0\.006,\-0\.001\)

##### How the bank drove the solve\.

On reading the scene the agent observes the catapult arm sitting aty≈−0\.87y\{\\approx\}\{\-\}0\.87\(higher than typical\), and applies skillcat\_ev\_7\_002to scale the drop height from the defaulty=0\.4y\{=\}0\.4toy=0\.9y\{=\}0\.9, combined with the canonical\(x=0\.5,r=1\.5\)\(x\{=\}0\.5,\\,r\{=\}1\.5\)fromcat\_ev\_1\_001andcat\_ev\_6\_000on the first attempt\. That trial fails by ceiling\-hit; instead of micro\-tuning, the agent invokescat\_ev\_10\_008which prescribes the exact fallbackx=0\.3x\{=\}0\.3to flatten the arc, yielding success on the next simulation\.

## Appendix FMore Background and Preliminaries

##### Tool\-Augmented Large Language Model \(LLM\) Agents\.

To interact with complex environments, LLMs are frequently augmented with external tools and programmatic execution capabilities\. This allows them to deeply interleave reasoning with acting, enabling them to gather information, execute APIs, or manipulate simulators directly\(Yao et al\.,[2022](https://arxiv.org/html/2606.29315#bib.bib66); Schick et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib45)\)\. By structuring interactions as a sequence of thoughts, actions, and observations, tool\-augmented agents transcend the limitations of static parametric knowledge and can adapt their strategies based on real\-time environmental feedback\.

##### Reinforcement Learning and MDPs\.

We model the interaction between an agent and its environment as a Markov Decision Process \(MDP\), defined by the tupleℳ=⟨𝒮,𝒜,p,r,γ,p0⟩\\mathcal\{M\}=\\langle\\mathcal\{S\},\\mathcal\{A\},p,r,\\gamma,p\_\{0\}\\rangle\. Here,𝒮\\mathcal\{S\}is the state space,𝒜\\mathcal\{A\}is the action space,p\(s′∣s,a\)p\(s^\{\\prime\}\\mid s,a\)is the transition dynamics,r\(s,a\)r\(s,a\)is the reward function,p0\(s\)p\_\{0\}\(s\)is the initial state distribution, andγ∈\[0,1\)\\gamma\\in\[0,1\)is the discount factor\. At timett, the agent observes stateSt∼p0S\_\{t\}\\sim p\_\{0\}, selects an actionAt∼π\(⋅∣St\)A\_\{t\}\\sim\\pi\(\\cdot\\mid S\_\{t\}\)according to its policyπ\\pi, receives a rewardRt\+1R\_\{t\+1\}, and transitions toSt\+1S\_\{t\+1\}\. This generates a trajectoryτt=\(S0,A0,R1,…,St−1,At−1,Rt\)\\tau\_\{t\}=\(S\_\{0\},A\_\{0\},R\_\{1\},\\dots,S\_\{t\-1\},A\_\{t\-1\},R\_\{t\}\)\.

The core tasks in RL are*policy evaluation*\(estimating the expected returnvπ\(s\)=𝔼\[∑i=1∞γi−1Rt\+i∣St=s\]v^\{\\pi\}\(s\)=\\mathbb\{E\}\[\\sum\_\{i=1\}^\{\\infty\}\\gamma^\{i\-1\}R\_\{t\+i\}\\mid S\_\{t\}=s\]\) and*control*\(finding an optimal policy to maximizeJ\(π\)=∑sp0\(s\)vπ\(s\)J\(\\pi\)=\\sum\_\{s\}p\_\{0\}\(s\)v^\{\\pi\}\(s\)\)\. By iterating between evaluating a policy and improving it, RL algorithms optimize a parameterized policyπθ\\pi\_\{\\theta\}\. This optimization can occur via*online pretraining*\(interacting directly with the environment\) or*offline pretraining*\(learning from a static dataset of prior interactions𝒟=\{\(si,ai,ri,si′\)\}\\mathcal\{D\}=\\\{\(s\_\{i\},a\_\{i\},r\_\{i\},s^\{\\prime\}\_\{i\}\)\\\}\)\.

Traditional methods for optimizing LLM policies rely on gradient\-based parameter updates\. For instance, Group Relative Policy Optimization \(GRPO\) avoids training a separate critic by using intra\-group relative rewards to optimize the policy\. For a given queryxx, the model samplesGGresponses\{y\(1\),…,y\(G\)\}\\\{y^\{\(1\)\},\\dots,y^\{\(G\)\}\\\}, which are scored to obtain rewards\{R1,…,RG\}\\\{R\_\{1\},\\dots,R\_\{G\}\\\}\. GRPO computes normalized advantages and updates the policy with a clipped objective:

JGRPO\(θ\)=𝔼x,\{y\(i\)\}\[1G∑i=1Gmin⁡\(riAi,clip\(ri,1−ϵ,1\+ϵ\)Ai\)−βDKL\(πθ∥πref\)\]J\_\{\\text\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\_\{x,\\\{y^\{\(i\)\}\\\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\min\\left\(r\_\{i\}A\_\{i\},\\text\{clip\}\(r\_\{i\},1\-\\epsilon,1\+\\epsilon\)A\_\{i\}\\right\)\-\\beta D\_\{\\text\{KL\}\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\text\{ref\}\}\)\\right\]\(8\)

##### In\-Context Reinforcement Learning \(ICRL\)\.

To circumvent the need for continuous parameter updates, In\-Context Reinforcement Learning \(ICRL\) conditions the policy on both the current stateStS\_\{t\}and an evolving history or contextCtC\_\{t\}\. Here, the actionAtA\_\{t\}is sampled according toπθ\(⋅∣St,Ct\)\\pi\_\{\\theta\}\(\\cdot\\mid S\_\{t\},C\_\{t\}\)\. The pre\-trained weightsθ\\thetaremain frozen, yet the policy can achieve high rewards in test environments that differ from those seen during pre\-training\. This generalization arises because the forward pass of the neural network effectively implements an RL algorithm that learns from the contextCtC\_\{t\}\. The performance ofπθ\\pi\_\{\\theta\}generally improves with the length and quality ofCtC\_\{t\}, a phenomenon referred to as*in\-context improvement*\(Moeini et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib39)\)\.

## Appendix GInterphyre: Environment and Benchmark Details

This appendix provides the full design specification, intervention API, and limitations of our Interphyre environment and benchmark\.

### G\.1Introduction

The board\-reset primitive that unlocked world\-model probing in chess and Othello has no equivalent in continuous physics\. Those results relied on models trained on task\-domain sequences and on the ability to specify any board state symbolically; physical state is continuous and contact\-dependent, and counterfactual access requires rewinding a running simulation rather than resetting a board\. We introduce Interphyre, a 2D physics environment whose snapshot/restore API captures complete simulation state at any event\-triggered branch point and branches it into paired counterfactual trajectories that diverge under controlled perturbations\. In a preliminary probing study, linear probes on Qwen3\-8B residual\-stream activations predict counterfactual outcomes above chance across three levels, but only when conditioned on oracle solution strategy\. On the catapult level, the same object removal is causally necessary under one strategy and irrelevant under another; pooling across strategies averages over this reversal\. The same backend supports RL generalization research: physics parameters are independent experimental axes, so a policy trained under standard conditions can be tested under altered physics without changing scene structure\. Interphyre’s 25\-level curriculum and open Python API are released to support interpretability work on physics\-reasoning models and RL generalization research\.

Existing 2D physics benchmarks \(PHYRE\(Bakhtin et al\.,[2019](https://arxiv.org/html/2606.29315#bib.bib8)\), I\-PHYRE\(Li et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib30)\), Kinetix\(Matthews et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib37)\)\) were designed to answer one question: can the agent solve the puzzle? Two research communities now need more from physics environments\. Mechanistic interpretability of physics\-reasoning LLMs requires mid\-trajectory intervention and paired counterfactual rollouts from a shared physical state; without these, model activations cannot be paired with causal physical ground truth\. RL generalization and behavioral analysis require independent control over physics parameters and white\-box access to simulation state; without these, a researcher cannot test whether a learned strategy transfers across physics regimes or which scene features it relies on\. No existing benchmark meets either requirement\.

World\-model discovery in board games relies on a property physics does not share: discrete, fully observable ground\-truth states that make probe targets unambiguous and interventions exact\. Language models encode Othello boards\(Li et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib29)\)and chess positions\(Karvonen,[2024](https://arxiv.org/html/2606.29315#bib.bib26)\)as structured world models; those results used models trained on task\-domain sequences, where board state was the latent variable the training objective required the model to track\. For physics\-reasoning LLMs trained on general text, the analogous question is whether chain\-of\-thought reasoning produces structured physical representations, and answering it requires infrastructure those experiments did not need\. Physical state is continuous and contact\-dependent; the probe target is the outcome of a specific contact event, not a fixed board position, and counterfactual access requires rewinding a running simulation rather than resetting a board\. The infrastructure that made board\-game probing tractable does not transfer\.

The RL gap has a specific shape\. Testing whether a learned strategy generalizes across physics variations, or identifying which scene features a policy relies on, requires varying physics parameters while holding scene topology fixed\. Kinetix varies both jointly\(Matthews et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib37)\); PHYRE and I\-PHYRE expose no parametric control at all\. Without independent axes, a generalization result cannot be attributed to physics transfer rather than to a change in scene structure\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x12.png)Figure 14:Interphyre’s snapshot/restore API branches a shared mid\-trajectory state into a factual rollout and perturbed alternatives \(Section[G\.4\.1](https://arxiv.org/html/2606.29315#A7.SS4.SSS1)\)\. Each row is one seed of thecatapultlevel with a different oracle solution strategy\. The leftmost column shows the initial scene for reference; the teal curve is the green ball’s trajectory, shown up to the branch point in the*Branch point*column and continuing through the outcome in each subsequent column\. Top \(seed 8, deflector strategy\): removing the deflector causes failure; shrinking the red action ball does not\. Bottom \(seed 5, direct\-launch strategy\): shrinking the red ball causes failure; removing the deflector does not\. The same interventions have opposite causal relevance across seeds, showing that causal relevance is a property of an object relative to the active strategy, not of the object in isolation\. See Section[G\.4\.3](https://arxiv.org/html/2606.29315#A7.SS4.SSS3)for the branching protocol\.We introduce Interphyre, an extensible, intervention\-aware 2D physics environment that addresses both gaps through three capabilities:

1. 1\.An intervention API with snapshot/restore semantics\.Any running simulation can be snapshotted at an arbitrary point \(on a physics event, a contact trigger, or a fixed timestep\) and restored into two or more branches that diverge under controlled perturbations \(impulses, forces, parameter changes, object additions or removals\)\. Paired counterfactual evaluation follows: the same agent, on the same scene, under matched and perturbed physical conditions\. No existing physical reasoning benchmark provides this programmatic primitive\.
2. 2\.Extensible level authoring as Python code\.Levels are Python classes built from a composable object API, with tunable physics parameters \(gravity, friction, restitution, density\) and user\-defined success conditions\. Like Toybox’s reimplementation of Atari games\(Foley et al\.,[2018](https://arxiv.org/html/2606.29315#bib.bib17)\), Interphyre reimplements the 2D physics puzzle paradigm established by PHYRE, extending it with white\-box access, parametric control, and intervention infrastructure\. A curriculum of 25 named levels, each parameterized by a random seed, ships with 250,000 pre\-validated task instances; an oracle verification system certifies new seeds on demand, so the curriculum is editable and extensible rather than fixed\.
3. 3\.LLM\-native interfaces and interpretability\-ready data generation\.The simulator and intervention API are exposed as a tool\-call surface that an LLM agent can invoke directly\. The same interfaces support standalone generation of paired \(factual, counterfactual\) trajectory data for downstream interpretability and offline reinforcement learning pipelines\.

A preliminary probing study on Qwen3\-8B demonstrates the infrastructure across three levels\. Linear probes on residual\-stream activations predict counterfactual outcomes at snapshot\-defined branch points\. Figure[14](https://arxiv.org/html/2606.29315#A7.F14)shows the key result: the same structural intervention has opposite causal relevance across seeds that use different solution strategies, a confound any probing study not conditioned on strategy will average over\.

### G\.2Related Work

##### Physical reasoning benchmarks\.

PHYRE\(Bakhtin et al\.,[2019](https://arxiv.org/html/2606.29315#bib.bib8)\)introduced 2D physics puzzles for evaluating sample\-efficient learning, measuring agents by AUCCESS across within\-template and cross\-template generalization splits\. Virtual Tools\(Allen et al\.,[2020](https://arxiv.org/html/2606.29315#bib.bib2)\)developed a concurrent benchmark with similar Box2D\-style mechanics, grounded in a Bayesian “sample, simulate, update” cognitive model\. I\-PHYRE\(Li et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib30)\)extended this to temporally sequenced interventions, where agents eliminate obstacles at precise moments as simulations unfold\. I\-PHYRE’s interventions are agent actions within a single trajectory; the environment produces one modified rollout, and paired counterfactual analysis requires two from a shared prefix\. Kinetix\(Matthews et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib37)\)takes a different direction: it is a JAX\-accelerated physics engine supporting open\-ended procedural generation for training general RL agents\. Interphyre provides a designed curriculum whose solution structure is known, which is necessary for controlled interpretability comparisons\. DeepPHY\(Xu et al\.,[2026](https://arxiv.org/html/2606.29315#bib.bib64)\)unifies several of these environments under a single evaluation framework for agentic VLMs, but inherits the structural limitations of each backend \(no parametric control, no paired counterfactuals, no activation\-level instrumentation\)\.

##### Counterfactual and causal evaluation environments\.

CoPhy\(Baradel et al\.,[2020](https://arxiv.org/html/2606.29315#bib.bib9)\)introduced counterfactual learning of physical dynamics, requiring models to predict outcomes after do\-interventions on object positions in 3D scenes\. CausalWorld\(Ahmed et al\.,[2021](https://arxiv.org/html/2606.29315#bib.bib1)\)exposed causal variables \(masses, sizes, friction, gravity\) for do\-interventions in robotic manipulation, demonstrating the value of interventional evaluation for transfer learning\. CRAFT\(Ates et al\.,[2022](https://arxiv.org/html/2606.29315#bib.bib7)\)tested causal reasoning about forces and interactions through video question answering over Box2D scenes, including counterfactual question types\. ContPhy\(Zheng et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib71)\)extended counterfactual physics questions to continuum materials\. Each introduces counterfactual or interventional structure into physics evaluation; none implements the programmatic snapshot/restore primitive that generates paired rollouts from a bit\-identical physical prefix, which Interphyre’s paired\-data generation requires\.

##### LLM\-based physical reasoning\.

LLMs underperform humans on physical attribute reasoning\(Wang et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib61)\)and struggle with physical dynamics even when strong on common\-sense tasks\(Chow et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib13)\)\. These systems evaluate behavior, not representations\. PhysGym\(Chen et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib11)\)uses LLM agents to actively probe interactive physics simulations and formulate hypotheses, the closest existing paradigm to Interphyre, though PhysGym targets discovery of physical equations rather than intuitive task\-solving\. LLMPhy\(Cherian et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib12)\)demonstrated a simulator\-in\-the\-loop framework where LLMs iteratively estimate physical parameters via code generation\. None provides the pairing mechanistic interpretability requires: model activations matched to counterfactual physical outcomes from a shared simulation state\.

##### Interpretability on controlled reasoning tasks\.

Recent work has discovered world models in Othello\(Li et al\.,[2023](https://arxiv.org/html/2606.29315#bib.bib29)\), chess\(Karvonen,[2024](https://arxiv.org/html/2606.29315#bib.bib26)\), and mazes\(Spies et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib52)\), and circuit structure in board games\(He et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib22)\)and in\-context RL\(Demircan et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib14)\)\. All these results rely on discrete, fully observable ground\-truth states\. Physics reasoning does not have this property\. Physical state is continuous and high\-dimensional; the quantity a probe should predict depends on which contact event matters for the task, not on a fixed board position\. Counterfactual access requires rewinding a running simulation to a precise mid\-trajectory event, not resetting a board to a saved position\. Thought Anchors\(Bogdan et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib10)\)identify important reasoning steps via black\-box resampling of output traces; Interphyre supports a different analysis, pairing internal activations with causal physical ground truth obtained by intervening on the simulation directly rather than by resampling output traces\.

### G\.3Design

Four design decisions give Interphyre its research capabilities: levels are Python code \([G\.3\.1](https://arxiv.org/html/2606.29315#A7.SS3.SSS1)\), physics parameters are first\-class inputs \([G\.3\.2](https://arxiv.org/html/2606.29315#A7.SS3.SSS2)\), the curriculum is certified by an oracle \([G\.3\.3](https://arxiv.org/html/2606.29315#A7.SS3.SSS3)\), and both RL and LLM interfaces share one simulator backend \([G\.3\.4](https://arxiv.org/html/2606.29315#A7.SS3.SSS4)\)\.[15](https://arxiv.org/html/2606.29315#A7.F15)shows eight of the 25 canonical levels, spanning gravity\-driven drops, collision chains, balancing tasks, and aperture navigation\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x13.png)Figure 15:Eight of the 25 canonical levels in the Interphyre curriculum, shown at initial state\. The curriculum is hand\-authored to span a range of physical phenomena:down\_to\_earth\(gravity\-driven drop past an obstruction\),catapult\(impulse transfer\),basket\_case\(container avoidance\),marble\_race\(rolling dynamics along a track\),seesaw\(lever balance\),the\_cradle\(pendulum\),keyhole\(aperture navigation\), andtipping\_point\(unstable equilibrium\)\. Each level is a Python class; each class is instantiable under a researcher\-specified seed and physics configuration\.#### G\.3\.1Levels as Python Code

Encoding levels as Python functions rather than data files means the geometry, action slots, and success conditions are ordinary code that any researcher can fork, modify, and register without rebuilding the package; Listing[1](https://arxiv.org/html/2606.29315#listing1)shows a minimal example\.

fromInterphyreimportLevel

fromInterphyre\.levelsimportregister\_level

fromInterphyre\.objectsimportBall,Bar

defsuccess\_condition\(engine\):

returnengine\.is\_in\_contact\_for\_duration\(

"green\_ball",

"purple\_ground",

3\.0,

\)

@register\_level

defbuild\_level\(seed=None,variant=0,scene=None\)\-\>Level:

ground=Bar\(

left=\-5,

right=5,

y=\-4\.9,

thickness=0\.2,

color="purple",

dynamic=False,

\)

green=Ball\(

x=0,

y=4\.0,

radius=0\.5,

color="green",

dynamic=True,

\)

red=Ball\(

x=0,

y=0\.0,

radius=0\.4,

color="red",

dynamic=True,

\)

returnLevel\(

name="falling",

objects=\{

"green\_ball":green,

"red\_ball":red,

"purple\_ground":ground,

\},

action\_objects=\["red\_ball"\],

success\_condition=success\_condition,

\)

Listing 1:A minimal custom level\. The geometry, the action slot, and the success condition are all ordinary Python; registering the builder makes it usable by the same evaluation pipeline as the bundled curriculum, with no package rebuild\.A researcher studying how behavior depends on scene topology writes a new subclass rather than waiting for a fixed benchmark to include the condition\.

#### G\.3\.2Physics Parameters as First\-Class Inputs

Physics parameters \(gravity, friction, restitution, density, solver step size\) are fields on aSimulationConfigobject passed to the environment constructor, not constants compiled into the engine\. The same level, seed, and action can be evaluated under any physically plausible parameter regime by swapping one configuration for another\. Figure[16](https://arxiv.org/html/2606.29315#A7.F16)shows the consequence: the same oracle action succeeds under default physics and fails in three distinct ways under reduced gravity, zero friction, and high restitution, each failure requiring a different corrective action\.

![Refer to caption](https://arxiv.org/html/2606.29315v1/x14.png)Figure 16:Thecatapultlevel \(seed 8\) under four physics configurations, each set by a singleSimulationConfigfield, with the same oracle action placed in every panel\. Ghost trails show the green ball’s in\-flight trajectory\. Default physics yields success\. Reduced gravity preserves the qualitative mechanism but extends the arc, producing a near miss: the ball clears the basket opening rather than dropping in\. Without friction the arm imparts less tangential force and the ball falls short; a higher\-energy placement would compensate\. With high restitution the elastic collision over\-drives the arm and the ball overshoots; a softer placement would keep the launch on target\. Holding scene and seed fixed, each parameter change shifts the required action\.Unlike Kinetix, which varies physics parameters jointly with level topology\(Matthews et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib37)\), Interphyre keeps scene structure and parameter regime independent; a generalization study holds the level and seed fixed while sweeping gravity or friction\.

#### G\.3\.3Curriculum and Oracle Verification

The curriculum ships with 25 canonical levels, each paired with 10,000 pre\-validated seeds, for 250,000 pre\-certified task instances\. Every seed in this distribution is guaranteed to admit at least one solving action placement, verified by an exhaustive oracle search over a discretized action grid with a bounded simulation budget per candidate rather than by stochastic sampling\. The verification is decoupled from the environment runtime: a custom level or an extended seed range can be certified offline and the resulting seed list distributed alongside the environment code\. Solvability is therefore a guarantee of the curriculum rather than of any particular bundled snapshot\.

Oracle verification ensures that, up to the resolution of the action grid, every evaluation failure is attributable to the agent\. PHYRE’s stochastic\-validity heuristic\(Bakhtin et al\.,[2019](https://arxiv.org/html/2606.29315#bib.bib8)\)admits unsolvable instances, a confound controlled interpretability studies cannot afford; Kinetix trades solvability guarantees for scale, reasonable for general RL pretraining but not for paired\-counterfactual evaluation\.

#### G\.3\.4Agent Interfaces and Observation Modes

Two interface surfaces share one simulator backend\. The gymnasium\-compatible interface\(Towers et al\.,[2024](https://arxiv.org/html/2606.29315#bib.bib57)\)exposes the environment to standard RL training and evaluation code without modification: action spaces, step semantics, and reset behavior follow the conventions of the wider gymnasium ecosystem\. The tool\-call interface exposes the same environment as a set of named functions \(inspect\_scene,place\_action,advance\_simulation,query\_success\) an LLM can invoke directly, so that an LLM reasoner operating under a tool\-use harness can drive the simulator in the same turn structure it would use for any other external tool\.

Three observation modes expose the same physics to different experimental designs:physics\_statereturns a symbolic vector of positions, velocities, and contacts;imagerenders a frame in either RGB or a discrete\-color palette \(discrete colors remove the texture and shading confounds that bias VLM evaluation\(Chow et al\.,[2025](https://arxiv.org/html/2606.29315#bib.bib13)\)\);bothreturns both together for multimodal agents\. Because all three run the same simulation, observation modality is a controlled variable, not a design constraint\.

### G\.4Intervention API

Paired counterfactual analysis requires two trajectories that share a bit\-identical physical prefix but diverge under exactly one controlled perturbation\. Forward rollout cannot produce these: without the ability to hold physical history constant while changing a single variable mid\-trajectory, interpretability studies that depend on counterfactual contrast have no substrate\. Interphyre supplies this primitive via snapshot/restore, which makes the simulation*causally interrogable*\.

#### G\.4\.1Snapshot/Restore and Counterfactual Branching

The environment is deterministic under a fixed seed: replaying the same action sequence on the same level, seed, and physics configuration yields a bit\-identical trajectory, which is the property that makes a comparison genuinely paired\.

The central primitive is the ability to capture an immutable snapshot of complete physics state at any point during simulation and restore it into two or more independent branches\. A snapshot captures the complete Box2D world state: positions, velocities, fixture properties, solver configuration, active contacts, and contact\-duration accumulators; the accumulators are required because many levels define success as sustained contact, and restoring position without restoring contact timing would break the pairing\.

Afterrestore, the simulation evolves identically to the original trajectory unless the researcher introduces a perturbation\. This is what makes a comparison*paired*: the two branches share a bit\-identical physical prefix, not merely an approximate distributional match\. The branching protocol is: snapshot at a physically meaningful event, restore into branch A with no modification, restore into branch B with a controlled perturbation, and compare the divergent outcomes\.

In causal terms\(Pearl,[2009](https://arxiv.org/html/2606.29315#bib.bib42)\), the branching protocol produces a matched counterfactual pair: the shared physical prefix is the conditioning history, the perturbation is the treatment whose effect is being measured, and the divergent suffix yields the factual and counterfactual outcomes\. What makes the comparison interpretable as evidence of causal effect rather than distributional difference is the bit\-identical prefix: the only thing that differs between the two branches is the perturbation itself\.

#### G\.4\.2Event\-Driven Branch Points

The usefulness of snapshot/restore depends on*where*to branch\. Raw step indices are fragile: the step at which a ball contacts a platform varies across seeds, coupling the experimental design to seed\-specific timing\. Interphyre’s declarative trigger language decouples branch points from timing: the researcher specifies a physical event, and the engine fires automatically when the condition is met\. Table[14](https://arxiv.org/html/2606.29315#A7.T14)lists the core factories\.

Table 14:Core trigger factories\. Each factory returns a trigger that the engine evaluates per step; the researcher specifies*what*physical event gates the intervention, not*which step*it corresponds to\.Triggers compose:on\_sequencefires an ordered list of triggers in sequence, andon\_anyfires when any child does\.

#### G\.4\.3Worked Example: Strategy\-Dependent Causal Structure

Figure[14](https://arxiv.org/html/2606.29315#A7.F14)runs this protocol on two seeds of thecatapultlevel whose oracle solutions use different strategies: in seed 8, the green projectile bounces off a black deflector ball to reach the basket; in seed 5, it flies directly in\.

Both seeds succeed in the factual branch\. The two structural interventions, however, have opposite effects\. Removing the black deflector ball causes seed 8 to fail \(the deflector was necessary\) and leaves seed 5 unaffected \(the direct path does not require it\)\. Shrinking the red action ball to radiusr=0\.4r\{=\}0\.4changes the launch dynamics for seed 5 \(failure\) while leaving seed 8’s deflector arc intact \(success\)\. No single intervention explains both outcomes\. Each identifies the causal role of one object relative to one strategy\.

Causal relevance, these results show, is not a property of an object in isolation\. The sameremove\_objectcall is causally necessary for one strategy and causally irrelevant for the other\. A probing study that pools across strategies averages over these opposite effects; if the two strategies occur at similar rates across seeds, the unconditional probe recovers neither\.

defrun\_branch\(seed,action,intervene=None\):

env=InterphyreEnv\("catapult",seed=seed,enable\_interventions=True\)

env\.reset\(\);env\.place\_action\(action\)

snap,\_=env\.run\_until\(

on\_contact\("red\_ball","gray\_platform"\)

\)

env\.restore\(snap\)

env\.step\_physics\(30\)

ifintervene:

intervene\(env\)

returnenv\.run\_until\(on\_success\(\),max\_steps=450\)

forseedin\[8,5\]:

action=oracle\_action\(seed\)

run\_branch\(seed,action\)

run\_branch\(

seed,action,

lambdae:e\.remove\_object\("black\_ball"\)

\)

run\_branch\(

seed,action,

lambdae:shrink\_red\_ball\(e,r=0\.4\)

\)

Figure 17:Branching protocol for Figure[14](https://arxiv.org/html/2606.29315#A7.F14)\(catapult level, seeds 8 and 5\)\. Each call snapshots at the moment the red ball contacts the catapult arm, advances 30 steps to the branch point, where the green ball is airborne, and then applies the structural intervention\. Seed 8: removing the black deflector causes failure; shrinking the red ball does not\. Seed 5: shrinking the red ball causes failure; removing the deflector does not\. The object’s causal role depends on which strategy the seed uses, not on the object itself\.Figure[17](https://arxiv.org/html/2606.29315#A7.F17)shows the protocol skeleton\. The snapshot is taken at the physical event that gates the experiment: the moment the red action ball contacts the catapult arm\. The branch point is 30 steps later, once the green ball is airborne and the two seeds’ trajectories have diverged\. From that point, three branches diverge under different structural modifications\. Section[G\.4\.4](https://arxiv.org/html/2606.29315#A7.SS4.SSS4)formalizes the split between kinematic and structural perturbations\.

#### G\.4\.4Perturbation Primitives as Experimental Treatments

Counterfactual branching requires not just rewinding but*perturbing*, and the space of perturbations determines the space of experimental questions\. Kinematic perturbations \(impulse, continuous force, velocity override, position override, and freeze\) modify the state of existing objects\. These support experiments asking whether an outcome depends on a specific kinematic variable \(e\.g\., does a 5% change in post\-contact velocity flip the result?\)\. Structural perturbations \(adding and removing objects\) modify scene composition\. These support experiments asking whether an outcome depends on the presence of a particular object \(e\.g\., does removing a supporting surface change the agent’s prediction?\)\. The kinematic/structural distinction is the experimental design axis that makes causal findings interpretable: the reversal in[14](https://arxiv.org/html/2606.29315#A7.F14)is visible only when intervention type is held constant and seed varied\.

### G\.5Conclusion

Conditioning on oracle solution strategy is necessary to recover probing signal in physics\-reasoning models: pooling across strategies averages over opposite causal effects\. Linear probes on Qwen3\-8B residual\-stream activations predict counterfactual outcomes above chance across three levels, but only when indexed to the oracle strategy for each seed\. That result suggests probing without strategy conditioning conflates representations from different solution behaviors and understates what models encode about physical causality\. The same conditioning requirement applies to all three questions motivating the platform: probing how representations encode physical structure, whether reasoning steps commit to correct physics, and whether those commitments are causally related to task success all require probe targets conditioned on the oracle strategy for each seed\. Interphyre’s snapshot/restore API makes this tractable: the two branches share a bit\-identical physical prefix, so the probe’s target is the perturbation’s downstream effect rather than a distributional shift in scene statistics\.

For RL research, the independent parameter axis is the primary contribution\. Fixing scene topology while varying physics parameters isolates physics transfer from structural variation, a confound Kinetix cannot separate\. The hand\-authored curriculum provides concrete tasks with known solution structure, a substrate for curriculum learning and hypothesis\-driven behavioral tests; a researcher can ask whether a strategy learned under one physics regime transfers to another, or whether it depends on properties specific to the training regime\.

Two limitations apply to the current results\. The probing demonstration uses one model family on three levels; whether the strategy\-conditioning finding holds across model scale, architecture, and a broader curriculum is an open empirical question\. The 25\-level curriculum is hand\-authored and does not scale indefinitely without additional authoring effort, though the oracle verification system applies to any new levels without modification to the runtime\.

## Appendix HReAct System Prompts

This appendix shows the system prompt given to theReActagent for all 8 levels of the InterPhyre introduced in Section[4](https://arxiv.org/html/2606.29315#S4)that we use in our experiments\. The prompt contains factual scene description, the success condition, the placement constraints, the full tool list \(*including*the level\-specific analysis tool\), and the standardReActformat instructions : we don’t give the agent any priviliged information like strategy tips, directional or geometry hints, no radius guidance and no physics rules\.

### H\.1Down to Earth

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:DowntoEarth\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements:\*\*

\-\*\*GreenBall\(Target\):\*\*Adynamicballthatwillfallduetogravity\.

\-\*\*BlackHighPlatform:\*\*Astatichorizontalplatformbelowthegreenball\.Withoutintervention,thegreenballlandsonthisplatformandstaysthere\.

\-\*\*PurpleGround:\*\*Thefloorattheverybottomofthebox\(y~\-5\)\.

\*\*TheGoal:\*\*

IntroduceaRedBallintothescenesothattheGreenBallisknockedofftheplatformandtouchesthePurpleGroundforatleast3seconds\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwiththegreenballatt=0:distancebetweencenters\>sumofradii\.

\-TheredballmustNOToverlapwiththeblackplatformatt=0\.

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.compute\_gap\_analysis

Description:Analyzethegapsoneachsideoftheplatform\.

Returnstheleftgapandrightgap,andwhetherthegreenballcanfitthrougheachgap\.

Arguments:None

Usage:Action:compute\_gap\_analysis

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.2Two Body Problem

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:TwoBodyProblem\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements:\*\*

\-\*\*GreenBall:\*\*Adynamicball\.

\-\*\*BlueBall:\*\*Adynamicball,separatedhorizontallyfromthegreenball\.

\-Bothballsfallundergravityfromrest\.

\*\*TheGoal:\*\*

PlaceONERedBallatt=0sothattheGreenBallcollideswiththeBlueBallandstaysincontact\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwiththegreenballortheblueballatt=0\.

\-0\.1<=radius<=1\.5

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.compute\_relative\_positions

Description:Analyzethepositionsofthegreenandblueballs\.

Returnstheircoordinates,distance,onwhichsidetheblueballis

relativetogreen,andrecommendedredballplacementdirection\.

Arguments:None

Usage:Action:compute\_relative\_positions

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.3Pass the Parcel

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:PassTheParcel\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements:\*\*

\-\*\*TopBasket\(Gray,inverted\):\*\*AdynamicbasketsittingontheblackplatformwithitsopeningfacingDOWNWARD\.Ittrapsthegreenballunderneathit\.

\-\*\*GreenBall:\*\*Asmalldynamicballtrappedbeneaththeinvertedtopbasketontheplatform\.

\-\*\*BottomBasket\(Gray,upright\):\*\*AdynamicbasketbelowtheplatformwithitsopeningfacingUPWARD\.Itholdstheblueball\.

\-\*\*BlueBall:\*\*Adynamicballsittinginsidethebottombasket\.Thisisthetarget\-\-\-thegreenballmusttouchit\.

\-\*\*BlackPlatform:\*\*Astatichorizontalbar\.Thetopbasketand\(initially\)redballsitonit\.

\-\*\*Ramp\(Black\):\*\*Astaticangledbarrisingfromtheleftedgeoftheplatformupwardtotheright\.Usefulforrollingtheredballdownontothetopbasket\.

\*\*TheGoal:\*\*

PlaceONERedBallsothattheGreenBallcontactstheBlueBallforatleast3seconds\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwithexistingobjectsatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.get\_ramp\_center

Description:Analyzethepass\_the\_parcelsetup\.Returnsthecenteroftheramp\.

Arguments:None

Usage:Action:get\_ramp\_center

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.4Catapult

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:Catapult\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements\(factual\-\-\-noimpliedapproach\):\*\*

\-\*\*GreenBall:\*\*AsmalldynamicballsittingontheLEFTendofagraybar\.

\-\*\*GrayBar\(CatapultArm\):\*\*Adynamicleverrestingonagrayball\(pivot\)\.Thegreenballsitsonitsleftend\.

\-\*\*GrayBall\(Pivot\):\*\*Adynamicballactingasthefulcrum\.Itsitsontheleftblackplatform\.

\-\*\*BlackBall\(CeilingBlocker\):\*\*Astaticballnearthetopofthescene\.

\-\*\*BlackPlatform\(Left\):\*\*Astatichorizontalplatformontheleftside\.

\-\*\*BlackLedge\(Right\):\*\*Astatic\(possiblyangled\)platformontherightside\.

\-\*\*Basket\(Gray\):\*\*Adynamicbasketsittingontherightledge\.

\-\*\*BlueBall\(Target\):\*\*Adynamicballinsidethebasket\.

\*\*TheGoal:\*\*

PlaceONERedBallsomewhereintheboxsothat,oncethesimulationruns,thegreenballcontactstheblueballforatleast3seconds\.ThesuccessconditionisONLYthegreen\-bluecontact\-\-\-howyouachieveitisyourchoice\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwithanyexistingobjectatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.describe\_scene\_geometry

Description:Returnstrategy\-neutralgeometry:everyball\(position,radius,dynamicflag\),everybar\(position,angle,length,dynamicflag\),everybasket\(position,dynamicflag\),andthekeydistance\(green<\-\>blue\)\.Noprescriptiveadvice;youinterpretthelayouttoformastrategy\.

Arguments:None

Usage:Action:describe\_scene\_geometry

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

6\.trace\_green\_ball

Description:Lightweighttrajectoryprobe\-\-\-onlythegreenballissampled\.Placesaredball,runsthesimulation,andreturnsthegreenball's\(x,y\)waypointsatfixedstepintervalsplusstart/end/peaksummary\.Stopsearlyoncethegreenballcomestorest\(cappedat~600steps\)\.UsethiswhenyouonlycareaboutWHEREthegreenballtravels,notcontacteventsorotherobjects\-\-\-muchcheaperthansimulate\_with\_trace\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:trace\_green\_ball

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6\}

7\.predict\_first\_contact

Description:Cheappre\-simulationcheck\(<=90physicssteps,~1\.5sofsimtime\)\.RunsjustlongenoughtofindtheFIRSTobjecttheredballtouchesafteritisreleased,andreports:placementvalidity,theotherobject'sname,thestepofimpact,approachspeed,approximatecontactpoint,andsurfacenormal\.UsethistoverifythatyourredballactuallyreachestheobjectyouintendedtohitBEFOREburningafullsimulationbudget\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:predict\_first\_contact

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6\}

8\.simulate\_with\_trace

Description:Placearedballandrunthesimulation\.Returns:successflag,contacteventsinvolvingtheredballorYOURchosenobjects\(viaobject\_names\),andper\-objectkinematicextrema\(peak\_y,min\_y,max\_speed,displacement,andangularstatsformovingbars/baskets\)\.Youchoosewhichobjectstotrace\-\-\-e\.g\.,\["green\_ball"\]toseeifitlaunches,\["basket","blue\_ball"\]toseeifthebasketisdisturbed,\["catapult\_arm","green\_ball"\]toseehowthelevermovesthegreenball\.

Arguments:x\(float\),y\(float\),radius\(float\),object\_names\(listofstrings\),n\_samples\(int,optional,unused\),stop\_step\(int,optional,default=runtocompletion\)

Usage:Action:simulate\_with\_trace

ActionInput:\{"x":1\.2,"y":3\.5,"radius":0\.6,"object\_names":\["green\_ball","catapult\_arm"\]\}

9\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.5Falling Into Place

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:FallingIntoPlace\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements:\*\*

\-\*\*GreenBall:\*\*Adynamicballsittingononeofthetwoblackplatforms\(leftorrightside\)\.

\-\*\*LeftPlatform/RightPlatform\(BlackBars\):\*\*Twostatichorizontalplatformswithagapbetweentheminthecenter\.

\-\*\*BottomRamp\(BlackBar\):\*\*Aslightlyangledstaticbarnearthebottomofthescene\.

\-\*\*BlueJar\(dynamicBasket\):\*\*AdynamicbasketpositionedabovewithitsopeningfacingDOWNWARD\.Itfallsduetogravity\.

\*\*TheGoal:\*\*

PlaceONERedBallsothattheGreenBalltouchestheBlueJarforatleast3seconds\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwithanyexistingobjectatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.compute\_intercept\_setup

Description:Computesinterceptgeometryforthefalling\_into\_placelevel\.Returnswhichplatformthegreenballison,whichdirectionitmusttraveltoreachthejar,theplatformedgeitmustcross,thegapcenter,andtheestimatedtimebeforethejarreachesplatformheight\.

Arguments:None

Usage:Action:compute\_intercept\_setup

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.6Basket Case

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:BasketCase\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements:\*\*

\-\*\*GreenBall:\*\*Adynamicballpositionedhighabovethebasket\.Itfallsstraightdownduetogravity\.

\-\*\*Basket\(Gray\):\*\*Adynamiccontainersittingnearthepurpleground\.ItsopeningfacesUPWARD\.Withoutintervention,thegreenballfallsdirectlyintoitandgetstrapped\.

\-\*\*PurpleGround:\*\*Astaticbarattheverybottomofthescene\(y~\-5\)\.Thisisthetargetsurface\.

\*\*TheGoal:\*\*

PlaceONERedBallsothattheGreenBalltouchesthePurpleGroundforatleast3seconds\.Thegreenballstartsdirectlyabovethebasketandwillfallintoitunlessdeflectedsideways\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwiththegreenballorbasketatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.compute\_basket\_analysis

Description:Analyzethebasketcasesetup\.Returnsthegreenballposition,basketpositionandscale,purplegroundposition,andrecommendedpushdirectiontodeflectthegreenballawayfromthebasket\.

Arguments:None

Usage:Action:compute\_basket\_analysis

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.7Cliffhanger

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.Youhaveaccesstoaphysicssimulatorandcantestyourideasbeforesubmittingafinalanswer\.

\*\*Puzzle:Cliffhanger\*\*

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

\*\*KeyElements\(factual\-\-\-noimpliedapproach\):\*\*

\-\*\*GreenBar:\*\*Adynamic,verticalbar\(length2\.0\-\-3\.0\)standinguprightontheblackplatformnearoneoftheplatform'sedges\.

\-\*\*BlackPlatform:\*\*Astatichorizontalbar\(length4\.0\-\-6\.0\)atvariableheightyin\[\-3,0\];thegreenbarstandsontopofit\.

\-\*\*Ceiling:\*\*Astatichorizontalbarspanningthebox,positionedabovetheplatform\(yabovethegreenbar'stop\)\.

\-\*\*PurpleGround:\*\*Thestaticflooratthebottomofthebox\(y~=\-5\)\.

\*\*TheGoal:\*\*

PlaceONERedBallsomewhereintheboxsothat,oncethesimulationruns,thegreenbarcontactsthepurplegroundforatleast3seconds\.ThesuccessconditionisONLYthegreen\-bar/purple\-groundcontact\-\-\-howyouachieveitisyourchoice\.

\*\*PlacementConstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-TheredballmustNOToverlapwithanyexistingobjectatt=0\.

\-0\.1<=radius<=2\.0

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.compute\_cliffhanger\_analysis

Description:Analysethecliffhangergeometry\.Returnsthegreenbar'scentre,length,andthe\(x,y\)coordinatesofitsbottompoint\(restingontheplatform\)andtoppoint\(oppositeend\);theplatform'sleft/rightextentsandtop\-surfacey;theceilingyandpurple\-groundy;thebar'sdistancetoeachplatformedge;whichedgeiscloser\(LEFTorRIGHT\)\-\-\-i\.e\.theedgethebarmustfallpast\-\-\-andthewidthofthefallinggaponthatside\.

Arguments:None

Usage:Action:compute\_cliffhanger\_analysis

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

### H\.8Tipping Point

Youareanexpertphysicsreasoningagentsolvinga2Dphysicspuzzle\.

Puzzle:TippingPoint

Theenvironmentisa2Dboxwithcoordinatesrangingfrom\-5to5onbothaxes\.Gravitypullsobjectsdownward\.

KeyElements\(factual\):

\-GreenBar:dynamicverticalbar\(length2\.0\-\-5\.0\)restinguprightinasmallgraybasketontheground\.

\-GrayBasket:smalldynamicbasketholdingthegreenbarnearthebottomofthebox\.

\-PurpleWall:staticverticalbarflushagainsttheLEFT\(x~=\-4\.9\)orRIGHT\(x~=4\.9\)sideofthebox;spanstoptobottom\.

\-Boxbounds:xin\[\-5,5\],yin\[\-5,5\]\.

Goal:PlaceONERedBallsothatgreen\_barcontactspurple\_wallandmaintainscontactforthesuccess\-timeduration\.Thepathtothatoutcomeisyourstodesign\.

PlacementConstraints:

\-Redballmustbeinsidethebox:\-5\+radius<=x<=5\-radius,\-5\+radius<=y<=5\-radius\.

\-Nooverlapwithexistingobjectsatt=0\.

\-0\.1<=radius<=2\.0

Whenyoursimulationshowsgreen\_barincontinuouscontactwithpurple\_wallfortherequiredduration,callfinish\.

Youhaveaccesstothefollowingtoolstointeractwiththephysicssimulation:

1\.get\_level\_state

Description:Getthecurrentlevellayoutincludingallobjectpositions,sizes,andproperties\.

Arguments:None

Usage:Action:get\_level\_state

2\.simulate\_action

Description:Placearedballat\(x,y\)withthegivenradiusandrunthefullphysicssimulationtocompletion\.Returnswhetherthegoalwasachieved,finalpositionsofallobjects,andtotalsimulationsteps\.Iftheplacementisinvalid\(outofboundsoroverlaps\),returnsadetailederrorwithhowfartomovetheball\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:simulate\_action

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

3\.get\_contact\_log

Description:Afterrunningasimulation,returnsthecontactevents:whichobjectstouchedandwhen\.

Arguments:None

Usage:Action:get\_contact\_log

4\.simulate\_partial

Description:Placearedballandrunthesimulationonlyuptothespecifiedstep\.Returnsobjectpositionsandvelocitiesatthatpoint\.Usefulforobservingmid\-simulationdynamics\.

Arguments:x\(float\),y\(float\),radius\(float\),stop\_step\(int\)

Usage:Action:simulate\_partial

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6,"stop\_step":50\}

5\.compute\_tipping\_point\_analysis

Description:Analysetipping\_pointgeometry\.Returnsthegreenbar'scentre,length,angle,andthe\(x,y\)coordinatesofitstopandbottomendpoints;thebasket'scentreandfloor;thepurplewall'sxpositionanditstop/bottomy;thepurplewall'ssiderelativetothegreenbar\(LEFTorRIGHT\);thehorizontaldistancefromthebar'scentretothewall;theanglethebarmusttipthroughtotouchthewall\(assumingitpivotsnearthebasket\);andthesuggestedtipdirection\(LEFTorRIGHT\)\.

Arguments:None

Usage:Action:compute\_tipping\_point\_analysis

6\.finish

Description:Submityourfinalanswer\.Usethiswhenyouareconfidentinyoursolution\.

Arguments:x\(float\),y\(float\),radius\(float\)

Usage:Action:finish

ActionInput:\{"x":0\.5,"y":4\.0,"radius":0\.6\}

Tosolvethispuzzle,youwillreasonstep\-by\-stepandusetoolstotestyourideas\.

Ateachstep,youMUSTfollowthisexactformat:

Thought:<yourreasoningaboutwhattodonext\>

Action:<toolname\>

ActionInput:<JSONarguments,orleaveblankfortoolswithnoarguments\>

Afteryoutakeanaction,youwillreceive:

Observation:<resultfromthetool\>

ThenyoucontinuewithanotherThought/Actioncycle\.

Whenyouareconfidentinyouranswer,usethe"finish"tooltosubmitit\.

Importantrules:

\-AlwaysstartwithaThoughtbeforetakinganAction\.

\-OnlycallONEtoolperstep\.

\-Parseobservationresultscarefullybeforeyournextthought\.

\-Youcansimulatemultipledifferentactionstocompareresults\.

\-Eachsimulationresetstheenvironment,soprevioussimulationsdon'taffectnewones\.

## Appendix IEvolver Prompts forHeXAand Cross\-Level Transfer

This appendix reproduces, verbatim, the prompts presented to the evolver that drive every skill\-distillation result in this paper\. Placeholder fields shown in curly braces \(\{level\_name\},\{success\_block\},\{n\_sources\}, etc\.\) are filled in at run time from the level identifier, the rendered trajectory blocks, and other context\. Double\-braces\{\{\.\.\.\}\}appear as single braces at run time and are used to denote literal JSON braces inside the prompt body\.

### I\.1Pass 1 Contrastive Skill Distillation

EveryHeXAround invokes this prompt to extract*strategy skills*by contrasting high\-reward against low\-reward trajectories within the same level batch\. The reward function from Eq\.[6](https://arxiv.org/html/2606.29315#A1.E6)is communicated to the teacher as a per\-trajectoryRewardannotation embedded in the rendered success and failure blocks\.

Youareanexpertphysicsanalystdistillingagentbehaviorintoconcise,actionableskills\.

\*\*Environment\*\*:2Dphysicssimulation\(Box2D\)\.Gravity=\-9\.8m/s^2\.Worldbounds\[\-5,5\]onbothaxes\.Theagentplacesaredballat\(x,y\)withagivenradius,thenthesimulationrunstocompletion\.

\*\*Level:\{level\_name\}\*\*

\{level\_description\}

BelowareSUCCESSFULandFAILEDtrajectoriesfromthesamelevel\.Eachshowstheagent'sreasoning\(Thought\),actionstaken,andsimulationobservations\.Eachtrajectoryhasa\*\*Reward\*\*score:

\-Successes:\+1\.0\(solvedfast,1\-3iters\)to\+0\.25\(solvedslowly,16\-25iters\)

\-Failures:\-0\.5\(triedalliterations\)to\-0\.75\(gaveupearly\)

\*\*Weightyouranalysisbyreward\*\*\-\-skillsfromhigh\-rewardtrajectories\(fastsolves\)aremorereliablethanskillsfromlow\-rewardones\(barelysolved\)\.

===SUCCESSFULTRAJECTORIES\(\{n\_successes\}\)===

\{success\_block\}

===FAILEDTRAJECTORIES\(\{n\_failures\}\)===

\{failure\_block\}

\-\-\-

Yourtask:ByCONTRASTINGthesuccessesandfailures,extracttheKEYPHYSICSSKILLSthatdistinguishsolvingfromfailing\.Focusespeciallyonhigh\-rewardsuccesses\-\-whatinsightlettheagentsolveitquickly?Whatdidthefailedagentsmiss?

Foreachskill,provide:

\-\*\*title\*\*:Shortname\(3\-7words\)

\-\*\*principle\*\*:Thephysicsinsight\(2\-3sentences\)\.Whatmechanismwasexploited?Whydoesitwork?

\-\*\*when\_to\_apply\*\*:Specifictriggercondition\(1sentence\)

\-\*\*source\_seeds\*\*:Listofseednumbersfromthetrajectoriesabovethatthisskillwasprimarilyderivedfrom\.Includeonlytheseedswhosebehaviordirectlydemonstratesormotivatesthisskill\.

OutputaJSONarray:

\`\`\`json

\[

\{\{

"title":"\.\.\.",

"principle":"\.\.\.",

"when\_to\_apply":"\.\.\.",

"source\_seeds":\[1,5,16\]

\}\}

\]

\`\`\`

Extract4\-6skills\.EachshouldcaptureaDISTINCTinsightfromthesuccess/failurecontrast\.Avoidredundancy\.

### I\.2Pass 2 Mistake & Partial\-Skill Extraction

Run on the failure trajectories𝒯ℓ\(n\),−\\mathcal\{T\}\_\{\\ell\}^\{\(n\),\-\}alone\. Produces \(i\) structured*mistake*records describing recurring failure modes, and \(ii\)*partial skills*extracted from individual correct steps inside otherwise\-failed trajectories\.

Youareanexpertatanalyzingagentfailuresanddistillingthemintoavoidablemistakepatterns\.

\*\*Environment\*\*:2Dphysicssimulation\(Box2D\)\.Gravity=\-9\.8m/s^2\.Worldbounds\[\-5,5\]\.

\*\*Level:\{level\_name\}\*\*

\{level\_description\}

BelowareFAILEDtrajectories\.Eachshowstheagent'sreasoning,actions,andsimulationresults\.

\{failure\_block\}

\-\-\-

YourtaskhasTWOparts:

\*\*Part1\-\-Mistakes\*\*:IdentifytheCOMMONMISTAKEPATTERNSacrossthesefailures\.Foreachmistake,analyze:

1\.Whatexactlytheagentdidwrong

2\.WHYtheagentmadethiserror\(whatbrokencausalbeliefledtoit\)

3\.Aconcreteactionablefix

\*\*Part2\-\-Partialinsights\*\*:Eveninfailedtrajectories,someindividualstepsshowCORRECTphysicsreasoningorusefuldiscoveries\(e\.g\.,theagentfoundavalidplacementregionbutthenabandonedit,orcorrectlyidentifiedamechanismbutapplieditwithwrongparameters\)\.Extract1\-2skillsfromthese"goodstepswithinbadtrajectories"\.Theseshouldbegenuinephysicsinsights,notjustrestatingwhatwentwrong\.

FormatasJSONobjectwithtwoarrays:

\`\`\`json

\{\{

"mistakes":\[

\{\{

"description":"Whatthemistakeis\(1sentence\)",

"why\_it\_happens":"Thebrokenbelieforreasoningerrorthatcausesthis\(1sentence\)",

"how\_to\_avoid":"Concreteactionablefix\-\-whattodoinstead\(1\-2sentences\)"

\}\}

\],

"partial\_skills":\[

\{\{

"title":"Shortname\(3\-7words\)",

"principle":"Thephysicsinsightfromthefailedtrajectory\(2\-3sentences\)",

"when\_to\_apply":"Specifictriggercondition\(1sentence\)",

"source\_seeds":\[5,11\]

\}\}

\]

\}\}

\`\`\`

Extract3\-5mistakesand1\-2partialskills\.Formistakes,groupsimilarfailuresintooneandfocusonROOTCAUSES\.Forpartialskills,onlyextractgenuinelyusefulinsights\-\-donotforceitifnogoodstepsexist\.

### I\.3Skill\-Bank Evolution

HeXA employs the Skill Evolution method in each roundn≥1n\\geq 1, with the current bank𝒦ℓ\(n\)\\mathcal\{K\}\_\{\\ell\}^\{\(n\)\}and the new trajectory batch𝒯ℓ\(n\)\\mathcal\{T\}\_\{\\ell\}^\{\(n\)\}as inputs, which yields a new bank𝒦ℓ\(n\+1\)\\mathcal\{K\}\_\{\\ell\}^\{\(n\+1\)\}bounded by the skill limit\. The skill evolver determines which previously learned skills should be kept \(withconfidence\[Varun: reward\]values maintained\), which should be discarded, and which should be acquired \(with confidence value calculated using Eq\.[7](https://arxiv.org/html/2606.29315#A1.E7)\)\.

Youareaphysicsteacherevolvingaskillbankforpuzzle\-solvingagents\.

LEVEL:\{level\_name\}

LEVELDESCRIPTION:\{level\_description\}

CURRENTSKILLBANK\(frompreviousrounds\):

\{existing\_skills\_block\}

NEWTRAJECTORIES\(fromthelatestround\):

Successes:\{n\_successes\}

Failures:\{n\_failures\}

\{new\_trajectories\_block\}

YOURTASK:Evolvetheskillbankbymergingtheexistingskillswithinsightsfromthenewtrajectories\.

RULES:

1\.OutputtheCOMPLETEFINALskillbank\(notadiff\)\-\-includebothretainedexistingskillsandanynewones\.

2\.Hardconstraints:

\-Maximum\{max\_skills\}totalskillsforthislevel

\-Maximum\{max\_mistakes\}totalmistakesforthislevel

3\.Foreachskillyouinclude:

\-Ifit'saRETAINEDskillfromtheexistingbank:set"is\_new":false

\-Ifit'saNEWskillextractedfromthenewtrajectories:set"is\_new":true

\-Include"source\_seeds"listingseednumberswherethisskillwasobserved\(requiredforconfidencecalibration\)

\-Include"confidence":afloatin\[0\.1,1\.0\]representingyourconfidenceinthisskill

4\.Forretainedskills,preservetheirexistingconfidencevalues\(they'vebeenvalidated\)\.

5\.Fornewskills,estimateconfidencebasedon:

\-Successrateamongsourcetrajectories\(highsuccess=highconfidence\)

\-Universality\(appliestomultipleseedconditions=higherconfidence\)

\-Clarityandactionabilityoftheprinciple

6\.Donotincludeduplicateskills\.Ifanewtrajectoryconfirmsanexistingskill,keeptheexistingone\(possiblywithslightlyhigherconfidence\)\.

7\.Removeskillsthatare:

\-Redundantorsubsumedbyotherskills\.

\-Contradictedbythenewtrajectories

\-Toospecificorrarelyapplicable

\-Lowconfidence\(<0\.3\)andnotdirectlyobservedinnewtrajectories

8\.Donotremovemistakesunlessthenewtrajectoriesshowthey'renolongercommon\.

OUTPUTJSONOBJECT:

\{\{

"skills":\[

\{\{

"title":"<shortnameofskill\>",

"principle":"<2\-3sentencephysicsinsight\>",

"when\_to\_apply":"<conditionforapplicability\>",

"example":"<optionalconcretecoordinateexample\>",

"source\_seeds":\[<seednumbers\>\],

"confidence":<floatin\[0\.1,1\.0\]\>,

"is\_new":<true\|false\>

\}\}

\.\.\.

\],

"mistakes":\[

\{\{

"description":"<whatthemistakeis\>",

"why\_it\_happens":"<whyagentsmakethismistake\>",

"how\_to\_avoid":"<actionablefix\>",

"is\_new":<true\|false\>

\}\}

\.\.\.

\],

"removed\_skill\_titles":\["<titleofremovedskill1\>",\.\.\.\],

"reasoning":"<briefexplanationofkeychanges:whatwasremoved,whatwasadded,why\>"

\}\}

Beconcisebutprecise\.Focusonphysicsinsightsthatdirectlyhelppuzzle\-solving\.

### I\.4Catapult\-Specific Variants \(Claude\)

Here we provide an example of Skill Evolution prompts specific to the catapult level\.

Shared catapult scene description\.

\*\*Scene\(factual\-\-noapproachimplied\):\*\*

\-GreenBall:smalldynamicballontheLEFTendofagraybar\.

\-GrayBar\(CatapultArm\):dynamicleverrestingonagrayball\(pivot\);thegreenballsitsonitsleftend\.

\-GrayBall\(Pivot\):dynamicballactingasthefulcrum;sitsontheleftblackplatform\.

\-BlackBall\(CeilingBlocker\):staticballnearthetopofthescene\.

\-BlackPlatform\(Left\):statichorizontalplatformontheleftside\.

\-BlackLedge\(Right\):static\(possiblyangled\)platformontherightside\.

\-Basket\(Gray\):dynamicbasketsittingontherightledge\.

\-BlueBall\(Target\):dynamicballinsidethebasket\.

\*\*Successcondition:\*\*Thegreenballmustcontacttheblueballforatleast3secondsaftertheredballisplacedandthesimulationruns\.

\*\*Placementconstraints:\*\*

\-Theredballmustbecompletelyinsidethebox:\-5\+r<=x<=5\-r,\-5\+r<=y<=5\-r\.

\-TheredballmustNOToverlapwithanyexistingobjectatt=0\.

\-0\.1<=radius<=2\.0\.

Pass 1 contrastive \(catapult\)\.

Youareanexpertphysicsanalystdistillingagentbehaviorintoconcise,actionableskills\.

\*\*Environment\*\*:2Dphysicssimulation\(Box2D\)\.Gravity=\-9\.8m/s^2\.Worldbounds\[\-5,5\]onbothaxes\.Theagentplacesaredballat\(x,y\)withagivenradius,thenthesimulationrunstocompletion\.

\*\*Level:\{level\_name\}\*\*

\[catapultscenedescriptioninjectedhere\-\-seeabove\]

BelowareSUCCESSFULandFAILEDtrajectoriesfromthislevel\.Eachshowstheagent'sreasoning\(Thought\),actionstaken,andsimulationobservations\.Eachtrajectoryhasa\*\*Reward\*\*score:

\-Successes:\+1\.0\(solvedfast,1\-3iters\),\+0\.75\(4\-7iters\),\+0\.5\(8\-15iters\),\+0\.25\(solvedslowly,16\-25iters\)

\-Failures:\-0\.5\(triedall25iters\),\-0\.75\(gaveupearly,<10iters\)

\*\*Weightyouranalysisbyreward\*\*\-\-skillsextractedfromhigh\-rewardtrajectories\(fastsolves\)aremorereliablethanthosefromlow\-rewardones\.

===SUCCESSFULTRAJECTORIES\(\{n\_successes\}\)===

\{success\_block\}

===FAILEDTRAJECTORIES\(\{n\_failures\}\)===

\{failure\_block\}

\-\-\-

Yourtask:ByCONTRASTINGthesuccessesandfailures,extracttheKEYPHYSICSSKILLSthatdistinguishsolvingfromfailingonthiscatapultlevel\.Focusonwhathigh\-rewardsuccessesDIDthatlow\-rewardfailuresmissed\.Letthetrajectories\-\-notyourpriorexpectations\-\-determinethemechanismsthatwork\.

Foreachskill,provide:

\-\*\*title\*\*:Shortname\(3\-7words\)

\-\*\*principle\*\*:Thephysicsinsight\(2\-3sentences\)\.Whatmechanismwasexploited?Whydoesitwork?

\-\*\*when\_to\_apply\*\*:Specifictriggercondition\(1sentence\)

\-\*\*source\_seeds\*\*:Seednumbersfromthetrajectoriesabovethatthisskillwasprimarilyderivedfrom\.

OutputaJSONarray:

\`\`\`json

\[

\{\{

"title":"\.\.\.",

"principle":"\.\.\.",

"when\_to\_apply":"\.\.\.",

"source\_seeds":\[1,5,16\]

\}\}

\]

\`\`\`

Extract4\-6skills\.EachshouldcaptureaDISTINCTinsightfromthesuccess/failurecontrast\.Avoidredundancy\.

##### Pass 2 mistakes & partial skills \(catapult\)\.

Youareanexpertatanalyzingagentfailuresanddistillingthemintoavoidablemistakepatterns\.

\*\*Environment\*\*:2Dphysicssimulation\(Box2D\)\.Gravity=\-9\.8m/s^2\.Worldbounds\[\-5,5\]\.

\*\*Level:\{level\_name\}\*\*

\[catapultscenedescriptioninjectedhere\-\-seeabove\]

BelowareFAILEDtrajectories\.Eachshowstheagent'sreasoning,actions,andsimulationresults\.

\{failure\_block\}

\-\-\-

YourtaskhasTWOparts:

\*\*Part1\-\-Mistakes\*\*:IdentifytheCOMMONMISTAKEPATTERNSacrossthesefailures\.Foreachmistake,analyze:

1\.Whatexactlytheagentdidwrong

2\.WHYtheagentmadethiserror\(whatbrokencausalbeliefledtoit\)

3\.Aconcreteactionablefix

\*\*Part2\-\-Partialinsights\*\*:Eveninfailedtrajectories,someindividualstepsshowCORRECTphysicsreasoningorusefuldiscoveries\(e\.g\.,theagentfoundapromisingplacementregionbutthenabandonedit,orcorrectlyidentifiedamechanismbutapplieditwithwrongparameters\)\.Extract1\-2skillsfromthese"goodstepswithinbadtrajectories"\.Theseshouldbegenuinephysicsinsights,notjustrestatingwhatwentwrong\.

FormatasaJSONobjectwithtwoarrays:

\`\`\`json

\{\{

"mistakes":\[

\{\{

"description":"Whatthemistakeis\(1sentence\)",

"why\_it\_happens":"Thebrokenbelieforreasoningerrorthatcausesthis\(1sentence\)",

"how\_to\_avoid":"Concreteactionablefix\-\-whattodoinstead\(1\-2sentences\)"

\}\}

\],

"partial\_skills":\[

\{\{

"title":"Shortname\(3\-7words\)",

"principle":"Thephysicsinsightfromthefailedtrajectory\(2\-3sentences\)",

"when\_to\_apply":"Specifictriggercondition\(1sentence\)",

"source\_seeds":\[5,11\]

\}\}

\]

\}\}

\`\`\`

Extract3\-5mistakesand1\-2partialskills\.Formistakes,groupsimilarfailuresintooneandfocusonROOTCAUSES\.Forpartialskills,onlyextractgenuinelyusefulinsights\-\-donotforceitifnogoodstepsexist\.

Skill\-bank evolution \(catapult\)\.

Youareaphysicsteacherevolvingaskillbankforacatapult\-puzzle\-solvingagent\.

\*\*Environment\*\*:2Dphysicssimulation\(Box2D\)\.Gravity=\-9\.8m/s^2\.Worldbounds\[\-5,5\]\.

\*\*Level:\{level\_name\}\*\*

\[catapultscenedescriptioninjectedhere\-\-seeabove\]

CURRENTSKILLBANK\(frompreviousrounds\):

\{existing\_skills\_block\}

NEWTRAJECTORIES\(fromthelatestround\):

Successes:\{n\_successes\}

Failures:\{n\_failures\}

\{new\_trajectories\_block\}

YOURTASK:Evolvetheskillbankbymergingtheexistingskillswithinsightsfromthenewtrajectories\.

RULES:

1\.OutputtheCOMPLETEFINALskillbank\(notadiff\)\-\-includebothretainedexistingskillsandanynewones\.

2\.Hardconstraints:

\-Maximum\{max\_skills\}totalskillsforthislevel\.

\-Maximum\{max\_mistakes\}totalmistakesforthislevel\.

3\.Foreachskillyouinclude:

\-Ifit'saRETAINEDskillfromtheexistingbank:set"is\_new":false\.

\-Ifit'saNEWskillextractedfromthenewtrajectories:set"is\_new":true\.

\-Include"source\_seeds"listingseednumberswherethisskillwasobserved\(requiredforconfidencecalibration\)\.

\-Include"confidence":afloatin\[0\.1,1\.0\]representingyourconfidenceinthisskill\.

4\.Forretainedskills,preservetheirexistingconfidencevalues\(they'vebeenvalidated\)\.

5\.Fornewskills,estimateconfidencebasedon:

\-Successrateamongsourcetrajectories\(highsuccess=highconfidence\)\.

\-Universality\(appliesacrossmultipleseedconditions=higherconfidence\)\.

\-Clarityandactionabilityoftheprinciple\.

6\.Donotincludeduplicateskills\.Ifanewtrajectoryconfirmsanexistingskill,keeptheexistingone\(optionallyraisingitsconfidenceslightly\)\.

7\.Removeskillsthatare:

\-Redundantorsubsumedbyanotherskill\.

\-Contradictedbythenewtrajectories\.

\-Toospecificorrarelyapplicable\.

\-Lowconfidence\(<0\.3\)ANDnotobservedinthenewtrajectories\.

8\.Donotremovemistakesunlessthenewtrajectoriesshowthey'renolongercommon\.

OUTPUTJSONOBJECT:

\{\{

"skills":\[

\{\{

"title":"<shortnameofskill\>",

"principle":"<2\-3sentencephysicsinsight\>",

"when\_to\_apply":"<conditionforapplicability\>",

"example":"<optionalconcretecoordinateexample\>",

"source\_seeds":\[<seednumbers\>\],

"confidence":<floatin\[0\.1,1\.0\]\>,

"is\_new":<true\|false\>

\}\}

\],

"mistakes":\[

\{\{

"description":"<whatthemistakeis\>",

"why\_it\_happens":"<whyagentsmakethismistake\>",

"how\_to\_avoid":"<actionablefix\>",

"is\_new":<true\|false\>

\}\}

\],

"removed\_skill\_titles":\["<titleofremovedskill1\>",\.\.\.\],

"reasoning":"<briefexplanationofkeychanges:whatwasremoved,whatwasadded,why\>"

\}\}

Beconcisebutprecise\.Focusonphysicsinsightsgroundedinthetrajectories\.

### I\.5Skill Transfer Prompts

We use this prompt for skill transfer from easier levels to a harder level on Claude Sonnet 4\.6 where the skill banks of \(down\_to\_earth,two\_body\_problem,pass\_the\_parcel\) are given to the model and prompted to frame reusable skills on catapult\. The eight constraints—citation necessity, no fictional coordinates, entities only on the target side, no platitudes, transfer explanation necessity, confidence correction, no redundancies, mistakes constraints—are all listed in detail to ensure that each synthesized skill can be traced back to its source skill IDs\.

YouareanexpertphysicsanalystperformingCROSS\-LEVELOFFLINESKILLTRANSFER\.

Youwillbegivenexpertskillbanksfrom\{n\_sources\}SOURCEphysicslevels\.

Eachsourcebankwasdistilledandevolvedovermanyroundsofanagentsolving

thatlevel\.YouwillalsobegivenafactualdescriptionofaTARGETlevel

thattheagenthasNEVERattemptedyet\-\-notarget\-leveltrajectories,

notarget\-levelskills,notarget\-levelsuccess/failuredataexist\.

Yourtask:synthesiseaTARGETskillbankbyextractingtransferablephysics

PRINCIPLESfromthesourcebanks,re\-groundingtheminthetargetscene,and

predictingwhichprinciplesthetargetlevelwillreward\.

==========================================================================

\*\*Environment\(sharedacrossalllevels\)\*\*:2Dphysicssimulation\(Box2D\)\.

Gravity=\-9\.8m/s^2\.Worldbounds\[\-5,5\]onbothaxes\.TheagentplacesONE

redballat\(x,y\)withachosenradius\(0\.1<=r<=2\.0\)att=0;thesimulation

thenrunstocompletion\.Massscalesasr^3;momentumandimpactforcescale

withbothmassandtheball'svelocityatfirstcontact\.

==========================================================================

===SOURCELEVELS===

\{source\_blocks\}

===TARGETLEVEL:\{target\_level\}===

\{target\_block\}

\{structural\_hint\_block\}

===TASK===

ProduceaSKILLBANKforthetargetlevel\.Thebankmustcontain:

\-6to10SKILLS\(physicsprinciplesyoupredictwillhelponthetarget\)\.

\-2to4MISTAKES\(anti\-patternstransferablefromsourcefailures\)\.

===HARDCONSTRAINTS===

1\.\*\*EveryskillMUSTcitesourceskills\.\*\*Foreachoutputskill,include

\`source\_skills\`:anon\-emptylistofobjectsoftheform

\`\{\{"source\_level":"<level\>","skill\_id":"<id\>"\}\}\`,listingthesource\-bank

entriesthatmotivatedthisskill\.Ifyoucannotciteatleastonesource

skill,doNOTemittheskill\-\-synthesisesomethingelseinstead\.

2\.\*\*Noinventedcoordinates\.\*\*YouhaveNEVERseenthetargetlevelsolved\.

DoNOTpredictspecific\(x,y,r\)valuesthat"work"forthetarget\.The

\`example\`field,ifused,mustdescribea\*qualitative\*placement

\(e\.g\."placeneartheendoftheleveroppositethegreenball,with

rlargeenoughtodominatethelever'smass"\)\-\-neverspecificnumbers\.

3\.\*\*Usetarget\-levelentities\.\*\*Phraseeachskillintermsoftheentities

thatexistintheTARGETscenedescriptionabove\.DoNOTmentionsource\-

level\-onlyentities\(e\.g\."ramp","invertedbasket"\)intheprincipleor

when\_to\_apply,evenifthoseentitiesmotivatedthetransfer\.The

\`transfer\_rationale\`fieldiswhereyoumaynamesourceentitiesto

explaintheanalogy\.

4\.\*\*Noplatitudes\.\*\*Skillslike"usephysicsintuition","considergravity",

"becareful"areforbidden\.EachskillmustnameaSPECIFICmechanism

\(e\.g\."torqueaboutapivotscaleswithmomentarmxforce";"thelineof

centresatfirstcontactdeterminesthepost\-collisionvelocitydirection"\)\.

5\.\*\*Eachskillmustincludea\`transfer\_rationale\`\*\*\(1\-2sentences\)explaining

whatphysicsprimitivebridgesthesourceobservationtothetarget

prediction\.Thisistheaudittrailforyoursynthesis\.

6\.\*\*Confidencecalibration\.\*\*Weightaskill'sconfidenceby:

\-Howmanysourcebankscorroboratetheunderlyingprimitive\(more=higher\)\.

\-Howdirectlytheprimitiveappliestothetargetscene\(direct=higher\)\.

\-Howload\-bearingthesourceskillwas\(highsourceconfidence=higher\)\.

Confidenceisafloatin\[0\.1,1\.0\]\.Assign0\.7\+onlywhentheprimitive

appearsin\>=2sourcebanksANDthetargetsceneclearlyinvokesit\.

7\.\*\*Avoidredundancy\.\*\*EachskillmustcaptureaDISTINCTmechanism\.Donot

emittwoskillsthatareparaphrasesofthesameidea\.

8\.\*\*Mistakes\*\*:deriveeachfromsource\-bankmistakesorfromthefailure

modesthosemistakesimply\.Applythesamehardconstraints\(citesources,

noinventedcoordinates,target\-sideentities,specificmechanism\)\.

===OUTPUTFORMAT\(singleJSONobject\)===

\`\`\`json

\{\{

"skills":\[

\{\{

"title":"<3\-7wordname\>",

"principle":"<2\-3sentencephysicsinsight,intarget\-levelterms\>",

"when\_to\_apply":"<specificcondition,intarget\-levelterms\>",

"example":"<qualitativeplacementdescription,NOnumbers\-\-oromit\>",

"confidence":<floatin\[0\.1,1\.0\]\>,

"source\_skills":\[

\{\{"source\_level":"<level\>","skill\_id":"<id\>"\}\}

\],

"transfer\_rationale":"<1\-2sentencesnamingthebridgingphysicsprimitive\>"

\}\}

\],

"mistakes":\[

\{\{

"description":"<1sentenceintarget\-levelterms\>",

"why\_it\_happens":"<1sentencerootcause\>",

"how\_to\_avoid":"<1\-2sentenceconcretefix\>",

"source\_skills":\[

\{\{"source\_level":"<level\>","skill\_id":"<id\>"\}\}

\],

"transfer\_rationale":"<1\-2sentences\>"

\}\}

\],

"synthesis\_reasoning":"<2\-4sentences:whichsourcelevelscontributedmostheavily,whichprimitivesyoutreatedasload\-bearing,whichsourceskillsyoudeliberatelydidNOTtransferandwhy\>"

\}\}

\`\`\`

Bepreciseandspecific\.Donotpad\.Donotproduceextrafields\.OutputONLY

theJSONobjectinsideasinglefenced\`\`\`jsonblock\.

### I\.6Cross\-Level Synthesis on Qwen\-7B

Used for the Falling Into Place and Two Body Problem cross level rows of Table[13](https://arxiv.org/html/2606.29315#A4.T13), where the teacher is Qwen 2\.5 7B\. The slim variant drops audit fields \(source\_skills,transfer\_rationale\), reduces to three rules, and tightens form constraints onprinciple/when\_to\_apply/exampleto prevent the platitude\-and\-hallucination failure mode the long prompt produces on smaller models \(e\.g\. inventing a lever on a level with no pivot\)\. Output budget is reduced to 4–6 skills and 1–3 mistakes\.

YouareextractingtransferablephysicsPRINCIPLESfromoneormoresourcelevelskillbanksforaNEWtargetlevel\.

ENVIRONMENT:2Dphysics\(Box2D\),gravity=\-9\.8m/s^2,bounds\[\-5,5\]\.AgentplacesONEredball\(x,y,radius0\.1\-2\.0\)att=0;simulationrunstocompletion\.

===SOURCELEVELS===

\{source\_blocks\}

===TARGETLEVEL:\{target\_level\}===

\{target\_block\}

===TASK===

Outputaskillbankforthetargetlevel\.Produce4\-6skillsand1\-3mistakes\.

RULES\(FOLLOWSTRICTLY\):

1\.SKILLSAREPRINCIPLES,NOTCOMMANDS\.Eachskillstatesaphysicsobservation,mechanism,orrelationshiptheagentcanREASONFROM\.SkillsMUSTNOTtelltheagentwhattodo\.EachfieldhasastrictFORM:

\-\`principle\`:startswithaconditionorrelationship\("WhenXholds,Yoccurs","XscaleswithY","AandBdifferinZ"\)\.NEVERanimperativeverb\.FORBIDDENopenings:place/position/put/drop/move/set/use/adjust\.

\-\`when\_to\_apply\`:describesaREASONINGcontexttheagententers\("Whenevaluatingwhether\.\.\.","Whencomparingalternativesthat\.\.\."\)\.NEVERatargetoutcometoachieve\("whenyouneedtotip\.\.\.","whenknockingthebaroff\.\.\."\)\.

\-\`example\`:contraststwophysicalscenariosorillustratestheprincipleabstractly\.NEVERaplacement,neveraspecificsceneaction\.

2\.EachskillnamesaSPECIFICmechanism\.FORBIDDENplatitudes:"usephysics","placestrategically","considergravity","adjustasneeded","becareful"\.

3\.UseONLYentitiespresentintheTARGETsceneabove\.Donotmentionsource\-onlyentitiesanywhere\.Ifthetargetscenehasnopivot,doNOTwriteskillsaboutleversortorque\-around\-a\-pivot\.

4\.NOcoordinatesandNOplacementsanywhereinanyfield\.

5\.MistakesdescribeMISCONCEPTIONStheagentmightform\(wrongmentalmodels\),NOTactionstoavoid\.\`how\_to\_avoid\`isaCORRECTEDUNDERSTANDING,notaplacementinstruction\.

OUTPUTFORMAT\(singleJSONobject,fenced\`\`\`jsonblock,noextrafields\):

\`\`\`json

\{\{

"skills":\[

\{\{"title":"\.\.\.","principle":"\.\.\.","when\_to\_apply":"\.\.\.","example":"\.\.\.","confidence":0\.X\}\}

\],

"mistakes":\[

\{\{"description":"<misconception,NOTanaction\>","why\_it\_happens":"\.\.\.","how\_to\_avoid":"<correctedunderstanding,NOTaplacementcommand\>"\}\}

\],

"synthesis\_reasoning":"<1\-2sentences:whichsourceprimitivetransferred\>"

\}\}

\`\`\`

Beconcise\.OutputONLYtheJSONobject\.
Hierarchical Experimentalist Agents

Similar Articles

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Harnessing LLM Agents with Skill Programs

Socratic agents for autonomous scientific discovery in high-dimensional physical systems

AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

Submit Feedback

Similar Articles

EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry
Harnessing LLM Agents with Skill Programs
Socratic agents for autonomous scientific discovery in high-dimensional physical systems
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design