Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

arXiv cs.AI Papers

Summary

HASTE introduces a hierarchical multi-agent system for ML engineering that organizes cross-competition knowledge into three tiers, achieving 77.3% medal rate on MLE-Bench Lite while reducing compute by 52% and demonstrating that structured knowledge transfer outperforms flat memory approaches.

arXiv:2606.30911v1 Announce Type: new Abstract: ML engineering agents waste compute rediscovering known techniques because every competition is a cold start. We present HASTE, a hierarchical multi-agent system that organizes cross-competition knowledge into three scope tiers (global, domain, and competition-specific), each coupled to a matching agent level. An orchestrator coordinates domain specialists and promotes learning between tiers via LLM-driven abstraction. A controlled ablation provides evidence for scoped loading: holding a 159-skill inventory constant across 8 competitions, tiered loading achieves a 100% medal rate while flat loading reaches only 62.5%, the same medal rate as loading no skills, and consumes 2x the output tokens. On the full MLE-Bench Lite benchmark (22 Kaggle competitions), HASTE reaches a medal rate of 77.3% using Claude Sonnet 4.6 at 12h per competition. In a cold-start run, the system begins with no accumulated skills. In warm-start runs, it reloads skills learned from earlier competitions, using only global and domain-level skills for transfer across competitions. Warm starts use 52% fewer refinement iterations, and the fraction of proposed changes kept by the agent rises from 42% at low inventory to 85% once 50+ skills are available. These results suggest that better knowledge organization can partly substitute for model strength and compute budget in ML-engineering agents.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:36 AM

# Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering
Source: [https://arxiv.org/html/2606.30911](https://arxiv.org/html/2606.30911)
###### Abstract

ML engineering agents waste compute rediscovering known techniques because every competition is a cold start\. We present HASTE, a hierarchical multi\-agent system that organizes cross\-competition knowledge into three scope tiers \(global, domain, and competition\-specific\), each coupled to a matching agent level\. An orchestrator coordinates domain specialists and promotes learning between tiers via LLM\-driven abstraction\. A controlled ablation provides evidence for scoped loading: holding a 159\-skill inventory constant across 8 competitions, tiered loading achieves a 100% medal rate while flat loading reaches only 62\.5%, the same medal rate as loading no skills, and consumes2×2\\timesthe output tokens\. On the full MLE\-Bench Lite benchmark \(22 Kaggle competitions\), HASTE reaches a medal rate of 77\.3% using Claude Sonnet 4\.6 at 12h per competition\. In a cold\-start run, the system begins with no accumulated skills\. In warm\-start runs, it reloads skills learned from earlier competitions, using only global and domain\-level skills for transfer across competitions\. Warm starts use 52% fewer refinement iterations, and the fraction of proposed changes kept by the agent rises from 42% at low inventory to 85% once 50\+ skills are available\. These results suggest that better knowledge organization can partly substitute for model strength and compute budget in ML\-engineering agents\.

ML engineering, agents, knowledge transfer, hierarchical memory, MLE\-Bench

## 1Introduction

Why solve the same problem twice? Current ML engineering agents do exactly this\. MLE\-Bench evaluates agents on 75 Kaggle competitions independently\(Chanet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib1)\), and agents treat them independently too, resetting all state between tasks\. Techniques that proved effective in one competition must be rediscovered from scratch in the next similar one\. This redundancy is the cost of treating every competition as a cold start: many top\-performing agents rely on frontier models, longer budgets, or both to compensate for repeated exploration\.

Recent work has explored cross\-task transfer\(Grosnitet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib22); Zhanget al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib23); Zhaoet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib20); Wanget al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib19); Huet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib26)\), but stores knowledge in flat pools or by memory type\. Agents with flat memory still lack the*organization*needed to load the right knowledge for the right task: everything goes into one context window, diluting the signal\. Transfer helps only when the agent can select the right prior for the current task\. The organization of accumulated knowledge therefore determines whether the agent spends its limited budget on useful exploration or on re\-deriving known facts\. We provide direct evidence that this distinction is consequential\. In our controlled ablation, flat loading performs no better than loading no skills, whereas scoped hierarchical loading medals on every task\.

HASTE \(Hierarchical Accumulation of Skills for Transfer\-Efficient ML Engineering\) organizes accumulated skills into three scope tiers: global skills that apply across ML tasks, domain skills for tabular, vision, NLP, or audio tasks, and competition skills that remain tied to one dataset\. An orchestrator coordinates domain specialists for tabular, vision, and NLP, each loading only the skills relevant to its scope\. Between competitions, the orchestrator promotes learning upward via LLM\-driven abstraction\. This scoped loading means each agent sees only what is relevant, keeping prompts focused and iterations productive\. The loaded skills act as a structured prior over the next code change\. That prior keeps the per\-task search narrow enough for a linear refinement loop to work: after two failed improvements the loop moves from broad exploration to optimization or fine\-tuning, and any change that hurts validation score is reverted before the next step\.

Across a multi\-phase MLE\-Bench Lite campaign, HASTE reaches the top public performance band using a non\-frontier model and a 12h budget\. Skill accumulation improves medal rate, reduces the number of refinement iterations needed, and raises the fraction of proposed changes that are worth keeping\. A fixed\-inventory ablation shows that accumulation needs scoped organization: tiered loading beats both flat and empty loading, while flat matches empty at much higher token cost\.

#### Contributions\.

The paper contributes a 3\-tier hierarchical skill store with LLM\-driven promotion between global, domain, and competition tiers, coupled to an orchestrator and matching domain specialists\. It evaluates scoped loading directly with a fixed\-inventory ablation, showing that hierarchy matters beyond skill volume: the same 159 skills help when loaded by scope but not when dumped into one flat prompt\. To our knowledge, HASTE is the first MLE\-Bench agent to organize reusable skills by cross\-competition scope and evaluate scoped loading against flat and empty loading under a fixed inventory\. The empirical study shows that this organization improves efficiency on MLE\-Bench Lite, allowing a non\-frontier model under a 12h budget to reach the same public performance band as stronger\-model systems: a cold run with no accumulated skills achieves 40\.9% medal rate, while reloading global and domain skills lifts the same system to 77\.3%, flipping 8 of 13 previously failed competitions to medal\. The study characterizes the resulting 159\-entry plain\-text skill inventory across 3 tiers and 4 domains\.

## 2Related Work

#### MLE\-Bench agents and search strategies\.

A first family of MLE\-Bench agents advances the*search*axis\. AIDE\(Jianget al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib2)\)runs greedy tree search; MLE\-STAR\(Namet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib3)\)performs ablation\-guided targeted refinement with web\-retrieved priors; AIRA\-Dojo\(Toledoet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib4)\)formalizes MLE agents as search\-policy×\\timesoperator\-set; R&D\-Agent\(Yanget al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib5)\)separates a Researcher from a Developer\. Population\-based evolutionary search appears in LoongFlow\(Wanet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib7)\)and FM Agent\(Liet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib9)\), whereas budget\-aware or graph\-augmented Monte Carlo search appears in MARS\(Chenet al\.,[2026](https://arxiv.org/html/2606.30911#bib.bib6)\)and ML\-Master\(Liuet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib8)\)\. HASTE contributes to the*knowledge*axis, which is orthogonal to search strategy and can be combined with any of these frameworks\.

#### Cross\-task knowledge transfer in LLM agents\.

Several systems accumulate experience across tasks but store it as a flat pool\. Voyager\(Wanget al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib19)\)maintains a code skill library indexed by embedding similarity in Minecraft; ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib20)\)extracts natural\-language insights via ADD/EDIT/UPVOTE/DOWNVOTE on a flat vector store; Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.30911#bib.bib21)\)pioneered verbal self\-reflection within a single task; ICAL\(Sarchet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib27)\)extends reflection to vision\-language agents with a four\-component knowledge structure\. For data science specifically, Agent K\(Grosnitet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib22)\)maintains persistent intrinsic state summarizing past episodes across competitions within an experiential\-learning formalism; MLCopilot\(Zhanget al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib23)\)retrieves related benchmarks via text embeddings and distilled knowledge; DS\-Agent\(Guoet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib24)\)combines embedding\-ranked human\-insight cases with iterative case\-based reasoning; MLZero\(Fanget al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib25)\)separates semantic library knowledge from episodic execution traces; ADAS\(Huet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib26)\)evolves agents via Meta Agent Search over an archive of coded designs\. SkillRL\(Xiaet al\.,[2026](https://arxiv.org/html/2606.30911#bib.bib28)\)reports a 25% performance collapse when distilled skills are replaced with raw trajectories, motivating aggressive compression\.

Table 1:Cross\-task knowledge mechanisms in ML engineering agents\. ✓ = present, ✗ = absent,∼\\sim= partial\.Table[1](https://arxiv.org/html/2606.30911#S2.T1)compares these systems by organization, cross\-task scope, hierarchy, promotion, and MLE\-Bench evaluation; HASTE is the only one that organizes reusable knowledge by*scope of applicability*\.

#### Hierarchical organization in agent systems\.

Hierarchy is well\-motivated outside ML engineering\. The options framework\(Suttonet al\.,[1999](https://arxiv.org/html/2606.30911#bib.bib29)\)introduced temporally extended actions in RL, building on feudal RL\(Dayan and Hinton,[1993](https://arxiv.org/html/2606.30911#bib.bib31)\)where a manager sets sub\-goals for a worker; feudal networks\(Vezhnevetset al\.,[2017](https://arxiv.org/html/2606.30911#bib.bib30)\)formalized this with different temporal resolutions\. For LLM agents, GITM\(Zhuet al\.,[2023](https://arxiv.org/html/2606.30911#bib.bib34)\)uses an explicit three\-tier decomposition for Minecraft\. It is the closest published precedent for a 3\-tier organization, although its tiers are within\-task layers rather than cross\-task scope layers\. CoALA\(Sumerset al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib32)\)classifies agent memory by type, andTalebiradet al\.\([2026](https://arxiv.org/html/2606.30911#bib.bib33)\)formalize how bounded\-capacity agents benefit from multi\-level knowledge organization\. HASTE brings this principle to ML engineering with empirical validation\. Its 3\-tier skill store and orchestrator\-specialist hierarchy mirror a feudal architecture in which managers at different levels decide what knowledge a subordinate executor should see, while the stored knowledge itself is organized by*scope of applicability*\.

#### Automated pipeline search and meta\-learning\.

Earlier AutoML systems reduce the cost of building a pipeline for each new dataset by searching within predefined model and hyperparameter spaces\. Predefined\-space systems such as AutoGluon\(Ericksonet al\.,[2020](https://arxiv.org/html/2606.30911#bib.bib14)\), TPOT\(Olson and Moore,[2016](https://arxiv.org/html/2606.30911#bib.bib15)\), Auto\-WEKA\(Thorntonet al\.,[2013](https://arxiv.org/html/2606.30911#bib.bib16)\), and Auto\-Sklearn\(Feureret al\.,[2015](https://arxiv.org/html/2606.30911#bib.bib17),[2022](https://arxiv.org/html/2606.30911#bib.bib18)\)search hand\-engineered configuration grids\. Meta\-learning extensions \(e\.g\., Auto\-Sklearn warm\-starting\) accumulate cross\-dataset experience as numerical meta\-features and predict configuration vectors on new tasks\. These systems are effective within their predefined space but cannot incorporate qualitative insights \(“tokenization choice X breaks on dataset Y because of Z”\)\. AgentHPO\(Liuet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib10)\)tunes hyper\-parameters; CAAFE\(Hollmannet al\.,[2023](https://arxiv.org/html/2606.30911#bib.bib11)\)generates feature\-engineering code; EvoPrompting\(Chenet al\.,[2023](https://arxiv.org/html/2606.30911#bib.bib13)\)searches architectures; AutoML\-GPT\(Zhanget al\.,[2023](https://arxiv.org/html/2606.30911#bib.bib12)\)orchestrates training from model and data cards\. Across this group, accumulated priors remain flat or implicit; HASTE splits them into applicability tiers\. HASTE keeps the same goal of reducing per\-task search cost, but the search takes place in unbounded Python code rather than a fixed configuration grid\. It makes that larger space tractable by reusing plain\-text lessons extracted from prior runs and loading only the lessons whose scope matches the current task\.

## 3Method

Orchestrator— classify, schedule, promoteSpecialist\(one per domain: Tabular / NLP / Vision\)TaskProfilerPrototypeScreenAdaptiveRefinementEnsembleProduce Learningsmetadata,CV strategy3 diversemodelsexplore→\\tooptimize→\\tofine\-tunerank\-averagetop\-33\-Tier Skill HierarchyGlobal\(5 entries\) — loaded by all specialistsDomain— Tabular \(19\)\|\|NLP \(12\)\|\|Vision \(15\) — matching specialist onlyCompetition\(21 dirs\) — per\-taskassign competitionsload relevant skillssave learningspromote between roundsFigure 1:HASTE architecture\. The Orchestrator assigns competitions to domain Specialists\. Each Specialist loads relevant skills, executes the pipeline \(profile→\\toprototype→\\torefine→\\toensemble\), and produces learnings\. Between rounds, the Orchestrator promotes generalizable learnings upward through the hierarchy via LLM\-driven abstraction\.Figure[1](https://arxiv.org/html/2606.30911#S3.F1)shows the two hierarchies that co\-evolve during a multi\-competition run\. The*skill hierarchy*has global, domain, and competition tiers\. The*agent hierarchy*has an orchestrator and three domain specialists for tabular, NLP, and vision tasks\. Scoped context loading connects them: each agent receives only the skill tiers matched to its scope\. The search space remains unbounded Python code; we restrict only the*distribution*over that space using accumulated structured priors\.

### 3\.1Skill Hierarchy

The skill store is a plain\-text filesystem of markdown files with YAML frontmatter, organized into three tiers by*scope of applicability*\.Globalhas 5 entries and is loaded by every specialist\.Domainhas 12 NLP, 19 tabular, and 15 vision entries, loaded only by the matching specialist\.Competitionhas 108 entries across 21 directories, loaded only on re\-runs of that competition\.

Within each tier, HASTE distinguishes three kinds of entries, written into different prompt slots at different stages\.Techniqueentries record what worked or failed and feed proposal prompts, for example “target encoding helps high\-cardinality categoricals in tree models\.”Commitment priorsrecord which design choices have high cross\-task variance and feed the prototype screen, telling the agent which choices require evidence before commitment\.Refinement hintsrecord which knobs to tune per model family and feed the optimizing and fine\-tuning stages\. Other systems collapse these into one “lessons” bag; we separate them because they answer different questions and enter the loop at different points\.

Loading is deliberately simple: read the relevant directories and concatenate them\. No embedding index is needed at the current scale of roughly 159 entries, with 10 to 60 loaded per agent\. A character cap of 2000 in the prototype prompt and 4000 in refinement limits dilution if the inventory grows\. The decision to avoid embedding\-only retrieval follows recent results on the theoretical limits of single\-vector embeddings for multi\-field conditional retrieval\(Welleret al\.,[2026](https://arxiv.org/html/2606.30911#bib.bib35)\)\. Most skills come from an LLM reflection step at the end of each competition \(Figure[5](https://arxiv.org/html/2606.30911#A2.F5)\), in the spirit of Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.30911#bib.bib21)\)and ExpeL\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.30911#bib.bib20)\), with paired success and failure analysis so that failure modes are recorded explicitly\. Two skill types are additionally extracted*algorithmically*by a Prior Extractor\. Commitment priors come from score variance across prototype model options; high variance implies a decision\-relevant choice\. Refinement hints come from per\-knob deltas across refinement history, identifying which changes had the highest acceptance rate\. These structured priors are mined from logs; to our knowledge, prior text\-memory agents rely primarily on LLM summarization for this kind of memory\.

### 3\.2Agent Hierarchy and Execution Loop

#### Orchestrator\.

The orchestrator delegates experiment execution\. Its responsibilities are: \(i\)*domain classification*, assigning each competition to a domain via manual tags for 100\+ MLE\-Bench competitions with a heuristic fallback; \(ii\)*round scheduling*, picking seeds first \(one per domain for cold start\) and then assigning remaining competitions to specialists; and \(iii\)*skill promotion*, evaluating new learnings via an LLM after each round \(Figure[6](https://arxiv.org/html/2606.30911#A2.F6)\)\. Promotion decides for each learning:skip\(already covered\),competition\(too specific\),domain\(abstract and promote up one tier\),global\(universally useful\), orconflict\(contradicts an existing skill\)\. When two learnings conflict, both are kept and annotated with conditions\. For example, one skill records that ensembling helps when correlation is below 0\.95, but hurts when a weaker member pulls down a stronger one\. The store grows through*abstraction*and stays interpretable plain text on the filesystem\. The top\-level multi\-competition loop is given in Algorithm[1](https://arxiv.org/html/2606.30911#alg1)\(Appendix[A\.1](https://arxiv.org/html/2606.30911#A1.SS1)\)\.

#### Specialist\.

Given a competitionttand the scope\-loaded skills supplied by the orchestrator, the specialist runs a five\-stage per\-competition pipeline\. Appendix[A\.2](https://arxiv.org/html/2606.30911#A1.SS2)gives the full pseudocode in Algorithm[2](https://arxiv.org/html/2606.30911#alg2)\.\(1\) Task profiling: parse metadata, profile the dataset, select a CV strategy, and probe GPU/CPU/RAM; no model execution\.\(2\) Prototype screen\(Figure[3](https://arxiv.org/html/2606.30911#A2.F3)\): the LLM proposes three fundamentally diverse approaches and executes each on a single validation fold; up to a2\.7×2\.7\\timesscore spread between best and worst justifies this hedge against committing to a suboptimal foundation\.\(3\) Adaptive refinementon the prototype winner with budgetN=20N\{=\}20and on the runner\-up withN=6N\{=\}6viaAdaptiveRefine\(Appendix[A\.3](https://arxiv.org/html/2606.30911#A1.SS3)\), a linear loop across three tiers \(Exploring, Optimizing, Fine\-tuning\) with auto\-escalation on two consecutive non\-improvements, stagnation exit at Fine\-tuning, and revert\-on\-regression on every step \(Figure[4](https://arxiv.org/html/2606.30911#A2.F4)\); the runner\-up branch is a second hedge, and in 3 of our 25 main\-benchmark runs, the runner\-up’s refined score beat the winner’s\.\(4\) Ensemble: rank\-average the top\-3 checkpoints across both branches; accept the ensemble only if it beats the best single member\.\(5\) Produce learnings\(Figure[5](https://arxiv.org/html/2606.30911#A2.F5)\): the LLM reflects on the full experiment history and emits 2 to 5 plain\-text learnings, each with a proposed tier, which flow back to Algorithm[1](https://arxiv.org/html/2606.30911#alg1)for promotion\. A 6\-mode failure taxonomy guides refinement diagnosis as prompt content:Underfitting,Overfitting,Feature\_Gap,Noise\_Ceiling,Distribution\_Mismatch, andDiminishing\_Returns\. Tiered history compression keeps the prompt under 20 lines even at 50\+ iterations\.

This linear refinement loop is intentionally simpler than the trees, evolutionary populations, or MCTS used by other systems\(Jianget al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib2); Liet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib9); Chenet al\.,[2026](https://arxiv.org/html/2606.30911#bib.bib6); Liuet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib8); Wanet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib7)\)\. We found linear refinement sufficient in this setting; the priors loaded by the orchestrator plausibly collapse the branching factor enough for linear search to match tree or evolutionary alternatives, but a controlled comparison at fixed knowledge condition is future work\. The ablation in §[4\.5](https://arxiv.org/html/2606.30911#S4.SS5)shows the priors carry substantive weight at fixed search strategy: performance drops sharply when no skills are loaded\.

#### Executor\.

The executor is a pluggable backend; the agent has Read, Write, Edit, Bash, Glob, and Grep tools, a 3\-attempt retry per step with diagnostic feedback between attempts, and validates submissions againstsample\_submission\.csv\(column names, row count, no NaN, dtype\) before scoring\. For cost efficiency, all experiments in this paper use the CLI backend \(claude \-psubprocess\) billed via a fixed Claude Code subscription\.

## 4Experiments

### 4\.1Setup

We evaluate on MLE\-Bench Lite\(Chanet al\.,[2025](https://arxiv.org/html/2606.30911#bib.bib1)\), the 22\-competition subset used for cross\-agent comparison\. Each competition runs Claude Sonnet 4\.6 via the CLI backend on a SLURM node with 24 CPUs, 128 GB RAM, and 1 NVIDIA L40S 48 GB GPU\. The wall\-clock budget is 12h, half the dominant 24h budget on the leaderboard, with 20 iterations on the main benchmark and 11 on the ablation\. The search space is unbounded Python code; evaluation uses the official Kaggle metric per competition, scored against held\-out test sets via the MLE\-Bench grader\. No test labels are used in the iteration loop\. Appendix[B](https://arxiv.org/html/2606.30911#A2)includes representative prompts, and Appendix[E](https://arxiv.org/html/2606.30911#A5)shows representative skills\.

### 4\.2Main Results

Table 2:MLE\-Bench Lite reference points\. Reference numbers from the public MLE\-Bench leaderboard \([https://github\.com/openai/mle\-bench](https://github.com/openai/mle-bench), accessed June 2026\)\.Across the multi\-phase accumulation campaign, HASTE’s medal rate on MLE\-Bench Lite is 77\.3% \(17 of 22\), with 10 gold, 2 silver, and 5 bronze\. The above\-median rate is 86\.4%\. The per\-competition table is in Appendix[C](https://arxiv.org/html/2606.30911#A3)\.

Table[2](https://arxiv.org/html/2606.30911#S4.T2)places HASTE among public MLE\-Bench Lite results\. At 77\.3%, HASTE reaches the same performance band as leading public MLE\-Bench Lite agents while using a non\-frontier model and a 12h budget\. Two features distinguish it from the rest of the top\-tier agents\. HASTE is the only non\-frontier\-model agent at or above 77%: every other agent in this band uses Gemini\-3\-Pro\-Preview, Claude Opus 4\.6, or a model ensemble\. HASTE is also one of only two top\-band agents running at 12h; all others use the standard 24h budget\. We note that public leaderboard numbers carry per\-task statistical noise \(paper\-reported SD≈4\.4\\approx 4\.4\) and that a multi\-seed confidence interval is still pending; differences of less than 5 points on Lite fall within that noise, and the headline 77\.3% is a campaign result rather than a multi\-seed estimate \(§[5\.3](https://arxiv.org/html/2606.30911#S5.SS3)\)\.

A cold\-start single\-pass of HASTE achieves only 40\.9%, which would place it below every agent on Table[2](https://arxiv.org/html/2606.30911#S4.T2); tiered skill accumulation adds 36\.4 percentage points to the same system, lifting it into the leading public performance band\.

### 4\.3Evidence for Skill Transfer

Skill accumulation lifts medal rate from 40\.9% \(9 medals\) in a single\-pass cold run to 77\.3% \(17 medals\) when later phases reload accumulated skills\. A cold run starts without prior competition experience\. A warm run starts with skills learned earlier in the campaign\. For the warm transfer evaluation, HASTE reloads global and domain skills only\. Competition\-specific skills are not loaded, so the result measures transfer from reusable knowledge rather than leakage from a previous attempt on the same dataset\.

The improvement shows up in three ways\. First, warm\-start runs need fewer refinement iterations to reach their best score: 7\.8 on average compared with 16\.3 for cold\-start runs, a 52% reduction\. Second, the agent keeps more of its own proposed edits as the skill inventory grows\. We measure hit rate as the fraction of attempted changes that improve validation score and are kept rather than reverted\. When the inventory has 0 to 15 skills, the hit rate is 42%; once 50\+ skills are available, it rises to 85%\. Third, of the 13 competitions that failed the cold run, 8 flipped to medal in the warm run, while the 9 competitions that already medaled cold were skipped in the rerun phase \(Appendix[D](https://arxiv.org/html/2606.30911#A4)\)\. The store grew from 5 entries at cold start to 72 by phase 12, and each promotion step used LLM abstraction to keep the entries readable as plain text\.

### 4\.4Phase 0, Refinement, and What the Agent Learned

Refinement beats the prototype winner in 92% of runs \(23 of 25\), with an average gain of\+0\.045\+0\.045in the competition’s native metric; in 3 competitions, the runner\-up’s refined score beat the winner’s, and the two cases where refinement underperformed both occurred in early phases with fewer skills\. The final 159\-entry skill store breaks down as 5 global, 15 vision, 12 NLP, 19 tabular, and 108 competition entries, all traceable to specific runs and including both successes and failures\. For example, the global skill “ensembling a strong model with a weaker model can degrade performance” was learned from dogs\-vs\-cats and later prevented the same mistake in two vision competitions; “DeBERTa\-v3\-base is the strongest text\-classification start under a 12h budget” was learned from detecting\-insults and applied in jigsaw\-toxic and spooky\-author\. Appendix[E](https://arxiv.org/html/2606.30911#A5)lists representative skills with source competitions and estimated iteration savings\.

### 4\.5Ablation: Tiered vs\. Flat vs\. Empty Skill Loading

We test the loading function directly while holding skill*inventory*constant, varying only skill*organization*\. We run 8 competitions spanning NLP, vision, tabular, and audio domains under three conditions, with the same model \(Claude Sonnet 4\.6\), pipeline, 11\-iteration budget, and 159\-skill inventory frozen at one point in the benchmark campaign\. The conditions differ only in*which*skills enter context:Tieredloads global, domain\-matched, and competition\-specific skills only \(10 to 60 entries per competition\);Flatloads all 159 skills \( 145K characters\), the “dump everything into the prompt” approach;Emptyloads no skills\.

![Refer to caption](https://arxiv.org/html/2606.30911v1/x1.png)Figure 2:Controlled ablation in Section[4\.5](https://arxiv.org/html/2606.30911#S4.SS5)\.All conditions use the same 159\-skill inventory, model, pipeline, and budget\. Tiered loading medals on all 8 competitions; flat and empty both medal on 5 of 8\.The difference between loading functions is large despite the fixed inventory\. Tiered loading achieves 100% medal rate \(6 gold, 1 silver, 1 bronze\), flat 62\.5% \(4 gold, 1 silver, 3 no\-medal\), and empty 62\.5% \(5 gold, 3 no\-medal\); mean test scores follow the same ordering \(0\.949\>\>0\.910\>\>0\.893\)\. Figure[2](https://arxiv.org/html/2606.30911#S4.F2)shows the per\-competition breakdown; the full table is in Appendix[G](https://arxiv.org/html/2606.30911#A7)\.

Because the ablation is small, we treat the statistical tests as supporting evidence rather than a definitive test\. WithN=8N=8and a single seed, formal tests are underpowered, but the qualitative pattern holds across measures\. A paired bootstrap on per\-competition score differences \(tiered minus flat\) gives a 95% CI of\[\+0\.001,\+0\.093\]\[\+0\.001,\+0\.093\]around a mean of\+0\.040\+0\.040, which just excludes zero\. The corresponding one\-sided Wilcoxon signed\-rank tests yieldp=0\.11p=0\.11for tiered\>\>flat,p=0\.08p=0\.08for tiered\>\>empty \(both paired across the same 8 competitions\), and Fisher’s exact test on medal counts givesp=0\.10p=0\.10for tiered \(8/8\) vs\. flat or empty \(5/8\)\. The tests fall short of standard significance atN=8N=8; we report them transparently and emphasize the consistent direction of effect across all three measures\. Multi\-seed replication is the planned next step \(§[5\.3](https://arxiv.org/html/2606.30911#S5.SS3)\)\.

The flat condition is especially informative because it tests whether more knowledge is enough\. Flat skill loading leaves the medal rate unchanged relative to starting from scratch\. Flat consumes 3\.78M output tokens to empty’s 1\.86M \(tiered: 2\.27M\), doubling empty’s token cost for the same medal rate and a modestly higher mean score\. The*tokens per medal*metric \(total output tokens divided by medals won\) makes the efficiency gap concrete: tiered spends 284K tokens per medal, flat 756K, and empty 371K \(Figure[2](https://arxiv.org/html/2606.30911#S4.F2)b\)\. Tiered is 2\.7×\\timesmore token\-efficient than flat per medal won\. Flat runs the most experiments \(75 vs\. tiered’s 60 and empty’s 65\) with a slightly higher execution success rate \(87% vs\. 83%\), so the extra attempts appear poorly directed: more compute without medal\-rate gains\. Full resource breakdown is in Appendix[F](https://arxiv.org/html/2606.30911#A6)\.

The aggregate result hides where the difference is largest\. The gap focuses on harder or niche tasks, where competition\-specific skills provide a known\-good starting point and domain skills guide model selection\. On mlsp\-2013\-birds \(audio\), tiered scores 0\.964 \(gold\) vs\. flat 0\.860 and empty 0\.832 \(neither medal\); on random\-acts\-of\-pizza, tiered 0\.798 \(silver\) vs\. flat 0\.599 and empty 0\.481\. The 4 easier competitions, detecting\-insults, histopathologic, tabular\-playground, and plant\-pathology, show smaller differences and all three conditions medal\.

From the logs, three mechanisms appear to limit flat loading\.Signal dilution: relevant skills are buried among domain and competition\-specific skills for unrelated competitions\.Context budget displacement: at 145K characters, the flat skill dump crowds out the agent’s own reasoning and code analysis\.Overconfident model selection: in the flat\-jigsaw rerun, the agent repeatedly attempted DeBERTa\-v3\-large \(an aggressive NLP recommendation that triggered OOM\) while the empty agent used simpler models and scored higher \(0\.985 vs\. 0\.981\)\.

#### Scope of the ablation\.

The flat condition couples three axes: organization, skill volume, and prompt\-length budget \(∼\\sim145K vs\.∼\\sim25K characters\)\. Because scoping is the mechanism by which the right 30 skills enter a 25K\-character budget, a flat\-but\-character\-capped condition that loadsNNrandom skills would test random subset selection, a different question\. The supported claim is that, on this inventory, scoped loading outperforms both the flat full\-load and empty\-context baselines through the three mechanisms above; any broader claim about hierarchy reduces here to its role in enabling scoped loading at a controlled budget\. The 159\-skill inventory was itself accumulated under hierarchical promotion, so this ablation tests inventory transfer under different loading functions; a fully flat pipeline end\-to\-end remains a separate condition\.

## 5Discussion

### 5\.1Why Hierarchy Makes Agents Faster

The efficiency gain depends on*scoped loading*in addition to accumulation, and the ablation supports this as the main mechanism\. A flat skill store sends every specialist unrelated advice, including domain tricks for other modalities and competition\-specific quirks from unrelated tasks\. Tier scoping keeps the context small and matched to the current agent; the resource table shows the cost of failing to scope, with flat loading spending substantially more tokens without improving on empty’s medal rate\.

### 5\.2Per\-Domain Performance and Token Usage

Per\-domain medal rates on the full benchmark are uneven: vision 80% \(8/10\), NLP 100% \(6/6\), tabular 40% \(2/5\), and the single audio competition \(mlsp\-2013\-birds\) reached gold\. Remaining failures such as taxi\-fare and dog\-breed likely need approaches still absent from the skill store\. Token usage scales sub\-linearly with skill inventory: warm\-start runs need fewer refinement iterations, so per\-competition tokens drop as the store grows; per\-condition totals are in Appendix[F](https://arxiv.org/html/2606.30911#A6)\.

### 5\.3Limitations

Single\-seed evaluation is the main limitation\. This choice matches the efficiency positioning of the work, but multi\-seed replication at the full 75\-competition benchmark is the priority for follow\-up runs with additional compute\. The §[4\.5](https://arxiv.org/html/2606.30911#S4.SS5)ablation tests scoped loading while holding inventory fixed\. Other engineering components, including the prototype screen, runner\-up branch, rank\-average ensemble, failure taxonomy, auto\-escalation, and revert\-on\-regression, remain for an analogous component\-wise study\.

## 6Conclusion

These results suggest that knowledge organization can partly substitute for model strength and compute budget in ML\-engineering agents: HASTE reaches competitive MLE\-Bench Lite performance with a non\-frontier model under a shorter budget, and the fixed\-inventory ablation points to scoped loading as the source of the gain over skill volume alone\. Multi\-seed replication, held\-out evaluation, full 75\-competition runs, and retrieval\-augmented loading at scale are next\.

## Impact Statement

This work aims to improve the efficiency of ML engineering agents through reusable knowledge organization\. Its broader impacts are those of ML engineering automation generally; we do not foresee additional direct societal risks from the hierarchy mechanism itself\.

## References

- J\. S\. Chan, N\. Chowdhury, O\. Jaffe,et al\.\(2025\)MLE\-bench: evaluating machine learning agents on machine learning engineering\.InICLR,External Links:2410\.07095,[Link](https://arxiv.org/abs/2410.07095)Cited by:[§1](https://arxiv.org/html/2606.30911#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.30911#S4.SS1.p1.1)\.
- A\. Chen, D\. M\. Dohan, and D\. R\. So \(2023\)EvoPrompting: language models for code\-level neural architecture search\.InNeurIPS,External Links:2302\.14838,[Link](https://arxiv.org/abs/2302.14838)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Chen, B\. D\. Mishra, J\. Nam, R\. Meng, T\. Pfister, and J\. Yoon \(2026\)MARS: modular agent with reflective search for automated AI research\.External Links:2602\.02660,[Link](https://arxiv.org/abs/2602.02660)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.30911#S3.SS2.SSS0.Px2.p2.1)\.
- P\. Dayan and G\. E\. Hinton \(1993\)Feudal reinforcement learning\.InNeurIPS,External Links:[Link](https://proceedings.neurips.cc/paper/1992/hash/d14220ee66aeec73c49038385428ec4c-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px3.p1.1)\.
- N\. Erickson, J\. Mueller, A\. Shirkov, H\. Zhang, P\. Larroy, M\. Li, and A\. Smola \(2020\)AutoGluon\-tabular: robust and accurate AutoML for structured data\.External Links:2003\.06505,[Link](https://arxiv.org/abs/2003.06505)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- H\. Fang, B\. Han, N\. Erickson,et al\.\(2025\)MLZero: a multi\-agent system for end\-to\-end machine learning automation\.External Links:2505\.13941,[Link](https://arxiv.org/abs/2505.13941)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Feurer, K\. Eggensperger, S\. Falkner, M\. Lindauer, and F\. Hutter \(2022\)Auto\-Sklearn 2\.0: hands\-free AutoML via meta\-learning\.JMLR23,pp\. 1–61\.External Links:[Link](https://www.jmlr.org/papers/v23/21-0992.html)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Feurer, A\. Klein, K\. Eggensperger, J\. Springenberg, M\. Blum, and F\. Hutter \(2015\)Efficient and robust automated machine learning\.InNeurIPS,External Links:[Link](https://proceedings.neurips.cc/paper/2015/hash/11d0e6287202fced83f79975ec59a3a6-Abstract.html)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Grosnit, A\. Maraval, J\. Doran,et al\.\(2024\)Kolb\-based experiential learning for generalist agents with human\-level kaggle data science performance\.External Links:2411\.03562,[Link](https://arxiv.org/abs/2411.03562)Cited by:[§1](https://arxiv.org/html/2606.30911#S1.p2.1),[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Guo, C\. Deng, Y\. Wen, H\. Chen, Y\. Chang, and J\. Wang \(2024\)DS\-agent: automated data science by empowering large language models with case\-based reasoning\.InICML,External Links:2402\.17453,[Link](https://arxiv.org/abs/2402.17453)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Hollmann, S\. Müller, and F\. Hutter \(2023\)Large language models for automated data science: introducing CAAFE for context\-aware automated feature engineering\.InNeurIPS,External Links:2305\.03403,[Link](https://arxiv.org/abs/2305.03403)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- S\. Hu, C\. Lu, and J\. Clune \(2024\)Automated design of agentic systems\.InICLR,External Links:2408\.08435,[Link](https://arxiv.org/abs/2408.08435)Cited by:[§1](https://arxiv.org/html/2606.30911#S1.p2.1),[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Jiang, D\. Schmidt, D\. Srikanth, D\. Xu, I\. Kaplan, D\. Jacenko, and Y\. Wu \(2025\)AIDE: AI\-driven exploration in the space of code\.External Links:2502\.13138,[Link](https://arxiv.org/abs/2502.13138)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.30911#S3.SS2.SSS0.Px2.p2.1)\.
- A\. Li, C\. Wu, Z\. Ge,et al\.\(2025\)The FM agent\.External Links:2510\.26144,[Link](https://arxiv.org/abs/2510.26144)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.30911#S3.SS2.SSS0.Px2.p2.1)\.
- S\. Liu, C\. Gao, and Y\. Li \(2024\)Large language model agent for hyper\-parameter optimization\.InCPAL,External Links:2402\.01881,[Link](https://arxiv.org/abs/2402.01881)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Liu, Y\. Cai, X\. Zhu,et al\.\(2025\)ML\-master: towards AI\-for\-AI via integration of exploration and reasoning\.External Links:2506\.16499,[Link](https://arxiv.org/abs/2506.16499)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.30911#S3.SS2.SSS0.Px2.p2.1)\.
- J\. Nam, J\. Yoon, J\. Chen, J\. Shin, S\. O\. Arik, and T\. Pfister \(2025\)MLE\-STAR: machine learning engineering agent via search and targeted refinement\.External Links:2506\.15692,[Link](https://arxiv.org/abs/2506.15692)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1)\.
- R\. S\. Olson and J\. H\. Moore \(2016\)TPOT: a tree\-based pipeline optimization tool for automating machine learning\.InWorkshop on Automatic Machine Learning at ICML,pp\. 66–74\.External Links:[Link](https://proceedings.mlr.press/v64/olson_tpot_2016.html)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Sarch, L\. Jang, M\. J\. Tarr, W\. W\. Cohen, K\. Marino, and K\. Fragkiadaki \(2024\)VLM agents generate their own memories: distilling experience into embodied programs of thought\.InNeurIPS,External Links:2406\.14596,[Link](https://arxiv.org/abs/2406.14596)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InNeurIPS,External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.30911#S3.SS1.p3.1)\.
- T\. R\. Sumers, S\. Yao, K\. Narasimhan, and T\. L\. Griffiths \(2024\)Cognitive architectures for language agents\.Transactions on Machine Learning Research\.External Links:[Link](https://openreview.net/forum?id=1i6ZCvflQJ)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px3.p1.1)\.
- R\. S\. Sutton, D\. Precup, and S\. Singh \(1999\)Between MDPs and semi\-MDPs: a framework for temporal abstraction in reinforcement learning\.Artificial Intelligence112,pp\. 181–211\.External Links:[Link](https://doi.org/10.1016/S0004-3702(99)00052-1)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Talebirad, A\. Parsaee, C\. Y\. Szepesvari, A\. Nadiri, and O\. Zaiane \(2026\)Toward a theory of hierarchical memory for language agents\.InICLR 2026 Workshop on Memory for LLM\-Based Agentic Systems \(MemAgents\),External Links:2603\.21564,[Link](https://arxiv.org/abs/2603.21564)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Thornton, F\. Hutter, H\. H\. Hoos, and K\. Leyton\-Brown \(2013\)Auto\-WEKA: combined selection and hyperparameter optimization of classification algorithms\.InKDD,pp\. 847–855\.External Links:[Link](https://doi.org/10.1145/2487575.2487629)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- E\. Toledo, K\. Hambardzumyan, M\. Josifoski,et al\.\(2025\)AI research agents for machine learning: search, exploration, and generalization in MLE\-bench\.External Links:2507\.02554,[Link](https://arxiv.org/abs/2507.02554)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1)\.
- A\. S\. Vezhnevets, S\. Osindero, T\. Schaul, N\. Heess, M\. Jaderberg, D\. Silver, and K\. Kavukcuoglu \(2017\)FeUdal networks for hierarchical reinforcement learning\.InICML,External Links:1703\.01161,[Link](https://arxiv.org/abs/1703.01161)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Wan, X\. Dai, Z\. Wang, M\. Li, Y\. Wang, Y\. Mao, Y\. Lan, and Z\. Xiao \(2025\)LoongFlow: directed evolutionary search via a cognitive plan\-execute\-summarize paradigm\.External Links:2512\.24077,[Link](https://arxiv.org/abs/2512.24077)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.30911#S3.SS2.SSS0.Px2.p2.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2024\)Voyager: an open\-ended embodied agent with large language models\.TMLR\.External Links:[Link](https://openreview.net/forum?id=ehfRiF0R3a),2305\.16291Cited by:[§1](https://arxiv.org/html/2606.30911#S1.p2.1),[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- O\. Weller, M\. Boratko, I\. Naim, and J\. Lee \(2026\)On the theoretical limitations of embedding\-based retrieval\.InICLR,External Links:2508\.21038,[Link](https://arxiv.org/abs/2508.21038)Cited by:[§3\.1](https://arxiv.org/html/2606.30911#S3.SS1.p3.1)\.
- P\. Xia, J\. Chen, H\. Wang,et al\.\(2026\)SkillRL: evolving agents via recursive skill\-augmented reinforcement learning\.External Links:2602\.08234,[Link](https://arxiv.org/abs/2602.08234)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Yang, X\. Yang, S\. Fang,et al\.\(2025\)R&D\-Agent: an LLM\-agent framework towards autonomous data science\.External Links:2505\.14738,[Link](https://arxiv.org/abs/2505.14738)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zhang, Y\. Zhang, K\. Ren, D\. Li, and Y\. Yang \(2024\)MLCopilot: unleashing the power of large language models in solving machine learning tasks\.InEACL,External Links:2304\.14979,[Link](https://arxiv.org/abs/2304.14979)Cited by:[§1](https://arxiv.org/html/2606.30911#S1.p2.1),[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Zhang, C\. Gong, L\. Wu, X\. Liu, and M\. Zhou \(2023\)AutoML\-GPT: automatic machine learning with GPT\.External Links:2305\.02499,[Link](https://arxiv.org/abs/2305.02499)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang \(2024\)ExpeL: llm agents are experiential learners\.InAAAI,External Links:[Link](https://doi.org/10.1609/aaai.v38i17.29936),2308\.10144Cited by:[§1](https://arxiv.org/html/2606.30911#S1.p2.1),[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.30911#S3.SS1.p3.1)\.
- X\. Zhu, Y\. Chen, H\. Tian,et al\.\(2023\)Ghost in the minecraft: generally capable agents for open\-world environments via large language models with text\-based knowledge and memory\.External Links:2305\.17144,[Link](https://arxiv.org/abs/2305.17144)Cited by:[§2](https://arxiv.org/html/2606.30911#S2.SS0.SSS0.Px3.p1.1)\.

## Appendix AAlgorithms

### A\.1Orchestrator

Algorithm[1](https://arxiv.org/html/2606.30911#alg1)gives the top\-level multi\-competition loop described in §[3](https://arxiv.org/html/2606.30911#S3)\.

Algorithm 1HASTE orchestrator: multi\-competition skill accumulation with LLM\-driven promotion\.0:Competitions

𝒯\\mathcal\{T\}, domain partition

𝒟\\mathcal\{D\}, initial skill store

𝒮\\mathcal\{S\}\(possibly

∅\\emptyset\)\.

1:Tag each

t∈𝒯t\\in\\mathcal\{T\}with domain

d​\(t\)∈𝒟d\(t\)\\in\\mathcal\{D\}\.

2:Pick one seed per domain; let

\{R1,R2,…\}\\\{R\_\{1\},R\_\{2\},\\dots\\\}be the resulting rounds\.

3:forround

RrR\_\{r\}in

R1,R2,…R\_\{1\},R\_\{2\},\\dotsdo

4:

ℒr←∅\\mathcal\{L\}\_\{r\}\\leftarrow\\emptyset\{new learnings this round\}

5:forcompetition

t∈Rrt\\in R\_\{r\}, in parallel where possibledo

6:

Λ​\(t\)←𝒮G∪𝒮d​\(t\)D∪𝒮tC\\Lambda\(t\)\\leftarrow\\mathcal\{S\}^\{G\}\\cup\\mathcal\{S\}^\{D\}\_\{d\(t\)\}\\cup\\mathcal\{S\}^\{C\}\_\{t\}\{tiered loading\}

7:

\(y^t,ℓt\)←Specialist​\(t,Λ​\(t\)\)\(\\hat\{y\}\_\{t\},\\ \\ell\_\{t\}\)\\leftarrow\\textsc\{Specialist\}\(t,\\ \\Lambda\(t\)\)\{Alg\.[2](https://arxiv.org/html/2606.30911#alg2)\}

8:

𝒮tC←𝒮tC∪ℓt\\mathcal\{S\}^\{C\}\_\{t\}\\leftarrow\\mathcal\{S\}^\{C\}\_\{t\}\\cup\\ell\_\{t\};

ℒr←ℒr∪ℓt\\mathcal\{L\}\_\{r\}\\leftarrow\\mathcal\{L\}\_\{r\}\\cup\\ell\_\{t\}
9:endfor

10:forlearning

ℓ∈ℒr\\ell\\in\\mathcal\{L\}\_\{r\}do

11:

decision←PromoteLLM​\(ℓ,𝒮\)\\textit\{decision\}\\leftarrow\\textsc\{PromoteLLM\}\(\\ell,\\mathcal\{S\}\)
12:if

decision=global\\textit\{decision\}=\\textsc\{global\}then

13:

𝒮G←𝒮G∪\{Abstract​\(ℓ\)\}\\mathcal\{S\}^\{G\}\\leftarrow\\mathcal\{S\}^\{G\}\\cup\\\{\\textsc\{Abstract\}\(\\ell\)\\\}
14:elseif

decision=domain\\textit\{decision\}=\\textsc\{domain\}then

15:

𝒮d​\(t\)D←𝒮d​\(t\)D∪\{Abstract​\(ℓ\)\}\\mathcal\{S\}^\{D\}\_\{d\(t\)\}\\leftarrow\\mathcal\{S\}^\{D\}\_\{d\(t\)\}\\cup\\\{\\textsc\{Abstract\}\(\\ell\)\\\}
16:elseif

decision=conflict\\textit\{decision\}=\\textsc\{conflict\}then

17:keep both, annotate with conditions

18:else

19:leave in

𝒮tC\\mathcal\{S\}^\{C\}\_\{t\}or skip

20:endif

21:endfor

22:endfor

23:return

\{y^t\}t∈𝒯\\\{\\hat\{y\}\_\{t\}\\\}\_\{t\\in\\mathcal\{T\}\},

𝒮\\mathcal\{S\}

### A\.2Specialist Pipeline

Algorithm[2](https://arxiv.org/html/2606.30911#alg2)formalizes the five\-stage per\-competition pipeline described in §[3\.2](https://arxiv.org/html/2606.30911#S3.SS2)\.

Algorithm 2HASTE specialist: per\-competition pipeline\.0:Competition

tt, loaded skills

Λ​\(t\)\\Lambda\(t\)\.

1:

profile←TaskProfiler​\(t\)\\textit\{profile\}\\leftarrow\\textsc\{TaskProfiler\}\(t\)\{metadata, CV strategy, resource probe\}

2:

\{m1,m2,m3\}←PrototypeScreen​\(t,profile,Λ​\(t\)\)\\\{m\_\{1\},m\_\{2\},m\_\{3\}\\\}\\leftarrow\\textsc\{PrototypeScreen\}\(t,\\textit\{profile\},\\Lambda\(t\)\);

si←Run​\(mi\)s\_\{i\}\\leftarrow\\textsc\{Run\}\(m\_\{i\}\)
3:

win,ru←\\textit\{win\},\\textit\{ru\}\\leftarrowtop\-2 prototypes by score

4:

win⋆←AdaptiveRefine​\(win,Λ​\(t\),N=20\)\\textit\{win\}^\{\\star\}\\leftarrow\\textsc\{AdaptiveRefine\}\(\\textit\{win\},\\Lambda\(t\),N\{=\}20\)\{Alg\.[3](https://arxiv.org/html/2606.30911#alg3)\}

5:

ru⋆←AdaptiveRefine​\(ru,Λ​\(t\),N=6\)\\textit\{ru\}^\{\\star\}\\leftarrow\\textsc\{AdaptiveRefine\}\(\\textit\{ru\},\\Lambda\(t\),N\{=\}6\)
6:

ens←RankAverage​\(top\-3 checkpoints fromwin⋆∪ru⋆\)\\textit\{ens\}\\leftarrow\\textsc\{RankAverage\}\(\\text\{top\-3 checkpoints from \}\\textit\{win\}^\{\\star\}\\cup\\textit\{ru\}^\{\\star\}\)
7:

y^←arg⁡maxc∈\{ens,win⋆,ru⋆\}⁡score​\(c\)\\hat\{y\}\\leftarrow\\arg\\max\_\{\\,c\\in\\\{\\textit\{ens\},\\textit\{win\}^\{\\star\},\\textit\{ru\}^\{\\star\}\\\}\}\\ \\text\{score\}\(c\)\{accept ensemble only if it wins\}

8:

ℓ←ProduceLearnings​\(history\)\\ell\\leftarrow\\textsc\{ProduceLearnings\}\(\\text\{history\}\)
9:return

y^,ℓ\\hat\{y\},\\ell

### A\.3Adaptive Refinement Loop

Algorithm[3](https://arxiv.org/html/2606.30911#alg3)gives the inner refinement loop invoked twice by the specialist, on lines 4 and 5 of Algorithm[2](https://arxiv.org/html/2606.30911#alg2): once on the prototype winner with budgetN=20N\{=\}20, once on the runner\-up with budgetN=6N\{=\}6\. The three tiers,Exploring,Optimizing, andFineTune, progress under auto\-escalation\. Two consecutive non\-improvements advance the tier; stagnation atFineTuneexits the loop; revert\-on\-regression keeps a bad change from carrying into the next step\.

Algorithm 3AdaptiveRefine: linear refinement loop with auto\-escalation and revert\-on\-regression\.1:procedureAdaptiveRefine\(

mm,

Λ\\Lambda,

NN\)

2:

tier←Exploring\\textit\{tier\}\\leftarrow\\textsc\{Exploring\};

best←m\\textit\{best\}\\leftarrow m;

c←0c\\leftarrow 0
3:for

i=1i=1to

NNdo

4:

p←LLMProposal​\(best,tier,history,Λ\)p\\leftarrow\\textsc\{LLMProposal\}\(\\textit\{best\},\\textit\{tier\},\\text\{history\},\\Lambda\)
5:

s←Run​\(p\)s\\leftarrow\\textsc\{Run\}\(p\)
6:if

ssimproves on

score​\(best\)\\text\{score\}\(\\textit\{best\}\)then

7:

best←p\\textit\{best\}\\leftarrow p;

c←0c\\leftarrow 0
8:else

9:revert

pp;

c←c\+1c\\leftarrow c\+1
10:endif

11:if

c≥2c\\geq 2then

12:if

tier=FineTune\\textit\{tier\}=\\textsc\{FineTune\}thenbreak⊳\\trianglerightstagnation exit

13:elseadvancetier;

c←0c\\leftarrow 0⊳\\trianglerightauto\-escalate

14:end if

15:endif

16:endfor

17:returnbest

18:end procedure

## Appendix BPrompts

We include the four key prompts that drive the system\.

### B\.1Prototype Screen

The prototype screen prompt \(Figure[3](https://arxiv.org/html/2606.30911#A2.F3)\) is used at the start of each competition\. It injects the resource probe and any accumulated skills, and asks the LLM for three diverse model proposals\.

Prototype Screen Prompt \(abbreviated\)You are an ML engineer starting a Kaggle competition\.
Competition: \{competition\_id\}
Domain: \{domain\} Metric: \{metric\} \(\{direction\}\)
Data: \{rows\} rows, \{features\} features
Target: \{target\_type\}
\[Resource budget: GPU model, VRAM, CPU cores, RAM\]
\[Accumulated skills from past competitions, if any\]
Phase 0 is a diversity screen across model families \-\-\- the winner will be refined later\.
Propose 3 DIVERSE model approaches to try\. Each should be a fundamentally different strategy \(e\.g\. gradient boosting vs\. neural net vs\. linear model, or different preprocessing philosophies\)\.
Return JSON:
\{‘‘models’’: \[\{‘‘name’’: \.\.\., ‘‘description’’: \.\.\.,
‘‘change\_specification’’: \.\.\.\}, \.\.\.\]\}Figure 3:Abbreviated prototype screen prompt\. The full prompt injects measured GPU/CPU/RAM from a resource probe and up to 2000 characters of accumulated skills\. The LLM returns three model specifications, each executed with 1\-fold validation on full data\.
### B\.2Refinement Proposal

The refinement proposal prompt \(Figure[4](https://arxiv.org/html/2606.30911#A2.F4)\) is the per\-iteration call inside AdaptiveRefine\. It carries the current tier, compressed history, and the loaded skills, and uses the six\-mode failure taxonomy to structure the next change\.

Refinement Proposal Prompt \(abbreviated\)You are an ML engineer working on a Kaggle competition\.
Competition: \{competition\_id\}
Metric: \{metric\} \(\{direction\}\)
Data: \{rows\} rows, \{features\} features
Current best score: \{best\_score\}
Phase: \{tier\_label\}
\{tier\_guidance\}
\[Resource budget, experiment history, accumulated skills,
current code \(first 3000 chars\)\]
\#\# Failure Mode Diagnosis
First, diagnose which failure mode best describes
the current state:
1\. UNDERFITTING \-\-\- train score near baseline, small gap
2\. OVERFITTING \-\-\- large train\-val gap, high train score
3\. FEATURE\_GAP \-\-\- top features dominate, score plateaus
4\. NOISE\_CEILING \-\-\- high CV variance, score fluctuates
5\. DISTRIBUTION\_MISMATCH \-\-\- train\-val\-test disagreement
6\. DIMINISHING\_RETURNS \-\-\- each iter improves <0\.1%
Mention which mode applies and which past skill \(if any\) influenced your decision\.
Propose ONE atomic change to improve the score\.
Return JSON: \{‘‘plan’’: \.\.\., ‘‘change\_specification’’: \.\.\.,
‘‘decision’’: ‘‘CONTINUE’’ \| ‘‘NEXT\_TIER’’ \| ‘‘STOP’’\}Figure 4:Abbreviated refinement proposal prompt\. The tier label rotates through Exploring, Optimizing, and Fine\-tuning\. The six\-mode failure taxonomy gives the LLM structured diagnostic guidance\. Thedecisionfield allows the LLM to self\-escalate tiers or terminate early\.
### B\.3Learning Production

The learning production prompt \(Figure[5](https://arxiv.org/html/2606.30911#A2.F5)\) closes each competition\. The specialist reflects on its full experiment history and emits two to five plain\-text learnings, each with a proposed tier\.

Learning Production Prompt \(abbreviated\)You just finished working on ‘‘\{competition\_id\}’’\.
Domain: \{domain\} Metric: \{metric\} \(\{direction\}\)
Dataset: \{rows\} rows, \{features\} features
\#\# Your Experiments and Results
\[Full experiment history with scores, deltas, kept/reverted status\]
Write learnings that could help in future competitions\.
Distinguish SUCCESS and FAILURE learnings:
\- What worked \-\-\- ‘‘this technique improved the score because\.\.\.’’
\- What failed \-\-\- ‘‘this technique hurt the score because\.\.\.’’
Failure learnings are equally valuable\. Knowing what NOT to do saves future compute\.
For failure learnings, note:
\- WHY it failed \(overfitting? wrong model family?\)
\- WHEN it might fail again \(conditions to watch for\)
Classify each learning:
\- ‘‘competition’’ \-\-\- specific to this task only
\- ‘‘domain’’ \-\-\- relevant to similar \{domain\} tasks
\- ‘‘global’’ \-\-\- relevant to any ML task
Return JSON: \{‘‘learnings’’: \[\{‘‘title’’: \.\.\.,
‘‘body’’: \.\.\., ‘‘proposed\_tier’’: \.\.\.\}, \.\.\.\]\}
Write 2\-\-5 learnings\. Focus on actionable, non\-obvious insights\.Figure 5:The learning production prompt\. Each specialist reflects on its full experiment history after completing a competition\. Learnings are saved to the competition tier and later evaluated for promotion by the orchestrator\.
### B\.4Skill Promotion

The skill promotion prompt \(Figure[6](https://arxiv.org/html/2606.30911#A2.F6)\) runs between rounds\. The orchestrator evaluates each new learning against existing skills and decidesskip,competition,domain,global, orconflict\.

Skill Promotion Prompt \(abbreviated\)You are reviewing ML learnings produced by domain agents after completing competitions\. Your job is to decide which learnings should be promoted up the skill hierarchy\.
\[All existing global and domain skills, for deduplication\]
\[New learnings from all domain agents this round\]
\#\# Promotion Rules
For each learning, decide:
1\. ‘‘skip’’ \-\-\- already covered or too obvious
2\. ‘‘competition’’ \-\-\- too specific \(dataset quirks, row indices\)
3\. ‘‘domain’’ \-\-\- generalizable to similar tasks\. Abstract it\.
4\. ‘‘global’’ \-\-\- universally useful across all ML tasks
5\. ‘‘conflict’’ \-\-\- contradicts an existing learning\. Note conditions under which each holds\.
\#\# Quality Standards
\- Be selective\. Promote AT MOST 50% of learnings\.
\- Abstractions MUST NOT mention specific competition
names, dataset names, or exact score values\.
Bad: ‘‘On aerial\-cactus, AUC reached 0\.9997’’
Good: ‘‘When AUC is near ceiling \(\>0\.999\),
further refinement yields diminishing returns’’
\- For ‘‘conflict’’: provide both the conflicting skill
ID and a condition annotation\.Figure 6:The skill promotion prompt\. The orchestrator evaluates all new learnings against existing skills after each round\. Abstractions strip competition\-specific details to produce reusable domain or global knowledge\.

## Appendix CPer\-Competition Main Benchmark Results

Table[3](https://arxiv.org/html/2606.30911#A3.T3)reports per\-competition results on MLE\-Bench Lite\.

Table 3:Per\-competition results on MLE\-Bench Lite \(22 competitions\)\. — indicates the value was not recorded for that run\.
## Appendix DCold Run vs\. Warm Run

Table[4](https://arxiv.org/html/2606.30911#A4.T4)compares cold\-run and warm\-run outcomes across all 22 MLE\-Bench Lite competitions\. Nine competitions already medaled on the cold run \(with few or no accumulated skills\) and were skipped in the rerun phase\. Of the 13 that failed the cold run, 8 flipped to medal on the warm run with accumulated global and domain skills, accounting for the lift from 40\.9% to 77\.3% medal rate\. The remaining 5 failed both attempts\.

Table 4:Cold run vs\. warm run across all 22 MLE\-Bench Lite competitions\. Cold = first attempt \(few or no accumulated skills\); Warm = later attempt with accumulated skills\. Nine competitions already medaled cold and were not re\-attempted\. Of the 13 that failed cold, 8 flipped to medal on the warm run\.CompetitionColdColdWarmWarmSkillsMedalSkillsMedalOutcomeAlready medaled on cold run \(not re\-attempted\)denoising\-dirty\-documents5G——already medaleddetecting\-insults5G——already medaleddogs\-vs\-cats37B——already medaledhistopathologic14G——already medaledplant\-pathology14G——already medaledrandom\-acts\-of\-pizza5S——already medaledspooky\-author5S——already medaledtabular\-playground\-dec21G——already medaledwhale\-challenge14G——already medaledFailed cold, flipped to medal on warm runaerial\-cactus5—50Gflippedaptos201934—55Gflippedjigsaw\-toxic37—59Bflippedmlsp\-2013\-birds32—60Gflippednomad201814—60Gflippedsiim\-isic\-melanoma50—71Bflippedtext\-norm\-english50—65Bflippedtext\-norm\-russian50—72BflippedFailed both cold and warm runsdog\-breed37—62—no medalleaf\-classification29—54—no medalnyc\-taxi\-fare59—68—no medalranzcr\-catheter37—60—no medaltabular\-playground\-may21—60—no medal
## Appendix ERepresentative Skills

Table[5](https://arxiv.org/html/2606.30911#A5.T5)shows representative entries from the skill hierarchy with their source competition and an estimate of how many refinement iterations they saved on later, similar tasks\. The full skill inventory contains 5 global, 15 vision, 12 NLP, 19 tabular, and 108 competition entries\.

Table 5:Representative skills from the 3\-tier hierarchy\. Each entry is a plain\-text file traced to the experiment that produced it\. Iteration savings are estimated from cold\-start vs\. skill\-loaded runs of the same competition\.TierTypeSkill \(abbreviated\)SourceIters savedGlobalTech\.Ensembling a strong model with a weaker one can degrade performancedogs\-vs\-cats∼\\sim2 eachGlobalTech\.Larger architecture does not fix wrong problem formulation \(ordinal vs\. classification\)aptos2019∼\\sim1–2GlobalTech\.Chance\-level scores usually indicate a training bug, not a model limitationdetecting\-insults∼\\sim1–2Domain \(NLP\)Tech\.DeBERTa\-v3\-base is the strongest starting point for text classification under 12hdetecting\-insults∼\\sim2–3Domain \(Vis\.\)Tech\.ConvNeXt\-Large \+ 10\-fold stratified CV is a strong default for small\-dataset image classificationplant\-pathology∼\\sim1–2Domain \(Tab\.\)Tech\.Log\-transforming the target can hurt RMSE on right\-skewed regression targetsNYC\-taxi\-fare∼\\sim1Domain \(Vis\.\)RefineVision refinement hints: 3/3 changes kept \(100% hit rate\)\. Try first: switch to ConvNeXt, add TTAaerial\-cactusguidedDomain \(NLP\)RefineNLP refinement hints: 5/12 changes kept \(42%\)\. Try first: fix LR, add FGM adversarial trainingdetecting\-insultsguidedComp\.Tech\.Crystal descriptors \(SOAP, Ewald, PRDF\) cause regression on small crystal datasets \(¡5K samples\)nomad2018∼\\sim2
## Appendix FAblation Resource Breakdown

Table[6](https://arxiv.org/html/2606.30911#A6.T6)reports the full resource usage of the three ablation conditions: medal rate, mean test score, output token consumption, wall time, experiments attempted \(with success rate\), and skills loaded\. The headline finding \(flat doubles empty’s token cost for fewer medals\) is in §[4\.5](https://arxiv.org/html/2606.30911#S4.SS5); this table provides the supporting detail\.

Table 6:Resource usage under three skill\-loading conditions\. Flat uses2×2\\timesthe output tokens of Empty without improving medal rate\.
## Appendix GAblation Per\-Competition Table

Table[7](https://arxiv.org/html/2606.30911#A7.T7)gives the full per\-competition scores and medals for the controlled ablation \(Figure[2](https://arxiv.org/html/2606.30911#S4.F2)in the main text\)\.

Table 7:Controlled ablation: scores and medals \(G/S/B/—\) under three skill\-loading conditions \(8 competitions, 1 seed\)\.Tiered: domain\-scoped;Flat: all 159 skills;Empty: none\. Best score per competition inbold\.CompetitionDomainTieredFlatEmptydetecting\-insultsNLP0\.953 G0\.958G0\.956 Gjigsaw\-toxicNLP0\.987B0\.981 —0\.985 —random\-acts\-of\-pizzaNLP0\.798S0\.599 —0\.481 —histopathologicVision0\.997 G0\.998G0\.998Gplant\-pathologyVision0\.996 G0\.999G0\.994 Gaptos2019Vision0\.937G0\.921 S0\.934 Gtabular\-playgroundTabular0\.963G0\.963G0\.962 Gmlsp\-2013\-birdsAudio0\.964G0\.860 —0\.832 —Medal rate100% \(8/8\)62\.5% \(5/8\)62\.5% \(5/8\)Mean score0\.9490\.9100\.893

Similar Articles

Hierarchical Experimentalist Agents

Hugging Face Daily Papers

Introduces HExA, a training-free framework enabling LLMs to learn through active experimentation and skill reuse, achieving up to 77% success on the new Interphyre physics benchmark, a large improvement over existing agents.