optimize_anything: A Universal API for Optimizing any Text Parameter
Summary
This paper presents optimize_anything, a universal LLM-based optimization system for text artifacts that achieves state-of-the-art results across diverse tasks including agent architecture discovery, scheduling, CUDA kernel generation, and packing, demonstrating general-purpose text optimization.
View Cached Full Text
Cached at: 05/20/26, 08:26 AM
# A Universal API for Optimizing any Text Parameter Source: [https://arxiv.org/html/2605.19633](https://arxiv.org/html/2605.19633) \\setcctype by\\acmBadgeR\[https://www\.acm\.org/publications/policies/artifact\-review\-and\-badging\-current\]figures/artifacts\-available\-v1\.1\.pdf\\acmBadgeR\[https://www\.acm\.org/publications/policies/artifact\-review\-and\-badging\-current\]figures/artifacts\-functional\-v1\.1\.pdf\\acmBadgeR\[https://www\.acm\.org/publications/policies/artifact\-review\-and\-badging\-current\]figures/results\-reproduced\-v1\.1\.pdf ## optimize\_anything: A Universal API for Optimizing any Text Parameter ,Donghyun LeeUC BerkeleyUSA[lukedhlee@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Shangyin TanUC BerkeleyUSA[shangyin@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Wenjie MaUC BerkeleyUSA[windsey@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Karim ElmaaroufiUC BerkeleyUSA[elmaaroufi@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Rohit SandadiUC BerkeleyUSA[rohitsandadi@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Sanjit A\. SeshiaUC BerkeleyUSA[sseshia@eecs\.berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Koushik SenUC BerkeleyUSA[ksen@cs\.berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Dan KleinUC BerkeleyUSA[klein@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Ion StoicaUC BerkeleyUSA[istoica@cs\.berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Joseph E\. GonzalezUC BerkeleyUSA[jegonzal@eecs\.berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Omar KhattabMITUSA[okhattab@mit\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]),Alexandros G\. DimakisUC BerkeleyUSA[alexdimakis@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected])andMatei ZahariaUC BerkeleyUSA[matei@berkeley\.edu](https://arxiv.org/html/2605.19633v1/mailto:[email protected]) \(2026\) ###### Abstract\. Can a single LLM\-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI\-based optimization system—supporting single\-task search, multi\-task search with cross\-problem transfer, and generalization to unseen inputs—achieves state\-of\-the\-art results across six diverse tasks\. Our system discovers agent architectures that nearly triple Gemini Flash’s ARC\-AGI accuracy \(32\.5% → 89\.5%\), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve’s reported circle packing solution \(n=26\)\. Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score\-only feedback, and that multi\-task search outperforms independent optimization given equivalent per\-problem budget through cross\-task transfer, with benefits scaling with the number of related tasks\. Together, we show for the first time that text optimization with LLM\-based search is a general\-purpose problem\-solving paradigm, unifying tasks traditionally requiring domain\-specific algorithms under a single framework\. We open\-sourceoptimize\_anythingwith support for multiple backends as part of the GEPA project at[https://github\.com/gepa\-ai/gepa](https://github.com/gepa-ai/gepa)\. LLM optimization, text artifact optimization, evolutionary search, prompt engineering, agentic systems, Pareto optimization ††journalyear:2026††copyright:cc††conference:ACM Conference on AI and Agentic Systems; May 26–29, 2026; San Jose, CA, USA††booktitle:ACM Conference on AI and Agentic Systems \(CAIS ’26\), May 26–29, 2026, San Jose, CA, USA††doi:10\.1145/3786335\.3813167††isbn:979\-8\-4007\-2415\-2/2026/05††ccs:Computing methodologies Natural language processing††ccs:Computing methodologies Neural networks††ccs:Computing methodologies Artificial intelligenceFigure 1\.Theoptimize\_anythingloop: a text artifactxxis passed to an evaluatorf\(x\)f\(x\)which returns a score plus diagnostic feedback \(SI\), which is consumed by an LLM proposer to produce an improved artifact\. The same API instantiates across domains: code optimization, prompt tuning, agent architecture search, and policy discovery\.System diagram showing the optimize\_anything loop\. A string artifact is evaluated, producing scores and SI feedback, which feeds into an LLM proposer that generates improved candidates\. Example instantiations shown for code, prompts, agents, and policies\.## 1\.Introduction Large language models can serve as effective optimizers when paired with automated evaluation\. FunSearch\(Romera\-Paredes et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib23)\)evolves Python functions to discover mathematical constructions that surpass known bounds\. AlphaEvolve\(Novikov et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib19)\)extends the idea to broader code optimization, improving a 56\-year\-old matrix multiplication bound and designing scheduling heuristics for Google’s data centers, but it operates exclusively on code artifacts, in single\-task mode \(one problem at a time\)\. GEPA\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\)achieves state\-of\-the\-art prompt optimization with generalization to unseen inputs, but is limited to prompts; MIPROv2\(Opsahl\-Ong et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib20)\)similarly targets prompt and few\-shot selection\. Despite strong results within their artifact types, no existing system has been applied to agent architectures, numeric optimization, or image gen, and no single system has demonstrated effectiveness across fundamentally different domains simultaneously\. We observe that a wide range of problems can be formulated as optimizing a text artifact\. Whether the artifact is a CUDA kernel, a cloud scheduling policy, an agent architecture, Scalable Vector Graphics \(SVGs\), or a system prompt, the structure is the same: serialize the artifact as a string, evaluate it, and let an LLM propose improvements based on diagnostic feedback\. This observation suggests a much simpler interface and a uniform algorithm is possible\. We presentoptimize\_anything\(initially released asAgrawal et al\.\([2026a](https://arxiv.org/html/2605.19633#bib.bib3)\)\), a declarative API that implements this insight\. The user provides a seed artifact \(or, in seedless mode, just a natural\-language objective\), an evaluator that returns a score and optional diagnostic feedback, and optionally a dataset\. The system handles prompt construction, reflection, candidate selection, and search strategy\. This declarative design, inspired by DSPy’s\(Khattab et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib13)\)principle of*programming—not prompting*, means the same API call works whether one is optimizing an LLM prompt, an agent architecture, or an image\. Our contributions are as follows: 1. \(1\)A single LLM\-based Text Optimization system matches or surpasses domain\-specific tools across six fundamentally different domains\.We are the first to show that a single system \(our proposedoptimize\_anything\) can optimize code, prompts, agent architectures, numerical configurations, and images, achieving state\-of\-the\-art results in each\. Our system discovers agent architectures that nearly triple ARC\-AGI accuracy \(32\.5%→\\to89\.5%\), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch baselines, create custom solver code matching and outperforming Optuna in numerical optimization, and outperforms AlphaEvolve’s solution on circle packing\. This establishes LLM\-based text optimization as a general\-purpose problem\-solving paradigm, not limited to code or prompts\. 2. \(2\)Three optimization modes—single\-task, multi\-task, and generalization—unified under one interface, including the first multi\-task mode\.Existing LLM\-evolution systems each support exactly one mode\. AlphaEvolve\(Novikov et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib19)\), OpenEvolve\(Sharma,[2025](https://arxiv.org/html/2605.19633#bib.bib25)\), and ShinkaEvolve\(Lange et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib14)\)operate in single\-task mode: optimizing one code artifact for one problem at a time\. GEPA\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\)and MIPROv2\(Opsahl\-Ong et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib20)\)operate in generalization mode: optimizing a prompt to perform well on unseen inputs, but only for prompts\. No prior system supports*multi\-task search*, where solving a batch of related problems together enables cross\-transfer of discovered optimization patterns\.optimize\_anythingunifies all three modes under one interface: multi\-task search on CUDA kernels outperforms independent single\-task optimization given equivalent per\-problem budget \(§[5\.8](https://arxiv.org/html/2605.19633#S5.SS8)\), and generalization extends beyond prompts to agent architectures \(§[5\.3](https://arxiv.org/html/2605.19633#S5.SS3)\) and scheduling policies \(§[5\.2](https://arxiv.org/html/2605.19633#S5.SS2)\)\. All optimization modes are expressed through the sameoptimize\_anythingAPI\. 3. \(3\)Side information as a first\-class evaluator contract\.Prior frameworks support diagnostic feedback through ad\-hoc, framework specific mechanisms\.optimize\_anythingelevates it to a uniform API contract: any diagnostic—stack traces, profiler data, rendered images, structured error reports—flows to the proposer through one interface\. Ablations across three domains \(prompt optimization, circle packing, and CUDA kernels\) show that actionable side information yields 4\-6×\\timesfaster convergence and substantially higher final performance versus score\-only feedback \(§[5\.9](https://arxiv.org/html/2605.19633#S5.SS9)\)\. We achieve these results by extending the Pareto\-based search ofAgrawal et al\.\([2026b](https://arxiv.org/html/2605.19633#bib.bib4)\)\(originally studied only for prompt optimization\) to arbitrary text artifacts, adding single\-task and multi\-task modes\. Candidates are selected based on per\-example or per\-metric Pareto dominance rather than aggregate scores, preserving complementary strengths across iterations\. Table[2](https://arxiv.org/html/2605.19633#S4.T2)provides a detailed comparison\. We evaluateoptimize\_anythingacross six primary domains spanning all three optimization modes \(Table[1](https://arxiv.org/html/2605.19633#S3.T1)\), with two additional domains \(blackbox mathematical optimization and 3D modeling\) in the appendix as preliminary demonstrations\. Key results include: \(i\) evolved agent architectures nearly triple Gemini Flash’s ARC\-AGI accuracy \(32\.5%→\\to89\.5%\); \(ii\) discovered cloud scheduling algorithms cut costs by up to 40%; \(iii\) 87% of generated CUDA kernels match or beat PyTorch baselines from KernelBench, with multi\-task mode outperforming dedicated single\-task optimization; \(iv\) prompt optimization improves GPT\-4\.1\-mini’s AIME\-2025 accuracy from 46\.67% to 60\.00%; and \(v\) our circle packing solution outperforms AlphaEvolve’s published one, confirmed by a controlled rerun against OpenEvolve under matched conditions\. Ablations across three domains show that actionable side information yields 4\-6×\\timesfaster convergence and substantially higher final performance versus score\-only feedback, and that multi\-task search benefits scale with the number of related tasks\. ## 2\.Related Work #### LLM\-based program evolution\. AlphaEvolve\(Novikov et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib19)\)pioneered the LLM\-evolution paradigm, using Gemini models with island\-based MAP\-Elites\(Mouret and Clune,[2015](https://arxiv.org/html/2605.19633#bib.bib18)\)to discover algorithms for Google’s infrastructure\. OpenEvolve\(Sharma,[2025](https://arxiv.org/html/2605.19633#bib.bib25)\)provides an open\-source reimplementation with model\-agnostic support\. ShinkaEvolve\(Lange et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib14)\)extends the paradigm with novelty\-based rejection sampling for sample efficiency and adaptive LLM ensemble selection for diversity\. FunSearch\(Romera\-Paredes et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib23)\)applies evolutionary LLM search to mathematical discovery\. EvoPrompting\(Chen et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib6)\)evolves code for neural architecture search\. All operate exclusively in single\-task mode and expose framework\-specific abstractions \(island topologies, prompt samplers, evolve\-block markers\)\.optimize\_anythingstrips the interface to its declarative essence, adds multi\-task and generalization modes, and elevates diagnostic feedback to a first\-class API concept\. #### Prompt optimization\. GEPA\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\)combines reflective mutation with a Pareto\-based search technique for prompt optimization, outperforming both MIPROv2\(Opsahl\-Ong et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib20)\)and GRPO\(Shao et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib24)\)\.optimize\_anythingsupports GEPA’s evolutionary search algorithm as one of the optimization backends, extending it beyond prompts to arbitrary text artifacts\. Other prompt optimization methods include OPRO\(Yang et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib28)\), APE\(Zhou et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib31)\), ProTeGi\(Pryzant et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib22)\), and PromptBreeder\(Fernando et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib9)\)\. TextGrad\(Yuksekgonul et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib29)\)uses LLM\-generated “gradients” for text optimization\. #### LLM self\-improvement and reflection\. Reflexion\(Shinn et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib26)\)uses verbal reinforcement for agent self\-correction\. Self\-Refine\(Madaan et al\.,[2023](https://arxiv.org/html/2605.19633#bib.bib16)\)applies iterative self\-feedback\. Evolution through Large Models\(Lehman et al\.,[2022](https://arxiv.org/html/2605.19633#bib.bib15)\)explores LLMs as mutation operators\.optimize\_anything’s SI mechanism generalizes these ideas by making diagnostic feedback a declarative evaluator contract rather than a hardcoded self\-critique\. #### Agent architecture search\. ADAS\(Hu et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib12)\)and AFlow\(Zhang et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib30)\)search over agent architectures\.optimize\_anything’s generalization mode subsumes these as special cases: the artifact is the agent code, the evaluator runs it on tasks, and the system evolves both architecture and prompts jointly\. ## 3\.Theoptimize\_anythingAPI ### 3\.1\.Core Interface At its simplest,optimize\_anythingrequires a seed artifact and an evaluator\. The evaluator takes a candidate string and returns a score \(higher is better\) alongside an optional Side Information \(SI\) dictionary containing diagnostic feedback the proposer reads during reflection: importoptimize\_anythingasoa defevaluate\(candidate:str\)\-\>tuple\[float,dict\]: result=execute\_code\(candidate\) returnresult\.score,\{ "Error":result\.stderr, "Output":result\.stdout, "Runtime":f"\{result\.time\_ms:\.1f\}ms", \} result=oa\.optimize\_anything\( seed\_candidate="<yourartifact\>", evaluator=evaluate, \) SI can include open\-ended text, structured data, multiple sub\-scores, or images \(viaoa\.Image\) for Vision\-capable LLMs \(VLM\)\. The fulloptimize\_anythingsignature is: defoptimize\_anything\( seed\_candidate=None, evaluator=\.\.\., dataset=None, valset=None, objective=None, background=None, config=None, \)\-\>OptimizationResult: Specifically,optimize\_anythingdoesn’t require mutation prompts, task\-specific templates, island configurations, orEVOLVE\-BLOCKmarkers \(all common in prior frameworks\)\. The user declares the*what*\(artifact, evaluator, domain knowledge\), andoptimize\_anything, through its optimization backends, handles the*execution*\. #### Seedless mode\. In domains where providing even a starting artifact is difficult, or where writing even a bad seed requires domain expertise \(e\.g\., 3D modeling\), the user can just provide a natural\-languageobjectiveas an argument in place of theseed\_candidateargument and the LLM bootstraps the first candidate from scratch\. Seedless mode makes the system accessible to users who can*specify*what they want but not*implement*it\. Appendix[C](https://arxiv.org/html/2605.19633#A3)demonstrates it on a 3D modeling task\. ### 3\.2\.Three Optimization Modes Which mode is active depends solely on whetherdatasetandvalsetare provided: #### Single\-Task Search\. No dataset\. The candidate*is*the solution; the evaluator scores it directly\. This is the mode that AlphaEvolve and OpenEvolve operate in\. Example: in circle packing \(§[5\.6](https://arxiv.org/html/2605.19633#S5.SS6)\), the artifact is the packing algorithm and the evaluator returns the packing score plus geometric diagnostics\. #### Multi\-Task Search\. Adatasetof related tasks is provided; insights from solving one help solve the others\. Example: in CUDA kernel generation \(§[5\.5](https://arxiv.org/html/2605.19633#S5.SS5)\), each task is a PyTorch operation to accelerate\. Multi\-task mode discovers optimization patterns that transfer across problems, converging faster and solving more problems than single\-task runs \(§[5\.8](https://arxiv.org/html/2605.19633#S5.SS8)\)\. No prior LLM\-evolution framework supports this mode\. Architecturally, the Pareto frontier is shared across tasks for cross\-transfer during proposal, but at output time each task independently selects its own best candidate from the frontier\. This means multi\-task search producesNNspecialized artifacts \(one per task\) that have benefited from shared optimization context, patterns discovered while optimizing taskeie\_\{i\}are available as parents when proposing for taskeje\_\{j\}, but each artifact can specialize to its task\. #### Generalization\. Bothdatasetandvalsetare provided; the optimized artifact must perform well on unseen examples\. This is the mode that GEPA’s prompt optimization\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\)operates in; optimize\_anythinggeneralizes the pattern to any text artifact\. Example: in agent architecture discovery \(§[5\.3](https://arxiv.org/html/2605.19633#S5.SS3)\), the artifact is the entire agent, and it must generalize to unseen ARC\-AGI puzzles\. The key distinction is that multi\-task search yieldsNNspecialized artifacts while generalization yields one globally generalized artifact\. Table 1\.Summary of experimental results across six domains\. “Mode” indicates which optimization paradigm is used: S = single\-task search, M = multi\-task search, G = generalization\. All results useoptimize\_anythingwith the indicated proposer LLM\. ## 4\.Method optimize\_anythingis backend agnostic, and can be used with various optimization algorithms\. The default optimization backend inoptimize\_anythingcurrently extends and manages information atop GEPA\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\), an algorithm originally studied primarily in the context of prompt optimization and code search\. The system overview is shown in Figure[1](https://arxiv.org/html/2605.19633#S0.F1)\. Whileoptimize\_anything’s primary contribution is a unified interface, several concrete algorithmic modifications were necessary to generalize from prompts to arbitrary text artifacts: \(1\) new frontier types for single\-task and multi\-task search with distinct selection semantics \(GEPA’s Pareto\-frontier selection relied on evaluation across multiple data points, whereas single\-task search admits only one\); \(2\) a refiner step that catches common LLM generation artifacts \(malformed code blocks, import errors, syntax issues\) before evaluation, essential for code and agent artifacts where minor formatting errors cause complete evaluation failure; \(3\) content\-addressed evaluation caching to avoid redundant expensive rollouts; \(4\) SI as a first\-class typed primitive enabling domain\-portable proposer logic and multimodal feedback; and \(5\) an adapter layer between various optimization backends and the unified interface\. We describe the two mechanisms that underpin effectiveness and contrastoptimize\_anythingwith prior frameworks\. Table 2\.Comparison ofoptimize\_anythingwith prior LLM\-based optimization frameworks across code evolution, prompt optimization, and agent architecture search systems\. Onlyoptimize\_anythingsupports all three modes and provides diagnostic feedback as a first\-class API concept\.### 4\.1\.Problem Formulation We formalize the text optimization problem as follows\. Let𝒳\\mathcal\{X\}denote the space of text artifacts \(strings\)\. An evaluatorf:𝒳×ℰ∪\{⊥\}→ℝ×ℐf:\\mathcal\{X\}\\times\\mathcal\{E\}\\cup\\\{\\bot\\\}\\to\\mathbb\{R\}\\times\\mathcal\{I\}maps an artifactx∈𝒳x\\in\\mathcal\{X\}and an \(optional\) examplee∈ℰ∪\{⊥\}e\\in\\mathcal\{E\}\\cup\\\{\\bot\\\}to a scores\(x,e\)∈ℝs\(x,e\)\\in\\mathbb\{R\}and actionable side informationι\(x,e\)∈ℐ\\iota\(x,e\)\\in\\mathcal\{I\}, i\.e\.,f\(x,e\)=\(s\(x,e\),ι\(x,e\)\)f\(x,e\)=\(s\(x,e\),\\iota\(x,e\)\)\. The three modes correspond to: Single\-task search:ℰ=∅\\mathcal\{E\}=\\emptyset; maximizes\(x\)s\(x\)directly\. The artifact*is*the solution \(e\.g\., a packing algorithm\)\. Multi\-task search:Given a dataset𝒟=\{e1,…,en\}\\mathcal\{D\}=\\\{e\_\{1\},\\ldots,e\_\{n\}\\\}of related problems, find an artifactx∈𝒳x\\in\\mathcal\{X\}\(e\.g\., a kernel\-generation prompt\) maximizing1n∑i=1ns\(x,ei\)\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}s\(x,e\_\{i\}\)\. Cross\-transfer arises because the Pareto frontier preserves patterns that work across problems\. Generalization:Given a training set𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and a validation set𝒟val=\{e1val,…,ekval\}\\mathcal\{D\}\_\{\\text\{val\}\}=\\\{e^\{\\text\{val\}\}\_\{1\},\\ldots,e^\{\\text\{val\}\}\_\{k\}\\\}, find an artifactx∈𝒳x\\in\\mathcal\{X\}maximizing1k∑j=1ks\(x,ejval\)\\frac\{1\}\{k\}\\sum\_\{j=1\}^\{k\}s\\\!\\left\(x,e^\{\\text\{val\}\}\_\{j\}\\right\)\. Search uses feedback from𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}, while𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}measures generalization to unseen examples\. This generalizes classical machine learning: the artifact may be a prompt, an agent, or a policy\. ### 4\.2\.Side Information \(SI\) Popularly used numerical optimization methods like gradient descent reduce all diagnostic context to a single scalar\. The optimizer knows*that*a candidate failed, but not*why*\. For example, one cannot show a Bayesian optimizer a stack trace\. LLM\-evolution frameworks changed this by feeding execution results into LLM proposers, but when an LLM reads a compiler error, diagnoses a logic bug, and proposes a targeted fix, the process is closer to an engineer iterating on a prototype than to blind evolution\. optimize\_anythingleans into this by making diagnostic feedback a first\-class part of the evaluator contract\. The evaluator returns both a score and aside\_infodictionary containing any diagnostic the evaluator can produce: - •Text:compiler errors, runtime exceptions, profiler summaries, natural\-language critiques\. - •Structured data:per\-test\-case results, sub\-scores for multiple objectives, execution traces\. - •Images:rendered SVGs, 3D model screenshots, or chart visualizations, enabling VLM proposers to*see*what they are improving\. SI is the text\-optimization analogue of the gradient\. Where gradients tell a numerical optimizer which direction to move, SI can tell the LLM proposer*why*a candidate failed and*how*to fix it\. During a dedicated reflection step, the proposer reasons over this signal to diagnose failures and propose targeted improvements\. Prior frameworks expose feedback through framework\-specific mechanisms; SI provides a uniform interface that makes it trivial to surface any diagnostic\. The key design choice is that SI is*opt\-in but zero\-friction*: evaluators that return only a score work fine, and existingprint\(\)statements can be captured automatically viacapture\_stdio=True\. ### 4\.3\.Pareto\-Based Search Even when optimizing a single objective, evaluating candidates across multiple examples or metrics produces richer signal than a scalar aggregate\. The naive approach collapses that signal into one average score and always selects the top candidate\. This stalls fast: averaging hides which aspects are strong and which are weak, and the proposer tries to improve everything at once\. optimize\_anythingdoes two things differently\. First, it tracks scores per task \(fromdataset\) or per metric \(from sub\-scores in SI\) individually and maintains aPareto frontier: any candidate that is the best at*something*survives, even if its average is suboptimal\. Second, each reflection step shows the proposer a minibatch of just 2–3 examples instead of all of them, enabling focused, targeted improvements on that subset\. Over iterations, the frontier accumulates complementary strengths\. Candidates that excel at different tasks are preserved and their strategies recombined\. This mechanism also powers multi\-task search: when optimizing across related problems, the frontier preserves candidates that excel on different tasks, and strategies discovered for one problem transfer to others \(§[5\.8](https://arxiv.org/html/2605.19633#S5.SS8)\)\. #### Candidate selection\. In GEPA\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\), the current default optimization backend, candidates are selected for mutation in proportion to how often they appear on the Pareto front\. LetJJindex the objectives used to form the Pareto scores \(e\.g\., per\-example tasks, per\-metric scores, or both\)\. Each candidateΦ\\Phiinduces a scoresj\(Φ\)s\_\{j\}\(\\Phi\)for everyj∈Jj\\in J\. Let𝒫\\mathcal\{P\}denote the set of Pareto\-nondominated candidates under these objectives\. For each objectivej∈Jj\\in J, letℬ\[j\]\\mathcal\{B\}\[j\]be the set of candidates in𝒫\\mathcal\{P\}that achieve the best score onjj\. We sample candidates with probability proportional to\|\{j∈J:Φ∈ℬ\[j\]\}\|\|\\\{j\\in J:\\Phi\\in\\mathcal\{B\}\[j\]\\\}\|, focusing exploration on broadly effective solutions\. #### Reflection and mutation\. Given a selected candidateΦ\\Phiand a minibatchℳ\\mathcal\{M\}of examples, the system executesΦ\\Phionℳ\\mathcal\{M\}, collects scores and SI, and presents them to the proposer LLM in a structured reflection prompt\. The proposer diagnoses failures using the SI and produces an updated artifactΦ′\\Phi^\{\\prime\}\. IfΦ′\\Phi^\{\\prime\}improves on the minibatch, it is fully evaluated and added to the candidate pool\. ## 5\.Experiments We evaluateoptimize\_anythingacross six domains spanning all three optimization modes\. For each, we describe the artifact, evaluator, SI design, and results\. We then present ablation studies on multi\-task search \(§[5\.8](https://arxiv.org/html/2605.19633#S5.SS8)\), SI \(§[5\.9](https://arxiv.org/html/2605.19633#S5.SS9)\), and proposer sensitivity and cost \(§[5\.10](https://arxiv.org/html/2605.19633#S5.SS10)\), followed by an analysis of the optimization mechanisms \(§[6](https://arxiv.org/html/2605.19633#S6)\)\. Optimized solutions are presented in the Appendix[J](https://arxiv.org/html/2605.19633#A10)\. ### 5\.1\.Coding Agent Skills \(Generalization\) Setup\.Skills are natural\-language instructions and best practices for working with a specific codebase \(blog post:\(Tan et al\.,[2026](https://arxiv.org/html/2605.19633#bib.bib27)\)\)\. The evaluator runs a coding agent on repository tasks and scores whether it resolves them; the optimized skills must generalize to unseen tasks\. We optimize skills for the Bleve search library and evaluate transfer to Claude Code with both Haiku 4\.5 and Sonnet 4\.5\. SI design\.The evaluator returns task descriptions, agent traces \(tool calls, code edits, errors\), test outcomes, and resolution time\. Results\.Optimized skills boost Haiku 4\.5’s pass rate from 79\.3% to 98\.3% and Sonnet 4\.5’s from 94\.8% to 100%, while cutting resolution time by 47% \(Figure[2](https://arxiv.org/html/2605.19633#S5.F2)\)\. Critically, skills discovered for one model transfer effectively to another without reoptimization, demonstrating the generalization mode’s ability to learn model\-agnostic repository knowledge\. Figure 2\.Claude Code on the Bleve repository\. Optimized skills boost pass rates to near\-perfect while reducing resolve time by 47%\. Skills transfer across models without reoptimization\.Bar chart showing pass rates: Haiku 4\.5 79\.3% \(173s\), Haiku 4\.5 \+ Skills 98\.3% \(142s\), Sonnet 4\.5 94\.8% \(285s\), Sonnet 4\.5 \+ Skills 100% \(169s\)\. ### 5\.2\.Cloud Scheduling Algorithms \(Generalization\) Setup\.We optimize two cloud infrastructure algorithms from the ADRS benchmark\(Cheng et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib7)\)\.CloudCastdiscovers broadcast routing strategies for multi\-cloud data transfer, minimizing data egress cost\.Can’t Be Latelearns scheduling policies deciding when to use cheap preemptibleSPOTinstances versus reliableON\_DEMANDinstances to meet deadlines\. Both use generalization mode with training/validation splits over infrastructure scenarios\. SI design\.For CloudCast: per\-partition routing decisions, edge utilizations, cost breakdowns\. For Can’t Be Late: spot\-availability patterns, instance\-usage timelines, segment counts \(SPOTvs\.ON\_DEMANDvs\. restarts\)\. Results\.CloudCast achieves 40\.2% cost savings over Dijkstra routing \(Figure[3\(a\)](https://arxiv.org/html/2605.19633#S5.F3.sf1)\), evolving from a baseline shortest\-path algorithm to a provider\-aware Steiner tree approach that jointly optimizes for egress cost and transfer latency\. Can’t Be Late achieves 7\.8% cost savings \(Figure[3\(b\)](https://arxiv.org/html/2605.19633#S5.F3.sf2)\), evolving a simple deadline\-check heuristic into an adaptive strategy with state tracking for spot\-unavailability patterns, break\-even switching cost analysis, and graduated decision thresholds based on slack ratio\. Both results top the ADRS leaderboard \(optimize\_anything: 96\.6 aggregate score vs\. 92\.9 for OpenEvolve, 72\.0 for ShinkaEvolve\)\. The evolved artifacts are qualitatively different from their seeds: CloudCast discovers provider\-aware Steiner tree routing \(absent from the Dijkstra seed\), while Can’t Be Late learns persistent spot\-unavailability tracking and overhead\-aware switching costs \(absent from the greedy seed\)\. \(a\)CloudCast: 40\.2% cost savings\. \(b\)Can’t Be Late: 7\.8% savings\. Figure 3\.Optimization trajectories for cloud scheduling\. Both use generalization mode with train/val splits over infrastructure scenarios\.Two line charts showing optimization trajectories\. CloudCast reaches 40\.2% test savings\. Can’t Be Late reaches 7\.8% test savings\. ### 5\.3\.ARC\-AGI Agent Architecture \(Generalization\) Setup\.Rather than optimizing a prompt, we optimize the*entire agent system*: code, sub\-agent architecture, control flow, helper functions, and prompts are all treated as a single text artifact, building on an earlier proof\-of\-concept with GEPAAdapter\(Agrawal,[2025](https://arxiv.org/html/2605.19633#bib.bib2)\)\. The optimization objective is for the artifact to generalize to unseen ARC\-AGI\(Chollet,[2019](https://arxiv.org/html/2605.19633#bib.bib8)\)puzzles\. SI design\.Training/test grid examples, per\-puzzle scores, internal model outputs, LLM costs, error tracebacks, and code execution results\. Results\.Using Gemini 3 Flash as both the proposer and the underlying agent model,optimize\_anythingstarts for a naive 10\-line agent seed \(one LLM call\) and iteratively designs it into a 300\+ line system consisting of 4 components along with fallbacks\. The test accuracy improves from 32\.5% to89\.5%, a 57 percentage point gain \(Figure[4](https://arxiv.org/html/2605.19633#S5.F4)\)\. The optimized architecture implements a 4\-stage pipeline: \(1\) rule induction via pattern analysis, \(2\) code generation withexec\(\)\-based verification, \(3\) iterative debugging with up to 2 fix attempts, and \(4\) structured fallback from code\-first to direct LLM prediction\. This represents a qualitative leap: the system discovers architectural patterns \(verify\-then\-fallback, iterative refinement\) that typically require manual engineering iterations\. Figure 4\.ARC\-AGI agent architecture evolution with Gemini 3 Flash\. Validation accuracy reaches 93\.5%; test accuracy improves from 32\.5% to 89\.5%\.Line chart showing validation accuracy improving from about 56% to 93\.5% over metric calls\. Base test 32\.5%, best test 89\.5%\. ### 5\.4\.AIME Prompt Optimization \(Generalization\) Setup\.We optimize a system prompt for GPT\-4\.1\-mini on AIME \(American Invitational Mathematics Examination\) competition problems\. Training uses AIME 2022–2024; testing uses AIME 2025\. SI design\.The evaluator returns each problem statement, the model’s reasoning chain, extracted answer, ground truth, and a correct/incorrect flag\. Results\.Prompt optimization improves GPT\-4\.1\-mini from 46\.67% to60\.00%on AIME 2025 \(Figure[5](https://arxiv.org/html/2605.19633#S5.F5)\), a 13\.3pp gain from changing only the system prompt\. This outperforms MIPROv2\(Opsahl\-Ong et al\.,[2024](https://arxiv.org/html/2605.19633#bib.bib20)\)\(51\.33% on the same benchmark\)\. The optimized prompt \(Appendix[I](https://arxiv.org/html/2605.19633#A9)\) evolves from a single generic sentence into a structured 6\-rule reasoning framework\. This result matches the performance gains reported byAgrawal et al\.\([2026b](https://arxiv.org/html/2605.19633#bib.bib4)\), demonstrating that exposing a prompt optimization algorithm through a general interface does not hurt performance on prompt optimization\. Figure 5\.AIME prompt optimization for GPT\-4\.1\-mini\. Validation score improves from 46\.67% to 57\.78%; test score reaches 60\.00%\.Line chart showing validation score improving over 350 metric calls\. Test accuracy reaches 60% from 46\.67% baseline\. ### 5\.5\.CUDA Kernel Generation \(Multi\-Task Search\) Setup\.We generate CUDA kernels for 31 reference PyTorch operations from KernelBench\(Ouyang et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib21)\), evaluated on a V100 32GB GPU\. The 31 problems span diverse operations: matrix multiplications, convolutions, reductions, element\-wise ops, and normalization layers\. Under the hood,optimize\_anythingevolves the prompt that drives kernel generation; in multi\-task mode, insights discovered for one problem \(e\.g\., how to handle memory coalescing\) transfer to others automatically through the shared Pareto frontier\. SI design\.The evaluator compiles the generated kernel, runs correctness tests \(max absolute error vs\. PyTorch reference\), and benchmarks wall\-clock time\. SI includes: \(i\) NVCC compiler errors with line numbers, \(ii\) correctness test failures with actual vs\. expected outputs, \(iii\) relevant CUDA documentation snippets, and \(iv\) speedup ratio vs\. the PyTorch baseline\. Results\.87% of generated kernels match or beat the PyTorch baseline performance; 48% achieve 10%\+ speedups, and 25% achieve 20%\+ speedups \(Figure[6](https://arxiv.org/html/2605.19633#S5.F6)\)\. The evolved kernels employ techniques such as float4 vectorization, two\-pass algorithms \(compute statistics, then normalize\), warp shuffle reductions, and shared memory tiling\. Multi\-task mode’s advantages are analyzed in §[5\.8](https://arxiv.org/html/2605.19633#S5.SS8)\. Figure 6\.KernelBench results \(GPT\-5 as proposer\)\.Fastp\(s\)\\text\{Fast\}\_\{p\}\(s\): fraction of kernels achieving speedup≥s\\geq s\. 87% match baseline; 25% are 20%\+ faster\.Line chart showing Fast\_p at various speedup thresholds\. Fast\_p\(0\)=100%, Fast\_p\(1\.0\)=87%, Fast\_p\(1\.1\)=48%, Fast\_p\(1\.2\)=25%\. ### 5\.6\.Circle Packing \(Single\-Task Search\) Setup\.The task is to packn=26n\{=\}26circles while maximizing the sum of radii within a unit square\.optimize\_anythingoptimizes the packing algorithm code; the evaluator executes the proposed packing code, and returns the score plus geometric diagnostics\. SI design\.Circle positions, radii, constraint violations, overlap distances, boundary violations, and a rendered visualization of the packing\. Results\.optimize\_anythingreaches a score of 2\.63598\+, outperforming AlphaEvolve’s, OpenEvolve’s, and ShinkaEvolve’s reported solution \(Figure[7](https://arxiv.org/html/2605.19633#S5.F7)\)\. The optimized algorithm is a bilevel optimizer: an LP over radii with dual\-variable gradients for L\-BFGS\-B center optimization, augmented by CMA\-ES exploration and diverse seeding strategies\. #### Controlled comparison with OpenEvolve\. To address concerns about comparing against published rather than reproduced results, we ran OpenEvolve \(open\-source reimplementation of AlphaEvolve\) under matched conditions using the same proposer LLM \(GPT\-5\.1\)\. As shown in Table[3](https://arxiv.org/html/2605.19633#S5.T3),optimize\_anythingachieved a superior score \(2\.63598\) in just 63 evaluations \(costing $̃3\.18\), while OpenEvolve failed to match this performance even when given over three times the evaluation budget \(200 iterations, costing $6\.85, reaching only 2\.6307\)\. Table 3\.Controlled comparison ofoptimize\_anythingvs\. OpenEvolve on circle packing \(n=26n\{=\}26\), both using GPT\-5\.1 as proposer\.Figure 7\.Circle packing \(n=26n\{=\}26\)\.optimize\_anythingoutperforms AlphaEvolve’s, ShinkaEvolve’s, and OpenEvolve’s solution, reaching a higher score with fewer evaluations\.Line chart comparing four methods on circle packing\. optimize\_anything reaches highest score around 2\.636 with fewer metric calls than alternatives\. ### 5\.7\.Image Generation \(Multi\-Task Search\) Setup\.We generate SVG code and CAD models \(viabuild123d\) for four image goals \(Table[10](https://arxiv.org/html/2605.19633#A8.T10)in Appendix[H](https://arxiv.org/html/2605.19633#A8)\)\. The evaluator renders the image and queries a VLM to rate individual visual aspects on a 0–100 scale; each evaluator call scores one aspect, making this a natural multi\-task search over the Pareto frontier of visual properties\. Results\.Five human evaluators unanimously preferred optimize\_anything\-optimized images over zero\-shot baselines across all goals\. Quantitatively, the “pelican riding a bicycle” task achieves a VLM score of 0\.726 vs\. 0\.330 for the zero\-shot baseline \(2\.2×\\timesimprovement\)\. Qualitative comparisons are shown in Appendix Figure[11](https://arxiv.org/html/2605.19633#A10.F11)\. ### 5\.8\.Ablation: Multi\-Task vs\. Single\-Task Search We re\-optimize the 10 best multi\-task problems from scratch in single\-task mode with equivalent per\-problem budget\. Figure[8](https://arxiv.org/html/2605.19633#S5.F8)shows that multi\-task mode consistently outperforms single\-task across all speedup thresholds, with the gap widening at higher thresholds \(Fastp\(1\.2\)\\text\{Fast\}\_\{p\}\(1\.2\): single\-task plateaus early while multi\-task continues improving\)\. Figure 8\.Single\-task vs\. multi\-task mode on 10 selected KernelBench problems\. Multi\-task \(blue\) consistently outperforms single\-task \(red\) at all speedup thresholds, converging faster and solving more problems\.Three line charts comparing single vs batch mode at F\(1\.0\), F\(1\.1\), F\(1\.2\) thresholds\. Batch mode solid lines are consistently above single mode dashed lines\.The mechanism is cross\-transfer via the Pareto frontier: optimization patterns discovered for one kernel \(e\.g\., vectorized memory access, warp\-level reductions\) are preserved on the frontier and inform proposals for other kernels\. In single\-task mode, each problem must independently discover these patterns\. #### Scaling with number of tasks\. Multi\-task benefits scale with the number of related tasks: MT20 \(20 problems\) outperforms MT10 \(10 problems\), which outperforms single\-task, with gains most pronounced at moderate speedup thresholds \(Tables[6](https://arxiv.org/html/2605.19633#A5.T6)–[7](https://arxiv.org/html/2605.19633#A5.T7)in Appendix[E](https://arxiv.org/html/2605.19633#A5)\)\. Frontier size does not bottleneck scaling, as candidates are sampled by frontier frequency \(e\.g\., ARC\-AGI used 200 tasks effectively\)\. ### 5\.9\.Ablation: Side Information To isolate the contribution of SI, we compareoptimize\_anythingwith and without actionable side information \(sub\-scores\) on prompt optimization for the Facility Support Analysis dataset\. In the “with SI” condition, the evaluator returns per\-aspect sub\-scores alongside the aggregate score\. In the “without SI” condition, only the aggregate score is returned\. Figure[9](https://arxiv.org/html/2605.19633#S5.F9)shows two effects\. First, SI accelerates convergence: the “with SI” condition reaches a validation score of 0\.80 within 100 rollouts, while the score\-only condition requires approximately 600 rollouts to reach the same level\. Second, SI improves final performance: the test score with SI is 86\.32 versus 82\.5 without\. Figure 9\.Ablation: prompt optimization with vs\. without SI on the Facility Support Analysis dataset\. SI accelerates convergence \(left\) and improves final test performance \(right\): 86\.32 vs\. 82\.5\.Left: validation score curves showing with\-subscores \(blue\) converging faster than without \(red\)\. Right: bar chart showing final test scores 86\.32 vs 82\.5\.Sub\-scores let the proposer identify which aspects are strong vs\. weak and target revisions accordingly, rather than receiving only an aggregate signal\. #### Cross\-domain SI ablation\. SI vs\. score\-only ablations on circle packing and CUDA kernels \(Table[4](https://arxiv.org/html/2605.19633#S5.T4)\) confirm generalization: SI achieves the optimal circle packing solution \(score\-only reaches 94%\), and enables 2\.5–5×\\timesmore kernels to exceed speedup thresholds\. SI reveals*which*failure mode to address next; without it, the proposer can only observe that the score changed\. Table 4\.SI vs\. score\-only ablation across three domains\. SI provides substantial gains in all domains, confirming generalization beyond prompt optimization\. ### 5\.10\.Proposer Sensitivity and Optimization Cost Comparing GPT\-5\.1 against the cheaper GPT\-5\-nano reveals a clear cost\-performance tradeoff \(Table[8](https://arxiv.org/html/2605.19633#A7.T8)in Appendix[G](https://arxiv.org/html/2605.19633#A7)\): the nano model reduces costs by over 90% on Circle Packing while still improving substantially over the seed, but consistently underperforms the larger model on final quality\. Total optimization costs range from $1 \(Numerical Blackbox\) to $144\.70 \(ARC\-AGI\), with reflection cost minimal and total spend dominated by the evaluator \(Table[9](https://arxiv.org/html/2605.19633#A7.T9)in Appendix[G](https://arxiv.org/html/2605.19633#A7)\)\. ## 6\.Why the Framework Works: Optimization Trajectory Analysis Beyond final scores, trajectory analysis on circle packing reveals three key mechanisms drivingoptimize\_anything’s effectiveness \(detailed in Appendix[F](https://arxiv.org/html/2605.19633#A6)\): \(1\)SI enables targeted algorithmic shifts: SI reveals*which*failure mode to address next \(e\.g\., collapsed radii→\\toswitch to LP; poor centers→\\toswitch to SLP\), enabling directed rather than blind mutations\. \(2\)Multi\-module Pareto leapfrogging: the code artifact and refiner prompt are both tracked on the Pareto front; each module’s advances become the foundation for the other’s next improvement, creating a productive coordination dynamic absent from single\-artifact systems\. \(3\)Pareto diversity prevents premature convergence: the front retains candidates from multiple algorithmic families \(greedy, LP, SLP, bilevel L\-BFGS, CMA\-ES\), ensuring structurally diverse parents for proposals\. These mechanisms operate identically across domains because they arise from theevaluate\(candidate\)→\\to\(score, side\_info\)contract\. ## 7\.Discussion #### When does multi\-task search help? Our experiments reveal that cross\-task transfer is most beneficial when problems share underlying optimization patterns but differ in their specifics\. CUDA kernel generation exemplifies this: memory coalescing, vectorized access patterns, and warp\-level reductions are strategies that apply across operations but manifest differently for each kernel\. Multi\-task mode discovers these patterns once and transfers them, while single\-task mode must rediscover them independently for each problem \(Tables[6](https://arxiv.org/html/2605.19633#A5.T6)–[7](https://arxiv.org/html/2605.19633#A5.T7)\)\. #### When does multi\-task search hurt? Multi\-task search can degrade performance when tasks lack shared transferable structure\. We quantify this on circle packing, where optimizing different values ofNNjointly introduces noise rather than useful cross\-transfer: Table 5\.Multi\-task search on circle packing\. Unlike CUDA kernels, circle packing problems for differentNNare fundamentally independent, and multi\-task search introduces noise\.Circle packing problems for differentNNare fundamentally independent, optimal configurations change unpredictably withNN, with no transferable structure\(Graham and Lubachevsky,[1996](https://arxiv.org/html/2605.19633#bib.bib11); Galiev and Lisafina,[2013](https://arxiv.org/html/2605.19633#bib.bib10)\)\. In general, multi\-task search helps when tasks share underlying patterns \(e\.g\., CUDA kernels on the same hardware\) and hurts when they are fundamentally independent\. #### The role of SI across domains\. While the SI ablation \(Table[4](https://arxiv.org/html/2605.19633#S5.T4)\) confirms SI’s value, the mechanism differs by domain: for code \(CUDA, circle packing\), SI surfaces compiler errors and runtime diagnostics pinpointing failures; for agents \(ARC\-AGI\), per\-puzzle traces reveal which components fail; for cloud scheduling, SI exposes temporal decision structure\. In each case, SI converts a scalar signal into actionable diagnostics\. #### Artifacts optimized byoptimize\_anything\. The optimized artifacts range from structured prompts \(AIME\) and agent architectures \(ARC\-AGI\) to 900\+ line bilevel algorithms \(circle packing\), demonstrating that the system discovers qualitatively novel strategies—multi\-stage pipelines \(ARC\-AGI\), provider\-aware Steiner trees \(CloudCast\), break\-even cost analysis \(Can’t Be Late\)—arising from the interaction between LLM reasoning and diagnostic feedback\. ## 8\.Limitations optimize\_anythinginherits limitations from LLM\-based optimization\. \(1\) The quality of proposals depends on the proposer LLM’s capabilities; weaker models produce weaker candidates, as confirmed by our proposer sensitivity analysis \(Table[8](https://arxiv.org/html/2605.19633#A7.T8)\)\. \(2\) Evaluation cost can be high when the evaluator involves expensive operations \(e\.g\., $144 for ARC\-AGI, Table[9](https://arxiv.org/html/2605.19633#A7.T9)\), however, it must be noted that LLM\-based optimization is highly sample efficient and therefore calls evaluators less often\. \(3\) The system assumes the artifact is representable as text; optimization of continuous parameters or binary artifacts requires a text\-based proxy\. \(4\) While multi\-task search provides cross\-transfer benefits on related problems, the degree of benefit depends on how related the problems are, for example, circle packing exhibits degradation with multi\-task mode \(Table[5](https://arxiv.org/html/2605.19633#S7.T5)\)\. \(5\) designing effective SI still requires domain expertise; while evaluators returning only a score work, the demonstrated gains come from expert\-designed SI \(compiler errors, profiler traces, VLM scoring rubrics\)\. That said,optimize\_anythingtrades*optimization*expertise for*domain*expertise\. The user, most often a domain expert, need not configure backends, tune algorithmic hyperparameters, or engineer prompting strategies, only surface the diagnostics they already understand\. ## 9\.Conclusion optimize\_anythingdemonstrates that a simple declarative interface \(seed artifact, evaluator, and optional dataset\) is sufficient to match or outperform purpose\-built tools across diverse domains\. The key ideas are \(1\) three unified optimization modes under one API, \(2\) Side Information as a first\-class evaluator contract, and \(3\) Pareto\-based search across metrics and examples\. The API is backend\-agnostic; as new optimization strategies emerge, they plug in without changing user code\.optimize\_anythingis open\-sourced with multiple backends as a part of the GEPA project:[https://github\.com/gepa\-ai/gepa](https://github.com/gepa-ai/gepa)\. ###### Acknowledgements\. This research is supported in part by gifts from Accenture, Amazon, AMD, Anyscale, Broadcom, Google, IBM, Intel, Intesa Sanpaolo, Lambda, Lightspeed, Mibura, NVIDIA, Samsung SDS, SAP, by the U\.S\. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research through the X\-STACK: Programming Environments for Scientific Computing program \(DESC0021982\), and the Defense Advanced Research Projects Agency \(DARPA\) under Agreement No\. HR00112590134\. Lakshya A Agrawal is supported by a Laude Slingshot grant provided by the Laude Institute and an Amazon AI PhD Fellowship\. ## References - \(1\) - Agrawal \(2025\)Lakshya A Agrawal\. 2025\.ARC\-AGI Agent Architecture Optimization with GEPAAdapter\.[https://github\.com/gepa\-ai/gepa/blob/ebe0cd71/src/gepa/examples/dspy\_full\_program\_evolution/arc\_agi\.ipynb](https://github.com/gepa-ai/gepa/blob/ebe0cd71/src/gepa/examples/dspy_full_program_evolution/arc_agi.ipynb)\.Committed September 1, 2025\. Readable version:[https://gepa\-ai\.github\.io/gepa/tutorials/arc\_agi/](https://gepa-ai.github.io/gepa/tutorials/arc_agi/)\. - Agrawal et al\.\(2026a\)Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A\. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E\. Gonzalez, Omar Khattab, Alexandros G\. Dimakis, and Matei Zaharia\. 2026a\.Introducingoptimize\_anything: A Unified Text Optimization API\.[https://gepa\-ai\.github\.io/gepa/blog/2026/02/18/introducing\-optimize\-anything/](https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/)\.Blog post, February 18, 2026\. - Agrawal et al\.\(2026b\)Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl\-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G\. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab\. 2026b\.GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning\. In*International Conference on Learning Representations \(ICLR\)*\. - Akiba et al\.\(2019\)Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama\. 2019\.Optuna: A Next\-generation Hyperparameter Optimization Framework\.arXiv:1907\.10902 \[cs\.LG\][https://arxiv\.org/abs/1907\.10902](https://arxiv.org/abs/1907.10902) - Chen et al\.\(2023\)Angelica Chen, David Dohan, and David So\. 2023\.EvoPrompting: Language Models for Code\-Level Neural Architecture Search\. In*Advances in Neural Information Processing Systems \(NeurIPS\)*\. - Cheng et al\.\(2025\)Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, and Ion Stoica\. 2025\.Barbarians at the Gate: How AI is Upending Systems Research\.arXiv:2510\.06189 \[cs\.AI\][https://arxiv\.org/abs/2510\.06189](https://arxiv.org/abs/2510.06189) - Chollet \(2019\)François Chollet\. 2019\.On the Measure of Intelligence\.*arXiv preprint arXiv:1911\.01547*\(2019\)\. - Fernando et al\.\(2023\)Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel\. 2023\.Promptbreeder: Self\-Referential Self\-Improvement Via Prompt Evolution\.arXiv:2309\.16797 \[cs\.CL\][https://arxiv\.org/abs/2309\.16797](https://arxiv.org/abs/2309.16797) - Galiev and Lisafina \(2013\)Shamil I Galiev and Maria S Lisafina\. 2013\.Linear models for the approximate solution of the problem of packing equal circles into a given domain\.*European Journal of Operational Research*230, 3 \(2013\), 505–514\. - Graham and Lubachevsky \(1996\)Ronald L Graham and Boris D Lubachevsky\. 1996\.Dense packings of equal disks in an equilateral triangle: from 22 to 34 and beyond\.*The Electronic Journal of Combinatorics*2 \(1996\)\. - Hu et al\.\(2024\)Shengran Hu, Cong Lu, and Jeff Clune\. 2024\.Automated Design of Agentic Systems\. In*arXiv preprint arXiv:2408\.08435*\. - Khattab et al\.\(2023\)Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T\. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts\. 2023\.DSPy: Compiling Declarative Language Model Calls into Self\-Improving Pipelines\.arXiv:2310\.03714 \[cs\.CL\][https://arxiv\.org/abs/2310\.03714](https://arxiv.org/abs/2310.03714) - Lange et al\.\(2025\)Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin\. 2025\.ShinkaEvolve: Towards Open\-Ended And Sample\-Efficient Program Evolution\.arXiv:2509\.19349 \[cs\.CL\][https://arxiv\.org/abs/2509\.19349](https://arxiv.org/abs/2509.19349) - Lehman et al\.\(2022\)Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O\. Stanley\. 2022\.Evolution through Large Models\.arXiv:2206\.08896 \[cs\.NE\][https://arxiv\.org/abs/2206\.08896](https://arxiv.org/abs/2206.08896) - Madaan et al\.\(2023\)Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al\.2023\.Self\-Refine: Iterative Refinement with Self\-Feedback\.*Advances in Neural Information Processing Systems \(NeurIPS\)*\(2023\)\. - McCourt \(2016\)Michael McCourt\. 2016\.Optimization Test Functions\.[https://github\.com/sigopt/evalset](https://github.com/sigopt/evalset)\.[https://github\.com/sigopt/evalset](https://github.com/sigopt/evalset) - Mouret and Clune \(2015\)Jean\-Baptiste Mouret and Jeff Clune\. 2015\.Illuminating search spaces by mapping elites\.arXiv:1504\.04909 \[cs\.AI\][https://arxiv\.org/abs/1504\.04909](https://arxiv.org/abs/1504.04909) - Novikov et al\.\(2025\)Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po\-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J\. R\. Ruiz, Abbas Mehrabian, M\. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog\. 2025\.AlphaEvolve: A coding agent for scientific and algorithmic discovery\.arXiv:2506\.13131 \[cs\.AI\][https://arxiv\.org/abs/2506\.13131](https://arxiv.org/abs/2506.13131) - Opsahl\-Ong et al\.\(2024\)Krista Opsahl\-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab\. 2024\.Optimizing Instructions and Demonstrations for Multi\-Stage Language Model Programs\.arXiv:2406\.11695 \[cs\.CL\][https://arxiv\.org/abs/2406\.11695](https://arxiv.org/abs/2406.11695) - Ouyang et al\.\(2025\)Anne Ouyang, Simon Guo, Simran Arora, Alex L\. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini\. 2025\.KernelBench: Can LLMs Write Efficient GPU Kernels?arXiv:2502\.10517 \[cs\.LG\][https://arxiv\.org/abs/2502\.10517](https://arxiv.org/abs/2502.10517) - Pryzant et al\.\(2023\)Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng\. 2023\.Automatic Prompt Optimization with “Gradient Descent” and Beam Search\. In*Empirical Methods in Natural Language Processing \(EMNLP\)*\. - Romera\-Paredes et al\.\(2024\)Bernardino Romera\-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M\. Pawan Kumar, Emilien Dupont, Francisco J\. R\. Ruiz, Jordan S\. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi\. 2024\.Mathematical discoveries from program search with large language models\.*Nature*625, 7995 \(2024\), 468–475\.[doi:10\.1038/s41586\-023\-06924\-6](https://doi.org/10.1038/s41586-023-06924-6) - Shao et al\.\(2024\)Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\. 2024\.DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models\.arXiv:2402\.03300 \[cs\.CL\][https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300) - Sharma \(2025\)Asankhaya Sharma\. 2025\.*OpenEvolve: an open\-source evolutionary coding agent*\.[https://github\.com/algorithmicsuperintelligence/openevolve](https://github.com/algorithmicsuperintelligence/openevolve) - Shinn et al\.\(2023\)Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\. 2023\.Reflexion: Language Agents with Verbal Reinforcement Learning\.arXiv:2303\.11366 \[cs\.AI\][https://arxiv\.org/abs/2303\.11366](https://arxiv.org/abs/2303.11366) - Tan et al\.\(2026\)Shangyin Tan, Lakshya A Agrawal, Rohit Sandadi, Dan Klein, Koushik Sen, Alexandros G\. Dimakis, and Matei Zaharia\. 2026\.Automatically Learning Skills for Coding Agents\.[https://gepa\-ai\.github\.io/gepa/blog/2026/02/18/automatically\-learning\-skills\-for\-coding\-agents/](https://gepa-ai.github.io/gepa/blog/2026/02/18/automatically-learning-skills-for-coding-agents/)\.Blog post, February 18, 2026\. - Yang et al\.\(2024\)Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V\. Le, Denny Zhou, and Xinyun Chen\. 2024\.Large Language Models as Optimizers\.arXiv:2309\.03409 \[cs\.LG\][https://arxiv\.org/abs/2309\.03409](https://arxiv.org/abs/2309.03409) - Yuksekgonul et al\.\(2024\)Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou\. 2024\.TextGrad: Automatic ”Differentiation” via Text\.arXiv:2406\.07496 \[cs\.CL\][https://arxiv\.org/abs/2406\.07496](https://arxiv.org/abs/2406.07496) - Zhang et al\.\(2025\)Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu\. 2025\.AFlow: Automating Agentic Workflow Generation\.arXiv:2410\.10762 \[cs\.AI\][https://arxiv\.org/abs/2410\.10762](https://arxiv.org/abs/2410.10762) - Zhou et al\.\(2023\)Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba\. 2023\.Large Language Models Are Human\-Level Prompt Engineers\.arXiv:2211\.01910 \[cs\.LG\][https://arxiv\.org/abs/2211\.01910](https://arxiv.org/abs/2211.01910) ## Appendix AUse of Generative AI The authors made use of Generative AI technologies including ChatGPT, Gemini, Claude and Cursor to generate sections of this Work, including text, tables, graphs, code, etc\. The experiment design and details were explicitly the authors’ original ideas\. ## Appendix BBlackbox Mathematical Optimization We additionally evaluateoptimize\_anythingin single\-task search mode on blackbox mathematical optimization, using the 56\-problem EvalSet benchmark\(McCourt,[2016](https://arxiv.org/html/2605.19633#bib.bib17)\)against Optuna\(Akiba et al\.,[2019](https://arxiv.org/html/2605.19633#bib.bib5)\)\. Rather than tuning parameters within a fixed algorithm,optimize\_anythingoptimizes the solver code itself, discovering bespoke algorithms for each problem\. With a budget of 8,000 evaluations per problem,optimize\_anythingties Optuna on 40 problems, wins 7, and loses 9\. On 10 selected problems where Optuna struggles with lower budgets \(2,000 evaluations\),optimize\_anythingfinds better solutions on 7 out of 10\. The mechanism: Optuna’s fixed TPE\-CMA\-ES pipeline fails in predictable, structural ways \(e\.g\., TPE’s per\-dimension sampling converges to trap basins; CMA\-ES assumes smooth unimodal landscapes\)\.optimize\_anythingtailors the solver to each problem—discovering L\-BFGS\-B for boundary optima and multi\-start search for deceptive traps\. ## Appendix CSeedless Mode: 3D Unicorn Every main experiment starts from a seed artifact\. Seedless mode \(seed\_candidate=None\) instead provides only a natural\-language objective and lets the LLM bootstrap the first candidate\. We demonstrate this on a 3D modeling task: generating a Python script \(build123d \+ pyrender\) that produces a 3D unicorn\. The evaluator renders multi\-view PNGs and asks a VLM to score them, passing images back as SI\. Starting from no code,optimize\_anythingiteratively refines geometry, proportions, and anatomical detail, producing a recognizable 3D unicorn that improves substantially over the zero\-shot baseline\. ## Appendix DDetailed Algorithm Algorithm 1optimize\_anything: Core optimization loop0:Artifact Φ0\\Phi\_\{0\}, evaluator ff, dataset 𝒟\\mathcal\{D\}, budget BB 0:Minibatch size bb, Pareto set size nn 1:Initialize candidates 𝒫←\[Φ0\]\\mathcal\{P\}\\leftarrow\[\\Phi\_\{0\}\] 2:Evaluate Φ0\\Phi\_\{0\}on 𝒟\\mathcal\{D\}; record per\-example scores SS 3:whilebudget BBnot exhausteddo 4: k←k\\leftarrowParetoSelect\( 𝒫,S\\mathcal\{P\},S\) \{Select based on frontier\} 5: ℳ←\\mathcal\{M\}\\leftarrowminibatch of size bbfrom 𝒟\\mathcal\{D\} 6:Execute Φk\\Phi\_\{k\}on ℳ\\mathcal\{M\}; collect scores and SI 7: Φ′←\\Phi^\{\\prime\}\\leftarrowReflect\( Φk\\Phi\_\{k\}, scores, SI\) \{LLM proposes fix\} 8:if Φ′\\Phi^\{\\prime\}improves on ℳ\\mathcal\{M\}then 9:Evaluate Φ′\\Phi^\{\\prime\}on full 𝒟\\mathcal\{D\} 10: 𝒫←𝒫∪\{Φ′\}\\mathcal\{P\}\\leftarrow\\mathcal\{P\}\\cup\\\{\\Phi^\{\\prime\}\\\} 11:Update SS; prune dominated candidates 12:endif 13:endwhile 14:return Φ∗∈𝒫\\Phi^\{\*\}\\in\\mathcal\{P\}maximizing average score Algorithm[1](https://arxiv.org/html/2605.19633#alg1)presents the core loop\. For single\-task search, the “dataset” is a singleton and per\-example tracking reduces to per\-metric tracking\. For multi\-task search, each dataset element is an independent problem\. For generalization, scores on𝒟\\mathcal\{D\}guide search while a held\-outvalsetmeasures generalization\. TheParetoSelectsubroutine shows the candidate selection algorithm used in the default optimization backend, GEPA\(Agrawal et al\.,[2026b](https://arxiv.org/html/2605.19633#bib.bib4)\)which identifies non\-dominated candidates and samples proportionally to their frontier frequency\. ## Appendix EMulti\-Task Scaling Tables Table 6\.Multi\-task scaling on 10 KernelBench problems\.f1\.xf\_\{1\.x\}: fraction of kernels achieving≥\\geqxx% speedup over PyTorch baseline\.Table 7\.Single\-task vs\. MT20 on 20 randomly sampled KernelBench problems\. ## Appendix FOptimization Trajectory Analysis: Full Details #### Mechanism 1: SI enables targeted algorithmic shifts\. SI works because it reveals*which*failure mode to address next, not merely that performance changed\. In circle packing, SI\-driven reflection produces a characteristic pattern: collapsed radii→\\toswitch to LP; poor center placement→\\toswitch to SLP; local saturation→\\toswitch to bilevel L\-BFGS\. Without SI, the proposer can only observe that the score changed, not why, and resorts to undirected mutations\. The cross\-domain SI ablation \(Table[4](https://arxiv.org/html/2605.19633#S5.T4)\) confirms this mechanism generalizes: on KernelBench with multi\-task search, SI enables 40% of kernels to exceed 1\.1×\\timesspeedup vs\. 0% with score\-only feedback\. #### Mechanism 2: Multi\-module Pareto leapfrogging\. optimize\_anythingoptimizes both the code artifact and a refiner prompt, both tracked on the shared Pareto front\. In circle packing, this creates a productive leapfrogging dynamic: the refiner discovers LP\-based optimization while the code module is still a weak heuristic \(code=0\.98, refiner=1\.93\)\. The code module then absorbs the LP approach, catching up \(→\\to2\.61\)\. The refiner pushes further with SLP \(→\\to2\.63\)\. The code module absorbs SLP and reaches the world record\. Each module’s advances become the foundation for the other’s next improvement—a coordination mechanism absent from single\-artifact systems like AlphaEvolve\. Even broken code mutations \(score=0\.0\) are recovered by the refiner and retained on the front, acting as a safety net that preserves exploration\. #### Mechanism 3: Pareto diversity prevents premature convergence\. At convergence, the Pareto front retains candidates from multiple algorithmic families simultaneously \(greedy, LP, SLP, bilevel L\-BFGS, CMA\-ES\) across quality dimensions \(max score, mean score, EMA stability, improvement rate\)\. This ensures the proposer has access to structurally diverse parents when generating new candidates, rather than being locked into refining a single approach\. The preservation of diverse strategies is what enables the algorithmic shifts described above: even when LP dominates on raw score, greedy and CMA\-ES candidates survive on stability metrics and can seed novel hybrid approaches\. ## Appendix GProposer Sensitivity and Optimization Cost Table 8\.Proposer LLM sensitivity\. GPT\-5\-nano reduces cost significantly but underperforms GPT\-5\.1 on final achieved performance\. Both models improve substantially over the seed\.Table 9\.Total optimization cost per experiment\. Reflection cost is minimal; total spend is dominated by the evaluator\. ## Appendix HImage Generation Details Table 10\.Image generation goals\. A VLM evaluator scores one visual aspect per call; multi\-task search explores the Pareto frontier of visual properties\.For SVG tasks, the evaluator renders the image and queries a VLM for feedback\. For each goal, we define several natural language properties which ask a VLM to rate on a scale of 0 to 100 how well the image aligns with that aspect\. During each evaluator call, the VLM rates one aspect \(not all at once\), making this a natural multi\-task search over the Pareto frontier\. In the CAD setting, since we are dealing with 3D objects, the evaluator takes 3 screenshots equidistant apart and asks the VLM to provide feedback using those images\. ## Appendix IOptimized AIME Prompt Optimized Prompt for AIME Solve the math problem carefully and thoroughly\. Your goal is to produce a correct, well‑structured solution that leads unambiguously to the requested final result\. Follow these rules: 1\. Restate the problem briefly in your own words\. 2\. Set up notation and equations cleanly before manipulating them\. \- Define variables explicitly\. \- State all constraints \(e\.g\., integrality, ranges, geometric conditions\) before using them\. 3\. Show clear, logically ordered reasoning\. \- Justify each important algebraic or geometric step\. \- When you split into cases, state why each case is necessary and what assumptions define it\. \- If you invoke a known theorem \(e\.g\., Ptolemy, Power of a Point, similarity, Vieta\), name it and show exactly how it applies in this context\. 4\. Handle dead ends correctly\. \- If you realize a line of reasoning leads to a contradiction or dead end, explicitly say so\. \- Then restart from the last correct point; do not guess or hand‑wave\. 5\. Keep the reasoning focused and minimal while still being rigorous\. \- Avoid unnecessary numerical approximations if an exact approach is available\. \- Do not approximate exact values unless the problem explicitly asks for a decimal\. \- Prefer algebraic or structural arguments over trial‑and‑error or random guessing\. \- You may test candidate values only after deriving strong constraints that sharply limit the possibilities\. 6\. At the end, clearly isolate the answer: \- Provide the final answer as a single number or expression on its own line\. \- Do not include any extra words, symbols, or explanation on that final line\. ## Appendix JDiscovered solutions We present excerpts of the final optimized artifacts discovered byoptimize\_anythingfor each domain\. ### J\.1\.Coding Agent Skills: Bleve Repository The following is the optimizedSKILL\.MDexcerpt discovered byoptimize\_anythingfor the Bleve search library: Optimized Bleve Skills \(excerpt\) 4\) Run tests early and iterate from failures \(tests are the bug report\) \- Start broad when feasible: ‘cd /testbed && go test \./\.\.\.‘ \(or project equivalent\)\. \- Narrow quickly: \- package: ‘go test \./path/to/pkg‘ \- single test: ‘go test \./path/to/pkg \-run TestName \-count=1‘ \(add \-v only if needed\) \- For panics: follow the stack trace top frame in repo code first\. \- For mismatches: use “expected vs got” to locate the producing function and invariants\. \.\.\. 7\) Make minimal, reviewable changes and verify continuously \- Change one behavior at a time; rerun the smallest reproducing test after each change\. \- Add focused unit tests when coverage is missing; keep them in the same package and table\-driven where sensible \(include short words \+ accented/Unicode edge cases\)\. \- Avoid scratch main\.go files in repo root\. ### J\.2\.ARC\-AGI Agent Architecture The optimized agent grew from a 10\-line seed to a 300\+ line system implementing a 4\-stage pipeline: rule induction via pattern analysis, code generation withexec\(\)\-based verification, iterative debugging with up to 2 fix attempts, and structured fallback from code\-first to direct LLM prediction\. Figure 10\.Architecture of the optimized ARC\-AGI agent\. The system discovers a 4\-stage pipeline with verify\-then\-fallback logic, starting from a naive single\-call seed\.Architecture diagram of the optimized ARC\-AGI agent showing four stages: rule induction, code generation with exec\(\)\-based verification, iterative debugging, and structured fallback\. ### J\.3\.CloudCast Routing Algorithm The optimized CloudCast algorithm \(178 lines\) discovers provider\-aware Steiner tree routing with egress cost optimization, a qualitative departure from the Dijkstra seed\. We show the mainsearch\_algorithmfunction; the full artifact is available in the supplementary material\. Optimized CloudCast Algorithm \(excerpt\) defsearch\_algorithm\(src,dsts,G,num\_partitions\): """OptimizedBroadcastRoutingAlgorithmv3\. KeyOptimizations: 1\.Provider\-AwareWeighting:biasespathfindingtowards intra\-providerlinkstominimizeegress\. 2\.Pareto\-FrontierCandidateSelection:Explicitlykeeps candidatesthatofferdistinctcost/timetradeoffs\. 3\.DiverseSteinerStrategies:IncludesMST\-like approximationsforcostandbottleneck\-widest pathsforthroughput\. 4\.RobustGreedyAllocation:Accuratelymodelsbandwidth contentionacrosspartitions\. """ EST\_DATA\_VOL\_GB=300\.0 EST\_INSTANCE\_COST\_PER\_HR=10\.0 PARTITION\_VOL\_GB=EST\_DATA\_VOL\_GB/max\(1,num\_partitions\) alphas=\[0\.0,1e\-5,0\.001,0\.01,0\.05,0\.1,0\.5,2\.0\] bw\_thresholds=\[0\.0,0\.5,5\.0,20\.0\] strategies=\[’prim’,’prim’,’furthest’,’random’\] ### J\.4\.Can’t Be Late Scheduling Policy The optimized scheduling policy \(110 lines\) starts from a simple deadline\-check heuristic and discovers three key behaviors absent from the seed: \(1\) break\-even switching cost analysis that avoids costly SPOT→\\toON\_DEMAND transitions when remaining work is small, \(2\) persistent spot\-unavailability tracking via a counter that detects when SPOT is unlikely to return, and \(3\) graduated decision thresholds based on slack ratio that become increasingly aggressive as the deadline approaches\. We show the core\_stepmethod; the full artifact includesreset\(\)and additional edge\-case guards\. Optimized Can’t Be Late Policy \(excerpt\) fromsky\_spot\.strategies\.strategyimportStrategy fromsky\_spot\.utilsimportClusterType classEvolveSingleRegionStrategy\(Strategy\): def\_\_init\_\_\(self,args\): super\(\)\.\_\_init\_\_\(args\) self\.spot\_unavailable\_count=0 self\.consecutive\_short\_spot\_windows=0 def\_step\(self,last\_cluster\_type,has\_spot\)\-\>ClusterType: remaining\_task\_time=self\.task\_duration\-sum\(self\.task\_done\_time\) remaining\_time=self\.deadline\-self\.env\.elapsed\_seconds slack=remaining\_time\-remaining\_task\_time\-self\.restart\_overhead ifnothas\_spot: self\.spot\_unavailable\_count\+=1 else: self\.spot\_unavailable\_count=0 ifremaining\_task\_time\+self\.restart\_overhead\>=remaining\_time\-0\.5: returnClusterType\.ON\_DEMAND slack\_ratio=slack/max\(remaining\_task\_time,1e\-6\) ifhas\_spot: iflast\_cluster\_type==ClusterType\.ON\_DEMAND: switch\_cost=self\.restart\_overhead\*1\.0 savings\_per\_hour=0\.7 break\_even=switch\_cost/savings\_per\_hour ifremaining\_task\_time<break\_even\*1\.5: returnClusterType\.ON\_DEMAND ifslack<self\.restart\_overhead\*3: returnClusterType\.ON\_DEMAND returnClusterType\.SPOT else: iflast\_cluster\_type==ClusterType\.ON\_DEMAND: returnClusterType\.ON\_DEMAND ifslack\_ratio<0\.1: returnClusterType\.ON\_DEMAND ifslack\_ratio<0\.25andself\.spot\_unavailable\_count\>10: returnClusterType\.ON\_DEMAND ifslack\_ratio<0\.4andself\.spot\_unavailable\_count\>20: returnClusterType\.ON\_DEMAND returnClusterType\.NONE ### J\.5\.CUDA Kernel: LayerNorm We show the best individual kernel discovered for LayerNorm, which achieves a 3\.32×\\timesspeedup over the PyTorch baseline\. The kernel employs three key techniques absent from the naive implementation: \(1\)float4vectorization that loads four values per memory transaction, cutting memory overhead by∼\\sim4×\\times; \(2\) a two\-pass algorithm \(compute statistics, then normalize\) that lets the GPU optimize each phase independently; and \(3\) warp shuffle reductions \(\_\_shfl\_down\_sync\) for direct register\-to\-register partial sum accumulation, bypassing slower shared memory paths\. This kernel was discovered in multi\-task mode, where optimization patterns transfer across the 31 KernelBench problems via the shared Pareto frontier\. Optimized LayerNorm CUDA Kernel \(excerpt\) \_\_inline\_\_\_\_device\_\_floatwarp\_sum\(floatv\)\{ unsignedmask=0xffffffffu; for\(intoffset=KB\_WARP\_SIZE/2;offset\>0;offset\>\>=1\) v\+=\_\_shfl\_down\_sync\(mask,v,offset\); returnv; \} \_\_global\_\_voidrowwise\_stats\_kernel\( constfloat\*\_\_restrict\_\_x,float\*\_\_restrict\_\_mean, float\*\_\_restrict\_\_inv\_std,int64\_tB,int64\_tM,floateps\)\{ int64\_trow=blockIdx\.x; if\(row\>=B\)return; constfloat\*row\_ptr=x\+row\*M; floatthread\_sum=0\.0f,thread\_sumsq=0\.0f; constfloat4\*row\_v4=reinterpret\_cast<constfloat4\*\>\(row\_ptr\); for\(int64\_tj=threadIdx\.x;j<\(M\>\>2\);j\+=blockDim\.x\)\{ float4v=row\_v4\[j\]; thread\_sum\+=\(v\.x\+v\.y\+v\.z\+v\.w\); thread\_sumsq\+=\(v\.x\*v\.x\+v\.y\*v\.y\+v\.z\*v\.z\+v\.w\*v\.w\); \} thread\_sum=warp\_sum\(thread\_sum\); thread\_sumsq=warp\_sum\(thread\_sumsq\); \} \_\_global\_\_voidlayernorm\_affine\_kernel\( constfloat\*\_\_restrict\_\_x,constfloat\*\_\_restrict\_\_weight, constfloat\*\_\_restrict\_\_bias,constfloat\*\_\_restrict\_\_mean, constfloat\*\_\_restrict\_\_inv\_std,float\*\_\_restrict\_\_y, int64\_tB,int64\_tM\)\{ int64\_trow=blockIdx\.x; floatm=mean\[row\],inv=inv\_std\[row\]; constfloat4\*x\_v4=reinterpret\_cast<constfloat4\*\>\(x\+row\*M\); float4\*y\_v4=reinterpret\_cast<float4\*\>\(y\+row\*M\); for\(int64\_tj=threadIdx\.x;j<\(M\>\>2\);j\+=blockDim\.x\)\{ float4xv=x\_v4\[j\],wv=w\_v4\[j\],bv=b\_v4\[j\]; y\_v4\[j\]=\{\(\(xv\.x\-m\)\*inv\)\*wv\.x\+bv\.x,\(\(xv\.y\-m\)\*inv\)\*wv\.y\+bv\.y, \(\(xv\.z\-m\)\*inv\)\*wv\.z\+bv\.z,\(\(xv\.w\-m\)\*inv\)\*wv\.w\+bv\.w\}; \} \} ### J\.6\.Circle Packing Algorithm The evolved circle packing algorithm \(480\+ lines\) is a bilevel optimizer that jointly optimizes circle centers and radii forn=26n\{=\}26circles in a unit square\. Starting from a simple greedy packing seed, the system discovers a multi\-stage architecture: \(1\) an LP over radii with dual\-variable sensitivities that provide exact gradients for center optimization, \(2\) L\-BFGS\-B over centers using these LP\-derived gradients, \(3\) block SLP trust\-region boosts targeting the worst\-performing circles, \(4\) CMA\-ES global exploration with automatic restarts, and \(5\) aggressive relocation of smallest circles to edges and corners\. The algorithm also employs six diverse seeding strategies \(hexagonal, uniform, edge\-ring, farthest\-point, corner\-spokes, and edge\-biased hex\) to avoid local optima\. We show the main entry point and key optimization components\. Evolved Circle Packing Algorithm \(excerpt\) defmain\(timeout,current\_best\_solution\): """BilevelL\-BFGSwithexactLPsensitivities\+ SLPblockboosts\+CMA/Evolutionfallback""" n=26 defsolve\_radii\_lp\(centers,need\_duals=False\): res=linprog\(c\_obj,A\_ub=A\_ub,b\_ub=b\_ub,\.\.\.\) returnr,success,\{’dual’:res\.ineqlin\.marginals\} defgradient\_from\_duals\(centers,dual\_vec\): returng deflbfgs\_bilevel\(centers\_init,max\_iters=300\): deff\_and\_g\(flat\): r,\_,info=solve\_radii\_lp\(centers,need\_duals=True\) g=gradient\_from\_duals\(centers,info\[’dual’\]\) return\-score,\-g\.reshape\(\-1\) minimize\(f\_and\_g,method=’L\-BFGS\-B’,bounds=bounds\) defblock\_slp\_boost\(centers,rounds=4,k=10,delta=0\.18\): Zero\-shotoptimize\_anything         Figure 11\.Qualitative comparison between zero\-shot generations \(left\) and optimize\_anything candidates \(right\) across four example tasks\. Optimization consistently improves many visual aspects including composition, structure, detail, and overall visual quality\. ## Appendix KDemonstration A 4\-minute demo video and accompanying artifacts are available at[https://drive\.google\.com/drive/folders/1mfd8xny\_YRri5UYwTxKoBs3CJ\_cpxpMr](https://drive.google.com/drive/folders/1mfd8xny_YRri5UYwTxKoBs3CJ_cpxpMr)\. The demo showcasesoptimize\_anything’s generality through two end\-to\-end scenarios: evolving ARC\-AGI agents and optimizing circle packing algorithms\. #### Scenario 1: Evolving ARC\-AGI agents\. Starting from a naive 10\-line agent \(a single LLM call\),optimize\_anythingiteratively designs it into a 300\+ line multi\-stage pipeline with sub\-agents, code generation, iterative debugging, and structured fallback logic\. SI—per\-puzzle execution traces, error tracebacks, and model outputs—drives targeted architectural improvements\. The final agent reaches 89\.5% accuracy on ARC\-AGI\(Chollet,[2019](https://arxiv.org/html/2605.19633#bib.bib8)\)test puzzles using Gemini 3 Flash as both proposer and agent model \(§[5\.3](https://arxiv.org/html/2605.19633#S5.SS3)\)\. #### Scenario 2: Optimizing circle packing\. We demonstrate single\-task search on packingn=26n\{=\}26circles in a unit square to maximize the sum of radii\.optimize\_anythingevolves a simple greedy packing seed into a 480\+ line bilevel optimizer using LP\-derived gradients and CMA\-ES exploration, outperforming AlphaEvolve’s\(Novikov et al\.,[2025](https://arxiv.org/html/2605.19633#bib.bib19)\)reported solution \(§[5\.6](https://arxiv.org/html/2605.19633#S5.SS6)\)\. The demo visualizes how the system discovers novel algorithmic components not present in the seed\. #### Live demonstration\. The demo runs both scenarios through Jupyter notebooks, allowing observation of optimization trajectories, inspection of intermediate candidates, and exploration of how diagnostic feedback drives improvements\. ## Appendix LArtifact Availability optimize\_anythingis open\-sourced as part of the GEPA project\. The source code is available at[https://github\.com/gepa\-ai/gepa](https://github.com/gepa-ai/gepa)\. A tutorial\-style introduction is available at the accompanying blog post\(Agrawal et al\.,[2026a](https://arxiv.org/html/2605.19633#bib.bib3)\)\. The complete reproduction artifact accompanying this paper is publicly available at[https://github\.com/gepa\-ai/optimize\-anything\-artifact](https://github.com/gepa-ai/optimize-anything-artifact)under theacm\_cais\_artifact\_evaluation/directory\. Each evaluation domain has its own subdirectory underdomains/with runnableoptimize\_anythingcode, aREADME\.mdmapping the folder to the relevant section of this paper, and the savedGEPAStatecheckpoint from the paper run\. See the top\-levelREADME\.mdfor the reproduction guide\. #### Hardware notes\. Most domains run on a single CPU host with API access to the proposer and refiner LLMs \(the paper used GPT\-5/5\.1, Gemini 3 Flash, and Claude Opus 4\.6 depending on domain; exact identifiers are documented per domain\)\. The KernelBench domain requires an NVIDIA V100 32GB GPU with CUDA 12\.1\+\.
Similar Articles
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
UnityMAS-O introduces a general RL optimization framework for LLM-based multi-agent systems, treating entire workflows as optimization units with role-level credit assignment and configurable parameter sharing, demonstrating significant gains on QA and code generation tasks.
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
SkillOpt introduces a systematic text-space optimizer for agent skills that trains skills as external agent state with stable updates and zero deployment inference overhead, achieving superior performance across multiple benchmarks and execution environments.
Adam's Law: Textual Frequency Law on Large Language Models
This article introduces AdamOpt, an open-source tool based on 'Adam's Law' that optimizes prompts by replacing low-frequency words with high-frequency synonyms to reduce perplexity. It highlights the tool's bilingual support, offline capability, and practical performance improvements in text generation.
@Yif_Yang: Introducing SkillOpt — an optimizer for agent skills. Instead of finetuning model weights, we treat a natural-language …
Introducing SkillOpt, an optimizer that treats natural-language skills as trainable external parameters instead of finetuning model weights. It uses bounded edits and validation gating to enable stable, controllable skill updates, achieving best or tied-best results across 52 settings on 6 benchmarks with 7 models.