APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

arXiv cs.CL Papers

Summary

APEX introduces a dynamic data selection strategy for automatic prompt optimization, stratifying datasets into easy, hard, and mixed tiers to improve data efficiency, achieving significant performance gains over initial prompts on multiple benchmarks.

arXiv:2606.11459v1 Announce Type: new Abstract: Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:37 PM

# Automated Prompt Engineering eXpert with Dynamic Data Selection
Source: [https://arxiv.org/html/2606.11459](https://arxiv.org/html/2606.11459)
Fei Wang Google feiwangnlp@google\.com &Si Si Google sisidaisy@google\.com &Cho\-Jui Hsieh UCLA chohsieh@cs\.ucla\.edu &Inderjit S\. Dhillon Google isd@google\.com

###### Abstract

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential\. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency\. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data\. In this work, we introduceAPEX\(AutomatedPromptEngineering eXpert\), a novel framework that optimizes thedata usagealongside the prompt search\.APEXdynamically stratifies the dataset intoEasy,Hard, andMixedtiers based on the optimization lineage\. By prioritizing theMixedtier, which identifies the data where the LLM has mixed performance, we identify two high\-leverage subsets: theaddressable frontierfor generating informative mutations and therank\-sensitive frontierfor distinguishing candidate quality\. We evaluateAPEXacross three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding\. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency,APEXoutperforms the initial prompt by an average of 11\.2% on Gemini 2\.5 Flash and 6\.8% on Gemma 3 27B, demonstrating that a data\-centric approach is key to efficient and effective prompt optimization\.

## 1Introduction

Large Language Models \(LLMs\) have demonstrated remarkable capabilities across instruction following, fact retrieval, and complex reasoning\. However, their performance remains notoriously sensitive to the prompts\(Kojimaet al\.,[2022](https://arxiv.org/html/2606.11459#bib.bib34)\)\. A minor change can often mean the difference between a correct answer and a hallucination\. This fragility has given rise to automatic prompt optimization, a field dedicated to treating prompts not as static strings, but as learnable parameters that can be optimized\(Zhouet al\.,[2023b](https://arxiv.org/html/2606.11459#bib.bib13); Pryzantet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib2); Denget al\.,[2022](https://arxiv.org/html/2606.11459#bib.bib12); Hsiehet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib18)\)\.

Currently, genetic and evolutionary algorithms have emerged as the dominant paradigm in prompt optimization, exemplified by methods like APO\(Pryzantet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib2)\)and GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib1)\)\. These frameworks typically operate through an iterative two\-stage process: \(1\)Mutation, where the optimizer analyzes error cases from the current prompt to generate new candidate prompts, and \(2\)Selection, where these candidates are evaluated and ranked by their performance to identify the better ones\.

However, this paradigm faces a critical and often overlooked bottleneck:data efficiency\. The effectiveness of both stages is heavily dependent on the quality of the data used\. The mutation stage relies on identifyinginformativefailure cases that point towards fixable errors, while the selection stage requires a rigorous evaluation to accurately rank candidates\. Current methods typically resort to naive data strategies: random sampling for mutation, and either random sampling or full\-set evaluation for selection\. This creates a fundamental dilemma: random sampling leads to rank instability, where superior prompts are discarded due to variance, while full\-set evaluation depletes the compute budget rapidly, limiting the search to only a few generations\.

We argue that the root cause of this inefficiency is the treatment of the dataset as astatic benchmark\. In reality, the utility of any given data point is not fixed\. It changes dynamically throughout the optimization process\. As prompts evolve and improve, examples that were once informative discriminators may converge into consistentlyEasycases \(solved by all candidates\), while others persist asHard\(intractable noise\)\. Continuing to mutate based on unsolvable errors, or evaluating candidates on examples that have lost their discriminatory power, wastes a significant portion of the compute budget on signals that provide zero improvement information\.

To address this, we introduceAPEX\(AutomatedPromptEngineering eXpert\)\. APEX shifts the focus from improving the search algorithm to improving the data usage\. By maintaining a lineage of prompt performance,APEXdynamically categorizes data points intoEasy,Hard, andMixedtiers\. Our core insight is that the informative mutation and selection signals primarily lie in theMixedtier, where the LLM has mixed performance on recent prompt variants\. This dynamic tier constitutes two critical frontiers: theaddressable frontierindicating volatile errors suitable for mutation, and therank\-sensitive frontierwith ambiguous cases that differentiate candidates\. By prioritizing compute to these high\-leverage frontiers,APEXmaximizes the optimization signal per evaluation call\.

In summary, our contributions are three\-fold\. First, we identify the data efficiency bottleneck in current genetic prompt optimization methods and attribute it to the static treatment of dynamic data signals\. Second, we proposeAPEX, a novel algorithm that enhances the prompt optimization process by stratifying data into dynamic tiers and targeting the “Addressable" and “Rank\-Sensitive" frontiers\. Third, we achieve performance gains across three diverse benchmarks \(IFBench, SimpleQA Verified, FACTS Grounding\)\. Under a fixed budget of 5,000 evaluation calls,APEXimproves performance by 11\.2% on Gemini 2\.5 Flash and 6\.8% on Gemma 3 27B, establishing a new standard for data\-efficient prompt optimization\.

## 2Related Work

### 2\.1Automatic Prompt Optimization

Early research explored diverse search strategies for prompt engineering, ranging from gradient\-based methods\(Shinet al\.,[2020](https://arxiv.org/html/2606.11459#bib.bib11); Denget al\.,[2022](https://arxiv.org/html/2606.11459#bib.bib12); Chenet al\.,[2024a](https://arxiv.org/html/2606.11459#bib.bib19)\)and edit\-based heuristics\(Prasadet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib10)\), to instruction induction\(Honovichet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib9)\)\. They have paved the way for LLM\-driven optimization frameworks, which formalize prompt optimization by leveraging LLMs to generate candidate instructions\(Zhouet al\.,[2023b](https://arxiv.org/html/2606.11459#bib.bib13); Yanget al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib14)\)\.

As the field matured, genetic \(or evolutionary\) algorithms utilizing natural language feedback emerged as the most effective paradigm\. A representative method isAPO\(Pryzantet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib2)\), which iteratively refines prompts by treating error feedback as a signal for mutation\. Parallel work has explored other directions to improve the genetic approach, by integrating history guidance, hybrid search strategies, and carefully curated meta\-prompts\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib18); Fernandoet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib15); Wanget al\.,[2024b](https://arxiv.org/html/2606.11459#bib.bib16); Yeet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib17); Guoet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib28)\)\. Notably, AELP\(Hsiehet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib18)\)is designed for long prompts, utilizing history guidance and limiting the mutation scope to ensure stability\. Recent research has also expanded the scope of prompt optimization to agentic workflows\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib1); Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib30)\), in\-context learning\(Wuet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib21); Wanet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib8)\), and multimodal scenarios\(Liuet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib32); Wanet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib31)\)\. Among them,GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib1)\)employs a Genetic\-Pareto framework to optimize agents towards a multi\-objective frontier\.

Despite these advancements, the field has largely overlooked the data\-centric perspective of prompt optimization\. Most existing methods treat the dataset as a static benchmark, ignoring the varying informational value of different examples\.APEXaddress this fundamental inefficiency by improving the data selection process itself\. By dynamically tracking the rank\-sensitivity and solvability of datapoints,APEXallocates compute wisely\.

### 2\.2Data Selection for LLM Alignment

The importance of data quality over quantity for LLM alignment was highlighted by the LIMA hypothesis\(Zhouet al\.,[2023a](https://arxiv.org/html/2606.11459#bib.bib22)\), which provided empirical evidence that a small, curated set of examples is sufficient for alignment\. This insight catalyzed a wave of automated data selection methods\. AlpaGasus\(Chenet al\.,[2024b](https://arxiv.org/html/2606.11459#bib.bib23)\)filters low\-quality data to improve fine\-tuning efficiency, while LESS\(Xiaet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib24)\)selects data based on gradient similarity\. Additionally, influence\-based approaches\(Choeet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib33)\)utilize influence functions to identify which training examples most affect downstream capabilities\. More recent work has shifted towards dynamic and active data curation, such as Data Advisor\(Wanget al\.,[2024a](https://arxiv.org/html/2606.11459#bib.bib25)\), which dynamically curates data to improve safety alignment\.

Parallel to these training\-time advances, data selection has recently garnered attention in prompt optimization\. Notable approaches, such as those byDiaoet al\.\([2024](https://arxiv.org/html/2606.11459#bib.bib26)\)andDonget al\.\([2025](https://arxiv.org/html/2606.11459#bib.bib29)\), employ active learning or combine data similarity and model confidence heuristics to curate examples\. However, the application scope of these methods is limited by restrictive dependencies, such as human\-in\-the\-loop annotation, reliance on internal model signals \(e\.g\., logits\), or restriction to classification tasks\. These requirements render them unsuitable for general generative tasks under fully automated, black\-box API settings, and non\-comparable to our work\.APEXextends the principles of identifying informative data and dynamic curation to prompt optimization, offering a fully automated, black\-box solution\.

![Refer to caption](https://arxiv.org/html/2606.11459v1/figure/APEX.png)Figure 1:The APEX framework overview\.The optimization process begins by tracking prompt lineage \(left\), where historical performance across ancestors \(e\.g\.,p2p\_\{2\}\) and children \(e\.g\.,p6p\_\{6\}\) is analyzed over a defined lookback window\. In theData Dynamicsphase \(center\), datapoints are dynamically categorized into tiers \(Easy,Hard, andMixed\) based on pass/fail trajectories\. TheMixedset identifies informative datapoints where model performance is inconsistent, signaling high potential for optimization\. These tiers inform theMutationandSelectionstages \(right\): the former leverages mutation history to explore unvisited prompt space, while the latter employs stratified sampling to efficiently rank candidate prompts\. Notably, both stages prioritize theMixedset to focus optimization on high\-signal datapoints\. Evaluation is incremental, skipping previously evaluated datapoints to maximize efficiency\.

## 3Bottleneck of Prompt Optimization

### 3\.1Preliminary: Prompt Optimization

Prompts serve as the primary interface for interacting with LLMs at the inference stage\. For a given user intent, there exists an expansive space of potential prompt variants, many of which yield significantly different performance outcomes\. The goal of prompt optimization is to systematically identify an optimal promptP∗P^\{\*\}that maximizes a predefined performance metric on a target LLM:

P∗=arg⁡maxP⁡𝔼x∼𝒟∗​\[f​\(x,LLM​\(P,x\)\)\],P^\{\*\}=\\arg\\max\_\{P\}\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}^\{\*\}\}\[f\(x,\\text\{LLM\}\(P,x\)\)\],\(1\)where𝒟∗\\mathcal\{D\}^\{\*\}represents the oracle data distribution andf​\(⋅\)f\(\\cdot\)is an evaluation function \(e\.g\., string\-matching accuracy or model\-based grading\) that scores the LLM’s response relative to inputxx\.

To simplify notation across the optimization trajectory, we defines​\(P,x\)∈\{0,1\}s\(P,x\)\\in\\\{0,1\\\}as the binary evaluation outcome of promptPPon datapointxx, where only a perfect score yields a pass \(11\) and any partial credit is treated as a failure \(0\)\.

In practice, both the LLM and the evaluation functionffare treated as black boxes where gradients are unavailable\. Consequently, derivative\-free genetic algorithms, such as APO\(Pryzantet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib2)\)and GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib1)\), have emerged as leading methods\. These methods typically iterate through two primary stages on a collected dataset𝒟\\mathcal\{D\}:

Mutation:To explore the prompt space, a meta\-optimizerLLMmeta\\text\{LLM\}\_\{\\text\{meta\}\}\(typically the target LLM itself\) generates a new promptPnewP\_\{\\text\{new\}\}conditioned on an existing promptPcurrP\_\{\\text\{curr\}\}and a batch of its error casesℰcurr⊂\{x∈𝒟∣s​\(Pcurr,x\)=0\}\\mathcal\{E\}\_\{\\text\{curr\}\}\\subset\\\{x\\in\\mathcal\{D\}\\mid s\(P\_\{\\text\{curr\}\},x\)=0\\\}:

Pnew∼LLMmeta​\(Pcurr,ℰcurr\)\.P\_\{\\text\{new\}\}\\sim\\text\{LLM\}\_\{\\text\{meta\}\}\(P\_\{\\text\{curr\}\},\\mathcal\{E\}\_\{\\text\{curr\}\}\)\.\(2\)
Selection:The new candidate promptPnewP\_\{\\text\{new\}\}is evaluated alongside the current population𝒫t\\mathcal\{P\}\_\{t\}\. The subsequent generation𝒫t\+1\\mathcal\{P\}\_\{t\+1\}is constructed by selecting thekkhighest\-performing prompts based on their aggregate scores on𝒟\\mathcal\{D\}:

𝒫t\+1=Top\-​k​\(𝒫t∪\{Pnew\}∣𝒟\)\.\\mathcal\{P\}\_\{t\+1\}=\\text\{Top\-\}k\\left\(\\mathcal\{P\}\_\{t\}\\cup\\\{P\_\{\\text\{new\}\}\\\}\\mid\\mathcal\{D\}\\right\)\.\(3\)

### 3\.2Problem: Data Efficiency

Despite their success, data efficiency remains a critical but overlooked bottleneck in both the mutation and selection stages, hindering the scalability of existing frameworks\. We formalize the global optimization state as asparse score matrix𝐒∈\{0,1,∅\}\|𝒫\|×\|𝒟\|\\mathbf\{S\}\\in\\\{0,1,\\emptyset\\\}^\{\|\\mathcal\{P\}\|\\times\|\\mathcal\{D\}\|\}, where each entrySj,iS\_\{j,i\}represents the performance of promptPjP\_\{j\}on datapointxi∈𝒟x\_\{i\}\\in\\mathcal\{D\}:

Sj,i=s​\(Pj,xi\)∈\{0,1\},\(or​∅​if not applicable\)\.S\_\{j,i\}=s\(P\_\{j\},x\_\{i\}\)\\in\\\{0,1\\\},\\quad\(\\text\{or \}\\emptyset\\text\{ if not applicable\}\)\.\(4\)
#### Inefficiency in Mutation\.

The mutation distribution produced byLLMmeta\\text\{LLM\}\_\{\\text\{meta\}\}highly depends onℰcurr\\mathcal\{E\}\_\{\\text\{curr\}\}\. Current methods typically rely on random sampling of error cases, which fails to account for theoptimization trajectory, rendering the mutation process a source of instability rather than monotonic improvement\.

Optimization performance is often governed by a hierarchy of “fixability,” where certain errors serve as logical precursors to others\. For instance, if error samples’ batchℰb\\mathcal\{E\}\_\{b\}contains errors only addressable after foundational issues inℰa\\mathcal\{E\}\_\{a\}are resolved, the trajectory must follow the sequenceℰa→ℰb\\mathcal\{E\}\_\{a\}\\rightarrow\\mathcal\{E\}\_\{b\}\. Reversing this sequence or sampling randomly leads to high\-variance updates and “forgetting,” as theLLMmeta\\text\{LLM\}\_\{\\text\{meta\}\}attempts to solve high\-level failures without the necessary prompt refinements established in earlier stages\.

This lack of trajectory awareness also frequently precipitates the trap whereℰcurr\\mathcal\{E\}\_\{\\text\{curr\}\}becomes dominated by “impossible” cases that are fundamentally beyond the capacity of the target LLM\. Attempting to optimize for these points misleads theLLMmeta\\text\{LLM\}\_\{\\text\{meta\}\}, stalling progress\. Notably, the optimal data for mutation is non\-static; it changes dynamically along the optimization process as different error types become addressable at different stages of prompt maturity\.

#### Inefficiency in Selection\.

The selection stage represents the primary computational overhead of the optimization loop, often accounting for more than 90% of total costs\.111Generating a mutation often requires fewer than three API calls, whereas the selection phase requires evaluating the new prompt across the entire dataset or multiple candidates on subsets, frequently requiring hundreds of API calls, as reported in APO and GEPA\.This creates a fundamental trade\-off between selection accuracy and the number of iterationsTTpossible within a fixed budget\. While comprehensive evaluation ensures precise ranking, it is computationally prohibitive\. Conversely, stochastically sampling a subset introduces estimation noise that leads to rank inversion, where suboptimal prompts are mistakenly retained\.

Crucially, absolute performance estimates are often redundant\. For selection, the priority is the relative rank of candidates\. Because the population consists of parents, children, and siblings that share significant structural or semantic similarities, their performance is identical for a large portion ofx∈𝒟x\\in\\mathcal\{D\}\. The true bottleneck is the failure to isolatediscriminative data:

𝒟disc=\{xi∈𝒟∣∃Pa,Pb∈𝒫t:s​\(Pa,xi\)≠s​\(Pb,xi\)\}\.\\mathcal\{D\}\_\{\\text\{disc\}\}=\\\{x\_\{i\}\\in\\mathcal\{D\}\\mid\\exists P\_\{a\},P\_\{b\}\\in\\mathcal\{P\}\_\{t\}:s\(P\_\{a\},x\_\{i\}\)\\neq s\(P\_\{b\},x\_\{i\}\)\\\}\.\(5\)Existing frameworks waste their budget evaluatingx∉𝒟discx\\notin\\mathcal\{D\}\_\{\\text\{disc\}\}, where all candidates perform identically\. Evaluating these points provides zero information for selection, effectively reducing the number of meaningful updates possible within a fixed compute budget\.

## 4Method:APEX

TheAPEXframework \(Figure[1](https://arxiv.org/html/2606.11459#S2.F1)\) addresses the core bottlenecks of data inefficiency by dynamically re\-evaluating the utility of each datapoint in the optimization loop \(Algorithm[1](https://arxiv.org/html/2606.11459#algorithm1)\)\. Instead of treating the development set as a static benchmark,APEXcategorizes data based on its historical interaction with the prompt lineage to guide both mutation and selection\.

1

Input :Initial

P0P\_\{0\}, Iterations

TT, Budget

NN, Batch

mm, Init anchor

α0\\alpha\_\{0\}, Lookback

kk
Output :Optimized Prompt

P∗P^\{\*\}
2

3

Pcurr←P0;α←α0P\_\{\\text\{curr\}\}\\leftarrow P\_\{0\};\\quad\\alpha\\leftarrow\\alpha\_\{0\}
4

ℋ←\{EvaluateFull​\(P0,𝒟\)\}\\mathcal\{H\}\\leftarrow\\\{\\text\{EvaluateFull\}\(P\_\{0\},\\mathcal\{D\}\)\\\}
5

6for*t=1​…​Tt=1\\dots T*do

//1\. Dynamic Data Stratification

ℬ←Stratify​\(𝒟,ℋ,k\)\\mathcal\{B\}\\leftarrow\\text\{Stratify\}\(\\mathcal\{D\},\\mathcal\{H\},k\)
//Alg\.[2](https://arxiv.org/html/2606.11459#algorithm2)

7

//2\. Trajectory\-Guided Mutation

8

ℰ←Sample​m​unused from​ℬM,0∪ℬH,0\\mathcal\{E\}\\leftarrow\\text\{Sample \}m\\text\{ unused from \}\\mathcal\{B\}\_\{M,0\}\\cup\\mathcal\{B\}\_\{H,0\}
9

C←LLMmeta​\(Critique​\(Pcurr,ℰ\)\)C\\leftarrow\\text\{LLM\}\_\{\\text\{meta\}\}\(\\text\{Critique\}\(P\_\{\\text\{curr\}\},\\mathcal\{E\}\)\)
10

Pnew←LLMmeta​\(Mutate​\(Pcurr,C\)\)P\_\{\\text\{new\}\}\\leftarrow\\text\{LLM\}\_\{\\text\{meta\}\}\(\\text\{Mutate\}\(P\_\{\\text\{curr\}\},C\)\)
11

//3\. Rank\-Sensitive Selection

12

𝒟req←ℬM,∅\\mathcal\{D\}\_\{\\text\{req\}\}\\leftarrow\\mathcal\{B\}\_\{M,\\emptyset\}
R←N−\|𝒟req\|R\\leftarrow N\-\|\\mathcal\{D\}\_\{\\text\{req\}\}\|
//Remaining budget

13

ρmix←PassRate​\(ℬM\);ρall←PassRate​\(𝒟\)\\rho\_\{\\text\{mix\}\}\\leftarrow\\text\{PassRate\}\(\\mathcal\{B\}\_\{M\}\);\\;\\;\\rho\_\{\\text\{all\}\}\\leftarrow\\text\{PassRate\}\(\\mathcal\{D\}\)
14

kpos←⌊min⁡\(α,ρmix,ρall\)⋅R⌋;kneg←R−kposk\_\{\\text\{pos\}\}\\leftarrow\\lfloor\\min\(\\alpha,\\rho\_\{\\text\{mix\}\},\\rho\_\{\\text\{all\}\}\)\\cdot R\\rfloor;\\;\\;k\_\{\\text\{neg\}\}\\leftarrow R\-k\_\{\\text\{pos\}\}
15

16

𝒟pos←Sample​kpos​prioritizing​ℬM,1​then​ℬE,∅\\mathcal\{D\}\_\{\\text\{pos\}\}\\leftarrow\\text\{Sample \}k\_\{\\text\{pos\}\}\\text\{ prioritizing \}\\mathcal\{B\}\_\{M,1\}\\text\{ then \}\\mathcal\{B\}\_\{E,\\emptyset\}
17

𝒟neg←Sample​kneg​prioritizing​ℬM,0​then​ℬH,∅\\mathcal\{D\}\_\{\\text\{neg\}\}\\leftarrow\\text\{Sample \}k\_\{\\text\{neg\}\}\\text\{ prioritizing \}\\mathcal\{B\}\_\{M,0\}\\text\{ then \}\\mathcal\{B\}\_\{H,\\emptyset\}
18

19

𝒟eval←𝒟req∪𝒟pos∪𝒟neg\\mathcal\{D\}\_\{\\text\{eval\}\}\\leftarrow\\mathcal\{D\}\_\{\\text\{req\}\}\\cup\\mathcal\{D\}\_\{\\text\{pos\}\}\\cup\\mathcal\{D\}\_\{\\text\{neg\}\}
20Evaluate

PnewP\_\{\\text\{new\}\}and

PcurrP\_\{\\text\{curr\}\}on

𝒟eval\\mathcal\{D\}\_\{\\text\{eval\}\}
21

22if*P*new*P\_\{\\text\{new\}\}is better thanP*curr*​on​𝒟*eval*P\_\{\\text\{curr\}\}\\text\{ on \}\\mathcal\{D\}\_\{\\text\{eval\}\}*then

23

Pcurr←PnewP\_\{\\text\{curr\}\}\\leftarrow P\_\{\\text\{new\}\}
α←α\+β\\alpha\\leftarrow\\alpha\+\\beta
//Anneal anchor

24

25Update

ℋ\\mathcal\{H\}with Evaluation of

PnewP\_\{\\text\{new\}\}and

PcurrP\_\{\\text\{curr\}\}on

𝒟eval\\mathcal\{D\}\_\{\\text\{eval\}\}
26

27return*P*curr*P\_\{\\text\{curr\}\}*

Algorithm 1APEX \(Mutation & Selection\)
1

Input :Dataset

𝒟\\mathcal\{D\}, History

ℋ\\mathcal\{H\}, Lookback

kk
Output :Granular Data Buckets

ℬt,s\\mathcal\{B\}\_\{t,s\}
2

3Initialize empty buckets

ℬt,s\\mathcal\{B\}\_\{t,s\}
4

5for*eachxi∈𝒟x\_\{i\}\\in\\mathcal\{D\}*do

//Retrieve the last k non\-empty outcomes

6

ℛi←Last​k​valid scores​s​\(P,xi\)​from​ℋ\\mathcal\{R\}\_\{i\}\\leftarrow\\text\{Last \}k\\text\{ valid scores \}s\(P,x\_\{i\}\)\\text\{ from \}\\mathcal\{H\}
7

//Determine semantic tier

8if*Set​\(ℛi\)≡\{1\}\\text\{Set\}\(\\mathcal\{R\}\_\{i\}\)\\equiv\\\{1\\\}*then

9

t←E \(Easy\)t\\leftarrow\\text\{E \(Easy\)\}
10

11else if*Set​\(ℛi\)≡\{0\}\\text\{Set\}\(\\mathcal\{R\}\_\{i\}\)\\equiv\\\{0\\\}*then

12

t←H \(Hard\)t\\leftarrow\\text\{H \(Hard\)\}
13

14else

15

t←M \(Mixed\)t\\leftarrow\\text\{M \(Mixed\)\}
16

17

//Lookup cached status \(yields∅\\emptysetif uneval\)

18

s←Retrieve​s​\(Pcurr,xi\)​from​ℋs\\leftarrow\\text\{Retrieve \}s\(P\_\{\\text\{curr\}\},x\_\{i\}\)\\text\{ from \}\\mathcal\{H\}
19

ℬt,s←ℬt,s∪\{xi\}\\mathcal\{B\}\_\{t,s\}\\leftarrow\\mathcal\{B\}\_\{t,s\}\\cup\\\{x\_\{i\}\\\}
20

21

224ptreturn*ℬ\\mathcal\{B\}*

Algorithm 2Dynamic Data Stratification

### 4\.1Data Dynamics for Prompt Optimization

The core philosophy ofAPEXis to move beyond static data by characterizing how individual datapoints influence the optimization trajectory\. We categorize the interaction between the prompt lineage and the dataset into three distinct dynamics:Easy,Hard, andMixed\(Algorithm[2](https://arxiv.org/html/2606.11459#algorithm2)\)\.

To accurately capture these dynamics without overfitting to stale data, we treat the evaluation historyℋ\\mathcal\{H\}as an ordered sequence of prompts\. For any datapointxi∈𝒟x\_\{i\}\\in\\mathcal\{D\}, letℋvalid\(i\)\\mathcal\{H\}^\{\(i\)\}\_\{\\text\{valid\}\}be the sub\-sequence of historical prompts that have actively evaluatedxix\_\{i\}\(i\.e\.,s​\(P,xi\)≠∅s\(P,x\_\{i\}\)\\neq\\emptyset\)\. We define its local historyℛi\\mathcal\{R\}\_\{i\}as the set of outcomes from thekkmost recent prompts inℋvalid\(i\)\\mathcal\{H\}^\{\(i\)\}\_\{\\text\{valid\}\}:

ℛi=\{s​\(P,xi\)∣P∈lastk​\(ℋvalid\(i\)\)\}\.\\mathcal\{R\}\_\{i\}=\\left\\\{s\(P,x\_\{i\}\)\\mid P\\in\\text\{last\}\_\{k\}\\big\(\\mathcal\{H\}^\{\(i\)\}\_\{\\text\{valid\}\}\\big\)\\right\\\}\.\(6\)The lookback windowkkacts as a dynamic slice on the sparse score matrix𝐒\\mathbf\{S\}, ensuring the optimizer always has a consistent sample size of relevant, recent behavior rather than outdated signals\.

We partition the dataset𝒟\\mathcal\{D\}into three semantic tiers based on the variance ofℛi\\mathcal\{R\}\_\{i\}:

Tier​\(i\)=\{E \(Easy\)ifSet​\(ℛi\)≡\{1\}H \(Hard\)ifSet​\(ℛi\)≡\{0\}M \(Mixed\)ifSet​\(ℛi\)≡\{0,1\}\\text\{Tier\}\(i\)=\\begin\{cases\}\\text\{E \(Easy\)\}&\\text\{if \}\\text\{Set\}\(\\mathcal\{R\}\_\{i\}\)\\equiv\\\{1\\\}\\\\ \\text\{H \(Hard\)\}&\\text\{if \}\\text\{Set\}\(\\mathcal\{R\}\_\{i\}\)\\equiv\\\{0\\\}\\\\ \\text\{M \(Mixed\)\}&\\text\{if \}\\text\{Set\}\(\\mathcal\{R\}\_\{i\}\)\\equiv\\\{0,1\\\}\\end\{cases\}\(7\)- •Easy:Points consistently solved by the lineage\. Re\-evaluating them provides minimal signal\.
- •Hard:Points consistently failed\. These represent data that are currently intractable\.
- •Mixed:Points exhibiting volatility\. These represent the rank\-sensitive frontier for evaluation and the most probable targets for improvement\.

By intersecting these historical tiersT∈\{E,H,M\}T\\in\\\{E,H,M\\\}with the evaluation outcomes∈\{1,0,∅\}s\\in\\\{1,0,\\emptyset\\\}under thecurrentpromptPcurrP\_\{\\text\{curr\}\}, we partition the dataset𝒟\\mathcal\{D\}into nine disjoint subsets, denoted asℬT,s\\mathcal\{B\}\_\{T,s\}\. This notation provides a granular view of the data state\. For example,ℬM,0⊆𝒟\\mathcal\{B\}\_\{M,0\}\\subseteq\\mathcal\{D\}represents the subset of historically Mixed instances that are currently failing, whileℬE,∅\\mathcal\{B\}\_\{E,\\emptyset\}represents Easy instances that were skipped in the current pass\.

Table 1:Main Results\.Accuracy \(%\)±\\pmStd Devand absolute gain \(Δ\\Delta\) over the initial prompt\. APEX consistently outperforms baselines\.MethodIFBenchSimpleQA VerifiedFACTS GroundingAverageGemini 2\.5 Flash\\cellcolorLightGrayInitial Prompt\\cellcolorLightGray38\.5±\\pm1\.2\(—\)\\cellcolorLightGray23\.6±\\pm0\.6\(—\)\\cellcolorLightGray85\.8±\\pm1\.1\(—\)\\cellcolorLightGray49\.3\(—\)GEPA41\.2±\\pm0\.9\(\+2\.7\)28\.8±\\pm0\.8\(\+5\.2\)93\.5±\\pm0\.0\(\+7\.7\)54\.5\(\+5\.2\)APO \(\|D\|\|D\|\)43\.7±\\pm1\.2\(\+5\.2\)27\.4±\\pm2\.6\(\+3\.8\)89\.7±\\pm1\.9\(\+3\.9\)53\.6\(\+4\.3\)APO \(\|D\|/2\|D\|/2\)43\.5±\\pm1\.9\(\+5\.0\)25\.0±\\pm3\.5\(\+1\.4\)89\.9±\\pm0\.1\(\+4\.1\)52\.8\(\+3\.5\)\\cellcolorLightCyanAPEX \(Ours\)\\cellcolorLightCyan52\.3±\\pm1\.4\(\+13\.8\)\\cellcolorLightCyan35\.0±\\pm2\.3\(\+11\.4\)\\cellcolorLightCyan94\.1±\\pm0\.4\(\+8\.3\)\\cellcolorLightCyan60\.5\(\+11\.2\)Gemma 3 27B\\cellcolorLightGrayInitial Prompt\\cellcolorLightGray33\.4±\\pm0\.7\(—\)\\cellcolorLightGray9\.4±\\pm0\.5\(—\)\\cellcolorLightGray80\.7±\\pm1\.6\(—\)\\cellcolorLightGray41\.2\(—\)GEPA34\.1±\\pm0\.5\(\+0\.7\)9\.4±\\pm0\.5\(\+0\.0\)91\.7±\\pm0\.5\(\+11\.0\)45\.1\(\+3\.9\)APO \(\|D\|\|D\|\)35\.7±\\pm2\.4\(\+2\.3\)11\.5±\\pm0\.1\(\+2\.1\)88\.5±\\pm0\.7\(\+7\.8\)45\.2\(\+4\.0\)APO \(\|D\|/2\|D\|/2\)36\.1±\\pm0\.5\(\+2\.7\)8\.4±\\pm1\.4\(\-1\.0\)85\.7±\\pm0\.9\(\+5\.0\)43\.4\(\+2\.2\)\\cellcolorLightCyanAPEX \(Ours\)\\cellcolorLightCyan39\.3±\\pm2\.0\(\+5\.9\)\\cellcolorLightCyan11\.5±\\pm1\.2\(\+2\.1\)\\cellcolorLightCyan93\.3±\\pm0\.1\(\+12\.6\)\\cellcolorLightCyan48\.0\(\+6\.8\)
### 4\.2Trajectory\-Guided Mutation

To resolve the stochastic instability of mutation,APEXenforces a trajectory\-guided strategy\. Standard optimization methods uniformly sample errors from the entire failure set, frequently retrieving “hard” cases \(ℬH,0\\mathcal\{B\}\_\{H,0\}\) that the current prompt logic cannot handle, leading to hallucinated corrections\.

In contrast,APEXtargets theaddressable frontierby constructing the mutation error batch primarily from theMixed\-Failbucket \(ℬM,0\\mathcal\{B\}\_\{M,0\}\)\. These “soft failures” are datapoints where the model has demonstrated capacity in the recent lineage but regressed under the current variation\. Correcting these regressions stabilizes the trajectory\.

Furthermore, to ensure coverage maximization, we maintain a usage history𝒰\\mathcal\{U\}to prevent the optimizer from repeatedly overfitting to the same recurring failures\. Errors for mutation are exclusively drawn from unvisited failures:

e∈\{xi∣xi∈\(ℬM,0∪ℬH,0\),xi∉𝒰\}\.e\\in\\left\\\{x\_\{i\}\\mid x\_\{i\}\\in\(\\mathcal\{B\}\_\{M,0\}\\cup\\mathcal\{B\}\_\{H,0\}\),\\,x\_\{i\}\\notin\\mathcal\{U\}\\right\\\}\.\(8\)Once this pool is exhausted, the usage history is reset, guaranteeing broad exploration of the error surface\.

### 4\.3Rank\-Sensitive Evaluation

The selection stage reduces evaluation cost by focusing on therank\-sensitive subset, which contains datapoints where the new prompt is most likely to diverge from its parent\. We construct evaluation dataset𝒟eval\\mathcal\{D\}\_\{\\text\{eval\}\}using a tiered sampling strategy to fill the budgetNN\. First, we identify volatile unknowns, data points that are historically mixed but were skipped during the previous evaluation\. These provide the highest information gain and form the required baseline:𝒟req=ℬM,∅\.\\mathcal\{D\}\_\{\\text\{req\}\}=\\mathcal\{B\}\_\{M,\\emptyset\}\.To fill the remaining evaluation budgetR=N−\|𝒟req\|R=N\-\|\\mathcal\{D\}\_\{\\text\{req\}\}\|, we employ stratified sampling to balance stability \(anchors\) and error correction\. We sample a positive set𝒟pos\\mathcal\{D\}\_\{\\text\{pos\}\}\(prioritizingℬM,1\\mathcal\{B\}\_\{M,1\}to catch regressions, followed byℬE,∅\\mathcal\{B\}\_\{E,\\emptyset\}\) and a negative set𝒟neg\\mathcal\{D\}\_\{\\text\{neg\}\}\(prioritizingℬM,0\\mathcal\{B\}\_\{M,0\}to confirm fixes\)\.

The size of these sets is governed by ananchor ratioαt\\alpha\_\{t\}, which dictates the proportion of the budget dedicated to verifying positive stability\. This ratio follows an annealing schedule, growing when a prompt update is successful:

αt\+1=αt\+β⋅𝕀​\(Pnew≻Pcurr\)\.\\alpha\_\{t\+1\}=\\alpha\_\{t\}\+\\beta\\cdot\\mathbb\{I\}\(P\_\{\\text\{new\}\}\\succ P\_\{\\text\{curr\}\}\)\.\(9\)As the prompt improves,α\\alphaincreases, effectively “locking in” mastered logic by dedicating more evaluation budget to preventing regressions\. To ensure the anchor size is calibrated to the model’s actual competence, we clampαt\\alpha\_\{t\}against the pass ratios of the mixed tier \(ρmix\\rho\_\{\\text\{mix\}\}\) and the global dataset \(ρall\\rho\_\{\\text\{all\}\}\):

kpos=⌊min⁡\(αt,ρmix,ρall\)⋅R⌋,kneg=R−kpos\.k\_\{\\text\{pos\}\}=\\lfloor\\min\(\\alpha\_\{t\},\\rho\_\{\\text\{mix\}\},\\rho\_\{\\text\{all\}\}\)\\cdot R\\rfloor,\\quad k\_\{\\text\{neg\}\}=R\-k\_\{\\text\{pos\}\}\.\(10\)
Finally, to minimize computational overhead,APEXemploys anincremental evaluationscheme\. If an outcomes​\(Pcurr,xi\)s\(P\_\{\\text\{curr\}\},x\_\{i\}\)for a sampled point already exists in the historyℋ\\mathcal\{H\}, we retrieve it directly from memory, executing full inference only for the new candidate\.

## 5Experiments

![Refer to caption](https://arxiv.org/html/2606.11459v1/figure/budget.png)Figure 2:Comparison of test accuracy versus budget\(i\.e\., number of evaluation calls\) on IFBench with Gemini 2\.5 Flash\. The performance margin between APEX and baselines becomes larger as budget increases\.
![Refer to caption](https://arxiv.org/html/2606.11459v1/figure/ablation.png)Figure 3:Ablation study of data selection strategies\. Comparison of test accuracy across different selection methods using Gemini\-2\.5\-Flash on IFBench\.

### 5\.1Experimental Setting

#### General Setup\.

We employGemini\-2\.5\-Flash\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib3)\)andGemma\-3\-27B\(Team Gemmaet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib4)\)as our underlying LLMs\. For each experiment, we utilize the same model for both the meta\-optimizer \(LLMm​e​t​a\\text\{LLM\}\_\{meta\}\) and the target model \(LLM\)\. To enforce a fair comparison across methods with varying sample efficiency, we restrict all optimizers to a fixed global budget of 5,000 calls to the evaluation function\.

#### Datasets\.

We evaluateAPEXon three distinct tasks representing instruction following, parametric factuality, and grounded generation\.IFBench\(Pyatkinet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib5)\): A benchmark for precise instruction following on verifiable constraints\. The evaluation function verify adherence to constraints using code\-based verifiers\. Following the protocol ofAgrawalet al\.\([2025](https://arxiv.org/html/2606.11459#bib.bib1)\), we sample 500 data points from the IF\-RLVR split for optimization and reserve 294 examples from IFBench for testing\. Notably, the constraints in the development set and the test set have no overlap\.SimpleQA Verified\(Haaset al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib6)\): Assesses short\-form factuality based on the model’s parametric knowledge across diverse topics\. The evaluation compares the model’s prediction against the gold\-standard answer\. We randomly partition the data into 700 examples for optimization and 300 for testing\.FACTS Grounding\(Jacoviet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib7)\): Evaluates the model’s ability to generate long\-form responses with respect to a provided context document\. The evaluation assesses both response quality and grounding quality using a multi\-phase LLM\-based judge system\. We randomly partition the data into 660 examples for optimization and 200 for testing\.

#### Baselines\.

We compareAPEXagainst the following leading prompt optimization frameworks\.APO\(Pryzantet al\.,[2023](https://arxiv.org/html/2606.11459#bib.bib2)\): A genetic based prompt optimization method that remains the best\-performing after its release\(Wanet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib8)\)\.222[https://github\.com/microsoft/LMOps/tree/main/prompt\_optimization](https://github.com/microsoft/LMOps/tree/main/prompt_optimization)It utilizes reflective mutation based on textual feedback and employs a sampling\-based selection mechanism with a controllable budget per iteration\. Since APO allows for a controllable evaluation budget, we include two baselines with per\-iteration budgets set to the size of the development set \(\|D\|\|D\|\) and half that size \(\|D\|/2\|D\|/2\)\.GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib1)\): A genetic framework designed for agentic workflows\.333[https://github\.com/gepa\-ai/gepa](https://github.com/gepa-ai/gepa)It evolves prompts via reflective mutation and employs a Pareto\-based selection strategy on the full development set\.

#### Implementation Details\.

ForAPEX, we employ a lineage lookback window ofk=5k=5to construct the history for data categorization\. The mutation batch size is set tom=5m=5\. For the rank\-sensitive evaluation, we set the evaluation budget per iteration toT=100T=100\. The anchor ratio initializes atα=0\.2\\alpha=0\.2and increments byβ=0\.03\\beta=0\.03when a better prompt is found, gradually stabilizing the evaluation set as the prompt converges\.

### 5\.2Main Results

Table[1](https://arxiv.org/html/2606.11459#S4.T1)presents the comparative performance ofAPEXagainst baseline methods\.APEXconsistently outperforms baselines across both LLMs and all three benchmarks, achieving the highest average accuracy\. On Gemini 2\.5 Flash, the method secures a mean accuracy of 60\.5% \(\+11\.2% over the initial prompt\), obviously surpassing the strongest baseline, GEPA, which reaches 54\.5%\. This observation is consistent on Gemma 3 27B, whereAPEXmaintains its lead\. Notably, this performance gap is particularly pronounced on complex instruction\-following tasks and when optimizing stronger LLMs, highlighting the method’s scalability\. Broadly, these results validate that the bottleneck in prompt optimization is not merely the search algorithm, but theeffective optimization signals\. While baselines struggle with the trade\-off between signal cost and signal noise,APEXdemonstrates that actively shaping the data distribution allows for directed and stable improvement\.

A deeper analysis of the experimental results isolates two distinct failure modes in existing prompt selection strategies\. First, methods reliant on full\-set evaluation \(GEPA\) suffer fromlimited search depth\. While the evaluation signal is reliable, the prohibitive computational cost restricts the depth of optimization trajectories\. Second, random subsampling strategies \(APO\) lackevaluation stability\. The reliance on small random sets introduces high variance, failing to reliably distinguish between superior and inferior prompts\. This instability is evident in that aggressive subsampling with fewer evaluation samples to exchange for more iterations \(APO\|D\|/2\|D\|/2\) may lead to performance regression compared to the conservative baseline \(APO\|D\|\|D\|\)\. Furthermore, these evaluation issues exacerbate the inherent limitations of random mutation\. GEPA truncates the search before the process can stabilize, while APO exhibits performance regression by corrupting the selection signal with noise\. Consequently, baselines fail to achievestable and monotonic improvement\.APEXeffectively navigates these trade\-offs through its dynamic data selection strategy\. By prioritizing theAddressable Frontierfor mutation and theRank\-Sensitive Frontierfor selection, the method achieves better performance across multiple scenarios\.

Beyond aggregate metrics, we observe that the magnitude of improvement is strongly correlated with the interplay between task characteristics and LLM capability\. Prompt optimization yields the most significant returns in a “promising region" where the model possesses latent knowledge but lacks the instruction alignment to express it\. This is evident in the contrast between easy tasks like FACTS Grounding, where strong base capabilities allow for substantial refinement, and difficult scenarios like Gemma 3 27B on SimpleQA Verified, where fundamental knowledge deficits impose a hard ceiling on improvements regardless of the optimization strategy\.

Table 2:Independent contributions ofAPEX’s mutation and selection components on IFBench with Gemini\-2\.5\-Flash\.Trajectory\-GuidedMutationRank\-SensitiveSelectionScore✓✓52\.3✗✓50\.2✓✗48\.3✗✗42\.9
Table 3:Impact of sampling from different data tiers on IFBench with Gemini\-2\.5\-Flash\.MethodScoreAPEX52\.3– Random on hard and mixed47\.3– Random on hard30\.3– Random on all data42\.9

### 5\.3Analysis

To provide an in\-depth understanding of the prompt optimization methods, we conduct a fine\-grained analysis on IFBench with Gemini 2\.5 Flash\.

#### Performance across Budgets\.

Figure[3](https://arxiv.org/html/2606.11459#S5.F3)illustrates the test accuracy achieved across different evaluation budgets\. GEPA exhibits competitive early performance but makes little progress thereafter, a result ofinsufficient explorationconstrained by high per\-iteration costs\. Conversely, the aggressive subsampling of APO \(\|𝒟\|/2\|\\mathcal\{D\}\|/2\) results ininstability\. In contrast,APEXmaintains asustained improvementthat becomes increasingly pronounced as the budget scales, validating the critical importance ofdata efficiency\.

#### Impact of Data Selection Strategy\.

Figure[3](https://arxiv.org/html/2606.11459#S5.F3)dissects the contribution of different data selection mechanisms to the final performance\. The results highlight a clear hierarchy of efficacy\.Random Selectionprovides a baseline improvement over the initial prompt\.Stratified Samplingbalances the dataset based oninitial prompt performance, marginally improving upon random selection\. Explicitly enforcingDiversitybased on constraint types provides further gains, as it introduces auxiliary guidance information\. However, these methods still fall significantly short ofAPEXvariants\. The comparison betweenAPEX\(Full History\)and the completeAPEXvalidates the hypothesis of our dynamic frontier strategy, demonstrating that removingoutdated signalsto target the dynamic frontiers delivers a substantial performance margin\.

#### Ablation on Mutation and Selection Methods\.

We conduct an ablation study to isolate and quantify the individual contributions ofAPEX’s prompt mutation and selection mechanisms\. As detailed in Table[3](https://arxiv.org/html/2606.11459#S5.T3), replacing both components with random baselines results in a severe performance drop\. IntroducingAPEX’s selection strategy independently yields a greater improvement than applying the mutation strategy alone\. Crucially, the combination of both components achieves the peak overall score, highlighting a strong synergistic effect between trajectory\-guided mutation and rank\-sensitive selection\.

#### Necessity of Prioritizing Mixed Data\.

Table[3](https://arxiv.org/html/2606.11459#S5.T3)highlights the critical role of mixed data during optimization\. Sampling exclusively from the hard tier severely degrades performance, causing the prompt to overfit to narrow edge cases at the expense of general accuracy\. In contrast, incorporating mixed data alongside hard examples substantially outperforms uniform random sampling across all data\. This confirms that prioritizing mixed\-tier examples provides a vital grounding signal, preventing catastrophic overfitting and ensuring the prompt maintains broad generalization\.

![Refer to caption](https://arxiv.org/html/2606.11459v1/x1.png)Figure 4:Qualitative analysis ofAPEXprompt evolution\.Theleft paneltracks the optimization trajectory, highlighting when key instructional strategies were discovered\. Theright panelshows the final prompt, color\-coded to match these milestones\. Together, they demonstrate howAPEXiteratively builds high\-performance prompts by accumulating improvements over time\.
#### Qualitative Example\.

The evolution of theAPEXprompt in Figure[4](https://arxiv.org/html/2606.11459#S5.F4)demonstrates a clear shift from simple instruction\-giving to establishing a rigid meta\-cognitive scaffolding for the model\. Early iterations focus oninput analysis, explicitly teaching the model what a constraint is\. The middle phase introducesprocess control, forcing a “plan and verify" mechanism that separates the reasoning process from the final generation\. The final and most critical optimization creates a strictoutput architecture\. By mandating a specific response structure \(Constraints→\\rightarrowPlan→\\rightarrowExecution→\\rightarrowAnswer\), the prompt acts as a forcing function\.

## 6Conclusion

In this work, we identified the static treatment of data as the fundamental efficiency bottleneck in automated prompt optimization\. Existing evolutionary methods waste improved model capabilities on fixed benchmarks, treating informative and uninformative data equally\. We addressed this by proposingAPEX, a data\-centric framework that transforms the development set from a static ruler into a dynamic pool\. By stratifying data into dynamic tiers,APEXfocuses the optimization budget on the addressable frontier for prompt mutation and the rank\-sensitive frontier for prompt selection\. Our empirical results across instruction following, factuality, and grounding tasks demonstrate thatAPEXachieves better performance under the same compute budget\. Beyond prompt engineering, these findings suggest a broader principle for black\-box optimization\. As candidates evolve, the data used to optimize them must evolve together\. We expectAPEXto inspire future work to further explore the intersection of dynamic data curation and agent optimization\.

## References

- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§1](https://arxiv.org/html/2606.11459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.11459#S3.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px3.p1.2)\.
- InstructZero: efficient instruction optimization for black\-box large language models\.InInternational Conference on Machine Learning,pp\. 6503–6518\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.
- L\. Chen, S\. Li, J\. Yan, H\. Wang, K\. Gunaratna, V\. Yadav, Z\. Tang, V\. Srinivasan, T\. Zhou, H\. Huang,et al\.\(2024b\)Alpagasus: training a better alpaca model with fewer data\.InThe Twelfth International Conference on Learning Representations \(ICLR 2024\),Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p1.1)\.
- S\. K\. Choe, H\. Ahn, J\. Bae, K\. Zhao, Y\. Chung, A\. Pratapa, W\. Neiswanger, E\. Strubell, T\. Mitamura, J\. Schneider, E\. Hovy, R\. B\. Grosse, and E\. P\. Xing \(2025\)What is your data worth to GPT? LLM\-scale data valuation with influence functions\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=zPKeJAEo27)Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px1.p1.2)\.
- M\. Deng, J\. Wang, C\. Hsieh, Y\. Wang, H\. Guo, T\. Shu, M\. Song, E\. Xing, and Z\. Hu \(2022\)Rlprompt: optimizing discrete text prompts with reinforcement learning\.InProceedings of the 2022 conference on empirical methods in natural language processing,pp\. 3369–3391\.Cited by:[§1](https://arxiv.org/html/2606.11459#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.
- S\. Diao, P\. Wang, Y\. Lin, R\. Pan, X\. Liu, and T\. Zhang \(2024\)Active prompting with chain\-of\-thought for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1330–1350\.Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p2.1)\.
- X\. Dong, S\. Wang, D\. Lin, and A\. Hassan \(2025\)Model performance\-guided evaluation data selection for effective prompt optimization\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 2844–2859\.Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p2.1)\.
- C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel \(2024\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.InProceedings of the 41st International Conference on Machine Learning,pp\. 13481–13544\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- Q\. Guo, R\. Wang, J\. Guo, B\. Li, K\. Song, X\. Tan, G\. Liu, J\. Bian, and Y\. Yang \(2024\)Connecting large language models with evolutionary algorithms yields powerful prompt optimizers\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZG3RaNIsO8)Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- L\. Haas, G\. Yona, G\. D’Antonio, S\. Goldshtein, and D\. Das \(2025\)Simpleqa verified: a reliable factuality benchmark to measure parametric knowledge\.arXiv preprint arXiv:2509\.07968\.Cited by:[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px2.p1.1)\.
- O\. Honovich, U\. Shaham, S\. Bowman, and O\. Levy \(2023\)Instruction induction: from few examples to natural language task descriptions\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1935–1952\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.
- C\. Hsieh, S\. Si, F\. Yu, and I\. Dhillon \(2024\)Automatic engineering of long prompts\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 10672–10685\.Cited by:[§1](https://arxiv.org/html/2606.11459#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- A\. Jacovi, A\. Wang, C\. Alberti, C\. Tao, J\. Lipovetz, K\. Olszewska, L\. Haas, M\. Liu, N\. Keating, A\. Bloniarz,et al\.\(2025\)The facts grounding leaderboard: benchmarking llms’ ability to ground responses to long\-form input\.arXiv preprint arXiv:2501\.03200\.Cited by:[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px2.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.Cited by:[Appendix E](https://arxiv.org/html/2606.11459#A5.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2606.11459#S1.p1.1)\.
- S\. Liu, S\. Yu, Z\. Lin, D\. Pathak, and D\. Ramanan \(2024\)Language models as black\-box optimizers for vision\-language models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 12687–12697\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- A\. Novikov, N\. Vũ, M\. Eisenberger, E\. Dupont, P\. Huang, A\. Z\. Wagner, S\. Shirobokov, B\. Kozlovskii, F\. J\. Ruiz, A\. Mehrabian,et al\.\(2025\)Alphaevolve: a coding agent for scientific and algorithmic discovery\.arXiv preprint arXiv:2506\.13131\.Cited by:[Appendix E](https://arxiv.org/html/2606.11459#A5.p1.1)\.
- K\. Opsahl\-Ong, M\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 9340–9366\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- A\. Prasad, P\. Hase, X\. Zhou, and M\. Bansal \(2023\)Grips: gradient\-free, edit\-based instruction search for prompting large language models\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,pp\. 3845–3864\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.
- R\. Pryzant, D\. Iter, J\. Li, Y\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic prompt optimization with “gradient descent” and beam search\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 7957–7968\.Cited by:[§1](https://arxiv.org/html/2606.11459#S1.p1.1),[§1](https://arxiv.org/html/2606.11459#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.11459#S3.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px3.p1.2)\.
- V\. Pyatkin, S\. Malik, V\. Graf, H\. Ivison, S\. Huang, P\. Dasigi, N\. Lambert, and H\. Hajishirzi \(2025\)Generalizing verifiable instruction following\.arXiv preprint arXiv:2507\.02833\.Cited by:[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Shi and F\. Morstatter \(2024\)The diffusion of causal language in social networks\.InProceedings of the International AAAI Conference on Web and Social Media,Vol\.18,pp\. 1422–1435\.Cited by:[Appendix E](https://arxiv.org/html/2606.11459#A5.p1.1)\.
- T\. Shin, Y\. Razeghi, R\. L\. Logan IV, E\. Wallace, and S\. Singh \(2020\)Autoprompt: eliciting knowledge from language models with automatically generated prompts\.arXiv preprint arXiv:2010\.15980\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling llm test\-time compute optimally can be more effective than scaling parameters for reasoning\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 10131–10165\.Cited by:[Appendix E](https://arxiv.org/html/2606.11459#A5.p1.1)\.
- Team Gemma, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px1.p1.2)\.
- X\. Wan, R\. Sun, H\. Nakhost, and S\. Arik \(2024\)Teach better or show smarter? on instructions and exemplars in automatic prompt optimization\.Advances in Neural Information Processing Systems37,pp\. 58174–58244\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1),[§5\.1](https://arxiv.org/html/2606.11459#S5.SS1.SSS0.Px3.p1.2)\.
- X\. Wan, H\. Zhou, R\. Sun, H\. Nakhost, K\. Jiang, R\. Sinha, and S\. Ö\. Arık \(2025\)Maestro: self\-improving text\-to\-image generation via agent orchestration\.arXiv preprint arXiv:2509\.10704\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- F\. Wang, N\. Mehrabi, P\. Goyal, R\. Gupta, K\. Chang, and A\. Galstyan \(2024a\)Data advisor: dynamic data curation for safety alignment of large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 8089–8100\.Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p1.1)\.
- X\. Wang, C\. Li, Z\. Wang, F\. Bai, H\. Luo, J\. Zhang, N\. Jojic, E\. Xing, and Z\. Hu \(2024b\)PromptAgent: strategic planning with language models enables expert\-level prompt optimization\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- Z\. Wu, X\. Lin, Z\. Dai, W\. Hu, Y\. Shu, S\. Ng, P\. Jaillet, and B\. K\. H\. Low \(2024\)Prompt optimization with ease? efficient ordering\-aware automated selection of exemplars\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 122706–122740\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen \(2024\)LESS: selecting influential data for targeted instruction tuning\.InInternational Conference on Machine Learning,pp\. 54104–54132\.Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.
- Q\. Ye, M\. Ahmed, R\. Pryzant, and F\. Khani \(2024\)Prompt engineering a prompt engineer\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 355–385\.Cited by:[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p2.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative ai by backpropagating language model feedback\.Nature639\(8055\),pp\. 609–616\.Cited by:[Appendix E](https://arxiv.org/html/2606.11459#A5.p1.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu,et al\.\(2023a\)Lima: less is more for alignment\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 55006–55021\.Cited by:[§2\.2](https://arxiv.org/html/2606.11459#S2.SS2.p1.1)\.
- Y\. Zhou, A\. I\. Muresanu, Z\. Han, K\. Paster, S\. Pitis, H\. Chan, and J\. Ba \(2023b\)Large language models are human\-level prompt engineers\.InThe eleventh international conference on learning representations,Cited by:[§1](https://arxiv.org/html/2606.11459#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.11459#S2.SS1.p1.1)\.

## Appendix AAnalysis

#### Impact of the Lookback Window\.

A sensitivity analysis demonstratesAPEX’s robustness across various lookback configurations\. As detailed in Table[4](https://arxiv.org/html/2606.11459#A1.T4), the default setting of a lookback window of 5 achieves a score of 52\.3\. Narrowing the window to 3 restricts the historical context, yielding a lower score of 50\.3\. Conversely, expanding the window to 10 degrades performance slightly to 50\.6, as the broader scope begins to reintroduce the outdated, stale signals that our dynamic frontier strategy aims to eliminate\. This confirms that a moderately sized lookback window provides the ideal balance between context retention and recency bias\.

Table 4:Sensitivity of the lookback window inAPEXon IFBench with Gemini 2\.5 Flash\.SettingScorelookback = 350\.3lookback = 5 \(default\)52\.3lookback = 1050\.6
#### Token Cost\.

Figure[5](https://arxiv.org/html/2606.11459#A1.F5)illustrates the trade\-off between test accuracy and token consumption for API calls\. The results demonstrate that APEX establishes a clear Pareto frontier, achieving higher test accuracy than the baseline methods\.

![Refer to caption](https://arxiv.org/html/2606.11459v1/figure/token.png)Figure 5:Test accuracy versus token consumption on IFBench with Gemini 2\.5 Flash\.
#### Comparison of Prompt Length\.

Table[5](https://arxiv.org/html/2606.11459#A1.T5)demonstrates thatAPEX’s superior performance is not a byproduct of generating longer prompts\. While baseline methods like GEPA tend to continuously accumulate instructions, resulting in highly inflated prompt lengths,APEXachieves the highest overall accuracy with the most concise prompt\. This confirms thatAPEXdoes not derive its performance gains from simply introducing more text\. Instead, its optimization process successfully distills the instructions to their most critical and precise components, providing highly effective model guidance without redundant bloat\.

Table 5:Prompt lengths optimized by different methods on IFBench with Gemini 2\.5 Flash\.MethodPrompt LengthAPEX257APO \(\|D\|\|D\|\)275APO \(\|D\|/2\|D\|/2\)342GEPA662

## Appendix BInitial Prompts

Respondtothegivenquery\.Enclosethefinalresponsein<answer\>tagstodistinguishitfromtherestofyouroutput\.

Query:\{query\}

Listing 1:Initial Prompt for IFBenchRespondtothegivenquery\.Yourfinalanswermustbein<answer\>tags\.

Query:\{query\}

Listing 2:Initial Prompt for SimpleQA VerifiedAnswerthequestionbasedonthegivencontext\.

\{user\_request\}

\{context\_document\}

Listing 3:Initial Prompt for FACTS Grounding
## Appendix CAPEX Prompt Templates

Youareanexpertpromptengineer\.Yoursolefunctionistoanalyzeafaultypromptandrecommendthesinglemostimpactful,generalizablechangeforthenextoptimizationiteration\.Adherestrictlytothefollowingprocess\.

\#\#\#ThePromptUnderAnalysis

Thepromptbeingevaluated\.

<current\_prompt\>

\{prompt\}

</current\_prompt\>

\-\-\-

\#\#\#FailureCaseAnalysis

Specificexampleswherethepromptproducedasuboptimaloutput\.

<failure\_cases\>

\{error\_cases\}

</failure\_cases\>

\-\-\-

\#\#\#Instructions

\*\*Step1:DiagnosetheRootCause\*\*

Analyzethe\*\*FailureCaseAnalysis\*\*toidentifytheprimary,underlyingreasonsformostoftheerrors\.Classifytherootcauseintooneoftwomajortypes:

\*\*\*Type1:WeakDecisionBoundaries\(Definingthe"What"\)\*\*

\*\*\*Ambiguity&VagueDefinitions:\*\*Terms,tone,orsuccesscriteriaareopentointerpretation\.

\*\*\*ConstraintLoopholes:\*\*Missingexclusionaryconstraintsallowunwantedbehaviors\.

\*\*\*InputConfusion:\*\*Thepromptlacksdelimiters\(e\.g\.,XMLtags,quotes\),causingthemodeltoconfuseuserinputwithinstructions\.

\*\*\*MissingContext/Grounding:\*\*Thepromptfailstoexplicitlybindthemodeltoprovidedsourcematerial\(leadingtohallucinations\)\.

\*\*\*Type2:MissingProcessInstructions\(Definingthe"How"\)\*\*

\*\*\*CognitiveOverload:\*\*Thetaskistoocomplexforasingleinstructionandrequiresexplicitdecomposition\(breakingtheproblemintodistinctsub\-tasks\)\.

\*\*\*ImplicitLogic:\*\*Thepromptassumesthemodelknowsthespecificalgorithmrequiredtotransforminputtooutput\.

\*\*Step2:FormulatetheRecommendation\*\*

BasedonyourdiagnosisinStep1,constructtherecommendationaccordingtotheseprinciples:

\*\*\*1\.ApplytheCorrectRemediationStrategy:\*\*

\*\*IfType1\(BoundaryIssues\):\*

\*\*\*OperationalizeDefinitions:\*\*Changesubjectiveadjectivestoobjectivemetrics\.

\*\*\*EnforceDelimiters:\*\*Recommendwrappinginputdatainexplicittags\(e\.g\.,<input\>\.\.\.</input\>\)toseparateitfrominstructions\.Notethattheplaceholdersandthecorrespondinginputcontentarenon\-editable\.

\*\*\*StrengthenConstraints:\*\*Addnegativeconstraintsor"Grounding"instructions\(e\.g\.,"Answeronlyusingtheprovidedtext"\)\.

\*\*IfType2\(ProcessIssues\):\*

\*\*\*RequestRationale:\*\*Requirethemodelto"showitswork"orthinkstep\-by\-stepbeforeanswering\.Itisallowedtogenerateintermediateoutputsbeforethefinalanswerifaclearoutputformatisspecifiedforanswerextraction\.

\*\*\*DecomposetheTask:\*\*Breakthepromptintosequential,modularstepsorsub\-prompts\.

\*\*\*Few\-ShotPrompting:\*\*Ifthelogicisabstract,strictlyrequire"Input\-\>Output"examplestodemonstratethepattern\.

\*\*\*2\.ProvideGeneralizablePrinciples:\*\*

\*Yourrecommendationmustaddresstherootcause,notjustthespecificfailureexamples\.

\*\*\*Crucially,donotquoteordirectlyreferencetheprovided‘<failure\_cases\>‘\.\*\*Yourfeedbackmustbeindependentofthespecificcontentoftheexamples\.

\*Ifanexampleisabsolutelyessentialtoillustrateyourpoint,\*\*youmustinventanew,concise,andclearone\*\*thatdemonstratestheprincipleeffectively\.

\*\*Step3:ConstructtheFeedbackObject\*\*

TranslateyourdiagnosisintoastructureddirectivefortheEditor\.Youmustusetheseexactfields:

\*\*\*Locator:\*\*Quotetheexacttext,sectionheader,orplaceholderinthe‘<current\_prompt\>‘wherethefixshouldbeapplied\.

\*\*\*Diagnosis:\*\*ExplainthespecificweaknessidentifiedinStep1\(e\.g\.,"Type1Ambiguity:Theadjective’short’issubjective\."\)\.

\*\*\*Instruction:\*\*ThespecificactionfortheEditortotake\(e\.g\.,"Replace’short’with’maximum50words’\."\)\.

\-\-\-

\#\#\#RequiredOutputFormat

Youmayprovideexplanatorytextorrationalefirst\(e\.g\.,"Analysis:\.\.\."\)\.

Yourfinalresponsemustbeenclosedin‘<actionable\_feedback\>‘tags\.

Insidethesetags,youmuststrictlyfollowthisformat:

<actionable\_feedback\>

\*\*Locator:\*\*\[QuotetextorHeaderinthe‘<current\_prompt\>‘\]

\*\*Diagnosis:\*\*\[BriefType1/Type2explanation\]

\*\*Instruction:\*\*\[Preciseeditinginstruction\]

</actionable\_feedback\>

Listing 4:APEX Critique PromptYouareanAdaptivePromptEditor\.Yourgoalistorewriteapromptbasedontargetedstructuredfeedback\.

\#\#\#InputData

\*\*1\.OriginalPrompt:\*\*

<current\_prompt\>

\{prompt\}

</current\_prompt\>

\*\*2\.CriticalFeedback:\*\*

<feedback\>

\{feedback\}

</feedback\>

\*\(ThisfeedbackcontainsaLocator,aDiagnosis,andanInstruction\.\)\*

\-\-\-

\#\#\#ExecutionProtocol

\*\*1\.AnalysisPhase\*\*

Beforerewriting,explicitlyplanyoureditbasedontheinputstructure:

\*\*\*Target:\*\*Findthespecifictextcitedinthefeedback’s\*\*Locator\*\*\.

\*\*\*Context:\*\*Readthe\*\*Diagnosis\*\*tounderstandtheintent\(thisensuresyoudon’tfixthegrammarbutmissthepoint\)\.

\*\*\*ConstraintVerification:\*\*Ensureyourplannededitdoesnotaccidentallyremovecriticalconstraints,negativeinstructions,orvariableplaceholders\(e\.g\.,‘\{\{variable\}\}‘\)\.

\*\*2\.RevisionPhase\*\*

Rewritethepromptusingthefollowing\*\*"Logicvs\.Syntax"\*\*rules:

\*\*\*VariableLockdown\(Strict\):\*\*Treatallplaceholders\(e\.g\.,‘\{\{variable\}\}‘\)asimmutableconstants\.Donotintroduceanynewplaceholders,anddonotadd,remove,rename,orreformatexistingones\.Thecontentrepresentedbytheplaceholderisnon\-editable\.

\*\*\*LogicLock\(Strict\):\*\*Donotremoveoralterinstructions,constraints,orstepsthatare\*not\*targetedbythe\*\*Locator\*\*\.

\*\*\*ContextualIntegration\(Flexible\):\*\*You\*are\*permittedtoadjustthewording,transitions,andgrammarofthesentencessurroundingthe\*\*Locator\*\*toensurethenewchangesblendnaturally\.Thefinalresultshouldreadasaunifieddocument\.

\-\-\-

\#\#\#OutputFormat

\*\*Part1:EditStrategy\*\*

Provideacomprehensive,step\-by\-stepplan\.Youmustinclude:

\*\*\*Thespecifictext/sectionyouwillmodify\.\*\*

\*\*\*HowyouwillrephraseittosatisfytheInstruction\.\*\*

\*\*\*Howyouwillensuresurroundingtransitionsremainsmooth\.\*\*

\*\*\*Confirmationthatspecificvariables/constraintsarepreserved\.\*\*

\*\*Part2:RevisedPrompt\*\*

Outputthefulltextoftherevisedpromptstrictlyenclosedwithin‘<new\_instruction\>‘tags\.

Listing 5:APEX Mutation Prompt\#\#\#FailureExample\(Score:\{score\}\)

\*\*1\.InputContext:\*\*

<input\>

\{query\}

</input\>

\*\*2\.ActualModelOutput:\*\*

<actual\_output\>

\{response\}

</actual\_output\>

\*\*3\.EvaluationFeedback\(Whythisfailed\):\*\*

<critique\>

\{feedback\}

</critique\>

Listing 6:APEX Error Case Template
## Appendix DLimitation

WhileAPEXimproves the data efficiency of prompt optimization, it shares some fundamental limitations inherent to automated prompt engineering\. First, the framework heavily relies on the availability of a representative dataset and a reliable, programmatic evaluation function, which can be challenging to obtain for highly subjective or open\-ended tasks\. Second,APEX’s effectiveness is bounded by the target model’s underlying capacities\. It assumes the model possesses a reasonable baseline capability to perform the task when properly instructed, and it cannot overcome a fundamental lack of model knowledge or reasoning\. Finally, our current evaluation primarily focuses on text\-based LLMs\. Future work could extendAPEXto more diverse and complex scenarios, such as agentic workflows and multimodal tasks\.

## Appendix EFuture Work

TheAPEXframework opens up several promising dimensions for subsequent research\. First,APEXcan serve as a versatile, plug\-and\-play core optimization layer within established LLM programming frameworks\[Yuksekgonulet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib35), Khattabet al\.,[2024](https://arxiv.org/html/2606.11459#bib.bib36)\]\. Second,APEX’s underlying mechanism can be leveraged to improve inference\-time scaling techniques\[Novikovet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib38), Snellet al\.,[2025](https://arxiv.org/html/2606.11459#bib.bib37)\]by dynamically guiding test\-time compute allocation and refining intermediate search paths\. Finally, future work can extendAPEXbeyond static, text\-based scenarios to multimodal tasks, complex agentic workflows, and highly interactive environments\. For example,APEXcould be utilized to optimize contextual text rephrasing to shape intricate downstream behaviors, such as prompting knowledge sharing via causal language in social networks\[Shi and Morstatter,[2024](https://arxiv.org/html/2606.11459#bib.bib39)\]\.

More broadly, our findings highlight a fundamental, yet underexplored principle for black\-box LLM optimization\. As candidate solutions evolve, the target data distribution dynamically co\-evolves\. Successfully modeling, tracking, and stabilizing this shifting landscape represents a crucial frontier for the robust optimization of black\-box LLMs\.

## Appendix FImpact Statement

This work significantly improves accessibility to high\-performance LLMs by automating complex prompt engineering\. By removing the bottleneck of manual expertise,APEXallows a broader audience, including non\-experts and smaller organizations, to achieve state\-of\-the\-art results previously reserved for specialized teams\. Furthermore, this efficiency drives a critical shift in computational resource allocation\. By demonstrating that the performance gap between optimized prompts and expensive parameter updates is rapidly narrowing,APEXencourages a move towards more sustainable, energy\-efficient model adaptation strategies, fundamentally reducing the computational overhead of AI deployment\. However, we acknowledge that automated prompt optimization is a dual\-use capability\. The same mechanisms that efficiently navigate the prompt space could be misused\.APEXoperates as an orchestration layer built on top of LLMs, which serve as the final execution engines\. As LLMs continue to grow safer, they inherently ensure that optimization tools likeAPEXremain securely bounded, regardless of how a prompt is engineered\.

Similar Articles

PrompTessor

Product Hunt

PrompTessor is an AI prompt generator, optimizer, and library that helps users craft effective prompts.

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

arXiv cs.CL

SePO (Self-Evolving Prompt Optimization) proposes a self-referential prompt agent that optimizes both task agents' system prompts and its own system prompt through an evolutionary search, outperforming Manual-CoT, TextGrad, and MetaSPO across five benchmarks including AIME'25, ARC-AGI-1, and GPQA.

SPEAR: Code-Augmented Agentic Prompt Optimization

arXiv cs.CL

SPEAR is a code-augmented agentic prompt optimizer that uses a Python sandbox for structural error analysis, achieving state-of-the-art performance on multiple LLM evaluation suites including industrial judge tasks, BBH, and GSM8K.