Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

arXiv cs.AI 05/20/26, 04:00 AM Papers
Summary
Introduces ReElicit, a Bayesian optimization framework that uses LLMs to elicit and adapt feature spaces for optimizing system prompts under aggregate scalar feedback, achieving strong performance across ten benchmark tasks.
arXiv:2605.19093v1 Announce Type: new Abstract: System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.
Original Article
View Cached Full Text
Cached at: 05/20/26, 08:27 AM
# Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
Source: [https://arxiv.org/html/2605.19093](https://arxiv.org/html/2605.19093)
Zhiyuan Jerry Lin Meta zylin@meta\.com &Benjamin Letham Meta bletham@meta\.com &Samuel Dooley Meta dooley@meta\.com Maximilian Balandat Meta balandat@meta\.com &Eytan Bakshy Meta ebakshy@meta\.com

###### Abstract

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations\. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per\-example labels, failures, or critiques\. We study this aggregate feedback setting as sample\-constrained black\-box optimization over discrete, variable\-length text\. We introduce ReElicit, a Bayesian optimization framework based on*embedding by elicitation*\. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it\. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts\. Re\-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt\-score history\. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per\-example labels, errors, or critiques\. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate\-only prompt\-optimization baselines\. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural\-language artifacts\.

## 1Introduction

The*system prompt*is a central control mechanism in modern AI systems\. It shapes response style, guardrails, and operational policies that persist across conversations and tasks\. As a result, small prompt changes can affect many downstream interactions\. Despite this importance, system prompts are still often crafted manually using developer intuition and limited offline evaluation\.

Recent work on automatic prompt optimization \(APO\) seeks to tune prompts automatically to maximize a target objective\(Ramnathet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib3)\)\. Many APO methods assume granular task feedback: a candidate prompt is evaluated on labeled examples, and the optimizer can inspect per\-example successes, failures, traces, or critiques\. That interface is powerful, but it differs from many deployment\-facing system prompt optimization settings where outcomes are delayed, population\-level, or meaningful only after repeated interactions\. Examples include long\-horizon task\-completion rates, safety\-incident rates, user satisfaction, retention, and escalation rates\. Although such metrics aggregate many individual interactions, the optimizer may observe only a top\-line scalar score for each deployed prompt variant that is not separable into individual interactions\.

In this regime, prompt optimization is no longer a supervised prompt\-revision problem over labeled examples\. It is a sample\-constrained black\-box optimization problem over natural language\. Directly asking an LLM to propose better prompts is a natural baseline, but such search does not explicitly model uncertainty or provide a principled exploration–exploitation tradeoff\.

Bayesian optimization \(BO\) is a natural tool for expensive scalar\-feedback objectives\. It fits a probabilistic surrogate to past evaluations and uses an acquisition function to balance exploration and exploitation\. BO is widely used for hyperparameter tuning and machine\-learning system design\(Shahriariet al\.,[2015](https://arxiv.org/html/2605.19093#bib.bib26); Balandatet al\.,[2020](https://arxiv.org/html/2605.19093#bib.bib1)\), A/B testing\(Olsonet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib24); Lethamet al\.,[2019](https://arxiv.org/html/2605.19093#bib.bib25)\), and other aggregate feedback LLM settings such as data mix optimization\(Yenet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib23)\)\.

The obstacle in our setting is representation and realization\. BO typically operates in a fixed low\-dimensional Euclidean domain, whereas system prompts are discrete, variable\-length, and semantically structured natural\-language objects\. Structured\-input BO can use hand\-crafted kernels over discrete objects\(Ohet al\.,[2019](https://arxiv.org/html/2605.19093#bib.bib27); Mosset al\.,[2020](https://arxiv.org/html/2605.19093#bib.bib28); Griffithset al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib29)\)or learned latent representations\(Gómez\-Bombarelliet al\.,[2018](https://arxiv.org/html/2605.19093#bib.bib31); Deshwal and Doppa,[2021](https://arxiv.org/html/2605.19093#bib.bib30); Mauset al\.,[2022](https://arxiv.org/html/2605.19093#bib.bib7)\), but these tools do not directly give a full prompt\-optimization loop\. Kernel methods still require searching over text by enumeration or sampling, while learned latent spaces usually require auxiliary data or task\-specific encoder–decoder training\. We therefore need a compact space that supports surrogate modeling and acquisition optimization, together with a way to map optimized points in that space back to deployable system prompts\.

Our approach is to use the LLM itself as a semantic representation builder\. Given a task description, evaluated prompts, and scalar scores, the LLM proposes a small set of performance\-relevant feature axes and maps prompts into\[0,1\]dt\[0,1\]^\{d\_\{t\}\}\. The premise is not that prompts are intrinsically simple, but that the task\-relevant variation observed under a specific objective may concentrate along a few semantically meaningful axes\. Useful axes may capture answer\-format control, calibrated uncertainty, explicit reasoning structure, evidence use, or task\-specific distinctions such as numerical consistency, ambiguity resolution, symbolic validity, or pragmatic intent\. These are not surface features such as length or token overlap; they are semantic directions along which prompts can differ\.

This representation gives BO a compact continuous space for surrogate modeling and acquisition optimization\. The LLM then realizes BO\-selected feature targets as deployable prompts and refines them using feature\-gap feedback\. Re\-eliciting the feature space as new evaluations arrive allows the representation to adapt to evidence about which prompt properties distinguish high\- and low\-performing candidates\.

Our key contributions are:

- •We cast aggregate feedback system prompt tuning as a black\-box optimization problem, distinct from prompt optimization settings that rely on per\-example labels, error traces, or textual critiques\.
- •We introduce ReElicit, a Bayesian optimization framework based on*embedding by elicitation*: an LLM elicits a semantic feature space from prompt\-score history, BO selects target feature vectors in this space, and the LLM realizes and refines those targets as natural language system prompts\.
- •We provide a reachability analysis showing how representation error affects optimization in an elicited embedding: under an oracle smooth semantic embedding assumption, near\-optimality for the approximated objective transfers to a bounded true prompt\-quality gap\.
- •We evaluate ReElicit on ten benchmark system prompt optimization tasks under a shared 30\-evaluation budget and an aggregate\-only feedback interface\. ReElicit achieves the strongest overall performance profile among representative aggregate\-only APO baselines, and our diagnostics and ablations analyze feature stability, surrogate fit, refinement quality, and component contributions\.

## 2Related Work

##### Automatic prompt optimization\.

Automatic prompt optimization searches directly over natural\-language instructions\(Ramnathet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib3)\)\. Many APO methods use instance\-level feedback, such as per\-example labels, textual critiques, error traces, or reflections on successes and failures; examples include ProTeGi, TextGrad, and GEPA\(Pryzantet al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib12); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.19093#bib.bib11); Agrawalet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib21)\)\. Other methods are closer to aggregate black\-box search: APE samples prompts from an LLM\(Zhouet al\.,[2022](https://arxiv.org/html/2605.19093#bib.bib9)\), OPRO conditions generation on previous solutions and scores\(Yanget al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib10)\), PromptBreeder uses evolutionary mutation and recombination\(Fernandoet al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib13)\), and label\-free prompt optimizers reduce dependence on labeled instance\-level feedback\(Wuet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib15)\)\.

##### Bayesian optimization over structured and embedded spaces\.

BO is most straightforward in low\-dimensional continuous domains, while system prompts are discrete, variable\-length, and semantically structured\. Prior work addresses this challenge through the use of lower\-dimensional embeddings\(Wanget al\.,[2016](https://arxiv.org/html/2605.19093#bib.bib2); Lethamet al\.,[2020](https://arxiv.org/html/2605.19093#bib.bib4)\), kernels for structured objects such as strings or graphs\(Ohet al\.,[2019](https://arxiv.org/html/2605.19093#bib.bib27); Mosset al\.,[2020](https://arxiv.org/html/2605.19093#bib.bib28); Griffithset al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib29)\), and learned latent spaces or deep kernels\(Gómez\-Bombarelliet al\.,[2018](https://arxiv.org/html/2605.19093#bib.bib31); Deshwal and Doppa,[2021](https://arxiv.org/html/2605.19093#bib.bib30); Mauset al\.,[2022](https://arxiv.org/html/2605.19093#bib.bib7),[2023](https://arxiv.org/html/2605.19093#bib.bib8); Mosset al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib16); Wilsonet al\.,[2016](https://arxiv.org/html/2605.19093#bib.bib6)\)\. These methods motivate our use of a compact representation, but they do not directly provide a full prompt\-optimization loop\. Structured kernels still require searching over text, typically by sampling or enumeration, and learned latent spaces usually require auxiliary data or a trained encoder–decoder\. One might consider performing BO over off\-the\-shelf dense text embeddings\. However, this is prohibitive in the small data setting\. Fitting a surrogate in thousands of dimensions on only very few observations yields uninformative posteriors\. Furthermore, even if dimensionality reduction \(e\.g\., PCA\) were applied, decoding an optimized continuous latent vector back into deployable discrete text requires auxiliary trained decoders\. In addition, such dimensionality reduction would pick out the most important*general*latent features, but what we are really after are the features most relevant*specifically for system prompt performance*for the target application\. ReElicit bypasses both the curse of dimensionality and the inverse\-mapping problem by allowing the LLM to construct a low\-dimensional, interpretable semantic space where BO targets can be realized natively as deployable prompts via text generation\.

##### Bayesian and surrogate\-based prompt optimization\.

Several methods combine BO or surrogate modeling with prompt or instruction search\. InstructZero optimizes soft prompts for an instruction generator\(Chenet al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib17)\); BOInG uses BO in relaxed or generator\-mediated instruction spaces\(Sabbatellaet al\.,[2024](https://arxiv.org/html/2605.19093#bib.bib18)\); MIPRO uses a Bayesian surrogate to search over instructions and demonstrations for LM programs\(Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2605.19093#bib.bib14)\); HbBoPs combines a structural\-aware deep\-kernel GP with Hyperband for prompt selection\(Schneideret al\.,[2024](https://arxiv.org/html/2605.19093#bib.bib32)\); and BOPRO performs BO over fixed embeddings of language\-based solutions\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib22)\), with related work on prompt and code\-generation search\(Ballewet al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib19); Tomaret al\.,[2025](https://arxiv.org/html/2605.19093#bib.bib20)\)\. These methods are closely related but target different interfaces, including soft prompts, finite prompt/program configurations, few\-shot demonstrations, candidate pools, or fixed embedding spaces\. ReElicit targets deployable hard system prompts under prompt\-level scalar feedback, using dynamic elicitation to build the BO representation during optimization\.

## 3Method

### 3\.1Problem Setting

We consider black\-box optimization over system prompts\. Letf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}denote an objective that maps a promptx∈𝒳x\\in\\mathcal\{X\}to a scalar scorey=f\(x\)y=f\(x\)\. In deployment, this score may be a delayed aggregate metric such as task\-completion rate, safety\-incident rate, or user satisfaction\. In our experiments,f\(x\)f\(x\)is benchmark accuracy on a fixed evaluation set, exposed to the optimizer only as a single prompt\-level scalar\. The optimizer does not observe per\-example labels, individual failures, answer traces, or textual critiques\.

The goal is to find a high\-scoring prompt within a highly constrained evaluation budget\. Motivated by the typical setting of conducting multiple long\-running online experiments in parallel, we use batched optimization with batch sizeqq\. LetTTdenote the total number of evaluated batches, including the initial seed batch\. The initial dataset𝒟0=\{\(xi,yi\)\}i=1q\\mathcal\{D\}\_\{0\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{q\}containsqqevaluated seed prompts and corresponds to iterationt=0t=0\. For each optimization roundt=1,…,T−1t=1,\\ldots,T\-1, the optimizer uses the current history𝒟t−1\\mathcal\{D\}\_\{t\-1\}to propose a new batch ofqqprompts, evaluates them, and updates

𝒟t=𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\.\\mathcal\{D\}\_\{t\}=\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\}\.After roundtt, the dataset containsq\(t\+1\)q\(t\+1\)evaluated prompts\. The total evaluation budget isN=qTN=qT\.

### 3\.2ReElicit

1

Input:

𝒟0=\{\(xi,yi\)\}i=1q\\mathcal\{D\}\_\{0\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{q\}, total evaluated batches

TT, batch size

qq, acquisition function

α\\alpha, elicitation rounds

KK, realization budget

MM, tolerance

τ\\tau
Output:

x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y
2

3

ℱ0←∅\\mathcal\{F\}\_\{0\}\\leftarrow\\varnothing;

4

5for*t=1,…,T−1t=1,\\ldots,T\-1*do

6Let

Xt−1,Yt−1X\_\{t\-1\},Y\_\{t\-1\}be the prompts and scores in

𝒟t−1\\mathcal\{D\}\_\{t\-1\};

7

8for*k=1,…,Kk=1,\\ldots,K*do

9

ℱt\(k\)←DefineFeatures\(𝒟t−1,ℱt−1\)\\mathcal\{F\}\_\{t\}^\{\(k\)\}\\leftarrow\\textnormal\{\{DefineFeatures\}\}\(\\mathcal\{D\}\_\{t\-1\},\\mathcal\{F\}\_\{t\-1\}\);

10

Zt\(k\)←ExtractFeatures\(Xt−1,ℱt\(k\)\)Z\_\{t\}^\{\(k\)\}\\leftarrow\\textnormal\{\{ExtractFeatures\}\}\(X\_\{t\-1\},\\mathcal\{F\}\_\{t\}^\{\(k\)\}\);

11

st\(k\)←CV\(Zt\(k\),Yt−1\)s\_\{t\}^\{\(k\)\}\\leftarrow\\mathrm\{CV\}\(Z\_\{t\}^\{\(k\)\},Y\_\{t\-1\}\);

12

13

14if*t\>1t\>1*then

15Add incumbent

ℱt−1\\mathcal\{F\}\_\{t\-1\}as an additional candidate by re\-extracting it on

Xt−1X\_\{t\-1\}and scoring it by CV;

16

17

18Select

ℱt\\mathcal\{F\}\_\{t\}with lowest CV error, and let

ZtZ\_\{t\}be the corresponding embeddings;

19

20Fit GP surrogate

ℳt\\mathcal\{M\}\_\{t\}on

\(Zt,Yt−1\)\(Z\_\{t\},Y\_\{t\-1\}\);

21

\{zt,1new,…,zt,qnew\}←arg⁡maxz1,…,zq∈\[0,1\]dt⁡α\(z1,…,zq∣ℳt\)\\\{z\_\{t,1\}^\{\\mathrm\{new\}\},\\ldots,z\_\{t,q\}^\{\\mathrm\{new\}\}\\\}\\leftarrow\\arg\\max\_\{z\_\{1\},\\ldots,z\_\{q\}\\in\[0,1\]^\{d\_\{t\}\}\}\\alpha\(z\_\{1\},\\ldots,z\_\{q\}\\mid\\mathcal\{M\}\_\{t\}\);

22

23for*j=1,…,qj=1,\\ldots,q*do

24

xt,jnew←GenerateWithRefinement\(zt,jnew,ℱt,𝒟t−1,M,τ\)x\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow\\textnormal\{\{GenerateWithRefinement\}\}\(z\_\{t,j\}^\{\\mathrm\{new\}\},\\mathcal\{F\}\_\{t\},\\mathcal\{D\}\_\{t\-1\},M,\\tau\);

25

26

27Evaluate

yt,jnew=f\(xt,jnew\)y\_\{t,j\}^\{\\mathrm\{new\}\}=f\(x\_\{t,j\}^\{\\mathrm\{new\}\}\)for each

jj;

28

𝒟t←𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\\mathcal\{D\}\_\{t\}\\leftarrow\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\};

29

30

return*x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y*

Algorithm 1ReElicit main loop\. Expanded subroutines and prompts are in Appendix[C\.1\.2](https://arxiv.org/html/2605.19093#A3.SS1.SSS2)\.ReElicit performs BO in an LLM\-elicited feature space rather than directly over text\. At roundtt, the method constructs an embedding

gt:𝒳→\[0,1\]dtg\_\{t\}:\\mathcal\{X\}\\to\[0,1\]^\{d\_\{t\}\}from a set of natural\-language feature descriptionsℱt\\mathcal\{F\}\_\{t\}\. Each feature describes a task\-relevant semantic property of a prompt, and the LLM maps prompts to numerical coordinates along these axes\.

Algorithm[1](https://arxiv.org/html/2605.19093#algorithm1)summarizes the main optimization loop\. Feature definition and feature extraction use different information\.DefineFeaturessees the task context and prompt\-score history, allowing performance evidence to inform which semantic axes are proposed\.ExtractFeaturessees prompts and feature definitions, but not scores, so the resulting coordinates are based on prompt content rather than direct outcome leakage\. The selected representation is the candidate feature set with the lowest cross\-validation \(CV\) error for predicting observed scores; whent\>1t\>1, the previous feature set is re\-extracted on the enlarged history and included in the batch as an incumbent candidate\.

Given the selected embedding, ReElicit fits a Gaussian process surrogate to the embedded observations and uses a batch acquisition function to select target feature vectors\. These vectors are not prompts: they are desired semantic coordinates in the elicited representation\. The LLM realizes each target as a system prompt and refines it using feature\-gap feedback, because the LLM\-induced inverse map from continuous feature coordinates to text is lossy\. The resulting prompts are evaluated withffand appended to the optimization history\.

## 4Theoretical Analysis

Our theoretical analysis studies the*reachability*question underlying ReElicit: when does a point that is near\-optimal in an elicited, low\-dimensional representation correspond to a good prompt in the original text space? This question is analogous to reachability analyses in random\-embedding BO\(Wanget al\.,[2016](https://arxiv.org/html/2605.19093#bib.bib2)\), but differs in that the embedding here is nonlinear, semantic, and constructed by an LLM from the observed prompt\-score history\.

We formalize this by comparing the elicited embedding to an oracle semantic embedding under which the objective is smooth\. If the elicited embedding preserves the performance\-relevant geometry of this oracle representation, then points that are good in the elicited representation are also good for the true objective\. The resulting bound identifies the role of representation error in ReElicit: the better the elicited embedding approximates the oracle embedding, the tighter the connection between optimization in feature space and optimization over prompts\.

Let𝒳\\mathcal\{X\}be a finite prompt universe\. We assume there exists an injective oracle embeddingg∗:𝒳→𝒵⊂ℝdg^\{\*\}:\\mathcal\{X\}\\to\\mathcal\{Z\}\\subset\\mathbb\{R\}^\{d\}, that, combined with a stationary base latent kernelkZ\(z,z′\)k\_\{Z\}\(z,z^\{\\prime\}\), produces an oracle kernelk∗\(x,x′\)=kZ\(g∗\(x\),g∗\(x′\)\)k^\{\*\}\(x,x^\{\\prime\}\)=k\_\{Z\}\(g^\{\*\}\(x\),g^\{\*\}\(x^\{\\prime\}\)\)\.

###### Assumption 1\.

The true objective functionf:𝒳→ℝf:\\mathcal\{X\}\\to\\mathbb\{R\}resides in the RKHS of the oracle kernel,f∈ℋk∗f\\in\\mathcal\{H\}\_\{k^\{\*\}\}, with bounded norm‖f‖k∗≤B\\\|f\\\|\_\{k^\{\*\}\}\\leq B\.

###### Assumption 2\.

The RKHS latent feature mapϕ:𝒵→ℋZ\\phi:\\mathcal\{Z\}\\to\\mathcal\{H\}\_\{Z\}is Lipschitz continuous with respect to the latent distance:‖ϕ\(z\)−ϕ\(z′\)‖ℋZ≤L‖z−z′‖\\\|\\phi\(z\)\-\\phi\(z^\{\\prime\}\)\\\|\_\{\\mathcal\{H\}\_\{Z\}\}\\leq L\\\|z\-z^\{\\prime\}\\\|,∀z,z′∈𝒵\\forall z,z^\{\\prime\}\\in\\mathcal\{Z\}\.

We modelffthrough the elicited embeddinggtg\_\{t\}\. Becausegtg\_\{t\}may differ from the oracle embeddingg∗g^\{\*\}, the objective representable throughgtg\_\{t\}may only approximate the true objective\. For the analysis, letftf\_\{t\}denote the best uniform approximation toffover𝒳\\mathcal\{X\}among functions representable throughgtg\_\{t\}with RKHS norm at mostBB; the full definition is given in Appendix[A](https://arxiv.org/html/2605.19093#A1)\. The quality of this approximation depends on the mismatch betweengtg\_\{t\}andg∗g^\{\*\}\.

###### Assumption 3\.

At iterationtt, the error between the elicited embeddinggt\(x\)g\_\{t\}\(x\)and the oracle embeddingg∗\(x\)g^\{\*\}\(x\)is bounded byηt≥0\\eta\_\{t\}\\geq 0such that∀x∈𝒳\\forall x\\in\\mathcal\{X\},‖g∗\(x\)−gt\(x\)‖≤ηt\\\|g^\{\*\}\(x\)\-g\_\{t\}\(x\)\\\|\\leq\\eta\_\{t\}\.

This assumption abstracts away the variable\-dimensional feature sets used in the implementation and analyzes a fixed latent space in which the elicited and oracle embeddings can be compared\.

###### Definition 1\.

Letx∗x^\{\*\}be an optimum offf:f\(x∗\)=maxx∈𝒳⁡f\(x\)f\(x^\{\*\}\)=\\max\_\{x\\in\\mathcal\{X\}\}f\(x\)\. Theϵ\\epsilon\-suboptimal set offfisSϵ\(f\)=\{x:f\(x∗\)−f\(x\)≤ϵ\}S\_\{\\epsilon\}\(f\)=\\\{x:f\(x^\{\*\}\)\-f\(x\)\\leq\\epsilon\\\}\.

We take Assumptions[1](https://arxiv.org/html/2605.19093#Thmassumption1)–[3](https://arxiv.org/html/2605.19093#Thmassumption3)throughout the analysis\. The main result bounds the true optimality gap of any point that is nearly optimal under the approximated objectiveftf\_\{t\}\.

###### Theorem 1\.

Supposextx\_\{t\}isδ\\delta\-suboptimal with respect toftf\_\{t\}, i\.e\.,maxx∈𝒳⁡ft\(x\)−ft\(xt\)≤δ\\max\_\{x\\in\\mathcal\{X\}\}f\_\{t\}\(x\)\-f\_\{t\}\(x\_\{t\}\)\\leq\\delta\. Thenxtx\_\{t\}isϵ\\epsilon\-suboptimal for the true objective, withϵ=δ\+2BLηt\\epsilon=\\delta\+2BL\\eta\_\{t\}, i\.e\.,f\(x∗\)−f\(xt\)≤δ\+2BLηt\.f\(x^\{\*\}\)\-f\(x\_\{t\}\)\\leq\\delta\+2BL\\eta\_\{t\}\.

Several insights follow from this result\. The termδ\\deltacaptures the optimization error within the approximated objective induced by the elicited embedding\. The term2BLηt2BL\\eta\_\{t\}bounds the representation error: the price paid for optimizing throughgtg\_\{t\}rather than the oracle embeddingg∗g^\{\*\}\. The RKHS normBBand Lipschitz constantLLshow that this price is smaller when the objective varies smoothly in the latent semantic space\. This aligns with the regime where aggregate prompt metrics are expected to be amenable to BO: small semantic changes in a system prompt can produce gradual changes in average downstream performance\.

Re\-elicitation can be viewed as a mechanism for reducingηt\\eta\_\{t\}as data accumulate\. Each new batch gives the LLM additional evidence about which semantic properties distinguish high\- and low\-performing prompts\. When this evidence leads to an embedding closer to the oracle performance\-relevant representation, the bound tightens\. Thus, the theorem identifies a concrete pathway by which dynamic elicitation can improve BO over prompts: by reducing representation error, it tightens the guarantee connecting feature\-space optima to true prompt quality\.

## 5Experiments

### 5\.1Setup

We evaluate ReElicit on ten system prompt optimization tasks under the aggregate feedback interface of Section[3\.1](https://arxiv.org/html/2605.19093#S3.SS1)\. We study this aggregate feedback interface in a controlled offline setting\. Our experiments instantiate the objectivef\(x\)f\(x\)using benchmark accuracy on fixed evaluation sets, but the optimizer observes only a single scalar score for each prompt\. It does not observe per\-example labels, individual failures, answer traces, or textual critiques\. This protocol isolates the methodological question we target: how can one perform sample\-efficient search over system prompts when only prompt\-level aggregate feedback is available?

##### Baselines\.

We compare against aggregate\-only adaptations of four representative hard\-prompt search methods: APE\-style sampling\(Zhouet al\.,[2022](https://arxiv.org/html/2605.19093#bib.bib9)\), which samples history\-free prompts; OPRO\(Yanget al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib10)\), which conditions generation on score\-sorted prompt histories; PromptBreeder\(Fernandoet al\.,[2023](https://arxiv.org/html/2605.19093#bib.bib13)\), which mutates and recombines prompts from a fitness\-sorted population; and TextGrad\-style refinement\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.19093#bib.bib11)\), which critiques the scalar prompt\-score trajectory and proposes variants\. These baselines test whether embedding by elicitation and BO\-guided realization improve over common LLM\-guided hard\-prompt search when all methods receive the same prompt\-level scalar feedback\. Closely related BO\-based prompt optimizers are discussed in Section[2](https://arxiv.org/html/2605.19093#S2); many target soft prompts, finite instruction–demonstration configurations, candidate pools, fixed embeddings, or multi\-fidelity validation subsets rather than deployable hard system prompts\. All baselines share the task context, initial evaluated dataset𝒟0\\mathcal\{D\}\_\{0\}, scalar\-score history, and target\-model evaluation budget with ReElicit\. Full aggregate\-only baseline adaptations and prompts are given in Appendix[C\.1\.3](https://arxiv.org/html/2605.19093#A3.SS1.SSS3)\.

##### Benchmarks and protocol\.

The tasks are drawn from GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.19093#bib.bib33)\), MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.19093#bib.bib34)\), and BIG\-Bench Hard \(BBH\)\(Suzgunet al\.,[2022](https://arxiv.org/html/2605.19093#bib.bib35)\)\. GSM8K and MMLU use fixed 500\-question subsamples; the remaining eight tasks are BBH tasks with 250 examples each\. Appendix[C\.2](https://arxiv.org/html/2605.19093#A3.SS2)lists the task descriptions and optimizer\-facing task contexts\. Each run has budgetN=30N=30target\-model evaluations:q=5q=5prompts per batch acrossT=6T=6evaluated batches, where the first batch is the shared seed dataset𝒟0\\mathcal\{D\}\_\{0\}generated by Algorithm[2](https://arxiv.org/html/2605.19093#algorithm2)and the remaining five batches are optimization rounds\. Unless otherwise noted, we report means and 95% confidence intervals over 30 independent seeds\.

##### LLM use and evaluation\.

LLMs are core experimental components\. The optimizer LLM is Llama 3\.3 70B Instruct, and the target LLM is Llama 3\.1 8B Instruct\. In ReElicit, the optimizer LLM elicits feature axes, extracts prompt coordinates, realizes BO\-selected feature targets as prompts, and refines feature gaps; in the baselines, the same optimizer LLM generates candidates or critiques according to the corresponding baseline\. The target LLM is queried only through the evaluation functionff: each evaluation runs the target model with system promptxxon the fixed benchmark subset using zero\-shot greedy decoding and returns aggregate accuracy\. We use the smaller target model to avoid task saturation and keep evaluation tractable in terms of wall time and token usage\.

Our primary budget is the number of target\-model evaluations, since each evaluation runs the target model over hundreds of benchmark examples\. ReElicit uses additional optimizer\-side LLM calls for elicitation, extraction, realization, and refinement, so the experiments evaluate sample efficiency in target evaluations rather than total LLM\-call efficiency\. In practice, the cost of optimization is typically dwarfed by the cost of evaluatingff, which often requires large scale online A/B testing across large user populations\. Shared history subsampling, optimizer\-side temperatures, hyperparameters, information\-access controls, and exact prompts are reported in Appendices[C\.3](https://arxiv.org/html/2605.19093#A3.SS3)and[D](https://arxiv.org/html/2605.19093#A4)\.

### 5\.2Main Result

Table[1](https://arxiv.org/html/2605.19093#S5.T1)reports the final best score achieved by each method under the shared 30\-evaluation budget\. For each seed and method, we take the best prompt found by the end of optimization and report the mean and 95% confidence interval across seeds\. ReElicit has the strongest overall performance profile: it is either the numerically best method or statistically indistinguishable from the best method on all ten tasks\. The absolute margins vary across tasks, which is expected in this small\-budget setting, but the aggregate comparison in Table[2](https://arxiv.org/html/2605.19093#S5.T2)shows that the effect is consistent across runs\. Across task\-seed pairs, ReElicit on average matches or exceeds APE, OPRO, PromptBreeder, and TextGrad as baselines\. This suggests that the elicited representation and BO\-guided realization loop improve robustness across tasks rather than producing gains on only a single benchmark\.

Table 1:Main performance comparison\. Entries are final best scores after 30 prompt evaluations, reported as mean±\\pm95% confidence interval over seeds\. Bold cells indicate the numerical best method and methods not significantly different from the best\. ReElicit is statistically better or tied for best on all tasks and has the strongest aggregate pairwise win\-or\-tie profile in Table[2](https://arxiv.org/html/2605.19093#S5.T2)\.Table 2:Pairwise win\-or\-tie rate over final optimization results\. Each cell is the fraction of task\-seed pairs where the row method matches or exceeds the column method in final best score\. Ties count for both methods, so opposite\-direction entries need not sum to one\. The final column averages each row’s off\-diagonal entries\. ReElicit has the highest average win\-or\-tie rate, indicating the most consistent aggregate performance across baselines\.
### 5\.3Component Analysis

We next ask whether the elicited representation has the properties needed for BO over prompts: repeatable feature extraction, conservative adaptation across rounds, low dimensionality, predictive surrogate fit, and actionable targets for prompt generation\.

##### Feature generation and extraction\.

A useful dynamic representation should be consistent under repeated extraction, but still able to adapt as new prompt\-score evidence arrives\. We measure consistency by re\-extracting feature values three times under the same selected feature definitions, and measure adaptation using cross\-iteration linear centered kernel alignment \(CKA\)\(Kornblithet al\.,[2019](https://arxiv.org/html/2605.19093#bib.bib5)\)on prompts shared across adjacent histories\.

![Refer to caption](https://arxiv.org/html/2605.19093v1/x1.png)\(a\)Per\-feature std
![Refer to caption](https://arxiv.org/html/2605.19093v1/x2.png)\(b\)Cross\-iteration linear CKA

Figure 1:Feature stability and conservative adaptation\. \(a\) Repeated extraction is highly stable: fewer than 4% of feature values have standard deviation above 0\.05, and 0\.7% exceed 0\.1\. \(b\) Cross\-iteration CKA has mean 0\.81, indicating high but imperfect alignment between successive selected representations\.Figure[1](https://arxiv.org/html/2605.19093#S5.F1)shows that extraction noise is small and that re\-elicited representations adapt conservatively\. The CKA distribution is highly aligned but not concentrated at one, suggesting that the representation preserves enough geometry for BO to accumulate evidence while still changing as additional prompt\-score pairs reveal new semantic distinctions\. Appendix Table[4](https://arxiv.org/html/2605.19093#A2.T4)gives a qualitative example of the tasks evaluated: selected features evolve across iterations while cross\-validation MSE decreases\.

##### Predictive features and actionable targets\.

The BO loop needs the elicited feature space to be useful in two senses\. First, the representation must make prompt scores predictable from very small histories: during optimization, the GP surrogate is fit with only 5 to 25 observed prompt\-score pairs\. Second, BO\-selected feature targets must be realizable as natural\-language prompts in a way that matters for downstream improvement\.

![Refer to caption](https://arxiv.org/html/2605.19093v1/x3.png)\(a\)GP surrogate fit
![Refer to caption](https://arxiv.org/html/2605.19093v1/x4.png)\(b\)Gap vs\. improvement probability

Figure 2:Predictiveness and actionability of the elicited feature space\. \(a\) Dynamic features yield lower GP cross\-validation MSE than static initial features\. Both are significantly better than a mean\-predictor baseline\. \(b\) Smaller finalℓ2\\ell\_\{2\}gaps to BO targets are associated with higher improvement probability\.Figure[2](https://arxiv.org/html/2605.19093#S5.F2)shows that the elicited representation is useful both for modeling and for generation\. Dynamic elicited features reduce GP cross\-validation MSE relative to static initial features and a mean\-predictor baseline, supporting the value of re\-eliciting features as new prompt\-score evidence accumulates\. Separately, prompts whose final extracted features are closer to the BO\-selected target are more likely to improve the best\-so\-far score in the next evaluation\. This supports feature\-gap refinement as a useful mechanism for translating BO targets back into deployable prompts\. Appendix[B](https://arxiv.org/html/2605.19093#A2)reports additional diagnostics and analyses\.

### 5\.4Ablation Study

We ablate four design choices\.No Refinementaccepts the initial realization of each BO target without feature\-gap refinement\.No BOreplaces acquisition optimization with uniform sampling in the elicited feature space\.Static Featuresfreezes the first selected feature set and removes dynamic re\-elicitation\.Independent Extractionrates one prompt at a time instead of extracting features jointly in batches\.

Figure[3](https://arxiv.org/html/2605.19093#S5.F3)summarizes convergence across the ten benchmarks, and Appendix Table[3](https://arxiv.org/html/2605.19093#A2.T3)reports paired final\-score differences\. The clearest performance contributions come from feature\-gap refinement and BO target selection: removing refinement or replacing BO with random feature\-space sampling produces the largest drops in some of the best\-so\-far optimization curves as well as in final scores\. These results support the two main algorithmic roles of the loop: realizing BO\-selected semantic targets as prompts, and using uncertainty\-aware acquisition rather than unguided feature\-space sampling\. Freezing the feature set also decreases optimization performance, but the effect is milder than removing refinement or replacing BO with random feature\-space sampling\. This suggests that even the initial dataset often lets the LLM elicit semantically meaningful features that support effective optimization\. Although static features are less predictive than their dynamically re\-elicited counterparts, as shown in Figure[2\(a\)](https://arxiv.org/html/2605.19093#S5.F2.sf1), GPs fitted on static features still achieve low cross\-validation MSE relative to the baseline\.

Independent Extraction isolates an implementation choice rather than a core algorithmic component\. Its similar final performance suggests that joint extraction preserves optimization quality while reducing feature\-extraction cost\. With extraction batch sizeb=10b=10, extracting features fornnprompts requires⌈n/b⌉\\lceil n/b\\rceilLLM calls per feature set rather thannncalls; at the largest history size in our experiments, this is 3 calls rather than 25 per feature set\.

![Refer to caption](https://arxiv.org/html/2605.19093v1/x5.png)Figure 3:Ablation convergence curves\. Removing feature\-gap refinement or replacing BO with random feature\-space sampling produces the clearest degradation\. Static features and independent extraction yield smaller differences, which are quantified in Appendix Table[3](https://arxiv.org/html/2605.19093#A2.T3)\.

## 6Discussion

ReElicit treats system\-prompt tuning as black\-box optimization with a learned semantic coordinate system\. Prompts are discrete text objects, but the search directions that matter are semantic, task\-dependent, and unavailable in advance\. ReElicit uses the optimizer LLM to elicit compact feature axes from prompt\-score history, BO to select target coordinates, and the LLM again to realize and feature\-gap\-refine those targets into deployable prompts\. This couples LLM prior knowledge and text generation with BO’s uncertainty\-aware search under scarce aggregate evaluations\.

Across ten tasks, ReElicit has the strongest aggregate profile: the highest pairwise win\-or\-tie rate and performance that is numerically best or not significantly worse than the numerical best on every task\. Diagnostics support the representation view: elicited spaces are stable under repeated extraction but adapt across rounds, improve GP surrogate fit, and yield BO targets whose realized feature gaps predict improvement\. Ablations show the largest contributions from BO target selection and feature\-gap refinement, while dynamic re\-elicitation improves the surrogate’s predictive structure\.

The main limitations concern evaluation scope\. For reproducibility and controlled comparison, we instantiate aggregate feedback using offline benchmark accuracy on GSM8K, MMLU, and BBH\. This is standard in prompt\-optimization work and lets us compare methods under identical scalar\-feedback budgets, but it is only a proxy for motivating deployment objectives such as user satisfaction, retention, or safety\-incident rate\. Testing those objectives would require live systems, changing user populations, and less standardized baselines\.

Related APO/BO methods discussed in Section[2](https://arxiv.org/html/2605.19093#S2)often optimize different objects or assume different interfaces, such as soft prompts, instruction\-demonstration program spaces, candidate pools, fixed embeddings, validation subsets, or instance\-level feedback\. Adapting them would change what information the optimizer receives or what search space is being optimized, so our empirical claim is an apples\-to\-apples comparison within the aggregate\-only hard\-prompt setting rather than a universal prompt\-optimization leaderboard\. ReElicit also uses additional optimizer\-side LLM calls\. Therefore, our efficiency claim is target\-evaluation efficiency, most relevant when each evaluation is a costly aggregate measurement\.

More broadly, embedding by elicitation suggests a recipe for optimizing language\-describable structured artifacts\. Whenever candidates can be specified in text, constraints can be expressed textually, and evaluation is an expensive scalar measurement, an LLM can construct an interpretable adaptive search space while BO performs sample\-efficient exploration within it\. In principle, this pattern could apply beyond system prompts to other modalities \(such as images, audio, or video by using multi\-modal models\) and other applications such as agentic tool\-use instructions or evaluation rubrics\.

##### Broader impacts\.

System prompt optimization can improve reliability and controllability by making prompt tuning more systematic under limited feedback\. The main risks are objective\-driven: like other black\-box optimizers, the method could optimize harmful goals or over\-optimize a poorly specified metric in ways that weaken unmeasured safety, fairness, privacy, or robustness constraints\. Deployment\-facing use should therefore treat metric design as part of the safety problem, with human review, held\-out evaluation, safety\-relevant slices, robustness checks, and explicit tests that gains on the target metric do not come from violating constraints omitted from the aggregate objective\. Our experiments are offline benchmark evaluations and do not deploy optimized prompts to users\.

## References

- Searching for optimal solutions with llms via bayesian optimization\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Balandat, B\. Karrer, D\. R\. Jiang, S\. Daulton, B\. Letham, A\. G\. Wilson, and E\. Bakshy \(2020\)BoTorch: A Framework for Efficient Monte\-Carlo Bayesian Optimization\.InAdvances in Neural Information Processing Systems 33,External Links:[Link](http://arxiv.org/abs/1910.06403)Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p4.1)\.
- A\. Ballew, J\. Wang, and S\. Ren \(2025\)LLM based bayesian optimization for prompt search\.arXiv preprint arXiv:2510\.04384\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Chen, J\. Chen, T\. Goldstein, H\. Huang, and T\. Zhou \(2023\)Instructzero: efficient instruction optimization for black\-box large language models\.arXiv preprint arXiv:2306\.03082\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168Cited by:[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px2.p1.4)\.
- A\. Deshwal and J\. Doppa \(2021\)Combining latent space and structured kernels for bayesian optimization over combinatorial spaces\.Advances in neural information processing systems34,pp\. 8185–8200\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p5.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Fernando, D\. Banarse, H\. Michalewski, S\. Osindero, and T\. Rocktäschel \(2023\)Promptbreeder: self\-referential self\-improvement via prompt evolution\.arXiv preprint arXiv:2309\.16797\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px1.p1.1)\.
- R\. Gómez\-Bombarelli, J\. N\. Wei, D\. Duvenaud, J\. M\. Hernández\-Lobato, B\. Sánchez\-Lengeling, D\. Sheberla, J\. Aguilera\-Iparraguirre, T\. D\. Hirzel, R\. P\. Adams, and A\. Aspuru\-Guzik \(2018\)Automatic chemical design using a data\-driven continuous representation of molecules\.ACS central science4\(2\),pp\. 268–276\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p5.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Griffiths, L\. Klarner, H\. Moss, A\. Ravuri, S\. Truong, Y\. Du, S\. Stanton, G\. Tom, B\. Rankovic, A\. Jamasb,et al\.\(2023\)GAUCHE: a library for gaussian processes in chemistry\.Advances in Neural Information Processing Systems36,pp\. 76923–76946\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p5.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px2.p1.4)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational conference on machine learning,pp\. 3519–3529\.Cited by:[§5\.3](https://arxiv.org/html/2605.19093#S5.SS3.SSS0.Px1.p1.1)\.
- B\. Letham, R\. Calandra, A\. Rai, and E\. Bakshy \(2020\)Re\-examining linear embeddings for high\-dimensional bayesian optimization\.Advances in neural information processing systems33,pp\. 1546–1558\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Letham, B\. Karrer, G\. Ottoni, and E\. Bakshy \(2019\)Constrained bayesian optimization with noisy experiments\.Bayesian Analysis\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p4.1)\.
- N\. Maus, H\. Jones, J\. Moore, M\. J\. Kusner, J\. Bradshaw, and J\. Gardner \(2022\)Local latent space bayesian optimization over structured inputs\.Advances in neural information processing systems35,pp\. 34505–34518\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p5.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Maus, Z\. J\. Lin, M\. Balandat, and E\. Bakshy \(2023\)Joint composite latent space bayesian optimization\.arXiv preprint arXiv:2311\.02213\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- H\. B\. Moss, S\. W\. Ober, and T\. Diethe \(2025\)Return of the latent space cowboys: re\-thinking the use of vaes for bayesian optimisation of structured spaces\.arXiv preprint arXiv:2507\.03910\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Moss, D\. Leslie, D\. Beck, J\. Gonzalez, and P\. Rayson \(2020\)Boss: bayesian optimization over string spaces\.Advances in neural information processing systems33,pp\. 15476–15486\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p5.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Oh, J\. Tomczak, E\. Gavves, and M\. Welling \(2019\)Combinatorial bayesian optimization using the graph cartesian product\.Advances in Neural Information Processing Systems32\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p5.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Olson, E\. Santorella, L\. C\. Tiao, S\. Cakmak, M\. Garrard, S\. Daulton, Z\. J\. Lin, S\. Ament, B\. Beckerman, E\. Onofrey,et al\.\(2025\)Ax: a platform for adaptive experimentation\.InAutoML 2025 ABCD Track,Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p4.1)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 9340–9366\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Pryzant, D\. Iter, J\. Li, Y\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic prompt optimization with “gradient descent” and beam search\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 7957–7968\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Ramnath, K\. Zhou, S\. Guan, S\. S\. Mishra, X\. Qi, Z\. Shen, S\. Wang, S\. Woo, S\. Jeoung, Y\. Wang,et al\.\(2025\)A systematic survey of automatic prompt optimization techniques\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 33066–33098\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p2.1),[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Sabbatella, F\. Archetti, A\. Ponti, I\. Giordani, and A\. Candelieri \(2024\)Bayesian optimization for instruction generation\.Applied Sciences14\(24\),pp\. 11865\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Schneider, M\. Wistuba, A\. Klein, J\. Golebiowski, G\. Zappella, and F\. A\. Merra \(2024\)Hyperband\-based bayesian optimization for black\-box prompt selection\.arXiv preprint arXiv:2412\.07820\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Shahriari, K\. Swersky, Z\. Wang, R\. P\. Adams, and N\. De Freitas \(2015\)Taking the human out of the loop: a review of bayesian optimization\.Proceedings of the IEEE104\(1\),pp\. 148–175\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p4.1)\.
- M\. Suzgun, N\. Scales, N\. Schärli, S\. Gehrmann, Y\. Tay, H\. W\. Chung, A\. Chowdhery, Q\. V\. Le, E\. H\. Chi, D\. Zhou, and J\. Wei \(2022\)Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.arXiv preprint arXiv:2210\.09261\.Cited by:[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px2.p1.4)\.
- S\. Tomar, A\. Deshwal, E\. Villalovoz, M\. Fazzini, H\. Cai, and J\. R\. Doppa \(2025\)An exploratory study of bayesian prompt optimization for test\-driven code generation with large language models\.arXiv preprint arXiv:2512\.15076\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Wang, F\. Hutter, M\. Zoghi, D\. Matheson, and N\. De Feitas \(2016\)Bayesian optimization in a billion dimensions via random embeddings\.Journal of Artificial Intelligence Research55,pp\. 361–387\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2605.19093#S4.p1.1)\.
- A\. G\. Wilson, Z\. Hu, R\. Salakhutdinov, and E\. P\. Xing \(2016\)Deep kernel learning\.InArtificial intelligence and statistics,pp\. 370–378\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wu, S\. Verma, J\. Lee, F\. Xiong, P\. Zhang, A\. Awadelkarim, X\. Chen, Y\. Yuan, and S\. Hill \(2025\)Llm prompt duel optimizer: efficient label\-free prompt optimization\.arXiv preprint arXiv:2510\.13907\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2023\)Large language models as optimizers\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px1.p1.1)\.
- T\. Yen, A\. W\. T\. Siah, H\. Chen, T\. Peng, D\. Guetta, and H\. Namkoong \(2025\)Data mixture optimization: a multi\-fidelity multi\-scale bayesian framework\.arXiv preprint arXiv:2503\.21023\.Cited by:[§1](https://arxiv.org/html/2605.19093#S1.p4.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)Textgrad: automatic" differentiation" via text\.arXiv preprint arXiv:2406\.07496\.Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px1.p1.1)\.
- Y\. Zhou, A\. I\. Muresanu, Z\. Han, K\. Paster, S\. Pitis, H\. Chan, and J\. Ba \(2022\)Large language models are human\-level prompt engineers\.InThe eleventh international conference on learning representations,Cited by:[§2](https://arxiv.org/html/2605.19093#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.19093#S5.SS1.SSS0.Px1.p1.1)\.

## Appendix

## Appendix AProof of theoretical result

The theoretical analysis begins from Assumption[1](https://arxiv.org/html/2605.19093#Thmassumption1), that the true objective functionffresides in the RKHS of the oracle kernel,ℋk∗\\mathcal\{H\}\_\{k^\{\*\}\}\. Because the oracle kernel is a pullback kernel, its RKHSℋk∗\\mathcal\{H\}\_\{k^\{\*\}\}is the space of functions induced by the compositionf=h∘g∗f=h\\circ g^\{\*\}, whereh∈ℋZh\\in\\mathcal\{H\}\_\{Z\}\. For anyf∈ℋk∗f\\in\\mathcal\{H\}\_\{k^\{\*\}\}, the norm is defined as∥f∥k∗=inf\{∥h∥ℋZ:f=h∘g∗\}\\\|f\\\|\_\{k^\{\*\}\}=\\inf\\\{\\\|h\\\|\_\{\\mathcal\{H\}\_\{Z\}\}:f=h\\circ g^\{\*\}\\\}\. For‖f‖k∗≤B\\\|f\\\|\_\{k^\{\*\}\}\\leq B, there exists a weight vectorw∗∈ℋZw^\{\*\}\\in\\mathcal\{H\}\_\{Z\}with‖w∗‖ℋZ=‖f‖k∗≤B\\\|w^\{\*\}\\\|\_\{\\mathcal\{H\}\_\{Z\}\}=\\\|f\\\|\_\{k^\{\*\}\}\\leq Bsuch thatf\(x\)=⟨w∗,ϕ\(g∗\(x\)\)⟩ℋZf\(x\)=\\langle w^\{\*\},\\phi\(g^\{\*\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}, whereϕ\\phiis the feature map associated withkZk\_\{Z\}\. The latent space𝒵\\mathcal\{Z\}is compact and contains the images ofg∗g^\{\*\}and all elicited embeddings under consideration\.

The oracle embeddingg∗g^\{\*\}is, of course, unknown, and we do modeling and optimizing within the elicited embeddinggtg\_\{t\}\. The embeddinggtg\_\{t\}in general will not be able to fully representff\. We define the approximation errorℰt\\mathcal\{E\}\_\{t\}as the infimum of the uniform distance betweenffand any function representable in the elicited embedding with RKHS norm at mostBB:

ℰt=infw∈ℋZ,‖w‖ℋZ≤Bsupx∈𝒳\|f\(x\)−⟨w,ϕ\(gt\(x\)\)⟩\|\.\\mathcal\{E\}\_\{t\}=\\inf\_\{w\\in\\mathcal\{H\}\_\{Z\},\\\|w\\\|\_\{\\mathcal\{H\}\_\{Z\}\}\\leq B\}\\sup\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\-\\langle w,\\phi\(g\_\{t\}\(x\)\)\\rangle\|\.The norm constraint matches the complexity of the oracle representation, ensuring the bound reflects approximation error rather than overfitting\. We define the approximated functionftf\_\{t\}as a minimizer of this quantity, and thus a function whose error equalsℰt\\mathcal\{E\}\_\{t\}\. We now bound the representation error between the true objective function and the approximated function on the embedding\.

###### Lemma 1\.

For allx∈𝒳x\\in\\mathcal\{X\}, the pointwise difference between the true objective function and the approximated function on the embedding is bounded as

\|f\(x\)−ft\(x\)\|≤BLηt\.\|f\(x\)\-f\_\{t\}\(x\)\|\\leq BL\\eta\_\{t\}\.

###### Proof\.

Consider the following structural surrogate forffthat applies the oracle weightsw∗w^\{\*\}to the elicited embedding:

f¯t=⟨w∗,ϕ\(gt\(x\)\)⟩ℋZ\.\\bar\{f\}\_\{t\}=\\langle w^\{\*\},\\phi\(g\_\{t\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}\.Letwt′∈ℋZw\_\{t\}^\{\\prime\}\\in\\mathcal\{H\}\_\{Z\}be the weight vector corresponding toftf\_\{t\}\. By the definition offtf\_\{t\}as the best available approximation, we have that

ℰt\\displaystyle\\mathcal\{E\}\_\{t\}=supx∈𝒳\|f\(x\)−⟨wt′,ϕ\(gt\(x\)\)⟩ℋZ\|\\displaystyle=\\sup\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\-\\langle w\_\{t\}^\{\\prime\},\\phi\(g\_\{t\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}\|≤supx∈𝒳\|f\(x\)−⟨w∗,ϕ\(gt\(x\)\)⟩ℋZ\|\\displaystyle\\leq\\sup\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\-\\langle w^\{\*\},\\phi\(g\_\{t\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}\|Thus, by the linearity of the inner product and the Cauchy\-Schwarz inequality,

supx∈𝒳\|f\(x\)−ft\(x\)\|\\displaystyle\\sup\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\-f\_\{t\}\(x\)\|≤supx∈𝒳\|f\(x\)−⟨w∗,ϕ\(gt\(x\)\)⟩ℋZ\|\\displaystyle\\leq\\sup\_\{x\\in\\mathcal\{X\}\}\|f\(x\)\-\\langle w^\{\*\},\\phi\(g\_\{t\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}\|=supx∈𝒳\|⟨w∗,ϕ\(g∗\(x\)\)⟩ℋZ−⟨w∗,ϕ\(gt\(x\)\)⟩ℋZ\|\\displaystyle=\\sup\_\{x\\in\\mathcal\{X\}\}\\Big\|\\langle w^\{\*\},\\phi\(g^\{\*\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}\-\\langle w^\{\*\},\\phi\(g\_\{t\}\(x\)\)\\rangle\_\{\\mathcal\{H\}\_\{Z\}\}\\Big\|≤supx∈𝒳‖w∗‖ℋZ‖ϕ\(g∗\(x\)\)−ϕ\(gt\(x\)\)‖ℋZ\.\\displaystyle\\leq\\sup\_\{x\\in\\mathcal\{X\}\}\\\|w^\{\*\}\\\|\_\{\\mathcal\{H\}\_\{Z\}\}\\\|\\phi\(g^\{\*\}\(x\)\)\-\\phi\(g\_\{t\}\(x\)\)\\\|\_\{\\mathcal\{H\}\_\{Z\}\}\.
By Assumption[1](https://arxiv.org/html/2605.19093#Thmassumption1), we have that‖w∗‖ℋZ≤B\\\|w^\{\*\}\\\|\_\{\\mathcal\{H\}\_\{Z\}\}\\leq B\. Applying Assumption[2](https://arxiv.org/html/2605.19093#Thmassumption2)bounds the feature distance by the latent space distance, which is in turn bounded according to Assumption[3](https://arxiv.org/html/2605.19093#Thmassumption3)\.

‖ϕ\(g∗\(x\)\)−ϕ\(gt\(x\)\)‖ℋZ≤L‖g∗\(x\)−gt\(x\)‖≤Lηt\.\\\|\\phi\(g^\{\*\}\(x\)\)\-\\phi\(g\_\{t\}\(x\)\)\\\|\_\{\\mathcal\{H\}\_\{Z\}\}\\leq L\\\|g^\{\*\}\(x\)\-g\_\{t\}\(x\)\\\|\\leq L\\eta\_\{t\}\.
Substituting this bound back into the Cauchy\-Schwarz inequality completes the proof\. ∎

We next prove the main theorem, which is that every point in the suboptimal set for the approximated functionftf\_\{t\}is in a corresponding suboptimal set forff\.

###### Proof of Theorem[1](https://arxiv.org/html/2605.19093#Thmtheorem1)\.\.

We decompose the optimality gap ofxtx\_\{t\}with respect toffusing the approximated functionftf\_\{t\}:

f\(x∗\)−f\(xt\)=\(f\(x∗\)−ft\(x∗\)\)\+\(ft\(x∗\)−ft\(xt\)\)\+\(ft\(xt\)−f\(xt\)\)\.f\(x^\{\*\}\)\-f\(x\_\{t\}\)=\\big\(f\(x^\{\*\}\)\-f\_\{t\}\(x^\{\*\}\)\\big\)\+\\big\(f\_\{t\}\(x^\{\*\}\)\-f\_\{t\}\(x\_\{t\}\)\\big\)\+\\big\(f\_\{t\}\(x\_\{t\}\)\-f\(x\_\{t\}\)\\big\)\.By Lemma[1](https://arxiv.org/html/2605.19093#Thmlemma1), the first and third terms are each bounded byBLηtBL\\eta\_\{t\}\. For the middle term, we are considering the optimum offfevaluated byftf\_\{t\}, which is bounded by the optimal value offtf\_\{t\}itself:

ft\(x∗\)−ft\(xt\)≤maxx⁡ft\(x\)−ft\(xt\)≤δ\.f\_\{t\}\(x^\{\*\}\)\-f\_\{t\}\(x\_\{t\}\)\\leq\\max\_\{x\}f\_\{t\}\(x\)\-f\_\{t\}\(x\_\{t\}\)\\leq\\delta\.Summing these three upper bounds produces the desired result\. ∎

## Appendix BAdditional Results

### B\.1Selected Feature Dimensionality

The main text focuses on whether elicited features improve surrogate fit\. For completeness, Figure[4](https://arxiv.org/html/2605.19093#A2.F4)reports the selected feature dimensionality over optimization rounds\. The selected spaces typically contain two to three features, with a mild increasing trend as more prompt\-score evidence becomes available\. This supports the intended use of a compact yet increasingly more complex representation for GP modeling under small evaluation budgets\. Despite the increase in embedding dimensionality, those features are becoming iteratively more predictive of observed scores, as shown in Figure[2\(a\)](https://arxiv.org/html/2605.19093#S5.F2.sf1)\.

![Refer to caption](https://arxiv.org/html/2605.19093v1/x6.png)Figure 4:Selected feature dimensionality over optimization rounds, reported as mean±\\pm95% CI across tasks and seeds\. The elicited feature spaces typically contain two to three semantic dimensions\.
### B\.2Refinement Gap Trajectory

Figure[5](https://arxiv.org/html/2605.19093#A2.F5)reports the best realized\-targetℓ2\\ell\_\{2\}gap over refinement steps\. This diagnostic checks that feature\-gap feedback can effectively move generated prompts toward BO\-selected feature targets\. As shown in Figure[2\(b\)](https://arxiv.org/html/2605.19093#S5.F2.sf2)from the main text, smallerℓ2\\ell\_\{2\}gap is directly associated with the probability of observing an improvement in score in the next iteration\.

![Refer to caption](https://arxiv.org/html/2605.19093v1/x7.png)Figure 5:Bestℓ2\\ell\_\{2\}gap between the generated prompt’s extracted features and the BO\-selected target over refinement steps\. The trajectory shows that the refinement loop moves generated prompts closer to target feature vectors under the allotted refinement budget\.
### B\.3Ablation Final\-Score Statistics

Figure[3](https://arxiv.org/html/2605.19093#S5.F3)shows the full convergence trajectories for the ablation variants\. To quantify the final\-budget comparison, Table[3](https://arxiv.org/html/2605.19093#A2.T3)reports paired final\-score differences atN=30N=30evaluations\. For each task and seed, we compute

Δm,t,s=scorem,t,s−scoreReElicit,t,s,\\Delta\_\{m,t,s\}=\\mathrm\{score\}\_\{m,t,s\}\-\\mathrm\{score\}\_\{ReElicit,t,s\},wheremmis an ablation variant\. Negative values therefore favor ReElicit\. We pool the paired differences across task\-seed pairs and test whether the mean paired difference is zero\.

Table 3:Pooled paired final\-score differences between each ablation and ReElicit across all task\-seed pairs\. Each entry reportsΔ=ablation−ReElicit\\Delta=\\mathrm\{ablation\}\-ReElicit, so negative values favor ReElicit\. The±\\pmterm denotes a 95% confidence interval for the mean paired difference, andpp\-values are from two\-sided pairedtt\-tests\.The results support two main conclusions\. First, removing feature\-gap refinement and replacing BO with random feature\-space sampling produce the largest and statistically significant drops, indicating that both target realization and acquisition\-guided search contribute to final performance\. Second, freezing the feature set and extracting prompts independently yield smaller negative point estimates that are not statistically significant under this pooled test\. Thus, the ablation evidence for dynamic re\-elicitation should be interpreted together with the surrogate\-fit diagnostic in Figure[2\(a\)](https://arxiv.org/html/2605.19093#S5.F2.sf1), while the independent\-extraction result supports joint extraction as an efficiency improvement that does not appear to harm final performance\.

Because this table pools heterogeneous benchmark tasks, the paired test should be interpreted as an aggregate diagnostic rather than a claim that every ablation degrades performance on every task\.

### B\.4Feature Evolution Case Study

Table[4](https://arxiv.org/html/2605.19093#A2.T4)gives a qualitative example of the tasks evaluated: selected features evolve across iterations while cross\-validation MSE decreases\.

IterMSEFeatureDescription10\.0612explicitness\_of\_sarcasm\_cuesThis feature measures the degree to which the prompt explicitly mentions cues for identifying sarcasm, such as ’ironic or mocking’, ’tone, context, and figurative language’, or ’exaggeration, understatement, and idiomatic expressions’\. A value of 0 represents a prompt that does not mention any such cues, while a value of 1 represents a prompt that explicitly mentions multiple relevant cues\. This feature causally affects performance because explicitly mentioning these cues guides the AI to focus on the most relevant aspects of the input statements for sarcasm detection\.focus\_on\_pragmatic\_analysisThis feature measures the extent to which the prompt emphasizes the importance of pragmatic analysis, such as considering ’the speaker’s intent and the audience’s potential interpretation’ or ’assessing their semantic meaning and pragmatics’\. A value of 0 indicates a prompt that does not emphasize pragmatic analysis, while a value of 1 represents a prompt that clearly directs the AI to consider the pragmatic aspects of the statements\. This feature causally affects performance because pragmatic analysis is crucial for distinguishing between literal and sarcastic meanings\.30\.0199explicitness\_of\_sarcasm\_cuesThis feature measures the degree to which the prompt explicitly mentions specific cues for identifying sarcasm, such as exaggeration, understatement, idiomatic expressions, irony, or mocking tone\. A value of 0 represents a prompt that does not mention any specific cues, while a value of 1 represents a prompt that mentions multiple relevant cues\. This feature causally affects performance because explicit cues guide the AI to focus on the most relevant aspects of the input statements for sarcasm detection\.simplicity\_of\_task\_descriptionThis feature measures the extent to which the prompt provides a simple and direct description of the task, without requiring additional steps or complex analysis\. A value of 0 indicates a prompt that requires the AI to perform additional tasks or analysis, while a value of 1 represents a prompt that clearly and directly describes the task\. This feature causally affects performance because simplicity reduces ambiguity and guides the AI’s analysis\.focus\_on\_literal\_vs\_figurative\_meaningThis feature measures the degree to which the prompt emphasizes the distinction between literal and figurative meanings in identifying sarcasm\. A value of 0 represents a prompt that does not emphasize this distinction, while a value of 1 represents a prompt that clearly directs the AI to consider the contrast between literal and figurative meanings\. This feature causally affects performance because understanding this distinction is crucial for identifying sarcasm\.50\.0119emphasis\_on\_literal\_figurative\_contrastThis feature measures the degree to which the prompt emphasizes understanding the contrast between literal and figurative meanings in identifying sarcasm\. A value of 0 represents a prompt that does not emphasize this distinction, while a value of 1 represents a prompt that clearly directs the AI to consider the contrast between literal and figurative meanings\. This feature causally affects performance because understanding this distinction is crucial for identifying sarcasm\.cue\_explicitness\_and\_brevityThis feature measures the degree to which the prompt explicitly mentions specific cues for identifying sarcasm \(such as exaggeration, understatement, idiomatic expressions, irony, or mocking tone\) in a concise manner\. A value of 0 represents a prompt that either does not mention any specific cues or does so in a verbose or indirect way, while a value of 1 represents a prompt that mentions relevant cues clearly and briefly\. This feature causally affects performance because explicit and concise cues guide the AI to focus on the most relevant aspects of the input statements for sarcasm detection without introducing unnecessary complexity\.avoidance\_of\_overcontextualizationThis feature measures the degree to which the prompt avoids requiring the AI to heavily consider the broader context, speaker’s intent, or audience interpretation beyond the explicit language and tone used\. A value of 0 represents a prompt that emphasizes the importance of these contextual factors, while a value of 1 represents a prompt that focuses primarily on the statements themselves and the linguistic cues they contain\. This feature causally affects performance because overcontextualization can lead to unnecessary complexity and ambiguity in sarcasm detection\.Table 4:Case study: features the LLM proposes at iterations 1, 3, 5 on Snarks\. Dimensionality grows fromd=2d=2at iteration 1 tod=3d=3at iteration 3 and 5\. GP LOO\-CV MSE decreases each iteration as we continuously iterate and regenerate the features\.

## Appendix CImplementation Details

### C\.1Algorithms

#### C\.1\.1Initial Dataset Generation

1

Input:Task context

cc, batch size

qq, seed

Output:

𝒟0=\{\(xi,yi\)\}i=1q\\mathcal\{D\}\_\{0\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{q\}
2

3

\{x1,…,xq\}←LLM\(c,q\)\\\{x\_\{1\},\\ldots,x\_\{q\}\\\}\\leftarrow\\text\{LLM\}\(c,q\)\#

𝒟0\\mathcal\{D\}\_\{0\}generation prompt: generateqqdiverse system promptsfor*i=1,…,qi=1,\\ldots,q*do

4

yi←f\(xi\)y\_\{i\}\\leftarrow f\(x\_\{i\}\)\#evaluate sequentially

5

𝒟0←\{\(xi,yi\)\}i=1q\\mathcal\{D\}\_\{0\}\\leftarrow\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{q\};

6return*𝒟0\\mathcal\{D\}\_\{0\}*;

Algorithm 2Initial Dataset𝒟0\\mathcal\{D\}\_\{0\}Generation
#### C\.1\.2ReElicit Algorithm

Algorithm[3](https://arxiv.org/html/2605.19093#algorithm3)described full ReElicit algorithm\.

1

Input:

𝒟0=\{\(x0,j,y0,j\)\}j=1q\\mathcal\{D\}\_\{0\}=\\\{\(x\_\{0,j\},y\_\{0,j\}\)\\\}\_\{j=1\}^\{q\}, total evaluated batches

TT, elicitation rounds

KK, batch size

qq, extraction batch size

bb, realization budget

M≥1M\\geq 1, tolerance

τ\\tau
Output:

x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y
2

3

ℱ0←∅\\mathcal\{F\}\_\{0\}\\leftarrow\\varnothing;

4

5for*t=1,…,T−1t=1,\\ldots,T\-1*do

6

7Let

Xt−1,Yt−1X\_\{t\-1\},Y\_\{t\-1\}be the prompts and scores in

𝒟t−1\\mathcal\{D\}\_\{t\-1\};

8

\#Phase 1: Elicit features and build embeddinggtg\_\{t\}

9parallel for*k=1,…,Kk=1,\\ldots,K*

ℱt\(k\)←DefineFeatures\(𝒟t−1,ℱt−1\)\\mathcal\{F\}\_\{t\}^\{\(k\)\}\\leftarrow\\textnormal\{\{DefineFeatures\}\}\(\\mathcal\{D\}\_\{t\-1\},\\mathcal\{F\}\_\{t\-1\}\);

\#DefineFeaturesprompt

10

11

Zt\(k\)←ExtractFeatures\(Xt−1,ℱt\(k\),b\)Z\_\{t\}^\{\(k\)\}\\leftarrow\\textnormal\{\{ExtractFeatures\}\}\(X\_\{t\-1\},\\mathcal\{F\}\_\{t\}^\{\(k\)\},b\);

\#ExtractFeaturesprompt;zi=gℱt\(k\)\(xi\)z\_\{i\}=g\_\{\\mathcal\{F\}\_\{t\}^\{\(k\)\}\}\(x\_\{i\}\); scores are not shown

12

13

scoret\(k\)←CV\(Zt\(k\),Yt−1\)\\text\{score\}\_\{t\}^\{\(k\)\}\\leftarrow\\text\{CV\}\(Z\_\{t\}^\{\(k\)\},Y\_\{t\-1\}\);

14

15

𝒦t←\{1,…,K\}\\mathcal\{K\}\_\{t\}\\leftarrow\\\{1,\\ldots,K\\\};

16

17if*t\>1t\>1*then

18

ℱt\(K\+1\)←ℱt−1\\mathcal\{F\}\_\{t\}^\{\(K\+1\)\}\\leftarrow\\mathcal\{F\}\_\{t\-1\};

19

Zt\(K\+1\)←ExtractFeatures\(Xt−1,ℱt−1,b\)Z\_\{t\}^\{\(K\+1\)\}\\leftarrow\\textnormal\{\{ExtractFeatures\}\}\(X\_\{t\-1\},\\mathcal\{F\}\_\{t\-1\},b\);

20

scoret\(K\+1\)←CV\(Zt\(K\+1\),Yt−1\)\\text\{score\}\_\{t\}^\{\(K\+1\)\}\\leftarrow\\text\{CV\}\(Z\_\{t\}^\{\(K\+1\)\},Y\_\{t\-1\}\);

21

𝒦t←𝒦t∪\{K\+1\}\\mathcal\{K\}\_\{t\}\\leftarrow\\mathcal\{K\}\_\{t\}\\cup\\\{K\+1\\\};

\#Re\-score the incumbent feature set on the current history

22

23

24

k∗←arg⁡mink∈𝒦t⁡scoret\(k\)k^\{\*\}\\leftarrow\\arg\\min\_\{k\\in\\mathcal\{K\}\_\{t\}\}\\text\{score\}\_\{t\}^\{\(k\)\};

25

ℱt←ℱt\(k∗\)\\mathcal\{F\}\_\{t\}\\leftarrow\\mathcal\{F\}\_\{t\}^\{\(k^\{\*\}\)\},

Zt←Zt\(k∗\)Z\_\{t\}\\leftarrow Z\_\{t\}^\{\(k^\{\*\}\)\};

26

\#Phase 2: Bayesian optimization in embedding space

27Fit GP

ℳt\\mathcal\{M\}\_\{t\}on

\(Zt,Yt−1\)\(Z\_\{t\},Y\_\{t\-1\}\);

28

29

\{zt,1new,…,zt,qnew\}←arg⁡maxz1,…,zq∈\[0,1\]dt⁡α\(z1,…,zq∣ℳt\)\\\{z\_\{t,1\}^\{\\mathrm\{new\}\},\\ldots,z\_\{t,q\}^\{\\mathrm\{new\}\}\\\}\\leftarrow\\arg\\max\_\{z\_\{1\},\\ldots,z\_\{q\}\\in\[0,1\]^\{d\_\{t\}\}\}\\alpha\(z\_\{1\},\\ldots,z\_\{q\}\\mid\\mathcal\{M\}\_\{t\}\);

\#BO selects feature targets, not text prompts

30

\#Phase 3: Realize feature targets as prompts

31parallel for*j=1,…,qj=1,\\ldots,q*

Minit←max⁡\(1,⌊M/2⌋\)M\_\{\\mathrm\{init\}\}\\leftarrow\\max\(1,\\lfloor M/2\\rfloor\);

Mrefine←M−MinitM\_\{\\mathrm\{refine\}\}\\leftarrow M\-M\_\{\\mathrm\{init\}\};

32

\#3a: Parallel initial generation

33parallel for*p=1,…,Minitp=1,\\ldots,M\_\{\\mathrm\{init\}\}*

𝒟~p←StratifiedSubsample\(𝒟t−1,nmax\)\\tilde\{\\mathcal\{D\}\}\_\{p\}\\leftarrow\\textnormal\{\{StratifiedSubsample\}\}\(\\mathcal\{D\}\_\{t\-1\},n\_\{\\max\}\);

34

x\(p\)←InitialGenerate\(𝒟~p,ℱt,zt,jnew\)x^\{\(p\)\}\\leftarrow\\textsc\{InitialGenerate\}\(\\tilde\{\\mathcal\{D\}\}\_\{p\},\\mathcal\{F\}\_\{t\},z\_\{t,j\}^\{\\mathrm\{new\}\}\);

35

z\(p\)←ExtractFeatures\(\{x\(p\)\},ℱt,b=1\)z^\{\(p\)\}\\leftarrow\\textnormal\{\{ExtractFeatures\}\}\(\\\{x^\{\(p\)\}\\\},\\mathcal\{F\}\_\{t\},b\{=\}1\);

36

37

p∗←arg⁡minp⁡‖zt,jnew−z\(p\)‖2p^\{\*\}\\leftarrow\\arg\\min\_\{p\}\\\|z\_\{t,j\}^\{\\mathrm\{new\}\}\-z^\{\(p\)\}\\\|\_\{2\};

38

xbest,zbest←x\(p∗\),z\(p∗\)x\_\{\\mathrm\{best\}\},z\_\{\\mathrm\{best\}\}\\leftarrow x^\{\(p^\{\*\}\)\},z^\{\(p^\{\*\}\)\};

39

\#3b: Sequential refinement by feature\-gap reduction

40for*i=1,…,Mrefinei=1,\\ldots,M\_\{\\mathrm\{refine\}\}*do

41if*‖zt,jnew−zbest‖2≤τ\\\|z\_\{t,j\}^\{\\mathrm\{new\}\}\-z\_\{\\mathrm\{best\}\}\\\|\_\{2\}\\leq\\tau*thenbreak;

42

43

Δℓ←\(zt,jnew\)ℓ−\(zbest\)ℓ\\Delta\_\{\\ell\}\\leftarrow\(z\_\{t,j\}^\{\\mathrm\{new\}\}\)\_\{\\ell\}\-\(z\_\{\\mathrm\{best\}\}\)\_\{\\ell\}for

ℓ=1,…,dt\\ell=1,\\ldots,d\_\{t\};

44Sort features by

\|Δℓ\|\|\\Delta\_\{\\ell\}\|descending;

45

46

xnew←FeatureGuidedRefine\(xbest,ℱt,zt,jnew,zbest\)x\_\{\\mathrm\{new\}\}\\leftarrow\\textsc\{FeatureGuidedRefine\}\(x\_\{\\mathrm\{best\}\},\\mathcal\{F\}\_\{t\},z\_\{t,j\}^\{\\mathrm\{new\}\},z\_\{\\mathrm\{best\}\}\);

47

znew←ExtractFeatures\(\{xnew\},ℱt,b=1\)z\_\{\\mathrm\{new\}\}\\leftarrow\\textnormal\{\{ExtractFeatures\}\}\(\\\{x\_\{\\mathrm\{new\}\}\\\},\\mathcal\{F\}\_\{t\},b\{=\}1\);

48

49if*‖zt,jnew−znew‖2<‖zt,jnew−zbest‖2\\\|z\_\{t,j\}^\{\\mathrm\{new\}\}\-z\_\{\\mathrm\{new\}\}\\\|\_\{2\}<\\\|z\_\{t,j\}^\{\\mathrm\{new\}\}\-z\_\{\\mathrm\{best\}\}\\\|\_\{2\}*then

xbest,zbest←xnew,znewx\_\{\\mathrm\{best\}\},z\_\{\\mathrm\{best\}\}\\leftarrow x\_\{\\mathrm\{new\}\},z\_\{\\mathrm\{new\}\};

50

51

52

xt,jnew←xbestx\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow x\_\{\\mathrm\{best\}\};

53

\#Phase 4: Evaluate and update

54

yt,jnew←f\(xt,jnew\)y\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow f\(x\_\{t,j\}^\{\\mathrm\{new\}\}\)for each

jj;

55

𝒟t←𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\\mathcal\{D\}\_\{t\}\\leftarrow\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\};

56

Algorithm 3ReElicit\. Expanded version of Algorithm[1](https://arxiv.org/html/2605.19093#algorithm1)\.
#### C\.1\.3Baselines

##### APE\-style guided sampling

\(Algorithm[4](https://arxiv.org/html/2605.19093#algorithm4), Section[D\.2\.1](https://arxiv.org/html/2605.19093#A4.SS2.SSS1)\)\. This is a history\-free candidate\-generation baseline that captures the core idea of APE: candidate generation plus evaluation\. At each iteration, the optimizer LLM generatesqqdiverse system prompts from the task description alone, without using any optimization history\. This is a reasonable simplification of APE, though not a full reproduction of the original search protocol\.

##### OPRO

\(Algorithm[5](https://arxiv.org/html/2605.19093#algorithm5), Section[D\.2\.2](https://arxiv.org/html/2605.19093#A4.SS2.SSS2)\)\. The main iterative baseline\. At each iteration, OPRO presents a stratified subsample of the\(x,y\)\(x,y\)history sorted worst\-to\-best \(placing the best prompts last for recency bias\) and asks the optimizer LLM to generateqqimprovements\. OPRO is the closest native fit among the evaluated baselines because it uses solution\-score histories to generate improved candidates\.

##### PromptBreeder

\(Algorithm[6](https://arxiv.org/html/2605.19093#algorithm6), Sections[D\.2\.3](https://arxiv.org/html/2605.19093#A4.SS2.SSS3)and[D\.2\.4](https://arxiv.org/html/2605.19093#A4.SS2.SSS4)\)\. A population\-based prompt evolution baseline inspired by PromptBreeder\. Maintains a fitness\-sorted population \(default size 20\)\. Each iteration generatesqqoffspring:q−1q\{\-\}1via mutation \(one of three operators: improve clarity, make reasoning explicit, increase conciseness\) and 1 via recombination of two random parents\. AllqqLLM calls run concurrently\. This is a natural match for the black\-box setting because evolutionary methods require only fitness values\.

##### TextGrad\-style black\-box refinement

\(Algorithm[7](https://arxiv.org/html/2605.19093#algorithm7), Section[D\.2\.5](https://arxiv.org/html/2605.19093#A4.SS2.SSS5)\)\. A critique\-then\-improve baseline inspired by TextGrad\. Presents a stratified trajectory sample plus the current best prompt, then follows a 3\-step chain: \(1\) analyze trajectory patterns, \(2\) critique the best prompt, \(3\) generateqqimproved variants addressing different critique aspects\. This adaptation requires the largest interface change because original TextGrad benefits from richer structured feedback, such as instance\-level errors or textual gradients\. We therefore label it “TextGrad\-style” to be transparent about the aggregate\-only simplification\.

##### Shared infrastructure\.

The stratified subsampling utility is shared by ReElicit \(DefineFeatures,GenerateWithRefinement\), OPRO, and TextGrad, ensuring all methods see the same in\-context distribution\. When history exceedsnmaxn\_\{\\max\}, it draws the top 25% by score, bottom 25%, and a random sample of the middle 50%\. OPRO and TextGrad sort the subsample ascending by score \(best last\) for recency bias\.

1

Input:Initial evaluated dataset

𝒟0\\mathcal\{D\}\_\{0\}, total evaluated batches

TT, task context

cc, batch size

qq
Output:Best prompt

x∗x^\{\*\}
2

3for*t=1,…,T−1t=1,\\ldots,T\-1*do

4

\{xt,1new,…,xt,qnew\}←LLM\(c,q\)\\\{x\_\{t,1\}^\{\\mathrm\{new\}\},\\ldots,x\_\{t,q\}^\{\\mathrm\{new\}\}\\\}\\leftarrow\\text\{LLM\}\(c,q\)\#APE prompt; history is not used

5Evaluate

yt,jnew←f\(xt,jnew\)y\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow f\(x\_\{t,j\}^\{\\mathrm\{new\}\}\)for

j=1,…,qj=1,\\ldots,q;

6

7

𝒟t←𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\\mathcal\{D\}\_\{t\}\\leftarrow\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\};

8

9

10return*x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y*;

Algorithm 4APE\-style Guided Sampling1

Input:Initial evaluated dataset

𝒟0\\mathcal\{D\}\_\{0\}, total evaluated batches

TT, task context

cc, batch size

qq, context cap

nmaxn\_\{\\max\}
Output:Best prompt

x∗x^\{\*\}
2

3for*t=1,…,T−1t=1,\\ldots,T\-1*do

4

𝒟~t−1←StratifiedSubsample\(𝒟t−1,nmax\)\\tilde\{\\mathcal\{D\}\}\_\{t\-1\}\\leftarrow\\textnormal\{\{StratifiedSubsample\}\}\(\\mathcal\{D\}\_\{t\-1\},n\_\{\\max\}\);

5

6Sort

𝒟~t−1\\tilde\{\\mathcal\{D\}\}\_\{t\-1\}by score ascending \(worst

→\\tobest\);

7

8

\{xt,1new,…,xt,qnew\}←LLM\(c,𝒟~t−1,q\)\\\{x\_\{t,1\}^\{\\mathrm\{new\}\},\\ldots,x\_\{t,q\}^\{\\mathrm\{new\}\}\\\}\\leftarrow\\text\{LLM\}\(c,\\tilde\{\\mathcal\{D\}\}\_\{t\-1\},q\)\#OPRO prompt

9Evaluate

yt,jnew←f\(xt,jnew\)y\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow f\(x\_\{t,j\}^\{\\mathrm\{new\}\}\)for

j=1,…,qj=1,\\ldots,q;

10

11

𝒟t←𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\\mathcal\{D\}\_\{t\}\\leftarrow\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\};

12

13

14return*x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y*;

Algorithm 5OPRO: History\-Conditioned Prompt Generation1

Input:Initial evaluated dataset

𝒟0\\mathcal\{D\}\_\{0\}, total evaluated batches

TT, task context

cc, batch size

qq, population size

PmaxP\_\{\\max\}
Output:Best prompt

x∗x^\{\*\}
2

3for*t=1,…,T−1t=1,\\ldots,T\-1*do

4Let

𝒫t−1\\mathcal\{P\}\_\{t\-1\}be the top\-

PmaxP\_\{\\max\}entries of

𝒟t−1\\mathcal\{D\}\_\{t\-1\}sorted by score descending;

5

6parallel for*j=1,…,qj=1,\\ldots,q*if*j<qj<q*then

7

xparent←x\_\{\\mathrm\{parent\}\}\\leftarrowrandom choice from

𝒫t−1\\mathcal\{P\}\_\{t\-1\};

8

9

m←m\\leftarrowrandom choice from \{clarity, reasoning, conciseness\};

10

11

xt,jnew←LLM\(c,xparent,m\)x\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow\\text\{LLM\}\(c,x\_\{\\mathrm\{parent\}\},m\)\#PromptBreeder mutation prompt

12else

13

xp1,xp2←x\_\{p\_\{1\}\},x\_\{p\_\{2\}\}\\leftarrowrandom pair from

𝒫t−1\\mathcal\{P\}\_\{t\-1\};

14

15

xt,jnew←LLM\(c,xp1,xp2\)x\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow\\text\{LLM\}\(c,x\_\{p\_\{1\}\},x\_\{p\_\{2\}\}\)\#PromptBreeder recombination prompt

16

17Evaluate

yt,jnew←f\(xt,jnew\)y\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow f\(x\_\{t,j\}^\{\\mathrm\{new\}\}\)for

j=1,…,qj=1,\\ldots,q;

18

19

𝒟t←𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\\mathcal\{D\}\_\{t\}\\leftarrow\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\};

20

21

22return*x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y*;

Algorithm 6PromptBreeder: Population\-Based Prompt Evolution1

Input:Initial evaluated dataset

𝒟0\\mathcal\{D\}\_\{0\}, total evaluated batches

TT, task context

cc, batch size

qq, context cap

nmaxn\_\{\\max\}
Output:Best prompt

x∗x^\{\*\}
2

3for*t=1,…,T−1t=1,\\ldots,T\-1*do

4

𝒟~t−1←StratifiedSubsample\(𝒟t−1,nmax\)\\tilde\{\\mathcal\{D\}\}\_\{t\-1\}\\leftarrow\\textnormal\{\{StratifiedSubsample\}\}\(\\mathcal\{D\}\_\{t\-1\},n\_\{\\max\}\);

5

6Sort

𝒟~t−1\\tilde\{\\mathcal\{D\}\}\_\{t\-1\}by score ascending \(worst

→\\tobest\);

7

8

\(xbest,ybest\)←arg⁡max\(x,y\)∈𝒟t−1⁡y\(x\_\{\\mathrm\{best\}\},y\_\{\\mathrm\{best\}\}\)\\leftarrow\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{t\-1\}\}y\#from full history

9

\{xt,1new,…,xt,qnew\}←LLM\(c,𝒟~t−1,xbest,ybest,q\)\\\{x\_\{t,1\}^\{\\mathrm\{new\}\},\\ldots,x\_\{t,q\}^\{\\mathrm\{new\}\}\\\}\\leftarrow\\text\{LLM\}\(c,\\tilde\{\\mathcal\{D\}\}\_\{t\-1\},x\_\{\\mathrm\{best\}\},y\_\{\\mathrm\{best\}\},q\)\#TextGrad\-style prompt: analyze, critique, generate variants

10Evaluate

yt,jnew←f\(xt,jnew\)y\_\{t,j\}^\{\\mathrm\{new\}\}\\leftarrow f\(x\_\{t,j\}^\{\\mathrm\{new\}\}\)for

j=1,…,qj=1,\\ldots,q;

11

12

𝒟t←𝒟t−1∪\{\(xt,jnew,yt,jnew\)\}j=1q\\mathcal\{D\}\_\{t\}\\leftarrow\\mathcal\{D\}\_\{t\-1\}\\cup\\\{\(x\_\{t,j\}^\{\\mathrm\{new\}\},y\_\{t,j\}^\{\\mathrm\{new\}\}\)\\\}\_\{j=1\}^\{q\};

13

14

15return*x∗=arg⁡max\(x,y\)∈𝒟T−1⁡yx^\{\*\}=\\arg\\max\_\{\(x,y\)\\in\\mathcal\{D\}\_\{T\-1\}\}y*;

Algorithm 7TextGrad\-style Black\-Box Refinement

### C\.2Benchmarks

Table[5](https://arxiv.org/html/2605.19093#A3.T5)describes each task, the number of questions used, and the task context supplied to the LLMs\.

Table 5:Full task context strings\. Each string is injected into all optimizer and baseline prompts as the\{task\_context\}variable\. GSM8K and MMLU use fixed 500\-question subsamples; the remaining eight tasks are from BIG\-Bench Hard \(BBH, 250 examples each\)\.
### C\.3Evaluation Protocol and Implementation Details

#### C\.3\.1Models and benchmark evaluation\.

We use Llama 3\.3 70B Instruct as the optimizer LLM and Llama 3\.1 8B Instruct as the target LLM; both have 128K\-token context windows\. The optimizer LLM performs feature elicitation, feature extraction, prompt generation, refinement, and baseline candidate generation\. The target LLM is the model whose system prompt is optimized and is accessed only through the evaluation functionff\. We use the smaller target model to avoid task saturation and reduce evaluation wallclock\.

Each evaluationf\(x\)f\(x\)runs the target LLM with system promptxxon a fixed task subset using the lm\-eval framework\. All target\-model evaluations are zero\-shot, use greedy decoding withtemperature: 0\.0anddo\_sample: false, and inject the candidate system promptxxthrough the model wrapper\. Optimizer\-side LLM calls use temperature 0\.7 unless otherwise specified\.

Table 6:Evaluation templates and answer extraction rules, grouped by output format\.
#### C\.3\.2BO surrogate and feature\-set selection\.

For ReElicit, the surrogate model is a BoTorchSingleTaskGPwith the default Matérn 5/2 kernel with ARD,Normalize\(d=d\)input transform, andStandardize\(m=1\)outcome transform\. We fit the GP by maximum likelihood usingfit\_gpytorch\_mllwithExactMarginalLogLikelihood\. The acquisition function isqLogNoisyExpectedImprovementwithX\_baselineset to the training inputs, optimized over\[0,1\]d\[0,1\]^\{d\}usingoptimize\_acqfwith 20 restarts and 512 raw samples\. The GP is refit from scratch at each BO iteration on the current prompt\-score history\.

At each iteration,KKindependently elicited feature sets compete via held\-out prediction error\. We use leave\-one\-out cross\-validation when the number of data points is below 10 and 10\-fold cross\-validation otherwise\. The selection metric is mean squared error on held\-out points\. Whent\>1t\>1, the incumbent feature set from the previous iteration is also included as an additional candidate by re\-extracting it on the current enlarged history and scoring it by the same cross\-validation procedure\. This allows a previously predictive representation to persist, but does not assume that feature quality monotonically improves\. As a diagnostic, we compare GP cross\-validation MSE against a constant predictor that outputsy¯train\\bar\{y\}\_\{\\mathrm\{train\}\}on held\-out points\.

##### In\-context history subsampling\.

When an optimizer\-side LLM call conditions on prompt\-score history and the history exceedsnmaxn\_\{\\max\}, we use a stratified subsample: top 25% by score, bottom 25% by score, and a random sample from the middle 50% to fill the remaining slots, with at least one example from the top and bottom groups\. This utility is shared by ReElicit, OPRO, and TextGrad\-style refinement, ensuring that these methods see the same in\-context distribution when history is subsampled\.

#### C\.3\.3Default hyperparameters\.

Table[7](https://arxiv.org/html/2605.19093#A3.T7)describes hyperparameters used in experiments\.

Table 7:Default hyperparameters\.
#### C\.3\.4Ablation variants\.

Table[8](https://arxiv.org/html/2605.19093#A3.T8)describes variants tested in ablation study\.

Table 8:ReElicit ablation variants\. Each variant changes one design choice\.
#### C\.3\.5Information access control\.

The optimizer LLM is used by several subroutines with different information access\. This separation is important because feature definitions may depend on prompt\-score history, but extracted feature values should be based on prompt content rather than direct access to outcomes\.

Table 9:Information available to optimizer\-LLM subroutines\.\*Whent\>1t\>1,DefineFeaturesalso sees the incumbent feature setℱt−1\\mathcal\{F\}\_\{t\-1\}\.

ExtractFeaturesnever sees evaluation scores\. Thus, although scores are used to choose which feature set is most predictive, the coordinates assigned to prompts are extracted from prompt content under the selected feature definitions rather than copied or inferred directly from outcomes\.

### C\.4Analysis Details

For completeness, we define the linear CKA used in Figure[1](https://arxiv.org/html/2605.19093#S5.F1)\. Given two representationsZ∈ℝn×dZZ\\in\\mathbb\{R\}^\{n\\times d\_\{Z\}\}andZ′∈ℝn×dZ′Z^\{\\prime\}\\in\\mathbb\{R\}^\{n\\times d\_\{Z^\{\\prime\}\}\}evaluated on the samennprompts, letK=ZZ⊤K=ZZ^\{\\top\}andL=Z′Z′⁣⊤L=Z^\{\\prime\}Z^\{\\prime\\top\}be their linear Gram matrices\. Linear CKA is

CKA\(K,L\)=⟨HKH,HLH⟩F‖HKH‖F‖HLH‖F,H=In−1n𝟏𝟏⊤\.\\mathrm\{CKA\}\(K,L\)=\\frac\{\\langle HKH,HLH\\rangle\_\{F\}\}\{\\\|HKH\\\|\_\{F\}\\\|HLH\\\|\_\{F\}\},\\quad H=I\_\{n\}\-\\frac\{1\}\{n\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}\.Here∥⋅∥F\\\|\\cdot\\\|\_\{F\}and⟨⋅,⋅⟩F\\langle\\cdot,\\cdot\\rangle\_\{F\}denote the Frobenius norm and inner product\.

### C\.5Existing assets and licenses\.

We use public benchmark datasets GSM8K, MMLU, and BIG\-Bench Hard, and cite their original papers in the main text\. GSM8K, MMLU, and the BIG\-Bench Hard repository are distributed under MIT licenses\. We use the lm\-evaluation\-harness and BoTorch software packages, both under MIT licenses\. The optimizer and target models are Llama 3\.3 70B Instruct and Llama 3\.1 8B Instruct, used under their respective Meta Llama Community License Agreements and Acceptable Use Policies\. We do not redistribute model weights or modified benchmark datasets\.

## Appendix DPrompts

### D\.1ReElicit Prompts

#### D\.1\.1DefineFeatures

This is theonlyprompt where the optimizer LLM sees evaluation scoresyiy\_\{i\}\.

DefineFeaturesYou are an expert at analyzing text objects and identifying patterns that predict performance\.These text objects are system prompts for an AI assistant performing the following task:\{task\_context\}Define numerical features that capture what makes a prompt effective FOR THIS SPECIFIC TASK\. Focus on properties that have a CAUSAL relationship to the AI’s ability to solve this type of problem — properties that, if changed in a prompt, would directly affect performance\. Avoid features that merely correlate with score or describe surface\-level text properties without a causal mechanism\.Before defining features, closely inspect the data:1\.What specific patterns or properties are present in the TOP\-performing text objects but absent from the BOTTOM ones?2\.What do the BOTTOM\-performing text objects have that the TOP ones don’t?3\.List 2–3 concrete observations about these differences\.Then define features that capture these observed causal differences\.Requirements for each feature:•name: A short identifier\.•description: Explain what the feature measures for this specific task, what 0 and 1 represent \(anchor semantics\), and why it causally affects performance\.•All feature values must be in \[0, 1\]\.•Each feature MUST be INDEPENDENT of the others\. If you can predict one feature’s value from the others, they are redundant — keep only the more causally relevant one\.•Before finalizing, verify: for each pair of features, can you imagine a text that is high on one and low on the other? If not, they are not independent — drop one\.Choose the number of features based on the available data\. With small datasets \(10 or fewer examples\), prefer fewer features \(1–2\) — a single well\-chosen feature is better than several noisy ones\. As the dataset grows and patterns become clearer, additional features can capture richer structure that was not visible with less data\. Every feature must earn its place by capturing genuinely independent variation\.Respond with a JSON array of objects, each with ‘name’ and ‘description’ fields\. Example:``` [ { "name": "example_feature_name", "description": "What this feature measures for the specific task,"\ "what 0 and 1 represent, and why it causally affects performance." } ] ``` *\[Whent\>1t\>1and incumbent featuresℱt−1\\mathcal\{F\}\_\{t\-1\}are provided, the following is prepended to the input\. Incumbent features are shown in shuffled order to avoid position bias\.\]*The following features are currently in use for this task:—\{f\.name\}:\{f\.description\}*\[for each incumbent feature\]*Analyze the data to identify patterns or properties NOT captured by these features\. You may keep features that are clearly the most predictive, but your proposed set MUST differ from the current set by at least one meaningful change: add a genuinely new feature, remove a feature, or substantively redefine one\. Renaming or trivially rephrasing an existing feature does NOT count as a change\.*\[Then always:\]*Here are\{n\}text objects with their performance scores \(higher is better\), sorted by performance tier:\{examples\_text\}*\[formatted with TOP/BOTTOM tier labels\]*You have\{n\}text objects available\. Be mindful of this sample size — fewer high\-quality, genuinely distinct features are far better than many overlapping ones\.First list your observations about what distinguishes TOP from BOTTOM performers, then define features\. Return ONLY the JSON array\.

#### D\.1\.2ExtractFeatures

Doesnotinclude evaluation scoresyiy\_\{i\}— prevents information leakage\.

ExtractFeaturesYou are an expert at analyzing text and rating it on specific features\. For each text object, assign a value in \[0, 1\] for each feature based on the feature description\.These text objects are system prompts for an AI assistant performing the following task:\{task\_context\}Rate each text object considering how the features relate to this specific task\.Be consistent: similar texts should get similar scores\. Use the full range of \[0, 1\] – don’t cluster all values near the middle\.Respond with a JSON object keyed by text object ID, where each value is an object mapping feature names to numeric values\.Example:``` { "0": {"feature_a": 0.75, "feature_b": 0.30}, "1": {"feature_a": 0.45, "feature_b": 0.80} } ``` Features to rate:—\{f\.name\}:\{f\.description\}*\[for each feature\]*Text objects to rate:— Text Object ID: “\{tid\}” —\{content\}*\[for each text in batch\]*Rate each text object on each feature\. Values must be numbers in \[0, 1\]\. Return ONLY the JSON object\.

#### D\.1\.3Initial Generation

Used in Phase A ofGenerateWithRefinement\(Algorithm[3](https://arxiv.org/html/2605.19093#algorithm3)\)\.

Initial GenerationYou are an expert prompt engineer\. Generate a system prompt for an AI assistant performing the following task:\{task\_context\}The prompt should match specific target feature values\.You will be given:1\.Feature definitions with their semantics\.2\.Example prompts labeled \[TOP\] or \[BOTTOM\] by performance, with their feature values and scores\.3\.A target feature vector to aim for\.Study what makes the TOP\-scoring examples effective and what makes the BOTTOM\-scoring examples less effective\. Learn from the best examples — understand the patterns and approaches that lead to high performance\.The target feature vector indicates a promising direction to explore\. Generate a NEW system prompt that combines the successful patterns from the TOP examples while matching the target feature values\.Output ONLY the generated prompt text, with no additional commentary or formatting\.Feature definitions:—\{f\.name\}:\{f\.description\}*\[for each feature\]*Example prompts \(sorted by performance, with tier labels\):\{examples\_text\}*\[TOP/BOTTOM labels, features, and scores\]*Target feature vector:\{target\_text\}*\[JSON dict, e\.g\.\{"conciseness": 0\.85, "step\_guidance": 0\.70\}\]*Generate a system prompt that combines the best patterns from the TOP examples while matching the target features\. Output ONLY the prompt text\.

#### D\.1\.4FeatureGuidedRefine

Used in Phase B ofGenerateWithRefinement\(Algorithm[3](https://arxiv.org/html/2605.19093#algorithm3)\)\.

FeatureGuidedRefineYou are an expert prompt engineer\. Modify the given system prompt to better match target feature values\.The system prompt is for an AI assistant performing the following task:\{task\_context\}Consider what text patterns in the reference examples correspond to the desired feature values, and what specific phrases in the current prompt are causing the gaps\.Rules:•Focus on the LARGEST gaps first \(they are listed in order of priority\)\.•MODIFY the existing text — do not rewrite from scratch\.•PRESERVE aspects that are already well\-aligned with their targets\.•Output ONLY the modified prompt text, with no additional commentary or formatting\.*\[When reference examples are provided:\]*Reference examples \(sorted by performance\):\{examples\_text\}*\[TOP/BOTTOM labels, features, and scores\]*Use the TOP examples as reference for the style and patterns that correspond to the desired feature values\.*\[Then always:\]*Current system prompt:\{text\}Feature gap analysis \(sorted by gap magnitude, largest first\):\{gap\_text\}*\[JSON array; format shown below\]*Modify the system prompt to reduce the largest feature gaps\. Output ONLY the modified prompt text\.*\[Gap analysis format:\]*``` [ { "feature_name": "step_by_step_guidance", "definition": "How explicitly the prompt instructs...", "target": 0.85, "current": 0.3, "gap": 0.55, "direction": "increase" }, ... ] ```

### D\.2Baseline Prompts

#### D\.2\.1APE\-style Guided Sampling

History is not used\. Used by Algorithm[4](https://arxiv.org/html/2605.19093#algorithm4)\.

APE\-style Guided SamplingYou are an expert prompt engineer\. Generate diverse system prompts for an AI assistant to help it perform well on a specific task\.Task description:\{task\_context\}Generate exactly\{q\}diverse system prompts\. Each should take a different approach \(e\.g\., step\-by\-step reasoning, concise instructions, structured format, direct commands, role\-playing, etc\.\)\.Return a JSON array of\{q\}strings, where each string is a complete system prompt\.

#### D\.2\.2OPRO

History is stratified\-subsampled and sorted worst\-to\-best\. Used by Algorithm[5](https://arxiv.org/html/2605.19093#algorithm5)\.

OPROYou are an expert prompt optimizer\. Analyze previous system prompts and their performance scores, then generate improved prompts\.Task description:\{task\_context\}Here are previous system prompts and their scores \(higher is better\), sorted from worst to best:\{history\_text\}*\[Each entry formatted as:\]*\-\-\- Prompt \(Score:\{score\}\) \-\-\- \{text\}Analyze what makes the higher\-scoring prompts better\. Then generate exactly\{q\}new system prompts that should score even higher\.Return a JSON array of\{q\}strings\.

#### D\.2\.3PromptBreeder Mutation

One of three mutation instructions is selected at random per offspring\. Used by Algorithm[6](https://arxiv.org/html/2605.19093#algorithm6)\.

PromptBreeder MutationYou are an expert prompt engineer\. Modify system prompts to improve their effectiveness\.Task description:\{task\_context\}Instruction:\{instruction\}Original system prompt:\{parent\_prompt\}Output ONLY the modified system prompt, no commentary\.*The\{instruction\}variable is one of three mutation types \(selected uniformly at random\):*1\.rewrite\_clearer: “Rewrite the following system prompt to be clearer and more precise\. Keep the core instructions but improve clarity\.”2\.explicit\_reasoning: “Modify the following system prompt to make reasoning steps more explicit\. Add instructions for step\-by\-step thinking\.”3\.concise\_constraints: “Make the following system prompt more concise\. Remove redundancy while preserving all important constraints\.”

#### D\.2\.4PromptBreeder Recombination

Used for the last offspring \(j=qj=q\) in Algorithm[6](https://arxiv.org/html/2605.19093#algorithm6)\.

PromptBreeder RecombinationYou are an expert prompt engineer\. Combine the best aspects of two system prompts into a single improved prompt\.Task description:\{task\_context\}Parent prompt 1:\{parent1\}Parent prompt 2:\{parent2\}Create a new system prompt that combines the best aspects of both parents\. Output ONLY the new prompt, no commentary\.

#### D\.2\.5TextGrad\-style Black\-Box Refinement

Explicit 3\-step chain\-of\-thought before generating variants\. Used by Algorithm[7](https://arxiv.org/html/2605.19093#algorithm7)\.\{best\_prompt\}and\{best\_score\}are the global best from the full history, not just the subsampled trajectory\.

TextGrad\-style Black\-Box RefinementYou are an expert prompt optimizer\. Given a trajectory of system prompts and their performance scores, analyze what makes some perform better than others, then critique the current best prompt and generate improved variants\.Task description:\{task\_context\}Trajectory of prior prompts and their scores \(higher is better, sorted worst\-to\-best\):\{history\_text\}Current best prompt \(score:\{best\_score\}\):\{best\_prompt\}Step 1: briefly analyze the trajectory – what patterns separate high\-scoring prompts from low\-scoring ones?Step 2: critique the current best prompt: what could be improved to get a higher score?Step 3: generate exactly\{q\}improved variants based on your analysis and critique\. Each variant should address a different aspect of the critique\.Return a JSON array of\{q\}strings, where each string is a complete improved system prompt\.

#### D\.2\.6Initial Dataset𝒟0\\mathcal\{D\}\_\{0\}Generation

Generated once per \(task, seed\) pair and shared across all methods\. Used by Algorithm[2](https://arxiv.org/html/2605.19093#algorithm2)\.

Initial Dataset𝒟0\\mathcal\{D\}\_\{0\}GenerationYou are an expert prompt engineer\. Generate diverse system prompts for an AI assistant\.Task description:\{task\_context\}Generate exactly\{q\}diverse system prompts that would help an AI assistant perform well on this task\. Each prompt should take a different approach \(e\.g\., step\-by\-step reasoning, concise instructions, structured format, etc\.\)\.Return a JSON array of\{q\}strings, where each string is a complete system prompt\.

*Note: This prompt is nearly identical to the APE prompt \(Section[D\.2\.1](https://arxiv.org/html/2605.19093#A4.SS2.SSS1)\) with minor differences: \(1\) the APE prompt says “to help it perform well on a specific task” while this one says “for an AI assistant”; \(2\) the APE prompt includes “direct commands, role\-playing” among approach suggestions, which is omitted here; \(3\) this prompt adds “that would help an AI assistant perform well on this task” after “diverse system prompts” and uses “Each prompt should” instead of “Each should\.”*
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

Similar Articles

Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

RECAP: Regression Evaluation for Continual Adaptation of Prompts

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Submit Feedback

Similar Articles

Prompt Optimization for User Simulation in Conversational Recommender Systems: A Multi-Objective Framework
Environment-Grounded Automated Prompt Optimization for LLM Game Agents
IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization
RECAP: Regression Evaluation for Continual Adaptation of Prompts
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents