Predicting Performance of Symbolic and Prompt Programs with Examples

arXiv cs.LG Papers

Summary

This paper studies performance prediction for symbolic (e.g., Python) and prompt programs using a Bayesian coin-flip model, finding that symbolic programs have all-or-nothing performance while prompt programs have diffuse priors, and introduces RAP (Retrieved Approximate Prior) for performance prediction.

arXiv:2605.21515v1 Announce Type: new Abstract: LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:47 AM

# Predicting Performance of Symbolic and Prompt Programs with Examples
Source: [https://arxiv.org/html/2605.21515](https://arxiv.org/html/2605.21515)
\\UseRawInputEncoding

Keya HuMassachusetts Institute of Technology, USAShuzhi LiuNanayang Technological University, SingaporeTao WuNanayang Technological University, SingaporeKevin EllisCornell University, USAYewen PuNanayang Technological University, Singapore

###### Abstract

LLM prompting is widely used for naturally stated tasks, yet it is unreliable — it may succeed on a few test cases but fail at deployment time\. We studyperformance prediction: given a program — either symbolic \(e\.g\. Python\) or a prompt executed on an LLM, and a few in\-domain examples, predict its performance on unseen tasks from the same domain\. We use a simple coin\-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probabilityθ\\thetais the program’s unknown performance\. In this model, performance depends entirely on: 1\) the observed execution outcomes on test cases, and 2\) a prior over performances\. We compileempirical performance priorsfrom a corpus of diverse programs and tasks, and find that performance for symbolic programs \(e\.g\., Python\) are “all or nothing,” while prompt programs have a diffuse prior with many nearly\-correct programs\. This difference explains why a few passing tests can certify symbolic programs but not prompt programs\. Building on this insight, we develop RAP \(Retrieved Approximate Prior\), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance\. We show RAP achieves solid performances\.

## 1Introduction

If a program passes a few representative test cases, one can reasonably trust it to remain correct on future instances of the same task\. The same is not true for LLM prompting—prompts may appear correct during development yet fail unexpectedly at deployment\(Zhouet al\.,[2025](https://arxiv.org/html/2605.21515#bib.bib2)\)\. Why do a few examples certify symbolic programs \(e\.g\., Python code\) but not prompt programs \(natural\-language instructions executed by an LLM\)? We study this question through a Bayesian lens, modeling test outcomes as independent coin tosses and using them to infer a program’s performance\.

Prior work shows that prompting is brittle\(Mizrahiet al\.,[2024](https://arxiv.org/html/2605.21515#bib.bib3)\)— small wording or style changes can cause large performance drops, and that even strong LLMs can fail on simple tasks\(Zhouet al\.,[2024](https://arxiv.org/html/2605.21515#bib.bib1)\)\. Yet prompting remains widely used because it is general, simple, and often “good enough” for non\-critical settings where a performance of 80% is still valuable\. This motivates*performance prediction*: given a prompt program and a few test cases, infer a distribution over its true success rate so one can decide whether a prompt is deployable and with what confidence \(e\.g\., “likely∼0\.8±0\.1\\sim 0\.8\\pm 0\.1once deployed”\)\.

This work is the first to study performance prediction for both*symbolic programs*\(s​psp\) and*prompt programs*\(p​ppp\)\. We define a symbolic program as one whose deployed behavior is executed by a symbolic interpreter \(e\.g\., Python, regex, or another formal language\), whereas a prompt program is executed by an LLM at deployment time\. The key distinction iswhat is delegated at deployment: if an LLM is used only to*write*Python code and the deployed artifact is that code, we categorize it ass​psp; in contrast, if the deployed artifact is a natural\-language prompt that must be run by an LLM to produce outputs for new inputs at deployment, we categorize it asp​ppp\.[Figure1](https://arxiv.org/html/2605.21515#S1.F1)illustrates a pattern\-matching task that can be solved as eithers​psporp​ppp, and in both cases we measure performanceθ\\thetaby evaluating the solution on instances sampled from the task\.

We adopt a simple performance model that treats each evaluation outcome as a Bernoulli coin toss\. Under this model, given an observation summarized byO=\(a,b\)O=\(a,b\)\(withaapasses andbbfails on some test\-cases\), the posterior distribution over performance isP​\(θ\|O\)∝P​\(θ\)​P​\(O\|θ\)P\(\\theta\|O\)\\propto P\(\\theta\)P\(O\|\\theta\)\. While the likelihoodP​\(O\|θ\)∝θa​\(1−θ\)bP\(O\|\\theta\)\\propto\\theta^\{a\}\(1\-\\theta\)^\{b\}can be easily computed, the prior of performanceP​\(θ\)P\(\\theta\)remains unknown\. To resolve this, we compiled empirical performance priors fors​pspandp​pppfrom a diverse set of tasks and programs\. These empirical priors reveal a stark difference:s​pspexhibits an “all\-or\-nothing” prior concentrated near 0\.0 and 1\.0, whereasp​ppphas a more diffuse prior with substantial mass on intermediate, nearly\-correct performance \([Figure2](https://arxiv.org/html/2605.21515#S2.F2)\)\. Consequently, passing a few test\-cases can certify as​psp’s performance, while leaving the performance ofp​pppuncertain \([Figure3](https://arxiv.org/html/2605.21515#S2.F3)\)\. Thus, our simple performance model cangive quantificationto the intuition that symbolic code tends to be well certified by few test cases, while prompt programs do not\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/intro_sp_pp.png)Figure 1:Two ways of solving a phone\-number validation task\. A symbolic program \(s​psp\) uses an explicit pattern, whereas a prompt program \(p​ppp\) delegates prediction to an LLM\. In both cases, performanceθ\\thetais the fraction of task instances solved correctly\. Note we consider the choice of LLM and parameters such temperature*as a part of*the prompt program\.Building on this insight, we posit that in order to accurately predict performance ofp​ppps, we need to construct a better prior\. We developRAP\(*Retrieval Approximated Prior*\)\. Given a prompt program, a few in\-domain test cases, and an*arbitrary*corpus of tasks and prompts, RAP retrieves similar tasks and prompt programs from the corpus, and constructs an approximate prior for performance prediction\. We show RAP outperforms baseline performance predictors and has several desirable properties: it converges to the in\-domain posterior as the number of test cases grows, improves with larger corpora, and is robust to irrelevant information in the corpus\.

ContributionsWe make three equally weighted contributions:

1. 1\.We formalize*performance prediction*as Bayesian inference under a simple coin\-flip model, inferring a program’s performance from observed execution outcomesP​\(θ∣O\)P\(\\theta\\mid O\)\.
2. 2\.We compile empirical priors fors​pspandp​pppfrom a diverse set of tasks and programs, and show thats​pspis “all\-or\-nothing,” whilep​pppis diffuse and nearly\-correct\-heavy\. This explains why a few tests can certifys​pspbut notp​ppp\.
3. 3\.We introduceRAP\(*Retrieval Approximated Prior*\), a retrieval\-based method that builds an approximate domain\-specific prior from a corpus and outperforms strong baselines\.

## 2Predicting Performance from Examples

This section formalizes the problem of*performance prediction from examples*\. We first define what we mean by a program and its performance on a task, and then define the objective of predicting such performance and how to evaluate prediction quality\.

### 2\.1Definitions

tasksA taskTTis a distribution of instances\(x,y\)∼T\(x,y\)\\sim Twith inputxxand oracle\-labeled outputyy\. For instance, in the task of sorting, an input may be any random array of integers, and the output is the same array of integers ordered from smallest to largest\. In this work, we assume sampling of instances is iid\.

programsA programffis a mapping from input to outputf:x→yf:x\\rightarrow y\. Given a taskTT, one writes a program by sampling fromf∼P​\(f\|T\)f\\sim P\(f\|T\)\. In practice, one may either draw a sample by asking a person to write a program for a task, or by prompting an LLM to write a program for a task\.

correctnessGiven an instance\(x,y\)\(x,y\), a program’s correctness on this instance is an indicator𝐙=𝟙​\(f​\(x\)=y\)\\mathbf\{Z\}=\\mathbbm\{1\}\(f\(x\)=y\)

performanceGiven a programffand a taskTT, the function’s \(true\) performanceθf∗\\theta\_\{f\}^\{\*\}is the expected success rate offfproducing the correct outputyyon inputxxfor\(x,y\)∼T\(x,y\)\\sim T

θf∗​\(f,T\)=𝔼\(x,y\)∼T​\[𝟙​\(f​\(x\)=y\)\]\.\\theta\_\{f\}^\{\*\}\(f,T\)=\\mathbb\{E\}\_\{\(x,y\)\\sim T\}\[\\mathbbm\{1\}\(f\(x\)=y\)\]\.\(1\)In this paper, we simply denote performance withθf\\theta\_\{f\}, implicitly assuming every program is written*for some task*\.

observationAn observation is a set ofKKindicators

𝐙1​…​K,𝐙i=𝟙​\(f​\(x\)=y\),\(xi,yi\)∼T\\mathbf\{Z\}\_\{1\\dots K\},\\mathbf\{Z\}\_\{i\}=\\mathbbm\{1\}\(f\(x\)=y\),\(x\_\{i\},y\_\{i\}\)\\sim T\(2\)We can summarize the outcome of the observation by the number of successesaaand the number of failuresbb

O=\(a,b\),a=∑iZi,b=∑i\(1−Zi\)O=\(a,b\),a=\\sum\_\{i\}Z\_\{i\},b=\\sum\_\{i\}\(1\-Z\_\{i\}\)\(3\)
Table 1:Benchmark domains used in this study, covering a diverse range of task structures, input modalities, and symbolic solver representations\. This diversity enables a systematic evaluation of performance prediction across heterogeneous program types\. Across all domains, we evaluate a total of 700 symbolic programs and 700 prompt programs, with approximately 100 evaluation instances per task\.
### 2\.2Predicting Performance from Examples

Given observed outcomesOOof running a programffon a set of sampled instances, we wish to infer a program’s true performance:

P​\(θf\|O\)P\(\\theta\_\{f\}\|O\)\(4\)
This posterior distribution can be evaluated in two ways:

probability density at the ground truthWe plug in the ground\-truth valueθf∗\\theta\_\{f\}^\{\*\}into the posterior distribution and obtain its probability density value,P​\(θf=θf∗\|O\)P\(\\theta\_\{f\}=\\theta\_\{f\}^\{\*\}\|O\)\. A poor posterior would have a low score, while a good posterior would have a high score, with the limit approaching infinity \(a Dirac\-delta posterior\)\.

absolute errorWe take the mean of the posterior as the prediction, and take the absolute difference between the prediction value and the ground\-truth value\|θf^−θf∗\|\|\\hat\{\\theta\_\{f\}\}\-\\theta\_\{f\}^\{\*\}\|\.

### 2\.3A Simple Coin\-flip Model

In this work, we take the simplest model of estimatingθf\\theta\_\{f\}by treating each observation𝐙i\\mathbf\{Z\}\_\{i\}as a Bernoulli random variable:

𝐙i∼B​e​r​n​o​u​l​l​i​\(θf\)\\mathbf\{Z\}\_\{i\}\\sim Bernoulli\(\\theta\_\{f\}\)\(5\)The likelihood function for observationO=\(a,b\)O=\(a,b\)follows the binomial distribution \(written without the constant\):

P​\(O\|θf\)∝θfa​\(1−θf\)bP\(O\|\\theta\_\{f\}\)\\propto\\theta\_\{f\}^\{a\}\(1\-\\theta\_\{f\}\)^\{b\}\(6\)
Thus, the posterior of a program’s performance under observation can be written via Bayes’ rule:

P​\(θf\|O\)∝P​\(θf\)​θfa​\(1−θf\)bP\(\\theta\_\{f\}\|O\)\\propto P\(\\theta\_\{f\}\)\\theta\_\{f\}^\{a\}\(1\-\\theta\_\{f\}\)^\{b\}\(7\)
As we can see, theperformance priorP​\(θf\)P\(\\theta\_\{f\}\)plays a pivotal role in determining the posterior distributionP​\(θf\|O\)P\(\\theta\_\{f\}\|O\)\. In the next section we will explain how to empirically estimate the performance prior for both symbolic programs and prompt programs\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/sp_vs_pp/sp/prior_induction_overall.png)\(a\)Performance prior for symbolic programs
![Refer to caption](https://arxiv.org/html/2605.21515v1/images/sp_vs_pp/pp/prior_transduction_overall.png)\(b\)Performance prior for prompt programs

Figure 2:Empirical performance priors fors​pspandp​ppp, constructed from 700 sampled programs each\. Symbolic programs have a sharply bimodal prior, with most programs either perfect or completely incorrect\. Prompt programs have a more diffuse prior, with substantial mass over intermediate performances\.![Refer to caption](https://arxiv.org/html/2605.21515v1/images/sp_vs_pp/sp/posterior_induction_overall.png)\(a\)Posterior fors​pspafter successes on 3 test cases
![Refer to caption](https://arxiv.org/html/2605.21515v1/images/sp_vs_pp/pp/posterior_transduction_overall.png)\(b\)Posterior forp​pppafter successes on 3 test cases

Figure 3:Posterior performance distributions for symbolic programsP​\(θs​p∣\[T,T,T\]\)P\(\\theta\_\{sp\}\\mid\[T,T,T\]\)and prompt programsP​\(θp​p∣\[T,T,T\]\)P\(\\theta\_\{pp\}\\mid\[T,T,T\]\)after observing successes on 3 test cases\. The likelihood update,θ3\\theta^\{3\}, is the same for boths​pspandp​ppp; the difference in the posterior is entirely due to their different performance priors\. The posterior fors​pspsupports a near\-100% performance prediction, whereas the posterior forp​pppdoes not provide a confident performance guarantee\.

## 3The Performance Prior

The performance prior can be written as:

P​\(θ\)=PrT∼P​\(T\),f∼P​\(f∣T\)⁡\[θ=θf\]\.P\(\\theta\)=\\Pr\_\{T\\sim P\(T\),\\,f\\sim P\(f\\mid T\)\}\\left\[\\theta=\\theta\_\{f\}\\right\]\.\(8\)In words, we first sample a taskT∼P​\(T\)T\\sim P\(T\), and from this task sample a programf∼P​\(f\|T\)f\\sim P\(f\|T\)\(either from a human or from a LLM\), then evaluating its performance on the task\. We do this many times, and we would obtain a large list of performance values\[θf1,θf2,…\]\[\\theta\_\{f\_\{1\}\},\\theta\_\{f\_\{2\}\},\\dots\], which together forms the performance prior\. Ideally, we would like a “perfect” performance prior derived from “all tasks that we have ever written a program for”, which is unrealistic\. Instead, we approximate the performance prior for boths​pspandp​pppempirically\.

### 3\.1Empirical Performance Prior

#### sampling tasksP​\(T\)P\(T\)

We chose a diverse set of programming domains — from inductive reasoning to code generation from language, then randomly sample tasks from these domains\. The domains and numbers of tasks chosen is shown in Table[1](https://arxiv.org/html/2605.21515#S2.T1)\. Crucially, tasks from these domains maybe reasonably solved by boths​pspandp​ppp\. For instance, in the Regex domain, one can either solve the task of writing a regular expression \(s​psp\) that matches a phone number pattern, or by asking a LLM directly \(p​ppp\) to detect if a string is a phone number\. In total, we sampled 70 tasks\.

#### sampling programsP​\(f\|T\)P\(f\|T\)

For each sampled task, we sample a number ofs​pspby prompting a LLM to generate code from the task, and we sample a number ofp​pppby prompting a LLM to generate a prompt that can output the answer directly\. Ideally we could also hire human programmers to write programs and the prompts for a more naturalistic set of programs\. The sampled tasks and the sampled programs together forms a corpusℳ\\mathcal\{M\}, and from this corpusℳ\\mathcal\{M\}we can build the empirical performance priorP​\(θf\)P\(\\theta\_\{f\}\)\. For this work, We sampled 10 programs for eithers​pspandp​pppper task, for a total of 700s​pspand 700p​ppp\.

#### sampling instancesP​\(\(x,y\)\|T\)P\(\(x,y\)\|T\)to estimateθf\\theta\_\{f\}

Each task comes equipped with an instance sampler,P​\(\(x,y\)\|T\)P\(\(x,y\)\|T\), which we can query for a large number of inputxxand oracle labeled outputyy\(e\.g\. a large number of randomly generated strings, and whether the string is a valid phone number\)\. We approximate the true performance of a programffby taking its average performance on the sampled instances\.

θ~f:=1N​∑i=1N𝟏​\[f​\(xi\)=yi\],\(xi,yi\)∼p​\(\(x,y\)∣T\)\.\\tilde\{\\theta\}\_\{f\}\\;:=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\!\\big\[f\(x\_\{i\}\)=y\_\{i\}\\big\],\\qquad\(x\_\{i\},y\_\{i\}\)\\sim p\(\(x,y\)\\mid T\)\.\(9\)For this part, we sampleN=100N=100instances per task\. In total, we performed 70000 function calls to executes​psp, and 70000 LLM API\-calls to executep​ppp\.

#### empirical performance prior fors​pspandp​ppp

The approximated performances of the 700s​pspandp​pppare collated together to form the empirical performance prior for symbolic programsP​\(θs​p\)P\(\\theta\_\{sp\}\)and prompt programsP​\(θp​p\)P\(\\theta\_\{pp\}\)\(Figure[2](https://arxiv.org/html/2605.21515#S2.F2)\)\. The most striking difference is thats​psphas a sharp, bi\-modal distribution where a program is either all correct \(1\.0\) or completely incorrect \(0\.0\), whereas the prior forp​pppis much more diffused, with a large number of “near perfect” programs\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/posterior_score_curve.png)\(a\)Probability density of ground truth performance under posterior for symbolicP​\(θs​p=θs​p∗∣O\)P\(\\theta\_\{sp\}=\\theta\_\{sp\}^\{\*\}\\mid O\)and promptP​\(θp​p=θp​p∗∣O\)P\(\\theta\_\{pp\}=\\theta\_\{pp\}^\{\*\}\\mid O\)programs\. Predicting performance of symbolic programs is better\.
![Refer to caption](https://arxiv.org/html/2605.21515v1/images/posterior_error_curve.png)\(b\)Absolute difference error between predicted performance and ground\-truth for symbolic\|θ^s​p−θs​p∗\|\|\\hat\{\\theta\}\_\{sp\}\-\\theta\_\{sp\}^\{\*\}\|and prompt\|θ^p​p−θp​p∗\|\|\\hat\{\\theta\}\_\{pp\}\-\\theta\_\{pp\}^\{\*\}\|programs\. Predicting performance of symbolic programs achieves a lower error\.

Figure 4:Evaluation of performance prediction as a function of observed examples\. Over both metrics, it was easier to predict the performance for symbolic programs\.

### 3\.2Using Prior for Performance Prediction

From the empirical performance priors, we can answer our original question: “Why observing a small number of successes guarantees good performance for symbolic programs but not prompt programs”\. In Figure[3](https://arxiv.org/html/2605.21515#S2.F3)we show the posterior performance distribution for boths​pspandp​pppafter observing successes on 3 test cases\. As we can see, from the posterior fors​psp, the mass is nearly concentrated on perfect performing programs, while the same cannot be said forp​ppp— it has many “near perfect” programs simply getting lucky\.

The above observation holds true in general\. By sampling a programff\(either symbolic or prompt\) from our empirical distribution, and trying to estimate its true performance from observationsP​\(θf\|O\)P\(\\theta\_\{f\}\|O\), we get the following results in[Figure4](https://arxiv.org/html/2605.21515#S3.F4)\. As we can see, it is much more difficult to predict the true performance forp​pppcompared tos​psp

#### conclusion

We conclude that the difference in performance prediction can be largely attributed to the distinct shapes of thes​pspandp​pppperformance priors\.The prior for symbolic programs affords us a confident prediction from few examples, while the prior for prompt programs does not\.

## 4RAP: Retrieved Approximate Prior

To construct a more informative empirical prior for performance prediction, we introduce*Retrieved Approximate Prior \(RAP\)*\. RAP takes as input \(1\) a target prompt programp​ptargetpp\_\{\\text\{target\}\}\(2\) a set of observed examples from the target domain𝒟=\{x1,…​xk\}\\mathcal\{D\}=\\\{x\_\{1\},\\dots x\_\{k\}\\\}, and \(3\) an*arbitrary*corpusℳ=\(ℳt​a​s​k​s,ℳp​p​s\)\\mathcal\{M\}=\(\\mathcal\{M\}\_\{tasks\},\\mathcal\{M\}\_\{pps\}\)of tasks and prompt programs\. RAP first constructs an informative priorP​\(θp​ptarget∣p​ptarget,𝒟,ℳ\)P\(\\theta\_\{pp\_\{\\text\{target\}\}\}\\mid pp\_\{\\text\{target\}\},\\mathcal\{D\},\\mathcal\{M\}\)by retrieving related tasks and prompt programs from the corpus \([Figure5](https://arxiv.org/html/2605.21515#S4.F5)\)\. RAP then executesp​ptargetpp\_\{\\text\{target\}\}on𝒟\\mathcal\{D\}to get the pass/fail outcomeOO\. Finally, RAP performs Bayesian update on the retrieved prior withOOto produce the posterior that predicts performance:

PRAP​\(θp​ptarget∣p​ptarget,𝒟,ℳ\)∝P​\(O∣θp​ptarget\)​P​\(θp​ptarget∣p​ptarget,𝒟,ℳ\)\.P\_\{\\text\{RAP\}\}\(\\theta\_\{pp\_\{\\text\{target\}\}\}\\mid pp\_\{\\text\{target\}\},\\mathcal\{D\},\\mathcal\{M\}\)\\propto P\(O\\mid\\theta\_\{pp\_\{\\text\{target\}\}\}\)P\(\\theta\_\{pp\_\{\\text\{target\}\}\}\\mid pp\_\{\\text\{target\}\},\\mathcal\{D\},\\mathcal\{M\}\)\.\(10\)
![Refer to caption](https://arxiv.org/html/2605.21515v1/x1.png)Figure 5:The overview of constructing prior\. RAP retrieves tasks similar to the observed examples and prompt programs similar to the target prompt program\. The retrieved programs are evaluated on the retrieved tasks to obtain empirical success rates, shown as blue points in the lower\-right panel\. These empirical performances are then smoothed into a mixture of Beta distributions, which serves as the task\-specific performance prior\.### 4\.1Retrieve Similar Tasks

Given an observed task set𝒟\\mathcal\{D\}, we retrieve an extended task set𝒟′\\mathcal\{D\}^\{\\prime\}\. We do so by retrieving the top\-n most similar tasks inℳt​a​s​k​s\\mathcal\{M\}\_\{tasks\}for eachx′∈𝒟x^\{\\prime\}\\in\\mathcal\{D\}, where similarity is measured using a textual embedding functione​\(⋅\)e\(\\cdot\)\. These retrieved tasks serve as a proxy for target domain\.

𝒟′=⋃x′∈𝒟Top​\-​n⁡\(\{x∈ℳt​a​s​k​s\},cos⁡\(e​\(x\),e​\(x′\)\)\),\\mathcal\{D\}^\{\\prime\}=\\bigcup\_\{x^\{\\prime\}\\in\\mathcal\{D\}\}\\operatorname\{Top\\text\{\-\}n\}\\Big\(\\\{x\\in\\mathcal\{M\}\_\{tasks\}\\\},\\;\\cos\\\!\\big\(e\(x\),e\(x^\{\\prime\}\)\\big\)\\Big\),\(11\)

### 4\.2Retrieve Similar Prompt Programs

RAP retrieves a top\-K set of prompt programs that are similar to thep​ptargetpp\_\{\\text\{target\}\}in terms of*behavior*— the number of tasks in𝒟′\\mathcal\{D\}^\{\\prime\}where both programs agree \(both correct or incorrect\),s𝒟′​\(p​ptarget,p​pi\)s\_\{\\mathcal\{D\}^\{\\prime\}\}\(pp\_\{\\text\{target\}\},pp\_\{i\}\)\.

𝒫𝒫′\\displaystyle\\mathcal\{PP\}\\prime:=Top​\-​K⁡\(\{p​pi∈ℳp​p​s\},s𝒟′​\(p​ptarget,p​pi\)\),\\displaystyle=\\operatorname\{Top\\text\{\-\}K\}\\Big\(\\\{pp\_\{i\}\\in\\mathcal\{M\}\_\{pps\}\\\},\\;s\_\{\\mathcal\{D\}^\{\\prime\}\}\(pp\_\{\\text\{target\}\},pp\_\{i\}\)\\Big\),\(12\)

### 4\.3Constructing Prior and Posterior Update

After retrieving similar tasks𝒟′\\mathcal\{D\}^\{\\prime\}and prompt programs𝒫​𝒫′\\mathcal\{PP\}^\{\\prime\}, we construct the distribution of the unknown success rateθp​ptarget\\theta\_\{pp\_\{\\text\{target\}\}\}as a mixture of Beta distribution\.

Letaia\_\{i\}andbib\_\{i\}denote the numbers of successes and failures ofp​pipp\_\{i\}on𝒟′\\mathcal\{D\}^\{\\prime\}\. We setαi=ai\+1\\alpha\_\{i\}=a\_\{i\}\+1andβi=bi\+1\\beta\_\{i\}=b\_\{i\}\+1\. Thus,p​pipp\_\{i\}induces the componentBeta​\(θ∣αi,βi\)\.\\mathrm\{Beta\}\(\\theta\\mid\\alpha\_\{i\},\\beta\_\{i\}\)\.Aggregating the evidence from the top\-KKretrieved programs yields the retrieved prior as a mixture of these components:

P​\(θ\)=∑i=1K1K​Beta​\(θ∣αi,βi\)\.P\(\\theta\)=\\sum\_\{i=1\}^\{K\}\\frac\{1\}\{K\}\\,\\mathrm\{Beta\}\(\\theta\\mid\\alpha\_\{i\},\\beta\_\{i\}\)\.\(13\)
Given observed outcomesOO, letaaandbbdenote the numbers of successes and failures\. Since each mixture component is a Beta prior over the same Bernoulli success rateθ\\theta, the observed outcomes update each component by conjugacy to produce the updated posterior distribution:

P​\(θ∣O,ℳ\)=∑i=1K1K​Beta​\(θ∣αi\+a,βi\+b\),P\(\\theta\\mid O,\\mathcal\{M\}\)=\\sum\_\{i=1\}^\{K\}\\frac\{1\}\{K\}\\,\\mathrm\{Beta\}\(\\theta\\mid\\alpha\_\{i\}\+a,\\beta\_\{i\}\+b\),\(14\)
#### Component\-wise Prior Strength Adjustment\.

It is possible that the top\-K retrievedp​pipp\_\{i\}behave drastically differently thanp​ptargetpp\_\{\\text\{target\}\}on the target domain\. To resolve this, RAP adjust the strength of each Beta component’s parameters based on performance similarity on in\-domain tasks\.

LetPi​\(θ\)=Beta​\(θ∣αi,βi\)P\_\{i\}\(\\theta\)=\\mathrm\{Beta\}\(\\theta\\mid\\alpha\_\{i\},\\beta\_\{i\}\)denote the prior component induced by retrieved programp​pipp\_\{i\}\. LetPobs​\(θ\)=Beta​\(θ∣a\+1,b\+1\)P\_\{\\text\{obs\}\}\(\\theta\)=\\mathrm\{Beta\}\(\\theta\\mid a\+1,b\+1\)denote the Beta distribution induced directly from in\-domain examples without the corpus\. We compute a positive scaling factor based on a normalized Earth Mover’s Distance \(EMD\):

λi=1−EMD​\(Pi,Pobs\),λi∈\[0,1\]\.\\lambda\_\{i\}=1\-\\mathrm\{EMD\}\\\!\\left\(P\_\{i\},P\_\{\\text\{obs\}\}\\right\),\\quad\\lambda\_\{i\}\\in\[0,1\]\.\(15\)
We then multiply this scaling factorλi\\lambda\_\{i\}to the Beta distribution’s parameters\. In addition, we also cap the concentration value to ensureλi​αi\+λi​βi<cmax\\lambda\_\{i\}\\alpha\_\{i\}\+\\lambda\_\{i\}\\beta\_\{i\}<c\_\{\\max\}, where we setcmax=40c\_\{\\max\}=40\. Finally:

PR​A​P​\(θ∣O\)=∑i=1K1K​Beta​\(θ∣αiCi\+a,βiCi\+b\),Where​Ci=αi\+βimin⁡\(λi​\(αi\+βi\),cm​a​x\)\.P\_\{RAP\}\(\\theta\\mid O\)=\\sum\_\{i=1\}^\{K\}\\frac\{1\}\{K\}\\,\\mathrm\{Beta\}\(\\theta\\mid\\frac\{\\alpha\_\{i\}\}\{C\_\{i\}\}\+a,\\frac\{\\beta\_\{i\}\}\{C\_\{i\}\}\+b\),~~~\\text\{Where \}C\_\{i\}=\\frac\{\\alpha\_\{i\}\+\\beta\_\{i\}\}\{\\min\(\\lambda\_\{i\}\(\\alpha\_\{i\}\+\\beta\_\{i\}\),c\_\{max\}\)\}\.\(16\)
Note that as the number of in\-domain observations increases, the likelihood term —aaandbb— dominates the posterior update, ensuring convergence toward the empirical posterior induced purely by observed in\-domain data\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/posterior_sequence.png)Figure 6:Our method constructs a more informative and adaptable prior than the baselines\. Compared to*all\-corpus*, whose initial prior is poorly calibrated for the target domain, and*no\-corpus*, where the first few observations can excessively distort the posterior, our approach provides a better\-calibrated starting point that can be effectively updated as new observations are incorporated\. The vertical line marks the posterior mean used as the performance prediction; closer alignment with the ground\-truth performance indicates better calibration\.![Refer to caption](https://arxiv.org/html/2605.21515v1/images/muscle/in_domain/posterior_score_curve.png)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/muscle/in_domain/posterior_error_plot_single.png)

\(a\)in\-domain
![Refer to caption](https://arxiv.org/html/2605.21515v1/images/muscle/out_domain/posterior_score_curve.png)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/muscle/out_domain/posterior_error_plot_single.png)

\(b\)out\-of\-domain

Figure 7:RAP out performs baselines in both in\-domain and out\-of\-domain settings\. In the in\-domain setting, the corpus contains tasks from the target domain, enabling RAP to construct a more informative prior, which leads to higher performances\.

## 5Experiment

We validate the following:

1. 1\.RAP outperforms baselines that either do not use a corpus \(no\-corpus\) or naively use the entire corpus \(all\-corpus\), while effectively leveraging in\-domain tasks and remaining robust to irrelevant out\-of\-domain tasks in the corpus \([Figure7](https://arxiv.org/html/2605.21515#S4.F7)\)\.
2. 2\.Performance of RAP increases as the sizes of corpus increases \([Figure10](https://arxiv.org/html/2605.21515#A2.F10)\)
3. 3\.RAP converges gracefully to the in\-domain posterior when given a number of in\-domain examples \([Figure6](https://arxiv.org/html/2605.21515#S4.F6)\)
4. 4\.RAP is robust to hyperparameter choices \(Appendix[SectionB\.1](https://arxiv.org/html/2605.21515#A2.SS1)\)\.

### 5\.1Methodology

#### Benchmarks

We chose benchmarks from domains in Appendix[SectionA\.4](https://arxiv.org/html/2605.21515#A1.SS4)[Table2](https://arxiv.org/html/2605.21515#A1.T2), covering the range of tasks that people generally use prompt program for — knowledge and reasoning based multiple choice, general Python code generation, and embodied agent\. These domains are*different*than the domains considered in Section 3 in the following way — rather writing a different program per task, the goal here is to write a generalp​pppthat can solve many different tasks\. For instance, write a single prompt that can generate multiple Python solutions for the MBPP benchmark\. No symbolic program will work for the kind of tasks considered in this section\.

#### Methods

In this paper we consider three methods\.

- •no\-corpus: Uses a fixedBeta​\(1,1\)\\mathrm\{Beta\}\(1,1\)prior, which is uniform and independent of the corpus\.
- •all\-corpus: Uses the performance prior estimated from the entire corpus\.
- •RAP \(ours\): As described in Section[4](https://arxiv.org/html/2605.21515#S4)\.

#### Evaluation

The prediction algorithm is presented with a prompt programp​pppand few in\-domain tasksT1,…,TKT\_\{1\},\\dots,T\_\{K\}from a*target domain*, and gives a posterior distribution on the performance ofp​pppon future tasks from the target domain\.

We report two metrics: the probability density of the posterior at the ground\-truth performance \(SC\), and the absolute error \(AE\)\. They are described in[Section2](https://arxiv.org/html/2605.21515#S2)\. We consider two settings, in\-domain and out\-of\-domain\. In the in\-domain setting, the corpusℳ\\mathcal\{M\}contain tasks from the target domain\. For instance, if we wish to predict performance of ap​pppin themathdomain, we will put some other tasks from themathdomain in the corpus\. In the out\-of\-domain setting, no tasks from the same domain are included in the corpus\. For both settings we perform leave\-one\-out evaluation, where we single out each domain \(e\.g\.math,MBPP, …\) as the target domain\.

We evaluate 60 prompt programs under different settings, varying the language model, prompt, and execution mechanism111Note that we consider a prompt program*inclusive*of its prompt, its LLM, its execution mechanisms, etc\. We use performance measured on 200 held\-out tasks to approximate the ground\-truth performance\. Additional details are provided in the appendix\.

### 5\.2Results

#### Qualitative Result

We first qualitatively compare our algorithm with the baselines in[Figure6](https://arxiv.org/html/2605.21515#S4.F6)\. Our method produces a more informative yet flexible prior, which leads to more reasonable posterior updates than the baselines\. Our algorithm also approaches the no\-prior baseline, as the likelihood term dominates for a large number of examples\.

#### Quantitative Result

Results in[Figure7](https://arxiv.org/html/2605.21515#S4.F7)compare the performance in in\-domain and out\-of\-domain settings\. RAP performs best in both settings\. Further, using in\-domain examples consistently yields better performance, especially when only a limited number of examples are available\.

#### Ablation Study

Appendix[B](https://arxiv.org/html/2605.21515#A2)shows that RAP remains robust across retrieval hyperparameters, prior concentration choices, corpus sizes, and different LLM backbones\.

## 6Related Works

Reliability of symbolic and prompt programs\.Symbolic programs are amenable to*mechanical*correctness checks\. Reliability can be increased via a simple propose\-and\-reject loop—generate multiples​pspcandidates and accept only those that satisfy the checker\(Aluret al\.,[2013](https://arxiv.org/html/2605.21515#bib.bib17); Solar\-Lezama,[2008](https://arxiv.org/html/2605.21515#bib.bib19)\)\. Prompt programs are difficult to certify because “execution” is stochastic and implicit–correctness typically cannot be decided by a deterministic verifier\. As a result, reliability is often improved with sampling\-and\-selection heuristics \(e\.g\., self\-consistency\(Wanget al\.,[2022](https://arxiv.org/html/2605.21515#bib.bib20)\)/ best\-of\-NN\(Wanget al\.,[2024](https://arxiv.org/html/2605.21515#bib.bib21); Ouyanget al\.,[2022](https://arxiv.org/html/2605.21515#bib.bib22)\)\) and with empirical robustness evaluation\(Srivastavaet al\.,[2023](https://arxiv.org/html/2605.21515#bib.bib23); Lianget al\.,[2022](https://arxiv.org/html/2605.21515#bib.bib24)\)\.

Performance prediction\.One line of work studies*sensitivity*: small paraphrastic or formatting changes can cause substantial variance of performance\(Razaviet al\.,[2025](https://arxiv.org/html/2605.21515#bib.bib28)\)\. A second line focuses on*prompt optimization*, improving accuracy through more sample\-efficient optimization and selection strategies\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.21515#bib.bib29); Shiet al\.,[2024](https://arxiv.org/html/2605.21515#bib.bib30)\)\. Broader*performance prediction*line of works aim to rank relative strengths of different LLMs using a smaller benchmark\(Wanget al\.,[2025](https://arxiv.org/html/2605.21515#bib.bib25); Yeet al\.,[2023](https://arxiv.org/html/2605.21515#bib.bib26)\), and to anticipate when an LLM may fail\(Pacchiardiet al\.,[2025](https://arxiv.org/html/2605.21515#bib.bib27)\)\. Our work is the first to take a simple coin\-toss model, and clearly quantifies the difference betweens​pspandp​pppusing this model\.

## 7Conclusion, Limitations, and Future Works

We study program performance prediction by: 1\) Framing prediction as posterior inferenceP​\(θf∣O\)P\(\\theta\_\{f\}\\mid O\), 2\) Show that the empirical priors differ sharply between symbolic \(s​psp\) and prompt \(p​ppp\) programs, and 3\) Introduce RAP, which predictsp​pppperformance from few examples by approximating an in\-domain prior using an arbitrary corpus\. We consider RAP a first line of attack on the problem of performance prediction, and it is far from perfect\. Chiefly, the top\-K retrievals of relevant tasks andp​ppps would be expensive as the sizes of the corpus grows\. In future work, we may explore actively querying the right \(task,p​ppp\) pair to execute, and by building a low\-ranking approximation of the corpus ofℳ=t​a​s​k​s×p​p​s\\mathcal\{M\}=tasks\\times pps\.

## References

- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang,et al\.\(2025\)Gepa: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p2.2)\.
- R\. Alur, R\. Bodik, G\. Juniwal, M\. M\. Martin, M\. Raghothaman, S\. A\. Seshia, R\. Singh, A\. Solar\-Lezama, E\. Torlak, and A\. Udupa \(2013\)Syntax\-guided synthesis\.In2013 Formal Methods in Computer\-Aided Design,pp\. 1–8\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Table 1](https://arxiv.org/html/2605.21515#S2.T1.4.3.2.1)\.
- F\. Chollet \(2019\)On the measure of intelligence\.arXiv preprint arXiv:1911\.01547\.Cited by:[Table 1](https://arxiv.org/html/2605.21515#S2.T1.4.4.3.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar,et al\.\(2022\)Holistic evaluation of language models\.arXiv preprint arXiv:2211\.09110\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- M\. Mizrahi, G\. Kaplan, D\. Malkin, R\. Dror, D\. Shahaf, and G\. Stanovsky \(2024\)State of what art? a call for multi\-prompt llm evaluation\.External Links:2401\.00595,[Link](https://arxiv.org/abs/2401.00595)Cited by:[§1](https://arxiv.org/html/2605.21515#S1.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- L\. Pacchiardi, K\. Voudouris, B\. Slater, F\. Martínez\-Plumed, J\. Hernández\-Orallo, L\. Zhou, and W\. Schellaert \(2025\)PredictaBoard: benchmarking llm score predictability\.arXiv preprint arXiv:2502\.14445\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p2.2)\.
- A\. Razavi, M\. Soltangheis, N\. Arabzadeh, S\. Salamat, M\. Zihayat, and E\. Bagheri \(2025\)Benchmarking prompt sensitivity in large language models\.InEuropean Conference on Information Retrieval,pp\. 303–313\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p2.2)\.
- J\. S\. Rule, S\. T\. Piantadosi, A\. Cropper, K\. Ellis, M\. Nye, and J\. B\. Tenenbaum \(2024\)Symbolic metaprogram search improves learning efficiency and explains rule learning in humans\.Nature Communications15\(1\),pp\. 6847\.Cited by:[Table 1](https://arxiv.org/html/2605.21515#S2.T1.4.5.4.1)\.
- C\. Shi, K\. Yang, Z\. Chen, J\. Li, J\. Yang, and C\. Shen \(2024\)Efficient prompt optimization through the lens of best arm identification\.Advances in Neural Information Processing Systems37,pp\. 99646–99685\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p2.2)\.
- A\. Solar\-Lezama \(2008\)Program synthesis by sketching\.University of California, Berkeley\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on machine learning research\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- P\. Wang, L\. Li, Z\. Shao, R\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9426–9439\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- S\. Wang, C\. Wang, W\. Fu, Y\. Min, M\. Feng, I\. Guan, X\. Hu, C\. He, C\. Wang, K\. Yang,et al\.\(2025\)Rethinking llm evaluation: can we evaluate llms with 200x less data?\.arXiv preprint arXiv:2510\.10457\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p2.2)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p1.2)\.
- Q\. Ye, H\. Fu, X\. Ren, and R\. Jia \(2023\)How predictable are large language model capabilities? a case study on big\-bench\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 7493–7517\.Cited by:[§6](https://arxiv.org/html/2605.21515#S6.p2.2)\.
- X\. Ye, Q\. Chen, I\. Dillig, and G\. Durrett \(2020\)Benchmarking multimodal regex synthesis with complex structures\.External Links:2005\.00663,[Link](https://arxiv.org/abs/2005.00663)Cited by:[Table 1](https://arxiv.org/html/2605.21515#S2.T1.4.2.1.1)\.
- L\. Zhou, P\. A\. Moreno\-Casares, F\. Martínez\-Plumed, J\. Burden, R\. Burnell, L\. Cheke, C\. Ferri, A\. Marcoci, B\. Mehrbakhsh, Y\. Moros\-Daval, S\. Ó\. heigeartaigh, D\. Rutar, W\. Schellaert, K\. Voudouris, and J\. Hernández\-Orallo \(2025\)Predictable artificial intelligence\.External Links:2310\.06167,[Link](https://arxiv.org/abs/2310.06167)Cited by:[§1](https://arxiv.org/html/2605.21515#S1.p1.1)\.
- L\. Zhou, W\. Schellaert, F\. Martínez\-Plumed, Y\. Moros\-Daval, C\. Ferri, and J\. Hernández\-Orallo \(2024\)Larger and more instructable language models become less reliable\.Nature634\(8032\),pp\. 61–68\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07930-y),[Link](https://www.nature.com/articles/s41586-024-07930-y)Cited by:[§1](https://arxiv.org/html/2605.21515#S1.p2.1)\.

## Appendix AExperiments Details

### A\.1Hyperparameters Setting

RAP involves three hyperparameters in our experiments: the number of retrieved tasks, set ton=100n=100; the number of retrieved prompt programs, set toK=5K=5; and the maximum prior concentration, set tocmax=40c\_\{\\max\}=40\.

### A\.2Computational Cost

Constructing the full prior from scratch for all experiments costs approximately $300 in API usage, including the evaluation of prompt programs over the task corpus and the observed example sets\.

### A\.3The formula for All\-Corpus

For the all\-corpus baseline, every prompt programp​pm∈𝒫pp\_\{m\}\\in\\mathcal\{P\}is evaluated on all corpus tasks\. This yieldsαm\\alpha\_\{m\}successes andβm\\beta\_\{m\}failures for each prompt program\. Given the target prompt program’s observed outcomes on the in\-domain examples, letssandffdenote its numbers of successes and failures, respectively\. The all\-corpus posterior is then defined as:

pall\-corpus​\(θ\)=1\|𝒫\|​∑p​pm∈𝒫Beta​\(θ∣αm\+s,βm\+f\)\.p\_\{\\text\{all\-corpus\}\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{pp\_\{m\}\\in\\mathcal\{P\}\}\\mathrm\{Beta\}\\left\(\\theta\\mid\\alpha\_\{m\}\+s,\\ \\beta\_\{m\}\+f\\right\)\.\(17\)
We also refer to this baseline as*all\-prior*, since it uses all prompt programs in the corpus to construct the prior before incorporating the target program’s observed outcomes\.

### A\.4Benchmark Details

Our benchmarks cover three broad classes of prompt\-programming tasks: knowledge\-intensive question answering, code generation, and embodied agentic tasks\. Specifically, we use six MMLU\-PRO domains, two code\-generation benchmarks, and two ALFWorld task types\. These benchmarks provide a diverse testbed for evaluating whether RAP can construct useful task\-specific priors across different forms of prompt program execution\.

Table 2:Benchmark domains used in our experiments\. We may separate domains from MMLU\_PRO and Alfworld\. For most domains, three datasets are provided: a test set \(200 tasks\), an example set \(128 tasks\), and a question pool \(400 tasks\)\. More details can be found in appendix\.
### A\.5Part 1: Comparison Between Symbolic Programs and Prompt Programs

In this part, we investigate a diverse set of tasks spanning multiple domains\. Some tasks do not naturally provide more than 100 instances; for these cases, we augment or complete the task instances to ensure a sufficiently large evaluation set \(approximately 100\+ instances\) so that performance estimates are statistically reliable\.

For inference on tasks specified by few\-shot examples, we use GPT\-5\-mini as the underlying language model to ensure strong and stable prompt program performance\. For all other settings, we use GPT\-4o\-mini or other commonly adopted language model configurations\. Table[3](https://arxiv.org/html/2605.21515#A1.T3)summarizes the task specifications and model choices used in this part\.

Table 3:Benchmark tasks used in Part 1, including task specification format and transduction models\.
### A\.6Part 2: The details of pp only experiments

In the second part of the experiments, we evaluate 60 prompt programs, including 24 from multiple\-choice question answering, 24 from code generation, and 12 from embodied AI tasks\. We consider two language models, GPT\-4o\-mini and GPT\-5\-mini, and evaluate two prompting strategies: chain\-of\-thought \(CoT\), which explicitly elicits intermediate reasoning before producing an answer, and a vanilla setting, which does not allow the model to output reasoning content\.

For each domain, we use the official checker to verify correctness\. Most reported success rates are consistent with the official records\.

Task construction proceeds as follows\. The test set is used to obtain ground\-truth performance, the example set is sampled to represent observed tasks, and the question pool is used to construct the task corpus\. For the HumanEval domain, due to the limited number of tasks, the dataset is used exclusively as the question pool\. For agentic domains, the size of the question pool is reduced because executing multi\-turn environments is computationally expensive\.

We construct prompt programs using commonly adopted prompt\-design paradigms for each benchmark category\. In particular, we instantiate task\-specific instruction templates for multiple\-choice question answering, code generation, and embodied agentic tasks, while ignoring fixed formatting details imposed by the prompt\-programming Dspy framework\. Representative examples are shown below\.

#### MMLU\-Pro\.

For multiple\-choice question answering, the prompt encourages careful reasoning over the question and answer options:

Analyze the multiple\-choice question carefully\. Reason step by step, identify the key concept, evaluate each option, eliminate weaker choices, and select the single best answer\. Return only the final answer in the required format\.

#### MBPP\.

For code\-generation tasks, the prompt instructs the model to produce a correct and readable Python function:

Write a clean, efficient Python function that solves the problem and passes all tests\. Read the description, constraints, examples, and assertions carefully, handle edge cases, and ensure correctness and readability\.

#### ALFWorld\.

For embodied agentic tasks, the prompt specifies the task context and asks the model to select the next action:

You are an embodied household agent in an interactive environment\. Your goal is to complete the given task description by selecting the best next action based on the task description, current observation, previous trajectory, and possible actions\.

## Appendix BAblation Study

### B\.1Hyperparameter Ablation Study

#### Retrieval Hyperparameters\.

We ablate the two retrieval hyperparameters used by RAP: the number of retrieved prompt programsKKand the number of retrieved tasksnn\. IncreasingKKcan provide more diverse prior components, but may also introduce irrelevant priors when the retrieved programs are not well aligned with the target program\. In contrast, increasingnngenerally improves performance by providing more task\-level evidence for constructing a stable retrieval context\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/ablation_study_for_k.png)\(a\)Number of retrieved prompt programsKK\.
![Refer to caption](https://arxiv.org/html/2605.21515v1/ablation_study_for_n.png)\(b\)Number of retrieved tasksnn\.

Figure 8:Hyperparameter ablation for RAP\. Left: increasingKKcan improve prior diversity but may also introduce irrelevant prior components\. Right: increasingnngenerally improves prediction accuracy by stabilizing the retrieval context\.
#### Effect of Prior Concentration Cap\.

We also ablate the maximum prior concentrationcmaxc\_\{\\max\}, which controls the maximum strength assigned to each retrieved Beta prior component after similarity\-based rescaling\. A smallcmaxc\_\{\\max\}makes the retrieved prior easier to update with observed examples, while a largecmaxc\_\{\\max\}allows the retrieved prior to have stronger influence but may make the posterior overconfident when retrieval is imperfect\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/ablation_study_c_max.png)Figure 9:Ablation over the maximum prior concentrationcmaxc\_\{\\max\}\. Moderate values provide a good balance between using retrieved prior information and remaining adaptable to observed in\-domain examples\.

### B\.2Effects of Corpus Size

Intuitively, the larger the corpus size, the more likely our algorithm will be able to retrieve similar tasks andp​ppps\. We generate a sequence of corpora with increasing sizes and evaluate RAP’s performance\. Results are shown in[Figure10](https://arxiv.org/html/2605.21515#A2.F10)\. In both the in\-domain and out\-of\-domain settings, bigger corpus yields better results\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/corpus_size_vs_score/in_domain/posterior_score_curve_corpus_only.png)\(a\)Posterior score curve \(in\-domain\)
![Refer to caption](https://arxiv.org/html/2605.21515v1/images/corpus_size_vs_score/out_domain/posterior_score_curve_corpus_only.png)\(b\)Posterior score curve \(out\-of\-domain\)

Figure 10:RAP benefits from a larger corpus size, where a larger corpus boosts performance in the in\-domain setting \(a\) and does not harm performance in out\-of\-domain setting \(b\)
### B\.3Effects of LLM Capabilities

In our experiments, we find that different language model settings substantially influence prediction performance\. We investigate this phenomenon, with results shown in[Figure11](https://arxiv.org/html/2605.21515#A2.F11)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/4o-mini_vs_5-mini/4o-mini/distribution/gt_perf_gpt4o-mini.png)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/4o-mini_vs_5-mini/4o-mini/error/posterior_error_plot_single_4o-mini.png)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/4o-mini_vs_5-mini/4o-mini/score/posterior_score_curve_4o-mini.png)

\(a\)4o\-mini
![Refer to caption](https://arxiv.org/html/2605.21515v1/images/4o-mini_vs_5-mini/5-mini/distribution/gt_perf_gpt5-mini.png)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/4o-mini_vs_5-mini/5-mini/error/posterior_error_plot_single_5-mini.png)

![Refer to caption](https://arxiv.org/html/2605.21515v1/images/4o-mini_vs_5-mini/5-mini/score/posterior_score_curve_5-mini.png)

\(b\)5\-mini

Figure 11:First row: the performance prior forp​pppif using a weaker model \(GPT\-4o\-mini\) versus a stronger model \(GPT\-5\-mini\)\. Stronger model has a prior that’s more similar to the prior for symbolic programs\. Consequently, theno corpusbaseline becomes relatively stronger when used on stronger models\. RAP still performs best in all settings\.\. We find that RAP performs comparatively better for weaker LLMs, where the performance prior is more diffused, and worse for stronger LLMs, whose performance prior is similar to that ofs​psp\.

### B\.4Effects of Text Embedding

As RAP will always retrieve tasks that are textually similar, it is good to understand if textual similarity implies performance prior similarity\.[Figure12](https://arxiv.org/html/2605.21515#A2.F12)shows the correlation between textual similarity and performance prior similarity across different domains\. As we can see, there exists certain pairs of domains \(e\.g\. chemistry\-engineering\) which are very similar in the textual space \(y\-axis\) but are dissimilar in the prior \(x\-axis\)\. This is the “risk area” of our algorithm, where the retrieved tasks do not approximate the prior on the target domain\. RAP mitigate this issue with prior strength normalization\.

![Refer to caption](https://arxiv.org/html/2605.21515v1/x2.png)Figure 12:Relationship between textual similarity \(of pairs of domains\) and performance prior similarity \(of pairs of domains\)\. Retrieval based on textual similarity does not guarantee similarity in performance prior, for instance, the pair chemistry\-engineering\. The upper\-left corner is the*risk area*where RAP might retrieve tasks which induces a prior dis\-similar to the target domain\. RAP mitigate this issue with prior strength normalization\.

Similar Articles

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

arXiv cs.CL

Introduces PRIG, a gradient attribution method that localizes prompt ambiguity in large language models by training a linear probe to distinguish clear from ambiguous prompts and attributing the probe score to token representations in the residual stream, achieving strong performance on synthetic and human-written benchmarks.