Reflective Prompt Tuning through Language Model Function-Calling

arXiv cs.CL 05/22/26, 04:00 AM Papers
prompt-optimization function-calling llm calibration reasoning automated-prompting
Summary
Introduces Reflective Prompt Tuning (RPT), a framework that uses LLM function-calling to iteratively diagnose and revise prompts based on systematic error patterns, improving reasoning task performance and calibration.
arXiv:2605.21781v1 Announce Type: new Abstract: Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.
Original Article
View Cached Full Text
Cached at: 05/22/26, 08:43 AM
# Reflective Prompt Tuning through Language Model Function-Calling
Source: [https://arxiv.org/html/2605.21781](https://arxiv.org/html/2605.21781)
Farima Fatahi Bayat,Moin Aminnaseri,Pouya Pezeshkpour,Estevam Hruschka Megagon Labs \{farima, moin, pouya, estevam\}@megagon\.ai

###### Abstract

Large language models \(LLMs\) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates\. Yet prompt design remains labor\-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference\-time flexibility\. However, existing methods often search over prompt candidates or use fixed critique\-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history\. We propose Reflective Prompt Tuning \(RPT\), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers\. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report\. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration\.RPTfurther supports confidence\-aware optimization by using calibration signals in diagnostic feedback and final prompt selection\. Across three reasoning tasks,RPTimproves over initial prompts by up to 12\.9 points, remains competitive with state of the art, and improves confidence calibration\. Our analyses show thatRPTis especially effective on multi\-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration\.111We release our code at:[https://github\.com/megagonlabs/RPT](https://github.com/megagonlabs/RPT)\.

Reflective Prompt Tuning through Language Model Function\-Calling

Farima Fatahi Bayat, Moin Aminnaseri, Pouya Pezeshkpour, Estevam HruschkaMegagon Labs\{farima, moin, pouya, estevam\}@megagon\.ai

## 1Introduction

Large language models \(LLMs\) have become increasingly adept at following instructions and performing complex reasoning, making contextual prompting the dominant mechanism for adapting model behavior to downstream tasksLouet al\.\([2024](https://arxiv.org/html/2605.21781#bib.bib36)\); Weiet al\.\([2022](https://arxiv.org/html/2605.21781#bib.bib40)\); Kojimaet al\.\([2022](https://arxiv.org/html/2605.21781#bib.bib39)\)\. Prompts let users specify objectives, constraints, and output formats without modifying model parameters, enabling rapid adaptation across applicationsSahooet al\.\([2025](https://arxiv.org/html/2605.21781#bib.bib37)\); Schulhoffet al\.\([2025](https://arxiv.org/html/2605.21781#bib.bib38)\)\.

Despite this flexibility, prompt design remains a major bottleneck\. Crafting effective prompts is often a manual and iterative process that relies on trial and error and, in some cases, requires substantial expertise\(Zamfirescu\-Pereiraet al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib41); Knothet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib42)\)\. Moreover, LLMs exhibit unpredictable sensitivity to seemingly minor choices such as formatting, phrasing, and instruction ordering, so prompt effectiveness may not generalize reliably across settings\(Zhuoet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib43); Sclaret al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib44)\)\. These challenges have motivated automated prompt optimization methods that aim to reduce manual prompt\-engineering effort by automatically searching for, selecting, or revising prompts based on task objectives\(Ramnathet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib45)\)\.

The current state of the art increasingly uses textual feedback to guide prompt optimization\(Shinnet al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib25); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib24); Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\)\. In this paradigm, an optimizer inspects signals such as execution traces, reasoning steps, or evaluator feedback, and proposes prompt revisions\. However, existing methods have several limitations\. First, many follow fixed context\-updating pipelines\. For example, ACE\(Zhanget al\.,[2026](https://arxiv.org/html/2605.21781#bib.bib32)\)updates an auxiliary playbook of reusable strategies inserted into a fixed prompt template\. While this can improve stability, it limits the optimizer’s ability to make arbitrary prompt\-level revisions\. Second, updates in each iteration are often driven by individual examples\(Zhanget al\.,[2026](https://arxiv.org/html/2605.21781#bib.bib32)\)or minibatch subsets\(Opsahl\-Onget al\.,[2024a](https://arxiv.org/html/2605.21781#bib.bib28); Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib24)\), making optimization sensitive to local rather than recurring failures\. Third, most methods lack explicit memory over prior diagnostic reports and prompt revisions, limiting credit assignment across iterations\. Finally, prompt selection is typically driven by task performance alone, leaving broader reliability properties outside the optimization criterion\. Although GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\)incorporates auxiliary evaluation signals, its prompt selection remains primarily task\-performance driven\.

To address these limitations, we propose Reflective Prompt Tuning \(RPT\), a framework that leverages LLMs’ function\-calling capabilities to mimic the iterative workflow of human prompt engineers\. Modern LLMs can call external functions, inspect structured outputs, and reason over feedback from those calls to guide subsequent decisions\.RPTbuilds on these capabilities by using an LLM as an active prompt optimizer that inspects model behavior and revises the prompt through an explicit diagnostic function\. Starting from a seed prompt, the optimizer iteratively calls the diagnostic function to evaluate the target model and return a structured diagnostic report\. This function collects behavioral traces, critiques incorrect responses by diagnosing their failure modes, clusters these diagnoses to identify recurring failure patterns, and summarizes where the current prompt breaks down\. The optimizer conditions on this report together with an accumulated memory of prior reports and prompt revisions, enabling it to reason about persistent failures and previous refinement attempts rather than treating each update in isolation\.RPTfurther supports confidence\-aware optimization by incorporating calibration diagnostics into both the feedback shown to the optimizer and the development\-set criterion used to select the final prompt\.

We evaluateRPTon three reasoning tasks spanning multi\-hop reasoning over textual evidence with HotPotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.21781#bib.bib46)\), mathematical reasoning with LiveBench\-Math\(Whiteet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib47)\), and domain\-specific numerical reasoning with Formula\(Wanget al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib48)\)\. Using GPT\-4\.1 as the target model, we compareRPTagainst state\-of\-the\-art automated prompt\-optimization baselines, including ACE\(Zhanget al\.,[2026](https://arxiv.org/html/2605.21781#bib.bib32)\), GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\), and MIPRO\(Opsahl\-Onget al\.,[2024a](https://arxiv.org/html/2605.21781#bib.bib28)\)\. Across tasks,RPTconsistently improves over initial prompts, achieving gains of up to \+12\.9 points on HotPotQA, \+12\.4 points on LiveBench\-Math, and \+11\.7 points on Formula, while remaining competitive with state\-of\-the\-art baselines\. Our confidence\-aware experiments further show that incorporating calibration signals into both diagnostic feedback and final prompt selection improves calibration alongside task performance\. Finally, analysis of optimization traces shows thatRPTproduces targeted prompt revisions aligned with diagnosed failure modes, offering insight into why and how prompts are revised across iterations\. Together, these results suggest that tool\-calling LLMs can enable scalable and interpretable prompt optimization\.

![Refer to caption](https://arxiv.org/html/2605.21781v1/x1.png)Figure 1:Overview of Reflective Prompt Tuning \(RPT\)\. At each iteration, the optimizer calls a diagnostic function that evaluates the current prompt onDtrainD\_\{\\mathrm\{train\}\}, critiques failures, clusters recurring failure modes, and returns a structured report\. The optimizer uses this report and prior reports to generate the next prompt\.
## 2Reflective Prompt Tuning \(RPT\)

We presentReflective Prompt Tuning\(RPT\), a diagnosis\-driven prompt optimization framework\.RPTautomates the iterative workflow of prompt engineers: run a prompt, inspect outputs, identify recurring failures, revise the prompt, and repeat\. Recent advances in LLM function calling and reasoning over tool outputs enable LLMs to serve as prompt optimizers\(Schicket al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib34); Gouet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib35); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib24)\)\. We first formulate prompt optimization as selecting a prompt that improves task performance and confidence calibration \(Section[2\.1](https://arxiv.org/html/2605.21781#S2.SS1)\)\. We then describe howRPTconstructs diagnostic feedback and reflectively revises prompts based on diagnosed failures \(Section[2\.2](https://arxiv.org/html/2605.21781#S2.SS2)\)\. All prompts used inRPTare in Appendix[7](https://arxiv.org/html/2605.21781#S7)\.

### 2\.1Problem Statement

Letfθf\_\{\\theta\}be the target model andptp\_\{t\}the prompt at optimization iterationtt\. Given inputxx, the model produces

fθ\(x;pt\)=\(r,y^,c\),f\_\{\\theta\}\(x;p\_\{t\}\)=\(r,\\hat\{y\},c\),\(1\)whererris the reasoning trace,y^\\hat\{y\}is the final answer, andccis the reported confidence\. We assume an optimization setDtrainD\_\{\\mathrm\{train\}\}, a development setDdevD\_\{\\mathrm\{dev\}\}, and a held\-out test setDtestD\_\{\\mathrm\{test\}\}\. The goal is to generate candidate prompts\{p0,…,pT\}\\\{p\_\{0\},\\ldots,p\_\{T\}\\\}and select a final promptp∗p^\{\*\}using development\-set performance\. Let

𝒪\(p;D\)=\{μ1\(p;D\),…,μn\(p;D\)\}\\mathcal\{O\}\(p;D\)=\\\{\\mu\_\{1\}\(p;D\),\\ldots,\\mu\_\{n\}\(p;D\)\\\}\(2\)denote the set of evaluation metrics for promptppon datasetDD, including task performance metrics and confidence calibration error\. We use a scalar selection functionΦ\\Phito combine these metrics and select the final prompt:

p∗=arg⁡maxpt∈\{p0,…,pT\}⁡Φ\(𝒪\(pt;Ddev\)\)p^\{\*\}=\\arg\\max\_\{p\_\{t\}\\in\\\{p\_\{0\},\\ldots,p\_\{T\}\\\}\}\\Phi\\left\(\\mathcal\{O\}\(p\_\{t\};D\_\{\\mathrm\{dev\}\}\)\\right\)\(3\)Appendix[7\.8](https://arxiv.org/html/2605.21781#S7.SS8)further shows that prompts often grow during optimization, but longer prompts do not necessarily yield better development performance, motivating development\-set selection\. In the confidence\-aware setting,Φ\\Phijointly accounts for task performance and calibration by rewarding higher task scores while penalizing miscalibration, for example through a negative Brier\-score term\. The selected promptp∗p^\{\*\}is then evaluated on the held\-out test setDtestD\_\{\\mathrm\{test\}\}\.

### 2\.2Methodology Overview

We formulateRPTas a two\-stage textual update process, illustrated in Figure[1](https://arxiv.org/html/2605.21781#S1.F1)\. First,RPTconstructs response\-level feedback \(Section[2\.2\.1](https://arxiv.org/html/2605.21781#S2.SS2.SSS1)\): given the current promptptp\_\{t\}, the diagnostic function evaluates target\-model outputs onDtrainD\_\{\\mathrm\{train\}\}, critiques incorrect responses, identifies recurring failure modes, and summarizes them with aggregate metrics into a diagnostic reportℛt\\mathcal\{R\}\_\{t\}\. Second,RPTtranslates this report into a prompt\-level revision \(Section[2\.2\.2](https://arxiv.org/html/2605.21781#S2.SS2.SSS2)\): conditioned onptp\_\{t\},ℛt\\mathcal\{R\}\_\{t\}, and a memory of prior reports, optimizer infers likely prompt shortcomings and produces the next promptpt\+1p\_\{t\+1\}\.

#### 2\.2\.1Constructing Diagnostic Feedback

The diagnostic function connects target\-model behavior to the optimizer LLM\. Given the current promptptp\_\{t\}, the optimizer invokes this function to evaluate the target model on the full optimization setDtrainD\_\{\\mathrm\{train\}\}and return a structured diagnostic reportℛt\\mathcal\{R\}\_\{t\}\. The report captures not only*how well*the prompt performs, but also how target\-model outputs fail and which failures recur across the dataset\.

##### Behavior collection and scoring\.

The diagnostic function first runs the target modelfθf\_\{\\theta\}with promptptp\_\{t\}on each example\(xi,yi\)∈Dtrain\(x\_\{i\},y\_\{i\}\)\\in D\_\{\\mathrm\{train\}\}\. For each example, it records the reasoning tracerir\_\{i\}, final answery^i\\hat\{y\}\_\{i\}, and reported confidencecic\_\{i\}\. It then computes task\-specific performance metrics, along with average confidence and Brier score for calibration\. These metrics capture overall prompt quality, but do not explain the causes of failures\.

##### Failure detection and critique\.

Next, the function identifies failed examples using the task\-specific evaluator:

ℐt=\{\(xi,yi,y^i,ri,ci\)∣iis incorrect\}\\mathcal\{I\}\_\{t\}=\\\{\(x\_\{i\},y\_\{i\},\\hat\{y\}\_\{i\},r\_\{i\},c\_\{i\}\)\\mid i\\text\{ is incorrect\}\\\}\(4\)Next, for each failed examplei∈ℐti\\in\\mathcal\{I\}\_\{t\}, a critique LLM generates concise response\-level diagnoses of how the target\-model output fails with respect to the expected answeryiy\_\{i\}and the evaluation criteria\. SinceRPTelicits confidence as part of the target\-model output, the critique also assesses whether the reported confidencecic\_\{i\}is appropriate based on the response’s correctness and quality\. These diagnoses capture local issues such as incorrect reasoning, unsupported evidence use, formatting errors, or overconfident incorrect answers\.

Each failed instance may yield up to three diagnoses to improve coverage and reduce sensitivity to individual critiques\. Let the resulting pool of sample\-level failure diagnoses be:

𝒵t=\{zi,j∣i∈ℐt,j≤3\}\\mathcal\{Z\}\_\{t\}=\\\{z\_\{i,j\}\\mid i\\in\\mathcal\{I\}\_\{t\},\\;j\\leq 3\\\}\(5\)

##### Identifying recurring failure modes\.

The diagnoses in𝒵t\\mathcal\{Z\}\_\{t\}provide local feedback about individual failures, but prompt revision benefits from identifying patterns that recur across the optimization set\. To convert response\-level critiques into dataset\-level diagnostic feedback,RPTapplies ClusterFusion\(Xuet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib8)\)to𝒵t\\mathcal\{Z\}\_\{t\}, grouping semantically similar diagnoses into recurring failure topics:

𝒞t=\{\(ak,dk,Sk\)\}k=1K,\\mathcal\{C\}\_\{t\}=\\\{\(a\_\{k\},d\_\{k\},S\_\{k\}\)\\\}\_\{k=1\}^\{K\},\(6\)whereaka\_\{k\}is a short topic label,dkd\_\{k\}describes the failure mode, andSkS\_\{k\}contains representative examples\. This aggregation compresses local critiques into a compact summary of systematic target\-model failures, helping the optimizer infer prompt\-level shortcomings and propose targeted revisions\. The number of topicsKKcontrols the summary granularity \(details onKKselection in Appendix[7\.5](https://arxiv.org/html/2605.21781#S7.SS5)\)\.

##### Diagnostic report generation\.

The diagnostic function returns a structured report

ℛt=\(pt,𝒪\(pt;Dtrain\),𝒞t′\),\\mathcal\{R\}\_\{t\}=\\left\(p\_\{t\},\\mathcal\{O\}\(p\_\{t\};D\_\{\\mathrm\{train\}\}\),\\mathcal\{C\}^\{\\prime\}\_\{t\}\\right\),\(7\)whereptp\_\{t\}is the current prompt,𝒪\(pt;Dtrain\)\\mathcal\{O\}\(p\_\{t\};D\_\{\\mathrm\{train\}\}\)contains aggregate metrics, and𝒞t′⊆𝒞t\\mathcal\{C\}^\{\\prime\}\_\{t\}\\subseteq\\mathcal\{C\}\_\{t\}denotes the retained subset of clustered failure topics, with representative examples and summaries\. We retain a subset𝒞t′\\mathcal\{C\}^\{\\prime\}\_\{t\}to keep the report focused on prominent recurring patterns; details are provided in Appendix[7\.5](https://arxiv.org/html/2605.21781#S7.SS5)\. Together, these components turn feedback from scalar scoring into structured diagnosis\.

History is maintained in an external memory outside the diagnostic function\. At iterationtt, the optimizer receives the current reportℛt\\mathcal\{R\}\_\{t\}together with prior reportsℳ<t\\mathcal\{M\}\_\{<t\}\. After the iteration, the current report is appended for future use:

ℳ<t\+1=Append\(ℳ<t,ℛt\)\.\\mathcal\{M\}\_\{<t\+1\}=\\mathrm\{Append\}\(\\mathcal\{M\}\_\{<t\},\\mathcal\{R\}\_\{t\}\)\.\(8\)This lets the optimizer reason over the optimization trajectory rather than only the current report\. In practice, memory grows linearly with the iteration budgetTT, but remains manageable because each report stores only aggregate metrics and a filtered set of recurring failure clusters\.

#### 2\.2\.2Reflective Prompt Revision with Memory

Given the current diagnostic reportℛt\\mathcal\{R\}\_\{t\}and the external memory of prior reportsℳ<t\\mathcal\{M\}\_\{<t\}, the optimizer identifies which recurring response\-level failures indicate shortcomings of the current prompt and generates a revision\. Formally,

pt\+1=LLMopt\(pt,ℛt,ℳ<t\)\.p\_\{t\+1\}=\\mathrm\{LLM\}\_\{\\mathrm\{opt\}\}\(p\_\{t\},\\mathcal\{R\}\_\{t\},\\mathcal\{M\}\_\{<t\}\)\.\(9\)The optimizer treats diagnostic reports as evidence for revision: it inspects aggregate metrics, recurring failure topics, representative examples, and previous prompt changes\.

The external memoryℳ\\mathcal\{M\}helps address the credit\-assignment challenge in prompt optimization\(Opsahl\-Onget al\.,[2024b](https://arxiv.org/html/2605.21781#bib.bib22); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib24)\)\. A prompt edit may improve some metrics while worsening others, a failure may require several revisions to resolve, and repeated failures may indicate ineffective prior edits\. By conditioning on prior reports, the optimizer can track persistent failures, previous revision attempts, and performance changes over time\. Thus,RPTtreats history as memory over the optimization trajectory rather than treating each update as an independent proposal\.

Table 1:Benchmark results for different prompt optimizers evaluated on GPT\-4\.1 as the target model\. Columns report task\-specific metrics: accuracy for HotPotQA and Formula, and task score for LiveBench\-Math\. Final denotes performance of the optimized prompt\. Best final scores within each optimizer and dataset are shown inbold\.RPTconsistently improves over initial prompts and remains competitive with state\-of\-the\-art baselines\.

## 3Experimental Setup

##### Tasks and Datasets\.

We optimize and evaluate prompts on three reasoning tasks: multi\-hop reasoning over textual evidence \(HotPotQA;Yanget al\.\([2018](https://arxiv.org/html/2605.21781#bib.bib46)\)\), mathematical reasoning \(LiveBench\-Math;Whiteet al\.\([2025](https://arxiv.org/html/2605.21781#bib.bib47)\)\), and domain\-specific numerical reasoning \(Formula;Wanget al\.\([2025](https://arxiv.org/html/2605.21781#bib.bib48)\)\)\. Additional dataset statistics and details can be found in Appendix[7\.4](https://arxiv.org/html/2605.21781#S7.SS4)

##### Target model and optimizer LLMs\.

We use GPT\-4\.1OpenAI \([2025b](https://arxiv.org/html/2605.21781#bib.bib50)\)as the target model forRPTand all baselines\. As optimizer LLMs, we instantiateRPTwith function\-calling frontier models from two families and at different scales: GPT\-5 and GPT\-5\-miniOpenAI \([2025a](https://arxiv.org/html/2605.21781#bib.bib49)\), and Gemini\-3\.1\-Pro\(GoogleAI,[2026a](https://arxiv.org/html/2605.21781#bib.bib52)\)and Gemini\-3\.1\-Flash\-Lite\(GoogleAI,[2026b](https://arxiv.org/html/2605.21781#bib.bib51)\)\.

##### Baselines\.

We compareRPTagainst three state\-of\-the\-art automated prompt\-optimization baselines: Agentic Context Engineering \(ACE;Zhanget al\.\([2026](https://arxiv.org/html/2605.21781#bib.bib32)\)\), GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\), and MIPRO\(Opsahl\-Onget al\.,[2024b](https://arxiv.org/html/2605.21781#bib.bib22)\)\. Additional baseline and implementation details are provided in Appendix[7\.5](https://arxiv.org/html/2605.21781#S7.SS5)\.

##### Evaluation\.

We report task\-specific performance metrics: accuracy for HotPotQA and Formula, and task score for LiveBench\-Math222Following LiveBench, task score is averaged across four math tasks:[https://github\.com/LiveBench/LiveBench/tree/main/livebench/process\_results/math](https://github.com/LiveBench/LiveBench/tree/main/livebench/process_results/math)\.\. For calibration, we report Brier score using the model’s verbalized confidence\(Xionget al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib53)\)\.

Table 2:Confidence\-aware optimization results\. GEPA\-C denotes GEPA with confidence feedback\. Each cell reports initial→\\rightarrowfinal performance and Brier; higher performance and lower Brier are better\.

## 4Results and Analyses

We evaluateRPTfrom three perspectives\. First, we compareRPT\-optimized prompts against seed prompts and state\-of\-the\-art baselines, while studying the effect of optimizer LLM size \(Section[4\.1](https://arxiv.org/html/2605.21781#S4.SS1)\)\. Second, we examine whether confidence\-aware optimization improves calibration without sacrificing task performance \(Section[4\.2](https://arxiv.org/html/2605.21781#S4.SS2)\)\. Finally, we analyze optimization traces to study persistent failures, diagnosis–patch alignment, and associations with subsequent performance gains \(Section[4\.3](https://arxiv.org/html/2605.21781#S4.SS3)\)\.

### 4\.1RPTIs Competitive with SOTA Baselines

Table[1](https://arxiv.org/html/2605.21781#S2.T1)reports the task performance of prompts optimized byRPTand the baseline prompt optimizers described in Section[3](https://arxiv.org/html/2605.21781#S3)\. For each task and method, we report the performance of the initial prompt and the performance of the optimized prompt selected via development\-set performance333For Formula, we use the initial prompt from ACE; for HotPotQA, we adapt this template to the QA setting; and for LiveBench\-Math, we adapt the initial prompt from GEPA\.\.

##### Observation 1:RPTis strongest on tasks with recurring reasoning failures\.

Across optimizer LMs,RPTachieves the best final performance on LiveBench\-Math for every optimizer setting, improving over the initial prompt by up to \+12\.4 points\. On HotPotQA,RPTis also competitive: it achieves the best final performance with GPT\-5 and remains close to the strongest baseline under other instantiations\. GEPA and MIPRO perform competitively on HotPotQA, but provide smaller gains on LiveBench\-Math; their lower initial scores also suggest that implementation\-specific choices affect their absolute performance\. Formula shows a different pattern: ACE consistently achieves the best final performance, whileRPTis competitive mainly when paired with GPT\-5\. More broadly,RPTappears well\-suited to tasks where recurring failures can be diagnosed and translated into targeted prompt revisions\. However, it may be less advantageous for domain\-specific computation, where localized instance\-level updates or predefined prompt structures may be more effective\.

##### Observation 2:RPTbenefits from stronger optimizer LLMs\.

Optimizer choice has a clear impact onRPT’s performance\. Compared to GPT\-5\-mini, using GPT\-5 increasesRPT’s Aggregate score from 68\.5 to 74\.3, with gains across all three tasks\. Within the Gemini family, Gemini\-3\.1\-Pro similarly improves over Gemini\-3\.1\-Flash\-Lite, increasing Aggregate from 67\.7 to 70\.1\. This pattern is expected becauseRPTplaces a demanding burden on the optimizer: it must perform credit assignment over diagnostic feedback and prior prompt revisions, identify unresolved failures, and translate recurring failure modes into targeted prompt edits\. Compared with the baselines,RPTachieves the best aggregate performance with GPT\-5 and is nearly tied with ACE under Gemini\-3\.1\-Pro, while ACE remains stronger with smaller optimizer LLMs\. GEPA and MIPRO generally trail in aggregate performance, partly due to lower initial prompt performance on LiveBench\-Math\.

### 4\.2Confidence Signals Improve Calibration

We next ask whether confidence\-aware prompt optimization can improve both task performance and calibration\. This matters because verbalized confidence is often used as a proxy for answer reliability in abstention, routing, human review, and risk\-sensitive deployment\(Wenet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib57); Chuanget al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib56); dela Cruzet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib54); Wanget al\.,[2026](https://arxiv.org/html/2605.21781#bib.bib55)\)\. ACE and MIPRO do not directly expose calibration diagnostics to the optimizer without substantial modification, while GEPA can use them as auxiliary feedback\. In contrast,RPTincorporates calibration into both diagnostic feedback and final prompt selection

Table[2](https://arxiv.org/html/2605.21781#S3.T2)comparesRPTwith confidence\-aware GEPA\. GEPA shows that calibration feedback can help: on HotPotQA, it improves both task performance and Brier score across optimizer LLMs\. However, gains are more limited on LiveBench\-Math and Formula\. With GPT\-5\-mini as optimizer, confidence feedback yields no gain on LiveBench\-Math and slightly hurts Formula performance, suggesting that it may distract a less capable optimizer\.

RPTmore consistently improves both task performance and calibration\. Although prompt optimization cannot access internal uncertainty estimates or logits, our results show that calibration can improve when treated as a first\-class optimization signal\. By incorporating calibration into both the diagnostic loop and prompt\-selection objective,RPTbetter aligns self\-reported confidence with empirical correctness while also improving task performance\.

![Refer to caption](https://arxiv.org/html/2605.21781v1/x2.png)Figure 2:Failure\-to\-patch alignment across datasets\. Each heatmap reportsP\(patch topic∣failure topic\)P\(\\text\{patch topic\}\\mid\\text\{failure topic\}\), showing which prompt revisions tend to follow each diagnosed failure type\.
### 4\.3What DoesRPTLearn from Diagnostics?

Beyond final task performance,RPTproduces structured optimization traces at each iteration\. We analyze these traces to understand howRPTimproves prompts over time, focusing on the GPT\-5 optimizer since it performs best in our experiments\.

Across tasks, we collect failure diagnoses from each iteration and derive prompt\-update instances by using GPT\-4\.1 to extract atomic differences between consecutive prompts,ptp\_\{t\}andpt\+1p\_\{t\+1\}\. We then apply ClusterFusion, as described in Section[2](https://arxiv.org/html/2605.21781#S2), to group diagnoses and prompt updates into 10failuretopics and 10patchtopics, respectively\. To relate topics to performance, we compute next\-iteration metric changes by comparing metrics underptp\_\{t\}with those after evaluatingpt\+1p\_\{t\+1\}\. Thus, positiveΔ\\Deltatask score and negativeΔ\\DeltaBrier indicate improvement\. Because this analysis relies on optimization traces, we interpret results as associations rather than causal effects\.

#### 4\.3\.1DoesRPTProduce Targeted Revisions?

We next examine whetherRPTperforms targeted credit assignment from diagnosed failures to prompt revisions\. For each failure topicFiF\_\{i\}and patch topicPjP\_\{j\}, we computeP\(Pj∣Fi\)P\(P\_\{j\}\\mid F\_\{i\}\)as the fraction of transitions containingPjP\_\{j\}among those containingFiF\_\{i\}\. Failure topics diagnosed at iterationttare assigned to transitiont→t\+1t\\\!\\rightarrow\\\!t\+1, and patch topics are extracted from the corresponding prompt update \(topic presence is binary within each transition\)\. This measures whether specific failures systematically lead to specific prompt edits, which would indicate targeted credit assignment rather than generic prompt rewriting\.

Figure[2](https://arxiv.org/html/2605.21781#S4.F2)shows that the specificity of this failure\-to\-patch mapping varies across tasks\. On HotPotQA, several answer\-control patches, such as span minimality, canonical\-form preference, and answer granularity matching, appear across many failure types, reflecting the benchmark’s sensitivity to exact answer form\. However, multi\-hop reasoning failures more often trigger relation\- and query\-handling patches, suggesting meaningful failure\-specific credit assignment beyond generic answer\-format control \(optimized HotPotQA prompt in Appendix[7\.9](https://arxiv.org/html/2605.21781#S7.SS9)\)\. On LiveBench\-Math, the alignment concentrates around verification\-oriented patches, including stepwise protocols, arithmetic checks, output validation, and notation or invariant handling\. This indicates that the optimizer maps diverse mathematical failures to structured reasoning and checking mechanisms\.

Formula exhibits a broader pattern: many distinct failure topics lead to similar domain\-level safeguards rather than sharply different patches\. This suggests thatRPTidentifies relevant domain controls, but performs less fine\-grained credit assignment on this task than on HotPotQA or LiveBench\-Math\. This weaker failure\-specificity may partly explain whyRPTyields smaller gains on Formula and falls behind ACE \(Table[1](https://arxiv.org/html/2605.21781#S2.T1)\)\.

#### 4\.3\.2Do Prompt Patches Predict Gains?

![Refer to caption](https://arxiv.org/html/2605.21781v1/x3.png)Figure 3:Patch topics and next\-iteration metric changes\. Each cell reports the averageΔ\\Deltatask score orΔ\\DeltaBrier onDtrainD\_\{\\mathrm\{train\}\}after a prompt update containing that patch; higherΔ\\Deltatask score and lowerΔ\\DeltaBrier indicate improvement\.We next examine whether prompt updates are followed by improvements in task performance or calibration\. For each patch topic present at iterationtt, we compute the average change in task score and Brier score after evaluating the revised promptpt\+1p\_\{t\+1\}onDtrainD\_\{\\mathrm\{train\}\}\. This analysis identifies which edits tend to precede better task performance or calibration\.

Figure[3](https://arxiv.org/html/2605.21781#S4.F3)shows that useful patches differ by task but often share a common structure: they impose concrete controls on the model’s reasoning or output\. On HotPotQA, the strongest gains are associated with relation and multi\-hop handling, pre\-answer verification, confidence calibration, and answer granularity matching\. On LiveBench\-Math, gains are associated with step\-by\-step solution protocols, output validation, arithmetic checks, and confidence calibration\. On Formula, the clearest improvements come from unit, scale, and format handling, while precision/rounding, power\-consistency checks, and calibration\-related patches yield smaller gains\. Overall, these results suggest thatRPT’s prompt revisions are often useful as well as failure\-specific: patches that introduce verification steps, answer\-form constraints, arithmetic checks, or unit\-handling rules tend to improve task score while reducing Brier\. Formula is more mixed, with some specialized domain safeguards showing weak or negative short\-term associations, likely because they are introduced for harder or more persistent domain\-specific failures\. Appendix[7\.7](https://arxiv.org/html/2605.21781#S7.SS7)further shows that the most actionable diagnoses are concrete failures that can be translated into explicit behavioral constraints, while Appendix[7\.6](https://arxiv.org/html/2605.21781#S7.SS6)shows that the most persistent failures are task\-specific reasoning errors\.

## 5Related Work

Prompting offers a flexible way to adapt LLMs to downstream tasks without parameter updates\. However, prompt design remains labor\-intensive and sensitive to formatting, phrasing, demonstrations, and instruction order\(Luet al\.,[2022](https://arxiv.org/html/2605.21781#bib.bib13); Sclaret al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib14)\), motivating automated prompt optimization methods that reduce manual effort\(Shinet al\.,[2020](https://arxiv.org/html/2605.21781#bib.bib15); Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib24); Opsahl\-Onget al\.,[2024a](https://arxiv.org/html/2605.21781#bib.bib28); Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\)\.

##### Automated prompt optimization\.

A growing body of work uses optimization procedures, and increasingly LLMs themselves, to propose, revise, or select prompts\. AutoPrompt searches for discrete trigger tokens\(Shinet al\.,[2020](https://arxiv.org/html/2605.21781#bib.bib15)\), while APE and OPRO use LLMs to generate natural\-language prompt candidates from task examples or prior candidate\-score pairs\(Zhouet al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib29); Yanget al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib17)\)\. Other methods edit prompts using textual gradients or evolutionary search\(Pryzantet al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib18); Guoet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib30)\), and recent systems extend prompt optimization to modular LLM programs by searching over instructions and demonstrations\(Khattabet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib21); Opsahl\-Onget al\.,[2024b](https://arxiv.org/html/2605.21781#bib.bib22)\)\.RPTalso uses LLMs for prompt optimization, but differs by leveraging function calling to simulate the iterative workflow of human prompt engineers: evaluating the current prompt, diagnosing systematic failures, and using structured diagnostic feedback to guide revisions, making prompt revision explicitly diagnosis\-driven\.

##### Reflective optimization methods\.

Recent methods use rich textual feedback to guide prompt or program optimization\. TextGrad backpropagates natural\-language feedback through computation graphs\(Yuksekgonulet al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib24)\), while GEPA uses execution and evaluation traces as reflective feedback for prompt proposals\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\)\. MIPRO addresses credit assignment in multi\-stage LLM programs with program\- and data\-aware proposal strategies and Bayesian search\(Opsahl\-Onget al\.,[2024b](https://arxiv.org/html/2605.21781#bib.bib22)\)\.RPTbuilds on reflective feedback, but centers each iteration on a diagnostic function that evaluates the current prompt over the full optimization split and returns aggregate metrics and recurring failures\. The optimizer uses this report together with prior reports to guide the next update\. While GEPA can optionally use confidence and calibration as auxiliary signals,RPTincorporates them directly into both the diagnostic report and final prompt\-selection criterion\.

##### Memory and adaptive context\.

Recent methods improve LLM behavior by accumulating feedback or reusable strategies over time\. Reflexion stores verbal reflections from past trials\(Shinnet al\.,[2023](https://arxiv.org/html/2605.21781#bib.bib25)\), while Agent\-Pro evolves agent policies through reflection on interactive experience\(Zhanget al\.,[2024](https://arxiv.org/html/2605.21781#bib.bib26)\)\. Dynamic Cheatsheet and ACE build external playbook of lessons, strategies, or to improve later inference or context construction\(Krauseet al\.,[2019](https://arxiv.org/html/2605.21781#bib.bib31); Zhanget al\.,[2026](https://arxiv.org/html/2605.21781#bib.bib32)\)\.RPTuses memory at prompt\-optimization level: conditioning on prior reports and prompt revisions helps the optimizer reason over past refinements, avoid repetitive edits, and improve credit assignment\.

## Conclusion

We introduced Reflective Prompt Tuning \(RPT\), a diagnosis\-driven framework that uses LLM function calling to optimize prompts through structured feedback and memory over prior revisions\. Across three reasoning tasks,RPTimproves over seed prompts and remains competitive with state of the art, especially on multi\-hop and mathematical reasoning\. We also show that confidence\-aware optimization improves calibration alongside task performance, and thatRPTproduces prompt revisions aligned with diagnosed failures\. These results highlight function\-calling LLMs as a promising approach to scalable and interpretable prompt tuning\.

## Limitations

Our study has several limitations\. First, we evaluateRPTon three reasoning tasks: multi\-hop question answering, mathematical reasoning, and domain\-specific numerical reasoning\. While these tasks cover different forms of reasoning, they do not capture the full range of prompt\-optimization settings, such as open\-ended generation, coding, dialogue, tool\-using agents, or long\-horizon interactive tasks\. In addition, our experiments use GPT\-4\.1 as the target model and frontier proprietary LLMs as optimizers\. The effectiveness ofRPTmay differ for smaller open\-source models, weaker optimizer LMs, or settings where function calling is unavailable\.

Second,RPTis more computationally expensive than prompt optimizers that use individual examples or small minibatches\. Each iteration evaluates the target model on the full optimization set, critiques failed examples, clusters diagnoses, and conditions the optimizer on prior reports\. Although the diagnostic reports are compressed and the memory remains manageable under our iteration budgets, scalingRPTto much larger datasets or longer optimization trajectories may require more aggressive sampling, report compression, or retrieval over memory\.

Finally,RPTimproves prompts but cannot guarantee that prompting alone can resolve all failures\. Some persistent errors, especially deeper mathematical reasoning failures or domain\-specific convention errors, may require complementary interventions such as better tools, external validators, retrieval, fine\-tuning, or changes to the target model itself\. Similarly, our confidence\-aware setting relies on verbalized confidence, which is only a black\-box proxy for uncertainty\. AlthoughRPTimproves calibration in our experiments, self\-reported confidence may remain sensitive to prompting and should be validated carefully before being used in high\-stakes downstream decisions\.

## References

- L\. A\. Agrawal, S\. Tan, D\. Soylu, N\. Ziems, R\. Khare, K\. Opsahl\-Ong, A\. Singhvi, H\. Shandilya, M\. J\. Ryan, M\. Jiang, C\. Potts, K\. Sen, A\. G\. Dimakis, I\. Stoica, D\. Klein, M\. Zaharia, and O\. Khattab \(2025\)GEPA: reflective prompt evolution can outperform reinforcement learning\.arXiv preprint arXiv:2507\.19457\.Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p3.1),[§1](https://arxiv.org/html/2605.21781#S1.p5.1),[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.p1.1),[§7\.4](https://arxiv.org/html/2605.21781#S7.SS4.SSS0.Px2.p1.1)\.
- Learning to route llms with confidence tokens\.External Links:2410\.13284,[Link](https://arxiv.org/abs/2410.13284)Cited by:[§4\.2](https://arxiv.org/html/2605.21781#S4.SS2.p1.1)\.
- J\. A\. dela Cruz, I\. Hendrickx, and M\. Larson \(2025\)Evaluating large language models for confidence\-based check set selection\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 16249–16265\.External Links:[Link](https://aclanthology.org/2025.findings-acl.836/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.836),ISBN 979\-8\-89176\-256\-5Cited by:[§4\.2](https://arxiv.org/html/2605.21781#S4.SS2.p1.1)\.
- GoogleAI \(2026a\)Best for complex tasks and bringing creative concepts to life\.Note:https://deepmind\.google/models/gemini/pro/Cited by:[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px2.p1.1)\.
- GoogleAI \(2026b\)Gemini 3\.1 flash\-lite: built for intelligence at scale\.Note:https://blog\.google/innovation\-and\-ai/models\-and\-research/gemini\-models/gemini\-3\-1\-flash\-lite/Cited by:[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px2.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, yelong shen, Y\. Yang, M\. Huang, N\. Duan, and W\. Chen \(2024\)ToRA: a tool\-integrated reasoning agent for mathematical problem solving\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Ep0TtjVoap)Cited by:[§2](https://arxiv.org/html/2605.21781#S2.p1.1)\.
- Q\. Guo, R\. Wang, J\. Guo, B\. Li, K\. Song, X\. Tan, G\. Liu, J\. Bian, and Y\. Yang \(2024\)Connecting large language models with evolutionary algorithms yields powerful prompt optimizers\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ZG3RaNIsO8)Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. Vardhamanan, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Knoth, A\. Tolzin, A\. Janson, and J\. Leimeister \(2024\)AI literacy and its implications for prompt engineering strategies\.Comput\. Educ\. Artif\. Intell\.6,pp\. 100225\.External Links:[Link](https://api.semanticscholar.org/CorpusId:269273689)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p2.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p1.1)\.
- B\. Krause, E\. Kahembwe, I\. Murray, and S\. Renals \(2019\)Dynamic evaluation of transformer language models\.External Links:1904\.08378,[Link](https://arxiv.org/abs/1904.08378)Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px3.p1.1)\.
- R\. Lou, K\. Zhang, and W\. Yin \(2024\)Large language model instruction following: a survey of progresses and challenges\.Computational Linguistics50\(3\),pp\. 1053–1095\.External Links:ISSN 0891\-2017,[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00523),[Link](https://doi.org/10.1162/coli_a_00523),https://direct\.mit\.edu/coli/article\-pdf/50/3/1053/2470911/coli\_a\_00523\.pdfCited by:[§1](https://arxiv.org/html/2605.21781#S1.p1.1)\.
- Y\. Lu, M\. Bartolo, A\. Moore, S\. Riedel, and P\. Stenetorp \(2022\)Fantastically ordered prompts and where to find them: overcoming few\-shot prompt order sensitivity\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics,pp\. 8086–8098\.Cited by:[§5](https://arxiv.org/html/2605.21781#S5.p1.1)\.
- OpenAI \(2025a\)GPT\-5 system card\.Note:https://cdn\.openai\.com/gpt\-5\-system\-card\.pdfVersion: 2025\-08\-13Cited by:[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2025b\)Introducing gpt\-4\.1 in the api\.Note:https://openai\.com/index/gpt\-4\-1/Version: 2025\-04\-14Cited by:[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px2.p1.1),[§7\.5](https://arxiv.org/html/2605.21781#S7.SS5.SSS0.Px2.p1.4)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024a\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9340–9366\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.525/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.525)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p3.1),[§1](https://arxiv.org/html/2605.21781#S1.p5.1),[§5](https://arxiv.org/html/2605.21781#S5.p1.1)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024b\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 9340–9366\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.525)Cited by:[§2\.2\.2](https://arxiv.org/html/2605.21781#S2.SS2.SSS2.p2.1),[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px2.p1.1)\.
- R\. Pryzant, D\. Iter, J\. Li, Y\. Lee, C\. Zhu, and M\. Zeng \(2023\)Automatic prompt optimization with “gradient descent” and beam search\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 7957–7968\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.494)Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Ramnath, K\. Zhou, S\. Guan, S\. S\. Mishra, X\. Qi, Z\. Shen, S\. Wang, S\. Woo, S\. Jeoung, Y\. Wang, H\. Wang, H\. Ding, Y\. Lu, Z\. Xu, Y\. Zhou, B\. Srinivasan, Q\. Yan, Y\. Chen, H\. Ding, P\. Xu, and L\. L\. Cheong \(2025\)A systematic survey of automatic prompt optimization techniques\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 33078–33110\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1681/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1681),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p2.1)\.
- P\. Sahoo, A\. K\. Singh, S\. Saha, V\. Jain, S\. Mondal, and A\. Chadha \(2025\)A systematic survey of prompt engineering in large language models: techniques and applications\.External Links:2402\.07927,[Link](https://arxiv.org/abs/2402.07927)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessí, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§2](https://arxiv.org/html/2605.21781#S2.p1.1)\.
- S\. Schulhoff, M\. Ilie, N\. Balepur, K\. Kahadze, A\. Liu, C\. Si, Y\. Li, A\. Gupta, H\. Han, S\. Schulhoff, P\. S\. Dulepet, S\. Vidyadhara, D\. Ki, S\. Agrawal, C\. Pham, G\. Kroiz, F\. Li, H\. Tao, A\. Srivastava, H\. D\. Costa, S\. Gupta, M\. L\. Rogers, I\. Goncearenco, G\. Sarli, I\. Galynker, D\. Peskoff, M\. Carpuat, J\. White, S\. Anadkat, A\. Hoyle, and P\. Resnik \(2025\)The prompt report: a systematic survey of prompt engineering techniques\.External Links:2406\.06608,[Link](https://arxiv.org/abs/2406.06608)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p1.1)\.
- M\. Sclar, Y\. Choi, Y\. Tsvetkov, and A\. Suhr \(2023\)Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting\.arXiv preprint arXiv:2310\.11324\.Cited by:[§5](https://arxiv.org/html/2605.21781#S5.p1.1)\.
- M\. Sclar, Y\. Choi, Y\. Tsvetkov, and A\. Suhr \(2024\)Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting\.Twelfth International Conference on Learning Representationsabs/2310\.11324\.External Links:[Link](https://arxiv.org/pdf/2310.11324.pdf)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p2.1)\.
- T\. Shin, Y\. Razeghi, R\. L\. Logan IV, E\. Wallace, and S\. Singh \(2020\)AutoPrompt: eliciting knowledge from language models with automatically generated prompts\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing,pp\. 4222–4235\.Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. R\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p3.1),[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px3.p1.1)\.
- D\. Wang, J\. Patel, D\. Zha, S\. Y\. Yang, and X\. Liu \(2025\)FinLoRA: benchmarking lora methods for fine\-tuning llms on financial datasets\.External Links:2505\.19819,[Link](https://arxiv.org/abs/2505.19819)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p5.1),[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px1.p1.1),[§7\.4](https://arxiv.org/html/2605.21781#S7.SS4.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.21781#S7.T3.1.4.3.1)\.
- J\. Wang, Y\. Zhou, S\. Devic, and D\. Fu \(2026\)Are llm decisions faithful to verbal confidence?\.External Links:2601\.07767,[Link](https://arxiv.org/abs/2601.07767)Cited by:[§4\.2](https://arxiv.org/html/2605.21781#S4.SS2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p1.1)\.
- B\. Wen, J\. Yao, S\. Feng, C\. Xu, Y\. Tsvetkov, B\. Howe, and L\. L\. Wang \(2025\)Know your limits: a survey of abstention in large language models\.External Links:2407\.18418,[Link](https://arxiv.org/abs/2407.18418)Cited by:[§4\.2](https://arxiv.org/html/2605.21781#S4.SS2.p1.1)\.
- C\. White, S\. Dooley, M\. Roberts, A\. Pal, B\. Feuer, S\. Jain, R\. Shwartz\-Ziv, N\. Jain, K\. Saifullah, S\. Dey, Shubh\-Agrawal, S\. S\. Sandha, S\. V\. Naidu, C\. Hegde, Y\. LeCun, T\. Goldstein, W\. Neiswanger, and M\. Goldblum \(2025\)LiveBench: a challenging, contamination\-free LLM benchmark\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p5.1),[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px1.p1.1),[§7\.4](https://arxiv.org/html/2605.21781#S7.SS4.SSS0.Px2.p1.1),[Table 3](https://arxiv.org/html/2605.21781#S7.T3.1.3.2.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. LI, J\. Fu, J\. He, and B\. Hooi \(2024\)Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gjeQKFxFpZ)Cited by:[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px4.p1.1)\.
- Y\. Xu, Y\. Yuan, V\. Viswanathan, and G\. Neubig \(2025\)ClusterFusion: hybrid clustering with embedding guidance and llm adaptation\.arXiv preprint arXiv:2512\.04350\.Cited by:[§2\.2\.1](https://arxiv.org/html/2605.21781#S2.SS2.SSS1.Px3.p1.2),[§7\.5](https://arxiv.org/html/2605.21781#S7.SS5.SSS0.Px1.p1.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 2369–2380\.External Links:[Link](https://aclanthology.org/D18-1259/),[Document](https://dx.doi.org/10.18653/v1/D18-1259)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p5.1),[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px1.p1.1),[§7\.4](https://arxiv.org/html/2605.21781#S7.SS4.SSS0.Px1.p1.1),[Table 3](https://arxiv.org/html/2605.21781#S7.T3.1.2.1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, Z\. Huang, C\. Guestrin, and J\. Zou \(2024\)TextGrad: automatic "differentiation" via text\.External Links:2406\.07496,[Link](https://arxiv.org/abs/2406.07496)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p3.1),[§2\.2\.2](https://arxiv.org/html/2605.21781#S2.SS2.SSS2.p2.1),[§2](https://arxiv.org/html/2605.21781#S2.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.p1.1)\.
- J\.D\. Zamfirescu\-Pereira, R\. Y\. Wong, B\. Hartmann, and Q\. Yang \(2023\)Why johnny can’t prompt: how non\-ai experts try \(and fail\) to design llm prompts\.Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems\.External Links:[Link](http://dl.acm.org/citation.cfm?id=3581388)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p2.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2026\)Agentic context engineering: evolving contexts for self\-improving language models\.External Links:[Link](https://arxiv.org/abs/2510.04618)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p3.1),[§1](https://arxiv.org/html/2605.21781#S1.p5.1),[§3](https://arxiv.org/html/2605.21781#S3.SS0.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px3.p1.1),[§7\.4](https://arxiv.org/html/2605.21781#S7.SS4.SSS0.Px3.p1.1)\.
- W\. Zhang, K\. Tang, H\. Wu, M\. Wang, Y\. Shen, G\. Hou, Z\. Tan, P\. Li, Y\. Zhuang, and W\. Lu \(2024\)Agent\-pro: learning to evolve via policy\-level reflection and optimization\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px3.p1.1)\.
- Y\. Zhou, A\. I\. Muresanu, Z\. Han, K\. Paster, S\. Pitis, H\. Chan, and J\. Ba \(2023\)Large language models are human\-level prompt engineers\.External Links:2211\.01910,[Link](https://arxiv.org/abs/2211.01910)Cited by:[§5](https://arxiv.org/html/2605.21781#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Zhuo, S\. Zhang, X\. Fang, H\. Duan, D\. Lin, and K\. Chen \(2024\)ProSA: assessing and understanding the prompt sensitivity of llms\.InConference on Empirical Methods in Natural Language Processing,External Links:[Link](https://api.semanticscholar.org/CorpusId:273375563)Cited by:[§1](https://arxiv.org/html/2605.21781#S1.p2.1)\.

## 6Appendix

## 7RPTPrompts

This section lists the seed prompts, the optimizer prompt \(shared across all datasets\), and the dataset\-specific critic prompts used inRPT\.

### 7\.1Seed Prompts

Formula Seed PromptSystem messageYou are an analysis expert tasked with answering questions using your knowledge\.User instruction•Show your reasoning step\-by\-step•Be concise but thorough in your analysis•Double\-check your calculations and logic before providing the final answerYour output should be a JSON with fields:•reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations•answer: your concise final answer\.•confidence: a number in \[0,1\] representing your confidence in the final answer\.

HotpotQA Seed PromptSystem messageYou are tasked with answering questions using only the provided context\.User instruction•Reason step by step using only the provided context\.•Be concise but thorough in your justification\.•Before answering, verify that your answer is supported by the context\.Your output should be a JSON with fields:•justification: a context\-grounded explanation of how you reached the answer\.•answer: your concise final answer\.•confidence: a number in \[0,1\] representing your confidence in the final answer\.

LiveBench\-Math Seed PromptSystem messageSolve the math problem step by step and give the final answer in exactly the format requested by the question\.User instructionYour output should be a JSON with fields:•reasoning: your chain of thought / reasoning / thinking process, detailed analysis and calculations\.•answer: final answer in exactly the format requested by the question\.•confidence: a number in \[0,1\] representing your confidence in the final answer\.Output only valid JSON that matches the required schema\.

### 7\.2Critic Prompts

Formula Critic PromptYou are a strict evaluation critic for formula\-construction failures\. You are given ONE QA trace with:•question•gold answer•predicted answer•model confidence•model reasoningYour job is to diagnose why the model produced the wrong answer\.Instructions:1\.Produce 1\-3 failure\_modes with:•label: 2\-6 words, consistent across similar errors•definition: Comprehensive explanation of the failure mode•why: brief, self\-contained explanation for THIS example, e\.g\. ‘‘The question asked for the city’s location relative to Rome, but the model returned the city name instead\.’’•basis: cite what in trace/reasoning shows this2\.Focus on actionable failure modes\.3\.If you cannot identify a clear failure mode, return an empty list\.4\.Output ONLY valid JSON matching the schema\.

HotpotQA Critic PromptYou are a strict evaluation critic for QA failures\. You are given ONE QA trace:•question•context \(titles \+ snippets\)•gold answer•predicted answer•model confidence•model reasoningYour goal is to diagnose WHY the target model produced the wrong answer\.Instructions:1\.Produce 1\-3 failure\_modes with:•label: 2\-6 words, consistent across similar errors•definition: Comprehensive explanation of the failure mode•why: brief explanation for THIS example•basis: cite what in the trace/reasoning shows this2\.Make labels concrete and clusterable:•Prefer labels like ‘‘wrong bridge entity’’ over long sentences\.•Do not include entity names, dates, or example\-specific details in labels\.3\.If you cannot identify a clear failure mode, return an empty list\.4\.Output ONLY valid JSON matching the schema \(no extra text\)\.

LiveBench\-Math Critic PromptYou are a strict evaluation critic for math failures\. You are given one failed model attempt with:•task metadata•question•gold answer•predicted answer•model confidence•model reasoningYour goal is to diagnose why the model failed\.Instructions:1\.Produce 1\-3 failure\_modes with:•label: 2\-6 words, consistent across similar errors•definition: Comprehensive explanation of the failure mode•why: brief, self\-contained explanation for THIS example•basis: cite what in the trace/reasoning shows this2\.Make labels concrete and clusterable\.3\.If you cannot identify a clear failure mode, return an empty list\.4\.Output only valid JSON matching the schema\.

### 7\.3Shared Optimizer Prompt

Optimizer PromptYou are the Reflective Prompt Tuning \(RPT\) controller\.Your goal is to iteratively improve a PromptProgram for the target task\.At each iteration you must:1\.Call evaluate\_prompt exactly once on the CURRENT PromptProgram\.2\.Read the returned evaluation report with insights\.3\.Output either a PATCH or STOP\.Optimization target:•Primary: improve task performance on the training split\.•Secondary: improve calibration \(lower Brier / reduce overconfidence\) without hurting task performance\.Decision guidance:•When current\_summary is provided, use it as the primary decision signal, especially current\_summary\.metrics and any deltas vs previous/best\.•Use history only to detect trajectory, regressions, and previously ineffective edits\.Patch constraints:•Edits should be targeted to the failure modes, and designed to address their underlying issues with concrete guidance\.•Avoid vague or generic guidance that only restates the failure; specify concrete checks, comparisons, extractions, or verifications\.•Prefer revising, merging, deleting, or reorganizing existing instructions over adding new broad rules\.•Keep the output contract stable \(JSON schema and required fields\)\.•Avoid redundant or conflicting rules; consolidate instructions when possible\.Stop condition:•Output STOP if training\-set performance has plateaued or further edits are unlikely to help\.Hard rule:•Do NOT propose a PATCH or STOP decision before calling evaluate\_prompt and receiving its result\.

### 7\.4Optimization Datasets

##### HotPotQA\.

HotPotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.21781#bib.bib46)\)is a multi\-hop question answering benchmark in which the model is given a question and supporting passages, and must combine evidence across passages to produce the answer\. We use it as a task for multi\-hop reasoning over textual evidence, requiring models to identify relevant information, connect evidence across hops, and extract a concise answer\.

##### LiveBench\-Math\.

LiveBench\-Math\(Whiteet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib47)\)is the math category of LiveBench, a benchmark designed to reduce contamination through regularly updated questions and automatic scoring against objective ground\-truth answers\. We use LiveBench\-Math to evaluate mathematical reasoning, including problem decomposition, intermediate computation, and final\-answer generation\. Specifically, we use the retrieve math questions from the2024\-08\-31LiveBench release \(most recent to date\)\. Following GEPA\(Agrawalet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib23)\), we evenly split the resulting set of 368 questions \(shuffled with Python random seed 0\) into train, development, and test sets\.

##### Formula\.

Formula\(Wanget al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib48)\)is a financial reasoning benchmark built around the eXtensible Business Reporting Language \(XBRL\)\. Following ACE\(Zhanget al\.,[2026](https://arxiv.org/html/2605.21781#bib.bib32)\), we use Formula as a domain\-specific numerical reasoning task\. It requires models to apply financial concepts and perform computations over structured financial data\. Dataset statistics are reported in Table[3](https://arxiv.org/html/2605.21781#S7.T3)\.

### 7\.5Additional Experimental Details

Table 3:Datasets used for prompt optimization, prompt selection, and final evaluation\.![Refer to caption](https://arxiv.org/html/2605.21781v1/x4.png)Figure 4:Average persistence of failure topics across optimization iterations for HotPotQA, LiveBench\-Math, and Formula\. Persistence is measured as average run length: the number of consecutive iterations for which a failure topic remains active\.##### RPTconfiguration\.

EachRPToptimizer is instructed to call a single diagnostic function at the beginning of each iteration\. The diagnostic function evaluates the current prompt on the optimization split, collects the target model’s structured outputs, critiques incorrect responses, clusters failure diagnoses with ClusterFusion\(Xuet al\.,[2025](https://arxiv.org/html/2605.21781#bib.bib8)\), and summarizes recurring failure modes together with aggregate metrics\. The optimizer then uses this diagnostic report, along with the memory of prior reports, to revise the prompt or stop if performance has plateaued or the iteration budget has been reached\.

##### Clustering model\.

For failure\-mode clustering, we use ClusterFusion with GPT\-4\.1\(OpenAI,[2025b](https://arxiv.org/html/2605.21781#bib.bib50)\)as the clustering model across allRPTinstantiations\. Based on empirical tuning, we set the number of clusters toK=10K=10for HotPotQA and LiveBench\-Math, andK=20K=20for Formula, which has a larger optimization set\. We keepKKfixed across optimizer LLMs for each dataset\. To keep the diagnostic report focused on prominent recurring patterns, we include only clusters whose size exceeds10%10\\%of the total diagnosis pool\.

##### Baseline configuration\.

ACE, GEPA, and MIPRO are run with the same target model and task splits asRPT\. For each baseline, we follow the settings in the original paper or released implementation when available\.

For MIPRO, we use the official DSPy444https://dspy\.ai/api/optimizers/MIPROv2/implementation of MIPROv2, which jointly optimizes instructions and few\-shot demonstrations, withauto="heavy"to maximize optimization performance\. For GEPA, we use the released implementation and adapt the AIME configuration, scaling the optimization budget according to the relative train/development set sizes in our tasks\. For ACE, we use the released repository for Formula and adapt its instructions to the other datasets\. We keep the default setting of one epoch for all tasks555https://github\.com/ace\-agent/ace\.

For GEPA, we additionally evaluate a confidence\-aware variant in which confidence and calibration diagnostics are provided as auxiliary side information to the reflection prompt, while prompt selection remains driven primarily by task performance\.

### 7\.6Failure\-Mode Persistence

We measure the persistence of each failure topic by its average run length, defined as the number of consecutive iterations in which the topic remains active\. Longer runs indicate failures that persist despite prompt revisions, while shorter runs suggest failures that are more transient or easier to address\.

Figure[4](https://arxiv.org/html/2605.21781#S7.F4)shows that persistent failures are task\-specific\. On HotPotQA, the longest\-lived issues involve span extraction, surface\-form errors, granularity or answer\-type mismatches, and multi\-hop reasoning\. On LiveBench\-Math, persistent failures center on arithmetic and algebraic computation, semantic misalignment, and misuse of mathematical definitions or conventions\. On Formula, they involve arithmetic calculation, metric definition and formula selection, timing conventions, and domain constraints\. Overall, the most persistent failures are not generic formatting errors, but deeper task\-specific reasoning failures\. This suggests that some failures require repeated refinement, and others may reflect limitations that are difficult to overcome through prompting alone\.

### 7\.7Actionability of Diagnosed Failure Modes

![Refer to caption](https://arxiv.org/html/2605.21781v1/x5.png)Figure 5:Average next\-iteration metric changes associated with each diagnosed failure topic for HotPotQA, LiveBench\-Math, and Formula\. For each failure topic present at iterationtt, we report the mean change in task score and Brier score fromtttot\+1t\+1\. HigherΔ\\Deltatask score and lowerΔ\\DeltaBrier indicate improvement\.We analyze which diagnosed failure modes are most actionable for the optimizer\. For each failure topic active at iterationtt, we compute the average change in task score and Brier score after evaluating the revised prompt at iterationt\+1t\+1\. This analysis is associative rather than causal, but indicates which diagnosed failures tend to precede useful prompt updates\.

Figure[5](https://arxiv.org/html/2605.21781#S7.F5)shows that actionable diagnoses vary by task\. On HotPotQA, failures involving multi\-hop reasoning, question\-cue interpretation, span extraction, and granularity mismatches are followed by some of the largest task\-score gains and Brier reductions, suggesting that concrete answer\-selection errors can often be translated into effective prompt edits\. On LiveBench\-Math, diagnoses such as combinatorial errors, notation confusion, logical\-flow failures, and arithmetic or theorem\-application errors are generally followed by task\-score gains, though persistent mathematical failures often require multiple revisions\.

On Formula, failure topics are less discriminative: most are followed by small task\-score gains and Brier reductions, likely because domain\-specific failures often co\-occur\. Overall,RPTis most effective when diagnoses can be converted into explicit behavioral constraints, such as answer\-span control, verification steps, or unit and format handling\. More complex mathematical or domain\-convention failures remain useful to diagnose, but may require complementary interventions beyond prompt revision\.

### 7\.8Prompt Length and Development Performance

We analyze how prompt length changes duringRPToptimization and how these changes relate to development\-set performance\. Figure[6](https://arxiv.org/html/2605.21781#S7.F6)plots the number of prompt tokens and the corresponding development task score across optimization iterations for HotPotQA, LiveBench\-Math, and Formula\.

Across all tasks, prompt length tends to increase as the optimizer incorporates additional constraints, checks, and task\-specific instructions\. However, development performance does not increase monotonically with prompt length\. On HotPotQA, the largest gain occurs early, after which performance remains relatively stable while prompt length continues to grow\. On LiveBench\-Math, dev performance improves overall but fluctuates substantially, with later, longer prompts not always outperforming earlier shorter ones\. On Formula, performance jumps after early revisions and then largely plateaus, despite continued prompt growth\.

These trends suggest that longer prompts are not inherently better\. Instead, useful prompt growth comes from adding targeted constraints, while later edits may introduce redundancy or task\-specific overfitting\. This further motivates selecting the final prompt by development\-set performance, rather than simply using the last prompt produced by the optimization loop\.

![Refer to caption](https://arxiv.org/html/2605.21781v1/x6.png)Figure 6:Prompt length and development\-set task score acrossRPToptimization iterations for HotPotQA, LiveBench\-Math, and Formula\. Prompt length generally increases over time, while development performance improves early and then plateaus or fluctuates, motivating development\-set prompt selection rather than selecting the final iteration by default\.
### 7\.9Example Prompt Revision

Below, we show an example prompt revision produced byRPTfor HotPotQA with GPT\-5 as the optimizer\. The initial prompt contains only high\-level instructions to reason from context, verify support, and produce a JSON output\. The selected prompt is more targeted: it adds explicit guidance for question parsing, supporting\-span extraction, answer\-type and granularity matching, multi\-hop reasoning, temporal or comparative constraints, and confidence calibration\.

This example illustrates howRPTconverts diagnostic feedback into concrete prompt edits\. The optimized prompt targets recurring HotPotQA failures by adding explicit controls for minimal span extraction, surface\-form matching, multi\-hop relation tracing, and confidence calibration under ambiguity\.

Original Prompt \- HotpotQASystem Message\.You are tasked with answering questions using only the provided context\.User Instruction\.•Reason step by step using only the provided context\.•Be concise but thorough in your justification\.•Before answering, verify that your answer is supported by the context\.Your output should be a JSON with fields:•justification: a context\-grounded explanation of how you reached the answer\.•answer: your concise final answer\.•confidence: a number in \[0,1\] representing your confidence in the final answer\.

Optimized Prompt \- HotpotQASystem Message\.You are tasked with answering questions using only the provided context\.User Instruction\.Instructions:•Use only the provided context\. Parse the question to identify: \(a\) the subject, \(b\) the relation/attribute requested, \(c\) any constraints \(time/comparative qualifiers such as based on, before/after, first/last/only\), and \(d\) the required answer type and granularity/format, including whether a unit or administrative/institutional qualifier is explicitly requested\.•Find thespecific supporting span\(s\)in the context\.When multiple surface forms/aliases exist, select the exact contiguous span in the context that best matches the question wording; if none match, use the canonical full form as written in context, preserving capitalization, diacritics, and suffixes/post\-nominals \(e\.g\., Jr\., Sr\., III\)\. If the question asks for a unit or an administrative/institutional qualifier, select the span that includes it; if not, and both prefixed and bare forms appear, prefer the bare form\.•Answer extraction:copy the minimal contiguous span verbatim from the context\(casing and numerals\)\. Do not add extra words, qualifiers, punctuation, or units unless explicitly requested\. Match the question’s granularity\. For numbers, output digits only; if the context has modifiers like “about/circa/approximately 100”, return just the bare value “100”\. If a full date is given but the question asks for a year, return only the 4\-digit year\. If the question requests a value\-with\-unit, copy the exact number\+unit as one contiguous span \(e\.g\., “5 kilometers”\)\. For place/organization names, include prefixes like “City of”, “County of”, “Province of”, “University of”, or “Department of” only if \(a\) the question targets that unit, or \(b\) the context consistently uses that prefixed form without an alternative bare form\. Trim leading/trailing articles or punctuation unless they are part of the official name\.•Yes/No questions: answer exactly yes or no in lowercase with no punctuation\.•Relation/compositional queries:follow all required hops and return the final requested attribute/value, not an intermediate entity or related item\. Verify subject–object direction \(e\.g\., “X acquired Y” vs “X was acquired by Y”\)\. Multi\-hop checklist: \(1\) find the bridge entity tied to the subject, \(2\) locate the next hop that yields the requested attribute, \(3\) ensure the final answer span appears verbatim in the context\.•Comparative/temporal qualifiers: when prompts include first/last/only/earliest/latest/highest/lowest or a time restriction, \(1\) collect all candidate spans linked to the subject, \(2\) apply the specified ordering/constraint to choose the single correct candidate, \(3\) return only that final span\.•Disambiguation: if multiple candidate entities/mentions exist \(including same\-named entities\), choose the one whose qualifiers \(role, location, dates, affiliation\) satisfy the question’s constraints; ignore distractors\.•Before answering, run afinal span checklist: \(a\) type and administrative level match the request, \(b\) temporal/comparative constraint satisfied, \(c\) surface form matches the question when available, else canonical full form in context, \(d\) span is minimal, contiguous, and free of extraneous punctuation\.•Justification: briefly cite the key supporting phrase\(s\) and reasoning in≤3\\leq 3sentences\.•Confidence calibration: set confidence based on evidence and ambiguity\.–0\.85 if exactly one unambiguous supporting span directly answers the question with no alias/format ambiguity\.–∼\\sim0\.55 if supported but there are multiple aliases/forms, qualifier/unit risks, or minor ambiguity \(cap confidence here in such cases\)\.–∼\\sim0\.35 if partially supported or requires weaker inferences across snippets\.–≤\\leq0\.25 if the support is tenuous or potentially contradicted\.Output JSON fields:•justification: a context\-grounded explanation of how you reached the answer\.•answer: the bare minimal span only \(no quotes or extra words\)\.•confidence: a number in \[0,1\] representing your confidence in the final answer\.
Reflective Prompt Tuning through Language Model Function-Calling

Similar Articles

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]

Interpreting Style Representations via Style-Eliciting Prompts

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

Self-Supervised Prompt Optimization

Submit Feedback

Similar Articles

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
Making LLMs tell you how confident they really are through probe-targeted fine tuning.[R]
Interpreting Style Representations via Style-Eliciting Prompts
Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis
Self-Supervised Prompt Optimization