Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

arXiv cs.CL 05/27/26, 04:00 AM Papers
prompt-optimization causal-inference llm nlp task-analysis generalization automated-prompting
Summary
This paper conducts a causal-inspired analysis of automated prompt optimization across frameworks, LLMs, and tasks, identifying that specific edit types (e.g., complexity-increasing, meta-instructional) have systematic negative or positive effects depending on task characteristics, explaining generalization failures.
arXiv:2605.26655v1 Announce Type: new Abstract: Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:08 AM
# Why Prompt Optimization Works, and Why It Sometimes Doesn’t — A Causal-Inspired Edit-Level Analysis
Source: [https://arxiv.org/html/2605.26655](https://arxiv.org/html/2605.26655)
Shuzhi Gong1, Hechuan Wen2, 1The University of Melbourne, Melbourne, Victoria, Australia 2The University of Queensland, Brisbane, Queensland, Australia shuzhi@unimelb\.edu\.au h\.wen@uq\.edu\.au

###### Abstract

Automated prompt optimization methods \(e\.g\., DSpy, TextGrad\) can substantially improve the performance of large language model \(LLM\), however, their generalization ability across different tasks remains underperformed\. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones\. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference\-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks\. To achieve the goal, we build upon the propensity\-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task\-conditioned edits patterns are identified\. We find that complexity\-increasing and meta\-instructional edits are negatively associated with mathematical and multi\-hop reasoning performance, whereas step\-by\-step and meta\-cognitive edits improve logical and sequential reasoning tasks\. These effects are robust across cognitive\-load annotations, surface\-level text features, and edit\-motif analyses, and can generalize across optimization frameworks\. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature\-level characterization of optimizer behavior and motivating future task\-conditioned optimizer design\.

Why Prompt Optimization Works, and Why It Sometimes Doesn’t — A Causal\-Inspired Edit\-Level Analysis

Shuzhi Gong1, Hechuan Wen2††thanks:Corresponding author\.,1The University of Melbourne, Melbourne, Victoria, Australia2The University of Queensland, Brisbane, Queensland, Australiashuzhi@unimelb\.edu\.auh\.wen@uq\.edu\.au

## 1Introduction

Prompt optimization has emerged as an increasingly practical alternative to parameter\-efficient fine\-tuning for large language models \(LLMs\)Khattab et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib24)\); Opsahl\-Ong et al\. \([2024a](https://arxiv.org/html/2605.26655#bib.bib34)\); Yuksekgonul et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib55)\); Agrawal et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib2)\)\. Instead of updating model weights, recent frameworks such as TextGradYuksekgonul et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib55)\), GEPAAgrawal et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib2)\)automatically search the prompt space to improve downstream task performance\. These approaches have become particularly attractive for modern LLM systems because they are lightweight, modular, and directly compatible with LLM and agentic applicationsOpsahl\-Ong et al\. \([2024b](https://arxiv.org/html/2605.26655#bib.bib35)\);[Mihindukulasooriya et al\.](https://arxiv.org/html/2605.26655#bib.bib32); Susnjak \([2026](https://arxiv.org/html/2605.26655#bib.bib45)\)and retrieval\-augmented workflowsCâmara et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib4)\); Gong et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib19)\)\.

Despite their empirical success, prompt optimizers often exhibit unstable behavior across tasks and model backbonesZhang et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib57)\); Fu et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib14)\); Singhal et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib44)\)\. In practice, prompt revisions that improve one benchmark frequently fail to transfer to another, and optimizers that perform well on logical or sequential reasoning tasks may substantially degrade performance on mathematical or multi\-hop reasoning benchmarks\. Importantly, this pattern appears consistently in our experiments across multiple optimization frameworks and across diverse LLM backbones, including GPT\-5\.2, GPT\-4oAchiam et al\. \([2023](https://arxiv.org/html/2605.26655#bib.bib1)\), Qwen3\-32BYang et al\. \([2025](https://arxiv.org/html/2605.26655#bib.bib53)\), and DeepseekLiu et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib27)\)models\. Such instability raises a fundamental question:what kinds of prompt edits are modern optimizers actually learning to apply, andwhy do some edit patterns help certain task types while harming others?

Existing work has largely studied prompt optimization at the aggregate benchmark levelWan et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib49)\)\. Prior studies analyze optimizer success ratesPryzant et al\. \([2023](https://arxiv.org/html/2605.26655#bib.bib37)\), optimization dynamicsYuksekgonul et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib55)\); Yang et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib54)\), or embedding\-level optimization directionsLester et al\. \([2021](https://arxiv.org/html/2605.26655#bib.bib25)\); Li and Liang \([2021](https://arxiv.org/html/2605.26655#bib.bib26)\), but provide limited insight into the edit\-level behaviors underlying optimizer failures\. As a result, current evaluations often reveal*whether*optimization succeeds, but not*which prompt modifications*systematically contribute to improvement or degradation across different task settings\. Yet understanding these edit\-level behaviors is important for both diagnosis and optimizer design: two optimizers may achieve similar average gains while relying on fundamentally different editing strategies, and seemingly beneficial prompt modifications may interact differently with different reasoning tasks\.

![Refer to caption](https://arxiv.org/html/2605.26655v1/x1.png)Figure 1:Overview of our multi\-view probing framework for prompt optimizer behavior\.In this work, we investigate prompt optimizer behavior through an observational multi\-view analysis of optimizer\-induced prompt edits, as illustrated in Figure[1](https://arxiv.org/html/2605.26655#S1.F1)\. Rather than treating optimized prompts as indivisible artifacts, we analyze consecutive prompt revisions within real optimization trajectories and examine how different edit families are associated with downstream performance changes across task groups\.

To reduce dependence on any single representation of prompt edits, we probe optimizer behavior through three complementary views: \(1\) GPT\-4o\-annotated cognitive and instructional features, \(2\) deterministic surface\-level text statistics, and \(3\) literal text\-diff motifs extracted from consecutive prompt revisions\. These complementary representations are subsequently integrated into a propensity\-adjusted associational analysis framework to characterize heterogeneous optimizer behaviors across reasoning tasks\.

Methodologically, our analysis adopts a causal inference\-inspired observational framework\. We use propensity\-adjusted associational estimationRosenbaum and Rubin \([1983](https://arxiv.org/html/2605.26655#bib.bib40)\)to reduce measured selection bias arising from differences in prior prompt states, while explicitly avoiding strong causal claims\. Because the analysis involves many simultaneous feature\-task comparisons, we distinguish between statistically robust findings that survive false\-discovery\-rate \(FDR\) correction and exploratory directional patterns that provide corroborative but non\-confirmatory evidence\. Throughout the paper, we therefore organize results using a unified two\-tier evidence hierarchy\.Tier 1refers to associations surviving Benjamini–Hochberg false discovery rate correction\.Tier 2refers to directionally consistent corroborative patterns reproduced across multiple representations or frameworks but not surviving full multiple\-testing correction\.

Across around 20 thousand real world prompt optimization’s \(prompt,question,correctness\) tuples, we observe consistent edit\-level heterogeneity across task groups\. In particular, complexity\-increasing and meta\-instructional edits tend to be negatively associated with mathematical and multi\-hop reasoning performance, whereas metacognitive and step\-structured edits are positively associated with logical and sequential reasoning tasks\. Several of these associations remain significant after false\-discovery\-rate correction, while others appear consistently across multiple independent prompt representations\.

Our main contributions are summarized as follows:

- •We present an observational edit\-level analysis of prompt optimizer behavior across multiple optimization frameworks, LLM backbones, and reasoning task groups\.
- •We identify statistically robust heterogeneous associations between prompt optimizer\-induced edits and downstream task performance\.
- •We provide evidence that prompt optimization failures are not purely random, but are systematically associated with interactions between edits and benchmark\-specific characteristics, motivating future task\-conditioned optimizer design\.

## 2Related Work

#### Automated prompt optimization\.

Automated prompt optimization treats the prompt as a learnable variable to maximize task performance\. Early work learns soft prompts or searches discrete prompt tokens/templates for few\-shot adaptation and probing\(Li and Liang,[2021](https://arxiv.org/html/2605.26655#bib.bib26); Lester et al\.,[2021](https://arxiv.org/html/2605.26655#bib.bib25); Shin et al\.,[2020](https://arxiv.org/html/2605.26655#bib.bib43); Gao et al\.,[2021](https://arxiv.org/html/2605.26655#bib.bib15)\)\. Later black\-box methods optimize natural\-language instructions directly, including instruction generation and ranking\(Zhou et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib59)\), gradient\-free edit search\(Prasad et al\.,[2023](https://arxiv.org/html/2605.26655#bib.bib36)\), reinforcement\-learning\-based prompt editing\(Deng et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib10); Zhang et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib56)\), and textual\-gradient revision\(Pryzant et al\.,[2023](https://arxiv.org/html/2605.26655#bib.bib37); Yuksekgonul et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib55)\)\. More recent systems use LLMs as optimizers or search operators, through natural\-language proposal selection\(Yang et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib54)\), evolutionary/self\-referential mutation\(Guo et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib20); Fernando et al\.,[2023](https://arxiv.org/html/2605.26655#bib.bib13); Agrawal et al\.,[2026](https://arxiv.org/html/2605.26655#bib.bib2)\), and program\-level prompt compilation\(Khattab et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib24); Opsahl\-Ong et al\.,[2024a](https://arxiv.org/html/2605.26655#bib.bib34)\)\. Despite their different optimization mechanisms, these methods converge to a common empirical failure mode in our setting: consistent performance degradation on math and multi\-hop tasks\. This convergence motivates a feature\-level diagnosis of prompt changes rather than a method\-specific analysis\.

#### Prompt sensitivity and format effects\.

A complementary line of work studies how non\-semantic prompt properties affect LLM performance\.Zhao et al\. \([2021](https://arxiv.org/html/2605.26655#bib.bib58)\)shows that few\-shot calibration can substantially reduce format\-induced bias, whileLu et al\. \([2022](https://arxiv.org/html/2605.26655#bib.bib29)\)finds that demonstration order can change accuracy by up to 30 points\. Related studies further show that in\-context learning depends on label space, input distribution, and sequence format more than exact input\-label mappings\(Min et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib33)\), that models can be insensitive to instruction semantics\(Webson and Pavlick,[2022](https://arxiv.org/html/2605.26655#bib.bib51)\), that answer surface forms distort likelihood\-based scoring\(Holtzman et al\.,[2021](https://arxiv.org/html/2605.26655#bib.bib21)\), and that minor formatting choices can induce large performance swings\(Sclar et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib42)\)\. Together, these findings demonstrate that prompt surface properties matter independently of semantic content\. We extend this perspective from isolated format variation to optimizer trajectories, examining consecutive optimizer\-induced edits and their task\-type\-conditioned performance effects\. The Coin Flip paper\(Zhang et al\.,[2026](https://arxiv.org/html/2605.26655#bib.bib57)\)uses ANOVA\-based analysis to identify run\-level conditions under which optimization succeeds; we complement this with feature\-level and edit\-level causal analysis across task types\.

#### Causal inference for NLP\.

Causal inference has been increasingly used in NLP for debiasing, model behavior analysis, and text\-based causal estimation\(Feder et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib12)\)\. Recent LLM\-era work further studies LLMs both as objects of causal analysis and as tools for causal discovery or effect estimation\(Ma,[2025](https://arxiv.org/html/2605.26655#bib.bib30); Liu et al\.,[2025](https://arxiv.org/html/2605.26655#bib.bib28)\)\. For example, recent studies use LLMs to estimate causal effects from unstructured text\(Dhawan et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib11)\), build causal graphs and perform counterfactual inference from natural language\(Gendron et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib17)\), generate or match counterfactual texts for faithful model explanation\(Gat et al\.,[2023](https://arxiv.org/html/2605.26655#bib.bib16); Wang et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib50)\), and evaluate LLMs’ formal causal and counterfactual reasoning abilities\(Jin et al\.,[2024](https://arxiv.org/html/2605.26655#bib.bib22); Maasch et al\.,[2025](https://arxiv.org/html/2605.26655#bib.bib31); Chen et al\.,[2026b](https://arxiv.org/html/2605.26655#bib.bib6)\)\. The heterogeneous treatment effect literature\(Wager and Athey,[2018](https://arxiv.org/html/2605.26655#bib.bib48)\)provides the statistical framework for conditional causal effects across subpopulations, while the double machine learning framework\(Chernozhukov et al\.,[2018](https://arxiv.org/html/2605.26655#bib.bib7)\)enables nuisance\-adjusted effect estimation with flexible machine learning models\. We adapt inverse probability of treatment weighting \(IPTW\)\(Robins et al\.,[2000](https://arxiv.org/html/2605.26655#bib.bib39)\)to prompt optimization traces, where the treatment is an optimizer\-induced change in prompt features and the confounder is the prior prompt state\. This framing extends causal\-effect estimation tools to optimizer behavior analysis, while our evidence\-tier design distinguishes FDR\-controlled associations from exploratory corroboration\.

#### Closest prior work\.

CPO\(Chen et al\.,[2026a](https://arxiv.org/html/2605.26655#bib.bib5)\)applies double machine learning to whole\-prompt embeddings to estimate optimization effects, whereas our focus is diagnostic and feature\-level: we analyze how interpretable prompt edits are associated with gains or losses across task groups\. While the CausalNLP survey\(Feder et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib12)\)reviews text\-as\-treatment methods, we instantiate this perspective at the optimizer edit\-motif level\. Recent work on embedding\-based causal inference\(Dawoud and El\-Shamy,[2026](https://arxiv.org/html/2605.26655#bib.bib9)\)shows dense representations can reduce selection bias; in contrast, we emphasize interpretable surface features that support direct comparison across annotation\-based and annotation\-free views\. The Coin Flip paper\(Zhang et al\.,[2026](https://arxiv.org/html/2605.26655#bib.bib57)\)studies optimizer success at the run level via ANOVA, whereas we perform edit\-level IPTW\-adjusted analysis to identify which specific edit types are associated with optimizer success or failure\.

#### Cognitive load in instructional design\.

Our annotation scheme draws on cognitive load theory, which distinguishes intrinsic load from task complexity, extraneous load from irrelevant or poorly structured information, and germane load from task\-relevant cognitive processing\. We use GPT\-4o to annotate these constructs at the prompt level and validate them against deterministic text proxies \(§[3\.2](https://arxiv.org/html/2605.26655#S3.SS2)\), following partial\-validity assessment practices in computational social science\.

## 3Data and Annotation

### 3\.1Pairwise Prompt Comparison Dataset

We collected optimization logs from 3 frameworks, i\.e\., DSPyKhattab et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib24)\)\(with MiPROv2 optimizerOpsahl\-Ong et al\. \([2024a](https://arxiv.org/html/2605.26655#bib.bib34)\)\), TextGradYuksekgonul et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib55)\), and GEPAAgrawal et al\. \([2026](https://arxiv.org/html/2605.26655#bib.bib2)\); 5 LLM backbones, i\.e\., GPT\-5\.2, GPT\-4o, Qwen3\-32BYang et al\. \([2025](https://arxiv.org/html/2605.26655#bib.bib53)\), Deepseek\-v3, and Deepseek\-R1Liu et al\. \([2024](https://arxiv.org/html/2605.26655#bib.bib27)\); and 11 NLP benchmarks spanning 5 reasoning task categories:

1

Commonsense reasoning \(including Commonsense QATalmor et al\. \([2019](https://arxiv.org/html/2605.26655#bib.bib47)\), causal judgment, disambiguation QA datasetsSuzgun et al\. \([2023](https://arxiv.org/html/2605.26655#bib.bib46)\), with a mean base accuracy 0\.70\), 2Mathematical reasoning \(including GSM8KCobbe et al\. \([2021](https://arxiv.org/html/2605.26655#bib.bib8)\), MultiArith datasetsRoy and Roth \([2015](https://arxiv.org/html/2605.26655#bib.bib41)\), with a mean base accuracy 0\.97\), 3Logical reasoning \(including boolean expressions, coinflipSuzgun et al\. \([2023](https://arxiv.org/html/2605.26655#bib.bib46)\)datasets, with a mean base accuracy 0\.81\), 4Sequential reasoning \(including last lettersSuzgun et al\. \([2023](https://arxiv.org/html/2605.26655#bib.bib46)\), ListOpsKeysers et al\. \([2019](https://arxiv.org/html/2605.26655#bib.bib23)\)datasets, with a mean base accuracy 0\.75\), 5Multi\-hop reasoning \(including Strategy QAGeva et al\. \([2021](https://arxiv.org/html/2605.26655#bib.bib18)\), date understandingSuzgun et al\. \([2023](https://arxiv.org/html/2605.26655#bib.bib46)\)datasets, with a mean base accuracy 0\.75\)\.

Especially, Math benchmarks exhibit particularly high initial performance,raising the possibility of a ceiling effect that we further examine in §[4](https://arxiv.org/html/2605.26655#S4)\.

For each consecutive optimization step, we record the performance gain, defined asΔacc=acc\(p2\)−acc\(p1\)\\Delta\\mathrm\{acc\}=\\mathrm\{acc\}\(p\_\{2\}\)\-\\mathrm\{acc\}\(p\_\{1\}\), on a fixed evaluation set\.p1p\_\{1\}andp2p\_\{2\}are the former and updated prompts, and the prompt optimization applies the edit onp1p\_\{1\}to producep2p\_\{2\}\. This process yields 2,095 pairwise comparisons from DSPy for the main analysis, and an additional 17,708 comparisons from TextGrad and GEPA for cross\-framework replication\.

### 3\.2Prompt Feature Annotation

Each prompt is annotated with 12 features derived from cognitive load theory and instructional design, each scored on a 1–10 scale by GPT\-4o\.

Table 1:The 12 GPT\-4o\-annotated cognitive\-load features\.#### Construct validity\.

We correlate these annotations against 13 text\-derived features \(§[5\.1](https://arxiv.org/html/2605.26655#S5.SS1)\) using Spearman rank correlation\. 7 of 19 expected correlations are validated \(\|ρ\|\>0\.10\|\\rho\|\>0\.10,p<0\.10p<0\.10\): Metacognition↔\\leftrightarrowmeta\-cognitive word density \(ρ=\+0\.691\\rho=\+0\.691,p<0\.001p<0\.001\); Intrinsic\_load↔\\leftrightarrowword count \(ρ=\+0\.566\\rho=\+0\.566,p<0\.001p<0\.001\); Structural\_logic↔\\leftrightarrownumbered list presence \(ρ=\+0\.457\\rho=\+0\.457,p<0\.001p<0\.001\)\. Annotation validity is partial but sufficient for coarse\-grained analysis, and we address the remaining validity concern through annotation\-free replication in §[5](https://arxiv.org/html/2605.26655#S5)\.

## 4Confounding\-Adjusted Associational Analysis

### 4\.1Estimand and Assumptions

We estimate Inverse Probability of Treatment WeightingRosenbaum and Rubin \([1983](https://arxiv.org/html/2605.26655#bib.bib40)\)\(*IPTW\)\-adjusted conditional mean gain differences*\(ACMGD\) for optimizer\-selected edit regimes\. ACMGD is therefore interpreted as an IPTW\-adjusted observational association rather than as an interventional causal estimand\. For a given feature indicatorTT, we set treatment=1=1when the revision fromp1p\_\{1\}top2p\_\{2\}increases that feature, and0otherwise\. Within each dataset/task group, the estimand compares expected performance gain between edit regimes with and withoutTT, under three standard assumptions:

- •Consistency: The observed outcome equals the potential outcome under the observed treatment\.
- •Positivity: Each unit has positive probability of both treatment values given covariates\.
- •Conditional exchangeability: Given*pre\-edit covariates*\(promptp1p\_\{1\}state: length, demo count, headroom, framework, backbone, dataset, optimization step\), treatment assignment is conditionally independent of untreated potential outcomes\.

#### Plausible unmeasured confounders\.

Conditional exchangeability is unlikely to hold exactly in this setting\. Several optimizer\-internal variables remain unmeasured: \(1\)*search trajectory history*may confound the estimates, because optimizers tend to insert meta\-instructions after observing prior failures, leaving “meta\-instruction inserted” correlated with “prior prompt was struggling”; \(2\)*selection pressure*may arise because high\-performing prompts at later optimization steps are more likely to receive additional complex features; \(3\)*LLM\-specific instruction following*may matter because backbones vary in how they process instruction\-level directives, and backbone is partially conditioned on rather than fully controlled\. IPTW adjusts for measured pre\-edit state but cannot eliminate these residual dependencies\. Accordingly, the results should be interpreted as*adjusted associational contrasts*rather than causal effects\.

#### Treatment bundling\.

Prompt optimizers often modify multiple features simultaneously, so treatments are*edit bundles*rather than isolated feature changes\. Although the propensity model conditions only on pre\-edit covariates, residual bundle heterogeneity remains a limitation\. Thus, the analysis provides*observational associational evidence rather than causal identification*\.

### 4\.2Method: IPTW within Task Types

For each featureffand task groupτ\\tau, we proceed as follows\. First, we define treatment indicator asT=𝟏\[f\-change\>0\]T=\\mathbf\{1\}\[f\\text\{\-change\}\>0\]\. Second, we fit a logistic propensity modelP\(T=1∣X\)P\(T\{=\}1\\mid X\)using only pre\-edit covariates\. Third, we compute stabilized IPTW weights,

wi=TiP¯\(T=1\)e\(Xi\)\+\(1−Ti\)P¯\(T=0\)1−e\(Xi\),w\_\{i\}=\\frac\{T\_\{i\}\\,\\bar\{P\}\(T\{=\}1\)\}\{e\(X\_\{i\}\)\}\+\\frac\{\(1\-T\_\{i\}\)\\,\\bar\{P\}\(T\{=\}0\)\}\{1\-e\(X\_\{i\}\)\},with weights capped at 10 to limit the influence of extreme propensity scores\.

Fourth, we estimate the stabilized\-IPTW weighted average conditional mean gain difference \(ACMGD\),

ACMGD^f,τSIPW=∑iwiTiYi∑iwiTi−∑iwi\(1−Ti\)Yi∑iwi\(1−Ti\),\\widehat\{\\mathrm\{ACMGD\}\}\_\{f,\\tau\}^\{\\mathrm\{SIPW\}\}=\\frac\{\\sum\_\{i\}w\_\{i\}T\_\{i\}Y\_\{i\}\}\{\\sum\_\{i\}w\_\{i\}T\_\{i\}\}\-\\frac\{\\sum\_\{i\}w\_\{i\}\(1\-T\_\{i\}\)Y\_\{i\}\}\{\\sum\_\{i\}w\_\{i\}\(1\-T\_\{i\}\)\},whereYiY\_\{i\}denotes the observed performance gain for revision pairii\.

Fifth, we calculate block\-bootstrap standard errors using 500 resamples, with blocks defined by dataset×\\timesbackbone stratum to account for within\-run and within\-dataset correlation\. All reported p\-values and BH\-corrected significance stars are based on these block\-bootstrap uncertainties\. BH\-FDR correction is applied simultaneously across all 60 \(feature⊗\\otimestask\-group\) tests\.

The average confounding magnitude across all combinations is\|SIPW−naive\|=0\.018\|\\text\{SIPW\}\-\\text\{naive\}\|=0\.018, confirming that selection bias is present in naive estimates\.

### 4\.3Results

We organize results into two tiers:Tier 1 \(Confirmatory\)contains BH\-FDR corrected findings \(q<0\.05q<0\.05\), whereasTier 2 \(Exploratory\)contains uncorrected directional patterns that corroborate Tier 1 but do not independently support inference\.

#### Tier 1: Confirmatory findings \(BH\-FDR corrected\)\.

Table[2](https://arxiv.org/html/2605.26655#S4.T2)reports IPTW\-adjusted ACMGD estimates by feature and dataset/task group\. Across 60 simultaneous tests, 11 reach uncorrected significance \(p<0\.05p<0\.05\) and2 survive Benjamini–Hochberg FDR correction\(q<0\.05q<0\.05\):

- •Extraneous\_load⊗\\otimessequential reasoning: ACMGD=−0\.060★=\-0\.060^\{\\bigstar\}111★\\bigstar= BH\-FDRq<0\.05q\{<\}0\.05;∗= uncorrectedp<0\.05p\{<\}0\.05\. These significance markers are used consistently throughout the paper\. More details are in Appendix[A](https://arxiv.org/html/2605.26655#A1)\.\. Increases in extraneous load are significantly associated with performance degradation in the sequential task group \(LOO\-stable across 5/5 splits\)\. Here, leave\-one\-out \(LOO\) stability refers to re\-estimating the sign and direction of an association after iteratively excluding one dataset at a time, providing a robustness check against single\-dataset\-driven effects\.
- •Metacognition⊗\\otimessequential reasoning: ACMGD=\+0\.062★=\+0\.062^\{\\bigstar\}\. Metacognitive prompting is significantly associated with performance gains in the sequential task group \(LOO\-stable across 2/3 splits; one flip in disambiguation\_qa\)\.

Per\-dataset robustness details for the Tier 1 findings are provided in Appendix[G](https://arxiv.org/html/2605.26655#A7)\.

![Refer to caption](https://arxiv.org/html/2605.26655v1/x2.png)Figure 2:IPTW\-adjusted ACMGD heatmap \(10 features×\\times5 dataset/task groups, 60 tests\)\. Blue = positive association; red = negative\. Math and multihop task groups show predominantly negative associations with complexity\-increasing features; this directional pattern is exploratory \(uncorrected\)\.Table 2:IPTW\-adjusted ACMGD by feature and task group\. CS=commonsense, Math=Mathematical, Logic=logical, Seq=sequential, MH=multihop, Spread = max−\-min\.
#### Tier 2: Exploratory directional patterns\.

Beyond the two BH\-corrected effects, several uncorrected patterns appear directionally consistent and corroborate Tier 1: the Extraneous\_load sign reversal \(commonsense\+0\.032∗\+0\.032^\{\*\}vs\. sequential−0\.060★\-0\.060^\{\\bigstar\}\) is stable in 5/5 LOO splits\. Demos \(math\+0\.017\+0\.017vs\. sequential−0\.044∗\-0\.044^\{\*\}\) is directionally consistent with the surface feature analysis in §[5](https://arxiv.org/html/2605.26655#S5)\. In addition, math task groups show predominantly negative point estimates for complexity\-increasing features across 9/12 features, a pattern consistent with the motif analysis but uncorrected\. These results are reported as*exploratory hypothesis\-generating observations only*and do not independently support inferential claims\.

### 4\.4Ceiling Effects

We examine whether dataset/task\-group heterogeneity can be explained by ceiling effects alone\. Headroom\. defined as1−base\_accuracy1\-\\text\{base\\\_accuracy\}, captures the remaining potential for improvement\. The association between headroom and performance gain is weak and not statistically significant, with Spearmanρ\(headroom,Δacc\)=0\.047\\rho\(\\text\{headroom\},\\Delta\\mathrm\{acc\}\)=0\.047\(p=0\.053p=0\.053\)\. In addition, the partialR2R^\{2\}of task group conditional on headroom is small but nonzero,R2\(task group∣headroom\)=0\.003R^\{2\}\(\\text\{task group\}\\mid\\text\{headroom\}\)=0\.003, indicating that task\-group membership explains residual variation beyond headroom\. These results suggest that ceiling effects are only a minor contributor and that dataset/task\-group moderation is not reducible to baseline performance differences alone\.

## 5Annotation\-Free Corroboration

To assess whether the BH\-corrected findings in §[4](https://arxiv.org/html/2605.26655#S4)depend on GPT\-4o annotation choices, we re\-examine the same prompt pairs using annotation\-free text representations\. These analyses produce Tier 2 corroborating: they align with Tier 1 findings but are not independently confirmatory\.

### 5\.1Surface Complexity Features

We extract 14 deterministic text features from raw instruction text, including word count, demonstration count extracted from the JSON structure, step\-word density based on terms such as “step,” “first,” and “then,” type\-token ratio, compression ratio, sentence count, and related surface statistics\. We then apply the same IPTW\-adjusted associational analysis on DSPy pairs \(N=2,095N=2\{,\}095\)\.

Table 3:Surface feature IPTW\-adjusted ACMGD\. CS=commonsense, Math=Mathematical, Logic=logical, Seq=sequential, MH=multihop, Spread = max−\-min\.All four major LLM\-annotated directional patterns are reproduced by text\-native features \(Table[3](https://arxiv.org/html/2605.26655#S5.T3)\)\. Specifically, n\_demos algins with Demos \(math positive, sequential negative\); type\_token\_ratio algins with Clarity \(math negative\); step\_words algins with Structural\_logic \(logical/sequential positive, multihop negative\); word\_count algins with Extraneous\_load \(sequential positive, multihop negative\)\.

This convergence reduces the concern that the observed patterns are artifacts of GPT\-4o annotation choices\. If the sign reversals were driven solely by GPT\-4o annotation choices, deterministic surface proxies would be unlikely to reproduce the same directional patterns\. Although this comparison does not fully rule out annotation bias, becauseboth representations are derived from the same underlying prompts,it provides annotation\-independent corroboration\. Figure[3](https://arxiv.org/html/2605.26655#S5.F3)visualizes this multi\-view agreement for five key feature comparisons\.

#### Cross\-framework directional corroboration \(TextGrad, GEPA,N=17,708N=17\{,\}708\)\.

We apply the same surface\-feature extraction to TextGrad and GEPA pairs as a directional corroboration check\. Because TextGrad and GEPA uses a different optimization framework and prompt structure, this analysis is a*partial replication of directional patterns only*, rather than a full replication of the IPTW adjustment protocol\. The results show consistent directional pattern: word\_count \(sequential\+0\.017∗\+0\.017^\{\*\}, multihop−0\.014∗\-0\.014^\{\*\}\); sentence\_count \(commonsense\+0\.027∗\+0\.027^\{\*\}vs\. sequential−0\.021∗\-0\.021^\{\*\}vs\. multihop−0\.018∗\-0\.018^\{\*\}\); reasoning\_words \(sequential\+0\.039∗\+0\.039^\{\*\}vs\. commonsense−0\.014∗\-0\.014^\{\*\}\)\. The n\_demos feature is less interpretable for TextGrad \(which does not use structured few\-shot demos in the same format\)\. We therefore interpret these results as cross\-framework directional corroboration, instead of independent replication\.

![Refer to caption](https://arxiv.org/html/2605.26655v1/x3.png)Figure 3:Multi\-view convergence of sign reversals across three representations \(LLM annotation, surface text features, edit motifs\) for five key feature comparisons\. Blue dots = higher\-benefit task type; red dots = lower\-benefit task type\. The consistent sign pattern across representations reduces concern that results are annotation artifacts\.

### 5\.2Edit Motif Effects

We compute text\-diff for each consecutive prompt pair, extract inserted word spans of at least≥4\\geq 4tokens, and classify them into four pre\-specified motif categories using regex patterns defined prior to analysis:chain\_of\_thought\(“step by step”, “think through”, “reasoning”, etc\.\),meta\_instruction\(“make sure to”, “do not”, “ensure”, “remember to”, etc\.\),step\_by\_step\(“step 1”, “first…then”, numbered instruction lists\), andclarity\_constraint\(“concisely”, “briefly”, “simple”, “avoid”, etc\.\)\.

We then estimate IPTW\-adjusted ACMGD for each motif×\\timesdataset/task\-group combination using pre\-treatment covariates, and apply BH\-FDR correction across all 15 motif×\\timesgroup tests simultaneously\.

Table 4:Edit motif insertion ACMGD by dataset/task group\. CS=commonsense, Math=Mathematical, Logic=logical, Seq=sequential, MH=multihop, Spread = max−\-min\.Motif co\-occurrence: chain\_of\_thought & meta\_instruction in 24\.0% of pairs\. Results organized by Tier 1 \(starred\) vs\. Tier 2 \(unstated\)\.![Refer to caption](https://arxiv.org/html/2605.26655v1/x4.png)Figure 4:Edit motif insertion ACMGD by dataset/task group \(4 motifs×\\times5 groups, 15 tests\)\. Meta\-instruction is significantly associated with worse math performance; clarity\-constraint is significantly associated with worse logical performance\.#### Tier 1: BH\-corrected motif findings\.

Meta\-instructioninsertion is significantly associated with lower performance in the math dataset group \(ACMGD=−0\.103★=\-0\.103^\{\\bigstar\}, BH\-corrected\)\.*Per\-dataset caveat*: This effect is partially concentrated in MultiArith \(naive CATE=−0\.277=\-0\.277,n=64n=64\) vs\. GSM8K \(−0\.019\-0\.019,n=14n=14\); the aggregated math estimate should be interpreted with this per\-dataset heterogeneity in mind \(see Appendix[F](https://arxiv.org/html/2605.26655#A6)\)\. Meta\-instruction is also directionally negative in multihop \(−0\.044∗\-0\.044^\{\*\}, uncorrected, Tier 2\)\.

Clarity\-constraintinsertion is significantly associated with lower performance in the logical dataset group \(ACMGD=−0\.083★=\-0\.083^\{\\bigstar\}, BH\-corrected\)\. Instructions to “be concise/brief/simple” may suppress step\-by\-step reasoning on boolean and coin\-flip tasks\.

#### Tier 2: Exploratory directional pattern\.

Chain\-of\-thoughtshows a preserved directional pattern across representations: negative in math groups \(motif:−0\.048\-0\.048; surface step\_words:−0\.044\-0\.044; LLM Structural\_logic negative\), positive in sequential/last\_letters \(motif:\+0\.045∗\+0\.045^\{\*\}; surface step\_words:\+0\.075∗\+0\.075^\{\*\}; LLM Structural\_logic positive\)\. This pattern is consistent but uncorrected across all representations and should be treated as Tier 2 \(exploratory corroboration\)\.*Dataset caveat*: “sequential” is a single\-dataset group \(last\_letters\); this is dataset\-level evidence, not multi\-dataset task\-type evidence\.

#### Motif validity\.

We manually audited 50 classified\-positive spans per key motif to assess precision\. Formeta\_instruction, representative inserted spans include:

- •“Ensure that both outputs clearly explain how the final solution was reached based on the arithmetic operations involved\.”
- •“Make sure to articulate the reasoning process clearly, even if it doesn’t require detailed breakdowns\.”

These patterns are consistently genuine meta\-instructional directives added by the optimizer\. Forchain\_of\_thought, the most common false positive is the JSON key"reasoning": "\.\.\."appearing in structured few\-shot demo fields, i\.e\., the regex matches the word “reasoning” but the pattern appears in a non\-instruction field not visible to the model as a directive\. This false positive inflates the treatment indicator for chain\_of\_thought without representing an actual user\-visible instruction change\. We estimate meta\_instruction precision at≳90%\{\\gtrsim\}90\\%based on this audit \(50 random positive spans, stratified by dataset; labeled independently by one author; ambiguous cases counted as false positives\)\. Chain\_of\_thought precision is lower and the pattern should be treated as noisier\. Motif precision was not formally estimated at scale and the audit involved one labeler; results remain preliminary\.

## 6Conclusion

We present a multi\-view observational analysis of edit\-level heterogeneity in prompt optimizer behavior\. Using three complementary representations of 2,095 pairwise prompt comparisons, i\.e\., GPT\-4o\-annotated cognitive\-load features, deterministic surface text statistics, and literal text\-diff motifs, we find four BH\-FDR\-corrected associations: Extraneous\_load⊗\\otimessequential \(−0\.060★\-0\.060^\{\\bigstar\}\), Metacognition⊗\\otimessequential \(\+0\.062★\+0\.062^\{\\bigstar\}\), meta\-instruction⊗\\otimesmath \(−0\.103★\-0\.103^\{\\bigstar\}\), and clarity\-constraint⊗\\otimeslogical \(−0\.083★\-0\.083^\{\\bigstar\}\)\. All four involve at least two benchmark datasets per task group\. Beyond these confirmatory findings, directionally consistent corroborating evidence appears in surface text features and in 17,708 TextGrad/GEPA pairs, and 83% of headline sign patterns survive LOO exclusion\.

These patterns suggest that edit\-family\-level analysis, especially when conditioned on benchmark\-specific task demands, offers a more informative diagnosis than aggregate success or failure metrics alone\. The pre\-optimization question “does my target benchmark resemble groups where meta\-instruction and complexity\-increasing edits are negatively associated with gains?” has practical diagnostic value even under the observational limitations of this study, and points toward edit\-conditioned optimizer design as a fruitful direction for interventional follow\-up\.

#### Implications\.

For practitioners, the ACMGD tables in this paper serve as a preliminary diagnostic: before running an expensive optimizer, consider whether the target task/dataset resembles groups \(math\-like, multihop\-like\) where meta\-instruction and complexity\-increasing edits show negative associations with gains\. This does not guarantee failure, but the BH\-corrected associations are consistent across frameworks, representations, and LOO folds\. The consistent negative association between meta\-instruction insertion and math\-type benchmarks suggests that future prompt optimizers may benefit from task\-conditioned edit control, particularly by discouraging unnecessary complexity\-increasing or meta\-instructional edits during optimization\.

#### Reproducibility\.

All analysis code and pairwise comparison dataset will be released upon publication\. The TextGrad processing pipeline is compatible with TextGrad public official repository\.

## Limitations

Our analysis is observational rather than interventional\. Although IPTW reduces measured selection bias using pre\-edit prompt states, residual confounding may remain due to optimizer trajectory history, backbone\-specific instruction following, and bundled prompt edits that modify multiple features simultaneously\. Accordingly, the reported ACMGD estimates should be interpreted as adjusted associational contrasts rather than causal effects\.

In addition, several findings remain benchmark\-sensitive\. Some task\-group effects are partially concentrated in specific datasets \(e\.g\., MultiArith within math\), and cross\-framework corroboration is directional rather than a fully controlled replication because different optimization frameworks use different prompt structures\. Finally, motif extraction relies on pre\-specified regex patterns and therefore introduces unavoidable labeling noise, particularly for chain\-of\-thought\-related edits\.

## References

- Achiam et al\. \(2023\)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others\. 2023\.Gpt\-4 technical report\.*arXiv preprint arXiv:2303\.08774*\.
- Agrawal et al\. \(2026\)Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl\-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, and 1 others\. 2026\.Gepa: Reflective prompt evolution can outperform reinforcement learning\.*International Conference on Learning Representations*\.
- Benjamini and Hochberg \(1995\)Yoav Benjamini and Yosef Hochberg\. 1995\.Controlling the false discovery rate: a practical and powerful approach to multiple testing\.*Journal of the Royal statistical society: series B \(Methodological\)*, 57\(1\):289–300\.
- Câmara et al\. \(2026\)Arthur Câmara, Vincent Slot, and Jakub Zavrel\. 2026\.Self\-optimizing multi\-agent systems for deep research\.*arXiv preprint arXiv:2604\.02988*\.
- Chen et al\. \(2026a\)Wei Chen, Yanbin Fang, Shuran Fu, Fasheng Xu, and Xuan Wei\. 2026a\.Optimizing prompts for large language models: A causal approach\.*arXiv preprint arXiv:2602\.01711*\.
- Chen et al\. \(2026b\)Yuefei Chen, Vivek K Singh, Jing Ma, and Ruixiang Tang\. 2026b\.Counterbench: Evaluating and improving counterfactual reasoning in large language models\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 30350–30358\.
- Chernozhukov et al\. \(2018\)Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins\. 2018\.Double/debiased machine learning for treatment and structural parameters\.
- Cobbe et al\. \(2021\)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others\. 2021\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*\.
- Dawoud and El\-Shamy \(2026\)Ahmed Dawoud and Osama El\-Shamy\. 2026\.Reading between the lines: Deconfounding causal estimates using text embeddings and deep learning\.*arXiv preprint arXiv:2601\.01511*\.
- Deng et al\. \(2022\)Mingkai Deng, Jianyu Wang, Cheng\-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric Xing, and Zhiting Hu\. 2022\.Rlprompt: Optimizing discrete text prompts with reinforcement learning\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing*, pages 3369–3391\.
- Dhawan et al\. \(2024\)Nikita Dhawan, Leonardo Cotta, Karen Ullrich, Rahul G Krishnan, and Chris J Maddison\. 2024\.End\-to\-end causal effect estimation from unstructured natural language data\.*Advances in Neural Information Processing Systems*, 37:77165–77199\.
- Feder et al\. \(2022\)Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood\-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, and 1 others\. 2022\.Causal inference in natural language processing: Estimation, prediction, interpretation and beyond\.*Transactions of the Association for Computational Linguistics*, 10:1138–1158\.
- Fernando et al\. \(2023\)Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel\. 2023\.Promptbreeder: Self\-referential self\-improvement via prompt evolution\.*arXiv preprint arXiv:2309\.16797*\.
- Fu et al\. \(2026\)Lucheng Fu, Ye Yu, Yiyang Wang, Yiqiao Jin, Haibo Jin, B Aditya Prakash, and Haohan Wang\. 2026\.Textreg: Mitigating prompt distributional overfitting via regularized text\-space optimization\.*arXiv preprint arXiv:2605\.21318*\.
- Gao et al\. \(2021\)Tianyu Gao, Adam Fisch, and Danqi Chen\. 2021\.Making pre\-trained language models better few\-shot learners\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 3816–3830\.
- Gat et al\. \(2023\)Yair Gat, Nitay Calderon, Amir Feder, Alexander Chapanin, Amit Sharma, and Roi Reichart\. 2023\.Faithful explanations of black\-box nlp models using llm\-generated counterfactuals\.*arXiv preprint arXiv:2310\.00603*\.
- Gendron et al\. \(2024\)Gaël Gendron, Jože M Rožanec, Michael Witbrock, and Gillian Dobbie\. 2024\.Counterfactual causal inference in natural language with large language models\.*arXiv preprint arXiv:2410\.06392*\.
- Geva et al\. \(2021\)Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant\. 2021\.Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies\.*Transactions of the Association for Computational Linguistics*, 9:346–361\.
- Gong et al\. \(2026\)Shuzhi Gong, Richard O Sinnott, Jianzhong Qi, Cecile Paris, Preslav Nakov, and Zhuohan Xie\. 2026\.Multi\-sourced, multi\-agent evidence retrieval for fact\-checking\.*arXiv preprint arXiv:2603\.00267*\.
- Guo et al\. \(2024\)Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang\. 2024\.Connecting large language models with evolutionary algorithms yields powerful prompt optimizers\.In*International Conference on Learning Representations*, volume 2024, pages 34133–34156\.
- Holtzman et al\. \(2021\)Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer\. 2021\.Surface form competition: Why the highest probability answer isn’t always right\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 7038–7051\.
- Jin et al\. \(2024\)Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, David Ha, and 1 others\. 2024\.Can large language models infer causation from correlation?In*International Conference on Learning Representations*, volume 2024, pages 28663–28679\.
- Keysers et al\. \(2019\)Daniel Keysers, Nathanael Schärli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, and 1 others\. 2019\.Measuring compositional generalization: A comprehensive method on realistic data\.*arXiv preprint arXiv:1912\.09713*\.
- Khattab et al\. \(2024\)Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas Joshi, Hanna Moazam, Heather Miller, and 1 others\. 2024\.Dspy: compiling declarative language model calls into state\-of\-the\-art pipelines\.In*International Conference on Learning Representations*, volume 2024, pages 54928–54958\.
- Lester et al\. \(2021\)Brian Lester, Rami Al\-Rfou, and Noah Constant\. 2021\.The power of scale for parameter\-efficient prompt tuning\.In*Proceedings of the 2021 conference on empirical methods in natural language processing*, pages 3045–3059\.
- Li and Liang \(2021\)Xiang Lisa Li and Percy Liang\. 2021\.Prefix\-tuning: Optimizing continuous prompts for generation\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 4582–4597\.
- Liu et al\. \(2024\)Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others\. 2024\.Deepseek\-v3 technical report\.*arXiv preprint arXiv:2412\.19437*\.
- Liu et al\. \(2025\)Xiaoyu Liu, Paiheng Xu, Junda Wu, Jiaxin Yuan, Yifan Yang, Yuhang Zhou, Fuxiao Liu, Tianrui Guan, Haoliang Wang, Tong Yu, and 1 others\. 2025\.Large language models and causal inference in collaboration: A comprehensive survey\.*Findings of the Association for Computational Linguistics: NAACL 2025*, pages 7668–7684\.
- Lu et al\. \(2022\)Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp\. 2022\.Fantastically ordered prompts and where to find them: Overcoming few\-shot prompt order sensitivity\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 8086–8098\.
- Ma \(2025\)Jing Ma\. 2025\.Causal inference with large language model: A survey\.*Findings of the Association for Computational Linguistics: NAACL 2025*, pages 5886–5898\.
- Maasch et al\. \(2025\)Jacqueline RMA Maasch, Alihan Hüyük, Xinnuo Xu, Aditya V Nori, and Javier Gonzalez\. 2025\.Compositional causal reasoning evaluation in language models\.*arXiv preprint arXiv:2503\.04556*\.
- \(32\)Nandana Mihindukulasooriya, Niharika S D’Souza, Faisal Chowdhury, and Horst Samulowitz\.Automatic prompt optimization for knowledge graph construction: Insights from an empirical study\.*Proceedings of the VLDB Endowment\. ISSN*, 2150:8097\.
- Min et al\. \(2022\)Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer\. 2022\.Rethinking the role of demonstrations: What makes in\-context learning work?In*Proceedings of the 2022 conference on empirical methods in natural language processing*, pages 11048–11064\.
- Opsahl\-Ong et al\. \(2024a\)Krista Opsahl\-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab\. 2024a\.Optimizing instructions and demonstrations for multi\-stage language model programs\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 9340–9366\.
- Opsahl\-Ong et al\. \(2024b\)Krista Opsahl\-Ong, Michael J Ryan, Josh Purtell, David Broman, Christopher Potts, Matei Zaharia, and Omar Khattab\. 2024b\.Optimizing instructions and demonstrations for multi\-stage language model programs\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 9340–9366\.
- Prasad et al\. \(2023\)Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal\. 2023\.Grips: Gradient\-free, edit\-based instruction search for prompting large language models\.In*Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics*, pages 3845–3864\.
- Pryzant et al\. \(2023\)Reid Pryzant, Dan Iter, Jerry Li, Yin Lee, Chenguang Zhu, and Michael Zeng\. 2023\.Automatic prompt optimization with “gradient descent” and beam search\.In*Proceedings of the 2023 conference on empirical methods in natural language processing*, pages 7957–7968\.
- Reimers and Gurevych \(2019\)Nils Reimers and Iryna Gurevych\. 2019\.Sentence\-bert: Sentence embeddings using siamese bert\-networks\.In*Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\)*, pages 3982–3992\.
- Robins et al\. \(2000\)James M Robins, Miguel Angel Hernan, and Babette Brumback\. 2000\.Marginal structural models and causal inference in epidemiology\.
- Rosenbaum and Rubin \(1983\)Paul R Rosenbaum and Donald B Rubin\. 1983\.The central role of the propensity score in observational studies for causal effects\.*Biometrika*, 70\(1\):41–55\.
- Roy and Roth \(2015\)Subhro Roy and Dan Roth\. 2015\.Solving general arithmetic word problems\.In*Proceedings of the 2015 conference on empirical methods in natural language processing*, pages 1743–1752\.
- Sclar et al\. \(2024\)Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr\. 2024\.Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting\.In*International Conference on Learning Representations*, volume 2024, pages 25055–25083\.
- Shin et al\. \(2020\)Taylor Shin, Yasaman Razeghi, Robert L Logan Iv, Eric Wallace, and Sameer Singh\. 2020\.Autoprompt: Eliciting knowledge from language models with automatically generated prompts\.In*Proceedings of the 2020 conference on empirical methods in natural language processing \(EMNLP\)*, pages 4222–4235\.
- Singhal et al\. \(2026\)Rahul Singhal, Pradyumna Tambwekar, and Karime Maamari\. 2026\.Prefpo: Pairwise preference prompt optimization\.*arXiv preprint arXiv:2603\.19311*\.
- Susnjak \(2026\)Teo Susnjak\. 2026\.A reproducible optimisation protocol for calibrating prompt\-based large language model workflows in evidence synthesis\.*arXiv preprint arXiv:2605\.06937*\.
- Suzgun et al\. \(2023\)Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and 1 others\. 2023\.Challenging big\-bench tasks and whether chain\-of\-thought can solve them\.In*Findings of the Association for Computational Linguistics: ACL 2023*, pages 13003–13051\.
- Talmor et al\. \(2019\)Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant\. 2019\.Commonsenseqa: A question answering challenge targeting commonsense knowledge\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\)*, pages 4149–4158\.
- Wager and Athey \(2018\)Stefan Wager and Susan Athey\. 2018\.Estimation and inference of heterogeneous treatment effects using random forests\.*Journal of the American Statistical Association*, 113\(523\):1228–1242\.
- Wan et al\. \(2024\)Xingchen Wan, Ruoxi Sun, Hootan Nakhost, and Sercan Ö Arık\. 2024\.Teach better or show smarter? on instructions and exemplars in automatic prompt optimization\.*Advances in Neural Information Processing Systems*, 37:58174–58244\.
- Wang et al\. \(2024\)Ziao Wang, Xiaofeng Zhang, and Hongwei Du\. 2024\.Beyond what if: Advancing counterfactual text generation with structural causal modeling\.In*IJCAI*, pages 6522–6530\.
- Webson and Pavlick \(2022\)Albert Webson and Ellie Pavlick\. 2022\.Do prompt\-based models really understand the meaning of their prompts?In*Proceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies*, pages 2300–2344\.
- Wei et al\. \(2022\)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others\. 2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others\. 2025\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*\.
- Yang et al\. \(2024\)Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen\. 2024\.Large language models as optimizers\.In*International Conference on Learning Representations*, volume 2024, pages 12028–12068\.
- Yuksekgonul et al\. \(2024\)Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou\. 2024\.Textgrad: Automatic" differentiation" via text\.*arXiv preprint arXiv:2406\.07496*\.
- Zhang et al\. \(2022\)Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, and Joseph E Gonzalez\. 2022\.Tempera: Test\-time prompting via reinforcement learning\.*arXiv preprint arXiv:2211\.11890*\.
- Zhang et al\. \(2026\)Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, and Peiyang He\. 2026\.Prompt optimization is a coin flip: Diagnosing when it helps in compound ai systems\.*arXiv preprint arXiv:2604\.14585*\.
- Zhao et al\. \(2021\)Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh\. 2021\.Calibrate before use: Improving few\-shot performance of language models\.In*International conference on machine learning*, pages 12697–12706\. Pmlr\.
- Zhou et al\. \(2022\)Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba\. 2022\.Large language models are human\-level prompt engineers\.In*The eleventh international conference on learning representations*\.

## Appendix AStatistical Significance \(Uncorrectedp\-values, BH\-FDR Correction\)

Our analysis involves multiple simultaneous hypothesis tests across prompt features, edit motifs, and dataset/task groups\. In such settings, interpreting raw statistical significance without correction can substantially inflate the probability of false discoveries\. This appendix clarifies the distinction between uncorrectedp\-values and Benjamini–Hochberg false discovery rate \(BH\-FDR\) corrected results used throughout the paper\.

### A\.1Uncorrectedp\-values

For each feature–task\-group combination, we estimate an IPTW\-adjusted associational effect \(ACMGD\) and compute a corresponding hypothesis test:

H0:ACMGD=0,H\_\{0\}:\\mathrm\{ACMGD\}=0,\(1\)where the null hypothesis assumes no adjusted association between the optimizer\-induced edit regime and performance gain\.

The reported uncorrectedp\-value measures the probability of observing an effect at least as extreme as the estimated one under the null hypothesis:

p=Pr⁡\(\|T\|≥\|Tobs\|∣H0\),p=\\Pr\\left\(\|T\|\\geq\|T\_\{\\mathrm\{obs\}\}\|\\mid H\_\{0\}\\right\),\(2\)whereTTdenotes the test statistic computed using block\-bootstrap uncertainty estimates\.

Throughout the paper, an uncorrected significance marker

indicates that the corresponding association is statistically significant when considered as an isolated test\.

However, because our analysis evaluates many hypotheses simultaneously \(60 feature×\\timestask\-group tests in Table 2 and 15 motif×\\timesgroup tests in Table 4\), interpreting uncorrectedp\-values alone would inflate the probability of false positives due to the multiple comparisons problem\.

### A\.2Multiple Comparisons Problem

Supposemmindependent null hypotheses are tested at significance levelα=0\.05\\alpha=0\.05\. Even if all null hypotheses are true, the expected number of false positives is:

For example, withm=60m=60tests, approximately

false discoveries are expected purely by chance under naive thresholding\.

Consequently, some apparently significant associations may arise from stochastic variation rather than systematic optimizer behavior\.

### A\.3Benjamini–Hochberg False Discovery Rate Correction

To mitigate inflated false positives, we apply the Benjamini–Hochberg \(BH\) false discovery rate correction\(Benjamini and Hochberg,[1995](https://arxiv.org/html/2605.26655#bib.bib3)\)across each family of simultaneous tests\.

Unlike family\-wise error rate procedures \(e\.g\., Bonferroni correction\), which control the probability of*any*false positive, BH\-FDR controls the expected proportion of false discoveries among all rejected hypotheses:

FDR=𝔼\[\#false discoveries\#total discoveries\]\.\\mathrm\{FDR\}=\\mathbb\{E\}\\left\[\\frac\{\\\#\\text\{ false discoveries\}\}\{\\\#\\text\{ total discoveries\}\}\\right\]\.\(4\)
This criterion is substantially less conservative and is therefore widely used in exploratory high\-dimensional analyses where moderate discovery power is desirable\.

### A\.4BH Procedure

Givenmmhypothesis tests with orderedp\-values

p\(1\)≤p\(2\)≤⋯≤p\(m\),p\_\{\(1\)\}\\leq p\_\{\(2\)\}\\leq\\cdots\\leq p\_\{\(m\)\},\(5\)the BH procedure compares each orderedp\-value against the adaptive threshold:

p\(i\)≤imα,p\_\{\(i\)\}\\leq\\frac\{i\}\{m\}\\alpha,\(6\)where:

- •iiis the rank of the orderedp\-value,
- •mmis the total number of simultaneous tests,
- •α\\alphais the target false discovery rate \(0\.05 in this paper\)\.

The largest index satisfying the inequality is identified, and all hypotheses with smaller\-rankedp\-values are declared significant after correction\.

The resulting BH\-adjusted significance level is reported as aq\-value throughout the paper\.

### A\.5Interpretation in This Work

We distinguish between two evidence tiers:

#### Tier 1 \(Confirmatory\)\.

Associations surviving BH\-FDR correction:

are treated as statistically robust findings\. These effects remain significant even after accounting for the full multiple\-testing burden\.

#### Tier 2 \(Exploratory\)\.

Associations satisfying:

but not surviving BH correction are reported as directional or corroborative patterns only\. These findings may still reflect meaningful optimizer behavior, but they carry a substantially higher false discovery risk and therefore do not independently support strong inferential claims\.

Our analysis evaluates a large number of simultaneous hypotheses across prompt features, edit motifs, task groups, and representation spaces\. In such exploratory settings, relying solely on uncorrectedp\-values would substantially increase the probability of reporting false\-positive associations that arise purely from random variation\. BH\-FDR correction provides a principled compromise between statistical rigor and discovery sensitivity: rather than controlling the probability of any false positive, it controls the expected proportion of false discoveries among all reported significant findings\. This is particularly suitable for our setting because the goal of the analysis is not to establish a small number of tightly controlled confirmatory effects, but to characterize heterogeneous optimizer behaviors across multiple complementary views\. Accordingly, we treat BH\-corrected findings \(q<0\.05q<0\.05\) as confirmatory evidence, while uncorrected findings \(p<0\.05p<0\.05only\) are explicitly framed as exploratory and hypothesis\-generating observations\. This distinction reduces the risk of over\-interpreting noisy associations while still preserving sufficient statistical power to identify potentially meaningful optimizer behavior patterns\.

## Appendix BConditional Independence Structure \(Exploratory\)

We pool prompt observations per task type \(NNranging from 30 to 92\) and apply:

- •PC algorithm\(constraint\-based, Fisher\-Z test,α=0\.05\\alpha=0\.05\) on feature vectors
- •DirectLiNGAM\(non\-Gaussian directed acyclic graph estimation\)

#### Results\.

Average cross\-task Jaccard similarity=0\.249=0\.249—task types share only∼\\sim25% of their inferred graph edges, confirming structural heterogeneity\. Pairwise similarity ranges from 0\.053 \(commonsense vs\. math\) to 0\.667 \(logical vs\. sequential\)\. Math tasks have the most distinctive structure\. These results are reported as exploratory analysis of conditional independence patterns; no mechanism claims are made\.

## Appendix CCausal Receptivity Score \(Descriptive\)

We construct a per\-task\-type receptivity score as the weighted sum of BH\-significant and directionally consistent CATE estimates \(positive sign = optimizer\-compatible\)\. Spearmanρ=−0\.866\\rho=\-0\.866between receptivity and optimizer success rate at the task\-type level \(n=5n=5,p=0\.058p=0\.058\)\. This correlation is descriptive and based on five task\-type aggregates; it does not constitute a validated predictive diagnostic\. LOO\-CV classifier AUC=0\.260=0\.260\(below chance\) confirms that the receptivity score does not generalize to held\-out task types at current sample sizes\. We report it as a descriptive summary only\.

## Appendix DSBERT Edit\-Direction Sensitivity Analysis

We compute SBERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.26655#bib.bib38)\)embedding differencesΔ𝐞=𝐞\(p2\)−𝐞\(p1\)\\Delta\\mathbf\{e\}=\\mathbf\{e\}\(p\_\{2\}\)\-\\mathbf\{e\}\(p\_\{1\}\)for each prompt pair, then project onto the first 20 principal components via PCA\. For each component, we compute SIPW\-CATE stratified by task type\.

Results: 10/20 PCA components show sign reversals \(spread\>0\.05\>0\.05\) across at least two task types\. PC7 shows the clearest pattern: commonsense\+0\.061∗\+0\.061^\{\*\}, math−0\.098∗\-0\.098^\{\*\}, sequential\+0\.084∗\+0\.084^\{\*\}\(spread=0\.181=0\.181\)\. This embedding\-level evidence is consistent with the annotation\-free robustness findings in §[5](https://arxiv.org/html/2605.26655#S5)and provides an additional representation\-independent corroboration\.

## Appendix EFull LOO Stability Table

Table[5](https://arxiv.org/html/2605.26655#A5.T5)reports all 18 LOO splits covering six headline sign reversals\. A split is “stable” \(✓\) if the sign reversal direction matches the full\-data baseline\.

Table 5:Full LOO stability table\. “comm\.” = commonsense, “seq\.” = sequential\. Three failures involve date\_understanding or disambiguation\_qa\.
## Appendix FPer\-Dataset Motif Effect Breakdown

Table[6](https://arxiv.org/html/2605.26655#A6.T6)reports naive \(unweighted\) CATE for chain\_of\_thought and meta\_instruction insertion by dataset\.

Table 6:Naive CATE for chain\_of\_thought and meta\_instruction insertion by dataset\. The strongest per\-dataset effects are for MultiArith \(both motifs\)\. The math task\-type result is partially driven by MultiArith; task\-type conclusions should be interpreted with this heterogeneity in mind\.
## Appendix GPer\-Dataset Robustness for Tier 1 Findings

Table[7](https://arxiv.org/html/2605.26655#A7.T7)reports per\-dataset estimates for all four BH\-corrected Tier 1 associations\. For each finding, we report the IPTW\-adjusted ACMGD, treated/control counts, and sign consistency across datasets within the task group\.

FindingFeature/MotifGroupDatasetACMGDnTn\_\{T\}nCn\_\{C\}Sign stable?\(a\) GPT\-featureExtraneous\_loadsequentiallast\_letters−0\.060★\-0\.060^\{\\bigstar\}412381✓\(only dataset\)\(b\) GPT\-featureMetacognitionsequentiallast\_letters\+0\.062★\+0\.062^\{\\bigstar\}398395✓\(only dataset\)\(c\) Motifmeta\_instructionmathGSM8K−0\.019\-0\.01914189✓\(negative\)mathMultiArith−0\.277\-0\.27764175✓\(negative, dominant\)\(d\) Motifclarity\_constraintlogicalboolean\_expr\.−0\.099\-0\.099112565✓\(negative\)logicalcoin\_flip−0\.071\-0\.07197521✓\(negative\)Table 7:Per\-dataset breakdown of all four Tier 1 BH\-corrected associations \(★\\bigstar= BH\-correctedq<0\.05q<0\.05from full\-data analysis\)\.nTn\_\{T\}= treated,nCn\_\{C\}= control\. Sequential results are from last\_letters alone; math motif results show MultiArith heterogeneity; logical motif results are consistent across both logical datasets\.
## Appendix HPrompt Feature Extraction

Below we provide the prompt template used inPrompt Feature Extraction\.

Prompt: Prompt Feature Extraction\{ "role": "user", "content": "What are the measurable and improveable textual features of the instructions generated above \{sample instruction\}, for solving the ask of solving problems? Make sure these features are independent of each other and not confounded\. Give the answers directly without preparatory statements\. " \}, \{ "role": "assistant", "content": str\(\(features\), \}, \{ "role": "user", "content": "According to the order of the factors: \{features\}", score the \{instructions\} with 1 to 10\. The final result must be a string of scores separated by commas\. Give the answer directly without preparatory statements\." \}

## Appendix IDiscussion

### I\.1Evidence Hierarchy

We use a three\-tier structure to prevent misreading:

Tier 1\. Four BH\-corrected associations across two analysis families\.From the GPT\-4o feature analysis: \(a\) Extraneous\_load/sequential: ACMGD =−0\.060★\-0\.060^\{\\bigstar\}, 100% LOO\-stable; \(b\) Metacognition/sequential: ACMGD =\+0\.062★\+0\.062^\{\\bigstar\}, 67% LOO\-stable\. From the motif analysis: \(c\) meta\_instruction⊗\\otimesmath: ACMGD =−0\.103★\-0\.103^\{\\bigstar\}, partially concentrated in MultiArith; \(d\) clarity\_constraint⊗\\otimeslogical: ACMGD =−0\.083★\-0\.083^\{\\bigstar\}\. The sequential effects involve last\_letters only; the math effect involves two datasets with heterogeneous within\-group estimates\. These are statistical associations, not causal effects\.

Tier 2\. Corroborated but exploratory\.Math\-vs\-sequential directional reversals in surface features \(n\_demos, step\_words\); chain\-of\-thought direction pattern consistent across representations; TextGrad directional corroboration\. These are consistent with Tier 1 but do not independently support inference\.

Optimizer\-design suggestions \(task\-conditioned edit priors, complexity penalties, validation gates\); causal mechanisms for the meta\-instruction effect; generalization to unseen task types\. Grounded in the observed associations but require controlled validation before implementation\.

### I\.2Observational Edit\-Effect Heterogeneity

Across three representations of prompt edits, i\.e\., annotated cognitive\-load features, deterministic surface statistics, and literal edit motifs, a consistent directional pattern emerges: optimizer\-induced edits that add meta\-instruction language \(“make sure to”, “do not”\) and structural scaffolding \(chain\-of\-thought framing\) are*negatively*associated with math performance \(BH\-corrected for meta\-instruction\), while logical and sequential task groups tend to benefit from step\-by\-step and structural features\.

#### What this paper establishes\.

The associations between edit types and dataset/task\-group performance are directionally consistent across three annotation\-independent representations, survive IPTW adjustment, and are robust to leave\-one\-dataset\-out exclusion for the BH\-corrected effects\. This consistency is compatible with an optimizer\-task mismatch account, though residual confounding cannot be ruled out \(see §[4\.1](https://arxiv.org/html/2605.26655#S4.SS1)for specific unmeasured confounders\)\.

#### What this paper does not establish\.

Whether these associations are causal\. Conditional exchangeability is not fully defensible; treatments are bundles; interventional validation at scale was not possible\. These results characterize optimizer behavior patterns rather than definitively explaining failure\.

### I\.3The Meta\-Instruction Effect in Math

The strongest FDR\-corrected finding is the negative association between meta\-instruction insertions and math dataset/task\-group performance \(ACMGD=−0\.103★=\-0\.103^\{\\bigstar\}, partially concentrated in MultiArith, see Appendix[F](https://arxiv.org/html/2605.26655#A6)\)\. This is notable because meta\-instructions \(“make sure to show your work”, “do not skip steps”, “you must include a final answer”\) might seem helpful for math\. One hypothesis: models fine\-tuned on math instruction\-following may have already internalized these guidelines; redundant meta\-instruction may increase prompt complexity without benefit\. A competing explanation consistent with the unmeasured confounders in §[4\.1](https://arxiv.org/html/2605.26655#S4.SS1): optimizers may tend to insert meta\-instructions specifically when prior prompts are struggling, so the association partly reflects selection into struggling trajectories rather than a direct effect\. Targeted interventional validation would be required to distinguish these accounts\.

The chain\-of\-thought directional reversal \(negative for math groups, positive for sequential⊗\\otimeslast\_letters\) is consistent across representations \(Tier 2\) but does not survive FDR correction\. The difference from the well\-established finding that CoT helps math\(Wei et al\.,[2022](https://arxiv.org/html/2605.26655#bib.bib52)\)likely reflects the distinction between*optimizer\-generated CoT meta\-instructions*in system prompts \(vague “reason step by step” phrases\) vs\.*user\-designed structured CoT*with explicit answer markers\.

#### Sanity check: Structured CoT intervention\.

To verify that the chain\-of\-thought motif finding is not simply an artifact of optimizer trajectories containing CoT\-style content without actual instruction\-level changes, we ran a minimal controlled sanity check using Qwen2\-VL\-7B\-Instruct on 25 GSM8K and 25 last\_letters questions: base prompt vs\. base \+ structured CoT with explicit “\#\#\#\# answer” markers\. Results: GSM8K base=28%=28\\%→\\toCoT=60%=60\\%\(Δ=\+32%\\Delta=\+32\\%, McNemarχ2=4\.90\\chi^\{2\}=4\.90,p=0\.027p=0\.027\); last\_letters base=0%=0\\%→\\toCoT=4%=4\\%\(Δ=\+4%\\Delta=\+4\\%, NS\)\. Explicit structured CoT helps math, consistent withWei et al\. \([2022](https://arxiv.org/html/2605.26655#bib.bib52)\)—confirming that our chain\-of\-thought motif estimate \(−0\.048\-0\.048for math groups, uncorrected\) does not reflect a general CoT\-hurts\-math phenomenon\.*Caution*: This sanity check does not validate or invalidate the observational motif estimate; it shows only that the motif labels do not correspond to clean user\-designed structured interventions\. The experiment used one model and 25 questions per task; generalization is not established\.

### I\.4Benchmark Sensitivity

Table[8](https://arxiv.org/html/2605.26655#A9.T8)summarizes leave\-one\-dataset\-out stability for six headline sign reversals across 18 LOO splits\.15/18 \(83%\)preserve the sign reversal from the full\-data baseline, exceeding our 80% success criterion\. The two BH\-corrected effects \(Extraneous\_load and Demos\) are 100% stable\. The word\_count multihop reversal appears driven by date\_understanding and should be treated with caution\.

Table 8:Leave\-one\-dataset\-out \(LOO\) stability of headline sign reversals\. Full LOO table in Appendix[E](https://arxiv.org/html/2605.26655#A5)\.
### I\.5Implications

#### For practitioners \(Tier 3 — hypothesis\-generating\)\.

The ACMGD tables in this paper serve as a preliminary diagnostic: before running an expensive optimizer, consider whether the target task/dataset resembles groups \(math\-like, multihop\-like\) where meta\-instruction and complexity\-increasing edits show negative associations with gains\. This does not guarantee failure, but the BH\-corrected associations are consistent across frameworks, representations, and LOO folds\. These suggestions are*Tier 3 hypotheses*derived from observational associations; implementation without controlled validation would be premature\.

#### Concrete optimizer\-design hypotheses\.

\(1\) Task\-conditioned edit priors: Optimizers could down\-weight meta\-instruction insertion for arithmetic or multi\-step reasoning tasks, which are implementable as a regex\-based edit filter or a classifier over proposed edits\.\(2\) Complexity\-penalized search: The BH\-corrected meta\_instruction⊗\\otimesmath association \(−0\.103★\-0\.103^\{\\bigstar\}\) suggests complexity\-penalizing objectives may be worth piloting for math\-type benchmarks\.\(3\) Validation gates: Before accepting an optimizer\-proposed edit, evaluate on a held\-out subset classified by task group, which is motivated by but not validated in this paper\.

These are mechanism\-level hypotheses, not established causal rules\.
Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

Similar Articles

Self-Supervised Prompt Optimization

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection

Probing the Prompt KV Cache: Where It Becomes Dispensable

Submit Feedback

Similar Articles

Self-Supervised Prompt Optimization
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
Probing the Prompt KV Cache: Where It Becomes Dispensable