RECAP: Regression Evaluation for Continual Adaptation of Prompts
Summary
Introduces RECAP, a benchmark for evaluating continual learning of prompts under evolving constraints in a proactive adaptation setting. Results show that existing prompt optimization methods fail in this setting, highlighting the need for new methods.
View Cached Full Text
Cached at: 06/08/26, 09:17 AM
# RECAP: Regression Evaluation for Continual Adaptation of Prompts
Source: [https://arxiv.org/html/2606.06698](https://arxiv.org/html/2606.06698)
###### Abstract
Production agentic systems routinely face evolving constraints and must comply from the very next interaction\. Scenarios like a tool\-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production\. This*proactive*adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback\. We introduce RECAP, a benchmark that measures continual\-learning phenomena \(forgetting, regression, forward transfer\) at the constraint level under a strictly proactive adapt\-then\-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data\. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency\. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm\. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment\.
RECAP: Regression Evaluation for Continual Adaptation of Prompts
Harsh Deshpande Kushal Chawla Sangwoo Cho William Campbell\{harsh\.deshpande2\}@capitalone\.com
## 1Introduction
Agentic systems in production operate under constraints that evolve continuously: a tool\-call response tightens a length limit, a policy update adds a disclosure requirement, or a user preference changes the expected tone\. The system must satisfy new constraints immediately while continuing to respect all prior ones\. Further, these constraints are often not centrally documented: they accumulate from individual tool\-call responses, user interactions, and personalization settings across many users\(Yeet al\.,[2026](https://arxiv.org/html/2606.06698#bib.bib35)\), making it unrealistic to collect the full active set and optimize jointly each time one changes\(Banerjeeet al\.,[2025](https://arxiv.org/html/2606.06698#bib.bib36)\)\.
Aligned with the EMNLP 2026 theme of rethinking evaluation beyond static benchmarks, we argue that progress in deployed agents must account for longitudinal behavior under evolving specifications\. We focus onproactiveadaptation, where the system receives only a constraint specification and must comply before seeing any real test data or feedback, with minimal latency overhead\. The setting is ubiquitous in deployment, yet entirely absent from current evaluation paradigms\. Instruction\-following benchmarks\(Zhouet al\.,[2023](https://arxiv.org/html/2606.06698#bib.bib4); Guoet al\.,[2026](https://arxiv.org/html/2606.06698#bib.bib1); Jianget al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib20); Qinet al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib30)\)present a fixed constraint set and measure single\-shot success, with no mechanism for constraints to evolve over time\. Prompt optimization methods\(Yuksekgonulet al\.,[2025](https://arxiv.org/html/2606.06698#bib.bib31); Yanget al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib32); Khattabet al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib8); Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib34)\)do address iterative improvement, but assume access to representative evaluation data and multiple rounds of feedback, which are not available in the proactive setting\. Reactive protocols like ACE\(Zhanget al\.,[2026](https://arxiv.org/html/2606.06698#bib.bib2)\)allow iterative debugging on observed failures, but again require test\-time feedback to drive adaptation\. The natural framework for evolving constraints is Continual learning \(CL\), which studies how systems adapt to new tasks without forgetting prior ones\(De Langeet al\.,[2021](https://arxiv.org/html/2606.06698#bib.bib24); Shiet al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib17); Wuet al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib19)\)\. However, existing CL operate on model weights through regularization\(Kirkpatricket al\.,[2016](https://arxiv.org/html/2606.06698#bib.bib15)\), replay\(Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2606.06698#bib.bib6)\), or prompt embeddings\(Wanget al\.,[2022b](https://arxiv.org/html/2606.06698#bib.bib27),[a](https://arxiv.org/html/2606.06698#bib.bib28)\)and do not address prompt\-level text constraints where the model weights are frozen and adaptation must happen entirely through the input\. None evaluates the proactive case where methods must generalize from specification alone\.
We presentRECAP:RegressionEvaluation forContinualAdaptation ofPrompts, a benchmark that extends the evaluation of constraint satisfaction beyond static settings\. RECAP performs continual evaluation on schedules subject to evolving constraints withadd,edit, anddeleteoperations\. This enables a rigorous evaluation of recent prompt adaptation methods under a proactive protocol\. We summarize our contributions below:
1. 1\.We design aconstraint\-level CL benchmarkthat converts static instruction\-following datasets into temporal evaluation streams via typed operations under a proactive protocol, where methods receive only the constraint specification and must generalize immediately \(§[2](https://arxiv.org/html/2606.06698#S2)\)\.
2. 2\.We develop adecomposed metric suitemeasuring constraint satisfaction, regression, edit uptake \(Are modified constraints adopted?\), unlearning fidelity \(Are deleted constraints forgotten?\), and efficiency \(§[2](https://arxiv.org/html/2606.06698#S2)\)\.
3. 3\.We provideempirical evidencethat existing prompt adaptation methods are structurally inadequate with the proactive paradigm, discussing failure modes to guide future work \(§[3](https://arxiv.org/html/2606.06698#S3)and §[4](https://arxiv.org/html/2606.06698#S4)\)\.
## 2Methodology
Source Data:In the proactive setting, constraints evolve independently of the base task\. This requires source data where constraints are separated from base instructions\. We build on RECAST\-30K\(Guoet al\.,[2026](https://arxiv.org/html/2606.06698#bib.bib1)\): which is based on Tulu 3 Persona IF\(Lambertet al\.,[2025](https://arxiv.org/html/2606.06698#bib.bib3)\)\. The data contains base instructions \(e\.g\., ‘Write a cover letter for a data\-science role’\) paired with one or more constraints \(e\.g\., ‘keep under 200 words’ or ‘mention Python at least 3 times’\)\. Constraints are grouped into semantic*types*\(Length, Keyword, Format, Tone, etc\.\), each with one or more concrete values \(e\.g\. maximum length can be 200 or 300 words\)\. We have 21 constraints in total, out of which 8 have deterministic rule\-based validators while 13 require LLM judgement \(Appendix[B](https://arxiv.org/html/2606.06698#A2)\)\.
Operations and Shadow Evaluation:RECAP transforms this static dataset of instructions and constraints into a temporal evaluation stream by definingschedulesof evolving constraints\. At each step in a schedule, we apply one out of three operations: 1\)Addintroduces a new constraint type, 2\)Editreplaces the concrete value of an existing type, and 3\)Deleteremoves a constraint type entirely\. A*schedule*consists of a sequence of operations over 15\-20 steps \(Appendix[C](https://arxiv.org/html/2606.06698#A3)\), controlling which constraints are introduced, modified, or removed, and in what order\. A key question is whether adapting to a new or modified constraint causes interference with previously satisfied ones\. To measure this after edits and deletions, we retain the old constraint as ashadowin the evaluation set at all subsequent steps: responses continue to be checked against the replaced or removed specification even though the LLM no longer sees it\. This enables tracking of edit persistence \(does the model revert to old behavior over time?\) and unlearning rebound \(does a deleted constraint resurface?\)\.
RECAP Protocol:We adopt the adapt\-then\-test protocol from CL evaluation\(Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2606.06698#bib.bib6); Chaudhryet al\.,[2018](https://arxiv.org/html/2606.06698#bib.bib5),[2019](https://arxiv.org/html/2606.06698#bib.bib22); De Langeet al\.,[2021](https://arxiv.org/html/2606.06698#bib.bib24)\)\. At each step, the method first adapts to the new constraint operation, then is evaluated on all active constraints \(see Figure[1](https://arxiv.org/html/2606.06698#S2.F1), pseudocode is in Appendix[A](https://arxiv.org/html/2606.06698#A1)\)\. Adaptation is*proactive*:adapt\(\)receives only the constraint specification \(e\.g\., “edit Length: Keep under 500 words”\) but no test prompts and no feedback from evaluation\. Methods may use internal self\-play during adaptation \(generating and judging synthetic responses\), but they never observe real test data or evaluation results from prior steps\. The no\-adaptation baseline \(Base LLM\) skips adaptation entirely and receives only the current active constraints appended to each user prompt at test time, making it a pure test of the LLM’s instruction\-following ability\.
Production Agentic SystemTool\-Call Notification“Compliant threshold value: < 3000”Ephemeral SandboxSelf\-evaluate & optimizeLive TrafficDiverse real requestsspecserveno feedbackRECAP Protocol \(One Step\)Constraint Specedit Length: <500 wordsadapt\(\)SandboxOptimize on new/modified constrainttest\(\)EvaluationDiverse constraints & instructionsspecpromptno feedbackanalogyFigure 1:The RECAP protocol mirrors production deployment: a constraint specification arrives, the method adapts without access to test data, then is evaluated on all active constraints\. This repeats over a multi\-step schedule\.Metrics:Our primary metric issat¯\\overline\{\\text\{sat\}\}: Mean constraint satisfaction rate across all types and steps \(formula in Appendix[D](https://arxiv.org/html/2606.06698#A4)\)\. However,sat¯\\overline\{\\text\{sat\}\}by itself can mask important dynamics\. A method might maintain mean satisfaction while silently regressing on prior constraints – We detect this with*peak forgetting*\(Chaudhryet al\.,[2018](https://arxiv.org/html/2606.06698#bib.bib5)\)\(the maximum drop from a type’s prior peak\) and*collateral damage*\(mean drop in non\-targeted types after an operation\)\. For edits, we define*edit switch*: the fraction of samples satisfying the new specification but not the old\. For deletions, we ask whether the model appropriately stops satisfying a removed constraint:*Unlearning Fidelity*measures how quickly satisfaction reverts to its unconstrained default rate\. We also report latency \(Appendix[E](https://arxiv.org/html/2606.06698#A5)\)\.
Methods:We evaluate66prompt adaptation methods, spanning few\-shot \(ICL\), memory\-based \(Dynamic Cheatsheet\(Suzgunet al\.,[2026](https://arxiv.org/html/2606.06698#bib.bib7)\)\), and optimization\-based \(ACE, GEPA, MIPROv2\(Opsahl\-Onget al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib34)\)\) paradigms\. The optimization methods use*self\-play*duringadapt\(\): the LLM generates a response to a synthetic prompt, then a second call judges whether the target constraint is satisfied\. They differ in search strategy: ACE uses a generate–evaluate–reflect–curate pipeline \(4 LLM calls/step\); GEPA evolves a population via mutation and fitness selection \(17 calls/step\); MIPROv2 proposes diverse candidates informed by a score history \(∼\{\\sim\}11 calls/step\)\. To mimic dynamic realistic settings, all methods are asked to optimize for the single new constraint at each step without access to other active constraints\.
## 3Experimental Setup
We show aggregated results on5050base prompts taken from RECAST using 3 schedules, each designed to assess a different aspect of continual adaptation\.Interleaved\-20mixes 11 adds, 5 edits, and 4 deletes across 20 steps, testing whether methods handle concurrent accumulation and revision\.Clustered\-20applies the same operations but phased \(ADD→\\toEDIT→\\toDELETE blocks\), testing whether batched operations amplify forgetting\.Rule\-Only\-15uses 6 rule\-based types with deterministic validators in an interleaved structure, isolating genuine forgetting from any LLM judge noise \(more details in Appendix[C](https://arxiv.org/html/2606.06698#A3)\)\. We use44backbone LLMs: Llama\-3\.1\-8B, Llama\-3\.3\-70B, GPT\-OSS\-20B, and GPT\-OSS\-120B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.06698#bib.bib33); Agarwalet al\.,[2025](https://arxiv.org/html/2606.06698#bib.bib37)\)\. Claude Sonnet 4\.5\(Anthropic,[2025](https://arxiv.org/html/2606.06698#bib.bib38)\)serves as the LLM judge for qualitative constraints\. In total, we have 72 conditions \(4 backbones×\\times6 methods×\\times3 schedules\)\. Detailed hyperparameters are in Appendix[itemnum\_samples:](https://arxiv.org/html/2606.06698#A9.SS0.SSS0.Px2)\.
## 4Results
Figure[2](https://arxiv.org/html/2606.06698#S4.F2)reports results for Llama\-3\.3\-70B and GPT\-OSS\-120B\. Results for other models and efficiency comparisons are in Appendix[E](https://arxiv.org/html/2606.06698#A5)\. The central finding is that no adaptation method achieves a significant improvement over the no\-adaptation baseline \(Base LLM\) on any metric, across any backbone LLM\. On GPT\-OSS models, adaptation is actively harmful \(up to−0\.176\-0\.176mean satisfaction\)\. On Llama models, methods converge within noise of the baseline while consuming upto 1\.7×\\timesthe latency\. This points to a structural misalignment with the proactive setting\. These findings hold on the rule\-only schedule, which uses purely deterministic validators and no LLM judge, confirming they are not artifacts of judge noise \(Appendix[F](https://arxiv.org/html/2606.06698#A6)\)\. Peak forgetting confirms the pattern: adaptation methods increase forgetting by up to 84% on GPT\-OSS models \(ACE: 0\.330 vs\. base 0\.179 on 120B\) due to context accumulation—as prompt artifacts grow \(187→\\to9K chars for ACE\), earlier constraint signals are diluted in the context window, causing previously\-satisfied types to regress \(Figure[3](https://arxiv.org/html/2606.06698#S4.F3)\)\.
Methodsat¯\\overline\{\\text\{sat\}\}Forg\.↓\\downarrowColl\.↓\\downarrowSw\.↑\\uparrowUF↑\\uparrowLlama 3\.3 70BBase LLM0\.5950\.2350\.0480\.2950\.642ICL0\.5920\.1880\.0330\.3010\.650Dyn\. Ch\.0\.6000\.2040\.0480\.3000\.693ACE0\.6020\.2120\.0350\.2900\.683GEPA0\.6030\.1980\.0480\.2910\.643MIPROv20\.5950\.2090\.0340\.2820\.672GPT\-OSS 120BBase LLM0\.6300\.1790\.0530\.3090\.674ICL0\.5980\.2110\.0400\.3170\.730Dyn\. Ch\.0\.6100\.2240\.0500\.3150\.709ACE0\.4540\.3300\.0500\.2890\.610GEPA0\.5060\.2680\.0490\.3010\.648MIPROv20\.5710\.2220\.0530\.2560\.635
Llama\-8BGPT\-20BLlama\-70BGPT\-120B0\.30\.30\.40\.40\.50\.50\.60\.60\.70\.7Mean SatisfactionBase LLMICLDyn\. Ch\.ACEGEPAMIPROv2
Figure 2:\(Left\)Results on RECAP\.sat¯\\overline\{\\text\{sat\}\}: Mean Satisfaction, Forg\.: Peak Forgetting, Coll\.: Collateral Damage, Sw\.: Edit Switch, UF: Unlearning Fidelity\. Bold: best, Underline: second best\. Full results with std\. dev\. in Appendix[E](https://arxiv.org/html/2606.06698#A5)\.\(Right\)Mean satisfaction \(±\\pm1 SD\) across all four backbone LLMs\.\(a\) Specification LockMIPROv2⋅\\cdotGPT\-OSS\-20B⋅\\cdotStep 12Start\_With“To”Keyword“Fair Trade Cert\.”Keyword“transparent reports”Topicaction moviesHelpfulnessactionable rec\.Length3\-paragraphEnd\_With“practices”Response\(273 words\)1/12In addressing the legal challenges of DACA,the Supreme Court’s 2020 decision… Stale compiled prompt: topic=DACA, prefix=“In addressing…” Current requires “To” \+ Fair Trade\. Base LLM: 5/12\.\(b\) Refusal CascadeACE⋅\\cdotGPT\-OSS\-120B⋅\\cdotStep 10Start\_With“The”End\_With“society”Length≈\\approx200wKeyword“printing press”×\\times2Keyword“Renaissance lit\.”ExampleMachiavelli’sThe PrinceToneformalTopicprinting press→\\toRen\. lit\.Response0/10I’m sorry, but I can’t fulfill this request as it contains conflicting requirements that cannot all be satisfied simultaneously\. Task: 200\-word essay on printing press \+ Renaissance lit\. All 5 other methods score 5/10\.\(c\) Prefix ContaminationGEPA⋅\\cdotGPT\-OSS\-20B⋅\\cdotStep 14Start\_With“Google”No\_CommasLength3 sent\.End\_With“Teams”TopicgratitudeTonemotivationalEmotiondeterminationResponse0/8Here is what you asked for:I recommendGoogleClassroom and Microsoft Teams as powerful classroom management tools… Constraint: “Start with Google\.” Evolved prompt injects prefix; 19/50 samples affected\. Base LLM: 4/8\.
Figure 3:Examples of observed failures\. Each panel shows a single step within a schedule: Blue: All active constraints at that step, Red: Callouts show the model output with the failure cause highlighted\.Scale dominates method\.Model size determines satisfaction scores more than the adaptation strategy\. Llama\-70B achieves∼0\.60\{\\sim\}0\.60regardless of method \(all66fall within0\.5920\.592–0\.6030\.603\), while Llama\-8B plateaus near0\.490\.49\(Figure[2](https://arxiv.org/html/2606.06698#S4.F2)\)\. The best method on 70B only shows negligible gains over Base LLM \(GEPA, \+0\.008\), yet costs 1\.5×\\timesthe latency \(189s vs\. 126s per step\)\.
Self\-play fitness does not transfer\.The three optimization methods report high self\-play pass rates during adaptation, yet actual test satisfaction shows no gain over Base LLM\. Self\-play optimization for the newly added constraint on a fixed synthetic task is not robust to regression on existing constraints embedded in the evaluation prompts\. The cost is substantial: GEPA’s 17 extra calls per step add 64s of latency on 70B for no clear performance gain\.
More self\-play compute does not help\.Scaling the self\-play budget for ACE to 3×\\timesand 5×\\times*degrades*quality further: satisfaction drops from0\.4540\.454to0\.3400\.340as the playbook grows faster and the model’s refusal rates rise44times \(Appendix[J](https://arxiv.org/html/2606.06698#A10)\)\. The failure is structural, not computational\.
Closing the information gap yields expensive neutrality\.Providing the full constraint set toadapt\(\)\(not just the delta\) recovers most of the deficit on GPT\-OSS\-120B \(0\.454→0\.6140\.454\\to 0\.614, vs\. base0\.6300\.630\), but at 2\.8×\\timesthe token cost for near zero net improvement over Base LLM \(Appendix[K](https://arxiv.org/html/2606.06698#A11)\)\.
Accumulated artifacts introduce spurious contradictions\.Adaptive methods accumulate stale values and may introduce tighter constraints than specified \(e\.g\., adding a sentence\-count restriction when optimizing for word\-count\)\. When a new constraint conflicts with these outdated or hallucinated values, the LLM mistakenly refuses to generate\. On GPT\-OSS\-120B, ACE’s playbook grows to 9K characters over 20 steps, producing a 14% refusal rate\. MIPROv2’s compiled prompt locks early values and is never revised, causing 31% of responses on GPT\-OSS\-20B to follow stale directives\. Llama models are more robust, treating system prompts as guidance rather than hard directives\.
Failure mode taxonomy\.We identify66systematic failure modes that show structural limitations of the adaptation methods when tested under a proactive protocol \(Table[10](https://arxiv.org/html/2606.06698#A7.T10)in Appendix[G](https://arxiv.org/html/2606.06698#A7)\); Figure[3](https://arxiv.org/html/2606.06698#S4.F3)shows33of them as examples: \(a\) A locked prompt causes the model to follow stale directives from earlier steps, \(b\) Stale context triggers refusal on non\-conflicting constraints, and \(c\) An evolved prompt injects a prefix that violates formatting constraints\.
## 5Conclusion
We formalize proactive adaptation – how agents handle evolving constraints without feedback or history, and introduce RECAP to evaluate it\. Across 72 conditions, no adaptation method beats a baseline that simply appends active constraints to the prompt, despite incurring higher costs\. Because real\-world constraints are often heterogeneous and non\-enumerable, this gap highlights the urgent need for building efficient, regression\-free adaptation methods for production agentic systems\.
## Limitations
We note two limitations\. First, all methods are prompt\-level \(no fine\-tuning\)\. Adaptation methods that explore model training specifically for this challenging continual learning setup is one potential future direction for progress in the future\. Second, a comparison with reactive methods along with real\-time feedback from a specialized oracle would further quantify the proactive gap\. We encourage this experiment for future work, noting that self\-play evidence presented in this work \(near\-perfect internal fitness vs\. flat test performance\) already demonstrates the gap is binding\.
## Ethical considerations
Our work was approved by the established internal review procedure\. We carefully verified the licensing information associated with all the datasets and LLMs used in this work, ensuring that their use was within their intended scope\.
Our benchmark uses publicly available instruction\-following data \(Tulu 3 Persona IF prompts\) that does not contain personal, sensitive, or harmful content\. No human subjects were involved; all evaluation is automated\. We note that proactive adaptation methods, if made effective in future work, could be misused to silently inject constraints that alter system behavior without user awareness \(e\.g\., suppressing certain topics\)\. We believe that benchmarking and understanding these methods transparently is a necessary step toward responsible deployment and appropriate safeguards\.
Finally, we note that for ensuring reproducibility, all the code as well as CL evaluation data used in RECAP will be released upon acceptance\.
## References
- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§3](https://arxiv.org/html/2606.06698#S3.p1.6)\.
- Anthropic \(2025\)Claude sonnet 4\.5\.Note:Large Language ModelExternal Links:[Link](https://www.anthropic.com/claude/sonnet)Cited by:[§3](https://arxiv.org/html/2606.06698#S3.p1.6)\.
- D\. Banerjee, T\. Suresh, S\. Ugare, S\. Misailovic, and G\. Singh \(2025\)CRANE: reasoning with constrained LLM generation\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=wKs9fHYxCV)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p1.1)\.
- A\. Chaudhry, P\. K\. Dokania, T\. Ajanthan, and P\. H\. Torr \(2018\)Riemannian walk for incremental learning: understanding forgetting and intransigence\.InProceedings of the European conference on computer vision \(ECCV\),pp\. 532–547\.Cited by:[§2](https://arxiv.org/html/2606.06698#S2.p3.1),[§2](https://arxiv.org/html/2606.06698#S2.p4.2)\.
- A\. Chaudhry, M\. Ranzato, M\. Rohrbach, and M\. Elhoseiny \(2019\)Efficient lifelong learning with a\-GEM\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Hkf2_sC5FX)Cited by:[§2](https://arxiv.org/html/2606.06698#S2.p3.1)\.
- M\. De Lange, R\. Aljundi, M\. Masana, S\. Parisot, X\. Jia, A\. Leonardis, G\. Slabaugh, and T\. Tuytelaars \(2021\)A continual learning survey: defying forgetting in classification tasks\.IEEE transactions on pattern analysis and machine intelligence44\(7\),pp\. 3366–3385\.Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1),[§2](https://arxiv.org/html/2606.06698#S2.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3](https://arxiv.org/html/2606.06698#S3.p1.6)\.
- Z\. Guo, W\. Liu, M\. Xie, J\. Xu, Z\. Huang, M\. Tian, J\. Xu, Y\. Shen, Q\. Qian, M\. Wu,et al\.\(2026\)RECAST: expanding the boundaries of llms’ complex instruction following with multi\-constraint data\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1),[§2](https://arxiv.org/html/2606.06698#S2.p1.1)\.
- Y\. Jiang, Y\. Wang, X\. Zeng, W\. Zhong, L\. Li, F\. Mi, L\. Shang, X\. Jiang, Q\. Liu, and W\. Wang \(2024\)FollowBench: a multi\-level fine\-grained constraints following benchmark for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4667–4688\.External Links:[Link](https://aclanthology.org/2024.acl-long.257/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.257)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- O\. Khattab, A\. Singhvi, P\. Maheshwari, Z\. Zhang, K\. Santhanam, S\. V\. A, S\. Haq, A\. Sharma, T\. T\. Joshi, H\. Moazam, H\. Miller, M\. Zaharia, and C\. Potts \(2024\)DSPy: compiling declarative language model calls into self\-improving pipelines\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=sY5N0zY5Od)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. C\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska, D\. Hassabis, C\. Clopath, D\. Kumaran, and R\. Hadsell \(2016\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114,pp\. 3521 – 3526\.External Links:[Link](https://api.semanticscholar.org/CorpusID:4704285)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- N\. Lambert, J\. Morrison, V\. Pyatkin, S\. Huang, H\. Ivison, F\. Brahman, L\. J\. V\. Miranda, A\. Liu, N\. Dziri, X\. Lyu, Y\. Gu, S\. Malik, V\. Graf, J\. D\. Hwang, J\. Yang, R\. L\. Bras, O\. Tafjord, C\. Wilhelm, L\. Soldaini, N\. A\. Smith, Y\. Wang, P\. Dasigi, and H\. Hajishirzi \(2025\)Tulu 3: pushing frontiers in open language model post\-training\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=i1uGbfHHpH)Cited by:[§2](https://arxiv.org/html/2606.06698#S2.p1.1)\.
- D\. Lopez\-Paz and M\. Ranzato \(2017\)Gradient episodic memory for continual learning\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1),[§2](https://arxiv.org/html/2606.06698#S2.p3.1)\.
- K\. Opsahl\-Ong, M\. J\. Ryan, J\. Purtell, D\. Broman, C\. Potts, M\. Zaharia, and O\. Khattab \(2024\)Optimizing instructions and demonstrations for multi\-stage language model programs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 9340–9366\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.525/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.525)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1),[§2](https://arxiv.org/html/2606.06698#S2.p5.2)\.
- Y\. Qin, K\. Song, Y\. Hu, W\. Yao, S\. Cho, X\. Wang, X\. Wu, F\. Liu, P\. Liu, and D\. Yu \(2024\)InFoBench: evaluating instruction following ability in large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 13025–13048\.External Links:[Link](https://aclanthology.org/2024.findings-acl.772/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.772)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- H\. Shi, Z\. Xu, H\. Wang, W\. Qin, W\. Wang, Y\. Wang, Z\. Wang, and H\. Wang \(2024\)Continual learning of large language models: a comprehensive survey\.ACM Computing Surveys58,pp\. 1 – 42\.External Links:[Link](https://api.semanticscholar.org/CorpusID:269362836)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- M\. Suzgun, M\. Yuksekgonul, F\. Bianchi, D\. Jurafsky, and J\. Zou \(2026\)Dynamic cheatsheet: test\-time learning with adaptive memory\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),V\. Demberg, K\. Inui, and L\. Marquez \(Eds\.\),Rabat, Morocco,pp\. 7080–7106\.External Links:[Link](https://aclanthology.org/2026.eacl-long.333/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.333),ISBN 979\-8\-89176\-380\-7Cited by:[item num\_samples:](https://arxiv.org/html/2606.06698#A9.SS0.SSS0.Px2.p1.2),[§2](https://arxiv.org/html/2606.06698#S2.p5.2)\.
- Z\. Wang, Z\. Zhang, S\. Ebrahimi, R\. Sun, H\. Zhang, C\. Lee, X\. Ren, G\. Su, V\. Perot, J\. Dy,et al\.\(2022a\)Dualprompt: complementary prompting for rehearsal\-free continual learning\.InEuropean conference on computer vision,pp\. 631–648\.Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- Z\. Wang, Z\. Zhang, C\. Lee, H\. Zhang, R\. Sun, X\. Ren, G\. Su, V\. Perot, J\. Dy, and T\. Pfister \(2022b\)Learning to prompt for continual learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 139–149\.Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- T\. Wu, L\. Luo, Y\. Li, S\. Pan, T\. Vu, and G\. Haffari \(2024\)Continual learning for large language models: a survey\.arXiv preprint arXiv:2402\.01364\.External Links:[Link](https://api.semanticscholar.org/CorpusID:267406164)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- C\. Yang, X\. Wang, Y\. Lu, H\. Liu, Q\. V\. Le, D\. Zhou, and X\. Chen \(2024\)Large language models as optimizers\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 12028–12068\.Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- J\. Ye, G\. Zhang, W\. Fu, T\. Gui, Q\. Zhang, and X\. Huang \(2026\)CCTU: a benchmark for tool use under complex constraints\.CoRRabs/2603\.15309\.External Links:[Link](https://doi.org/10.48550/arXiv.2603.15309),[Document](https://dx.doi.org/10.48550/ARXIV.2603.15309),2603\.15309Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p1.1)\.
- M\. Yuksekgonul, F\. Bianchi, J\. Boen, S\. Liu, P\. Lu, Z\. Huang, C\. Guestrin, and J\. Zou \(2025\)Optimizing generative ai by backpropagating language model feedback\.Nature639,pp\. 609 – 616\.External Links:[Link](https://api.semanticscholar.org/CorpusID:277148007)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- Q\. Zhang, C\. Hu, S\. Upasani, B\. Ma, F\. Hong, V\. Kamanuru, J\. Rainton, C\. Wu, M\. Ji, H\. Li, U\. Thakker, J\. Zou, and K\. Olukotun \(2026\)Agentic context engineering: evolving contexts for self\-improving language models\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=eC4ygDs02R)Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§1](https://arxiv.org/html/2606.06698#S1.p2.1)\.
## Appendix AProtocol Pseudocode
Algorithm 1RECAP Evaluation Protocol1:Schedule
\{\(k,opk\)\}k=0K\\\{\(k,\\text\{op\}\_\{k\}\)\\\}\_\{k=0\}^\{K\}, method
ℳ\\mathcal\{M\}, model
θ\\theta
2:
ℳ\.reset\(\)\\mathcal\{M\}\.\\text\{reset\}\(\)
3:for
k=0…Kk=0\\ldots Kdo
4:
ctx←StepContext\(opk,reg\_meta\)\\text\{ctx\}\\leftarrow\\textsc\{StepContext\}\(\\text\{op\}\_\{k\},\\ \\text\{reg\\\_meta\}\)⊳\\trianglerightsee App\.[I](https://arxiv.org/html/2606.06698#A9)
5:
ℳ\.adapt\(θ,ctx\)\\mathcal\{M\}\.\\text\{adapt\}\(\\theta,\\text\{ctx\}\)⊳\\trianglerightproactive; no test results
6:Evaluate on all
𝒞0:k\\mathcal\{C\}\_\{0:k\}\+ shadow\-eval edited/deleted
7:endfor
Hereθ\\thetadenotes the frozen LLM backbone,reg\_metais the regression metadata \(which types are new, retained, edited, or deleted\), andStepContextbundles the operation and metadata into the adaptation interface\.
## Appendix BConstraint Taxonomy
Table[1](https://arxiv.org/html/2606.06698#A2.T1)lists all 21 constraint types\.
Table 1:Full constraint taxonomy\. Rule\-based types use deterministic validators; LLM\-based types use an LLM judge\.
## Appendix CSchedule Configuration
Table 2:The three evaluation schedules\. A/E/D = adds/edits/deletes\.### Interleaved\-20\.
Edits and deletes interleaved with adds throughout 20 steps\. Types span rule\-based and LLM\-judged\. Features: edit\-on\-edit \(Length edited at steps 5 and 16\), delete\-after\-edit \(Style edited then deleted\), variable post\-deletion windows \(12, 8, 5, 0 steps\)\.
### Clustered\-20\.
Identical operations but phased: ADDs first \(steps 0–10\), then EDITs \(steps 11–15\), then DELETEs \(steps 16–19\)\. Clustered produces 12% higher forgetting than interleaved \(0\.251 vs\. 0\.223\), consistent with classical CL findings\.
### Rule\-Only\-15\.
Only 6 rule\-based types with deterministic validators\. Confirms that findings are not artifacts of LLM judge noise: method rankings are preserved\.
Table 3:Interleaved\-20 schedule configuration\.
## Appendix DDetailed Metric Definitions
### Satisfaction\.
For typettat stepkk:
rsat\(t,k\)=\|\{x:tsatisfied\}\|\|\{x:tactive\}\|r\_\{\\text\{sat\}\}\(t,k\)=\\frac\{\|\\\{x:t\\text\{ satisfied\}\\\}\|\}\{\|\\\{x:t\\text\{ active\}\\\}\|\}We report mean satisfaction across types and steps\.
### Peak forgetting\.
fpeak\(t,k\)=maxj<krsat\(t,j\)−rsat\(t,k\)f\_\{\\text\{peak\}\}\(t,k\)=\\max\_\{j<k\}r\_\{\\text\{sat\}\}\(t,j\)\-r\_\{\\text\{sat\}\}\(t,k\)Clamped to\[0,∞\)\[0,\\infty\)\.
### Collateral damage\.
When stepkktargets typet∗t^\{\*\}, letℛk=𝒞0:k∖\{t∗\}\\mathcal\{R\}\_\{k\}=\\mathcal\{C\}\_\{0:k\}\\setminus\\\{t^\{\*\}\\\}:
coll\(k\)=1\|ℛk\|∑t∈ℛkmax\(0,Δt\(k\)\)\\text\{coll\}\(k\)=\\frac\{1\}\{\|\\mathcal\{R\}\_\{k\}\|\}\\\!\\sum\_\{t\\in\\mathcal\{R\}\_\{k\}\}\\\!\\max\\bigl\(0,\\;\\Delta\_\{t\}\(k\)\\bigr\)whereΔt\(k\)=rsat\(t,k−1\)−rsat\(t,k\)\\Delta\_\{t\}\(k\)=r\_\{\\text\{sat\}\}\(t,k\{\-\}1\)\-r\_\{\\text\{sat\}\}\(t,k\)\.
### Edit switch\.
switch=AdaptedBoth\+Stale\+Adapted\\text\{switch\}=\\frac\{\\text\{Adapted\}\}\{\\text\{Both\}\+\\text\{Stale\}\+\\text\{Adapted\}\}where Adapted = satisfies new but not old, Both = satisfies both, Stale = satisfies old but not new\.
### Unlearning Fidelity\.
When typeddis deleted at stepkk:
UF\(d,k\)=1−\|rpost\(d,k\)−rdef\(d\)\|max\(\|rpre\(d\)−rdef\(d\)\|,ϵ\)\\text\{UF\}\(d,k\)=1\-\\frac\{\|r\_\{\\text\{post\}\}\(d,k\)\-r\_\{\\text\{def\}\}\(d\)\|\}\{\\max\\bigl\(\|r\_\{\\text\{pre\}\}\(d\)\-r\_\{\\text\{def\}\}\(d\)\|,\\,\\epsilon\\bigr\)\}whererprer\_\{\\text\{pre\}\}is frozen at deletion,rpostr\_\{\\text\{post\}\}is shadow\-evaluated at stepkk,rdefr\_\{\\text\{def\}\}is the unconstrained default rate, andϵ=0\.05\\epsilon\{=\}0\.05\. Clamped to\[0,1\]\[0,1\]\. Three variants: immediate, sustained \(reported\), rebound\.
### Trajectory metrics\.
sat¯\\overline\{\\text\{sat\}\}= mean across steps\. Init\. = satisfaction at first introduction \(forward transfer\)\. Final = last\-step average\. BWT = mean change between first appearance and final step per type\.
### Efficiency\.
Step latency = adapt time \+ generation time \(seconds\)\. Tokens = total LLM tokens consumed across adaptation and inference\.
## Appendix EPer\-Backbone Results
Figure[2](https://arxiv.org/html/2606.06698#S4.F2)in the main body reports a subset of metrics for Llama\-3\.3\-70B and GPT\-OSS\-120B\. Tables[4](https://arxiv.org/html/2606.06698#A5.T4)–[7](https://arxiv.org/html/2606.06698#A5.T7)below report all metrics for all four backbones\. Table[8](https://arxiv.org/html/2606.06698#A5.T8)consolidates efficiency metrics\. All values are averaged over 3 schedules \(n=3n\{=\}3\)\.
Table 4:Trajectory\-level results forLlama 3\.1 8B\(averaged over 3 schedules\)\. Bold = best; underline = second best\.Table 5:Trajectory\-level results forGPT\-OSS 20B\(averaged over 3 schedules\)\. Bold = best; underline = second best\.Table 6:Trajectory\-level results forLlama 3\.3 70B\(averaged over 3 schedules\)\. Bold = best; underline = second best\.Table 7:Trajectory\-level results forGPT\-OSS 120B\(averaged over 3 schedules\)\. Bold = best; underline = second best\.Table 8:Efficiency metrics across all four backbones \(averaged over 3 schedules\)\. Step latency includes adaptation and generation time \(seconds\)\. Tokens is the total LLM tokens consumed per run\.
## Appendix FRule\-Only Schedule Analysis
The Rule\-Only\-15 schedule uses exclusively deterministic constraint types \(Length, Keyword, Format, Start\_With, End\_With, No\_Commas\), eliminating LLM judge noise entirely\. Figure[4](https://arxiv.org/html/2606.06698#A6.F4)shows that the same patterns observed on the full schedules persist under purely deterministic evaluation: Base LLM leads or ties on 3 of 4 backbones, all methods converge within 0\.008 on Llama\-70B, and adaptation is actively harmful on GPT\-OSS\-120B \(MIPROv2 drops to 0\.441, GEPA to 0\.552 vs\. base 0\.654\)\. This confirms our findings are not artifacts of LLM judge variance\.
Llama\-8BGPT\-20BLlama\-70BGPT\-120B0\.40\.40\.50\.50\.60\.60\.70\.7Mean SatisfactionBase LLMICLDyn\. Cheat\.ACEGEPAMIPROv2Figure 4:Rule\-Only\-15 schedule: mean satisfaction by backbone \(deterministic evaluation only, no LLM judge\)\. Same conclusion holds: Base LLM dominates on GPT\-OSS models; all methods converge on Llama\-70B\.Table 9:Full results for the Rule\-Only\-15 schedule \(deterministic evaluation only\)\.
## Appendix GFailure Mode Analysis
Table 10:Six failure modes observed across 72 production runs\. Each represents a structural misalignment between the method’s design assumptions and the proactive protocol\.Δ\\Deltasat is relative to Base LLM on the worst\-affected backbone\.We identify six systematic failure modes across all 72 production runs\. Each illustrates a structural limitation of prompt\-level continual adaptation under the proactive protocol\.
### Failure 1: Specification Lock \(MIPROv2, GPT\-OSS\-20B\)\.
MIPROv2’s compiled prompt hardcodes constraint values from early steps and never revises them after edits\. Self\-validation reports perfect scores despite a−0\.140\-0\.140gap versus base\.
> Clustered Step 12, GPT\-OSS\-20B: Current constraints require Start\_With “To”, keywords “Fair Trade Certification” and “transparent reports”, topic about action movies\. Compiled prompt locks stale values: topic=DACA, prefix=“In addressing the legal challenges of DACA\.” Response follows the stale system prompt \(1/12\)\. Base LLM follows current constraints \(5/12\)\.
### Failure 2: Progressive Refusal Cascade \(ACE, GPT\-OSS\-120B\)\.
ACE’s playbook grows from 187 to 9,085 characters over 20 steps\. The model perceives conflicts between accumulated rules and current constraints, refusing 14% of samples\.
> Step 15, GPT\-OSS\-120B: “I’m sorry, but I can’t fulfill this request as it contains conflicting constraints that cannot be satisfied simultaneously\.” Base LLM satisfies 4/9 constraints—no actual conflict exists\.
### Failure 3: Demo Noise \(ICL, Llama\-3\.1\-8B\)\.
Synthetic demonstrations bloat context by 90–130% and introduce surface patterns \(preambles, list punctuation\) that conflict with active constraints\.
> Step 7, Llama\-8B: Active constraints include Start\_With \(“Google”\), No\_Commas, Length \(≤\\leq3 sentences\)\. ICL response begins “Here are two classroom management tools…”, uses commas, generates 107 words\. Base LLM correctly starts “Google Classroom is a game\-changer…” using dashes instead of commas \(5/8 vs\. ICL 0/8\)\.
### Failure 4: Prefix Contamination \(GEPA, GPT\-OSS\-20B\)\.
GEPA’s evolved prompts inject a fixed prefix \(“Here is what you asked for:”\) into 19/50 responses at affected steps, violating all Start\_With constraints\.
> Step 14, GPT\-OSS\-20B: Constraint “Start with ‘Google’ ”\. GEPA response starts “Here is what you asked for: I recommend Google Classroom…” \(0/8\)\. Base correctly starts “Google Classroom and Schoology are the tools…” \(4/8\)\.
### Failure 5: Inert Cheatsheet Growth \(Dynamic Cheatsheet, GPT\-OSS\-120B\)\.
The cheatsheet grows to 3,945 characters of general strategies, yet 62% of responses are identical to Base LLM\. Net improvement: \+0\.005 \(within noise\)\.
> Step 16, GPT\-OSS\-120B: Cheatsheet advice “Include testimonials for credibility” introduces uppercase in a response requiring All\_Lower, causing the cheatsheet to*harm*\(5/8 vs\. base 6/8\)\.
### Failure 6: Keyword Anti\-Unlearning \(all methods, GPT\-OSS\-120B\)\.
After Keyword deletion, target keywords persist in 20–40% of responses due to topic\-keyword semantic correlation\. UF peaks at 0\.62, never reaching 1\.0\.
> Step 14, GPT\-OSS\-120B: Deleted keyword “ethnographic” appears naturally \(“a recent ethnographic fieldwork project…”\) because it is semantically related to the topic\. Structural constraints \(End\_With\) achieve near\-perfect unlearning \(3\.7% residual\)\.
## Appendix HPer\-Method Behavioral Profiles
Table 11:Method behavioral profiles across 12 conditions \(4 backbones×\\times3 schedules\)\. Delta = mean difference from Base LLM\.Base LLMsucceeds by doing nothing harmful—constraint text in the user prompt is sufficient\.ICLachieves 0% win rate; demos add 90–130% token overhead and introduce conflicting patterns\.Dynamic Cheatsheetis the safest adaptive method \(42% win rate,−0\.008\-0\.008delta\) because it produces transferable strategies rather than specific values\.ACEis the most polarized: \+0\.007 on 70B but−0\.176\-0\.176on GPT\-OSS\-120B due to progressive refusal cascades\.GEPAspends 311 additional API calls per run for \+0\.008 on 70B and−0\.124\-0\.124on 120B\.MIPROv2consistently helps Llama\-8B \(\+0\.017\) but collapses on GPT\-OSS\-20B \(−0\.140\-0\.140\) via specification lock\. GPT\-OSS models copy system\-prompt patterns literally \(84–94% prefix contamination\); Llama models treat them as guidance\.
## Appendix IImplementation Details
### Adaptation interface:
The adaptation interface encapsulates all information available to a method at each step:
itemstep\_id:Integer step identifier\.
itemstep\_ops:Operations at this step, including concrete constraint text \(e\.g\.,\[\{"op": "edit", "type": "Length", "new\_value": "Keep under 500 words"\}\]\)\.
itemstep\_meta:Step metadata \(schedule name, operation counts, type inventory\)\.
itemregression\_meta:Which types are new, retained, removed, edited, plus cumulative deleted/edited sets\.
itemprev\_state:Opaque state from previous adaptation call;Nonefor step 0\.
itemnum\_samples:Window size \(number of samples per evaluation step\)\.
### Hyperparameters:
ICL:num\_demos=3\\text\{num\\\_demos\}\{=\}3, constraint\-matched selection, pool cap 30\. Dynamic Cheatsheet\(Suzgunet al\.,[2026](https://arxiv.org/html/2606.06698#bib.bib7)\): max 4K tokens\. ACE: 4K token playbook budget, 4\-call self\-play pipeline\. GEPA: population size 4, 2 generations, elite size 2, 2K token budget \(17 calls/step\)\. MIPROv2: 3 proposals/step, max 20 history entries \(∼11\{\\sim\}11calls/step\)\. All self\-play methods use the backbone as both generator and judge \(2 calls per evaluation\)\.
### Efficiency metrics:
We track per\-step: adaptation wall\-clock time, generation wall\-clock time, constraint scoring time, tokens consumed during adaptation, tokens for test responses, LLM API calls during adaptation, and LLM API calls for test generation\.
## Appendix JSelf\-Play Budget Ablation
We test whether ACE’s underperformance is due to insufficient optimization by scaling the self\-play budget from 1×\\times\(4 LLM calls/step; default\) to 3×\\times\(12 calls, 3 Gen→\\toEval→\\toReflect→\\toCurate cycles\) and 5×\\times\(20 calls, 5 cycles\) on GPT\-OSS\-120B\. Table[12](https://arxiv.org/html/2606.06698#A10.T12)reports the results\.
Table 12:Self\-play budget ablation for ACE on GPT\-OSS\-120B, averaged across 3 schedules\. More optimization*degrades*quality\.### Key finding: more optimization accelerates failure\.
Mean satisfaction*decreases*monotonically with budget \(0\.454→0\.421→0\.3400\.454\\to 0\.421\\to 0\.340\), widening the gap to Base LLM \(0\.6300\.630\)\. The mechanism is the Refusal Cascade \(§[G](https://arxiv.org/html/2606.06698#A7)\): each cycle adds rules to the playbook, so 5×\\timesgrows the playbook to 15\.6K characters \(vs\. 9\.1K at 1×\\times\), exceeding the model’s tolerance\. Refusal rates rise from 12% to 48%, directly explaining the satisfaction drop\.
### Schedule\-dependent nuance:
On the Rule\-Only\-15 schedule \(6 rule\-based types, deterministic evaluation\), more cycles*marginally help*: mean satisfaction rises0\.619→0\.624→0\.6360\.619\\to 0\.624\\to 0\.636and refusal rates actually*decrease*\(10\.9%→9\.1%→6\.8%10\.9\\%\\to 9\.1\\%\\to 6\.8\\%\)\. Rule\-based constraints produce unambiguous self\-play signal—binary pass/fail on concrete specifications \(word counts, keywords\)—so the Reflector generates actionable rules that the Curator can integrate without bloat\. On LLM\-judged schedules \(Interleaved\-20, Clustered\-20\), the same mechanism fails: vague constraint specifications \(“adopt a persuasive narrative style”\) produce noisy self\-play verdicts, leading to verbose, hedge\-laden playbook entries that accumulate into perceived contradictions\.
### Implication:
The failure of proactive adaptation on complex schedules is*structural*, not computational\. The bottleneck is not insufficient self\-play optimization but distribution mismatch between the fixed synthetic self\-play task and the diverse real evaluation prompts\. Scaling compute within this paradigm only accelerates artifact growth without improving transfer\.
## Appendix KFull\-Constraints Information Ablation
We test whether ACE’s failure is due to an*information gap*: in the standard protocol,adapt\(\)sees only the current step’s operation and a list of active type names\. In this ablation, we additionally provide the full text of*all*currently active constraints, so the self\-play pipeline evaluates against the same specifications the model will face at test time\. We run ACE on GPT\-OSS\-120B \(the worst\-affected backbone; standard ACE:0\.4540\.454, base:0\.6300\.630\) across 3 schedules\.
Table 13:Full\-constraints ablation for ACE on GPT\-OSS\-120B, averaged across 3 schedules\. Closing the information gap recovers most of the deficit but yields zero net improvement over Base LLM\.### Key finding: closing the information gap is necessary but not sufficient\.
Full\-constraints ACE recovers 91% of the standard ACE deficit \(0\.454→0\.6140\.454\\to 0\.614; gap to base narrows from−0\.176\-0\.176to−0\.016\-0\.016\) and nearly eliminates excess forgetting \(0\.330→0\.2030\.330\\to 0\.203, vs\. base0\.1790\.179\)\. The step\-by\-step trajectory tracks Base LLM closely throughout all 20 steps, confirming that the fix is not merely a better initialization but sustained alignment\. However, despite this recovery, full\-constraints ACE does*not*exceed base on any primary metric—it achieves expensive neutrality at 2\.8×\\timesthe token cost\.
### What drives the remaining gap:
Refusal rates remain comparable to standard ACE \(∼13%\{\\sim\}13\\%vs\.∼12%\{\\sim\}12\\%; base: 0%\), indicating that the playbook representation itself—regardless of information quality—still causes occasional perceived contradictions\. The Gen→\\toEval→\\toReflect→\\toCurate pipeline introduces indirection: constraint knowledge must survive four LLM transformations before reaching the playbook, introducing noise and verbosity at each stage\.
### Implication:
The information gap explains ACE’s*catastrophic*failure on GPT\-OSS\-120B, but the fundamental limitation is architectural: for constraint satisfaction tasks where the specification is self\-contained, no amount of meta\-cognitive machinery \(reflection, curation, playbook management\) improves upon directly presenting the constraints in the user prompt\. The constraints*are*the optimal prompt\.Similar Articles
Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts
Introduces ReElicit, a Bayesian optimization framework that uses LLMs to elicit and adapt feature spaces for optimizing system prompts under aggregate scalar feedback, achieving strong performance across ten benchmark tasks.
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
GEPA is a prompt optimizer that uses natural language reflection to learn from trial and error, outperforming reinforcement learning methods like GRPO and MIPROv2 with up to 35x fewer rollouts across multiple tasks.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML proposes a parameter-efficient multi-task learning method that co-optimizes continuous prompts and model weights via low-rank adaptation. It achieves up to 6.67% average accuracy improvement on multiple benchmarks.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
This paper empirically demonstrates that single-prompt evaluation of instruction-tuned embedding models is insufficient, as performance varies significantly with prompt phrasing and leaderboard rankings can be manipulated by prompt selection.
Self-Supervised Prompt Optimization
This paper introduces Self-Supervised Prompt Optimization (SPO), a framework that optimizes prompts for LLMs without external references by using output comparisons, significantly reducing costs and data requirements.