Tag
Introduces SPO, a stochastic search framework for automatic prompt optimization, with three strategies including SAGE, an agent-guided multi-agent pipeline. Evaluated on benchmarks and deployed on a mental-health chatbot, showing improvements in retention through continuous optimization.
Microsoft introduces SkillOpt, a method that trains an agent's skill documentation like a neural network, using epochs, batches, learning rates, and validation sets for optimization, without modifying model weights. It achieves top results across multiple benchmarks and can be transferred across models and tools.
Introducing Prompt Optimizer, an open-source tool that helps users optimize, test, and reuse prompts. It supports multi-platform deployment and transforms prompts from one-time use into assets that can be called repeatedly.
FAPO is a framework for fully autonomous prompt optimization of multi-step LLM pipelines, combining prompt editing and structural changes. It outperforms the GEPA baseline in 15 of 18 comparisons, with gains up to +33.8 pp on security tasks.
The paper proposes GTBP, a graph-based back-propagation framework for context adaptation in multi-LLM agentic systems, which improves prompt optimization with theoretical convergence guarantees and outperforms existing methods on benchmarks.
APEX introduces a dynamic data selection strategy for automatic prompt optimization, stratifying datasets into easy, hard, and mixed tiers to improve data efficiency, achieving significant performance gains over initial prompts on multiple benchmarks.
LEVI is an open-source AlphaEvolve-like system that runs locally on Qwen3-30B, offering code and prompt optimization with up to 35x cost reduction and better performance than existing frameworks.
Introduces RECAP, a benchmark for evaluating continual learning of prompts under evolving constraints in a proactive adaptation setting. Results show that existing prompt optimization methods fail in this setting, highlighting the need for new methods.
CRAFT is a Pareto-front prompt optimizer that jointly optimizes for accuracy and token cost, avoiding the 'scalarization collapse' of weighted-sum approaches by maintaining a diverse population of prompts across the accuracy-cost trade-off frontier using NSGA-II and budget-aware validation.
SePO (Self-Evolving Prompt Optimization) proposes a self-referential prompt agent that optimizes both task agents' system prompts and its own system prompt through an evolutionary search, outperforming Manual-CoT, TextGrad, and MetaSPO across five benchmarks including AIME'25, ARC-AGI-1, and GPQA.
Proposes Demo2Reward, a test-time prompt optimization technique for VLM reward models using a few expert demonstrations, significantly reducing false positives and improving policy learning in robotics without additional model training.
This paper proposes learning assessment skills for LLMs to automate rubric construction for scoring tasks, achieving performance comparable to expert-written rubrics without requiring human-written examples.
Introduces eXTC, a text classifier with three progressive stages: structured prompt optimization to learn a natural-language rulebook, reasoning distillation into a compact LM, and reinforcement learning to expand reasoning, achieving strong performance and interpretability.
This paper conducts a causal-inspired analysis of automated prompt optimization across frameworks, LLMs, and tasks, identifying that specific edit types (e.g., complexity-increasing, meta-instructional) have systematic negative or positive effects depending on task characteristics, explaining generalization failures.
SPEAR is a code-augmented agentic prompt optimizer that uses a Python sandbox for structural error analysis, achieving state-of-the-art performance on multiple LLM evaluation suites including industrial judge tasks, BBH, and GSM8K.
Microsoft Research introduces SkillOpt, a method that treats agent skill documents as trainable external state, using an optimizer model to make bounded edits validated by a held-out set. The approach achieves best or tied results across 52 evaluation cells and improves accuracy by over 23 points on GPT-5.5, with zero extra inference cost and transferable skills.
This paper identifies two failure modes in multi-objective prompt optimization for LLM judges using textual gradients: gradient dilution during optimization and instruction interference during inference, showing that joint gradient processing loses criterion-specific information.
Introduces Reflective Prompt Tuning (RPT), a framework that uses LLM function-calling to iteratively diagnose and revise prompts based on systematic error patterns, improving reasoning task performance and calibration.
CANTANTE is an open-source framework that solves the credit assignment problem in multi-agent systems by converting system-level rewards into per-agent update signals, outperforming DSPy-based baselines on coding and math reasoning benchmarks.
CANTANTE introduces a contrastive credit attribution method to optimize multi-agent LLM systems by decomposing global rewards into per-agent signals, enabling automated prompt tuning. It outperforms baselines on programming, math, and retrieval benchmarks, achieving up to +18.9 points improvement without increased inference cost.