When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Summary
This paper identifies two failure modes in multi-objective prompt optimization for LLM judges using textual gradients: gradient dilution during optimization and instruction interference during inference, showing that joint gradient processing loses criterion-specific information.
View Cached Full Text
Cached at: 06/08/26, 03:29 AM
Paper page - When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Source: https://huggingface.co/papers/2605.26046 Title: When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges Authors: Parth Darshan (IIT Jodhpur), Abhishek Divekar (Amazon) Blogpost:https://textgrad-failure-modes.github.io Codebase:https://github.com/adivekar-utexas/when-gradients-collide
https://huggingface.co/papers/2605.26046#introductionIntroduction
LLM judges increasingly score text along multiple criteria at once.TextGradcan optimize a prompt for one criterion, but its “gradients” are natural-language edit suggestions, not numerical vectors. They cannot be projected, averaged, or constrained the way PCGrad or MGDA operate on vector gradients. This paper asks what happens when textual gradients are forced into themulti-objectivesetting. We find two separable failure modes: during optimization, jointly generated gradients lose criterion-specific information; during inference, individually optimized instructions interfere when packed into a single judge prompt.
We evaluate on SummEval, which provides expert annotations for four separable summary-evaluation criteria: fluency, relevance, coherence, and consistency. Each optimization step has three stages where the criteria can interact: the loss LLM, the gradient LLM, and the optimizer LLM. We encode each mode with three letters: S means the stage processes each criterion separately; C means the stage processes all four criteria jointly.
The four multi-objective modes are: SSS (all stages separate), SSC (loss and gradient separate, optimizer combined), SCC (only loss separate, gradient and optimizer combined), and CCC (all stages combined). We also include a Single-Task baseline where each criterion receives its own independent optimization run. This baseline is not a deployable one-prompt judge, but it measures the ceiling we would hope to approach if multi-objective coupling caused no damage. All experiments use N=3 independent runs per configuration over 12 optimization steps.
https://huggingface.co/papers/2605.26046#failure-mode-1-gradient-dilutionFailure Mode 1: Gradient Dilution
The first failure happens during optimization. We measure each textual gradient for gradient specificity: how targeted its improvement suggestions are to a single criterion (scored 1–10 by an LLM evaluator). When the gradient LLM processes each task separately (modes Single, SSS, SSC), gradients are sharply focused, scoring a mean of 9.0 (±0.3). But when it must reconcile feedback from all four criteria in one call (modes SCC, CCC), specificity drops to 3.7 (±0.5), a 59% reduction with no overlap between the per-task and cross-task distributions.
The per-criterion breakdown reveals uneven dilution. Consistency is the most diluted: SCC scores 2.6 and CCC scores 2.4. Coherence retains more focus: SCC scores 4.8 and CCC scores 5.1. Joint gradients do not merely become uniformly worse; they become uneven, preserving generic writing-quality feedback while losing the criterion whose rubric is easiest to confuse with other dimensions.
This finding extends the rule-dilution hypothesis ofCAROfrom the within-criterion to the cross-criterion setting. CARO shows that aggregating heterogeneous error modes in a single optimization step degrades rubric accuracy; we observe the analogous effect when multiple task gradients are combined in a single gradient call, degrading the per-task optimization signal.
https://huggingface.co/papers/2605.26046#failure-mode-2-instruction-interferenceFailure Mode 2: Instruction Interference
Gradient dilution explains why the cross-task modes fail. But why do the per-task modes (SSS, SSC) also stagnate, when their gradients are sharp and their edits faithful? The answer lives at inference time, not optimization time.
We run an oracle experiment: for each criterion, we pick the single best instruction across all single-task runs, the one with the highest held-out Spearman for that task, then combine the four oracle-optimal instructions into one prompt. Even these individually-best instructions degrade when combined, falling from 0.305 to 0.220 average Spearman (−0.085), strictly worse than the generic baseline (0.284).
The mechanism is instruction-length asymmetry. Optimization over-specifies some criteria (the fluency rubric expands to approximately 800 tokens with detailed scoring anchors) while leaving others under-specified (the relevance instruction remains at approximately 4 tokens of the initial prompt). Packed into a single prompt, verbose instructions receive disproportionate attention relative to brief ones at inference time. Individually good rubrics can hurt when combined, so interference cannot be fixed by better per-task optimization alone.
This result strengthens a finding from RRD, which shows that naive rubric construction degrades GPT-4o preference-judgment accuracy by 13 points on JudgeBench. RRD’s result shows that bad rubrics hurt. Our result shows that individually good rubrics can hurt when combined, implying that instruction interference is not resolvable by improving per-task optimization alone.
https://huggingface.co/papers/2605.26046#what-this-means-for-custom-llm-judgesWhat This Means for Custom LLM Judges
For practitioners customizing judges to domain-specific criteria, these results indicate that architectural changes are required before the multi-objective setting can work reliably. Addressing either failure mode alone is insufficient.
For gradient dilution: conflict-aware gradient resolution adapted from numerical multi-task learning (PCGrad, CAGrad) could address dilution if textual gradients can be meaningfully embedded and projected. A specificity-aware router could fall back to per-task gradient calls when multi-task specificity drops below a threshold, capturing CCC’s hypervolume gains without losing task focus.
For instruction interference: separate judge calls per criterion eliminates interference but multiplies inference cost. Length-aware instruction synthesis that normalizes rubric length during optimization prevents verbose rubrics from dominating the attention budget. Next-token attention masking that exposes only the relevant criterion instruction during each output field eliminates interference at no cost.
The diagnostics we provide (gradient specificity and feedback adherence) give a way to measure both failure modes, so future work can evaluate mitigations against the same yardstick.
Similar Articles
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling
This paper identifies perceptual judgment bias in multimodal LLM judges, where they over-reward fluent but visually wrong responses, and proposes a dataset PPJD and a trained model Perception-Judge using GRPO with batch-ranking reward to mitigate this bias and improve perception-grounded evaluation.
Value-Gradient Hypothesis of RL for LLMs
This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.
Judge Circuits
This paper investigates the internal mechanisms of LLM-as-a-judge, finding a shared Latent Evaluator sub-graph in mid-to-late MLPs across models that handles abstract judging, while format-specific terminal branches map the judgment to output tokens, revealing the cause of format-induced inconsistency.
Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking
This paper addresses the challenge of robust checkpoint selection for multimodal LLMs under evaluation uncertainty, proposing a multi-stage framework that integrates curated real-world data, LLM-based judgment, and ranking protocols with confidence estimation.
A Unified Framework for Gradient Aggregation in Multi-Objective Optimization
This paper presents a unified theoretical framework for gradient aggregation in multi-objective optimization, establishing convergence rates to Pareto stationarity. The authors introduce a sufficient alignment condition and demonstrate its application to existing and new algorithms, such as capped MGDA.