A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

arXiv cs.LG 06/01/26, 04:00 AM Papers
Summary
This paper presents a unified theoretical framework for gradient aggregation in multi-objective optimization, establishing convergence rates to Pareto stationarity. The authors introduce a sufficient alignment condition and demonstrate its application to existing and new algorithms, such as capped MGDA.
arXiv:2605.30452v1 Announce Type: new Abstract: Many machine learning problems involve multiple inherent trade-offs that are best addressed by gradient-based multi-objective optimization (MOO) algorithms. Existing methods are often proposed with various motivations, analyzed case by case, and differ algorithmically in how the component gradients are aggregated at each step. In this work, we develop a unifying framework for gradient aggregation in MOO, establishing (optimal) rates of convergence to Pareto stationarity, the standard measure of performance in MOO. Central to our analysis is a sufficient alignment condition, from which we derive a theorem showing that non-conflicting directions, when chosen within the convex hull of gradients, form a fundamental sufficient condition for convergence. We further show that feasibility can be ensured through projection onto the dual cone, broadening the scope of methods that admit convergence guarantees. In parallel, we present a primal optimization perspective of gradient aggregation that encompasses established algorithms, clarifies their theoretical relationships, and enables the design of new variants. As an illustration, we introduce capped MGDA, derived from a CVaR-based formulation, and demonstrate its robustness in adversarial federated learning. Finally, we validate our theory through experiments on synthetic problems and practical benchmarks.
Original Article
View Cached Full Text
Cached at: 06/01/26, 09:24 AM
# A Unified Framework for Gradient Aggregation in Multi-Objective Optimization
Source: [https://arxiv.org/html/2605.30452](https://arxiv.org/html/2605.30452)
Zeou HuKelvin HoComputer Science The Chinese University of Hong KongYaoliang YuCheriton School of Computer Science University of Waterloo Waterloo, ON, CanadaVector Institute

###### Abstract

Many machine learning problems involve multiple inherent trade\-offs that are best addressed by gradient\-based multi\-objective optimization \(MOO\) algorithms\. Existing methods are often proposed with various motivations, analyzed case by case, and differ algorithmically in how the component gradients are aggregated at each step\. In this work, we develop a unifying framework for gradient aggregation inMOO, establishing \(optimal\) rates of convergence to Pareto stationarity—the standard measure of performance inMOO\. Central to our analysis is a sufficient alignment condition, from which we derive a theorem showing that non\-conflicting directions, when chosen within the convex hull of gradients, form a fundamental sufficient condition for convergence\. We further show that feasibility can be ensured through projection onto the dual cone, broadening the scope of methods that admit convergence guarantees\. In parallel, we present a primal optimization perspective of gradient aggregation that encompasses established algorithms, clarifies their theoretical relationships, and enables the design of new variants\. As an illustration, we introduce capped MGDA, derived from a CVaR\-based formulation, and demonstrate its robustness in adversarial federated learning\. Finally, we validate our theory through experiments on synthetic problems and practical benchmarks\.

## 1Introduction

Many problems in machine learning are inherently multi\-objective, requiring a balance between multiple, often competing, performance criteria\. This tension is evident across diverse applications, from ensuring fairness alongside accuracy in classification systems, to balancing heterogeneous clients’ performances in federated learning \(FL\), and to jointly mastering different tasks with a shared model in multi\-task learning \(MTL\)\. To address such challenges in modern deep learning, gradient\-basedMOOmethods have become indispensable, offering scalability to high\-dimensional models and seamless integration with existing training pipelines\. In these methods, the key algorithmic challenge is to determine, at each iteration, an effective update direction𝐝\\mathbf\{d\}synthesized from the component gradients, that can guide learning across competing objectives\.

Recent work on gradient aggregation inMOOspans a range of algorithms—e\.g\., MGDA\[Desideri12\], Nash\-MTL\[NavonSAMKCF22\], FairGrad\[BanJi24\], and UPGrad\[quinton2024jacobian\], among others—each proposing a particular rule for constructing𝐝\\mathbf\{d\}from component gradients\. These methods were developed under disparate motivations and, when available, their convergence analyses are established case by case, tied to method\-specific assumptions and proofs\. As a result, while these methods have provided valuable insights, there is still no general framework to explain what properties of an update direction ensure convergence to Pareto stationarity or how these different methods are connected\. This gap highlights the need for a unifying theory that clarifies the conditions for convergence and offers a principled basis for designing new aggregation schemes\.

In this work, we develop a general theoretical framework for gradient\-basedMOO\. Our first main result \([Theorem˜1](https://arxiv.org/html/2605.30452#Thmtheorem1)and[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)\) establishes a broad*alignment condition*\([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) on𝐝t\\mathbf\{d\}\_\{t\}that guarantees convergence to Pareto stationarity\. This versatile result makes[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)pivotal and serves as the cornerstone of our analysis\. Building on it, we derive[Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2), which specializes condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) to the convex hull and non\-conflicting requirements, thereby explaining the success of prominent non\-conflicting aggregation rules\. We further show that feasibility can be restored by projection onto the dual cone, leading to[Corollary˜2](https://arxiv.org/html/2605.30452#Thmcorollary2)\. In parallel, we introduce the \(primal\) optimization subproblem perspective of gradient aggregation \([12](https://arxiv.org/html/2605.30452#S4.E12)\), and establish sufficient conditions under which the resulting aggregation is a conic combination of gradients, ensuring convergence \([Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)\)\. This perspective subsumes existing formulations and provides a principled recipe for designing new ones\. Together, these results form a coherent unifying framework that simplifies theoretical analysis, clarifies prior work, and opens new design possibilities\. We summarize our contributions below:

- •We establish a general alignment criterion for gradient\-basedMOO, yielding a broadly applicable template for analyzing convergence\.
- •We uncover the theoretical importance of non\-conflicting directions inMOOas a fundamental condition leading to convergence\.
- •We study the primal optimization subproblem formulation of gradient aggregation, providing sufficient conditions for the resulting aggregation to be in the conic hull and converge, subsuming several existing methods \(e\.g\., LS, MGDA, Nash\-MTL\) and clarifying their relationships\.
- •We design and analyze a novel method, capped MGDA, derived from a CVaR\-based primal formulation, illustrating our framework’s ability to generate new aggregations\.
- •We validate our theoretical results through experiments on synthetic and fairness benchmarks, and on adversarial federated learning, demonstrating the robustness of capped MGDA\.

Theorem 1non\-convexTheorem 3convexCorollary 1condition \(A\)AngleConstraintTheorem 2convex hull & non\-conflictingCorollary 2Theorem 4primal\-dualprojection to dual cone• MGDA• Nash\-MTL\*• UPGrad\*• DualProj\*• PCGrad\*• UPGrad• DualProj• Greedy\-DCP• Power mean: FairGrad• Convex risk measure: CVaRconvex hullm=1m=1

Figure 1:An overview of the relationships between our key theoretical results\. Our pivotal result,[Theorem˜1](https://arxiv.org/html/2605.30452#Thmtheorem1)and[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1), establishes a broadly applicable convergence guarantee that requires condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\)\. From this result, we then derive[Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2),[Corollary˜2](https://arxiv.org/html/2605.30452#Thmcorollary2), and[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4), which provide more insightful and readily verifiable criteria for a broad class ofMOOaggregation methods\.
## 2Related Works

Multi\-objective optimization \(MOO\) and Pareto solutions have been extensively studied, with classical approaches such as evolutionary algorithms\[DebPAM02\]\. However, modern ML problems are large\-scale and differentiable, making gradient\-based methods more appropriate\. Therefore, in this work we focus on gradient\-basedMOO, in particular, multi\-objective gradient aggregation\.

Gradient\-basedMOO\.Gradient\-basedMOOoptimizes multiple objectives using gradient information\. A foundational algorithm in this regard is MGDA\[Mukai80,FliegeSvaiter00,Desideri12\], which computes a*non\-conflicting*direction by solving for the minimum\-norm element in the convex hull of gradients\.\[FliegeVV19\]provided a detailed convergence analysis of MGDA\. Subsequent works \(e\.g\.,\[Fliege2009newton,MontonenKM18,tanabe2019proximal,AssunccaoFP21,tanabe2023accelerated\]\) have also extended classical single\-objective methods to the multi\-objective setting\. Another important line of research studies stochastic variants of MGDA\[MercierPD18,LiuVicente21,ZhouZJZGZ22,FernandoSLCMC23,ChenFYC23,XiaoBJ23\], motivated by their practical relevance in machine learning, particularly for mini\-batch training of deep neural networks\.

Multi\-task Learning \(MTL\) and Multi\-Objective Gradient Aggregation \(MOGA\)\.MTL aims to train a single model that performs well across multiple tasks\.\[SenerKoltun18\]first cast MTL as a multi\-objective optimization problem and applied MGDA to address it\. Since then, a rich line of research in MTL has proposed general\-purpose multi\-objective gradient aggregation methods, focusing on novel gradient aggregation schemes to mitigate task conflicts\. Examples include PCGrad\[YuKGLHF20\], which projects each gradient onto the normal plane of others; CAGrad\[LiuLJSL21\], which balances average and worst\-case objectives by constraining the search region; and Nash\-MTL\[NavonSAMKCF22\], which frames MTL as a bargaining game\. Other approaches include IMTL\-G\[LiuLKXCYLZ21\]and FairGrad\[BanJi24\], among others\.

Non\-conflicting direction and the Dual cone\.The notion of a non\-conflicting update direction has appeared in variousMOO\-related works\[Desideri12,YuKGLHF20,LiuLJSL21\], though often without sufficient formalization or emphasis\. Recent studies clarified that this criterion corresponds to a dual cone constraint over the gradients𝐠k\{\\mathbf\{g\}\_\{k\}\}, which can be explicitly enforced to guarantee conflict\-free updates\[HwangLim2024,quinton2024jacobian\]\. While\[quinton2024jacobian\]acknowledge the relevance of non\-conflicting directions and propose projection onto the dual cone to ensure them, they do not investigate its theoretical significance\. In contrast, our work rigorously establishes non\-conflicting as a unifying sufficient condition for convergence to Pareto stationarity \(see[Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2)\)\. We show that non\-conflicting is not merely a preference, but a fundamental condition for convergence guarantees—something not recognized in prior work\.

## 3Preliminaries

This section reviews the concepts of Pareto optimality, Pareto stationarity, a measure for quantifying the latter, and two key cones associated with the Jacobian matrix in multi\-objective optimization\.

### 3\.1Multi\-Objective Optimization \(MOO\)

In mathematical terms, a Multi\-Objective Optimization \(MOO\) problem can be written as:

min𝐰∈ℝd\\displaystyle\\min\_\{\\mathbf\{w\}\\in\\mathds\{R\}^\{d\}\}𝐟\(𝐰\),\\displaystyle\\mathbf\{f\}\(\\mathbf\{w\}\),\(1\)where𝐟\(𝐰\):=\(f1\(𝐰\),…,fm\(𝐰\)\)\.\\displaystyle\\mathbf\{f\}\(\\mathbf\{w\}\)=\\bigl\(f\_\{1\}\(\\mathbf\{w\}\),\\ldots,f\_\{m\}\(\\mathbf\{w\}\)\\bigr\)\.and the minimum is defined w\.r\.t\. the*partial*ordering:

𝐟\(𝐰\)≤𝐟\(𝐳\)⇔∀i=1,…,m,fi\(𝐰\)≤fi\(𝐳\)\.\\displaystyle\\mathbf\{f\}\(\\mathbf\{w\}\)\\leq\\mathbf\{f\}\(\\mathbf\{z\}\)\\iff\\forall i=1,\\ldots,m,\\penalty 10000\\ f\_\{i\}\(\\mathbf\{w\}\)\\leq f\_\{i\}\(\\mathbf\{z\}\)\.\(2\)Unlike single\-objective optimization, with multiple objectives it is possible that

𝐟\(𝐰\)≰𝐟\(𝐳\)and𝐟\(𝐳\)≰𝐟\(𝐰\),\\displaystyle\\mathbf\{f\}\(\\mathbf\{w\}\)\\not\\leq\\mathbf\{f\}\(\\mathbf\{z\}\)\\mbox\{ and \}\\mathbf\{f\}\(\\mathbf\{z\}\)\\not\\leq\\mathbf\{f\}\(\\mathbf\{w\}\),\(3\)in which case we say𝐰\\mathbf\{w\}and𝐳\\mathbf\{z\}are not comparable\. As a result, aMOOproblem typically admits a set of optimal solutions \(a\.k\.a\.*Pareto Optimal*\), whose objective values form the*Pareto front*\.

### 3\.2Pareto Optimality and Pareto Stationarity

###### Definition 1\(Pareto Optimality\)\.

We call𝐰∗\\mathbf\{w\}^\{\*\}a*Pareto optimal*solution of \([1](https://arxiv.org/html/2605.30452#S3.E1)\) if its objective value𝐟\(𝐰∗\)\\mathbf\{f\}\(\\mathbf\{w\}^\{\*\}\)is a minimum element w\.r\.t\. the partial ordering in \([2](https://arxiv.org/html/2605.30452#S3.E2)\); equivalently,

∀𝐰,𝐟\(𝐰\)≤𝐟\(𝐰∗\)⟹𝐟\(𝐰\)=𝐟\(𝐰∗\)\.\\displaystyle\\forall\\mathbf\{w\},\\penalty 10000\\ \\mathbf\{f\}\(\\mathbf\{w\}\)\\leq\\mathbf\{f\}\(\\mathbf\{w\}^\{\*\}\)\\implies\\mathbf\{f\}\(\\mathbf\{w\}\)=\\mathbf\{f\}\(\\mathbf\{w\}^\{\*\}\)\.\(4\)

In other words, it is not possible to improve*any*component objective in𝐟\(𝐰∗\)\\mathbf\{f\}\(\\mathbf\{w\}^\{\*\}\)without compromising*some*other objective\. Similarly, we call𝐰∗\\mathbf\{w\}^\{\*\}*weakly*Pareto optimal if it is not possible to improve*all*objectives in𝐟\(𝐰∗\)\\mathbf\{f\}\(\\mathbf\{w\}^\{\*\}\), i\.e\., there does not exist𝐰\\mathbf\{w\}such that𝐟\(𝐰\)<𝐟\(𝐰∗\)\\mathbf\{f\}\(\\mathbf\{w\}\)<\\mathbf\{f\}\(\\mathbf\{w\}^\{\*\}\)\.

Next, we recall the concept of*Pareto Stationarity*\(also referred to as*Pareto Criticality*\), which is the first\-order necessary condition for Pareto optimality\.

###### Definition 2\(Pareto Stationarity\)\.

We call𝐰∗\\mathbf\{w\}^\{\*\}Pareto stationary \(PS\) iff

𝟎∈conv\{∇f1\(𝐰∗\),⋯,∇fm\(𝐰∗\)\},\\displaystyle\\mathbf\{0\}\\in\\mathop\{\\mathrm\{conv\}\}\\\{\\nabla f\_\{1\}\(\\mathbf\{w\}^\{\*\}\),\\cdots,\\nabla f\_\{m\}\(\\mathbf\{w\}^\{\*\}\)\\\},\(5\)i\.e\., there exists some𝛌∈Δ\\bm\{\\lambda\}\\in\\Delta\(the probability simplex\) such that∑i=1mλi∇fi\(𝐰∗\)=𝟎\\sum\_\{i=1\}^\{m\}\\lambda\_\{i\}\\nabla f\_\{i\}\(\\mathbf\{w\}^\{\*\}\)=\\mathbf\{0\}\.

The relevance of Pareto stationarity is captured in the following lemma:

###### Lemma 1\(e\.g\.,\[Mukai80\], Thm 1\)\.

Any Pareto optimal solution is Pareto stationary\. Conversely, if all functions are convex \(resp\., strictly convex\), then any Pareto stationary solution is weakly Pareto optimal \(resp\., Pareto optimal\)\.

Measure of Pareto Stationarity\.To quantify the degree of Pareto stationarity, we recall the following metric \(e\.g\.,\[Mukai80,ChenFYC23,ZhangXJZ24\]\):

γ\(𝐰\)=γ𝐟\(𝐰\):=min𝝀∈Δ⁡‖J𝐟\(𝐰\)𝝀‖,\\displaystyle\\gamma\(\\mathbf\{w\}\)=\\gamma\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)=\\min\_\{\\bm\{\\lambda\}\\in\\Delta\}\\,\\\|J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\\bm\{\\lambda\}\\\|,\(6\)whereJ𝐟\(𝐰\):=\[∇f1\(𝐰\),…,∇fm\(𝐰\)\]\.\\displaystyle\\text\{where \}\\;J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)=\[\\nabla f\_\{1\}\(\\mathbf\{w\}\),\\ldots,\\nabla f\_\{m\}\(\\mathbf\{w\}\)\]\.Clearly,γ\(𝐰\)=0\\gamma\(\\mathbf\{w\}\)=0iff𝐰\\mathbf\{w\}is Pareto stationary\. Whenm=1m=1\(single\-objective\),γ\(𝐰\)=‖∇f\(𝐰\)‖\\gamma\(\\mathbf\{w\}\)=\\\|\\nabla f\(\\mathbf\{w\}\)\\\|is the standard gradient norm widely used in analyzing gradient descent for nonconvex functions\. This measureγ\(𝐰\)\\gamma\(\\mathbf\{w\}\)is continuous \(assuming𝐟\\mathbf\{f\}is continuously differentiable\)\. Therefore, when𝐰t→𝐰∗\\mathbf\{w\}\_\{t\}\\to\\mathbf\{w\}\_\{\*\}andγ\(𝐰t\)→0\\gamma\(\\mathbf\{w\}\_\{t\}\)\\to 0, we immediately know that the limit𝐰∗\\mathbf\{w\}\_\{\*\}must be Pareto stationary sinceγ\(𝐰∗\)=0\\gamma\(\\mathbf\{w\}\_\{\*\}\)=0\.

We introduce two cones inℝd\\mathds\{R\}^\{d\}that are related to a matrixJ∈ℝd×mJ\\in\\mathds\{R\}^\{d\\times m\}:

coneJ:=\{𝐝:𝐝=J𝝁,𝝁≥𝟎\},\\displaystyle\\mathop\{\\mathrm\{cone\}\}\{J\}=\\\{\\mathbf\{d\}:\\mathbf\{d\}=J\\bm\{\\mu\},\\ \\bm\{\\mu\}\\geq\\mathbf\{0\}\\\},\(7\)cone∗J:=\{𝐝:J⊤𝐝≥𝟎\}\.\\displaystyle\{\\mathop\{\\mathrm\{cone\}\}\}^\{\*\}\{J\}=\\\{\\mathbf\{d\}:J^\{\\top\}\\mathbf\{d\}\\geq\\mathbf\{0\}\\\}\.SettingJJto be the \(transposed\) JacobianJ𝐟\(𝐰\)J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)at each iteration, the two cones represent two natural conditions on the update direction𝐝\\mathbf\{d\}:

- •coneJ𝐟\(𝐰\)\\mathop\{\\mathrm\{cone\}\}\{J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\}consists of directions that are conic combinations of the component gradients;
- •cone∗J𝐟\(𝐰\)\{\\mathop\{\\mathrm\{cone\}\}\}^\{\*\}\{J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\}consists of directions that are*non\-conflicting*with each component gradient\.

We note that a direction𝐝∈coneJ\\mathbf\{d\}\\in\\mathop\{\\mathrm\{cone\}\}\{J\}can be normalized to lie inconvJ\\mathop\{\\mathrm\{conv\}\}\{J\}, and the normalization constant can be absorbed into the step size\.

## 4A Unified Framework forMOO

In this section, we consider the following general update step forMOO:

𝐰t\+1=𝐰t−ηt𝐝t,\\displaystyle\\mathbf\{w\}\_\{t\+1\}=\\mathbf\{w\}\_\{t\}\-\\eta\_\{t\}\\mathbf\{d\}\_\{t\},\(8\)whereηt\>0\\eta\_\{t\}\>0is the step size and𝐝t\\mathbf\{d\}\_\{t\}is the update direction\. We will first present a general theorem for analyzing the progress of the above update\. Then, we derive immediate consequences of our framework, illustrate the construction of the update direction, and detail some examples\.

### 4\.1What directions lead to convergence

We recall that a functionFFisLL\-smooth if its gradient∇F\\nabla FisLL\-Lipschitz continuous, a widely adopted assumption in the analysis of gradient\-based methods\[Nesterov18\]\.

Our first result is a slight generalization of the well\-known result on feasible directions\[Zoutendijk76, e\.g\.,\]\.

###### Theorem 1\(Sufficient Alignment Condition\)\.

Suppose there exists anLL\-smooth functionF:ℝd→ℝ\+F:\\mathds\{R\}^\{d\}\\to\\mathds\{R\}\_\{\+\}such that the directions𝐝t\\mathbf\{d\}\_\{t\}satisfy:

⟨𝐝t,∇F\(𝐰t\)⟩≥ctΓt‖𝐝t‖,withct≥0\.\\displaystyle\\left\\langle\\mathbf\{d\}\_\{t\},\\nabla F\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\\geq c\_\{t\}\\Gamma\_\{t\}\\\|\\mathbf\{d\}\_\{t\}\\\|,\\penalty 10000\\ \\penalty 10000\\ \\quad\\mbox\{ with \}c\_\{t\}\\geq 0\.\(9\)With suitably chosen step sizeηt\\eta\_\{t\}\(so that \([56](https://arxiv.org/html/2605.30452#A2.E56)\) in[Appendix˜B](https://arxiv.org/html/2605.30452#A2)holds; for instance, whenηt=ctΓtL‖𝐝t‖\\eta\_\{t\}=\\tfrac\{c\_\{t\}\\Gamma\_\{t\}\}\{L\\\|\\mathbf\{d\}\_\{t\}\\\|\}\), ifct≥c\>0c\_\{t\}\\geq c\>0, then∑tΓt2≤2LF\(𝐰0\)c2\\sum\_\{t\}\\Gamma^\{2\}\_\{t\}\\leq\\frac\{2LF\(\\mathbf\{w\}\_\{0\}\)\}\{c^\{2\}\}\. In particular,mint≤T⁡Γt≤2LF\(𝐰0\)c2T\\min\\limits\_\{t\\leq T\}\\Gamma\_\{t\}\\leq\\sqrt\{\\frac\{2LF\(\\mathbf\{w\}\_\{0\}\)\}\{c^\{2\}T\}\}andlimt→∞Γt=0\\lim\\limits\_\{t\\to\\infty\}\\Gamma\_\{t\}=0\.

Despite the simplicity of its proof,[Theorem˜1](https://arxiv.org/html/2605.30452#Thmtheorem1)is surprisingly general: There is little restriction on how the direction𝐝t\\mathbf\{d\}\_\{t\}or the quantity of interestΓt\\Gamma\_\{t\}is chosen\. In the single\-objective setting, lettingΓt=‖∇F\(𝐰t\)‖\\Gamma\_\{t\}=\\\|\\nabla F\(\\mathbf\{w\}\_\{t\}\)\\\|we reduce to the well\-known angle constraint in the method of feasible directions\[Zoutendijk76, e\.g\.,\]:

⟨𝐝t,∇F\(𝐰t\)⟩‖𝐝t‖⋅‖∇F\(𝐰t\)‖≥ct\>0\.\\displaystyle\\frac\{\\left\\langle\\mathbf\{d\}\_\{t\},\\nabla F\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\}\{\\\|\\mathbf\{d\}\_\{t\}\\\|\\cdot\\\|\\nabla F\(\\mathbf\{w\}\_\{t\}\)\\\|\}\\geq c\_\{t\}\>0\.\(10\)In our multi\-objective setting, the functionFFserves as a \(proof\) surrogate: we use it to prove the convergence ofΓt:=γ\(𝐰t\)\\Gamma\_\{t\}:=\\gamma\(\\mathbf\{w\}\_\{t\}\), the measure of Pareto stationarity\. Often we can simply chooseFFto be linear combinations \(or even just the sum\) of the component functionsfkf\_\{k\}inMOO\. However, we emphasize that it does not meanMOOalgorithms simply minimizeFF\. Indeed, the guarantee onΓt\\Gamma\_\{t\}in[Theorem˜1](https://arxiv.org/html/2605.30452#Thmtheorem1)may have little to do withFF\.

Most importantly, upon settingΓt=‖𝐝t‖\\Gamma\_\{t\}=\\\|\\mathbf\{d\}\_\{t\}\\\|, condition \([9](https://arxiv.org/html/2605.30452#S4.E9)\) simplifies to

⟨𝐝t,∇F\(𝐰t\)⟩≥ct‖𝐝t‖2≥0\\displaystyle\\left\\langle\\mathbf\{d\}\_\{t\},\\nabla F\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\\geq c\_\{t\}\\\|\\mathbf\{d\}\_\{t\}\\\|^\{2\}\\geq 0\(A\)an easily verifiable*alignment condition*\(A\) that is used throughout our subsequent theoretical results\.

In particular,[Theorem˜1](https://arxiv.org/html/2605.30452#Thmtheorem1)gives conditions on when𝐝t→0\\mathbf\{d\}\_\{t\}\\to 0\. To relate this guarantee on𝐝t\\mathbf\{d\}\_\{t\}back to the measure of Pareto stationarity \(i\.e\.,γ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)in \([6](https://arxiv.org/html/2605.30452#S3.E6)\)\), we need only restrict𝐝t\\mathbf\{d\}\_\{t\}to the convex hull of the component gradients, so that‖𝐝t‖≥γ\(𝐰t\)\\\|\\mathbf\{d\}\_\{t\}\\\|\\geq\\gamma\(\\mathbf\{w\}\_\{t\}\)holds trivially\. We summarize this observation in a corollary since it highlights the convenience of searching the direction𝐝t\\mathbf\{d\}\_\{t\}in the convex hull of component gradients:

###### Corollary 1\.

If𝐝t∈conv\(J𝐟\(𝐰t\)\)\\mathbf\{d\}\_\{t\}\\in\\mathop\{\\mathrm\{conv\}\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)and condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) holds withct≥c\>0c\_\{t\}\\geq c\>0\. Then, with \(constant\) step sizeηt=cL\\eta\_\{t\}=\\tfrac\{c\}\{L\}, we havemint≤T⁡γ\(𝐰t\)≤2LF\(𝐰0\)c2T\\min\_\{t\\leq T\}\\gamma\(\\mathbf\{w\}\_\{t\}\)\\leq\\sqrt\{\\frac\{2LF\(\\mathbf\{w\}\_\{0\}\)\}\{c^\{2\}T\}\}\.

In other words, we approach Pareto stationarity at the rate ofO\(1/t\)O\(1/\\sqrt\{t\}\), which in general is optimal \(even for the single\-objective case\)\. We note that a larger choice ofFFmakes it easier to satisfy the condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\), but this also increases the constantsLLandF\(𝐰0\)F\(\\mathbf\{w\}\_\{0\}\)in the rate of convergence\.

Based on[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)we now present our second result, a surprisingly simple and yet effective sufficient condition for convergence to Pareto stationarity\.

###### Theorem 2\(Convergence of Non\-Conflicting Directions\)\.

If the direction𝐝t∈conv\(J𝐟\(𝐰t\)\)\\mathbf\{d\}\_\{t\}\\in\\mathop\{\\mathrm\{conv\}\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)and𝐝t∈cone∗\(J𝐟\(𝐰t\)\)\\mathbf\{d\}\_\{t\}\\in\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)\(i\.e\., non\-conflicting\), then condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) and hence[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)holds withct≡1c\_\{t\}\\equiv 1andF=∑kfkF=\\sum\_\{k\}f\_\{k\}\.

Quite remarkably, the convex hull and the non\-conflicting conditions are easy to check \(and construct, as we shall see\), and together they already imply the \(optimal\)O\(1/t\)O\(1/\\sqrt\{t\}\)rate of convergence to Pareto stationarity\. Interestingly, we can obtain a non\-conflicting direction through projection:

###### Proposition 1\.

Let𝐪∈cone\(J\)\\mathbf\{q\}\\in\\mathop\{\\mathrm\{cone\}\}\(J\), i\.e\.,𝐪=J𝛍\\mathbf\{q\}=J\\bm\{\\mu\}for some𝛍≥0\\bm\{\\mu\}\\geq 0\. Then,𝐝:=Pcone∗\(J\)\(𝐪\)∈cone∗\(J\)∩cone\(J\)\\mathbf\{d\}:=\\mathrm\{P\}\_\{\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\)\}\(\\mathbf\{q\}\)\\in\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\)\\cap\\mathop\{\\mathrm\{cone\}\}\(J\)\. In particular,𝐝=J𝛎\\mathbf\{d\}=J\\bm\{\\nu\}for some𝛎≥0\\bm\{\\nu\}\\geq 0such that‖𝛎‖1≥‖𝛍‖1\\\|\\bm\{\\nu\}\\\|\_\{1\}\\geq\\\|\\bm\{\\mu\}\\\|\_\{1\}\.

Thus, surprisingly but conveniently, from an algorithmic point of view, it suffices to choose a \(pre\-\)direction𝐪t\\mathbf\{q\}\_\{t\}from the convex hull of the component gradients:

###### Corollary 2\.

Let𝐝t=Pcone∗\(Jt\)\(𝐪t\)\\mathbf\{d\}\_\{t\}=\\mathrm\{P\}\_\{\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\_\{t\}\)\}\(\\mathbf\{q\}\_\{t\}\)where𝐪t∈conv\(Jt\)\\mathbf\{q\}\_\{t\}\\in\\mathop\{\\mathrm\{conv\}\}\(J\_\{t\}\)andJt:=J𝐟\(𝐰t\)J\_\{t\}:=J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\), then \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) and hence[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)holds withct≥1c\_\{t\}\\geq 1\.

Furthermore, we observe that condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) is stable under convex combinations, namely that if each direction𝐝j\\mathbf\{d\}\_\{j\}satisfies \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\), then so does any of their convex combinations\. UPGrad\[quinton2024jacobian\]is an example method that takes the average of these projected directions\. See also[Section˜A\.2\.2](https://arxiv.org/html/2605.30452#A1.SS2.SSS2)for another novel variant whose convergence follows directly from[Corollary˜2](https://arxiv.org/html/2605.30452#Thmcorollary2)\.

Convex case\.When the component functionsfkf\_\{k\}are convex, by slightly strengthening the non\-conflicting property of the direction𝐝t\\mathbf\{d\}\_\{t\}, we can establish anO\(1t\)O\(\\frac\{1\}\{t\}\)convergence rate in terms of the \(aggregated\) function value\. See[Appendix˜B](https://arxiv.org/html/2605.30452#A2)for detailed proof and discussion\.

###### Theorem 3\(Convergence under Monotone Descent\)\.

Suppose each objectivefkf\_\{k\}isLL\-smooth, convex and bounded from below\. Choose𝐝t∈conv\(J𝐟\(𝐰t\)\)\\mathbf\{d\}\_\{t\}\\in\\mathop\{\\mathrm\{conv\}\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)\(and step sizeηt≡η≤1L\\eta\_\{t\}\\equiv\\eta\\leq\\tfrac\{1\}\{L\}\) such that the function values\{𝐟\(𝐰t\)\}\\\{\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\}\)\\\}monotonically decrease\. Then, there exists𝛌∈Δ\\bm\{\\lambda\}\\in\\Deltasuch that the iterates𝐰t\\mathbf\{w\}\_\{t\}defined in \([8](https://arxiv.org/html/2605.30452#S4.E8)\) satisfy: for any𝐰\\mathbf\{w\},

⟨𝝀,𝐟\(𝐰t\)⟩−⟨𝝀,𝐟\(𝐰\)⟩≤12ηt‖𝐰0−𝐰‖2\.\\displaystyle\\left\\langle\\bm\{\\lambda\},\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\-\\left\\langle\\bm\{\\lambda\},\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle\\leq\\tfrac\{1\}\{2\\eta t\}\\\|\\mathbf\{w\}\_\{0\}\-\\mathbf\{w\}\\\|^\{2\}\.\(11\)

In particular, choosing𝐰∗=argmin𝐰⟨𝝀,𝐟\(𝐰\)⟩\\mathbf\{w\}\_\{\*\}=\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{w\}\}\\left\\langle\\bm\{\\lambda\},\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle, which is weakly Pareto optimal under convexity and Pareto optimal under strict convexity, we conclude that the iterates𝐰t\\mathbf\{w\}\_\{t\}converge at rate ofO\(1/t\)O\(1/t\)\(in terms of function value\)\. This result generalizes that of\[FliegeVV19\], from the particular method MGDA to any direction𝐝t∈conv\(J𝐟\(𝐰t\)\)\\mathbf\{d\}\_\{t\}\\in\\mathop\{\\mathrm\{conv\}\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)that lies in the interior ofcone∗\(J𝐟\(𝐰t\)\)\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)\(which guarantees descending\)\. Needless to say, whenm=1m=1\(single\-objective\),[Theorem˜3](https://arxiv.org/html/2605.30452#Thmtheorem3)reduces to the well\-known result of gradient descent\.

In the next subsection, we present another way to construct the direction𝐝t\\mathbf\{d\}\_\{t\}, followed by some examples that recover existing algorithms and uncover new variants\.

### 4\.2Constructing the update direction in the conic hull

We now show how to construct the update \(pre\-\)direction in the conic hull of the component gradients so that \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) holds directly\. Our construction is based on the subproblem111For many choices ofssandrr, subproblem \([12](https://arxiv.org/html/2605.30452#S4.E12)\) can be solved in closed\-form or relatively easily, see[Table1](https://arxiv.org/html/2605.30452#S4.T1)\. More generally, if subproblem \([12](https://arxiv.org/html/2605.30452#S4.E12)\) is solved inexactly with an alignment gapϵt\\epsilon\_\{t\}, our result in \([57](https://arxiv.org/html/2605.30452#A2.E57)\) is slightly weakened by an error of order∑tηtϵt\\sum\_\{t\}\\eta\_\{t\}\\epsilon\_\{t\}\.:

𝐪t=argmin𝐪s\(Jt⊤𝐪\)\+r\(‖𝐪‖\),\\displaystyle\\mathbf\{q\}\_\{t\}=\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{q\}\}\\;s\(J\_\{t\}^\{\\top\}\\mathbf\{q\}\)\+r\(\\\|\\mathbf\{q\}\\\|\),\(12\)whereJt:=J𝐟\(𝐰t\)=\[∇f1\(𝐰t\),…,∇fm\(𝐰t\)\]\.\\displaystyle\\text\{where \}\\;J\_\{t\}=J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)=\[\\nabla f\_\{1\}\(\\mathbf\{w\}\_\{t\}\),\\ldots,\\nabla f\_\{m\}\(\\mathbf\{w\}\_\{t\}\)\]\.withs:ℝm→ℝs:\\mathds\{R\}^\{m\}\\to\\mathds\{R\}andr:ℝ\+→ℝr:\\mathds\{R\}\_\{\+\}\\to\\mathds\{R\}\. Whenssis convex andrris increasing convex, using the Fenchel conjugates∗\(𝝀\):=max𝐰⁡⟨𝝀,𝐰⟩−s\(𝐰\)s^\{\*\}\(\\bm\{\\lambda\}\):=\\max\_\{\\mathbf\{w\}\}\\left\\langle\\bm\{\\lambda\},\\mathbf\{w\}\\right\\rangle\-s\(\\mathbf\{w\}\), we derive the dual of \([12](https://arxiv.org/html/2605.30452#S4.E12)\):

min𝐪⁡s\(Jt⊤𝐪\)\+r\(‖𝐪‖\)\\displaystyle\\min\_\{\\mathbf\{q\}\}\\;s\(J\_\{t\}^\{\\top\}\\mathbf\{q\}\)\+r\(\\\|\\mathbf\{q\}\\\|\)\(13\)=min𝐪⁡max𝝀⁡⟨−𝝀,Jt⊤𝐪⟩−s∗\(−𝝀\)\+r\(‖𝐪‖\)\\displaystyle=\\min\_\{\\mathbf\{q\}\}\\;\\max\_\{\\bm\{\\lambda\}\}\\;\\left\\langle\-\\bm\{\\lambda\},J\_\{t\}^\{\\top\}\\mathbf\{q\}\\right\\rangle\-s^\{\*\}\(\-\\bm\{\\lambda\}\)\+r\(\\\|\\mathbf\{q\}\\\|\)=max𝝀⁡min𝐪⁡⟨Jt𝝀,−𝐪⟩−s∗\(−𝝀\)\+r\(‖𝐪‖\)\\displaystyle=\\max\_\{\\bm\{\\lambda\}\}\\;\\min\_\{\\mathbf\{q\}\}\\;\\left\\langle J\_\{t\}\\bm\{\\lambda\},\-\\mathbf\{q\}\\right\\rangle\-s^\{\*\}\(\-\\bm\{\\lambda\}\)\+r\(\\\|\\mathbf\{q\}\\\|\)=−min𝝀⁡s∗\(−𝝀\)\+r∗\(‖Jt𝝀‖\)\.\\displaystyle=\-\\min\_\{\\bm\{\\lambda\}\}\\;s^\{\*\}\(\-\\bm\{\\lambda\}\)\+r^\{\*\}\(\\\|J\_\{t\}\\bm\{\\lambda\}\\\|\)\.where𝐪=βJt𝝀\\mathbf\{q\}=\\beta J\_\{t\}\\bm\{\\lambda\}andβ=∇r∗\(‖Jt𝝀‖\)‖Jt𝝀‖≥0\\beta=\\frac\{\\nabla r^\{\*\}\(\\\|J\_\{t\}\\bm\{\\lambda\}\\\|\)\}\{\\\|J\_\{t\}\\bm\{\\lambda\}\\\|\}\\geq 0\. Whenssis also decreasing, we have−𝝀≤𝟎\-\\bm\{\\lambda\}\\leq\\mathbf\{0\}and hence𝐪t∈cone\(Jt\)\\mathbf\{q\}\_\{t\}\\in\\mathop\{\\mathrm\{cone\}\}\(J\_\{t\}\)\.

However, the direction𝐪t\\mathbf\{q\}\_\{t\}constructed above need not be non\-conflicting \(example will follow\)\. Instead, we can directly establish the condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\)\. For simplicity, assumer\(‖𝐪‖\)≥12‖𝐪‖2r\(\\\|\\mathbf\{q\}\\\|\)\\geq\\tfrac\{1\}\{2\}\\\|\\mathbf\{q\}\\\|^\{2\}ands\(𝟎\)=r\(0\)=0s\(\\mathbf\{0\}\)=r\(0\)=0\. From the optimality of𝐪t\\mathbf\{q\}\_\{t\}in \([12](https://arxiv.org/html/2605.30452#S4.E12)\):

0=s\(𝟎\)\+r\(‖𝟎‖\)≥s\(J⊤𝐪t\)\+12‖𝐪t‖2,\\displaystyle 0=s\(\\mathbf\{0\}\)\+r\(\\\|\\mathbf\{0\}\\\|\)\\geq s\(J^\{\\top\}\\mathbf\{q\}\_\{t\}\)\+\\tfrac\{1\}\{2\}\\\|\\mathbf\{q\}\_\{t\}\\\|^\{2\},\(14\)i\.e\.,−s\(J⊤𝐪t\)≥12∥𝐪t∥2\.\\displaystyle\{i\.e\.\},\\;\-s\(J^\{\\top\}\\mathbf\{q\}\_\{t\}\)\\geq\\tfrac\{1\}\{2\}\\\|\\mathbf\{q\}\_\{t\}\\\|^\{2\}\.Then, to establish \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\), we apply the convexity ofss:

−s\(J⊤𝐪t\)\\displaystyle\-s\(J^\{\\top\}\\mathbf\{q\}\_\{t\}\)≤−s\(𝟎\)\+⟨−∇s\(𝟎\),J⊤𝐪t⟩\\displaystyle\\leq\-s\(\\mathbf\{0\}\)\+\\left\\langle\-\\nabla s\(\\mathbf\{0\}\),J^\{\\top\}\\mathbf\{q\}\_\{t\}\\right\\rangle\(15\)=⟨−Jt∇s\(𝟎\),𝐪t⟩\\displaystyle=\\left\\langle\-J\_\{t\}\\nabla s\(\\mathbf\{0\}\),\\mathbf\{q\}\_\{t\}\\right\\rangle=⟨∇F\(𝐰t\),𝐪t⟩,\\displaystyle=\\left\\langle\\nabla F\(\\mathbf\{w\}\_\{t\}\),\\mathbf\{q\}\_\{t\}\\right\\rangle,upon choosingF\(𝐰\)=⟨−∇s\(𝟎\),𝐟\(𝐰\)⟩F\(\\mathbf\{w\}\)=\\left\\langle\-\\nabla s\(\\mathbf\{0\}\),\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle\. Thus, we have obtained \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) withct=12c\_\{t\}=\\tfrac\{1\}\{2\}\.

We summarize the above discussion in the next theorem:

###### Theorem 4\(Subproblem\-Based Construction of Convergent Directions\)\.

Supposessis decreasing convex andr\(ρ\)≥12ρ2r\(\\rho\)\\geq\\tfrac\{1\}\{2\}\\rho^\{2\}is increasing convex in the optimization subproblem \([12](https://arxiv.org/html/2605.30452#S4.E12)\), withs\(𝟎\)=r\(0\)=0s\(\\mathbf\{0\}\)=r\(0\)=0\. Then, \(I\) the solution𝐪t\\mathbf\{q\}\_\{t\}to \([12](https://arxiv.org/html/2605.30452#S4.E12)\) lies incone\(Jt\)\\mathop\{\\mathrm\{cone\}\}\(J\_\{t\}\)with its normalized direction𝐝t\\mathbf\{d\}\_\{t\}lying inconv\(Jt\)\\mathop\{\\mathrm\{conv\}\}\(J\_\{t\}\), and \(II\)𝐝t\\mathbf\{d\}\_\{t\}satisfies condition \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\) withct=12βt‖𝛌t‖1c\_\{t\}=\\tfrac\{1\}\{2\}\\beta\_\{t\}\\\|\\bm\{\\lambda\}\_\{t\}\\\|\_\{1\}andF=⟨−∇s\(𝟎\),𝐟⟩F=\\left\\langle\-\\nabla s\(\\mathbf\{0\}\),\\mathbf\{f\}\\right\\rangle\.

Therefore, if we can showct≥cc\_\{t\}\\geq cfor some positive constantcc, then[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)holds and theO\(1/t\)O\(1/\\sqrt\{t\}\)rate of convergence to Pareto stationarity follows\.

Table 1:Summary ofMOOgradient aggregation methods \(non\-exhaustive\)\.s\(𝐱\)s\(\\mathbf\{x\}\),rr, and𝐪t\\mathbf\{q\}\_\{t\}follow \([12](https://arxiv.org/html/2605.30452#S4.E12)\)\. Row colors match[Figure˜1](https://arxiv.org/html/2605.30452#S1.F1), indicating the most specific theorem or corollary applicable to ensure convergence for each method, albeit more general ones may also apply\. Details are in Appendix[A](https://arxiv.org/html/2605.30452#A1)\.
### 4\.3Examples: Old and New

In this section, we illustrate how the subproblem formulation \([12](https://arxiv.org/html/2605.30452#S4.E12)\) unifies a broad class of existing gradient aggregation methods, and even leads to the discovery of new ones \(e\.g\., Capped MGDA\) that are easy to implement and come with automatic convergence guarantees, thanks to[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)\.

#### 4\.3\.1Power mean

We are now ready to present some examples\. Let us first consider the*power mean*:

s\(𝐱\)=−\(1m∑kxkp\)1/p,wherep≤1\.\\displaystyle s\(\\mathbf\{x\}\)=\-\\Big\(\\tfrac\{1\}\{m\}\\sum\\nolimits\_\{k\}x\_\{k\}^\{p\}\\Big\)^\{1/p\},\{\\quad\\text\{where\}\\quad\}p\\leq 1\.\(16\)Note that a similar formulation inspired byα\\alpha\-fairness appeared in\[BanJi24\]\[see Appendix[A\.1\.4](https://arxiv.org/html/2605.30452#A1.SS1.SSS4)for details\]\.

Restricted toℝ\+m\\mathds\{R\}\_\{\+\}^\{m\},ssis decreasing and convex, with Fenchel conjugate

s∗\(−𝝀\)=\{0,if\(1m∑kλkq\)1/q≥1,𝝀≥𝟎∞,otherwise,\\displaystyle s^\{\*\}\(\-\\bm\{\\lambda\}\)=\\begin\{cases\}0,&\\mbox\{ if \}\\left\(\\frac\{1\}\{m\}\\sum\_\{k\}\\lambda\_\{k\}^\{q\}\\right\)^\{1/q\}\\geq 1,\\bm\{\\lambda\}\\geq\\mathbf\{0\}\\\\ \\infty,&\\mbox\{ otherwise\}\\end\{cases\},\(17\)whereq:=pp−1q:=\\frac\{p\}\{p\-1\}is conjugate topp\. According to the reverse Hölder’s inequality, the optimalλk∝xkp/q\.\\lambda\_\{k\}\\propto x\_\{k\}^\{p/q\}\.

Settingppdifferently allows us to recover some existingMOOgradient aggregation algorithms:

- •p=1p=1: this amounts to linear scalarization with uniform weights, i\.e\.,s\(𝐱\)=−1m∑kxks\(\\mathbf\{x\}\)=\-\\tfrac\{1\}\{m\}\\sum\_\{k\}x\_\{k\}\.
- •p→0p\\to 0\(the limiting case\): this corresponds222More precisely, Nash\-MTL applied the log\-transform to obtains\(𝐱\)=−∑klog⁡xks\(\\mathbf\{x\}\)=\-\\sum\_\{k\}\\log x\_\{k\}\. It also changed the regularizerrrto a constraint\. It can be shown that these transformations are immaterial, while our choice ofs\(𝐱\)=−\(∏kxk\)1/ms\(\\mathbf\{x\}\)=\-\(\\prod\_\{k\}x\_\{k\}\)^\{1/m\}has the slight advantage of being defined at𝐱=𝟎\\mathbf\{x\}=\\mathbf\{0\}, rendering[Theorem4](https://arxiv.org/html/2605.30452#Thmtheorem4)directly applicable\. Alternatively, convergence of Nash\-MTL also follows from[Theorem2](https://arxiv.org/html/2605.30452#Thmtheorem2), see[SectionA\.1\.3](https://arxiv.org/html/2605.30452#A1.SS1.SSS3)for more discussions\.to Nash\-MTL\[NavonSAMKCF22\], wheres\(𝐱\)=−\(∏kxk\)1/ms\(\\mathbf\{x\}\)=\-\(\\prod\_\{k\}x\_\{k\}\)^\{1/m\}is the geometric mean and \([15](https://arxiv.org/html/2605.30452#S4.E15)\) reduces simply to the arithmetic\-geometric inequality \(withF=1m∑kfkF=\\tfrac\{1\}\{m\}\\sum\_\{k\}f\_\{k\}\)\.
- •p→−∞p\\to\-\\infty: this corresponds to MGDA\[Desideri12,Mukai80,FliegeSvaiter00\], wheres\(𝐱\)=−mink⁡xks\(\\mathbf\{x\}\)=\-\\min\_\{k\}x\_\{k\}\.
- •p=−1p=\-1: this has been explored by FairGrad\[BanJi24\]and PIVRG\[QinWY25\]\.

Convergence of power\-mean\-based directions\.Withr\(ρ\)=12ρ2r\(\\rho\)=\\tfrac\{1\}\{2\}\\rho^\{2\}, we haveβt≡1\\beta\_\{t\}\\equiv 1in[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)\. Sinceq≤1q\\leq 1, we know‖𝝀‖1≥\(1m∑kλkq\)1/q≥1\\\|\\bm\{\\lambda\}\\\|\_\{1\}\\geq\\left\(\\frac\{1\}\{m\}\\sum\_\{k\}\\lambda\_\{k\}^\{q\}\\right\)^\{1/q\}\\geq 1and hencect≥12c\_\{t\}\\geq\\tfrac\{1\}\{2\}in[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)\. Applying[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1)we at once obtain theO\(1/t\)O\(1/\\sqrt\{t\}\)rate of convergence to Pareto stationarity\.

#### 4\.3\.2Convex risk measures

Next, we choosessfrom the family of*convex risk measures*\[FollmerSchied02\], namely thatssis convex, decreasing and translation invariant:

∀c∈ℝ,s\(𝐱\+c\)=s\(𝐱\)−c\.\\displaystyle\\forall c\\in\\mathds\{R\},\\penalty 10000\\ \\penalty 10000\\ s\(\\mathbf\{x\}\+c\)=s\(\\mathbf\{x\}\)\-c\.\(18\)The last two conditions ensure that the domain of the conjugate function

s∗\(−𝝀\)\\displaystyle s^\{\*\}\(\-\\bm\{\\lambda\}\)=max𝐱,c⁡⟨−𝝀,𝐱\+c⟩−s\(𝐱\+c\)\\displaystyle=\\max\_\{\\mathbf\{x\},c\}\\;\\left\\langle\-\\bm\{\\lambda\},\\mathbf\{x\}\+c\\right\\rangle\-s\(\\mathbf\{x\}\+c\)\(19\)=max𝐱,c⁡⟨−𝝀,𝐱⟩−s\(𝐱\)\+c\(1−⟨𝝀,𝟏⟩\)\.\\displaystyle=\\max\_\{\\mathbf\{x\},c\}\\;\\left\\langle\-\\bm\{\\lambda\},\\mathbf\{x\}\\right\\rangle\-s\(\\mathbf\{x\}\)\+c\\bigl\(1\-\\left\\langle\\bm\{\\lambda\},\\mathbf\{1\}\\right\\rangle\\bigr\)\.is restricted so that𝝀∈Δ\\bm\{\\lambda\}\\in\\Delta\(the simplex\)\. The power mean \([16](https://arxiv.org/html/2605.30452#S4.E16)\) withp=1p=1andp=−∞p=\-\\inftyare convex risk measures, while other values ofppare not\.

Capped MGDA via CVaR\.Another widely\-used convex risk measure is the*Conditional Value\-at\-Risk \(CVaR,\[RockafellarUryasev00\]\)*:

s\(𝐱\):=CVaRϵ\(𝐱\)=minα⁡\{α\+1ϵm∑k=1mmax⁡\{0,−xk−α\}\}s\(\\mathbf\{x\}\):=\\mathrm\{CVaR\}\_\{\\epsilon\}\(\\mathbf\{x\}\)=\\min\_\{\\alpha\}\\Bigl\\\{\\alpha\+\\frac\{1\}\{\\epsilon m\}\\sum\_\{k=1\}^\{m\}\\max\\\{0,\-x\_\{k\}\-\\alpha\\\}\\Bigr\\\}

\(20\)which amounts to averaging the tails of−𝐱=−\(x1,…,xm\)\-\\mathbf\{x\}=\-\(x\_\{1\},\\ldots,x\_\{m\}\), i\.e\., entries that are larger than the\(1−ϵ\)\(1\-\\epsilon\)quantile, a\.k\.a\. value\-at\-risk \(VaR\)\. In particular, forϵ≤1m\\epsilon\\leq\\tfrac\{1\}\{m\}, CVaR coincides with MGDA, whereas forϵ≥1−1m\\epsilon\\geq 1\-\\tfrac\{1\}\{m\}, CVaR reduces to Linear Scalarization\. Other values ofϵ\\epsilonprovide different interpolations between these two extreme cases\. With a proper choice ofϵ\\epsilon, we can control the influence of extreme values in𝐱\\mathbf\{x\}\. Easy to derive the Fenchel conjugate:

s∗\(−𝝀\)=CVaRϵ∗\(−𝝀\)=\{0,if𝝀∈Δand𝝀≤C∞,otherwise,\\displaystyle s^\{\*\}\(\-\\bm\{\\lambda\}\)=\\mathrm\{CVaR\}\_\{\\epsilon\}^\{\*\}\(\-\\bm\{\\lambda\}\)=\\begin\{cases\}0,&\\mbox\{ if \}\\bm\{\\lambda\}\\in\\Delta\\mbox\{ and \}\\bm\{\\lambda\}\\leq C\\\\ \\infty,&\\mbox\{ otherwise\}\\end\{cases\},\(21\)which is similar to that of MGDA: the only difference is the cap constraint𝝀≤C:=1ϵm\\bm\{\\lambda\}\\leq C:=\\tfrac\{1\}\{\\epsilon m\}, which limits the contribution of each component gradient\. Thus, the implementation of CVaR \(which we refer to as*Capped MGDA*\) closely mirrors that of MGDA, with the additional cap constraint imposed when solving the MGDA dual quadratic program; see[Section˜A\.2\.1](https://arxiv.org/html/2605.30452#A1.SS2.SSS1)for a detailed derivation of the dual\. To our knowledge, CVaR, or more generally, convex\-risk\-measure\-based directions have not been explored before inMOO\.

Convergence of convex\-risk\-measure\-based directions\.Withr\(ρ\)=12ρ2r\(\\rho\)=\\tfrac\{1\}\{2\}\\rho^\{2\}, we haveβt≡1\\beta\_\{t\}\\equiv 1in[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)\. Since𝝀∈Δ\\bm\{\\lambda\}\\in\\Delta, we havect=12c\_\{t\}=\\tfrac\{1\}\{2\}in[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)\. Applying[Corollary˜1](https://arxiv.org/html/2605.30452#Thmcorollary1), we immediately obtain theO\(1/t\)O\(1/\\sqrt\{t\}\)rate of convergence to Pareto stationarity for all such directions\.

## 5Experiments

We conduct experiments on both synthetic problems and realistic benchmarks to study existing non\-conflicting gradient aggregators, both individually and under mixed aggregator scheduling \(MAS\), as well as the newly proposed*Capped MGDA*\. These experiments examine convergence behavior or robustness under adversarial conditions, serving to validate our theoretical findings rather than to rank methods333NonconvexMOOusually yields incomparable Pareto stationary solutions, and even in convex settings, Pareto optimal solutions are generally not directly comparable\.\. Further details and discussions are provided in[Appendix˜C](https://arxiv.org/html/2605.30452#A3)\.

### 5\.1Non\-conflicting gradient aggregators

For non\-conflicting gradient aggregators, we conduct experiments on synthetic problems \(VLMOP2 and Omnitest\) and on a realistic fairness classification benchmark\. In line with our theoretical findings \([Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2)\), we examine convergence using the measure of Pareto stationarityγ\(𝐰\)\\gamma\(\\mathbf\{w\}\)\.

Methods\.We evaluate four non\-conflicting aggregation schemes\[quinton2024jacobian\]: MGDA, DualProj, UPGrad, and Nash\-MTL\. For each, except MGDA \(whose update direction already lies in the convex hull\), we also include a normalized variant \(denoted with a star\) where𝐝\\mathbf\{d\}is rescaled to lie in the convex hull of gradients\. In addition, we consider*Mixed Aggregator Scheduling \(MAS\)*, which alternates among non\-conflicting methods according to a prescribed schedule \(see[Algorithm˜1](https://arxiv.org/html/2605.30452#algorithm1),[Appendix˜A](https://arxiv.org/html/2605.30452#A1)\)\. Different non\-conflicting aggregators have complementary properties: for instance, MGDA may stall at suboptimal Pareto\-stationary points \(\[HuYu2025\]\), whereas UPGrad can continue making progress; Nash\-MTL, though more expensive, provides scale\-invariant conflict resolution\. Simple schedules such as uniform random selection at each iteration \(Rand\), or round\-robin everynniterations \(RR\(n\)\) serve as natural baselines for mixing these methods\.

Synthetic problems setup\.We evaluate on two common synthetic MOO benchmarks: VLMOP2\[vlmop2paper\]and Omnitest\[Omnitest\], each with five random seeds\.

![Refer to caption](https://arxiv.org/html/2605.30452v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x2.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x3.png)

Figure 2:Dynamics of non\-conflicting aggregators on VLMOP2\.Left: objectivef1f\_\{1\};Middle: objectivef2f\_\{2\};Right: Pareto stationarity measureγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)\.Results\.[Figure˜2](https://arxiv.org/html/2605.30452#S5.F2)reports the optimization dynamics of four non\-conflicting aggregation methods on VLMOP2, using both the original directions and their normalized counterparts \(denoted by ‘\*’\)\. Since normalization to the convex hull removes scale differences and places all methods on a comparable footing, these normalized variants are the ones most indicative of their intrinsic convergence behavior, and indeed they exhibit similar asymptotic rates\. The unnormalized variants appear to converge faster, but only because their larger‖𝐝t‖\\\|\\mathbf\{d\}\_\{t\}\\\|means effectively using a larger step size\. This is especially visible for Nash\-MTL, whose update norm stays constant even when close to stationarity\.

In all cases, the Pareto\-stationarity measureγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)\(see \([6](https://arxiv.org/html/2605.30452#S3.E6)\)\) consistently converges to zero, aligning with the guarantee in[Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2)\.[Figure˜3](https://arxiv.org/html/2605.30452#S5.F3)further demonstrates the dynamics of our mixed\-aggregator scheduling \(MAS\) scheme using random and round\-robin schedules\. The trajectories confirm that switching among different non\-conflicting methods within a single optimization run still ensures convergence to Pareto\-stationarity, with end\-phase asymptotics comparable to individual methods\. This provides empirical support for the validity of MAS as suggested by our theory\.

![Refer to caption](https://arxiv.org/html/2605.30452v1/x4.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x5.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x6.png)

Figure 3:Dynamics of mixed aggregator scheduling on Omnitest\.Left: objectivef1f\_\{1\};Middle: objectivef2f\_\{2\};Right: Pareto stationarity measureγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)\.Fairness classification setup\.We follow LibMOON\[Zhang2024libmoon\]for evaluating fairness classification on the Adult dataset, where a44\-layer MLP is trained to predict the income level\. The objectives are binary cross\-entropy \(utility\) and smoothed relaxations of Difference of Equalized Odds444Equalized odds withY=1Y=1\(used in DEO1\) is also referred to as equal opportunity\[Hardt2016equality\]\.\(fairness\) withY=1Y=1\(DEO1\) andY=0Y=0\(DEO2\)\. Details are provided in[Appendix˜C](https://arxiv.org/html/2605.30452#A3)\.

Results\.[Table˜2](https://arxiv.org/html/2605.30452#S5.T2)reports the fairness results for the aforementioned methods555Nash\-MTL is excluded due to instability; see the discussion in[AppendixC](https://arxiv.org/html/2605.30452#A3)\.\. MAS achieves intermediate performance: it outperforms some methods in fairness and accuracy, but falls short of others\. This positions MAS as a natural baseline—its use of all aggregators within a single trial provides a representative “average” level of performance against which other methods can be compared\.

Table 2:Fairness classification on Adult dataset\. Results averaged over 4 random seeds\.
### 5\.2Capped MGDA

Setup\.We evaluate capped MGDA in a federated learning setting on the CIFAR\-10 dataset, withm=10m=10clients\. Each client holds distinct non\-i\.i\.d\. data partitions, and the objectivesfif\_\{i\}correspond to their individual prediction utilities\. We consider an adversarial scenario where, during gradient aggregation, a malicious attacker contributes a flipped gradient \(with some noise\), opposite to one client’s direction\. The goal is to compare the robustness and effectiveness of capped MGDA against MGDA in the presence of such adversarial gradients\. See further details in[Appendix˜C](https://arxiv.org/html/2605.30452#A3)\.

Results\.As shown in[Figure˜4](https://arxiv.org/html/2605.30452#S5.F4)\(Top Left\), capped MGDA achieves substantially higher per\-client test accuracies than MGDA in the adversarial FL setting\. This highlights MGDA’s vulnerability to adversarial gradients\. Top Right panel further confirms that this gap is specific to the adversarial scenario: in the standard \(no\-attack\) FL setting, MGDA and capped MGDA perform comparably, indicating that capping does not degrade performance when no adversary is present\.

MGDA’s vulnerability arises from its min\-norm update: when opposite gradients are present, MGDA assigns large weights \(near12\\frac\{1\}\{2\}\) to them, resulting in a much smaller update direction𝐝\\mathbf\{d\}, as seen in[Figure˜4](https://arxiv.org/html/2605.30452#S5.F4)\(Bottom Left,orangecurve\)\. This leads to ineffective progress under adversarial attacks\. By limiting each gradient’s maximum contribution, capped MGDA avoids these extreme allocations and yields larger, more meaningful update directions\. Finally,[Figure˜4](https://arxiv.org/html/2605.30452#S5.F4)\(Bottom Right\) shows that capped MGDA is, in general,*not*a non\-conflicting aggregator: smaller values ofCClead to the curves shifting further below zero, indicating more severe gradient conflicts\.

![Refer to caption](https://arxiv.org/html/2605.30452v1/x7.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x8.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x9.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x10.png)

Figure 4:Capped MGDA vs\. MGDA in adversarial federated learning on CIFAR\-10 \(1000 epochs\)\.Top: Per\-client test accuracies \(clients 0–9\) for MGDA and Capped\-MGDA under both adversarial and standard Federated Learning \(FL\) settings\.Bottom Left: Norm of the update direction𝐝t\\mathbf\{d\}\_\{t\}throughout adversarial FL training\.Bottom Right: Non\-conflictingness of the update direction, measured bymink⁡⟨𝐝t,𝐠k⟩\\min\_\{k\}\\langle\\mathbf\{d\}\_\{t\},\\mathbf\{g\}\_\{k\}\\rangle\(positive values indicate non\-conflicting updates\), during adversarial FL training\.

## 6Conclusion

We present a unifying framework for gradient aggregation in multi\-objective optimization\. Central to our framework is an alignment condition that simplifies convergence analysis inMOO, leading to theorems with simpler and more intuitive conditions and, to our knowledge, the first comprehensive answer to the question of which directions guarantee convergence inMOO\. We further summarize and clarify a wide range of existing aggregation methods, showing that ensuring a non\-conflicting direction is sufficient for convergence \(Theorem 2\)\. Inspired by Theorem 2, we propose the Mixed Aggregator Scheduling \(MAS\) strategy and demonstrate that mixing different aggregators within a single training run yields a valid and practical algorithm\. In Theorem 4, we present a subproblem\-based construction of convergent directions that applies even when the non\-conflicting direction property of Theorem 2 does not hold; this framework covers many existing methods as well and also enables the design of new ones, notably capped MGDA\.

Limitations and future work include tightening rates under stronger curvature \(e\.g\., PL/strong convexity\), extending the current analysis to stochastic and constrained/non\-smooth settings, and exploring the Mixed Aggregator Scheduling strategy more comprehensively\.

###### Acknowledgements\.

Briefly acknowledge people and organizations here\.*All*acknowledgements go in this section\.

## References

A Unified Framework for Gradient Aggregation in Multi\-Objective Optimization \(Supplementary Material\)

## Appendix AMore on gradient aggregations

1

Input:Objectives

𝐟\\mathbf\{f\}, pool of aggregation schemes

\{𝒜i\}\\\{\\mathcal\{A\}\_\{i\}\\\}, initializer

𝐰0\\mathbf\{w\}\_\{0\}, learning rate

ηt\\eta\_\{t\}\.

2

3

4for*t=0,1,…,T−1t=0,1,\\ldots,T\-1*do

5

J𝐟\(𝐰t\)←\[∇f1\(𝐰t\),…,∇fm\(𝐰t\)\]J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\\leftarrow\[\\nabla f\_\{1\}\(\\mathbf\{w\}\_\{t\}\),\\ldots,\\nabla f\_\{m\}\(\\mathbf\{w\}\_\{t\}\)\]
//compute gradients

6

Choose

𝒜j\\mathcal\{A\}\_\{j\}from the pool

//select an aggregator based on a schedule

7

𝐝t←𝒜j\(J𝐟\(𝐰t\)\)\\mathbf\{d\}\_\{t\}\\leftarrow\\mathcal\{A\}\_\{j\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\_\{t\}\)\)
//aggregate gradients using𝒜j\\mathcal\{A\}\_\{j\}

8

𝐰t\+1←𝐰t−ηt⋅𝐝t\\mathbf\{w\}\_\{t\+1\}\\leftarrow\\mathbf\{w\}\_\{t\}\-\\eta\_\{t\}\\cdot\\mathbf\{d\}\_\{t\}
//update

9

Output:final weights

𝐰T\\mathbf\{w\}\_\{T\}

Algorithm 1Multi\-Objective Descent Algorithm with Mixed Aggregator Scheduling### A\.1Existing aggregators

Here, we review commonly used multi\-objective gradient aggregation methods from the literature and present their optimization subproblem formulations under \([12](https://arxiv.org/html/2605.30452#S4.E12)\) \(with the corresponding choices ofssandrr\), along with their algorithmic formulations as implemented in practice\. We also provide discussions and insights on methods that do not neatly fit into our framework \(e\.g\., PCGrad\[YuKGLHF20\]and IMTL\-G\[LiuLKXCYLZ21\]\)\.

#### A\.1\.1\(Uniform\) Linear scalarization

The primal optimization subproblem formulation is

argmin𝐝−1m∑k=1m⟨𝐝,𝐠k⟩\+12‖𝐝‖2,\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\\;\-\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\rangle\+\\tfrac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\},\(22\)which corresponds to choosings\(𝐱\)=−1m𝟏⊤𝐱s\(\\mathbf\{x\}\)=\-\\tfrac\{1\}\{m\}\\mathbf\{1\}^\{\\top\}\\mathbf\{x\}andr\(∥⋅∥\)=12∥⋅∥2r\(\\\|\\cdot\\\|\)=\\tfrac\{1\}\{2\}\\\|\\cdot\\\|^\{2\}in \([12](https://arxiv.org/html/2605.30452#S4.E12)\)\.

The resulting gradient aggregation rule \(used in practice\) is

𝐝=1m∑k=1m𝐠k\.\\displaystyle\\mathbf\{d\}=\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\mathbf\{g\}\_\{k\}\.\(23\)Linear scalarization is not necessarily a non\-conflicting aggregator \(it is easy to construct cases where the averaged gradient𝐝\\mathbf\{d\}has a negative inner product with some component gradient\)\. Nevertheless, its convergence to Pareto stationarity follows directly by takingF=1m∑ifiF=\\tfrac\{1\}\{m\}\\sum\_\{i\}f\_\{i\}andct=1c\_\{t\}=1in[Theorem˜1](https://arxiv.org/html/2605.30452#Thmtheorem1)\. Moreover, the scalarization weights need not be fixed at1m\\tfrac\{1\}\{m\}; any conic scalarization preserves this convergence property, and it is common to tune the weights for better performance\.

##### Convergence guarantees of linear scalarization\.

The aggregation rule in \([23](https://arxiv.org/html/2605.30452#A1.E23)\) corresponds precisely to performing gradient descent on the scalarized objectiveF=∑ifiF=\\sum\_\{i\}f\_\{i\}\. Consequently, its convergence behavior follows that of standard gradient descent in*smooth*single\-objective optimization, with a rate ofO\(1/t\)O\(1/\\sqrt\{t\}\)for non\-convex objectives andO\(1/t\)O\(1/t\)for convex objectives\. Our framework, when applicable666In the non\-convex case, Theorem 1 applies directly; however, in the convex case, gradient descent is not necessarily non\-conflicting, and thus Theorem 3 cannot be invoked directly\., recovers these same \(optimal\) rates, as established in the main paper\.

#### A\.1\.2Multiple Gradient Descent Algorithm \(MGDA\)

The primal optimization subproblem formulation of MGDA\[Mukai80,FliegeSvaiter00,Desideri12\]is

argmin𝐝maxk−⟨𝐝,𝐠k⟩\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\\penalty 10000\\ \\max\_\{k\}\-\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\(24\)which corresponds to choosings\(𝐱\)=maxk⁡\(−xk\)s\(\\mathbf\{x\}\)=\\max\_\{k\}\(\-x\_\{k\}\),r\(∥⋅∥\)=12∥⋅∥2r\(\\\|\\cdot\\\|\)=\\frac\{1\}\{2\}\\\|\\cdot\\\|^\{2\}in \([12](https://arxiv.org/html/2605.30452#S4.E12)\)\.

The resulting gradient aggregation rule \(used in practice\) is

𝐝=J𝝀∗,where𝝀∗=argmin𝝀∈Δ‖J⊤𝝀‖2\.\\displaystyle\\mathbf\{d\}=J\\bm\{\\lambda\}\_\{\*\},\\quad\\mathrm\{where\}\\penalty 10000\\ \\bm\{\\lambda\}\_\{\*\}=\\mathop\{\\mathrm\{argmin\}\}\_\{\\bm\{\\lambda\}\\in\\Delta\}\\\|J^\{\\top\}\\bm\{\\lambda\}\\\|^\{2\}\.\(25\)MGDA is automatically a non\-conflicting aggregator, which is easy to see from the primal perspective e\.g\.,\[FliegeSvaiter00\]\. Its update𝐝t\\mathbf\{d\}\_\{t\}is also in the convex full of gradients by definition of the dual problem\. Thus, its convergence is immediate from[Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2)\.

##### Convergence guarantees of MGDA\.

To the best of our knowledge, the most complete complexity analysis of MGDA to date is provided by\[FliegeVV19\]\. Our framework extends this analysis to a broader class of gradient aggregation schemes\. When specialized to MGDA, our general results \(Theorems 2 and 3\) apply without requiring additional assumptions and recover the same convergence rates as those established by\[FliegeVV19\]for both non\-convex and convex settings\.

#### A\.1\.3Nash Bargaining Multi\-Task Learning \(Nash\-MTL\)

The primal optimization subproblem formulation of Nash\-MTL\[NavonSAMKCF22\]is

argmin‖𝐝‖≤ϵ−∑klog⁡⟨𝐝,𝐠k⟩,\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\\|\\mathbf\{d\}\\\|\\leq\\epsilon\}\\penalty 10000\\ \-\\sum\_\{k\}\\log\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle,\(26\)which corresponds to choosings\(𝐱\)=−∑klog⁡xks\(\\mathbf\{x\}\)=\-\\sum\_\{k\}\\log x\_\{k\},r\(∥⋅∥\)=ιℬϵ\(⋅\):=\{0,‖𝐝‖≤ϵ\+∞,otherwiser\(\\\|\\cdot\\\|\)=\\iota\_\{\\mathcal\{B\}\_\{\\epsilon\}\}\(\\cdot\):=\\begin\{cases\}0,&\\\|\\mathbf\{d\}\\\|\\leq\\epsilon\\\\ \+\\infty,&\\text\{ otherwise \}\\end\{cases\}

Note that this optimization formulation is equivalent to the−\(∏kxk\)1/m\-\(\\prod\_\{k\}x\_\{k\}\)^\{1/m\}we presented in the main paper, by moving the negative sign out and take thelog\\log\. Also, the hard ball constraint‖𝐝‖≤ϵ\\\|\\mathbf\{d\}\\\|\\leq\\epsilonin \([26](https://arxiv.org/html/2605.30452#A1.E26)\) can be equivalently replaced with regularization12‖𝐝‖2\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}which will result in the same aggregation rule as in \(Eq\), up to constant scaling\.

The resulting gradient aggregation rule \(used in practice\) is

𝐝\\displaystyle\\mathbf\{d\}=J𝝀∗,whereJ⊤J𝝀∗=𝟏𝝀∗\\displaystyle=J\\bm\{\\lambda\}\_\{\*\},\\quad\\mathrm\{where\}\\penalty 10000\\ J^\{\\top\}J\\bm\{\\lambda\}\_\{\*\}=\\frac\{\\mathbf\{1\}\}\{\\bm\{\\lambda\}\_\{\*\}\}\(Eq\)Nash\-MTL is a non\-conflicting aggregator, which can be seen directly from the primal perspective\. In the official implementation,𝐝\\mathbf\{d\}is clipped to satisfy‖𝐝‖=ϵ\\\|\\mathbf\{d\}\\\|=\\epsilonfor some fixedϵ\\epsilon\(default11\)\. Our experiments show that applying convex\-hull regularization \(the variant denoted Nash\-MTL\*\), which rescales𝝀∗\\bm\{\\lambda\}\_\{\*\}to lie inΔ\\Delta, greatly stabilizes and smooths training\. Thus, convex\-hull regularization not only ensures convergence via[Theorem˜2](https://arxiv.org/html/2605.30452#Thmtheorem2)but is also empirically preferable\. Moreover, our convergence\-analysis framework removes the need for the linearly independent gradients assumption required in the original work\.

##### Convergence guarantees of Nash\-MTL\.

For the*non\-convex*case,\[NavonSAMKCF22\]established convergence under three main assumptions: \(1\) linear independence of the gradients\{𝐠k\(t\)\}\\\{\\mathbf\{g\}^\{\(t\)\}\_\{k\}\\\}at each iterate𝐰\(t\)\\mathbf\{w\}^\{\(t\)\}and at the limit, \(2\) Lipschitz smooth and lower\-bounded objectivesfkf\_\{k\}, and \(3\) bounded sub\-level sets\. Under these conditions, they showed that the sequence\{𝐰\(t\)\}t=1∞\\\{\\mathbf\{w\}^\{\(t\)\}\\\}\_\{t=1\}^\{\\infty\}admits a subsequence converging to a Pareto\-stationary point\. In contrast, under our framework \(e\.g\., Theorem 2\), assumptions \(1\) and \(3\) are no longer required\. Instead, we establish an explicit convergence rate ofO\(1/t\)O\(1/\\sqrt\{t\}\)with respect to the degree of Pareto stationarityγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\), merely from assumption 2\.

For the*convex*case, both\[NavonSAMKCF22\]and our analysis \(Theorem 3\) rely on the same standard assumptions\. While\[NavonSAMKCF22\]proves convergence of𝐰t\\mathbf\{w\}\_\{t\}to a weakly Pareto\-optimal solution, our framework provides a rate ofO\(1/t\)O\(1/t\)in terms of the function value gap to optimality\.

#### A\.1\.4Fair Resource Allocation in MTL \(FairGrad\)

To provide anα\\alpha\-fair framework for MTL \(and MOO in general\), FairGrad\[BanJi24\]proposes the following subproblem to optimize:

argmin‖𝐝‖≤ϵ−∑k\(⟨𝐝,𝐠k⟩\)1−α1−α,s\.t\.⟨𝐝,𝐠k⟩≥0,∀k\.\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\\|\\mathbf\{d\}\\\|\\leq\\epsilon\}\-\\sum\_\{k\}\\frac\{\(\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\)^\{1\-\\alpha\}\}\{1\-\\alpha\},\\penalty 10000\\ \\text\{s\.t\.\}\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\\geq 0,\\penalty 10000\\ \\forall k\.\(27\)which has the resulting gradient aggregation rule:

𝐝\\displaystyle\\mathbf\{d\}=J𝝀∗,whereJ⊤J𝝀∗=𝝀∗−1/α\\displaystyle=J\\bm\{\\lambda\}\_\{\*\},\\quad\\mathrm\{where\}\\penalty 10000\\ J^\{\\top\}J\\bm\{\\lambda\}\_\{\*\}=\{\\bm\{\\lambda\}\_\{\*\}\}^\{\-1/\\alpha\}\(eq2\)
Similar to our power\-mean\-based formulation, FairGrad can recover Linear Scalarization, Nash\-MTL, harmonic mean \(p=−1p=\-1\), and MGDA whenα→0,1,2,∞\\alpha\\rightarrow 0,1,2,\\infty\. Note that−x1−α1−α\-\\tfrac\{x^\{1\-\\alpha\}\}\{1\-\\alpha\}is decreasing and convex forx\>0x\>0andα≥0\\alpha\\geq 0, so[Theorem˜4](https://arxiv.org/html/2605.30452#Thmtheorem4)applies as long as the constraint𝐝∈Bϵ\\mathbf\{d\}\\in B\_\{\\epsilon\}is replaced by a quadratic penalty term12‖𝐝‖2\\tfrac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}, though its implementation may be cumbersome in practice\.

##### Convergence guarantees of FairGrad\.

The original FairGrad work provides a convergence analysis in the non\-convex setting\. Similar to the proof of Nash\-MTL, it requires the following assumptions: \(1\) linear independence of the gradients at each iterate, \(2\) Lipschitz smooth and lower\-bounded objectivesfkf\_\{k\}, and \(3\) bounded sub\-level sets\. Under these assumptions, the authors establish subsequence convergence\.

Our framework \(i\.e\. Theorem 2\) applies because FairGrad produces non\-conflicting directions, and the convex\-hull requirement can be satisfied by rescaling𝐝t\\mathbf\{d\}\_\{t\}and absorbing the resulting constant into the step sizeηt\\eta\_\{t\}, we note that this effectively introduces a dynamic step size\. The original FairGrad proof also accommodates a dynamic step size, which further justifies our modification\. Under our framework, assumptions \(1\) and \(3\) are no longer required\. Relying only on assumption \(2\), we can obtain an explicit convergence rate ofO\(1/t\)O\(1/\\sqrt\{t\}\)with respect to the degree of Pareto stationarityγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\), albeit without guaranteeing pointwise subsequence convergence of𝐰t\\mathbf\{w\}\_\{t\}\.

#### A\.1\.5Performance\-Informed Variance Reduction Gradient aggregation \(PIVRG\)

PIVRG\[QinWY25\]proposes to minimize the \(weighted\) mean of the inverse utilities as the subproblem:

argmin‖𝐝‖≤ϵ1m∑k=1mωk⟨𝐝,𝐠k⟩\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\\|\\mathbf\{d\}\\\|\\leq\\epsilon\}\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\frac\{\\omega\_\{k\}\}\{\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\}\(28\)whereωk\\omega\_\{k\}are dynamic coefficients incorporating performance\-level information\.

For gradient aggregation rule used in practice, they solve for:

𝐝\\displaystyle\\mathbf\{d\}=J𝝀∗,whereJ⊤J𝝀∗=\(𝝎𝝀∗\)12\\displaystyle=J\\bm\{\\lambda\}\_\{\*\},\\quad\\mathrm\{where\}\\penalty 10000\\ J^\{\\top\}J\\bm\{\\lambda\}\_\{\*\}=\(\\frac\{\\bm\{\\omega\}\}\{\\bm\{\\lambda\}\_\{\*\}\}\)^\{\\tfrac\{1\}\{2\}\}\(29\)We note that PIVRG is conceptually very close to FairGrad \(withα=2\\alpha=2\), though PIVRG introduces a novel design of the weighting coefficients𝝎\\bm\{\\omega\}to achieve improved variance reduction\.

##### Convergence guarantees of PIVRG\.

The theoretical assumptions and convergence results of PIVRG largely mirror those of FairGrad, and our general framework applies in a similar fashion\. To avoid redundancy, we refer the reader to the FairGrad section for a detailed discussion and comparison of the convergence guarantees\.

#### A\.1\.6Unconflicting Projection of Gradients \(UPGrad\)

The primal optimization subproblem formulation of UPGrad\[quinton2024jacobian\]is

argmin𝐝−1m∑k=1m⟨𝐝,𝐩k⟩\+12‖𝐝‖2,\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\\penalty 10000\\ \-\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\left\\langle\\mathbf\{d\},\\mathbf\{p\}\_\{k\}\\right\\rangle\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\},\(30\)where𝐩k:=Pcone∗\(J\)\(𝐠k\),J𝜶k:=𝐩k\.\\displaystyle\\text\{where\}\\quad\\mathbf\{p\}\_\{k\}:=\\mathrm\{P\}\_\{\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\)\}\(\\mathbf\{g\}\_\{k\}\),\\penalty 10000\\ J\\bm\{\\alpha\}\_\{k\}:=\\mathbf\{p\}\_\{k\}\.\(31\)which corresponds to choosings\(𝐱\)=−\(1m∑k𝜶k\)⊤𝐱s\(\\mathbf\{x\}\)=\-\(\\frac\{1\}\{m\}\\sum\_\{k\}\\bm\{\\alpha\}\_\{k\}\)^\{\\top\}\\mathbf\{x\},r\(∥⋅∥\)=12∥⋅∥2r\(\\\|\\cdot\\\|\)=\\frac\{1\}\{2\}\\\|\\cdot\\\|^\{2\}in \([12](https://arxiv.org/html/2605.30452#S4.E12)\)\.

The resulting gradient aggregation rule \(used in practice\) is

𝐝=1m∑k𝐩k=J\(1m∑k=1m𝜶k\)\.\\displaystyle\\mathbf\{d\}=\\frac\{1\}\{m\}\\sum\_\{k\}\\mathbf\{p\}\_\{k\}=J\(\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\bm\{\\alpha\}\_\{k\}\)\.\(32\)which first projects each gradient onto the dual cone\{𝐝:J⊤𝐝≥𝟎\}\\\{\\mathbf\{d\}:J^\{\\top\}\\mathbf\{d\}\\geq\\mathbf\{0\}\\\}and then averages them\.

\(I\) UPGrad is a non\-conflicting aggregator since it first projects all gradients onto the dual cone\.

\(II\) UPGrad is closely related to PCGrad\[YuKGLHF20\], and in fact coincides with it whenm=2m=2\. We discuss this connection in detail in the PCGrad section\.

##### Convergence guarantees of UPGrad\.

For the*non\-convex*setting,\[quinton2024jacobian\]\(Appendix B\.4\) established anO\(1/t\)O\(1/\\sqrt\{t\}\)convergence rate under the same assumptions of Lipschitz smoothness and lower boundedness of eachfif\_\{i\}as those required by our framework \(e\.g\., Corollary 2\)\. Although the rate is identical, their analysis relies on a method\-specific proof, whereas our framework provides a more general and conceptually unified derivation\.

For the*convex*setting, both\[quinton2024jacobian\]and our analysis establish anO\(1/t\)O\(1/t\)convergence rate in terms of the function value gap, under the same assumptions of Lipschitz smoothness and convexity\. While rate is the same, our upper bound is slightly tighter\. Additionally, under the extra assumptions of \(1\) a bounded Pareto front and \(2\) bounded coefficients𝝀t\\bm\{\\lambda\}\_\{t\},\[quinton2024jacobian\]used a method\-specific proof to show the convergence of𝐟\(𝐰t\)\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\}\)to𝐟∗\\mathbf\{f\}^\{\*\}\. This result can also be obtained within our framework, through a direct application of Theorem 3 and upper bounding𝝀\\bm\{\\lambda\}\.

#### A\.1\.7DualProj

The primal optimization subproblem formulation of DualProj\[lopez2017gradient\]is

argmin𝐝−∑k=1mαk⟨𝐝,𝐠k⟩\+12‖𝐝‖2,\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\-\\sum\_\{k=1\}^\{m\}\\alpha\_\{k\}\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\rangle\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\},\(33\)whereJ𝜶:=Pcone∗\(J\)\(1m∑k𝐠k\)\.\\displaystyle\\text\{where\}\\quad J\\bm\{\\alpha\}:=\\mathrm\{P\}\_\{\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\)\}\(\\frac\{1\}\{m\}\\sum\_\{k\}\\mathbf\{g\}\_\{k\}\)\.\(34\)
The resulting gradient aggregation rule \(used in practice\) is

𝐝=J𝜶\\displaystyle\\mathbf\{d\}=J\\bm\{\\alpha\}\(35\)DualProj is a non\-conflicting aggregator since it projects a convex combination of gradients onto the dual cone, see[Proposition˜1](https://arxiv.org/html/2605.30452#Thmproposition1)\.

##### Convergence guarantees of DualProj\.

The original work of\[lopez2017gradient\]does not appear to provide a formal convergence analysis\. Within our framework,[Corollary˜2](https://arxiv.org/html/2605.30452#Thmcorollary2)establishes anO\(1/t\)O\(1/\\sqrt\{t\}\)convergence rate for DualProj in terms of the Pareto stationarity measureγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)for the non\-convex setting\. For the convex setting, we can apply Theorem 3 and establish aO\(1/t\)O\(1/t\)rate\.

#### A\.1\.8Conflict\-Averse Gradient descent \(CAGrad\)

The primal optimization subproblem formulation of CAGrad\[LiuLJSL21\]is

argmin𝐝maxk−⟨𝐝,𝐠k⟩,s\.t\.‖𝐝−𝐠0‖≤c‖𝐠0‖\.\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\\penalty 10000\\ \\max\_\{k\}\-\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle,\\penalty 10000\\ \\quad\\text\{s\.t\.\}\\penalty 10000\\ \\\|\\mathbf\{d\}\-\\mathbf\{g\}\_\{0\}\\\|\\leq c\\\|\\mathbf\{g\}\_\{0\}\\\|\.\(36\)where𝐠0:=1m∑k𝐠k\\mathbf\{g\}\_\{0\}:=\\frac\{1\}\{m\}\\sum\_\{k\}\\mathbf\{g\}\_\{k\}, andccis a pre\-specified hyper\-parameter that controls the radius of the ball constraint centered around the average gradient𝐠0\\mathbf\{g\}\_\{0\}\.

The resulting gradient aggregation rule \(used in practice\) is

𝐝=𝐠0\+ϵ𝐠𝝀∗\\displaystyle\\mathbf\{d\}=\\mathbf\{g\}\_\{0\}\+\\epsilon\\penalty 10000\\ \\mathbf\{g\}\_\{\\bm\{\\lambda\}\_\{\*\}\}\(37\)whereϵ:=c‖𝐠0‖\\epsilon:=c\\\|\\mathbf\{g\}\_\{0\}\\\|,𝐠𝝀∗:=J𝝀∗\\mathbf\{g\}\_\{\\bm\{\\lambda\}\_\{\*\}\}:=J\\bm\{\\lambda\}\_\{\*\}, and𝝀∗\\bm\{\\lambda\}\_\{\*\}is the solution to

argmin𝝀∈Δ⟨𝐠0,J𝝀⟩\+ϵ‖J𝝀‖\.\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\bm\{\\lambda\}\\in\\Delta\}\\left\\langle\\mathbf\{g\}\_\{0\},J\\bm\{\\lambda\}\\right\\rangle\+\\epsilon\\\|J\\bm\{\\lambda\}\\\|\.\(38\)CAGrad is a direction\-oriented variant of MGDA\. Due to the existence of the ball constraint, it is no longer a non\-conflicting aggregator whenc<1c<1; however, it is non\-conflicting whenc≥1c\\geq 1\(since𝐝=𝟎\\mathbf\{d\}=\\mathbf\{0\}is a feasible point\)\.

##### Convergence guarantees of CAGrad\.

The convergence behavior of CAGrad depends on the choice of the hyper\-parametercc\. The original work of\[LiuLJSL21\]considers the non\-convex setting, and:

\(1\) Forc≥1c\\geq 1, the authors only showed that the fixed point of CAGrad is Pareto stationary\. In contrast, by noting that CAGrad is*non\-conflicting*in this case, our framework \(Theorem 2\) can be directly applied to*strengthen*the guarantee to anO\(1/t\)O\(1/\\sqrt\{t\}\)convergence rate measured byγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)\.

\(2\) For0≤c<10\\leq c<1,\[LiuLJSL21\]established anO\(1/t\)O\(1/\\sqrt\{t\}\)rate to stationarity of the averaged objective1m∑i=1mfi\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}f\_\{i\}, and thus to Pareto stationarity of𝐟\\mathbf\{f\}\. We point out this convergence guarantee is a matter of fact of the ball constraint rather than the objective, and that we can directly apply the angle constraint \([10](https://arxiv.org/html/2605.30452#S4.E10)\) in Theorem 1 \(withF=1m∑i=1mfiF=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}f\_\{i\}\) to obtain the same result\.

In the convex setting, the original work does not provide extra convergence guarantees\. However, whenc≥1c\\geq 1, CAGrad remains non\-conflicting, and therefore our framework \(Theorem 3\) can be applied to this case, and establish a rate ofO\(1/t\)O\(1/t\)\.

#### A\.1\.9Projecting Conflicting Gradients \(PCGrad\)

PCGrad\[YuKGLHF20\]aims to fix each gradient𝐠k\\mathbf\{g\}\_\{k\}by initializing𝐠kPC←𝐠k\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\\leftarrow\\mathbf\{g\}\_\{k\}, and then iteratively projecting it onto the normal planes of the other gradients, for one pass over\[m\]\[m\]only:

fori∈\[m\],𝐠kPC←𝐠kPC\+\(−𝐠kPC⋅𝐠i\)\+‖𝐠i‖2𝐠i\\displaystyle\\text\{for\}\\penalty 10000\\ i\\in\[m\],\\penalty 10000\\ \\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\\leftarrow\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\+\\frac\{\(\-\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\\cdot\\mathbf\{g\}\_\{i\}\)\_\{\+\}\}\{\\left\\\|\\mathbf\{g\}\_\{i\}\\right\\\|^\{2\}\}\\mathbf\{g\}\_\{i\}\(39\)and finally gather and average the𝐠kPC\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}:

𝐝=1m∑k=1m𝐠kPC\.\\displaystyle\\mathbf\{d\}=\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\.\(40\)
We highlight several important points:

- •When the number of objectives ism=2m=2, PCGrad is equivalent to UPGrad, since in this case projection to the ‘dual cone’ coincides with projection to the ‘normal plane of the other gradient’\. Notably,m=2m=2is also the only setting in which the PCGrad paper provides theoretical guarantees\. Thus, for two\-objective optimization, PCGrad can be regarded as UPGrad, the latter being studied more extensively in this paper\.
- •Whenm≥3m\\geq 3, there is no guarantee that the resulting𝐠kPC\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\(and thus𝐝\\mathbf\{d\}\) is a non\-conflicting direction, because PCGrad performs only a single pass of iterative projections rather than continuing until𝐠kPC\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}lies in the dual cone\. For a counter\-example, let 𝐠1=\(1,3,z\),𝐠2=\(−2,0,z\),𝐠3=\(1,−3,z\),\\displaystyle\\mathbf\{g\}\_\{1\}=\(1,\\sqrt\{3\},z\),\\quad\\mathbf\{g\}\_\{2\}=\(\-2,0,z\),\\quad\\mathbf\{g\}\_\{3\}=\(1,\-\\sqrt\{3\},z\),\(41\)wherezzis a small positive constant \(e\.g\.,z=0\.1z=0\.1\)\. Applying the standard PCGrad procedure in order and averaging the adjusted gradients, the resulting aggregated direction𝐝\\mathbf\{d\}has a negative inner product with𝐠1\\mathbf\{g\}\_\{1\}\.
- •PCGrad can be modified torepeatedly project until𝐠kPC\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}lies in the dual cone, and we name this new variant*PCGrad\+*\(see[Algorithm˜2](https://arxiv.org/html/2605.30452#algorithm2)\)\. In this case, PCGrad\+ again resembles UPGrad, except that UPGrad performs a one\-step projection directly onto the dual coneC∗C^\{\*\}, whereas PCGrad\+ repeatedly performs alternating projections onto the half\-spacesHi=\{𝐳:⟨𝐠i,𝐳⟩≥0\}H\_\{i\}=\\\{\\mathbf\{z\}:\\langle\\mathbf\{g\}\_\{i\},\\mathbf\{z\}\\rangle\\geq 0\\\}, whose intersection is the dual cone, i\.e\.C∗=⋂i=1HiC^\{\*\}=\\bigcap\_\{i=1\}H\_\{i\}\. PCGrad\+ is also sensitive to the ordering of objectives, which is arguably an undesirable property\.

Input:Gradients

𝐠k:=∇fk\(𝐰\)\\mathbf\{g\}\_\{k\}:=\\nabla f\_\{k\}\(\\mathbf\{w\}\)of each objective

fkf\_\{k\}\.

1for*k∈\[m\]k\\in\[m\]*do

𝐠kPC←𝐠k\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\\leftarrow\\mathbf\{g\}\_\{k\}
//Initialize projected gradient

2repeat

//key distinction from PCGrad

3for*i∈\[m\]i\\in\[m\]*do

𝐠kPC←𝐠kPC\+\(−𝐠kPC⋅𝐠i\)\+‖𝐠i‖2𝐠i\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\\leftarrow\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\+\\dfrac\{\(\-\\,\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\\cdot\\mathbf\{g\}\_\{i\}\)\_\{\+\}\}\{\\\|\\mathbf\{g\}\_\{i\}\\\|^\{2\}\}\\,\\mathbf\{g\}\_\{i\}
//Resolve conflicts via projection

4

5

6until*𝐠kPC\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}is non\-conflicting with all𝐠i\\mathbf\{g\}\_\{i\}*

7

Output:Aggregated direction

𝐝←1m∑k=1m𝐠kPC\\mathbf\{d\}\\leftarrow\\tfrac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\mathbf\{g\}\_\{k\}^\{\\mathrm\{PC\}\}\.

Algorithm 2PCGrad\+Aggregation##### Convergence guarantees of PCGrad\(\+\+\)\.

To the best of our knowledge,\[YuKGLHF20\]provided a convergence analysis of PCGrad under reasonable assumptions*only*for the case of two objectives \(m=2m=2\)\. As noted above, whenm=2m=2, PCGrad coincides with UPGrad, and therefore our framework for UPGrad applies directly to PCGrad in this special case\.

Whenm≥3m\\geq 3, however, the output of PCGrad aggregation \(1\) depends on the order of the input objectives and \(2\) is not guaranteed to be non\-conflicting, both of which pose challenges for a general convergence analysis \(if attainable at all\)\.

To address this, we introduce a modified variant, termed*PCGrad\+*, which repeatedly performs the gradient\-fixing projections until the resulting direction𝐝t\\mathbf\{d\}\_\{t\}lies within the dual cone\. For PCGrad\+, our general results, Theorem 2 \(non\-convex\) and Theorem 3 \(convex\), can be readily applied to establish convergence for arbitrarym≥3m\\geq 3\.

#### A\.1\.10Impartial Multi\-task Learning Gradient \(IMTL\-G\)

Let𝐧=\[‖𝐠1‖,…,‖𝐠m‖\]⊤\\mathbf\{n\}=\[\\\|\\mathbf\{g\}\_\{1\}\\\|,\\ldots,\\\|\\mathbf\{g\}\_\{m\}\\\|\]^\{\\top\}\. IMTL\-G\[LiuLKXCYLZ21\]aggregates the gradients as

𝐝=J𝝀,where𝝀=\(J⊤J\)†𝐧,\\displaystyle\\mathbf\{d\}=J\\bm\{\\lambda\},\\quad\\text\{where\}\\quad\\bm\{\\lambda\}=\(J^\{\\top\}J\)^\{\\dagger\}\\mathbf\{n\},\(42\)and then applies a \(possibly negative\) rescaling to enforce∑iλi=1\\sum\_\{i\}\\lambda\_\{i\}=1to get normalized update𝐝~\\tilde\{\\mathbf\{d\}\}\.

Interestingly, a recent work\[liu2025config\]on Physics\-Informed Neural Networks is closely related to IMTL\-G\. However, it does not enforce the constraint∑iλi=1\\sum\_\{i\}\\lambda\_\{i\}=1, and instead introduces a different form of normalization\.

It’s worth noting that:

\(I\) The update direction𝐝\\mathbf\{d\}is*not necessarily*non\-conflicting unless the gradients are normalized\. As a counterexample, consider the following JacobianJJ\(whose row vectors are the gradients\), where𝐝\\mathbf\{d\}turns out to be conflicting with all gradients:

J=\(8\.660254−50\.0−8\.660254−50\.00\.0−0\.994987440\.1\)\\displaystyle J=\\begin\{pmatrix\}8\.660254&\-5&0\.0\\\\ \-8\.660254&\-5&0\.0\\\\ 0\.0&\-0\.99498744&0\.1\\end\{pmatrix\}\(43\)
\(II\)𝐝\\mathbf\{d\}is*not necessarily*in the cone either\. As a counterexample, consider the same matrixJJ, but with the first two rows divided by1010\.

##### Convergence guarantees of IMTL\-G\.

The original work of\[LiuLKXCYLZ21\]does not appear to include a formal convergence analysis\. Due to the irregularities and drawbacks discussed above, IMTL\-G is also difficult to incorporate into our theoretical framework, since its update direction𝐝\\mathbf\{d\}does not necessarily lie within either the cone or the dual cone\.

#### A\.1\.11Random Gradient Weighting \(RGW\)

Random Gradient Weighting\[lin2021reasonable\]uses random weighting to aggregate the gradients:

𝐝=J⋅softmax\(𝝃\),where𝝃∼𝒩\(𝟎,I\)\\displaystyle\\mathbf\{d\}=J\\cdot\\penalty 10000\\ \\mathrm\{softmax\}\(\\bm\{\\xi\}\),\\penalty 10000\\ \\text\{where\}\\quad\\bm\{\\xi\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\(44\)We can formulate a primal optimization subproblem:

argmin𝐝−∑k=1mξk⟨𝐝,𝐠k⟩\+12‖𝐝‖2\.\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\-\\sum\_\{k=1\}^\{m\}\\xi\_\{k\}\\left\\langle\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}\.\(45\)
##### Convergence guarantees of RGW\.

Unlike all aforementioned methods, RGW admits no fixed point \(except for the degenerate case where all𝐠k=𝟎\\mathbf\{g\}\_\{k\}=\\mathbf\{0\}\), and therefore cannot terminate or converge at any non\-trivial Pareto stationary point\. This poses an inherent challenge to establishing convergence guarantees for RGW\. The original work provides only an upper bound on the function value gap, which can be unbounded unless all gradients are assumed to be bounded\. Moreover, the bound does not vanish ast→∞t\\to\\infty\. In summary, we find it difficult to establish a theoretical convergence guarantee for this method without imposing very strong, if not unrealistic, assumptions\.

### A\.2New Aggregations

This section provides additional details on newly proposed gradient aggregation methods: \(i\) Capped MGDA, which is presented in the main paper; and \(ii\) Greedy Aggregation with Dual Cone Projection \(Greedy\-DCP\), which we introduce here\.

#### A\.2\.1Capped MGDA

Here we derive the dual formulation of Capped MGDA from the CVaR primal formulation:

min𝐝,α⁡α\+C∑k=1mmax⁡\{0,⟨−𝐝,𝐠k⟩−α\}\+12‖𝐝‖2,\\displaystyle\\min\_\{\\mathbf\{d\},\\alpha\}\\penalty 10000\\ \\alpha\+C\\sum\_\{k=1\}^\{m\}\\max\\left\\\{0,\\left\\langle\-\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\-\\alpha\\right\\\}\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\},\(primal\)
###### Proof\.

First, as a standard optimization technique, the above is equivalent to

min𝐝,α⁡max0≤λk≤C⁡α\+∑k=1mλk\(⟨−𝐝,𝐠k⟩−α\)\+12‖𝐝‖2\\displaystyle\\min\_\{\\mathbf\{d\},\\alpha\}\\max\_\{0\\leq\\lambda\_\{k\}\\leq C\}\\penalty 10000\\ \\alpha\+\\sum\_\{k=1\}^\{m\}\\lambda\_\{k\}\(\\left\\langle\-\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\-\\alpha\)\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}\(46\)Dual problem is formed by switching the min\-max \(strong duality holds since the objective is convex in𝐝,α\\mathbf\{d\},\\alpha; and linear in𝝀\\bm\{\\lambda\}defined on a compact domain\), and simplify:

max0≤λk≤C⁡min𝐝,α⁡α\+∑k=1mλk\(⟨−𝐝,𝐠k⟩−α\)\+12‖𝐝‖2\\displaystyle\\max\_\{0\\leq\\lambda\_\{k\}\\leq C\}\\min\_\{\\mathbf\{d\},\\alpha\}\\penalty 10000\\ \\alpha\+\\sum\_\{k=1\}^\{m\}\\lambda\_\{k\}\(\\left\\langle\-\\mathbf\{d\},\\mathbf\{g\}\_\{k\}\\right\\rangle\-\\alpha\)\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}\(47\)max0≤λk≤C⁡min𝐝,α⁡α\(1−∑k=1nλk\)−⟨𝐝,∑k=1nλk𝐠k⟩\+12‖𝐝‖2\\displaystyle\\max\_\{0\\leq\\lambda\_\{k\}\\leq C\}\\min\_\{\\mathbf\{d\},\\alpha\}\\penalty 10000\\ \\alpha\\left\(1\-\\sum\_\{k=1\}^\{n\}\\lambda\_\{k\}\\right\)\-\\left\\langle\\mathbf\{d\},\\sum\_\{k=1\}^\{n\}\\lambda\_\{k\}\\mathbf\{g\}\_\{k\}\\right\\rangle\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}\(48\)maxλk≤C,𝝀∈Δ⁡min𝐝−⟨𝐝,∑k=1nλk𝐠k⟩\+12‖𝐝‖2\\displaystyle\\max\_\{\\lambda\_\{k\}\\leq C,\\bm\{\\lambda\}\\in\\Delta\}\\min\_\{\\mathbf\{d\}\}\\penalty 10000\\ \-\\left\\langle\\mathbf\{d\},\\sum\_\{k=1\}^\{n\}\\lambda\_\{k\}\\mathbf\{g\}\_\{k\}\\right\\rangle\+\\frac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}\(49\)For the last line, the inner minimization is achieved when𝐝=∑k=1nλk𝐠k\\mathbf\{d\}=\\sum\_\{k=1\}^\{n\}\\lambda\_\{k\}\\mathbf\{g\}\_\{k\}\.

Then substitute this in and move the negative sign out, we reach the final dual problem:

min𝝀≤C,𝝀∈Δ⁡‖∑k=1nλk𝐠k‖2,\\displaystyle\\min\_\{\\bm\{\\lambda\}\\leq C,\\bm\{\\lambda\}\\in\\Delta\}\\\|\\sum\_\{k=1\}^\{n\}\\lambda\_\{k\}\\mathbf\{g\}\_\{k\}\\\|^\{2\},\(50\)Equivalently, in Jacobian notation:

min𝝀≤C,𝝀∈Δ⁡‖J𝝀‖2,and𝐝=J𝝀\.\\displaystyle\\min\_\{\\bm\{\\lambda\}\\leq C,\\bm\{\\lambda\}\\in\\Delta\}\\\|J\\bm\{\\lambda\}\\\|^\{2\},\\penalty 10000\\ \\text\{and\}\\penalty 10000\\ \\mathbf\{d\}=J\\bm\{\\lambda\}\.\(51\)∎

We call this new aggregation method*Capped MGDA*, since its dual problem closely resembles MGDA, with an additional cap constraint on the coefficients, i\.e\.𝝀≤C\\bm\{\\lambda\}\\leq C\.

While one may view Capped MGDA as a natural extension of the dual formulation of MGDA, we emphasize that its primal CVaR formulation not only offers an intuitive interpretation of the method’s objective, but also greatly facilitates the convergence analysis, which would be presumably difficult to establish from the seemingly simpler dual formulation\.

#### A\.2\.2Greedy Aggregation with Dual Cone Projection \(Greedy\-DCP\)

This method is a*non\-conflicting*aggregator that, similar to UPGrad, relies on projecting gradients onto the dual cone as the first step\. We first project each gradient𝐠k\\mathbf\{g\}\_\{k\}onto the dualcone∗J:=\{𝐝:J⊤𝐝≥𝟎\}\{\\mathop\{\\mathrm\{cone\}\}\}^\{\*\}\{J\}:=\\\{\\mathbf\{d\}:J^\{\\top\}\\mathbf\{d\}\\geq\\mathbf\{0\}\\\}, and denote the result by𝐩k\\mathbf\{p\}\_\{k\}\. The greedy aggregation is then formulated as

argmin𝐝mink⁡\(−⟨𝐩k,𝐝⟩\+12‖𝐝‖2\)\\displaystyle\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{d\}\}\\min\_\{k\}\\Big\(\-\\left\\langle\\mathbf\{p\}\_\{k\},\\mathbf\{d\}\\right\\rangle\+\\tfrac\{1\}\{2\}\\\|\\mathbf\{d\}\\\|^\{2\}\\Big\)\(52\)By switching the order of minimization, we obtain the algorithmic update:

𝐝=𝐩i,wherei=argmaxk‖𝐩k‖\.\\displaystyle\\mathbf\{d\}=\\mathbf\{p\}\_\{i\},\\quad\\text\{where \}i=\\mathop\{\\mathrm\{argmax\}\}\_\{k\}\\\|\\mathbf\{p\}\_\{k\}\\\|\.\(53\)The convergence of Greedy\-DCP follows directly from[Corollary˜2](https://arxiv.org/html/2605.30452#Thmcorollary2)established in the main paper\.

## Appendix BProofs omitted from the main text

See[1](https://arxiv.org/html/2605.30452#Thmtheorem1)

###### Proof\.

Applying the descent lemma we have

F\(𝐰t\+1\)\\displaystyle F\(\\mathbf\{w\}\_\{t\+1\}\)≤F\(𝐰t\)−ηt⟨∇F\(𝐰t\),𝐝t⟩\+L2‖ηt𝐝t‖2\\displaystyle\\leq F\(\\mathbf\{w\}\_\{t\}\)\-\\eta\_\{t\}\\left\\langle\\nabla F\(\\mathbf\{w\}\_\{t\}\),\\mathbf\{d\}\_\{t\}\\right\\rangle\+\\tfrac\{L\}\{2\}\\\|\\eta\_\{t\}\\mathbf\{d\}\_\{t\}\\\|^\{2\}\(54\)≤F\(𝐰t\)−ctΓt‖ηt𝐝t‖\+L2‖ηt𝐝t‖2\\displaystyle\\leq F\(\\mathbf\{w\}\_\{t\}\)\-c\_\{t\}\\Gamma\_\{t\}\\\|\\eta\_\{t\}\\mathbf\{d\}\_\{t\}\\\|\+\\tfrac\{L\}\{2\}\\\|\\eta\_\{t\}\\mathbf\{d\}\_\{t\}\\\|^\{2\}\(55\)Optimizingηt=ctΓtL‖𝐝t‖\\eta\_\{t\}=\\tfrac\{c\_\{t\}\\Gamma\_\{t\}\}\{L\\\|\\mathbf\{d\}\_\{t\}\\\|\}, we obtain

F\(𝐰t\+1\)≤F\(𝐰t\)−ct2Γt22L\.\\displaystyle F\(\\mathbf\{w\}\_\{t\+1\}\)\\leq F\(\\mathbf\{w\}\_\{t\}\)\-\\frac\{c\_\{t\}^\{2\}\\Gamma^\{2\}\_\{t\}\}\{2L\}\.\(56\)Telescoping and noting thatF≥0F\\geq 0:

∑tct2Γt2≤2LF\(𝐰0\),\\displaystyle\\sum\_\{t\}c\_\{t\}^\{2\}\\Gamma^\{2\}\_\{t\}\\leq 2LF\(\\mathbf\{w\}\_\{0\}\),\(57\)whence follows both claims\. ∎

\[Optionally\] For a weaker conclusion under more relaxed assumption, the same proof also shows that

- •if∑tct2=∞\\sum\_\{t\}c\_\{t\}^\{2\}=\\infty, thenlim inft→∞Γt=0\\liminf\_\{t\\to\\infty\}\\Gamma\_\{t\}=0,

namely that, there exists a subsequence ofΓt\\Gamma\_\{t\}that converges to 0\.

See[2](https://arxiv.org/html/2605.30452#Thmtheorem2)

###### Proof\.

We omit the indexttin the following to simplify the notation\.

LetF=∑kfkF=\\sum\_\{k\}f\_\{k\}and𝐝=J𝐟\(𝐰\)𝝀\\mathbf\{d\}=J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\\bm\{\\lambda\}for some𝝀∈Δ\\bm\{\\lambda\}\\in\\Delta\. We directly verify \([A](https://arxiv.org/html/2605.30452#S4.Ex1)\):

⟨𝐝,∇F\(𝐰\)⟩=∑k⟨𝐝,∇fk\(𝐰\)⟩≥∑kλk⟨𝐝,∇fk\(𝐰\)⟩=⟨𝐝,∑kλk∇fk\(𝐰\)⟩=‖𝐝‖2,\\displaystyle\\left\\langle\\mathbf\{d\},\\nabla F\(\\mathbf\{w\}\)\\right\\rangle=\\sum\_\{k\}\\left\\langle\\mathbf\{d\},\\nabla f\_\{k\}\(\\mathbf\{w\}\)\\right\\rangle\\geq\\sum\_\{k\}\\lambda\_\{k\}\\left\\langle\\mathbf\{d\},\\nabla f\_\{k\}\(\\mathbf\{w\}\)\\right\\rangle=\\left\\langle\\mathbf\{d\},\\sum\_\{k\}\\lambda\_\{k\}\\nabla f\_\{k\}\(\\mathbf\{w\}\)\\right\\rangle=\\\|\\mathbf\{d\}\\\|^\{2\},\(58\)where the inequality is due to𝐝∈cone∗\(J𝐟\(𝐰\)\)\\mathbf\{d\}\\in\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\)so that⟨𝐝,∇fk\(𝐰\)⟩≥0\\left\\langle\\mathbf\{d\},\\nabla f\_\{k\}\(\\mathbf\{w\}\)\\right\\rangle\\geq 0andλk∈\[0,1\]\\lambda\_\{k\}\\in\[0,1\]\. ∎

See[1](https://arxiv.org/html/2605.30452#Thmproposition1)

###### Proof\.

For any \(closed\) convex coneKK, we recall Moreau’s celebrated decomposition\[Moreau62\]:

𝐝=PK\(𝐪\),𝐝∗=−PK∗\(−𝐪\)⇔𝐝⟂𝐝∗,𝐝\+𝐝∗=𝐪,𝐝∈K,𝐝∗∈−K∗\.\\displaystyle\\mathbf\{d\}=\\mathrm\{P\}\_\{K\}\(\\mathbf\{q\}\),\\mathbf\{d\}^\{\*\}=\-\\mathrm\{P\}\_\{K^\{\*\}\}\(\-\\mathbf\{q\}\)\\iff\\mathbf\{d\}\\perp\\mathbf\{d\}^\{\*\},\\mathbf\{d\}\+\\mathbf\{d\}^\{\*\}=\\mathbf\{q\},\\mathbf\{d\}\\in K,\\mathbf\{d\}^\{\*\}\\in\-K^\{\*\}\.\(59\)Thus, withK=cone∗\(J\)K=\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\)and henceK∗=cone\(J\)K^\{\*\}=\\mathop\{\\mathrm\{cone\}\}\(J\), we have

𝐝=𝐪\+Pcone\(J\)\(−𝐪\)=J𝝁\+J𝜶,where𝜶≥𝟎\.\\displaystyle\\mathbf\{d\}=\\mathbf\{q\}\+\\mathrm\{P\}\_\{\\mathop\{\\mathrm\{cone\}\}\(J\)\}\(\-\\mathbf\{q\}\)=J\\bm\{\\mu\}\+J\\bm\{\\alpha\},\{\\quad\\text\{where\}\\quad\}\\bm\{\\alpha\}\\geq\\mathbf\{0\}\.\(60\)It follows that we can set𝝂=𝝁\+𝜶\\bm\{\\nu\}=\\bm\{\\mu\}\+\\bm\{\\alpha\}, and the proof is complete\. ∎

See[3](https://arxiv.org/html/2605.30452#Thmtheorem3)

###### Proof\.

FromLL\-smoothness, we have

fk\(𝐰t\+1\)≤fk\(𝐰t\)−η⟨𝐝t,∇fk\(𝐰t\)⟩\+Lη22‖𝐝t‖2,\\displaystyle f\_\{k\}\(\\mathbf\{w\}\_\{t\+1\}\)\\leq f\_\{k\}\(\\mathbf\{w\}\_\{t\}\)\-\\eta\\left\\langle\\mathbf\{d\}\_\{t\},\\nabla f\_\{k\}\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\+\\frac\{L\\eta^\{2\}\}\{2\}\\\|\\mathbf\{d\}\_\{t\}\\\|^\{2\},\(61\)Sinceffis convex, for all𝐰\\mathbf\{w\}:

fk\(𝐰t\+1\)≤fk\(𝐰\)\+⟨𝐰t−𝐰,∇fk\(𝐰t\)⟩−η⟨𝐝t,∇fk\(𝐰t\)⟩\+Lη22‖𝐝t‖2\.\\displaystyle f\_\{k\}\(\\mathbf\{w\}\_\{t\+1\}\)\\leq f\_\{k\}\(\\mathbf\{w\}\)\+\\left\\langle\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\},\\nabla f\_\{k\}\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\-\\eta\\left\\langle\\mathbf\{d\}\_\{t\},\\nabla f\_\{k\}\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\+\\frac\{L\\eta^\{2\}\}\{2\}\\\|\\mathbf\{d\}\_\{t\}\\\|^\{2\}\.\(62\)Rearranging and simplifying:

fk\(𝐰t\+1\)−fk\(𝐰\)≤⟨𝐰t−𝐰−η𝐝t,∇fk\(𝐰t\)⟩\+Lη22‖𝐝t‖2\.\\displaystyle f\_\{k\}\(\\mathbf\{w\}\_\{t\+1\}\)\-f\_\{k\}\(\\mathbf\{w\}\)\\leq\\left\\langle\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\}\-\\eta\\mathbf\{d\}\_\{t\},\\nabla f\_\{k\}\(\\mathbf\{w\}\_\{t\}\)\\right\\rangle\+\\frac\{L\\eta^\{2\}\}\{2\}\\\|\\mathbf\{d\}\_\{t\}\\\|^\{2\}\.\(63\)Taking inner product with𝝀t\\bm\{\\lambda\}\_\{t\}on both sides:

⟨𝝀t,𝐟\(𝐰t\+1\)−𝐟\(𝐰\)⟩≤⟨𝐰t−𝐰−η𝐝t,𝐝t⟩\+Lη22‖𝐝t‖2=⟨𝐰t−𝐰,𝐝t⟩\+\(Lη22−η\)‖𝐝t‖2\.\\displaystyle\\left\\langle\\bm\{\\lambda\}\_\{t\},\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\+1\}\)\-\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle\\leq\\left\\langle\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\}\-\\eta\\mathbf\{d\}\_\{t\},\\mathbf\{d\}\_\{t\}\\right\\rangle\+\\tfrac\{L\\eta^\{2\}\}\{2\}\\\|\\mathbf\{d\}\_\{t\}\\\|^\{2\}=\\left\\langle\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\},\\mathbf\{d\}\_\{t\}\\right\\rangle\+\(\\tfrac\{L\\eta^\{2\}\}\{2\}\-\\eta\)\\\|\\mathbf\{d\}\_\{t\}\\\|^\{2\}\.\(64\)As long asη≤1L\\eta\\leq\\frac\{1\}\{L\}, the above implies

⟨𝝀t,𝐟\(𝐰t\+1\)−𝐟\(𝐰\)⟩\\displaystyle\\left\\langle\\bm\{\\lambda\}\_\{t\},\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\+1\}\)\-\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle≤12η\(‖𝐰t−𝐰‖2−‖𝐰t−𝐰−η𝐝t‖2\)\\displaystyle\\leq\\frac\{1\}\{2\\eta\}\(\\\|\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\}\\\|^\{2\}\-\\\|\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\}\-\\eta\\mathbf\{d\}\_\{t\}\\\|^\{2\}\)\(65\)=12η\(‖𝐰t−𝐰‖2−‖𝐰t\+1−𝐰‖2\)\.\\displaystyle=\\frac\{1\}\{2\\eta\}\(\\\|\\mathbf\{w\}\_\{t\}\-\\mathbf\{w\}\\\|^\{2\}\-\\\|\\mathbf\{w\}\_\{t\+1\}\-\\mathbf\{w\}\\\|^\{2\}\)\.\(66\)Telescoping we arrive at:

1T∑t=0T−1⟨𝝀t,𝐟\(𝐰t\+1\)−𝐟\(𝐰\)⟩≤‖𝐰0−𝐰‖22ηT\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\left\\langle\\bm\{\\lambda\}\_\{t\},\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\+1\}\)\-\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle\\leq\\frac\{\\\|\\mathbf\{w\}\_\{0\}\-\\mathbf\{w\}\\\|^\{2\}\}\{2\\eta T\}\(67\)Since𝐟\(𝐰t\)\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\}\)monotonically decreases, we further lower bound the left\-hand side:

1T∑t=0T−1⟨𝝀t,𝐟\(𝐰T\)−𝐟\(𝐰\)⟩≤‖𝐰0−𝐰‖22ηT\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\left\\langle\\bm\{\\lambda\}\_\{t\},\\mathbf\{f\}\(\\mathbf\{w\}\_\{T\}\)\-\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle\\leq\\frac\{\\\|\\mathbf\{w\}\_\{0\}\-\\mathbf\{w\}\\\|^\{2\}\}\{2\\eta T\}\(68\)Denoting𝝀:=1T∑t=0T−1𝝀t\\bm\{\\lambda\}:=\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\bm\{\\lambda\}\_\{t\}, we conclude that for all𝐰\\mathbf\{w\}:

⟨𝝀,𝐟\(𝐰T\)⟩−⟨𝝀,𝐟\(𝐰\)⟩≤‖𝐰0−𝐰‖22ηT\.\\displaystyle\\left\\langle\\bm\{\\lambda\},\\mathbf\{f\}\(\\mathbf\{w\}\_\{T\}\)\\right\\rangle\-\\left\\langle\\bm\{\\lambda\},\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle\\leq\\frac\{\\\|\\mathbf\{w\}\_\{0\}\-\\mathbf\{w\}\\\|^\{2\}\}\{2\\eta T\}\.\(69\)The proof is now complete\. ∎

To exploit \([69](https://arxiv.org/html/2605.30452#A2.E69)\) as tightly as possible, we simply take𝐰=𝐰∗=argmin𝐰⟨𝝀,𝐟\(𝐰\)⟩\\mathbf\{w\}=\\mathbf\{w\}\_\{\*\}=\\mathop\{\\mathrm\{argmin\}\}\_\{\\mathbf\{w\}\}\\left\\langle\\bm\{\\lambda\},\\mathbf\{f\}\(\\mathbf\{w\}\)\\right\\rangle, which is weakly Pareto optimal for convex𝐟\\mathbf\{f\}\(and Pareto optimal for strictly convex𝐟\\mathbf\{f\}\)\. Thus, the iterate𝐰t\\mathbf\{w\}\_\{t\}converges at rateO\(1/t\)O\(1/t\)in terms of the𝝀\\bm\{\\lambda\}\-averaged function value\.

Furthermore, note that𝝀\\bm\{\\lambda\}is in the simplex, a compact set\. Thus, it possesses a convergent subsequence with limit𝝀⋆\\bm\{\\lambda\}\_\{\\star\}\. Assuming that the corresponding iterates are bounded, then passing to the limit in inequality \([69](https://arxiv.org/html/2605.30452#A2.E69)\) shows that every accumulation point𝐰⋆\\mathbf\{w\}\_\{\\star\}is weakly Pareto optimal, since𝐰⋆\\mathbf\{w\}\_\{\\star\}minimizes the scalarized objective⟨𝝀⋆,𝐟⟩\\left\\langle\\bm\{\\lambda\}\_\{\\star\},\\mathbf\{f\}\\right\\rangle\. Additionally, when𝝀⋆∈ri⁡\(Δ\)\\bm\{\\lambda\}\_\{\\star\}\\in\\operatorname\{ri\}\\left\(\\Delta\\right\),𝐰⋆\\mathbf\{w\}\_\{\\star\}is actually Pareto optimal\.

##### Remark\.

For non\-conflicting directions, monotonicity is guaranteed provided the descent direction𝐝\\mathbf\{d\}lies in the interior of the dual conecone∗\(J𝐟\(𝐰\)\)\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\)\(as in MGDA, UPGrad, or Nash\-MTL\) and the step sizeη\\etais chosen appropriately\. In degenerate cases where𝐝\\mathbf\{d\}lies on the boundary ofcone∗\(J𝐟\(𝐰\)\)\\mathop\{\\mathrm\{cone\}\}^\{\*\}\(J\_\{\\mathbf\{f\}\}\(\\mathbf\{w\}\)\), additional mechanisms such as line search or perturbation can be adopted to maintain monotonicity\.

To see this, apply the L\-smooth inequality for allfkf\_\{k\}, we have:

fk\(𝐰t−ηt𝐝t\)≤fk\(𝐰t\)−ηt⟨∇fk\(𝐰t\),𝐝t⟩\+L2ηt2‖𝐝t‖2,∀k\\displaystyle f\_\{k\}\\left\(\\mathbf\{w\}\_\{t\}\-\\eta\_\{t\}\\mathbf\{d\}\_\{t\}\\right\)\\leq f\_\{k\}\\left\(\\mathbf\{w\}\_\{t\}\\right\)\-\\eta\_\{t\}\\left\\langle\\nabla f\_\{k\}\\left\(\\mathbf\{w\}\_\{t\}\\right\),\\mathbf\{d\}\_\{t\}\\right\\rangle\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}\\left\\\|\\mathbf\{d\}\_\{t\}\\right\\\|^\{2\},\\penalty 10000\\ \\forall k\(70\)Using the strengthened non\-conflicting condition,

⟨∇fk\(𝐰t\),𝐝t⟩\>0,∀k\\displaystyle\\left\\langle\\nabla f\_\{k\}\(\\mathbf\{w\}\_\{t\}\),\\mathbf\{d\}\_\{t\}\\right\\rangle\>0,\\penalty 10000\\ \\forall k\(71\)and choosing

0<ηt≤mink⁡⟨∇fk\(𝐰t\),𝐝t⟩L‖𝐝t‖2\\displaystyle 0<\\eta\_\{t\}\\leq\\frac\{\\min\_\{k\}\\left\\langle\\nabla f\_\{k\}\(\\mathbf\{w\}\_\{t\}\),\\mathbf\{d\}\_\{t\}\\right\\rangle\}\{L\\left\\\|\\mathbf\{d\}\_\{t\}\\right\\\|^\{2\}\}\(72\)yields𝐟\(𝐰t−ηt𝐝t\)≤𝐟\(𝐰t\)\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\}\-\\eta\_\{t\}\\mathbf\{d\}\_\{t\}\)\\leq\\mathbf\{f\}\(\\mathbf\{w\}\_\{t\}\)\.

## Appendix CExperiment details

### C\.1Non\-conflicting gradient aggregators

#### C\.1\.1Methods

Existing aggregators\.Whenever possible, we stick to the official implementations of all methods, and otherwise use the TorchJD repository\[quinton2024jacobian\]as a reference\. For Nash\-MTL\[NavonSAMKCF22\], we adopt the official implementation’s default, which always clips the aggregated update direction to satisfy‖𝐝t‖=1\\\|\\mathbf\{d\}\_\{t\}\\\|=1\. All examined methods are run in their deterministic, full\-batch form, without momentum\. For the normalized variants \(e\.g\., Nash\-MTL\*, UPGrad\*, DualProj\*\), we keep the original implementations and simply re\-scale𝐝t\\mathbf\{d\}\_\{t\}to lie in the convex hull of gradients\{𝐠k\}\\\{\\mathbf\{g\}\_\{k\}\\\}by normalizing the weighting coefficients,𝝀t←𝝀t‖𝝀t‖1\\bm\{\\lambda\}\_\{t\}\\leftarrow\\frac\{\\bm\{\\lambda\}\_\{t\}\}\{\\\|\\bm\{\\lambda\}\_\{t\}\\\|\_\{1\}\}, ensuring that𝝀t∈Δ\\bm\{\\lambda\}\_\{t\}\\in\\Delta\.

Mixed Aggregator Scheduling \(MAS\)\.For MAS, we consider two schedulings: \(1\) Uniform random selection \(‘Rand’\), where each iteration one aggregator is randomly chosen from the pool\{MGDA, Nash\-MTL\*, UPGrad\*, DualProj\*\}\\\{\\text\{MGDA, Nash\-MTL\*, UPGrad\*, DualProj\*\}\\\}; \(2\) Round\-robin everynniterations \(‘RR\(nn\)’\), where each aggregator in the pool is applied fornnconsecutive iterations before switching to the next\.

Misc\.The learning rateη\\etaused for synthetic problems is0\.0010\.001, while for fairness classification benchmark is0\.0050\.005\.

#### C\.1\.2Synthetic problem details

Here we provide explicit definitions for VLMOP2 and Omnitest objectives used in our paper\.

VLMOP2\[vlmop2paper\]\.

min𝐱∈ℝn\\displaystyle\\min\_\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}\}\\quadf1\(𝐱\)=1−exp⁡\(−∑i=1n\(xi−1n\)2\),\\displaystyle f\_\{1\}\(\\mathbf\{x\}\)=1\-\\exp\\\!\\left\(\-\\sum\_\{i=1\}^\{n\}\\left\(x\_\{i\}\-\\tfrac\{1\}\{\\sqrt\{n\}\}\\right\)^\{2\}\\right\),\(73\)f2\(𝐱\)=1−exp⁡\(−∑i=1n\(xi\+1n\)2\),\\displaystyle f\_\{2\}\(\\mathbf\{x\}\)=1\-\\exp\\\!\\left\(\-\\sum\_\{i=1\}^\{n\}\\left\(x\_\{i\}\+\\tfrac\{1\}\{\\sqrt\{n\}\}\\right\)^\{2\}\\right\),\(74\)s\.t\.−2≤xi≤2,i=1,…,n\.\\displaystyle\-2\\leq x\_\{i\}\\leq 2,\\quad i=1,\\dots,n\.\(75\)
Omnitest\[Omnitest\]\.

min𝐱∈ℝn\\displaystyle\\min\_\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}\}\\quadf1\(𝐱\)=∑i=1nsin⁡\(πxi\),\\displaystyle f\_\{1\}\(\\mathbf\{x\}\)=\\sum\_\{i=1\}^\{n\}\\sin\(\\pi x\_\{i\}\),\(76\)f2\(𝐱\)=∑i=1ncos⁡\(πxi\),\\displaystyle f\_\{2\}\(\\mathbf\{x\}\)=\\sum\_\{i=1\}^\{n\}\\cos\(\\pi x\_\{i\}\),\(77\)s\.t\.0≤xi≤6,i=1,…,n\.\\displaystyle 0\\leq x\_\{i\}\\leq 6,\\quad i=1,\\dots,n\.\(78\)
For both problems, although𝐱\\mathbf\{x\}is formally constrained, we initialize it in the interior and ensure that the entire trajectory—including the Pareto solution it converges to—remains in the interior\. This allows us to empirically treat them as unconstrainedMOOproblems and apply all the gradient aggregation methods considered in this paper\.

#### C\.1\.3Fairness classification

Setup\.For the fairness classification task on theAdultdataset, we follow the LibMOON benchmark\[Zhang2024libmoon\]for both dataset preprocessing and model architecture\. Specifically, we use the functionlibmoon\.util\.mtl\.get\_dataset\("adult"\)to generate the train, validation, and test splits\. For the model, we adopt LibMOON’sM4fair\_modelarchitecture: a fully connected neural network consisting of three hidden layers of dimension256256each, with ReLU activations\. The binary classification task is whether the annual income is greater than $50K\.

We use the Difference of Equalized Odds \(DEO\) as our fairness metrics\. Following\[Hardt2016equality\], we define

DEO1=\|Pr⁡\{Y^=1∣A=0,Y=1\}−Pr⁡\{Y^=1∣A=1,Y=1\}\|,\\displaystyle\\text\{DEO1\}\\;=\\;\\big\|\\Pr\\\{\\widehat\{Y\}=1\\mid A=0,Y=1\\\}\-\\Pr\\\{\\widehat\{Y\}=1\\mid A=1,Y=1\\\}\\big\|,\(79\)DEO2=\|Pr⁡\{Y^=1∣A=0,Y=0\}−Pr⁡\{Y^=1∣A=1,Y=0\}\|\.\\displaystyle\\text\{DEO2\}\\;=\\;\\big\|\\Pr\\\{\\widehat\{Y\}=1\\mid A=0,Y=0\\\}\-\\Pr\\\{\\widehat\{Y\}=1\\mid A=1,Y=0\\\}\\big\|\.\(80\)whereY^\\widehat\{Y\}denotes the model’s predicted label,AAthe sensitive attribute \(e\.g\., gender\), andYYthe ground\-truth label\. To make these metrics differentiable, we apply atanh\\tanhrelaxation that replaces the indicator function𝟏\{pi≥0\.5\}\\mathbf\{1\}\\\{p\_\{i\}\\geq 0\.5\\\}, yielding smooth surrogate losses\. Thus, the three objectives are:f1f\_\{1\}, the binary cross\-entropy loss;f2f\_\{2\}, the relaxed DEO1; andf3f\_\{3\}, the relaxed DEO2\.

More results\.Here we provide additional experimental results for fairness classification; see[Figure˜5](https://arxiv.org/html/2605.30452#A3.F5)and[Figure˜6](https://arxiv.org/html/2605.30452#A3.F6)\.

We observe that the Pareto stationarity measureγ\(𝐰t\)\\gamma\(\\mathbf\{w\}\_\{t\}\)converges to0for all methods except Nash\-MTL \(without normalization\), which suffers from overshooting because‖𝐝‖\\\|\\mathbf\{d\}\\\|is fixed at11, leading to instability near Pareto stationarity\. Applying convex\-hull normalization to Nash\-MTL yields smoother convergence, and a similar but less pronounced effect is also observed for UPGrad\.

![Refer to caption](https://arxiv.org/html/2605.30452v1/figs/fairness/3objectives/seed100/Cross_entropy_loss.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/figs/fairness/3objectives/seed100/DEO1.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/figs/fairness/3objectives/seed100/DEO2.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/figs/fairness/3objectives/seed100/Measure_of_Pareto_stationarity.png)

Figure 5:Non\-conflicting aggregators on LibMOON fairness classification benchmark\. Nash\-MTL is unstable, while Nash\-MTL\* \(the normalized variant\) is smooth and stable\.![Refer to caption](https://arxiv.org/html/2605.30452v1/x11.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x12.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x13.png)

![Refer to caption](https://arxiv.org/html/2605.30452v1/x14.png)

Figure 6:Mixed aggregator scheduling on fairness classification benchmark\.

### C\.2Capped MGDA experiments detail

Data\.We conduct experiments on the CIFAR\-10 dataset in the adversarial federated learning setting\. To create a non\-i\.i\.d\. partition, we follow the sampling procedure of\[HuSZY22\]\. Specifically, we first sort all data points by class and then split them consecutively into 250 shards of 200 images each, where each shard contains images from a single class\. Each client is randomly assigned 10 distinct shards, resulting in 2000 instances per client\. This produces heterogeneous data distributions across clients, as each client has access to different subsets of class labels\. For example, Client0’s data includes class labels\{0,3,4,5,6,7,9\}\\\{0,3,4,5,6,7,9\\\}while lacking labels\{1,2,8\}\\\{1,2,8\\\}\.

Model\.We use the same lightweight CNN for MGDA, Capped\-MGDA, and the centralized\-training baseline; see[Table˜3](https://arxiv.org/html/2605.30452#A3.T3)for the configuration\.

Table 3:Network architecture for CIFAR\-10 experiments withCNN\.Adversarial FL setting\.In the adversarial federated learning setup, during each epoch’s gradient aggregation a single malicious attacker injects a crafted gradient that opposes one randomly selected participant’s gradient\. Concretely, lettingkkbe sampled uniformly from the clients, the injected gradient is𝐠adv=−𝐠k\+ϵ𝒩\(0,I\)\\mathbf\{g\}\_\{\\mathrm\{adv\}\}\\;=\\;\-\\,\\mathbf\{g\}\_\{k\}\\;\+\\;\\epsilon\\,\\mathcal\{N\}\(0,I\), with noise scaleϵ=0\.01\\epsilon=0\.01\.

Misc\.We use a learning rate ofη=0\.01\\eta=0\.01, train for10001000epochs, and adopt full\-batch training to ensure determinism\.

#### C\.2\.1Additional Results on Capped MGDA and the Adversarial Federated Learning Setting

In this subsection, we provide additional empirical results to further illustrate the behavior of Capped MGDA and other methods under adversarial federated learning\. These results complement the main experiments in Section 5\.2 by offering finer\-grained insights as well as comparisons with related variants such as MGDA with clipping\.

Table[4](https://arxiv.org/html/2605.30452#A3.T4)reports the final global test accuracy across a broader set of aggregation methods\. Empirically, we observe that MGDA performs poorly under adversarial gradients, whereas projection\-based methods such as UPGrad\* and DualProj\* remain substantially more robust, with NashMTL\* showing moderate performance\. Notably, Capped MGDA with a small cap \(i\.e\.C=0\.1C=0\.1\) achieves accuracy comparable to the strongest baselines, indicating that restricting the influence of the adversarial client substantially improves robustness\.

Figure[7](https://arxiv.org/html/2605.30452#A3.F7)compares Capped MGDA with a naive “MGDA \+ coefficient\-clipping" variant using the same threshold\. The performance gap highlights that simply clipping the MGDA coefficients is insufficient; the constrained optimization formulation underlying Capped MGDA is non\-trivial and essential for producing effective descent directions\.

Figure[8](https://arxiv.org/html/2605.30452#A3.F8)visualizes average client accuracy during training, for a range of cap valuesCC\. We observe a consistent trend: smallerCCleads to higher robustness and improved accuracy\. This aligns with the intuition that a tighter cap limits the adversarial client’s ability to distort the aggregated gradient\. AsCCdecreases, the aggregated direction becomes more stable across clients, leading to smoother training dynamics and better overall performance\.

Table 4:Comparison of global test accuracy across methods under the adversarial federated learning setting\. All values report mean accuracy of1010clients after10001000training epochs\.![Refer to caption](https://arxiv.org/html/2605.30452v1/x15.png)Figure 7:Capped MGDA vs ‘MGDA \+ coefficient clipping’\. We observe the later is not as effective, given the same threshold0\.20\.2\.![Refer to caption](https://arxiv.org/html/2605.30452v1/x16.png)Figure 8:Average client accuracy vs Epoch\. We can see a clear improvement whenCCbecomes smaller, which means more robustness by limiting the impact of adversarial gradient\.
A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

Similar Articles

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization

Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts

Regularity-Aware Stochastic MGDA with Adaptive Conflict-Avoidant Update Direction Control

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

Optimistic Dual Averaging Unifies Modern Optimizers

Submit Feedback

Similar Articles

MGUP: A Momentum-Gradient Alignment Update Policy for Stochastic Optimization
Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts
Regularity-Aware Stochastic MGDA with Adaptive Conflict-Avoidant Update Direction Control
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Optimistic Dual Averaging Unifies Modern Optimizers