Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts
Summary
This paper introduces a general acceleration mechanism for multi-objective Bayesian optimisation that uses Gaussian process predictive gradients as auxiliary signals to augment existing acquisition functions, enabling faster convergence to the global Pareto set under limited evaluation budgets.
View Cached Full Text
Cached at: 06/08/26, 09:19 AM
# Accelerating Multi‑Objective Bayesian Optimisation via Predictive‑Gradient Catalysts
Source: [https://arxiv.org/html/2606.06984](https://arxiv.org/html/2606.06984)
11institutetext:Loughborough University, Loughborough, LE11 3TU, U\.K\.
11email:A\.Rahat@lboro\.ac\.uk22institutetext:University of Exeter, Exeter EX4 4QD, U\.K\.
22email:\{T\.Chugh,J\.E\.Fieldsend\}@exeter\.ac\.uk33institutetext:The University of Manchester, Manchester M15 6PB, U\.K\.
33email:richard\.allmendinger@manchester\.ac\.uk###### Abstract
This paper presents a general acceleration mechanism for multi\-objective Bayesian optimisation \(MOBO\) that leverages Gaussian process predictive gradients as auxiliary signals\. Rather than replacing existing Pareto\-compliant acquisition functions, the proposed approach augments them with local stationarity information derived from surrogate\-derived gradients, enabling faster convergence toward the global Pareto set under limited evaluation budgets\. Two catalyst instantiations are investigated: an adaptive Multiple\-Gradient Descent Algorithm\-Based Catalyst \(MGDA\) and a predefined\-weight variant that enables focused exploration when budgets are tight\. Experiments on the DTLZ benchmark suite \(using 2 objectives and 10 decision variables\) show that predictive gradient catalysis can deliver significant acceleration compared to other acquisition functions \(EHVI, AugTch, tMPoI, SAF\) when surrogates are accurate, particularly for stationary problems\.111Supplementary material, code, and data are available at[Zenodo: 19649281](https://doi.org/10.5281/zenodo.19649281)\.
## 1Introduction
Bayesian optimisation \(BO\) is a widely used framework for optimising expensive, noisy, or black‑box objective functions\. By combining probabilistic surrogate models, most commonly Gaussian processes, with acquisition functions that balance exploration and exploitation, BO can identify high\-quality solutions using relatively few expensive evaluations\. This has led to successful applications in machine learning, engineering design, and simulation‑based optimisation\[[23](https://arxiv.org/html/2606.06984#bib.bib32),[10](https://arxiv.org/html/2606.06984#bib.bib33)\]\.
Recent work has explored the use of gradient information to accelerate BO\. Early approaches focus on incorporating*observed*gradients, obtained via adjoint methods or automatic differentiation, into surrogate models and acquisition functions\. In particular, Wu et al\.\[[27](https://arxiv.org/html/2606.06984#bib.bib31)\]showed that using exact gradients within a one‑step look\-ahead knowledge‑gradient framework can substantially reduce sample complexity\. More recently, Perrin and Le Riche proposed an acceleration strategy that exploits*predictive*gradients of Gaussian process surrogates to bias acquisition toward locally optimal regions\[[18](https://arxiv.org/html/2606.06984#bib.bib13)\]\. Although effective, this approach is limited to single\-objective settings and is tightly coupled to a specific acquisition function\.
However, many real\-world problems are inherently multi\-objective, requiring the optimisation of several competing criteria and yielding a set of trade‑off solutions rather than a single optimum\. In such problems, small design changes can significantly shift how performance is distributed across objectives, making efficient exploration of the Pareto set particularly challenging when evaluations are expensive\. Although gradient\-based stationarity notions play a central role in deterministic multi‑objective optimisation, their use within multi\-objective Bayesian optimisation \(MOBO\) has remained largely unexplored\.
This paper addresses this gap by introducing a general acceleration mechanism for MOBO that leverages Gaussian process*predictive gradients*as auxiliary, catalyst\-like signals\. Rather than replacing existing acquisition functions, the proposed framework augments them with local stationarity information derived from surrogate gradients, enabling faster convergence toward the Pareto set under limited evaluation budgets\.
The maincontributionsof this paper are:
- •A catalytic framework for MOBO that exploits GP predictive gradients to accelerate convergence without altering underlying acquisition functions\.
- •Two catalyst instantiations grounded in Pareto stationarity theory: an adaptive Multiple\-Gradient Descent Algorithm \(MGDA\)\-based strategy and a predefined\-weight approach for focused trade\-off exploration\.
- •A principled integration of catalytic signals with standard Pareto\-compliant acquisition functions via augmented Tchebycheff scalarisation\.
- •An extensive evaluation on the DTLZ benchmark suite showing consistent acceleration for \(approximately\) stationary problems\.
The remainder of the paper is organised as follows\. Section[2](https://arxiv.org/html/2606.06984#S2)reviews background material on multi\-objective optimisation, Gaussian processes, and MOBO\. Section[3](https://arxiv.org/html/2606.06984#S3)presents the proposed catalytic framework\. Experimental settings and results are discussed in Sections[4](https://arxiv.org/html/2606.06984#S4)and[5](https://arxiv.org/html/2606.06984#S5), followed by conclusions in Section[6](https://arxiv.org/html/2606.06984#S6)\.
## 2Background
We now set out the background material underpinning this work\.
### 2\.1Multi\-Objective Optimisation and Pareto Stationarity
Let𝐱∈ℝn\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}be annn\-dimensional decision vector\. Without loss of generality, an unconstrained multi\-objective optimisation problem can be defined as
min𝐱∈𝒳𝐟\(𝐱\),\\displaystyle\\min\_\{\\mathbf\{x\}\\in\\mathcal\{X\}\}\\,\\mathbf\{f\}\(\\mathbf\{x\}\),\(1\)where𝐟\(𝐱\)=\(f1\(𝐱\),…,fM\(𝐱\)\)⊤\\mathbf\{f\}\(\\mathbf\{x\}\)=\(f\_\{1\}\(\\mathbf\{x\}\),\\dots,f\_\{M\}\(\\mathbf\{x\}\)\)^\{\\top\}is anMM\-dimensional objective vector,fi\(⋅\)f\_\{i\}\(\\cdot\)denotes theiith objective, and𝒳⊆ℝn\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{n\}is the feasible space\.
Solving a multi\-objective problem yields a set of optimal trade\-off solutions rather than a single optimal point\. A solution is considered desirable if no other feasible solution can improve one objective without worsening at least one other\. This leads to the standard concept of Pareto optimality\. A vector𝐱′∈𝒳\\mathbf\{x\}^\{\\prime\}\\in\\mathcal\{X\}is Pareto optimal if there exists no other vector𝐱′′∈𝒳\\mathbf\{x\}^\{\\prime\\prime\}\\in\\mathcal\{X\}that dominates it; that is,
- •𝐱′′\\mathbf\{x\}^\{\\prime\\prime\}is at least as good as𝐱′\\mathbf\{x\}^\{\\prime\}in all objectives, and
- •𝐱′′\\mathbf\{x\}^\{\\prime\\prime\}is strictly better in at least one objective\.
Formally, the Pareto optimal set𝒫∗\\mathcal\{P\}^\{\*\}is defined as
𝒫∗=\{𝐱′∈𝒳\|\\displaystyle\\mathcal\{P\}^\{\*\}=\\Big\\\{\\mathbf\{x\}^\{\\prime\}\\in\\mathcal\{X\}~\\Big\|~∄𝐱′′∈𝒳:\(∀i∈\{1,…,M\},fi\(𝐱′′\)≤fi\(𝐱′\)\)\\displaystyle\\nexists\\mathbf\{x\}^\{\\prime\\prime\}\\in\\mathcal\{X\}:\\left\(\\forall i\\in\\\{1,\\dots,M\\\},\\ f\_\{i\}\(\\mathbf\{x\}^\{\\prime\\prime\}\)\\leq f\_\{i\}\(\\mathbf\{x\}^\{\\prime\}\)\\right\)∧\(∃j∈\{1,…,M\},fj\(𝐱′′\)<fj\(𝐱′\)\)\}\.\\displaystyle\\qquad\\qquad\\qquad\\land~\\left\(\\exists j\\in\\\{1,\\ldots,M\\\},\\ f\_\{j\}\(\\mathbf\{x\}^\{\\prime\\prime\}\)<f\_\{j\}\(\\mathbf\{x\}^\{\\prime\}\)\\right\)\\Big\\\}\.\(2\)
The Pareto frontℱ∗\\mathcal\{F\}^\{\*\}is the image of the Pareto set in the objective space, i\.e\.,ℱ∗=\{𝐟\(𝐱\):𝐱∈𝒫∗\}\.\\mathcal\{F\}^\{\*\}=\\left\\\{\\mathbf\{f\}\(\\mathbf\{x\}\):\\mathbf\{x\}\\in\\mathcal\{P\}^\{\*\}\\right\\\}\.
The necessary conditions for Pareto stationarity\-meaning that any sufficiently small perturbation in the decision space around a stationary point𝐱\\mathbf\{x\}yields an equivalent or dominated solution\-are given for unconstrained multi\-objective optimisation problems in\[[17](https://arxiv.org/html/2606.06984#bib.bib2)\]as:
𝜸⋅∇𝐟\(𝐱\)=∑i=1Mγi∇fi\(𝐱\)=0,\\displaystyle\\bm\{\\gamma\}\\cdot\\nabla\\mathbf\{f\}\(\\mathbf\{x\}\)=\\sum\_\{i=1\}^\{M\}\\gamma\_\{i\}\\nabla f\_\{i\}\(\\mathbf\{x\}\)=0,\(3\)where𝜸=\(γ1,…,γM\)⊤\\bm\{\\gamma\}=\(\\gamma\_\{1\},\\dots,\\gamma\_\{M\}\)^\{\\top\}denotes a vector of convex combination coefficients withγi∈\[0,1\]\\gamma\_\{i\}\\in\[0,1\]and∑iγi=1\\sum\_\{i\}\\gamma\_\{i\}=1\. These conditions, commonly referred to as the Fritz–John \(FJ\) conditions, imply that at a Pareto stationary point, either all objectives are simultaneously minimised or their improvement directions are in conflict such that the weighted sum of their gradients cancels out, leaving no feasible descent direction that improves all objectives simultaneously\.
Importantly, the FJ conditions are*necessary but not sufficient*for global Pareto optimality\. Sufficiency holds only under additional assumptions; see, for instance, Censor\[[2](https://arxiv.org/html/2606.06984#bib.bib3)\]\. Consequently, the condition may also be satisfied at points that are locally efficient yet globally dominated\. Hence, the FJ condition alone guarantees only*local*Pareto efficiency rather than global optimality\.
### 2\.2Gaussian Process Regression
Gaussian Process \(GP\) models provide a flexible Bayesian framework for regression, offering closed\-form posterior predictive distributions that quantify both expected objective values and the associated epistemic uncertainty\. The predictive mean represents the model’s current best estimate of the objective, while the predictive variance reflects the local scarcity of observations\. This dual representation naturally underpins the exploration–exploitation trade\-off, making GPs a powerful choice for surrogate\-assisted optimisation\[[6](https://arxiv.org/html/2606.06984#bib.bib8)\]\.
For each objectiveii, we consider a dataset𝒟i=\{\(𝐱t,yt=fi\(𝐱t\)\)\}t=1T\\mathcal\{D\}\_\{i\}=\\\{\(\\mathbf\{x\}^\{t\},y^\{t\}=f\_\{i\}\(\\mathbf\{x\}^\{t\}\)\)\\\}\_\{t=1\}^\{T\}of sizeTT\. A GP equipped with hyperparameters𝜽i∗\\bm\{\\theta\}\_\{i\}^\{\*\}yields the Gaussian predictive distribution
f^i\(𝐱\)∼p\(fi\(𝐱\)∣𝒟i,𝜽i∗\)=𝒩\(μi\(𝐱\),σi2\(𝐱\)\)\.\\displaystyle\\hat\{f\}\_\{i\}\(\\mathbf\{x\}\)\\sim p\(f\_\{i\}\(\\mathbf\{x\}\)\\mid\\mathcal\{D\}\_\{i\},\\bm\{\\theta\}\_\{i\}^\{\*\}\)=\\mathcal\{N\}\\\!\\left\(\\mu\_\{i\}\(\\mathbf\{x\}\),\\,\\sigma\_\{i\}^\{2\}\(\\mathbf\{x\}\)\\right\)\.\(4\)The predictive moments\[[20](https://arxiv.org/html/2606.06984#bib.bib9)\]are given by
μi\(𝐱\)\\displaystyle\\mu\_\{i\}\(\\mathbf\{x\}\)=κ\(𝐱,X\)K−1𝐲i,\\displaystyle=\\kappa\(\\mathbf\{x\},X\)K^\{\-1\}\\mathbf\{y\}\_\{i\},\(5\)σi2\(𝐱\)\\displaystyle\\sigma\_\{i\}^\{2\}\(\\mathbf\{x\}\)=κ\(𝐱,𝐱\)−κ\(𝐱,X\)K−1κ\(X,𝐱\),\\displaystyle=\\kappa\(\\mathbf\{x\},\\mathbf\{x\}\)\-\\kappa\(\\mathbf\{x\},X\)K^\{\-1\}\\kappa\(X,\\mathbf\{x\}\),\(6\)whereX∈ℝT×nX\\in\\mathbb\{R\}^\{T\\times n\}contains the input locations,𝐲i∈ℝT\\mathbf\{y\}\_\{i\}\\in\\mathbb\{R\}^\{T\}the corresponding objective evaluations, andKKis the covariance matrix induced by the kernelκ\(⋅,⋅\|𝜽i∗\)\\kappa\(\\cdot,\\cdot\\,\|\\,\\bm\{\\theta\}\_\{i\}^\{\*\}\)\. The vectorκ\(𝐱,X\)\\kappa\(\\mathbf\{x\},X\)denotes the covariances between the query point𝐱\\mathbf\{x\}and all observations\.
The kernel plays a central role in encoding smoothness and structural assumptions about the underlying objective\. In this work, we employ the Matérn\-5/25/2kernel, which has been recommended for real\-world optimisation tasks\[[24](https://arxiv.org/html/2606.06984#bib.bib11)\]\. Hyperparameters𝜽i∗\\bm\{\\theta\}\_\{i\}^\{\*\}are inferred via maximum likelihood using the L\-BFGS optimiser\[[15](https://arxiv.org/html/2606.06984#bib.bib35)\]with ten random restarts; see\[[11](https://arxiv.org/html/2606.06984#bib.bib12)\]for details\.
Because the predictive distribution is Gaussian, the expected gradient offif\_\{i\}is simply the gradient of its predictive mean\. Differentiating yields\[[25](https://arxiv.org/html/2606.06984#bib.bib10)\]:
𝔼\[∇f^i\(𝐱\)\]=\(∂μi\(𝐱\)∂xj\|j=1,…,n\)⊤=\(∂κ\(𝐱,X\)∂xjK−1𝐲i\|j=1,…,n\)⊤\.\\displaystyle\\mathbb\{E\}\\left\[\\nabla\\hat\{f\}\_\{i\}\(\\mathbf\{x\}\)\\right\]=\\left\(\\frac\{\\partial\\mu\_\{i\}\(\\mathbf\{x\}\)\}\{\\partial x\_\{j\}\}\\;\\bigg\|\\;j=1,\\dots,n\\right\)^\{\\top\}=\\left\(\\frac\{\\partial\\kappa\(\\mathbf\{x\},X\)\}\{\\partial x\_\{j\}\}K^\{\-1\}\\mathbf\{y\}\_\{i\}\\;\\bigg\|\\;j=1,\\dots,n\\right\)^\{\\top\}\.\(7\)
Figure[1](https://arxiv.org/html/2606.06984#S2.F1)visualises these quantities for a GP model of the Sphere function, illustrating how the GP predictive gradient correctly identifies the stationary point at the true optimum with a small dataset of size five\.
Figure 1:Illustration of the predictive mean, uncertainty, and expected gradient of a GP model trained on the Sphere functionfi\(x\)=x2f\_\{i\}\(x\)=x^\{2\}\. The predictive mean \(blue dashed line\) and uncertainty band \(light blue\) are shown alongside the expected gradient \(solid black line\), which correctly vanishes at the true minimum \(green dashed line\)\. The horizontal red dashed line indicates the zero‑gradient target,𝔼\[∇fi\(𝐱\)\]=0\\mathbb\{E\}\[\\nabla f\_\{i\}\(\\mathbf\{x\}\)\]=0\. Training data are indicated by grey crosses\.Although we assume noise\-free observations in this example, GP regression readily accommodates noisy data through an additive Gaussian noise term\. For instance, with homoscedastic noise of varianceσnoise2\\sigma\_\{noise\}^\{2\}, the predictive variance becomesσi2\(𝐱\)\+σnoise2\\sigma\_\{i\}^\{2\}\(\\mathbf\{x\}\)\+\\sigma\_\{noise\}^\{2\}\.
A standard strategy for constructing multi\-objective surrogate models is to assume conditional independence between objectives and to model each objective with an individual GP\. Under this assumption, the joint predictive model factorises across objectives, yielding:
𝐟^\(𝐱\)∣𝓓∼∏i=1Mp\(fi\(𝐱\)\)=∏i=1M𝒩\(μi\(𝐱\),σi2\(𝐱\)\),\\displaystyle\\hat\{\\mathbf\{f\}\}\(\\mathbf\{x\}\)\\mid\\bm\{\\mathcal\{D\}\}\\;\\sim\\;\\prod\_\{i=1\}^\{M\}p\\\!\\left\(f\_\{i\}\(\\mathbf\{x\}\)\\right\)=\\prod\_\{i=1\}^\{M\}\\mathcal\{N\}\\\!\\left\(\\mu\_\{i\}\(\\mathbf\{x\}\),\\,\\sigma\_\{i\}^\{2\}\(\\mathbf\{x\}\)\\right\),\(8\)where𝓓=\{X,Y\}\\bm\{\\mathcal\{D\}\}=\\\{X,Y\\\}denotes the training dataset, consisting of the input matrixXXand the corresponding objective evaluation matrixY=\(𝐲1,…,𝐲M\)⊤Y=\(\\mathbf\{y\}\_\{1\},\\dots,\\mathbf\{y\}\_\{M\}\)^\{\\top\}\.
We adopt this independent multi\-surrogate formulation as the foundation of our Bayesian multi\-objective optimisation framework\. This choice is supported by prior studies showing that independent per\-objective GPs generally outperform mono\-surrogate approaches, in which a scalarised version of the multi\-objective problem is modelled using a single GP\[[19](https://arxiv.org/html/2606.06984#bib.bib17),[4](https://arxiv.org/html/2606.06984#bib.bib18),[3](https://arxiv.org/html/2606.06984#bib.bib34)\]\.
### 2\.3Multi\-Objective Bayesian Optimisation
Multi\-objective Bayesian Optimisation \(MOBO\) has seen significant growth in the past two decades, with early influential contributions appearing around 2006\[[9](https://arxiv.org/html/2606.06984#bib.bib14),[14](https://arxiv.org/html/2606.06984#bib.bib15),[13](https://arxiv.org/html/2606.06984#bib.bib16)\]\. This rise has been driven by the need to efficiently approximate the Pareto set of expensive black\-box problems\.
The general MOBO workflow is conceptually straightforward: an initial dataset is generated using a space\-filling design such as Latin Hypercube Sampling\[[16](https://arxiv.org/html/2606.06984#bib.bib19)\], producing a set of candidate solutionsXXand their corresponding objective evaluationsYY\. A mono or multi\-surrogate GP model is then trained on this dataset, yielding a multivariate predictive distribution over the objectives\. An acquisition function is employed to select the next candidate solution expected to improve the current approximation of the Pareto set, trading off exploitation \(via the predictive mean\) and exploration \(via predictive uncertainty\)\. The selected solution is evaluated using the true expensive objectives, added to the dataset, and the surrogate is retrained\. This iterative procedure continues until the evaluation budget is exhausted\. A high\-level outline of the method is provided in the[supplementary material](https://doi.org/10.5281/zenodo.19649281)\.
Acquisition functions in this setting are typically constructed from a Pareto\-compliant utility indicator, meaning that any newly identified non\-dominated solution yields a strictly better indicator value relative to the current approximation of the Pareto set𝒫~\\tilde\{\\mathcal\{P\}\}\. Given such an indicator, the standard BO approach computes its expected improvement by integrating the utility over the predictive distribution in the objective space, unless the function already accounts for the predictive distribution\. The aim of such approaches is to yield an acquisition value that balances both the predicted objective performance and the associated epistemic uncertainty\.
In this study, we employ a set of widely used acquisition functions\-namely, Expected Hypervolume Improvement \(EHVI\)\[[9](https://arxiv.org/html/2606.06984#bib.bib14)\], Augmented Tchebycheff decomposition \(AugTch\)\[[4](https://arxiv.org/html/2606.06984#bib.bib18)\], the Transformed Minimum Probability of Dominance \(tMPoI\)\[[19](https://arxiv.org/html/2606.06984#bib.bib17)\], and the Summary Attainment Front distance \(SAF\)\[[1](https://arxiv.org/html/2606.06984#bib.bib25),[26](https://arxiv.org/html/2606.06984#bib.bib24)\]– as baseline acquisition functions for acceleration\. Detailed descriptions of these acquisition functions are provided in the[supplementary material](https://doi.org/10.5281/zenodo.19649281)\.
## 3Proposed Catalytic Framework
The overarching goal of this work is to identify the global Pareto set as efficiently as possible under a limited budget of expensive function evaluations\. In particular, we investigate whether predictive gradients obtained from multivariate GP surrogate models can be used to enhance existing Pareto\-compliant acquisition functions by acting as a local catalyst that focuses the search\.
According to the FJ optimality conditions for Pareto stationarity \(see Equation \([3](https://arxiv.org/html/2606.06984#S2.E3)\)\), points in the decision space where a suitable convex combination of objective gradients vanishes correspond to locally or globally Pareto‑stationary solutions\. When employing a multivariate GP surrogate, these conditions can be approximated in expectation through the predictive gradients of the surrogate model,
𝜸⋅∇𝐟^\(𝐱\)=𝜸⋅\(𝔼\[∇f^i\(𝐱\)\]\|i=1,…,M\)\.\\displaystyle\\bm\{\\gamma\}\\cdot\\nabla\\hat\{\\mathbf\{f\}\}\(\\mathbf\{x\}\)=\\bm\{\\gamma\}\\cdot\\bigl\(\\mathbb\{E\}\[\\nabla\\hat\{f\}\_\{i\}\(\\mathbf\{x\}\)\]\\;\\big\|\\;i=1,\\dots,M\\bigr\)\.\(9\)
Since the FJ conditions are necessary but not sufficient for Pareto optimality, they do not distinguish between local and global Pareto sets\. Nevertheless, vanishing predictive gradients are highly informative within local neighbourhoods of the decision space\. This motivates their use as a*catalyst*for Pareto‑compliant acquisition functions: locally inefficient points can be filtered out, while solutions satisfying∥𝜸⋅∇𝐟^\(𝐱⋆\)∥=0\\lVert\\bm\{\\gamma\}\\cdot\\nabla\\hat\{\\mathbf\{f\}\}\(\\mathbf\{x\}^\{\\star\}\)\\rVert=0are identified as locally Pareto‑efficient\.
By contrast, Pareto‑compliant acquisition functions, given sufficiently accurate surrogate models, can discriminate between local and global Pareto sets\. Our framework therefore, combines both criteria: the acquisition function provides global guidance, while predictive\-gradient information focuses the search locally\.
We formalise this as a bi‑objective decision problem,
maxg1\(𝐱\)\\displaystyle\\max\\;g\_\{1\}\(\\mathbf\{x\}\)≡maxα\(𝐱\),\\displaystyle\\equiv\\max\\;\\alpha\(\\mathbf\{x\}\),\(10\)ming2\(𝐱\)\\displaystyle\\min\\;g\_\{2\}\(\\mathbf\{x\}\)≡min∥𝜸⋅∇𝐟^\(𝐱\)∥,\\displaystyle\\equiv\\min\\;\\lVert\\bm\{\\gamma\}\\cdot\\nabla\\hat\{\\mathbf\{f\}\}\(\\mathbf\{x\}\)\\rVert,\(11\)whereα\(𝐱\)\\alpha\(\\mathbf\{x\}\)denotes a Pareto‑compliant acquisition function\. Assuming equal importance of global and local criteria, we combine them using an augmented Tchebycheff scalarisation\[[14](https://arxiv.org/html/2606.06984#bib.bib15)\],
αCAT\(𝐱\)=max\(g1\(𝐱\)2,−g2\(𝐱\)2\)−ρ\(g1\(𝐱\)2−g2\(𝐱\)2\),\\displaystyle\\alpha\_\{\\mathrm\{CAT\}\}\(\\mathbf\{x\}\)=\\max\\\!\\left\(\\frac\{g\_\{1\}\(\\mathbf\{x\}\)\}\{2\},\-\\frac\{g\_\{2\}\(\\mathbf\{x\}\)\}\{2\}\\right\)\-\\rho\\\!\\left\(\\frac\{g\_\{1\}\(\\mathbf\{x\}\)\}\{2\}\-\\frac\{g\_\{2\}\(\\mathbf\{x\}\)\}\{2\}\\right\),\(12\)where the negation converts the minimisation ofg2g\_\{2\}into a maximisation problem\. This formulation penalises candidate solutions that are inferior with respect to either global or local considerations—ρ\\rhois a tiny positive penalty parameter which effectively enables discrimination between dominating and equivalent points under the two objectives\. An adaptation of weights betweeng1g\_\{1\}andg2g\_\{2\}can be explored in the future\.
### 3\.1Multiple‑Gradient Descent Algorithm\-Based Catalyst \(MGDA\)
In the first strategy, the weight vector𝜸\\bm\{\\gamma\}is determined adaptively using the Multiple‑Gradient Descent Algorithm \(MGDA\)\[[22](https://arxiv.org/html/2606.06984#bib.bib43),[8](https://arxiv.org/html/2606.06984#bib.bib4)\]\. MGDA identifies, at a given point𝐱\\mathbf\{x\}, the convex combination of objective gradients with minimum Euclidean norm,
𝜸⋆=argminγi≥0,∑iγi=1‖∑i=1Mγi∇fi\(𝐱\)‖2\.\\displaystyle\\bm\{\\gamma\}^\{\\star\}=\\arg\\min\_\{\\gamma\_\{i\}\\geq 0,\\;\\sum\_\{i\}\\gamma\_\{i\}=1\}\\left\\lVert\\sum\_\{i=1\}^\{M\}\\gamma\_\{i\}\\nabla f\_\{i\}\(\\mathbf\{x\}\)\\right\\rVert^\{2\}\.\(13\)
This optimisation admits a closed‑form solution,𝜸⋆=\(𝐇−1𝟏\)/\(𝟏⊤𝐇−1𝟏\),\\bm\{\\gamma\}^\{\\star\}=\(\\mathbf\{H\}^\{\-1\}\\mathbf\{1\}\)/\(\\mathbf\{1\}^\{\\top\}\\mathbf\{H\}^\{\-1\}\\mathbf\{1\}\),where𝐇=𝐆⊤𝐆\\mathbf\{H\}=\\mathbf\{G\}^\{\\top\}\\mathbf\{G\}is the Gram matrix of the Jacobian𝐆:=∇𝐟\(𝐱\)\\mathbf\{G\}:=\\nabla\\mathbf\{f\}\(\\mathbf\{x\}\)\. The resulting direction satisfies the FJ stationarity conditions and adapts optimally to the local objective geometry\.
### 3\.2Predefined Weight Vectors \(W\)
In the second strategy, we discretise the objective space𝐟\(𝐱\)\\mathbf\{f\}\(\\mathbf\{x\}\)using predefined weight vectors as in\[[5](https://arxiv.org/html/2606.06984#bib.bib1)\], after applying min\-max normalisation computed from the observed objective valuesYY\. This approach restricts the search to a coarse set of directions and can improve robustness and focus under a limited evaluation budget\.
A limitation of this strategy is that predefined weight vectors are not guaranteed to satisfy the FJ conditions, as the chosen directions may not align with the locally optimal convex combination of gradients\. Nevertheless, for expensive optimisation problems, this restriction can be advantageous by prioritising targeted exploration of promising trade\-off regions rather than exploring all directions simultaneously\.
Figure[2](https://arxiv.org/html/2606.06984#S3.F2)illustrates the function landscapes induced by the proposed catalytic strategies for an illustrative bi‑objective problem\.
\(a\)MGDA \(true landscape\)
\(b\)MGDA \(GP approximation\)
\(c\)Predefined weights𝜸=\(0\.9,0\.1\)⊤\\bm\{\\gamma\}=\(0\.9,0\.1\)^\{\\top\}\(true landscape\)
\(d\)Predefined weights𝜸=\(0\.9,0\.1\)⊤\\bm\{\\gamma\}=\(0\.9,0\.1\)^\{\\top\}\(GP approximation\)
Figure 2:Gradient‑magnitude landscapes for MGDA \(top\) and a predefined weight vector𝜸=\(0\.9,0\.1\)⊤\\bm\{\\gamma\}=\(0\.9,0\.1\)^\{\\top\}\(bottom\) on the bi‑objective problem𝐟\(𝐱\)=\(x14−2x12\+2x22\+1,\(x1\+0\.5\)2\+\(x2−2\)2\)\\mathbf\{f\}\(\\mathbf\{x\}\)=\\bigl\(x\_\{1\}^\{4\}\-2x\_\{1\}^\{2\}\+2x\_\{2\}^\{2\}\+1,\\,\(x\_\{1\}\+0\.5\)^\{2\}\+\(x\_\{2\}\-2\)^\{2\}\\bigr\)\. Red ellipses mark the global Pareto set, while \(near‑\)zero regions \(purple\) outside this set indicate local Pareto‑stationary regions, which are not guaranteed to satisfy FJ conditions for predefined weights\. MGDA captures stationarity along a complete Pareto set, whereas the predefined weighting highlights a specific trade‑off region\. GP surrogates trained with only 20 initial samples \(black crosses\) closely approximate the true landscapes\.
### 3\.3Limitations
The proposed catalytic framework relies on the accuracy of surrogate predictive gradients, which may be unreliable in regions with sparse data or poor model fit\. Moreover, the MGDA\-based variant can dilute the search effort under strict evaluation budgets, while the predefined weight approach trades FJ stationarity guarantees for increased robustness\. Despite these limitations, combining local predictive\-gradient information with global Pareto\-compliant acquisition functions offers a practical balance between exploration and exploitation in expensive multi‑objective optimisation, particularly as predictive gradients are smooth and enable additional neighbourhood‑level exploitation\.
## 4Experimental Settings
We evaluate all proposed methods on the DTLZ benchmark suite\[[7](https://arxiv.org/html/2606.06984#bib.bib22)\], considering all seven problems \(DTLZ1–DTLZ7\) with two objectives and a ten\-dimensional decision space\. These problems are widely used to assess convergence, diversity, scalability, and robustness in multi\-objective optimisation\.
Each run starts from an initial design of sizeI=10nI=10nand is performed under a total evaluation budget ofB=30nB=30n\. For each problem–method pair, we conduct 11 independent optimisation runs\. To enable paired statistical comparisons, all methods share identical initial designs within each run, ensuring that observed performance differences are attributable to the optimisation strategy rather than stochastic sampling effects\.
Acquisition functions are optimised using Bi\-POP CMA\-ES\[[12](https://arxiv.org/html/2606.06984#bib.bib42)\]with a budget of5000n5000nevaluations; W uses10M10Mpredefined weight vectors using\[[5](https://arxiv.org/html/2606.06984#bib.bib1)\]\. For AugTch, two parameters are set as\|𝝀\|=10M=20\|\\bm\{\\lambda\}\|=10M=20andρ=0\.05\\rho=0\.05\. Performance is measured using the final hypervolume indicator with respect to predefined reference points; further details are provided in the[supplementary material](https://doi.org/10.5281/zenodo.19649281)\. Statistical significance is assessed via one\-tailed Wilcoxon signed\-rank tests evaluating directional improvement across paired runs, with Bonferroni correction yielding a corrected significance level ofα=0\.05/4\\alpha=0\.05/4\.
To assess surrogate model quality independently of optimisation outcomes, we apply Pareto\-front\-only leave\-one\-out cross\-validation \(PF\-LOOCV\)\. In MOBO, the key utility of a surrogate is its ability to identify potentially non\-dominated points—predictive performance elsewhere is not necessarily reflective of optimisation quality\. Accuracy is quantified using the standardised log loss \(SLL\)\[[20](https://arxiv.org/html/2606.06984#bib.bib9)\], defined relative to a naïve Gaussian baseline predicting the empirical mean\. An SLL of zero corresponds to baseline performance, negative values indicate improvement, and positive values indicate degradation\. We report the median of the median SLL across validation folds together with the median absolute deviation \(MAD\) across runs, summarising both accuracy and variability in surrogate performance\.
## 5Results and Analysis
We begin by analysing the pairwise significance map shown in Figure[3](https://arxiv.org/html/2606.06984#S5.F3), which summarises the effect of acceleration strategies across all problems\. The experiments cover the seven DTLZ problems and four baseline acquisition functions: EHVI, AugTch, tMPoI, and SAF\. Two classes of acceleration mechanisms \(referred to as*catalysts*\) are considered: MGDA\-based acceleration and predefined weighted\-gradient acceleration \(W\), yielding a total of eight algorithmic variants\.
Figure 3:Pairwise significance map summarising the acceleration performance across all benchmark problems\. Within each method group the baseline \(B\) is compared to the catalyst versions and the catalysts to each other\. Specifically the three columns from left to right correspond to Bonferroni\-corrected \(α=0\.05/4\\alpha=0\.05/4\) one\-tailed Wilcoxon signed\-rank tests for \(i\) B→\\rightarrowMGDA, \(ii\) B→\\rightarrowW, and \(iii\) MGDA→\\rightarrowW\. Green denotes a significant improvement of the challenger \(right of the arrow\) over the reference method \(left of the arrow\); Red indicates significant degradation\. Grey marks the equivalence region, where either directional hypothesis is rejected, and the methods cannot be distinguished statistically\.Across the seven problems and eight catalyst–baseline pairings, this results in 56 comparisons\. Additionally, we show the 28 comparisons between the catalyst\-pairs across each problem and acquisition function combination\. In Figure[3](https://arxiv.org/html/2606.06984#S5.F3), green cells indicate that acceleration significantly outperformed the corresponding baseline, red cells indicate degradation relative to the baseline, and grey cells denote no statistically significant difference\. Of the 56 cases, 18 are green, 16 are red, and 22 are grey, excluding the comparisons between the catalyst variants MGDA and W\. Thus, in approximately 32% of cases acceleration leads to statistically significant improvements, while in a further 40% of cases the accelerated and baseline methods behave equivalently\. These results indicate that using a catalyst is generally beneficial, as only 28% of cases result in degradation\.
This motivates an examination of whether the quality of the surrogate model helps explain when acceleration is beneficial\. To this end, Tables[1](https://arxiv.org/html/2606.06984#S5.T1)–[4](https://arxiv.org/html/2606.06984#S5.T4)report, for each problem, method, and objective, the median of medians of the SLLs, together with its MAD\.
Table 1:Median\-of\-median\-SLLs and MAD for objectivesf1f\_\{1\}andf2f\_\{2\}for problems DTLZ1, DTLZ2\.Table 2:Details as for Table[1](https://arxiv.org/html/2606.06984#S5.T1), DTLZ3 and DTLZ4 problems\.Table 3:Details as for Table[1](https://arxiv.org/html/2606.06984#S5.T1), DTLZ5 and DTLZ6 problems\.Table 4:Details as for Table[1](https://arxiv.org/html/2606.06984#S5.T1), DTLZ7 problem\.##### DTLZ1 and DTLZ3\.
For DTLZ1, Figure[3](https://arxiv.org/html/2606.06984#S5.F3)shows all grey cells, indicating that acceleration does not substantially affect performance\. This aligns with the surrogate quality metrics: for both objectives, median SLL values are only modestly negative, while MAD values are sufficiently large to span both negative and positive values\. This indicates inconsistent and only moderately informative surrogate models\. A similar pattern is observed for DTLZ3, where surrogate quality is comparable, with median SLLs near zero and large MADs frequently extending into positive values\. Consequently, most comparisons remain grey, suggesting that poor or highly variable surrogate quality limits the effectiveness of acceleration\.
##### DTLZ2\.
In contrast, DTLZ2 shows widespread green cells in Figure[3](https://arxiv.org/html/2606.06984#S5.F3), indicating a consistent improvement under acceleration\. This coincides with excellent surrogate quality: median SLL values are strongly negative \(often approaching−10\-10\), MAD values are small, and both objectives are predicted reliably across runs\. These results indicate that high\-quality surrogates can effectively support accelerated acquisition by accurately identifying Pareto\-relevant regions\.
##### DTLZ4\.
Results for DTLZ4 are mixed\. In several cases, most notably SAF combined with MGDA acceleration, the quality of the surrogate model for the first objective is extremely poor, with median SLL values exceeding\+30\+30\. Such values indicate models that are effectively uninformative\. In these cases, acceleration is detrimental, as reflected by the red cells in Figure[3](https://arxiv.org/html/2606.06984#S5.F3)\. More broadly, DTLZ4 is characterised by large MADs for the first objective, leading to unstable surrogate performance across runs and limiting the reliability of acceleration\.
##### DTLZ5\.
For DTLZ5, the quality of the surrogate model is consistently strong in both objectives\. Median SLL values are typically around−10\-10, with small MADs\. Correspondingly, acceleration improves performance in most cases, as shown in Figure[3](https://arxiv.org/html/2606.06984#S5.F3)\. The only notable exception occurs when SAF is combined with MGDA, where the baseline already performs competitively\. In general, robust and stable surrogate models enable effective acceleration for this problem\.
##### DTLZ6\.
In DTLZ6, surrogate quality is moderate but inconsistent\. Median SLL values tend to cluster around−2\-2, while MADs are often of comparable magnitude\. In some cases, particularly for the second objective, median SLL values become positive\. This instability aligns with the predominance of red cells in Figure[3](https://arxiv.org/html/2606.06984#S5.F3), indicating that acceleration often degrades performance\. Notably, only tMPoI combined with predefined weights yields consistent improvement, suggesting that careful choice of acceleration strategy can partially mitigate imperfect surrogate quality\.
##### DTLZ7\.
Finally, DTLZ7 presents an asymmetric situation\. The first objective is modelled extremely well, with strongly negative median SLL values and small MADs\. In contrast, the second objective model often exhibits larger MADs and occasional positive median SLLs\. As a result, acceleration yields modest improvements relative to DTLZ6, where both objective surrogates were poor—particularly for the weighted variants of EHVI, tMPoI, and SAF\.
##### Stationarity analysis\.
To further interpret these results, we examine the stationarity assumption that underpins the stationary kernels used in the surrogate models\. Although all DTLZ problems are continuous and differentiable, stationarity additionally requires that local perturbations in the decision space induce similar variations in the objective space, irrespective of location\.
To test this assumption, we sample 1000 random ten\-dimensional hyper\-boxes in the decision space, each spanning 5% of the domain per dimension and populated with 1000 Latin\-hypercube samples\. For each box, we compute the local output variance for both objectives\. The resulting distributions, shown in Figure[4](https://arxiv.org/html/2606.06984#S5.F4), reveal pronounced differences across problems\.
DTLZ2 and DTLZ5 exhibit uniformly low local variances across the domain, remaining below a variance threshold of10−210^\{\-2\}and indicating strong stationarity, which we attribute as a key reason for the superior performance of the catalytic framework\. In contrast, DTLZ1 and DTLZ3 show substantial variation in local variance, confirming pronounced non\-stationarity, and therefore, the catalytic framework offers limited benefit\. DTLZ4 displays frequent outliers beyond the threshold—particularly for the first objective—consistent with its mixed performance under acceleration\. DTLZ6 similarly exhibits intermittent large local variances for both objectives, resulting in poor performance of acceleration strategies\. Finally, DTLZ7 shows near\-stationary behaviour in the first objective but borderline or non\-stationary behaviour in the second, explaining its intermediate performance\.
Figure 4:Local variance of objectivesf1f\_\{1\}\(top\) andf2f\_\{2\}\(bottom\), computed from10001000randomly placed local neighbourhoods in a1010‑dimensional decision space and their corresponding objective values\. Each neighbourhood spans5%5\\%of the domain per dimension and contains10001000Latin–hypercube samples\. The dashed horizontal line indicates variance=10−2=10^\{\-2\}, which may be regarded as a small variance threshold suggestive of approximate stationarity over the domain\. DTLZ2 and DTLZ5 exhibit the most stationary behaviour, while DTLZ1 and DTLZ3 show clear non‑stationarity\. For DTLZ4, several outliers exceed the threshold, indicating locally sharp variations despite otherwise smooth behaviour\. In DTLZ7, the first objective appears largely stationary, whereas the second objective lies closer to the boundary of the threshold\. DTLZ6 exhibits regions with pronounced jumps in variance relative to the rest of the domain\.Overall, when a problem is stationary and consistent with GP kernel assumptions, surrogate\-driven acceleration is almost always beneficial: W is always superior to the baseline, while MGDA is outperformed by or equivalent to the baseline only with SAF \(in DTLZ5 and DTLZ2 respectively\)\. It should be noted that only two of the seven problems tested exhibit stationary behaviour\.
Finally, we note an important distinction between the two catalytic approaches, MGDA and W\. Across the 28 pairwise comparisons between these variants, only four cases favour MGDA, while seven cases result in statistical equivalence \(see Figure[3](https://arxiv.org/html/2606.06984#S5.F3)\)\. In the remaining comparisons, W is the superior performer, indicating a clear overall advantage\.
From a conceptual perspective, MGDA can be interpreted as allowing an effectively infinite number of descent directions, with an optimal direction identified locally for each solution\. In contrast, the W variant operates on a coarse, finite set of predefined directions\. This reduced granularity may be beneficial in the context of expensive optimisation problems, because of the limited budget on the number of function evaluations, where focusing the search on a limited subset of promising regions can be more effective than attempting to finely approximate the entire global Pareto set\.
The experiments are limited to the DTLZ benchmark suite with two objectives and ten decision variables\. The framework should nonetheless scale well to higher\-dimensional settings, as predictive gradients are analytically derived from independent GP models and the MGDA solver operates on the Gram matrix of objective gradients, keeping computational overhead modest, even for tens of objectives\. Whether the stationarity observations generalise to higher\-dimensional problems remains an open question for future work\.
## 6Conclusion
This paper introduced a predictive‑gradient catalytic framework for accelerating multi\-objective Bayesian optimisation under limited evaluation budgets\. By exploiting surrogate predictive gradients as auxiliary signals, the proposed approach augments standard Pareto\-compliant acquisition functions with local stationarity information, without altering their underlying structure\. Two catalyst instantiations were investigated: an adaptive MGDA\-based strategy grounded in Pareto stationarity theory, and a predefined\-weight variant that enables robust, focused exploration when budgets are tight\. Experiments on the DTLZ benchmark suite show that predictive‑gradient catalysis can deliver significant acceleration when surrogate models are accurate, particularly for stationary problems\. These results highlight surrogate\-derived gradients as a promising mechanism for improving efficiency in expensive multi\-objective optimisation and motivate extensions to constrained and noisy settings\. Future works include testing on high dimensional problems, both in objective and decision spaces and adapting the weights \(e\.g\. taking a similar approach to\[[21](https://arxiv.org/html/2606.06984#bib.bib41)\]\) for the acquisition function and predictive gradients in Equation[12](https://arxiv.org/html/2606.06984#S3.E12)\.
\{credits\}
#### 6\.0\.1\\discintname
The authors have no competing interests to declare that are relevant to the content of this article\.
#### 6\.0\.2Acknowledgements
We thank the organisers of the Dagstuhl Seminar*Multi\-objective Optimization: Theory, Methods and Applications*\([Seminar 23361](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/23361)\) for creating a stimulating environment that influenced this work\. We are particularly grateful to Kaisa Miettinen for insightful discussions and feedback, and to Tea Tus̆ar, Dimo Brockhoff, and Pascal Kerschke for valuable conversations\. Rahat acknowledges support from the Engineering and Physical Sciences Research Council \[grant number EP/W01226X/1\]\.
## References
- \[1\]D\. C\. Bautista\(2009\)A Sequential Design for Approximating the Pareto Front Using the Expected Pareto Improvement Function\.The Ohio State University\.Cited by:[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p4.1)\.
- \[2\]Y\. Censor\(1977\)Pareto Optimality in Multiobjective Problems\.Applied Mathematics and Optimization4\(1\),pp\. 41–59\.Cited by:[§2\.1](https://arxiv.org/html/2606.06984#S2.SS1.p6.1)\.
- \[3\]T\. Chugh\(2020\)Scalarizing Functions in Bayesian Multiobjective Optimization\.In2020 IEEE Congress on Evolutionary Computation \(CEC\),pp\. 1–8\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p8.1)\.
- \[4\]T\. Chugh\(2022\)Mono\-Surrogate vs Multi\-Surrogate in Multi\-Objective Bayesian Optimisation\.InProceedings of the Genetic and Evolutionary Computation Conference Companion,pp\. 2143–2151\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p8.1),[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p4.1)\.
- \[5\]I\. Das and J\. E\. Dennis\(1998\)Normal\-Boundary Intersection: A New Method for Generating the Pareto Surface in Nonlinear Multicriteria Optimization Problems\.SIAM Journal on Optimization8\(3\),pp\. 631–657\.Cited by:[§3\.2](https://arxiv.org/html/2606.06984#S3.SS2.p1.2),[§4](https://arxiv.org/html/2606.06984#S4.p3.5)\.
- \[6\]G\. De Ath, R\. M\. Everson, A\. A\. Rahat, and J\. E\. Fieldsend\(2021\)Greed is Good: Exploration and Exploitation Trade\-offs in Bayesian Optimisation\.ACM Transactions on Evolutionary Learning and Optimization1\(1\),pp\. 1–22\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p1.1)\.
- \[7\]K\. Deb, L\. Thiele, M\. Laumanns, and E\. Zitzler\(2005\)Scalable Test Problems for Evolutionary Multiobjective Optimization\.InEvolutionary Multiobjective Optimization: Theoretical Advances and Applications,A\. Abraham, L\. Jain, and R\. Goldberg \(Eds\.\),pp\. 105–145\.Cited by:[§4](https://arxiv.org/html/2606.06984#S4.p1.1)\.
- \[8\]J\. Désidéri\(2012\)Multiple\-Gradient Descent Algorithm for Multiobjective Optimization\.InEuropean Congress on Computational Methods in Applied Sciences and Engineering \(ECCOMAS 2012\),Cited by:[§3\.1](https://arxiv.org/html/2606.06984#S3.SS1.p1.2)\.
- \[9\]M\. T\. Emmerich, K\. C\. Giannakoglou, and B\. Naujoks\(2006\)Single\- and Multiobjective Evolutionary Optimization Assisted by Gaussian Random Field Metamodels\.IEEE Transactions on Evolutionary Computation10\(4\),pp\. 421–439\.Cited by:[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p4.1)\.
- \[10\]R\. Garnett\(2023\)Bayesian Optimization\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2606.06984#S1.p1.1)\.
- \[11\]GPy\(since 2012\)GPy: A Gaussian Process Framework in Python\.Note:[http://github\.com/SheffieldML/GPy](http://github.com/SheffieldML/GPy)Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p3.2)\.
- \[12\]N\. Hansen\(2009\)Benchmarking a BI\-Population CMA\-ES on the BBOB\-2009 Function Testbed\.InProceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference: Late Breaking Papers,pp\. 2389–2396\.Cited by:[§4](https://arxiv.org/html/2606.06984#S4.p3.5)\.
- \[13\]A\. J\. Keane\(2006\)Statistical Improvement Criteria for Use in Multiobjective Design Optimization\.AIAA Journal44\(4\),pp\. 879–891\.Cited by:[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p1.1)\.
- \[14\]J\. Knowles\(2006\)ParEGO: A Hybrid Algorithm with On\-Line Landscape Approximation for Expensive Multiobjective Optimization Problems\.IEEE Transactions on Evolutionary Computation10\(1\),pp\. 50–66\.Cited by:[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p1.1),[§3](https://arxiv.org/html/2606.06984#S3.p5.1)\.
- \[15\]D\. C\. Liu and J\. Nocedal\(1989\)On the Limited Memory BFGS Method for Large Scale Optimization\.Mathematical Programming45\(1\),pp\. 503–528\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p3.2)\.
- \[16\]M\. D\. McKay\(1992\)Latin Hypercube Sampling as a Tool in Uncertainty Analysis of Computer Models\.InProceedings of the 24th Conference on Winter Simulation,pp\. 557–564\.Cited by:[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p2.2)\.
- \[17\]K\. Miettinen\(1999\)Nonlinear Multiobjective Optimization\.Vol\.12,Springer Science & Business Media\.Cited by:[§2\.1](https://arxiv.org/html/2606.06984#S2.SS1.p5.1)\.
- \[18\]G\. Perrinet al\.\(2024\)Bayesian Optimization with Derivatives Acceleration\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.06984#S1.p2.1)\.
- \[19\]A\. A\. Rahat, R\. M\. Everson, and J\. E\. Fieldsend\(2017\)Alternative Infill Strategies for Expensive Multi\-Objective Optimisation\.InProceedings of the Genetic and Evolutionary Computation Conference,pp\. 873–880\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p8.1),[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p4.1)\.
- \[20\]C\.E\. Rasmussen and C\.K\.I\. Williams\(2006\)Gaussian Processes for Machine Learning\.MIT Press\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p2.11),[§4](https://arxiv.org/html/2606.06984#S4.p4.1)\.
- \[21\]B\. S\. Saini, H\. K\. Singh, B\. Shavazipour, and K\. Miettinen\(2025\)An Efficient Iterative Approach for Uniformly Representing Pareto Fronts\.InEvolutionary Multi\-Criterion Optimization,H\. Singh, T\. Ray, J\. Knowles, X\. Li, J\. Branke, B\. Wang, and A\. Oyama \(Eds\.\),Singapore,pp\. 241–256\.Cited by:[§6](https://arxiv.org/html/2606.06984#S6.p1.1)\.
- \[22\]S\. Schäffler, R\. Schultz, and K\. Weinzierl\(2002\)Stochastic Method for the Solution of Unconstrained Vector Optimization Problems\.Journal of Optimization Theory and Applications114\(1\),pp\. 209–222\.Cited by:[§3\.1](https://arxiv.org/html/2606.06984#S3.SS1.p1.2)\.
- \[23\]B\. Shahriari, K\. Swersky, Z\. Wang, R\. P\. Adams, and N\. De Freitas\(2015\)Taking the Human out of the Loop: A Review of Bayesian Optimization\.Proceedings of the IEEE104\(1\),pp\. 148–175\.Cited by:[§1](https://arxiv.org/html/2606.06984#S1.p1.1)\.
- \[24\]J\. Snoek, H\. Larochelle, and R\. P\. Adams\(2012\)Practical Bayesian Optimization of Machine Learning Algorithms\.Advances in Neural Information Processing Systems25\.Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p3.2)\.
- \[25\]E\. Solak, R\. Murray\-Smith, W\.E\. Leithead, D\. Leith, and C\.E\. Rasmussen\(2003\)Derivative Observations in Gaussian Process Models\.InAdvances in Neural Information Processing Systems,Cited by:[§2\.2](https://arxiv.org/html/2606.06984#S2.SS2.p4.1)\.
- \[26\]J\. Svenson and T\. Santner\(2016\)Multiobjective Optimization of Expensive\-to\-Evaluate Deterministic Computer Simulator Models\.Computational Statistics & Data Analysis94,pp\. 250–264\.Cited by:[§2\.3](https://arxiv.org/html/2606.06984#S2.SS3.p4.1)\.
- \[27\]J\. Wu, M\. Poloczek, A\. G\. Wilson, and P\. Frazier\(2017\)Bayesian Optimization with Gradients\.Advances in Neural Information Processing Systems30\.Cited by:[§1](https://arxiv.org/html/2606.06984#S1.p2.1)\.Similar Articles
A Unified Framework for Gradient Aggregation in Multi-Objective Optimization
This paper presents a unified theoretical framework for gradient aggregation in multi-objective optimization, establishing convergence rates to Pareto stationarity. The authors introduce a sufficient alignment condition and demonstrate its application to existing and new algorithms, such as capped MGDA.
Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not
This paper provides a unified theoretical framework for pseudo observation batch Bayesian optimization, proving that Gaussian processes produce distinct batch points and that common methods like Constant Liar and Kriging Believer are instances of a single conditioning mechanism. It introduces the Structural Diversity Diagnostic (SDD) for testing surrogate compatibility and validates predictions across multiple benchmark functions and hyperparameter tuning.
Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data
This paper introduces a parallelization strategy and adaptive steering mechanism for the Baymex algorithm to efficiently learn discretized Bayesian network classifiers for clinical data, achieving speedups over 54x on a 16-core CPU and comparable or better predictive performance than traditional models while maintaining explainability.
Non-Myopic Active Feature Acquisition via Pathwise Policy Gradients
This paper introduces NM-PPG, a non-myopic active feature acquisition method using pathwise policy gradients to optimize sequential feature selection in costly prediction scenarios.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.