Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems

arXiv cs.LG 06/08/26, 04:00 AM Papers
gaussian-process latent-factor regression low-data high-dimensional exoplanet-climate emulator
Summary
Proposes Gaussian process latent factor regression (GPLFR) for low-data, high-dimensional output problems, demonstrating it with a spatially resolved emulator of global climate models for rocky exoplanets.
arXiv:2606.06576v1 Announce Type: new Abstract: In the sciences, regression tasks often require predicting high-dimensional outputs from few training examples. Multi-output Gaussian processes excel in low-data regimes but typically struggle with high-dimensional outputs. Compress-then-predict pipelines such as PCA-GP (principal component analysis plus Gaussian process regression) handle high dimensionality, but rely on bases optimized for reconstruction rather than prediction. To address this gap, we propose a model that represents each output as a linear-Gaussian decoding of a low-dimensional latent state drawn from a Gaussian process prior. By analytically marginalizing the decoder weights, we couple compression and prediction in a single objective that scales to high-dimensional outputs. We refer to this model as Gaussian process latent factor regression (GPLFR). We demonstrate GPLFR by building the first spatially resolved emulator of global climate models for rocky exoplanets.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:17 AM
# Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems
Source: [https://arxiv.org/html/2606.06576](https://arxiv.org/html/2606.06576)
Eric T\. WolfUniversity of Colorado BoulderMei Ting MakUniversity of OxfordN\. J\. MayneUniversity of ExeterMiles CranmerUniversity of Cambridge

###### Abstract

In the sciences, regression tasks often require predicting high\-dimensional outputs from few training examples\. Multi\-output Gaussian processes excel in low\-data regimes but typically struggle with high\-dimensional outputs\. Compress\-then\-predict pipelines such as PCA\-GP \(principal component analysis plus Gaussian process regression\) handle high dimensionality, but rely on bases optimized for reconstruction rather than prediction\. To address this gap, we propose a model that represents each output as a linear\-Gaussian decoding of a low\-dimensional latent state drawn from a Gaussian process prior\. By analytically marginalizing the decoder weights, we couple compression and prediction in a single objective that scales to high\-dimensional outputs\. We refer to this model as Gaussian process latent factor regression \(GPLFR\)\. We demonstrate GPLFR by building the first spatially resolved emulator of global climate models for rocky exoplanets\.

## 1Introduction

Multi\-output regression can be framed around two broad modeling questions: how do the entries of output𝐲\\mathbf\{y\}relate to one another, and how does input\-space location𝐱\\mathbf\{x\}dictate similarity between examples? With limited data, these two questions compete for modeling capacity\. When outputs are heterogeneously or sparsely observed, sharing information across outputs is essential, and modeling capacity should flow toward output covariance\. When outputs are high\-dimensional and structured, a rich output covariance becomes poorly identified, whereas there is more information to constrain a compressed latent representation: each training example provides many output dimensions as repeated “views” of the same underlying latent state; here capacity should flow toward learning the similarity structure over𝐱\\mathbf\{x\}\.

Most multi\-output GPs \(MOGPs; “co\-kriging” in geostatistics\) prioritize sharing information across outputs by learning an explicit coupling structure, while using a shared input kernel to control generalization across𝐱\\mathbf\{x\}\. For example, the popular*linear model of coregionalization*\(LMC\) style MOGPs encode output covariance through linear mixing of shared latent GPs \(seeAlvarez et al\.,[2012](https://arxiv.org/html/2606.06576#bib.bib1)for a review\)\. Latent\-variable MOGPs \(LV\-MOGPs\) extend this idea by defining similarity between*outputs*via a kernel on per\-output latent embeddings rather than directly parameterizing a coregionalization matrix\[Dai et al\.,[2017](https://arxiv.org/html/2606.06576#bib.bib7)\]\. Other approaches in this lineage include higher\-order Gaussian process regression\[Zhe et al\.,[2019](https://arxiv.org/html/2606.06576#bib.bib36)\], which models output correlations through learned latent coordinate features in a tensorized output space with Kronecker inference, and Gaussian process regression networks\[Wilson et al\.,[2011](https://arxiv.org/html/2606.06576#bib.bib30), Li et al\.,[2020](https://arxiv.org/html/2606.06576#bib.bib19)\], which generalize LMC by making the decoder weights input\-dependent GP functions\. While recent work extends these methods to higher output dimensionalitiesDyD\_\{y\}\(e\.g\.,Bruinsma et al\.,[2020](https://arxiv.org/html/2606.06576#bib.bib4)extend an LMC variant toDy∼103D\_\{y\}\\sim 10^\{3\}andJiang et al\.,[2025](https://arxiv.org/html/2606.06576#bib.bib14)extend LV\-MOGPs toDy∼103–104D\_\{y\}\\sim 10^\{3\}\\text\{\-\-\}10^\{4\}\), they remain ill\-suited to problems where the bottleneck is learning input\-space structure rather than enriching output coupling\.

The standard alternative approach is to decouple the problem into two stages: first learn a basis from outputs alone \(usually via PCA\), then regress the resulting coefficients against𝐱\\mathbf\{x\}\[Hutchings et al\.,[2025](https://arxiv.org/html/2606.06576#bib.bib12), Holden et al\.,[2015](https://arxiv.org/html/2606.06576#bib.bib11), Higdon et al\.,[2008](https://arxiv.org/html/2606.06576#bib.bib10), Rougier,[2008](https://arxiv.org/html/2606.06576#bib.bib25)\]\. While this compress\-then\-predict approach handles highDyD\_\{y\}, its basis is optimized for output reconstruction rather than predictability from the inputs\. In this work, we avoid this mismatch by learning compression and regression jointly in a way that scales to highDyD\_\{y\}\.111We discuss adjacent task\-aware representation\-learning methods in Appendix[A](https://arxiv.org/html/2606.06576#A1)\.Concretely, our proposed model – Gaussian process latent factor regression \(GPLFR\) – infers latents that are simultaneously constrained by a GP prior over𝐱\\mathbf\{x\}and reconstruction of𝐲\\mathbf\{y\}through a linear decoder\.

ℓq\\boldsymbol\{\\ell\}\_\{q\}ηq\\eta\_\{q\}zi\(q\)z\_\{i\}^\{\(q\)\}𝐱i\\mathbf\{x\}\_\{i\}𝐲i\\mathbf\{y\}\_\{i\}𝐁\\mathbf\{B\}𝐖\\mathbf\{W\}σ\\sigmaq=1,…,Dzq=1,\\dots,D\_\{z\}i=1,…,Ni=1,\\dots,NVariables: 𝐱i∈ℝDx\\mathbf\{x\}\_\{i\}\\in\\mathbb\{R\}^\{D\_\{x\}\}: input𝐲i∈ℝDy\\mathbf\{y\}\_\{i\}\\in\\mathbb\{R\}^\{D\_\{y\}\}: outputℓq∈ℝ\+Dx\\boldsymbol\{\\ell\}\_\{q\}\\in\\mathbb\{R\}\_\{\+\}^\{D\_\{x\}\}: kernel lengthscalesηq∈ℝ\+\\eta\_\{q\}\\in\\mathbb\{R\}\_\{\+\}: kernel amplitudezi\(q\)∈ℝz\_\{i\}^\{\(q\)\}\\in\\mathbb\{R\}: latent variable𝐖∈ℝDy×Dz\\mathbf\{W\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{z\}\}: decoder weightsσ∈ℝ\+\\sigma\\in\\mathbb\{R\}\_\{\+\}: observation noise𝐁∈ℝDy×Dy\\mathbf\{B\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{y\}\}: output coregionalization matrix

Figure 1:Probabilistic graphical model of GPLFR\. Shaded nodes are observed\.Although GPLFR’s practical motivation is to provide an end\-to\-end alternative to compress\-then\-predict pipelines \(principally PCA\-GP\), its mathematical structure is most transparently understood through its relationship with the LMC\. We develop these connections in Sections[2](https://arxiv.org/html/2606.06576#S2)and[3](https://arxiv.org/html/2606.06576#S3), then compare GPLFR to PCA\-GP and other baselines on a synthetic benchmark and two scientific emulation tasks: one on biomedical optics and one on the exoplanet climate problem that originally motivated this work \(Section[4](https://arxiv.org/html/2606.06576#S4)\)\. Code is available at[https://github\.com/edstevenson/GPLFR](https://github.com/edstevenson/GPLFR)\.

## 2GPLFR and the Linear Model of Coregionalization \(LMC\)

We use the LMC as a unifying lens because GPLFR and LMC can be derived from the same linear\-Gaussian latent\-factor model, but correspond to different marginalizations\. In particular, marginalizing the latent factors yields an LMC prior over outputs, whereas marginalizing the decoder weights yields GPLFR’s collapsed likelihood used for joint representation learning and regression\. This perspective also clarifies a modeling choice emphasized in the introduction: GPLFR typically keeps any explicit output coupling \(i\.e\., any output coupling on top of that induced by the latent factors\) simple and instead concentrates modeling capacity on learning a useful similarity structure over𝐱\\mathbf\{x\}\.

Figure[1](https://arxiv.org/html/2606.06576#S1.F1)shows the probabilistic graphical model underlying GPLFR\. A full model specification is given in Appendix[B](https://arxiv.org/html/2606.06576#A2)\.

##### Notation\.

We consider regression fromDxD\_\{x\}\-dimensional inputs toDyD\_\{y\}\-dimensional structured outputs withNNtraining examples\. Let𝐗∈ℝN×Dx\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times D\_\{x\}\}denote the training inputs and𝐘∈ℝN×Dy\\mathbf\{Y\}\\in\\mathbb\{R\}^\{N\\times D\_\{y\}\}the corresponding outputs\. GPLFR introducesDz≪DyD\_\{z\}\\ll D\_\{y\}latent variables per example, collected as𝐙∈ℝN×Dz\\mathbf\{Z\}\\in\\mathbb\{R\}^\{N\\times D\_\{z\}\}, and a linear decoder𝐖∈ℝDy×Dz\\mathbf\{W\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{z\}\}\. We write𝐊\(𝐗,𝐗\)∈ℝN×N\\mathbf\{K\}\(\\mathbf\{X\},\\mathbf\{X\}\)\\in\\mathbb\{R\}^\{N\\times N\}for kernel matrices andσ2\\sigma^\{2\}for the observation noise variance\.

### 2\.1The LMC

The LMC is a very general multi\-output GP class that factorizes the output covariance as a sum of Kronecker products, each pairing an input kernel𝐊q∈ℝN×N\\mathbf\{K\}\_\{q\}\\in\\mathbb\{R\}^\{N\\times N\}with a*coregionalization*matrix𝐁q∈ℝDy×Dy\\mathbf\{B\}\_\{q\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{y\}\}:

Cov\(vec\(𝐘\)\)=∑q=1Q𝐁q⊗𝐊q\+σ2𝐈NDy,\\mathrm\{Cov\}\(\\mathrm\{vec\}\(\\mathbf\{Y\}\)\)=\\sum\_\{q=1\}^\{Q\}\\mathbf\{B\}\_\{q\}\\otimes\\mathbf\{K\}\_\{q\}\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\},\(1\)wherevec\(⋅\)\\mathrm\{vec\}\(\\cdot\)stacks columns\. One way to derive this is from a linear mixing of latent GP functions: for each componentqq, drawDqD\_\{q\}independent latent functions from a shared kernelkqk\_\{q\}and mix them into outputs with a matrix𝐀q∈ℝDy×Dq\\mathbf\{A\}\_\{q\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{q\}\}\. Then𝐁q=𝐀q𝐀q⊤⪰0\\mathbf\{B\}\_\{q\}=\\mathbf\{A\}\_\{q\}\\mathbf\{A\}\_\{q\}^\{\\top\}\\succeq 0and has rank≤Dq\\leq D\_\{q\}\. The*intrinsic coregionalized model*\(ICM\) is the special case where𝐊q=𝐊\\mathbf\{K\}\_\{q\}=\\mathbf\{K\}for allqq:

Cov\(vec\(𝐘\)\)=𝐁⊗𝐊\+σ2𝐈NDy\.\\begin\{split\}\\mathrm\{Cov\}\(\\mathrm\{vec\}\(\\mathbf\{Y\}\)\)&=\\mathbf\{B\}\\otimes\\mathbf\{K\}\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\}\.\\end\{split\}\(2\)

### 2\.2GPLFR and LMC as Two Marginalizations of the Same Joint Model

We can see the relationship between GPLFR and LMC by starting with their shared assumed data\-generating process\. Partition the latent space intoQQgroups with dimensionalities\{Dq\}q=1Q\\\{D\_\{q\}\\\}\_\{q=1\}^\{Q\}such that∑qDq=Dz\\sum\_\{q\}D\_\{q\}=D\_\{z\}\. Let𝐙q∈ℝN×Dq\\mathbf\{Z\}\_\{q\}\\in\\mathbb\{R\}^\{N\\times D\_\{q\}\}and𝐖q∈ℝDy×Dq\\mathbf\{W\}\_\{q\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{q\}\}, and define𝐙=\[𝐙1…𝐙Q\]\\mathbf\{Z\}=\\begin\{bmatrix\}\\mathbf\{Z\}\_\{1\}&\\dots&\\mathbf\{Z\}\_\{Q\}\\end\{bmatrix\}and𝐖=\[𝐖1…𝐖Q\]\\mathbf\{W\}=\\begin\{bmatrix\}\\mathbf\{W\}\_\{1\}&\\dots&\\mathbf\{W\}\_\{Q\}\\end\{bmatrix\}\. The data\-generating process draws the latent components from independent GP priors over𝐗\\mathbf\{X\}and then maps them through a linear\-Gaussian decoder:

Latent GP priors:vec\(𝐙q\)∣𝐗∼𝒩\(𝟎,𝐈Dq⊗𝐊q\)\\displaystyle\\textbf\{Latent GP priors:\}\\quad\\mathrm\{vec\}\(\\mathbf\{Z\}\_\{q\}\)\\mid\\mathbf\{X\}\\sim\\mathcal\{N\}\\\!\\left\(\\mathbf\{0\},\\mathbf\{I\}\_\{D\_\{q\}\}\\otimes\\mathbf\{K\}\_\{q\}\\right\)forq=1,…,Q,\\displaystyle\\qquad\\qquad\\qquad\\qquad\\;\\;\\;\\text\{for \}q=1,\\dots,Q,Decoder:𝐘=∑q=1Q𝐙q𝐖q⊤\+𝐄,Eij∼𝒩\(0,σ2\)\.\\displaystyle\\textbf\{Decoder:\}\\quad\\mathbf\{Y\}=\\sum\_\{q=1\}^\{Q\}\\mathbf\{Z\}\_\{q\}\\mathbf\{W\}\_\{q\}^\{\\top\}\+\\mathbf\{E\},\\quad E\_\{ij\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\)\.
If we marginalize out𝐙\\mathbf\{Z\}, then we get

p\(vec\(𝐘\)∣\{𝐖q\}q,σ,𝐗\)\\displaystyle p\(\\mathrm\{vec\}\(\\mathbf\{Y\}\)\\mid\\\{\\mathbf\{W\}\_\{q\}\\\}\_\{q\},\\sigma,\\mathbf\{X\}\)=∫p\(vec\(𝐘\)∣\{𝐙q\}q,\{𝐖q\}q,σ\)∏qp\(𝐙q∣𝐗\)d𝐙\\displaystyle=\\int p\(\\mathrm\{vec\}\(\\mathbf\{Y\}\)\\mid\\\{\\mathbf\{Z\}\_\{q\}\\\}\_\{q\},\\\{\\mathbf\{W\}\_\{q\}\\\}\_\{q\},\\sigma\)\\prod\_\{q\}p\(\\mathbf\{Z\}\_\{q\}\\mid\\mathbf\{X\}\)\\,d\\mathbf\{Z\}=𝒩\(vec\(𝐘\);0,𝐂LMC\),\\displaystyle=\\mathcal\{N\}\\\!\\left\(\\mathrm\{vec\}\(\\mathbf\{Y\}\);0,\\mathbf\{C\}\_\{\\text\{LMC\}\}\\right\),with

𝐂LMC=\[∑q\(𝐖q𝐖q⊤\)⊗𝐊q\]\+σ2𝐈NDy\.\\mathbf\{C\}\_\{\\text\{LMC\}\}=\\left\[\\sum\_\{q\}\(\\mathbf\{W\}\_\{q\}\\mathbf\{W\}\_\{q\}^\{\\top\}\)\\otimes\\mathbf\{K\}\_\{q\}\\right\]\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\}\.\(3\)This is exactly an LMC GP \([1](https://arxiv.org/html/2606.06576#S2.E1)\) with𝐁q=𝐖q𝐖q⊤\\mathbf\{B\}\_\{q\}=\\mathbf\{W\}\_\{q\}\\mathbf\{W\}\_\{q\}^\{\\top\}\.

In GPLFR we instead marginalize out𝐖\\mathbf\{W\}\. To do this we set independent matrix\-normal priors on the component decoders:𝐖q∼ℳ𝒩\(0,𝐁,𝐈Dq\)\\mathbf\{W\}\_\{q\}\\sim\\mathcal\{MN\}\(0,\\mathbf\{B\},\\mathbf\{I\}\_\{D\_\{q\}\}\)for eachqq\. Then

p\(vec\(𝐘\)∣\{𝐙q\}q,σ\)\\displaystyle p\(\\mathrm\{vec\}\(\\mathbf\{Y\}\)\\mid\\\{\\mathbf\{Z\}\_\{q\}\\\}\_\{q\},\\sigma\)=∫p\(vec\(𝐘\)∣\{𝐙q\}q,\{𝐖q\}q,σ\)∏qp\(𝐖q\)d𝐖\\displaystyle=\\int p\(\\mathrm\{vec\}\(\\mathbf\{Y\}\)\\mid\\\{\\mathbf\{Z\}\_\{q\}\\\}\_\{q\},\\\{\\mathbf\{W\}\_\{q\}\\\}\_\{q\},\\sigma\)\\,\\prod\_\{q\}p\(\\mathbf\{W\}\_\{q\}\)\\,d\\mathbf\{W\}=𝒩\(vec\(𝐘\);0,𝐂\),\\displaystyle=\\mathcal\{N\}\\\!\\left\(\\mathrm\{vec\}\(\\mathbf\{Y\}\);0,\\mathbf\{C\}\\right\),with

𝐂=𝐁⊗\[∑q𝐙q𝐙q⊤\]\+σ2𝐈NDy\.\\mathbf\{C\}=\\mathbf\{B\}\\otimes\\left\[\\sum\_\{q\}\\mathbf\{Z\}\_\{q\}\\mathbf\{Z\}\_\{q\}^\{\\top\}\\right\]\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\}\.\(4\)222This can also be interpreted as an ICM covariance \([2](https://arxiv.org/html/2606.06576#S2.E2)\) conditional on𝐙\\mathbf\{Z\}, where𝐁\\mathbf\{B\}is the coregionalization matrix and the effective input\-side kernel under point\-estimation is∑q𝐙q𝐙q⊤\\sum\_\{q\}\\mathbf\{Z\}\_\{q\}\\mathbf\{Z\}\_\{q\}^\{\\top\}\.

With this perspective, GPLFR and LMC are seen as different marginalizations of the same underlying factorization\. This is a similar idea to the*primal*and*dual*views of probabilistic PCA used to motivate GPLVMs inLawrence \[[2005](https://arxiv.org/html/2606.06576#bib.bib18)\]\. However, note that while the primal and dual views are*equivalent*\(they recover the same marginal model\) in the case of probabilistic PCA, the regression context of GPLFR/LMC \(𝐙\\mathbf\{Z\}tied to𝐗\\mathbf\{X\}by a GP prior\) immediately separates them into different model classes\.333Excepting degenerate cases, e\.g\., if we restrict GPLFR’s latent representation to be deterministic feature maps of the inputs𝐙q=𝚽q\(𝐗\)\\mathbf\{Z\}\_\{q\}=\\boldsymbol\{\\Phi\}\_\{q\}\(\\mathbf\{X\}\)then we get ordinary LMC priors with input kernelskq\(𝐱,𝐱′;𝚽\)=ϕq\(𝐱\)⊤𝐓qϕq\(𝐱′\)k\_\{q\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\};\\boldsymbol\{\\Phi\}\)=\\boldsymbol\{\\phi\}\_\{q\}\(\\mathbf\{x\}\)^\{\\top\}\\mathbf\{T\}\_\{q\}\\boldsymbol\{\\phi\}\_\{q\}\(\\mathbf\{x\}^\{\\prime\}\)\.

## 3GPLFR: Regularization and Connection to PCA\-GP

### 3\.1Output Coregionalization and Likelihood Tempering

In the high\-dimensional, low\-data regime that GPLFR targets, estimating a rich coregionalization matrix𝐁\\mathbf\{B\}is statistically unreliable\. We are therefore restricted to simple parameterizations of𝐁\\mathbf\{B\}, relying on the latent geometry to capture most output correlations\. In practice this means setting𝐁=𝐈\\mathbf\{B\}=\\mathbf\{I\}\(as we do throughout this paper\), unless the output structure admits a clear low\-dimensional parameterization \(as in the exoplanet climate experiment\)\. Any such simplification is a form of model misspecification when the true output correlations contain structured variation not already captured by the latent geometry or𝐁\\mathbf\{B\}’s low\-dimensional parameterization\. To prevent the likelihood from overstating the information content of each output dimension, we temper it with an inverse\-temperatureβ∈\(0,1\]\\beta\\in\(0,1\]; details and justification in Appendix[B\.4](https://arxiv.org/html/2606.06576#A2.SS4)\.

### 3\.2GPLFR and PCA\-GP

From an application perspective, GPLFR is most naturally compared to PCA\-GP – the standard compress\-then\-predict pipeline for high\-dimensional outputs with limited data \(recapped in Appendix[C](https://arxiv.org/html/2606.06576#A3)\)\. Both methods assume the outputs admit an accurate low\-rank representation and fit input\-to\-latent mappings with GP priors, but they differ in whether that representation is learned*independently of*the inputs or*jointly with*the regression task\.

PCA chooses a basis to retain the maximal output covarianceCov\(𝐲\)\\mathrm\{Cov\}\(\\mathbf\{y\}\), essentially treating all variance as signal\. However, for prediction, we want the basis that maximally captures the*predictable*component of covarianceCov\(𝔼\[𝐲∣𝐱\]\)\\mathrm\{Cov\}\(\\mathbb\{E\}\[\\mathbf\{y\}\\mid\\mathbf\{x\}\]\)\. The total covariance decomposes as:

Cov\(𝐲\)=Cov\(𝔼\[𝐲∣𝐱\]\)⏟predictable\+𝔼\[Cov\(𝐲∣𝐱\)\]⏟unpredictable\.\\begin\{split\}\\mathrm\{Cov\}\(\\mathbf\{y\}\)&=\\underbrace\{\\mathrm\{Cov\}\(\\mathbb\{E\}\[\\mathbf\{y\}\\mid\\mathbf\{x\}\]\)\}\_\{\\text\{predictable\}\}\\;\+\\;\\underbrace\{\\mathbb\{E\}\[\\mathrm\{Cov\}\(\\mathbf\{y\}\\mid\\mathbf\{x\}\)\]\}\_\{\\text\{unpredictable\}\}\.\\end\{split\}When the unpredictable term is structureless \(isotropic white noise\), PCA is asymptotically robust as the noise merely inflates the eigenvalues uniformly\. However, when the unpredictable variation is*structured*\(correlated across output dimensions\), it concentrates variance in specific directions independent of𝐱\\mathbf\{x\}\. In this regime, PCA can misallocate capacity to high\-variance but low\-predictability directions\.

By contrast, GPLFR’s latent representation is tied to𝐱\\mathbf\{x\}by GP priors, so directions that are not coherent with𝐱\\mathbf\{x\}are penalized, biasing the representation toward predictability\. Additionally, by marginalizing decoder weights analytically, GPLFR avoids committing to a single point\-estimated basis, which may be beneficial at smallNNwhere a PCA basis is poorly determined\.

The trade\-off is that this joint objective creates a more challenging optimization landscape\. The difficulty arises chiefly from the coupling between latents and kernel hyperparameters: the kernel hyperparameters must describe the latent structure, yet that structure is itself inferred conditional on the current kernels\. We mitigate this in practice with two regularizers: a small latent noiseλ\\lambdaadded to each latent GP covariance, which relaxes the requirement that the learned latent scores be explained exactly by smooth GP functions of the inputs; and the likelihood temperingβ\\betaintroduced above \(details in Appendix[B\.4](https://arxiv.org/html/2606.06576#A2.SS4)\)\. We explore the PCA\-GP–GPLFR trade\-off and how it depends on the characteristics of the data in Section[4\.1](https://arxiv.org/html/2606.06576#S4.SS1)\.

## 4Experiments

We compare GPLFR to baselines on three problems: a synthetic benchmark exploring the PCA\-GP–GPLFR trade\-off under structured output noise \(Section[4\.1](https://arxiv.org/html/2606.06576#S4.SS1)\); a biomedical optics emulation task in a PCA\-friendly regime \(Section[4\.2](https://arxiv.org/html/2606.06576#S4.SS2)\); and a challenging exoplanet climate emulation task \(Section[4\.3](https://arxiv.org/html/2606.06576#S4.SS3)\)\.444Although Section 2 uses the LMC as a mathematical reference for GPLFR, we do not benchmark against it directly\. LMC targets a complementary regime – enriching output covariance rather than input\-space structure – and does not scale to the output dimensionalities considered here without substantial approximations\.

### 4\.1Synthetic Benchmark

#### 4\.1\.1Set\-up

##### Motivation\.

This synthetic benchmark is designed to capture a common structure in real high\-dimensional regression problems, where outputs consist of a low\-dimensional “signal” component which is predictable from inputs, and residual “nuisance” variation that is*not*predictable from inputs – separating these two sources of variation is a key challenge for any model\. We model the nuisance component as a random field correlated across output dimensions\. This is a good stand\-in for many real settings – e\.g\., images, spatial fields, spectra – where measurement effects and unresolved variability induce correlated residuals across output dimensions\. Such correlated residual structure can expose PCA\-GP’s failure mode of allocating capacity to high\-variance but low\-predictability directions\. Through the GP prior over latents, GPLFR can instead bias the learned representation toward coherence in input space \(where coherence is defined by the choice of kernel family\), naturally prioritizing predictable structure\.

##### Data generation\.

We generate outputs𝐲∈ℝDy\\mathbf\{y\}\\in\\mathbb\{R\}^\{D\_\{y\}\}from inputs𝐱∈ℝDx\\mathbf\{x\}\\in\\mathbb\{R\}^\{D\_\{x\}\}as

𝐲=𝐖sig𝐳sig\(𝐱\)\+𝐲nuis\+ϵ,ϵ∼𝒩\(0,σϵ2𝐈Dy\)\.\\mathbf\{y\}=\\mathbf\{W\}\_\{\\text\{sig\}\}\\mathbf\{z\}\_\{\\text\{sig\}\}\(\\mathbf\{x\}\)\+\\mathbf\{y\}\_\{\\text\{nuis\}\}\+\\boldsymbol\{\\epsilon\},\\quad\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\epsilon\}^\{2\}\\mathbf\{I\}\_\{D\_\{y\}\}\)\.Outputs live on a 2D grid withDy=HWD\_\{y\}=HWlocations\. The signal component consists of𝐳sig\(𝐱\)∈ℝDsig\\mathbf\{z\}\_\{\\text\{sig\}\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{D\_\{\\text\{sig\}\}\}, decoded through localized squared\-exponential basis functions \(columns of𝐖sig\\mathbf\{W\}\_\{\\text\{sig\}\}\) with randomly chosen centers and scales\. The nuisance component𝐲nuis∼𝒩\(0,𝚺nuis\)\\mathbf\{y\}\_\{\\text\{nuis\}\}\\sim\\mathcal\{N\}\(0,\\boldsymbol\{\\Sigma\}\_\{\\text\{nuis\}\}\)is a spatially correlated random field independent of𝐱\\mathbf\{x\}, alongside white noiseϵ\\boldsymbol\{\\epsilon\}\. Full details are in Appendix[D\.1\.1](https://arxiv.org/html/2606.06576#A4.SS1.SSS1)\.

##### Metrics\.

Since we know the data\-generating process, we can evaluate predictions against the true conditional mean𝐲sig\(𝐱\)≡𝔼\[𝐲∣𝐱\]=𝐖sig𝐳sig\(𝐱\)\\mathbf\{y\}\_\{\\text\{sig\}\}\(\\mathbf\{x\}\)\\equiv\\mathbb\{E\}\[\\mathbf\{y\}\\mid\\mathbf\{x\}\]=\\mathbf\{W\}\_\{\\text\{sig\}\}\\,\\mathbf\{z\}\_\{\\text\{sig\}\}\(\\mathbf\{x\}\), isolating signal recovery from nuisance variation\. We report

RMSEsig=1NtestDy∑i=1Ntest‖𝐲^i−𝐲sig,i‖22\.\\mathrm\{RMSE\}\_\{\\text\{sig\}\}=\\sqrt\{\\frac\{1\}\{N\_\{\\text\{test\}\}\\,D\_\{y\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{test\}\}\}\\\|\\hat\{\\mathbf\{y\}\}\_\{i\}\-\\mathbf\{y\}\_\{\\text\{sig\},i\}\\\|\_\{2\}^\{2\}\}\.\(5\)An oracle with access to the true signal would achieveRMSEsig=0\\mathrm\{RMSE\}\_\{\\text\{sig\}\}=0\.

##### Models\.

We compare GPLFR against PCA\-GP and a training\-set\-mean baseline\. Both models use stationary ARD RBF kernels with per\-latent lengthscales\. For GPLFR, we fit the model by MAP estimation using Adam\. We set the coregionalization matrix to the identity𝐁=𝐈\\mathbf\{B\}=\\mathbf\{I\}, so all output covariance modeling enters via the shared latents\. For PCA\-GP, we fit the independent per\-score GPs by maximizing the marginal likelihood using L\-BFGS\-B\. Details are in Appendix[D\.1\.2](https://arxiv.org/html/2606.06576#A4.SS1.SSS2)\.

#### 4\.1\.2Results

![Refer to caption](https://arxiv.org/html/2606.06576v1/x1.png)Figure 2:*Synthetic benchmark:*Learning curves for GPLFR and PCA\-GP, each with six latent dimensions / principal components \(matching the true signal rank\)\. Bold lines show medians over five dataset seeds; faint lines show individual seeds\.![Refer to caption](https://arxiv.org/html/2606.06576v1/x2.png)Figure 3:*Synthetic benchmark:*Effect of latent dimensionality / number of principal components on signal prediction withN=800N=800examples\. The true signal rankDsig=6D\_\{\\text\{sig\}\}=6\. Bold lines show medians over five dataset seeds; faint lines show individual seeds\.![Refer to caption](https://arxiv.org/html/2606.06576v1/x3.png)Figure 4:*Synthetic benchmark:*Effect on model performance of increasing the amount of unpredictable, spatially\-correlated “nuisance” variationσnuis2\\sigma^\{2\}\_\{\\text\{nuis\}\}in the datasets\.\(σsig2,σϵ2\)=\(1,0\)\(\\sigma^\{2\}\_\{\\text\{sig\}\},\\sigma^\{2\}\_\{\\epsilon\}\)=\(1,0\)for all datasets\. Both models have six latent dimensions and are trained withN=800N=800examples\. Bold lines show medians over five dataset seeds; faint lines show individual seeds\.For Figures[2](https://arxiv.org/html/2606.06576#S4.F2)and[3](https://arxiv.org/html/2606.06576#S4.F3), we use the data\-generation settings:Dy=162=256D\_\{y\}=16^\{2\}=256,Dx=3D\_\{x\}=3,Dsig=6D\_\{\\text\{sig\}\}=6, and variances\(σsig2,σnuis2,σϵ2\)=\(1,1,10−4\)\(\\sigma\_\{\\text\{sig\}\}^\{2\},\\sigma\_\{\\text\{nuis\}\}^\{2\},\\sigma\_\{\\epsilon\}^\{2\}\)=\(1,1,10^\{\-4\}\), making a hard\-to\-predict dataset\.

##### Sample efficiency \(Figure[2](https://arxiv.org/html/2606.06576#S4.F2)\)\.

Both models’ RMSEs decrease roughly logarithmically with training set sizeNN\. GPLFR outperforms PCA\-GP across allNN, with an advantage equivalent to roughly4×4\\timesthe data efficiency\.

##### Representation efficiency \(Figure[3](https://arxiv.org/html/2606.06576#S4.F3)\)\.

GPLFR reaches its best performance \(approximately\) atDz=6D\_\{z\}=6, matching the true signal rank, before plateauing and exhibiting mild overfitting at very highDzD\_\{z\}\. PCA\-GP improves more slowly withDzD\_\{z\}and never matches GPLFR’s performance; it hits its lowest error atDz≈30D\_\{z\}\\approx 30, and tends to overfit at higherDzD\_\{z\}\. These discrepancies are consistent with GPLFR more effectively prioritizing signal in its latent representation, presenting the GP regressors with higher signal\-to\-noise regression tasks\.

Since the true signal is known for this synthetic problem, we can actually directly test how much signal each model’s learned subspace contains, and how much nuisance\. We project the test signal onto each model’s output subspace \(PCA basis for PCA\-GP; decoder posterior\-mean column\-space for GPLFR\) and measure the fraction of energy retained, and analogously for the nuisance component \(details in Appendix[D\.1\.3](https://arxiv.org/html/2606.06576#A4.SS1.SSS3)\)\. At the true signal rankDz=6D\_\{z\}=6, GPLFR captures77%77\\%of signal energy versus54%54\\%for PCA\-GP, while absorbing less nuisance \(25%25\\%versus38%38\\%\)\. Figure[D\.1](https://arxiv.org/html/2606.06576#A4.F1)shows the full dependence onDzD\_\{z\}\.

#### 4\.1\.3The Effect of Varying the Amount of Nuisance Variation

The results in Figures[2](https://arxiv.org/html/2606.06576#S4.F2)and[3](https://arxiv.org/html/2606.06576#S4.F3)show clearly the failure mode of PCA\-GP and benefit of GPLFR with nuisance\-heavy data\. However, the coupled GPLFR model is harder to optimize than the decoupled PCA\-GP model\. This suggests a practical boundary: at what point does the flexibility of GPLFR outweigh the stability of PCA\-GP? The answer will clearly depend on the particular features of the dataset – the kind of nuisance and signal variation, how well kernel assumptions can bias the model toward capturing signal variation etc\. – but as a first pass we can consider the effect of changing the nuisance variance across several instances of the synthetic problem\. Specifically, in Figure[4](https://arxiv.org/html/2606.06576#S4.F4), we varyσnuis2\\sigma^\{2\}\_\{\\text\{nuis\}\}while fixing\(σsig2,σϵ2\)=\(1,0\)\(\\sigma^\{2\}\_\{\\text\{sig\}\},\\sigma^\{2\}\_\{\\epsilon\}\)=\(1,0\); all other data\-generating settings \(including randomly generated ones\) are the same as in Figures[2](https://arxiv.org/html/2606.06576#S4.F2)and[3](https://arxiv.org/html/2606.06576#S4.F3)\.

GPLFR significantly outperforms PCA\-GP forσnuis2\\sigma^\{2\}\_\{\\text\{nuis\}\}between around10−510^\{\-5\}and 1\. We expect a large fraction of real\-world problems would fall in this signal\-to\-nuisance ratio range\. The models converge aroundσnuis2=10−6\\sigma^\{2\}\_\{\\text\{nuis\}\}=10^\{\-6\}and are then roughly equal down to10−810^\{\-8\}\. In the noiseless limitσnuis2=0\\sigma^\{2\}\_\{\\text\{nuis\}\}=0, where PCA’s assumption that all variance is relevant to prediction is exactly correct, PCA\-GP slightly outperforms \(standard\) GPLFR \(we discuss this further in Appendix[D\.1\.4](https://arxiv.org/html/2606.06576#A4.SS1.SSS4)\)\. When nuisance overwhelms signal atσnuis2=10\\sigma^\{2\}\_\{\\text\{nuis\}\}=10, both models overfit, underperforming the training mean baseline\.

### 4\.2Biomedical Optics: Emulating PyXOpto

#### 4\.2\.1Set\-up

##### Problem description\.

We evaluate GPLFR on a subset of*MCDataset*\[Bürmen et al\.,[2022](https://arxiv.org/html/2606.06576#bib.bib5)\]\. This is a dataset from biomedical optics consisting of*PyXOpto*simulations of light propagation through biological tissues\. Monte Carlo light\-transport models like PyXOpto underpin a lot of non\-invasive optical sensing, where one tries to infer tissue properties from measured reflectance curves for diagnosis or monitoring\. The task is to predict aDy=500D\_\{y\}=500dimensional radial reflectance profile𝐲\\mathbf\{y\}from three optical tissue properties\(μa,μs′,g\)\(\\mu\_\{a\},\\mu\_\{s\}^\{\\prime\},g\)\. Whileggis physically continuous, in the dataset it is evaluated at onlyS=5S=5values\. We therefore treat eachggvalue as a discrete input indexed bys=1,…,Ss=1,\\dots,S, while treating𝐱=\(μa,μs′\)\\mathbf\{x\}=\(\\mu\_\{a\},\\mu\_\{s\}^\{\\prime\}\)as a continuous input\. Monte Carlo light\-transport models like PyXOpto are stochastic simulators, but noise is typically close to uncorrelated across radii, making this a relatively PCA\-friendly regime, and thus a test of whether GPLFR offers any benefit when PCA’s implicit assumptions are roughly met\. Further details are in Appendix[D\.2\.1](https://arxiv.org/html/2606.06576#A4.SS2.SSS1)\.

##### Metrics\.

We report sample\-wise RMSE computed inlog10\\log\_\{10\}space on the reflectance curves,

RMSE=1Ntest∑i=1Ntest1Dy‖log10⁡𝐲^i−log10⁡𝐲i‖2\.\\mathrm\{RMSE\}=\\frac\{1\}\{N\_\{\\text\{test\}\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{test\}\}\}\\frac\{1\}\{\\sqrt\{D\_\{y\}\}\}\\big\\\|\\log\_\{10\}\\hat\{\\mathbf\{y\}\}\_\{i\}\-\\log\_\{10\}\\mathbf\{y\}\_\{i\}\\big\\\|\_\{2\}\.\(6\)

##### Models\.

We compare GPLFR against PCA\-ICM and a PCA\-MLP baseline\. To handle the discrete nature ofgg, we equip the GP models \(GPLFR and PCA\-ICM\) with a coregionalization kernel over the task indexss:k\(\(𝐱,s\),\(𝐱′,s′\)\)=kx\(𝐱,𝐱′\)Bss′ink\\big\(\(\\mathbf\{x\},s\),\(\\mathbf\{x\}^\{\\prime\},s^\{\\prime\}\)\\big\)=k\_\{x\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\,B^\{\\text\{in\}\}\_\{ss^\{\\prime\}\}, wherekx\(⋅,⋅\)k\_\{x\}\(\\cdot,\\cdot\)is an ARD RBF kernel on the standardized continuous inputs and𝐁in∈ℝS×S\\mathbf\{B\}^\{\\text\{in\}\}\\in\\mathbb\{R\}^\{S\\times S\}is a learned task correlation matrix\. This is an intrinsic coregionalized model \(ICM\) like we defined in Section[2\.1](https://arxiv.org/html/2606.06576#S2.SS1), but over inputs rather than outputs\. We restrict𝐁in\\mathbf\{B\}^\{\\text\{in\}\}to a correlation matrix rather than a covariance matrix for both models to improve identifiability in this low\-data regime \(both models performed worse with an unrestricted covariance matrix: GPLFR slightly, PCA\-ICM substantially\)\. GPLFR is fit by MAP estimation and PCA\-ICM is fit by maximizing the marginal likelihood, both using Adam\. The MLP in PCA\-MLP has two hidden layers and is trained to minimize MSE on PCA scores\. We also compare to an “oracle” PCA\-projection baseline, which reconstructs the test outputs using the*true*test PCA scores \(isolating reconstruction error from regression error\)\. This establishes a performance ceiling for our models\. Hyperparameters and architecture details are in Appendix[D\.2\.2](https://arxiv.org/html/2606.06576#A4.SS2.SSS2)\.

![Refer to caption](https://arxiv.org/html/2606.06576v1/x4.png)Figure 5:*PyXOpto emulation:*Learning curves for GPLFR and baselines\. GPLFR, PCA\-ICM and PCA\-MLP use six latent dimensions / principal components\. Bold lines show medians over five dataset seeds; faint lines show individual seeds\.![Refer to caption](https://arxiv.org/html/2606.06576v1/x5.png)Figure 6:*PyXOpto emulation:*Effect of latent dimensionality / number of principal components withN=200N=200examples\. Bold lines show medians over five dataset seeds; faint lines show individual seeds\.

#### 4\.2\.2Results

##### Sample efficiency \(Figure[5](https://arxiv.org/html/2606.06576#S4.F5)\)\.

Both GP\-based models converge to near the oracle benchmark atN=800N=800, confirming that there is limited structured output noise in this dataset\. GPLFR shows the best sample efficiency\.

##### Representation efficiency \(Figure[6](https://arxiv.org/html/2606.06576#S4.F6)\)\.

WithN=200N=200, all models plateau aroundDz=6D\_\{z\}=6, with GPLFR consistently maintaining a lower error floor\.

These results suggest that even when the “PCA assumption” holds \(signal variance≫\\ggnoise variance\), GPLFR can still offer greater sample and representation efficiency, due to its flexibility to rotate the representation within the signal subspace to maximize GP kernel coherence\.

### 4\.3Emulating Exoplanet Climate Models

#### 4\.3\.1Set\-up

##### Scientific motivation\.

The main goal of exoplanet science is to discover alien life\. A critical subgoal is to characterize the climates of*potentially habitable*exoplanets – by which we mean planets that are Earth\-like in a loose sense, spanning, for example, completely ice\-covered “snowball” worlds to steamy “moist greenhouse” worlds\.

The gold\-standard tools for modeling these climates are global climate models \(GCMs\)555The acronym “GCM” can be read asglobalclimatemodel orgeneralcirculationmodel; both mean the same thing\., which numerically solve the equations of atmospheric fluid dynamics coupled to parameterizations that capture non\-dynamical processes \(e\.g\., radiation, microphysics\) and sub\-grid\-scale dynamics \(e\.g\., turbulence, convection\)\. However, a single GCM simulation typically costs∼104\\sim 10^\{4\}–10610^\{6\}core\-hours, precluding their use in large parameter sweeps, uncertainty quantification, or data\-driven inference\. An emulator that delivers near\-instant, spatially resolved climate predictions would open up these tasks to the exoplanet community, which as of now generally must fall back to 1D models\.

![Refer to caption](https://arxiv.org/html/2606.06576v1/x6.png)Figure 7:*Exoclimate emulation:*Problem overview\. Output fields are defined on a32×6432\\times 64latitude–longitude grid\. 3D variables contain 10 pressure levels, with surface temperature recorded as a separate variable\. There are 53 fields in total, giving∼105\\sim 10^\{5\}output dimensions before spherical\-harmonic truncation\.
##### Problem description\.

We assemble a library of existing exoplanet GCM simulations from the literature, supplemented by bespoke runs chosen to improve input\-space coverage; the combined dataset spans five GCMs and is highly non\-uniform \(Table[D\.5](https://arxiv.org/html/2606.06576#A4.T5)\)\. The task is to predict the time\-mean 3D atmospheric state fromDx=8D\_\{x\}=8continuous planet properties𝐱\\mathbf\{x\}and a discrete GCM labels∈\{1,…,S\}s\\in\\\{1,\\dots,S\\\}, withS=5S=5\(Figure[7](https://arxiv.org/html/2606.06576#S4.F7)\)\. We use the term*variable*to refer to a single physical quantity that may span multiple vertical levels while*field*refers to a single slice of a variable at one vertical level\. We represent each field on the sphere via a truncated spherical harmonic expansion and concatenate the resulting coefficients across fields, giving a single output vector𝐲∈ℝDy\\mathbf\{y\}\\in\\mathbb\{R\}^\{D\_\{y\}\}withDy≈3×104D\_\{y\}\\approx 3\\times 10^\{4\}\. Truncation provides a natural compact representation for smooth fields while filtering out high\-frequency noise\. Further preprocessing details are in Appendix[D\.3\.2](https://arxiv.org/html/2606.06576#A4.SS3.SSS2)\.

Evaluation targets two high\-fidelity GCMs within a core physical domain \(Table[D\.4](https://arxiv.org/html/2606.06576#A4.T4)\), giving us just 307 examples, 80 of which we reserve for the test set\. To bolster the small training set, we supplement it with simulations from three auxiliary GCMs and from slightly outside the core domain, giving 1419 additional examples\.

GCMs differ in their vertical grids and in which variables they output, creating structured patterns of missing data across examples\. They also embody different physics and numerics, producing systematic inter\-model disagreement even for identical planets\. Even within a single GCM, differences in simulation setup \(e\.g\., parameterization choices, time\-averaging procedure, initial state\) – which are numerous, inconsistently documented, and too sparsely sampled to include as inputs – introduce structured variability not predictable from𝐱\\mathbf\{x\}alone\.

##### Models\.

We compare five models: GPLFR and PPCA\-ICM are probabilistic models that operate in spectral space and produce ensemble predictions for uncertainty quantification; PPCA\-MLP,kk\-nearest neighbours \(kNN\), and the training mean serve as deterministic baselines\.

GPLFR fits independent ARD Matérn\-52\\tfrac\{5\}\{2\}ICM GPs over\(𝐱,s\)\(\\mathbf\{x\},s\)with an identity decoder \(𝐁=𝐈\\mathbf\{B\}=\\mathbf\{I\}\)\. Missing fields are masked in the collapsed likelihood\. Details in Appendix[D\.3\.3](https://arxiv.org/html/2606.06576#A4.SS3.SSS3)\.

PPCA\-ICM extracts latent scores with probabilistic PCA, then regresses them against\(𝐱,s\)\(\\mathbf\{x\},s\)using independent ICM GPs as in GPLFR\. Missing fields are omitted from the PPCA likelihood\. Details in Appendix[D\.3\.4](https://arxiv.org/html/2606.06576#A4.SS3.SSS4)\.

PPCA\-MLP replaces PPCA\-ICM’s ICM GP with a two\-layer MLP\. kNN and the training mean operate directly in spatial space\. Details in Appendix[D\.3\.5](https://arxiv.org/html/2606.06576#A4.SS3.SSS5)\.

Table 1:*Exoclimate emulation:*RMSE\. Bold indicates the lowest \(best\) value\. 3D variables are shown averaged over pressure levels; per\-level results are reported in Table[D\.9](https://arxiv.org/html/2606.06576#A4.T9)\.Table 2:*Exoclimate emulation:*Energy scores\. Bold indicates lower \(better\) values\. 3D variables are shown averaged over pressure levels; per\-level results are reported in Table[D\.8](https://arxiv.org/html/2606.06576#A4.T8)\.
##### Metrics\.

We evaluate predictions on spatial fields after full inverse preprocessing\. For deterministic evaluation, we report the area\-weighted RMSE:

RMSE\(𝐲^,𝐲\)=‖𝐲^−𝐲‖G,‖𝐞‖G=𝐞⊤𝐆𝐞,\\mathrm\{RMSE\}\(\\hat\{\\mathbf\{y\}\},\\mathbf\{y\}\)=\\\|\\hat\{\\mathbf\{y\}\}\-\\mathbf\{y\}\\\|\_\{G\},\\quad\\\|\\mathbf\{e\}\\\|\_\{G\}=\\sqrt\{\\mathbf\{e\}^\{\\top\}\\mathbf\{G\}\\mathbf\{e\}\},where𝐆\\mathbf\{G\}contains Gauss–Legendre latitude weights, and𝐲\\mathbf\{y\}and𝐲^\\hat\{\\mathbf\{y\}\}are the true and predicted fields\. For probabilistic evaluation, we report the energy score666The energy score can be seen as a multivariate generalization of the continuous ranked probability score \(CRPS\)\. The first term penalizes mean inaccuracy and the second penalizes overconfidence; lower is better\.of the predictive distributionppagainst a ground\-truth field𝐲\\mathbf\{y\}:

ES\(p,𝐲\)=𝔼𝐲′∼p‖𝐲′−𝐲‖G−12𝔼𝐲′,𝐲′′∼p‖𝐲′−𝐲′′‖G,\\mathrm\{ES\}\(p,\\mathbf\{y\}\)=\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\}\\sim p\}\\\|\\mathbf\{y\}^\{\\prime\}\-\\mathbf\{y\}\\\|\_\{G\}\-\\tfrac\{1\}\{2\}\\mathbb\{E\}\_\{\\mathbf\{y\}^\{\\prime\},\\mathbf\{y\}^\{\\prime\\prime\}\\sim p\}\\\|\\mathbf\{y\}^\{\\prime\}\-\\mathbf\{y\}^\{\\prime\\prime\}\\\|\_\{G\},estimated from posterior predictive samples\. Both metrics are computed per\-field and averaged over test examples\.

#### 4\.3\.2Results

GPLFR achieves the lowest RMSE and energy score across all variables \(Tables[1](https://arxiv.org/html/2606.06576#S4.T1)and[2](https://arxiv.org/html/2606.06576#S4.T2)\), with RMSE improvements over PPCA\-ICM ranging from 8% \(cloud fraction\) to 34% \(absorbed shortwave radiation\)\. Energy score improvements are similar\. GPLFR also appears stronger on uncertainty calibration and spatial pattern prediction \(Appendix[D\.3\.9](https://arxiv.org/html/2606.06576#A4.SS3.SSS9)\)\. Example spatial predictions are shown in Appendix[D\.3\.8](https://arxiv.org/html/2606.06576#A4.SS3.SSS8)\.

We note the contrast with the PyXOpto experiment\. There, GPLFR and PPCA\-ICM were much closer in performance, consistent with the simulator noise being nearly uncorrelated across output dimensions – a regime where PCA’s implicit assumption that all variance is relevant to prediction is approximately correct\. The large performance gap here suggests that, as in the synthetic benchmark \(Section[4\.1](https://arxiv.org/html/2606.06576#S4.SS1)\), PPCA\-ICM is allocating modeling capacity to less predictable directions than GPLFR – plausibly due to structured, input\-independent variation arising from unmodeled differences in simulation setup\.

## 5Limitations and Future Work

##### Optimization landscape\.

Unlike the sequential estimation in PCA\-GP, GPLFR requires jointly optimizing latents and kernel hyperparameters\. This introduces a strong coupling: the kernel hyperparameters determine the prior induced on the latent space, yet the latent variables are identifiable only conditional on that kernel\. Consequently, the objective is highly multimodal\. In practice one can stabilize fitting with likelihood temperingβ\\betaand latent noiseλ\\lambda\(Section[3](https://arxiv.org/html/2606.06576#S3)\), but these do increase the hyperparameter tuning burden; better heuristics for choosing these parameters would reduce this cost\.

##### Structured residuals\.

Our current implementation defaults to an identity coregionalization matrix \(𝐁=𝐈\\mathbf\{B\}=\\mathbf\{I\}\), relying on likelihood tempering to mitigate the resulting misspecification when residuals are correlated\. Future work could explore low\-dimensional parameterizations of𝐁\\mathbf\{B\}, such as Kronecker products for gridded data or Toeplitz matrices for time\-series\.

##### Inference and uncertainty quantification\.

Here we focus on fitting GPLFR via MAP estimation\. Distributional approximations to the full posterior, such as variational inference or Hamiltonian Monte Carlo \(HMC\), are an avenue for future work, and may be particularly beneficial in very low data \(or high compute\) settings\. A promising middle ground is a “partially Bayesian” approximation: fixing the latents at their MAP estimates while performing Bayesian inference \(e\.g\., via HMC\) over the remaining parameters, avoiding the cost and identifiability challenge of sampling many latents\. This is reasonable in the low\-data, high\-dimensional outputs regime because the posterior over latents is typically much more concentrated, given many output dimensions per example, than the posterior over the remaining parameters, which are only weakly constrained given few examples\.

## 6Conclusion

We have presented GPLFR, a probabilistic model that couples representation learning and regression by learning latents under a GP prior jointly with output reconstruction through a collapsed linear\-Gaussian decoder\. This biases the learned representation towards structure that is*predictable*from the inputs, which is especially useful in problems with structured output noise\. On the synthetic benchmark, GPLFR achieves the same error as PCA\-GP with roughly4×4\\timesfewer training data; on PyXOpto emulation, it shows consistent sample efficiency gains even in a PCA\-friendly regime with limited structured output noise\. The main practical cost relative to PCA\-GP is harder optimization, which can be mitigated with regularization\. Beyond these benchmarks, we used GPLFR to build the first spatially resolved emulator of global climate models for rocky exoplanets, where it strongly outperformed alternative methods\.

###### Acknowledgements\.

Edward Stevenson is supported by the Science and Technology Facilities Council \(STFC\) Centre for Doctoral Training in Data Intensive Science at the University of Cambridge\. Miles Cranmer is grateful for support from the Isaac Newton Trust and the AI2050 program at Schmidt Sciences\. Mei Ting Mak acknowledges support from the Croucher Postdoctoral Fellowship, funded by the Croucher Foundation\. The GCM results are produced using Met Office Software and the Monsoon3 system, a collaborative facility supplied under the Joint Weather and Climate Research Programme, a strategic partnership between the Met Office and the Natural Environment Research Council in the UK\. Eric Wolf acknowledges funding from the Consortium on Habitability and Atmospheres of M\-dwarf Planets team and the Virtual Planetary Laboratory, supported by NASA grant numbers 80NSSC21K0905, 80NSSC23K1399, 80NSSC23K1398 and 80NSSC18K0829 respectively\. Nathan Mayne acknowledges support from a UK Research and Innovation \(UKRI\) Future Leaders Fellowship MR/T040866/1, and partly from the Leverhulme Trust through a research project grant RPG\-2020\-82 alongside a Science and Technology Facilities Council \(STFC\) Consolidated Grant ST/R000395/1\. This work used the Dawn AI service, part of the UK AI Research Resource \(AIRR\), operated by the University of Cambridge Research Computing Service \([www\.hpc\.cam\.ac\.uk/d\-w\-n](https://arxiv.org/html/2606.06576v1/www.hpc.cam.ac.uk/d-w-n)\) and supported by UK Research and Innovation, with Intel and Dell Technologies as technology partners\.

## References

- Alvarez et al\. \[2012\]Mauricio A\. Alvarez, Lorenzo Rosasco, and Neil D\. Lawrence\.Kernels for Vector\-Valued Functions: A Review, April 2012\.
- Bair et al\. \[2006\]Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani\.Prediction by Supervised Principal Components\.*Journal of the American Statistical Association*, 101\(473\):119–137, March 2006\.ISSN 0162\-1459\.[10\.1198/016214505000000628](https://arxiv.org/doi.org/10.1198/016214505000000628)\.
- Bissiri et al\. \[2016\]Pier Giovanni Bissiri, Chris Holmes, and Stephen Walker\.A General Framework for Updating Belief Distributions\.*Journal of the Royal Statistical Society Series B: Statistical Methodology*, 78\(5\):1103–1130, November 2016\.ISSN 1369\-7412, 1467\-9868\.[10\.1111/rssb\.12158](https://arxiv.org/doi.org/10.1111/rssb.12158)\.
- Bruinsma et al\. \[2020\]Wessel P\. Bruinsma, Eric Perim, Will Tebbutt, J\. Scott Hosking, Arno Solin, and Richard E\. Turner\.Scalable Exact Inference in Multi\-Output Gaussian Processes, July 2020\.
- Bürmen et al\. \[2022\]Miran Bürmen, Franjo Pernuš, and Peter Naglič\.MCDataset: A public reference dataset of Monte Carlo simulated quantities for multilayered and voxelated tissues computed by massively parallel PyXOpto Python package\.*Journal of Biomedical Optics*, 27\(8\):083012, April 2022\.ISSN 1083\-3668, 1560\-2281\.[10\.1117/1\.JBO\.27\.8\.083012](https://arxiv.org/doi.org/10.1117/1.JBO.27.8.083012)\.
- Calandra et al\. \[2016\]Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth\.Manifold Gaussian Processes for Regression, April 2016\.
- Dai et al\. \[2017\]Zhenwen Dai, Mauricio Álvarez, and Neil Lawrence\.Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes\.*Advances in Neural Information Processing Systems*, 30, 2017\.
- Hammond et al\. \[2025\]Tobi Hammond, Thaddeus D\. Komacek, Ravi K\. Kopparapu, Thomas J\. Fauchez, Avi M\. Mandell, Eric T\. Wolf, Vincent Kofman, Stephen R\. Kane, Ted M\. Johnson, Anmol Desai, Giada Arney, and Jaime S\. Crouse\.The Climates and Thermal Emission Spectra of Prime Nearby Temperate Rocky Exoplanet Targets\.*The Astrophysical Journal*, 984\(2\):181, May 2025\.ISSN 0004\-637X\.[10\.3847/1538\-4357/adc73b](https://arxiv.org/doi.org/10.3847/1538-4357/adc73b)\.
- Haqq\-Misra et al\. \[2022\]Jacob Haqq\-Misra, Eric T\. Wolf, Thomas J\. Fauchez, Aomawa L\. Shields, and Ravi K\. Kopparapu\.The Sparse Atmospheric Model Sampling Analysis \(SAMOSA\) Intercomparison: Motivations and Protocol Version 1\.0: A CUISINES Model Intercomparison Project\.*The Planetary Science Journal*, 3\(11\):260, November 2022\.ISSN 2632\-3338\.[10\.3847/PSJ/ac9479](https://arxiv.org/doi.org/10.3847/PSJ/ac9479)\.
- Higdon et al\. \[2008\]Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley\.Computer Model Calibration Using High\-Dimensional Output\.*Journal of the American Statistical Association*, 103\(482\):570–583, June 2008\.ISSN 0162\-1459\.[10\.1198/016214507000000888](https://arxiv.org/doi.org/10.1198/016214507000000888)\.
- Holden et al\. \[2015\]Philip B\. Holden, Neil R\. Edwards, Paul H\. Garthwaite, and Richard D\. Wilkinson\.Emulation and interpretation of high\-dimensional climate model outputs\.*Journal of Applied Statistics*, 42\(9\):2038–2055, September 2015\.ISSN 0266\-4763\.[10\.1080/02664763\.2015\.1016412](https://arxiv.org/doi.org/10.1080/02664763.2015.1016412)\.
- Hutchings et al\. \[2025\]Grant Hutchings, Derek Bingham, Kellin Rumsey, and Earl Lawrence\.Fast Emulation, Modular Calibration, and Active Learning for Simulators with Functional Response, October 2025\.
- Izenman \[1975\]Alan Julian Izenman\.Reduced\-rank regression for the multivariate linear model\.*Journal of Multivariate Analysis*, 5\(2\):248–264, June 1975\.ISSN 0047\-259X\.[10\.1016/0047\-259X\(75\)90042\-1](https://arxiv.org/doi.org/10.1016/0047-259X(75)90042-1)\.
- Jiang et al\. \[2025\]Xiaoyu Jiang, Sokratia Georgaka, Magnus Rattray, and Mauricio A\. Álvarez\.Scalable Multi\-Output Gaussian Processes with Stochastic Variational Inference, June 2025\.
- Komacek and Abbot \[2019\]Thaddeus D\. Komacek and Dorian S\. Abbot\.The atmospheric circulation and climate of terrestrial planets orbiting Sun\-like and M\-dwarf stars over a broad range of planetary parameters\.*The Astrophysical Journal*, 871\(2\):245, February 2019\.ISSN 0004\-637X, 1538\-4357\.[10\.3847/1538\-4357/aafb33](https://arxiv.org/doi.org/10.3847/1538-4357/aafb33)\.
- kumar Kopparapu et al\. \[2016\]Ravi kumar Kopparapu, Eric T\. Wolf, Jacob Haqq\-Misra, Jun Yang, James F\. Kasting, Victoria Meadows, Ryan Terrien, and Suvrath Mahadevan\.THE INNER EDGE OF THE HABITABLE ZONE FOR SYNCHRONOUSLY ROTATING PLANETS AROUND LOW\-MASS STARS USING GENERAL CIRCULATION MODELS\.*The Astrophysical Journal*, 819\(1\):84, March 2016\.ISSN 0004\-637X\.[10\.3847/0004\-637X/819/1/84](https://arxiv.org/doi.org/10.3847/0004-637X/819/1/84)\.
- kumar Kopparapu et al\. \[2017\]Ravi kumar Kopparapu, Eric T\. Wolf, Giada Arney, Natasha E\. Batalha, Jacob Haqq\-Misra, Simon L\. Grimm, and Kevin Heng\.Habitable Moist Atmospheres on Terrestrial Planets near the Inner Edge of the Habitable Zone around M Dwarfs\.*The Astrophysical Journal*, 845\(1\):5, August 2017\.ISSN 0004\-637X\.[10\.3847/1538\-4357/aa7cf9](https://arxiv.org/doi.org/10.3847/1538-4357/aa7cf9)\.
- Lawrence \[2005\]Neil Lawrence\.Probabilistic Non\-linear Principal Component Analysis with Gaussian Process Latent Variable Models\.*Journal of Machine Learning Research*, 6\(60\):1783–1816, 2005\.ISSN 1533\-7928\.
- Li et al\. \[2020\]Shibo Li, Wei Xing, Robert M\. Kirby, and Shandian Zhe\.Scalable Gaussian Process Regression Networks\.In*Twenty\-Ninth International Joint Conference on Artificial Intelligence*, volume 3, pages 2456–2462, July 2020\.[10\.24963/ijcai\.2020/340](https://arxiv.org/doi.org/10.24963/ijcai.2020/340)\.
- Macdonald et al\. \[2025\]Evelyn Macdonald, Kristen Menou, Christopher Lee, and Adiv Paradise\.Climate Transition to Temperate Nightside at High Atmosphere Mass\.*The Astrophysical Journal*, 981\(1\):3, February 2025\.ISSN 0004\-637X\.[10\.3847/1538\-4357/adb0cb](https://arxiv.org/doi.org/10.3847/1538-4357/adb0cb)\.
- Mak et al\. \[2024\]Mei Ting Mak, Denis Sergeev, Nathan Mayne, Nahum Banks, Jake Eager\-Nash, James Manners, Giada Arney, Eric Hebrard, and Krisztian Kohary\.3D simulations of TRAPPIST\-1e with varying CO2, CH4 and haze profiles\.*Monthly Notices of the Royal Astronomical Society*, 529\(4\):3971–3987, March 2024\.ISSN 0035\-8711, 1365\-2966\.[10\.1093/mnras/stae741](https://arxiv.org/doi.org/10.1093/mnras/stae741)\.
- Paradise et al\. \[2021\]Adiv Paradise, Bo Lin Fan, Kristen Menou, and Christopher Lee\.Climate Diversity in the Solar\-Like Habitable Zone due to Varying Background Gas Pressure\.*Icarus*, 358:114301, April 2021\.ISSN 00191035\.[10\.1016/j\.icarus\.2020\.114301](https://arxiv.org/doi.org/10.1016/j.icarus.2020.114301)\.
- Paradise et al\. \[2022a\]Adiv Paradise, Evelyn Macdonald, Kristen Menou, Christopher Lee, and Bo Lin Fan\.ExoPlaSim: Extending the Planet Simulator for Exoplanets\.*Monthly Notices of the Royal Astronomical Society*, 511\(3\):3272–3303, February 2022a\.ISSN 0035\-8711, 1365\-2966\.[10\.1093/mnras/stac172](https://arxiv.org/doi.org/10.1093/mnras/stac172)\.
- Paradise et al\. \[2022b\]Adiv Paradise, Kristen Menou, Christopher Lee, and Bo Lin Fan\.Fundamental challenges to remote sensing of exo\-earths\.*Monthly Notices of the Royal Astronomical Society*, 512\(3\):3616–3626, May 2022b\.ISSN 0035\-8711\.[10\.1093/mnras/stac724](https://arxiv.org/doi.org/10.1093/mnras/stac724)\.
- Rougier \[2008\]Jonathan Rougier\.Efficient Emulators for Multivariate Deterministic Functions\.*Journal of Computational and Graphical Statistics*, 17\(4\):827–843, December 2008\.ISSN 1061\-8600\.[10\.1198/106186008X384032](https://arxiv.org/doi.org/10.1198/106186008X384032)\.
- Sergeev et al\. \[2022\]Denis E\. Sergeev, Thomas J\. Fauchez, Martin Turbet, Ian A\. Boutle, Kostas Tsigaridis, Michael J\. Way, Eric T\. Wolf, Shawn D\. Domagal\-Goldman, François Forget, Jacob Haqq\-Misra, Ravi K\. Kopparapu, F\. Hugo Lambert, James Manners, and Nathan J\. Mayne\.The TRAPPIST\-1 Habitable Atmosphere Intercomparison \(THAI\)\. II\. Moist Cases\-The Two Waterworlds\.*The Planetary Science Journal*, 3:212, September 2022\.ISSN 2632\-3338\.[10\.3847/PSJ/ac6cf2](https://arxiv.org/doi.org/10.3847/PSJ/ac6cf2)\.
- Stevenson et al\. \[in review\]Edward T\. W\. Stevenson, Mei Ting Mak, Eric T\. Wolf, Denis E\. Sergeev, Tobi Hammond, N\. J\. Mayne, and Miles Cranmer\.ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets\.Submitted to the Fortieth Annual Conference on Neural Information Processing Systems \(NeurIPS 2026\), Evaluations and Datasets Track, in review\.
- Suissa et al\. \[2020\]Gabrielle Suissa, Eric T\. Wolf, Ravi kumar Kopparapu, Geronimo L\. Villanueva, Thomas Fauchez, Avi M\. Mandell, Giada Arney, Emily A\. Gilbert, Joshua E\. Schlieder, Thomas Barclay, Elisa V\. Quintana, Eric Lopez, Joseph E\. Rodriguez, and Andrew Vanderburg\.The First Habitable\-zone Earth\-sized Planet from TESS\. III\. Climate States and Characterization Prospects for TOI\-700 d\.*The Astronomical Journal*, 160\(3\):118, August 2020\.ISSN 1538\-3881\.[10\.3847/1538\-3881/aba4b4](https://arxiv.org/doi.org/10.3847/1538-3881/aba4b4)\.
- Teh et al\. \[2005\]Yee Whye Teh, Matthias Seeger, and Michael I\. Jordan\.Semiparametric latent factor models\.In*International Workshop on Artificial Intelligence and Statistics*, pages 333–340\. PMLR, January 2005\.
- Wilson et al\. \[2011\]Andrew Gordon Wilson, David A\. Knowles, and Zoubin Ghahramani\.Gaussian Process Regression Networks, October 2011\.
- Wold et al\. \[2001\]Svante Wold, Michael Sjöström, and Lennart Eriksson\.PLS\-regression: A basic tool of chemometrics\.*Chemometrics and Intelligent Laboratory Systems*, 58\(2\):109–130, October 2001\.ISSN 0169\-7439\.[10\.1016/S0169\-7439\(01\)00155\-1](https://arxiv.org/doi.org/10.1016/S0169-7439(01)00155-1)\.
- Wolf et al\. \[2019\]E\. T\. Wolf, R\. K\. Kopparapu, and J\. Haqq\-Misra\.Simulated Phase\-dependent Spectra of Terrestrial Aquaplanets in M Dwarf Systems\.*The Astrophysical Journal*, 877\(1\):35, May 2019\.ISSN 0004\-637X\.[10\.3847/1538\-4357/ab184a](https://arxiv.org/doi.org/10.3847/1538-4357/ab184a)\.
- Wolf \[2017\]Eric T\. Wolf\.Assessing the Habitability of the TRAPPIST\-1 System Using a 3D Climate Model\.*The Astrophysical Journal Letters*, 839\(1\):L1, April 2017\.ISSN 2041\-8205\.[10\.3847/2041\-8213/aa693a](https://arxiv.org/doi.org/10.3847/2041-8213/aa693a)\.
- Wolf et al\. \[2025\]Eric T\. Wolf, Edward W\. Schwieterman, Jacob Haqq\-Misra, Thomas J\. Fauchez, Sandra T\. Bastelberger, Michaela Leung, Sarah Peacock, Geronimo L\. Villanueva, and Ravi K\. Kopparapu\.Chemistry, Climate, and Transmission Spectra of TRAPPIST\-1 e Explored with a Multimodel Sparse Sampled Ensemble\.*The Planetary Science Journal*, 6\(10\):231, October 2025\.ISSN 2632\-3338\.[10\.3847/PSJ/ae031e](https://arxiv.org/doi.org/10.3847/PSJ/ae031e)\.
- Woodward et al\. \[In preparation\]Hannah Woodward et al\.\[title in preparation\]\.In preparation\.
- Zhe et al\. \[2019\]Shandian Zhe, Wei Xing, and Robert M\. Kirby\.Scalable High\-Order Gaussian Process Regression\.In*Proceedings of the Twenty\-Second International Conference on Artificial Intelligence and Statistics*, pages 2611–2620\. PMLR, April 2019\.

Gaussian Process Latent Factor Regression for Low\-Data, High\-Dimensional Output Problems \(Supplementary Material\)

## Appendix AAdjacent Task\-Aware Representation Learning Methods

Compress\-then\-predict pipelines learn a basis optimized for output reconstruction, not predictability from the inputs – a well\-known limitation\. Several alternatives exist, but none are satisfactory in our setting\. Supervised PCA\[Bair et al\.,[2006](https://arxiv.org/html/2606.06576#bib.bib2)\]screens output dimensions by their relevance to the inputs before compression; however, when nuisance variation is distributed across output dimensions rather than confined to a subset – as is typical of structured outputs like spatial fields – screening individual dimensions cannot separate signal from nuisance\. What is needed is a change of the basis itself, informed by the inputs\. Partial least squares\[Wold et al\.,[2001](https://arxiv.org/html/2606.06576#bib.bib31)\]and reduced\-rank regression\[Izenman,[1975](https://arxiv.org/html/2606.06576#bib.bib13)\]achieve this, but assume a linear relationship between inputs and outputs\. On the input side, Manifold GPs\[Calandra et al\.,[2016](https://arxiv.org/html/2606.06576#bib.bib6)\]jointly learn a nonlinear input transformation with GP regression, demonstrating the benefit of task\-aware representation learning – but for scalar or low\-dimensional outputs, not high\-dimensional output compression\. End\-to\-end neural approaches learn nonlinear input\-aware representations of both inputs and outputs but are typically data\-hungry and lack principled uncertainty quantification\.

## Appendix BGPLFR: Model Specification and Inference

This section complements Section[2](https://arxiv.org/html/2606.06576#S2)by writing the GPLFR probabilistic model explicitly and stating the objective used for fitting and prediction\. We present the most flexible special case: per\-latent kernels with separate lengthscales and amplitudes \(i\.e\.,Dq=1∀qD\_\{q\}=1\\ \\forall q\); restrictions to various lengthscale or amplitude groupings are straightforward to implement and can be useful in weakly identified settings\.

### B\.1Notation and Data Layout

##### Inputs\.

An input is𝐱∈ℝDx\\mathbf\{x\}\\in\\mathbb\{R\}^\{D\_\{x\}\}, with training inputs𝐗=\[𝐱1⊤⋮𝐱N⊤\]∈ℝN×Dx\\mathbf\{X\}=\\begin\{bmatrix\}\\mathbf\{x\}\_\{1\}^\{\\top\}\\\\ \\vdots\\\\ \\mathbf\{x\}\_\{N\}^\{\\top\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{N\\times D\_\{x\}\}\.

##### Outputs\.

An output is𝐲∈ℝDy\\mathbf\{y\}\\in\\mathbb\{R\}^\{D\_\{y\}\}, with training outputs𝐘=\[𝐲1⊤⋮𝐲N⊤\]∈ℝN×Dy\\mathbf\{Y\}=\\begin\{bmatrix\}\\mathbf\{y\}\_\{1\}^\{\\top\}\\\\ \\vdots\\\\ \\mathbf\{y\}\_\{N\}^\{\\top\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{N\\times D\_\{y\}\}\. We assume centered outputs\.

##### Latents and decoder\.

Latent variables for the training set are collected in𝐙∈ℝN×Dz\\mathbf\{Z\}\\in\\mathbb\{R\}^\{N\\times D\_\{z\}\}and decoder weights in𝐖∈ℝDy×Dz\\mathbf\{W\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{z\}\}\. We view both as concatenations ofQQblocks:𝐙=\[𝐙1…𝐙Q\]\\mathbf\{Z\}=\\begin\{bmatrix\}\\mathbf\{Z\}\_\{1\}&\\dots&\\mathbf\{Z\}\_\{Q\}\\end\{bmatrix\}and𝐖=\[𝐖1…𝐖Q\]\\mathbf\{W\}=\\begin\{bmatrix\}\\mathbf\{W\}\_\{1\}&\\dots&\\mathbf\{W\}\_\{Q\}\\end\{bmatrix\}, where𝐙q∈ℝN×Dq\\mathbf\{Z\}\_\{q\}\\in\\mathbb\{R\}^\{N\\times D\_\{q\}\},𝐖q∈ℝDy×Dq\\mathbf\{W\}\_\{q\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{q\}\}, and∑qDq=Dz\\sum\_\{q\}D\_\{q\}=D\_\{z\}\. The per\-latent special case used in this paper hasDq=1D\_\{q\}=1for allqq, so we index byq=1,…,Dzq=1,\\dots,D\_\{z\}and let𝐳\(q\)∈ℝN\\mathbf\{z\}^\{\(q\)\}\\in\\mathbb\{R\}^\{N\}and𝐰\(q\)∈ℝDy\\mathbf\{w\}^\{\(q\)\}\\in\\mathbb\{R\}^\{D\_\{y\}\}denote theqq\-th latent variable \(evaluated at the training inputs\) and its corresponding decoder column\.777The alternative marginalization \(as per Section[2\.2](https://arxiv.org/html/2606.06576#S2.SS2)\) for the per\-latent case yields the semiparametric latent factor model \(SLFM\) ofTeh et al\. \[[2005](https://arxiv.org/html/2606.06576#bib.bib29)\], with𝐂LMC=∑q=1Dz\(𝐰\(q\)𝐰\(q\)⊤\)⊗𝐊q\+σ2𝐈NDy\\mathbf\{C\}\_\{\\text\{LMC\}\}=\\sum\_\{q=1\}^\{D\_\{z\}\}\(\\mathbf\{w\}^\{\(q\)\}\\mathbf\{w\}^\{\(q\)\\top\}\)\\otimes\\mathbf\{K\}\_\{q\}\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\}\.

##### Parameters and hyperparameters\.

We refer to quantities inferred during model fitting as*parameters*, and fixed design choices made during model selection as*hyperparameters*\(omitted from conditioning notation\)\. Within*parameters*, we distinguish between the latents𝐙\\mathbf\{Z\}, and the remaining “global” parameters\.

### B\.2Generative Model

GPLFR couples a GP “encoder” prior over latents with a linear\-Gaussian “decoder” from latents to outputs\.

##### Encoder\.

For each latent dimensionq∈\{1,…,Dz\}q\\in\\\{1,\\dots,D\_\{z\}\\\}we introduce a latent functionz\(q\)\(⋅\):ℝDx→ℝz^\{\(q\)\}\(\\cdot\):\\mathbb\{R\}^\{D\_\{x\}\}\\to\\mathbb\{R\}with an independent zero\-mean GP priorz\(q\)\(⋅\)∼𝒢𝒫\(0,kq\(⋅,⋅\)\)z^\{\(q\)\}\(\\cdot\)\\sim\\mathcal\{GP\}\(0,k\_\{q\}\(\\cdot,\\cdot\)\)\. On the training inputs this implies

ℝN∋𝐳\(q\)∣𝐗,ℓq,ηq∼𝒩\(0,𝐊q\),Kq,ij=kq\(𝐱i,𝐱j;ℓq,ηq\)\.\\mathbb\{R\}^\{N\}\\ni\\mathbf\{z\}^\{\(q\)\}\\mid\\mathbf\{X\},\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\sim\\mathcal\{N\}\(0,\\mathbf\{K\}\_\{q\}\),\\quad K\_\{q,ij\}=k\_\{q\}\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\_\{j\};\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\)\.Each kernel is parameterized by ARD lengthscalesℓq∈ℝDx\\boldsymbol\{\\ell\}\_\{q\}\\in\\mathbb\{R\}^\{D\_\{x\}\}and an amplitudeηq\\eta\_\{q\}\. In practice, we use a regularized kernel𝐊q←𝐊q\+λ𝐈N\\mathbf\{K\}\_\{q\}\\leftarrow\\mathbf\{K\}\_\{q\}\+\\lambda\\mathbf\{I\}\_\{N\}, where the latent noiseλ\\lambdais a hyperparameter, to help stabilize optimization\. Collecting acrossqqgives

p\(𝐙∣𝐗,\{ℓq,ηq\}q\)=∏q=1Dzp\(𝐳\(q\)∣𝐗,ℓq,ηq\)\.p\(\\mathbf\{Z\}\\mid\\mathbf\{X\},\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\}\)=\\prod\_\{q=1\}^\{D\_\{z\}\}p\(\\mathbf\{z\}^\{\(q\)\}\\mid\\mathbf\{X\},\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\)\.

##### Decoder\.

We use a linear\-Gaussian decoder

𝐲∣𝐳,𝐖,σ∼𝒩\(𝐖𝐳,σ2𝐈Dy\)\\mathbf\{y\}\\mid\\mathbf\{z\},\\mathbf\{W\},\\sigma\\sim\\mathcal\{N\}\(\\mathbf\{W\}\\mathbf\{z\},\\sigma^\{2\}\\mathbf\{I\}\_\{D\_\{y\}\}\)with weight prior

𝐖∼ℳ𝒩\(0,𝐁,𝐈Dz\)\.\\mathbf\{W\}\\sim\\mathcal\{MN\}\\big\(0,\\mathbf\{B\},\\mathbf\{I\}\_\{D\_\{z\}\}\\big\)\.Here𝐁∈ℝDy×Dy\\mathbf\{B\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{y\}\}is an output\-space row covariance\. In the high\-dimensional output regime we default to𝐁=𝐈Dy\\mathbf\{B\}=\\mathbf\{I\}\_\{D\_\{y\}\}; structured choices for𝐁\\mathbf\{B\}are also possible\.

### B\.3Collapsed Decoder Likelihood

For one output dimensionjj\(a column of𝐘\\mathbf\{Y\}\),

𝐲\(j\)∣𝐙,𝐰\(j\),σ∼𝒩\(𝐙𝐰\(j\),σ2𝐈N\)\.\\mathbf\{y\}^\{\(j\)\}\\mid\\mathbf\{Z\},\\mathbf\{w\}^\{\(j\)\},\\sigma\\sim\\mathcal\{N\}\(\\mathbf\{Z\}\\mathbf\{w\}^\{\(j\)\},\\sigma^\{2\}\\mathbf\{I\}\_\{N\}\)\.Marginalizing out𝐖∼ℳ𝒩\(0,𝐁,𝐈Dz\)\\mathbf\{W\}\\sim\\mathcal\{MN\}\\big\(0,\\mathbf\{B\},\\mathbf\{I\}\_\{D\_\{z\}\}\\big\)yields

vec\(𝐘\)∣𝐙,𝐁,σ∼𝒩\(0,𝐂\),𝐂=𝐁⊗\(𝐙𝐙⊤\)\+σ2𝐈NDy\.\\mathrm\{vec\}\(\\mathbf\{Y\}\)\\mid\\mathbf\{Z\},\\mathbf\{B\},\\sigma\\sim\\mathcal\{N\}\(0,\\mathbf\{C\}\),\\quad\\mathbf\{C\}=\\mathbf\{B\}\\otimes\\big\(\\mathbf\{Z\}\\mathbf\{Z\}^\{\\top\}\\big\)\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\}\.This is the prior predictive distribution of𝐘\\mathbf\{Y\}conditional on𝐙\\mathbf\{Z\}\.

For the independence case𝐁=𝐈Dy\\mathbf\{B\}=\\mathbf\{I\}\_\{D\_\{y\}\}we use throughout this paper, because all columns are conditionally independent given𝐙\\mathbf\{Z\}, this reduces to

p\(𝐘∣𝐙,σ\)=∏j=1Dy𝒩\(𝐲\(j\);0,𝐂N\),𝐂N=𝐙𝐙⊤\+σ2𝐈N\.p\(\\mathbf\{Y\}\\mid\\mathbf\{Z\},\\sigma\)=\\prod\_\{j=1\}^\{D\_\{y\}\}\\mathcal\{N\}\\big\(\\mathbf\{y\}^\{\(j\)\};0,\\mathbf\{C\}\_\{N\}\\big\),\\quad\\mathbf\{C\}\_\{N\}=\\mathbf\{Z\}\\mathbf\{Z\}^\{\\top\}\+\\sigma^\{2\}\\mathbf\{I\}\_\{N\}\.
##### Efficient evaluation of the collapsed likelihood\.

The collapsed likelihood requires𝐂−1\\mathbf\{C\}^\{\-1\}andlogdet𝐂\\log\\det\\mathbf\{C\}\. We evaluate these via Woodbury identities for diagonal plus low\-rank structure\. For the𝐁=𝐈Dy\\mathbf\{B\}=\\mathbf\{I\}\_\{D\_\{y\}\}case, by the matrix inversion lemma \(Woodbury identity\) and matrix determinant lemma,

𝐂N−1=1σ2\(𝐈N−𝐙𝐃−1𝐙⊤\),logdet𝐂N=\(N−Dz\)log⁡σ2\+logdet𝐃,\\mathbf\{C\}\_\{N\}^\{\-1\}=\\frac\{1\}\{\\sigma^\{2\}\}\\big\(\\mathbf\{I\}\_\{N\}\-\\mathbf\{Z\}\\mathbf\{D\}^\{\-1\}\\mathbf\{Z\}^\{\\top\}\\big\),\\quad\\log\\det\\mathbf\{C\}\_\{N\}=\(N\-D\_\{z\}\)\\log\\sigma^\{2\}\+\\log\\det\\mathbf\{D\},where𝐃=σ2𝐈Dz\+𝐙⊤𝐙\\mathbf\{D\}=\\sigma^\{2\}\\mathbf\{I\}\_\{D\_\{z\}\}\+\\mathbf\{Z\}^\{\\top\}\\mathbf\{Z\}\. The per\-column quadratic form is

𝐲\(j\)⊤𝐂N−1𝐲\(j\)=1σ2\(‖𝐲\(j\)‖22−\(𝐙⊤𝐲\(j\)\)⊤𝐃−1\(𝐙⊤𝐲\(j\)\)\)\.\\mathbf\{y\}^\{\(j\)\\top\}\\mathbf\{C\}\_\{N\}^\{\-1\}\\mathbf\{y\}^\{\(j\)\}=\\frac\{1\}\{\\sigma^\{2\}\}\\left\(\\\|\\mathbf\{y\}^\{\(j\)\}\\\|\_\{2\}^\{2\}\-\(\\mathbf\{Z\}^\{\\top\}\\mathbf\{y\}^\{\(j\)\}\)^\{\\top\}\\mathbf\{D\}^\{\-1\}\(\\mathbf\{Z\}^\{\\top\}\\mathbf\{y\}^\{\(j\)\}\)\\right\)\.Summing overjj, the computational complexity isO\(NDzDy\+Dz2Dy\+NDz2\+Dz3\)O\(ND\_\{z\}D\_\{y\}\+D\_\{z\}^\{2\}D\_\{y\}\+ND\_\{z\}^\{2\}\+D\_\{z\}^\{3\}\)which is typically dominated by theO\(NDzDy\)O\(ND\_\{z\}D\_\{y\}\)term\. The same approach applies for general𝐁\\mathbf\{B\}using Kronecker structure\.

### B\.4Posterior and MAP Estimation

Letϕ≡\{𝐙,𝐁,σ,\{ℓq,ηq\}q\}\\boldsymbol\{\\phi\}\\equiv\\\{\\mathbf\{Z\},\\mathbf\{B\},\\sigma,\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\}\\\}collect the parameters and𝒟≡\{𝐗,𝐘\}\\mathcal\{D\}\\equiv\\\{\\mathbf\{X\},\\mathbf\{Y\}\\\}the dataset\. The \(tempered\) posterior is

p\(ϕ∣𝒟\)∝p\(𝐘∣𝐙,𝐁,σ\)βp\(𝐙∣𝐗,\{ℓq,ηq\}q\)p\(\{ℓq,ηq\}q\)p\(σ2\)p\(𝐁\)\.p\(\\boldsymbol\{\\phi\}\\mid\\mathcal\{D\}\)\\propto p\(\\mathbf\{Y\}\\mid\\mathbf\{Z\},\\mathbf\{B\},\\sigma\)^\{\\beta\}\\,p\(\\mathbf\{Z\}\\mid\\mathbf\{X\},\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\}\)\\,p\(\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\}\)\\,p\(\\sigma^\{2\}\)\\,p\(\\mathbf\{B\}\)\.\(7\)We approximate this posterior asp\(ϕ∣𝒟\)≈δ\(ϕ−ϕ⋆\)p\(\\boldsymbol\{\\phi\}\\mid\\mathcal\{D\}\)\\approx\\delta\(\\boldsymbol\{\\phi\}\-\\boldsymbol\{\\phi\}^\{\\star\}\)where

ϕ⋆=arg⁡maxϕ⁡log⁡p\(ϕ∣𝒟\)\.\\boldsymbol\{\\phi\}^\{\\star\}=\\arg\\max\_\{\\boldsymbol\{\\phi\}\}\\log p\(\\boldsymbol\{\\phi\}\\mid\\mathcal\{D\}\)\.The priorsp\(\{ℓq,ηq\}q\),p\(σ2\),p\(𝐁\)p\(\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\}\),p\(\\sigma^\{2\}\),p\(\\mathbf\{B\}\)act as regularizers under MAP\.

##### Likelihood tempering\.

We use a tempered likelihood with inverse\-temperatureβ∈\(0,1\]\\beta\\in\(0,1\]to mitigate decoder misspecification\. This misspecification is most acute when outputs exhibit strong output correlations not well approximated by the low\-rank latent\-factor structure; for example, with𝐁=𝐈\\mathbf\{B\}=\\mathbf\{I\}, GPLFR assumes no residual covariance beyond that induced by the shared latents\. The untempered likelihood can therefore overstate how informative each output dimension is when \(i\) the correlated residual structure is high\-rank, or \(ii\) correlations have a large residual component that is weakly tied to𝐱\\mathbf\{x\}\(e\.g\. input\-independent nuisance fields, or correlation patterns that vary across𝐱\\mathbf\{x\}in ways a global linear decoder cannot capture\)\. Tempering in response to model misspecification has a decision\-theoretic justification as a generalized Bayesian update\[Bissiri et al\.,[2016](https://arxiv.org/html/2606.06576#bib.bib3)\], and, practically, in GPLFR an untempered likelihood can overwhelm the GP prior, leading to a brittle latent geometry\.

##### Optimization\.

We jointly optimize the \(tempered\) log\-posterior \([7](https://arxiv.org/html/2606.06576#A2.E7)\) via Adam, using separate learning rates for the latents𝐙\\mathbf\{Z\}\(which are local to datapoints\) and global parameters \(in general:\{ℓq,ηq\}q,σ,𝐁\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\},\\sigma,\\mathbf\{B\}\) since the effective gradient scale differs between the two groups\.

### B\.5Prediction

For a test input𝐱∗\\mathbf\{x\}\_\{\*\}, the posterior predictive distribution is

p\(𝐲∗∣𝐱∗,𝒟\)\\displaystyle p\(\\mathbf\{y\}\_\{\*\}\\mid\\mathbf\{x\}\_\{\*\},\\mathcal\{D\}\)=∫p\(𝐲∗∣𝐱∗,𝐘,ϕ\)p\(ϕ∣𝒟\)𝑑ϕ\\displaystyle=\\int p\(\\mathbf\{y\}\_\{\*\}\\mid\\mathbf\{x\}\_\{\*\},\\mathbf\{Y\},\\boldsymbol\{\\phi\}\)\\,p\(\\boldsymbol\{\\phi\}\\mid\\mathcal\{D\}\)\\,d\\boldsymbol\{\\phi\}≈p\(𝐲∗∣𝐱∗,𝐘,ϕ⋆\)\(MAP approximation\)\.\\displaystyle\\approx p\(\\mathbf\{y\}\_\{\*\}\\mid\\mathbf\{x\}\_\{\*\},\\mathbf\{Y\},\\boldsymbol\{\\phi\}^\{\\star\}\)\\qquad\\text\{\(MAP approximation\)\.\}We describe sampling from the MAP\-approximated predictive distribution in two steps: encoder then decoder \(we omit the⋆superscript for brevity\)\.

##### 1\. Encoder\.

Sample the test latent from the posterior predictive over latentsp\(𝐳∗∣𝐙,𝐗,\{ℓq\}q\)p\(\\mathbf\{z\}\_\{\*\}\\mid\\mathbf\{Z\},\\mathbf\{X\},\\\{\\boldsymbol\{\\ell\}\_\{q\}\\\}\_\{q\}\)\. For dimensionqq,

z∗\(q\)∣𝐳\(q\),𝐗,ℓq,ηq∼𝒩\(𝐤∗,q⊤𝐊q−1𝐳\(q\),k∗∗,q−𝐤∗,q⊤𝐊q−1𝐤∗,q\),z\_\{\*\}^\{\(q\)\}\\mid\\mathbf\{z\}^\{\(q\)\},\\mathbf\{X\},\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\sim\\mathcal\{N\}\\\!\\Big\(\\mathbf\{k\}\_\{\*,q\}^\{\\top\}\\mathbf\{K\}\_\{q\}^\{\-1\}\\mathbf\{z\}^\{\(q\)\},\\ k\_\{\*\*,q\}\-\\mathbf\{k\}\_\{\*,q\}^\{\\top\}\\mathbf\{K\}\_\{q\}^\{\-1\}\\mathbf\{k\}\_\{\*,q\}\\Big\),where

Kq,ij=kq\(𝐱i,𝐱j;ℓq,ηq\),𝐤∗,q=kq\(𝐱∗,𝐗;ℓq,ηq\),k∗∗,q=kq\(𝐱∗,𝐱∗;ℓq,ηq\)\.K\_\{q,ij\}=k\_\{q\}\\big\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\_\{j\};\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\big\),\\quad\\mathbf\{k\}\_\{\*,q\}=k\_\{q\}\\big\(\\mathbf\{x\}\_\{\*\},\\mathbf\{X\};\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\big\),\\quad k\_\{\*\*,q\}=k\_\{q\}\\big\(\\mathbf\{x\}\_\{\*\},\\mathbf\{x\}\_\{\*\};\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\big\)\.Stacking these overqqgives

𝐳∗∣𝐙,𝐗,\{ℓq,ηq\}q∼𝒩\(𝝁z∗,diag\(𝝈z∗2\)\)\.\\mathbf\{z\}\_\{\*\}\\mid\\mathbf\{Z\},\\mathbf\{X\},\\\{\\boldsymbol\{\\ell\}\_\{q\},\\eta\_\{q\}\\\}\_\{q\}\\sim\\mathcal\{N\}\\\!\\big\(\\boldsymbol\{\\mu\}\_\{z\_\{\*\}\},\\;\\mathrm\{diag\}\(\\boldsymbol\{\\sigma\}^\{2\}\_\{z\_\{\*\}\}\)\\big\)\.

##### 2\. Decoder\.

Sample the test output given the test latent from the latent\-conditional posterior predictivep\(𝐲∗∣𝐳∗,𝐘,𝐙,𝐁,σ\)p\(\\mathbf\{y\}\_\{\*\}\\mid\\mathbf\{z\}\_\{\*\},\\mathbf\{Y\},\\mathbf\{Z\},\\mathbf\{B\},\\sigma\)\. After marginalizing𝐖\\mathbf\{W\}and conditioning on𝐳∗\\mathbf\{z\}\_\{\*\}, the joint Gaussian over training outputs and the test output is

\[vec\(𝐘\)𝐲∗\]\|𝐳∗,𝐙,𝐁,σ∼𝒩\(𝟎,\[𝐂YY𝐂Y⁣∗𝐂∗Y𝐂∗∗\]\),\\begin\{bmatrix\}\\mathrm\{vec\}\(\\mathbf\{Y\}\)\\\\ \\mathbf\{y\}\_\{\*\}\\end\{bmatrix\}\\Bigm\|\\;\\mathbf\{z\}\_\{\*\},\\mathbf\{Z\},\\mathbf\{B\},\\sigma\\sim\\mathcal\{N\}\\\!\\left\(\\mathbf\{0\},\\begin\{bmatrix\}\\mathbf\{C\}\_\{YY\}&\\mathbf\{C\}\_\{Y\*\}\\\\ \\mathbf\{C\}\_\{\*Y\}&\\mathbf\{C\}\_\{\*\*\}\\end\{bmatrix\}\\right\),with

𝐂YY=𝐁⊗\(𝐙𝐙⊤\)\+σ2𝐈NDy,𝐂Y⁣∗=𝐁⊗\(𝐙𝐳∗\),𝐂∗∗=𝐁⊗\(𝐳∗⊤𝐳∗\)\+σ2𝐈Dy\.\\mathbf\{C\}\_\{YY\}=\\mathbf\{B\}\\otimes\(\\mathbf\{Z\}\\mathbf\{Z\}^\{\\top\}\)\+\\sigma^\{2\}\\mathbf\{I\}\_\{ND\_\{y\}\},\\quad\\mathbf\{C\}\_\{Y\*\}=\\mathbf\{B\}\\otimes\(\\mathbf\{Z\}\\mathbf\{z\}\_\{\*\}\),\\quad\\mathbf\{C\}\_\{\*\*\}=\\mathbf\{B\}\\otimes\(\\mathbf\{z\}\_\{\*\}^\{\\top\}\\mathbf\{z\}\_\{\*\}\)\+\\sigma^\{2\}\\mathbf\{I\}\_\{D\_\{y\}\}\.Conditioning on observed𝐘\\mathbf\{Y\}gives

𝐲∗∣𝐳∗,𝐘,𝐙,𝐁,σ∼𝒩\(𝝁y∗,𝚺y∗\),𝝁y∗=𝐂∗Y𝐂YY−1vec\(𝐘\),𝚺y∗=𝐂∗∗−𝐂∗Y𝐂YY−1𝐂Y⁣∗\.\\mathbf\{y\}\_\{\*\}\\mid\\mathbf\{z\}\_\{\*\},\\mathbf\{Y\},\\mathbf\{Z\},\\mathbf\{B\},\\sigma\\sim\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{y\_\{\*\}\},\\boldsymbol\{\\Sigma\}\_\{y\_\{\*\}\}\),\\quad\\boldsymbol\{\\mu\}\_\{y\_\{\*\}\}=\\mathbf\{C\}\_\{\*Y\}\\mathbf\{C\}\_\{YY\}^\{\-1\}\\mathrm\{vec\}\(\\mathbf\{Y\}\),\\quad\\boldsymbol\{\\Sigma\}\_\{y\_\{\*\}\}=\\mathbf\{C\}\_\{\*\*\}\-\\mathbf\{C\}\_\{\*Y\}\\mathbf\{C\}\_\{YY\}^\{\-1\}\\mathbf\{C\}\_\{Y\*\}\.When𝐁=𝐈Dy\\mathbf\{B\}=\\mathbf\{I\}\_\{D\_\{y\}\},𝐂YY\\mathbf\{C\}\_\{YY\}is block\-diagonal and this reduces to independent per\-jjconditioning withN×NN\\times Ncovariance𝐂N=𝐙𝐙⊤\+σ2𝐈N\\mathbf\{C\}\_\{N\}=\\mathbf\{Z\}\\mathbf\{Z\}^\{\\top\}\+\\sigma^\{2\}\\mathbf\{I\}\_\{N\}\.

##### Predictive mean and covariance\.

With samples\{𝐲∗\[m\]\}m=1M\\\{\\mathbf\{y\}\_\{\*\}^\{\[m\]\}\\\}\_\{m=1\}^\{M\}from the MAP predictive,

𝔼\(𝐲∗∣𝒟\)\\displaystyle\\mathbb\{E\}\(\\mathbf\{y\}\_\{\*\}\\mid\\mathcal\{D\}\)≈𝐲^∗≡1M∑m𝐲∗\[m\],\\displaystyle\\approx\\hat\{\\mathbf\{y\}\}\_\{\*\}\\equiv\\frac\{1\}\{M\}\\sum\_\{m\}\\mathbf\{y\}\_\{\*\}^\{\[m\]\},Cov\(𝐲∗∣𝒟\)\\displaystyle\\mathrm\{Cov\}\(\\mathbf\{y\}\_\{\*\}\\mid\\mathcal\{D\}\)≈𝚺^∗≡1M−1∑m\(𝐲∗\[m\]−𝐲^∗\)\(𝐲∗\[m\]−𝐲^∗\)⊤\.\\displaystyle\\approx\\hat\{\\boldsymbol\{\\Sigma\}\}\_\{\*\}\\equiv\\frac\{1\}\{M\-1\}\\sum\_\{m\}\\big\(\\mathbf\{y\}\_\{\*\}^\{\[m\]\}\-\\hat\{\\mathbf\{y\}\}\_\{\*\}\\big\)\\big\(\\mathbf\{y\}\_\{\*\}^\{\[m\]\}\-\\hat\{\\mathbf\{y\}\}\_\{\*\}\\big\)^\{\\top\}\.In practice, when uncertainty is not required, we compute𝐲^∗\\hat\{\\mathbf\{y\}\}\_\{\*\}using the analytic predictive means of the encoder and decoder conditionals to avoid Monte Carlo noise and reduce compute\.

## Appendix CPCA\-GP

Gaussian process regression on PCA scores \(PCA\-GP, including the extensions below\) is the key baseline in our experiments, since it is the standard compress\-then\-predict model for low\-data, high\-dimensional outputs, and is structurally similar\. Here we briefly review PCA\-GP and state the base configuration used in our experiments\.

PCA\-GP is a two\-stage compress\-then\-predict pipeline\. First, we fit PCA on the \(centered\) training outputs to obtain aDzD\_\{z\}\-dimensional basis𝚽∈ℝDz×Dy\\boldsymbol\{\\Phi\}\\in\\mathbb\{R\}^\{D\_\{z\}\\times D\_\{y\}\}, with training scores𝐰i=𝚽𝐲i\\mathbf\{w\}\_\{i\}=\\boldsymbol\{\\Phi\}\\mathbf\{y\}\_\{i\}\. Second, we regress each score dimension independently against the inputs with a GP\. For eachq∈\{1,…,Dz\}q\\in\\\{1,\\dots,D\_\{z\}\\\}we model

wi,q=fq\(𝐱i\)\+ξi,q,fq\(⋅\)∼𝒢𝒫\(0,kq\(⋅,⋅\)\),ξi,q∼𝒩\(0,σξ,q2\)\.w\_\{i,q\}=f\_\{q\}\(\\mathbf\{x\}\_\{i\}\)\+\\xi\_\{i,q\},\\quad f\_\{q\}\(\\cdot\)\\sim\\mathcal\{GP\}\(0,k\_\{q\}\(\\cdot,\\cdot\)\),\\quad\\xi\_\{i,q\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\xi,q\}^\{2\}\)\.As with GPLFR, it is a modeling choice how strongly to tie kernel parameters across components, including the score noise \(nugget\) term\. We found a shared score noise variance to be a robust default and fix it across experiments, while the degree of sharing for kernel lengthscales and amplitudes is chosen on a per\-experiment basis\. Given a test input𝐱∗\\mathbf\{x\}\_\{\*\}, we form𝐰^∗\\hat\{\\mathbf\{w\}\}\_\{\*\}by GP prediction in each score dimension and predict the output as𝐲^∗=𝚽⊤𝐰^∗\\hat\{\\mathbf\{y\}\}\_\{\*\}=\\boldsymbol\{\\Phi\}^\{\\top\}\\hat\{\\mathbf\{w\}\}\_\{\*\}\. Predictive uncertainty is obtained by combining the independent per\-score GP predictive variances through the linear reconstruction\.

## Appendix DExperiment Details

### D\.1Synthetic Benchmark

#### D\.1\.1Data Generating Process

We generate outputs𝐲∈ℝDy\\mathbf\{y\}\\in\\mathbb\{R\}^\{D\_\{y\}\}from inputs𝐱∈ℝDx\\mathbf\{x\}\\in\\mathbb\{R\}^\{D\_\{x\}\}as

𝐲=𝐖sig𝐳sig\(𝐱\)\+𝐲nuis\+ϵ,ϵ∼𝒩\(0,σϵ2𝐈Dy\)\.\\mathbf\{y\}=\\mathbf\{W\}\_\{\\text\{sig\}\}\\mathbf\{z\}\_\{\\text\{sig\}\}\(\\mathbf\{x\}\)\+\\mathbf\{y\}\_\{\\text\{nuis\}\}\+\\boldsymbol\{\\epsilon\},\\quad\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\epsilon\}^\{2\}\\mathbf\{I\}\_\{D\_\{y\}\}\)\.
##### Predictable variation\.

ChooseDsigD\_\{\\text\{sig\}\}signal latents and draw each as a function of𝐱\\mathbf\{x\}:

zsig\(q\)\(⋅\)∼𝒢𝒫\(0,σsig2k\(⋅,⋅\)\)forq=1,…,Dsig\.z\_\{\\text\{sig\}\}^\{\(q\)\}\(\\cdot\)\\sim\\mathcal\{GP\}\(0,\\,\\sigma\_\{\\text\{sig\}\}^\{2\}k\(\\cdot,\\cdot\)\)\\quad\\text\{for \}q=1,\\dots,D\_\{\\text\{sig\}\}\.For training inputs\{𝐱i\}i=1N\\\{\\mathbf\{x\}\_\{i\}\\\}\_\{i=1\}^\{N\}, this implies

𝐳sig\(q\)≡\(zsig\(q\)\(𝐱1\),…,zsig\(q\)\(𝐱N\)\)⊤∼𝒩\(0,σsig2𝐊\),Kij=k\(𝐱i,𝐱j\)\.\\mathbf\{z\}^\{\(q\)\}\_\{\\text\{sig\}\}\\equiv\\big\(z\_\{\\text\{sig\}\}^\{\(q\)\}\(\\mathbf\{x\}\_\{1\}\),\\dots,z\_\{\\text\{sig\}\}^\{\(q\)\}\(\\mathbf\{x\}\_\{N\}\)\\big\)^\{\\top\}\\sim\\mathcal\{N\}\(0,\\,\\sigma\_\{\\text\{sig\}\}^\{2\}\\mathbf\{K\}\),\\quad K\_\{ij\}=k\(\\mathbf\{x\}\_\{i\},\\mathbf\{x\}\_\{j\}\)\.Stacking acrossqqgives𝐙sig∈ℝN×Dsig\\mathbf\{Z\}\_\{\\text\{sig\}\}\\in\\mathbb\{R\}^\{N\\times D\_\{\\text\{sig\}\}\}where theqq\-th column is𝐳sig\(q\)\\mathbf\{z\}^\{\(q\)\}\_\{\\text\{sig\}\}\.

##### Output structure\.

Outputs live on a 2D grid withDy=HWD\_\{y\}=HWlocations\. We index output dimensions byj∈\{1,…,Dy\}j\\in\\\{1,\\dots,D\_\{y\}\\\}and associate eachjjwith a coordinate vector𝐫j=\(uj,vj\)⊤\\mathbf\{r\}\_\{j\}=\(u\_\{j\},v\_\{j\}\)^\{\\top\}whereu=0,…,H−1u=0,\\ldots,H\-1andv=0,…,W−1v=0,\\ldots,W\-1\. The matrix𝐖sig∈ℝDy×Dsig\\mathbf\{W\}\_\{\\text\{sig\}\}\\in\\mathbb\{R\}^\{D\_\{y\}\\times D\_\{\\text\{sig\}\}\}maps latents to the output grid\. Each columnqqis a localized squared\-exponential blob with centre\(u¯q,v¯q\)\(\\bar\{u\}\_\{q\},\\bar\{v\}\_\{q\}\)and scalesqs\_\{q\}:

\[𝐖sig\]j,q=exp⁡\(−\(uj−u¯q\)2\+\(vj−v¯q\)22sq2\),q=1,…,Dsig\.\[\\mathbf\{W\}\_\{\\text\{sig\}\}\]\_\{j,q\}=\\exp\\left\(\{\-\}\\frac\{\(u\_\{j\}\-\\bar\{u\}\_\{q\}\)^\{2\}\+\(v\_\{j\}\-\\bar\{v\}\_\{q\}\)^\{2\}\}\{2s\_\{q\}^\{2\}\}\\right\),\\quad q=1,\\dots,D\_\{\\text\{sig\}\}\.

##### Nuisance variation\.

For each sampleii, draw an independent random field over the output grid:

𝐲nuis,i∼𝒩\(0,𝚺nuis\),\\mathbf\{y\}\_\{\\text\{nuis\},i\}\\sim\\mathcal\{N\}\(0,\\boldsymbol\{\\Sigma\}\_\{\\text\{nuis\}\}\),where𝚺nuis\\boldsymbol\{\\Sigma\}\_\{\\text\{nuis\}\}is a spatial covariance across output dimensions\. We parameterize𝚺nuis\\boldsymbol\{\\Sigma\}\_\{\\text\{nuis\}\}by an RBF on grid coordinates

Σnuis,jj′=σnuis2exp⁡\(−‖𝐫j−𝐫j′‖222ℓnuis2\)\.\\Sigma\_\{\\text\{nuis\},jj^\{\\prime\}\}=\\sigma^\{2\}\_\{\\text\{nuis\}\}\\exp\\left\(\-\\frac\{\\\|\\mathbf\{r\}\_\{j\}\-\\mathbf\{r\}\_\{j^\{\\prime\}\}\\\|\_\{2\}^\{2\}\}\{2\\ell\_\{\\text\{nuis\}\}^\{2\}\}\\right\)\.Stacking nuisance rows gives𝐘nuis=\[𝐲nuis,1⊤⋮𝐲nuis,N⊤\]∈ℝN×Dy\\mathbf\{Y\}\_\{\\text\{nuis\}\}=\\begin\{bmatrix\}\\mathbf\{y\}\_\{\\text\{nuis\},1\}^\{\\top\}\\\\ \\vdots\\\\ \\mathbf\{y\}\_\{\\text\{nuis\},N\}^\{\\top\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{N\\times D\_\{y\}\}\. In expectation, nuisance contributes only diagonal energy in example space𝔼\[𝐘nuis𝐘nuis⊤\]=tr\(𝚺nuis\)𝐈N\\mathbb\{E\}\[\\mathbf\{Y\}\_\{\\text\{nuis\}\}\\mathbf\{Y\}\_\{\\text\{nuis\}\}^\{\\top\}\]=\\mathrm\{tr\}\(\\boldsymbol\{\\Sigma\}\_\{\\text\{nuis\}\}\)\\mathbf\{I\}\_\{N\}, which simplifies toDyσnuis2D\_\{y\}\\sigma\_\{\\text\{nuis\}\}^\{2\}for our RBF parameterization\.

##### Inputs\.

Drawn as𝐱i∼𝒩\(0,𝐈Dx\)\\mathbf\{x\}\_\{i\}\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{D\_\{x\}\}\)and split randomly into training, validation, and test sets\.

##### Settings used in main text\.

Dy=162=256D\_\{y\}=16^\{2\}=256,Dx=3D\_\{x\}=3,Dsig=6D\_\{\\text\{sig\}\}=6;k\(⋅,⋅\)k\(\\cdot,\\cdot\)was an RBF kernel with lengthscalesℓq∼𝒰\(1,3\)\\ell\_\{q\}\\sim\\mathcal\{U\}\(1,3\), shared across input dimensions;ℓnuis=2\\ell\_\{\\text\{nuis\}\}=2;sq∼𝒰\(1,2\)s\_\{q\}\\sim\\mathcal\{U\}\(1,2\),u¯q∼𝒰\(0,H−1\)\\bar\{u\}\_\{q\}\\sim\\mathcal\{U\}\(0,H\-1\),v¯q∼𝒰\(0,W−1\)\\bar\{v\}\_\{q\}\\sim\\mathcal\{U\}\(0,W\-1\)\. For Figures[2](https://arxiv.org/html/2606.06576#S4.F2)and[3](https://arxiv.org/html/2606.06576#S4.F3):\(σsig,σnuis,σϵ\)=\(1,1,0\.01\)\(\\sigma\_\{\\text\{sig\}\},\\sigma\_\{\\text\{nuis\}\},\\sigma\_\{\\epsilon\}\)=\(1,1,0\.01\)\. For Figure[4](https://arxiv.org/html/2606.06576#S4.F4):\(σsig,σϵ\)=\(1,0\)\(\\sigma\_\{\\text\{sig\}\},\\sigma\_\{\\epsilon\}\)=\(1,0\)andσnuis\\sigma\_\{\\text\{nuis\}\}is varied\.

#### D\.1\.2Model Settings and Hyperparameters

##### Splits\.

Fixed split with sizes 1600 \(train\-pool\), 500 \(validation\), 500 \(test\)\. Training subsets in Figure[2](https://arxiv.org/html/2606.06576#S4.F2)come from a nestedN∈\{50,100,200,400,800\}N\\in\\\{50,100,200,400,800\\\}sweep across sampling seeds0–44\.

##### Preprocessing\.

For each\(seed,N\)\(\\text\{seed\},N\)we fit z\-score standardizers on the training subset\. We standardize per\-dimension for both𝐗\\mathbf\{X\}and𝐘\\mathbf\{Y\}\.

##### Tuning procedure\.

Hyperparameters for both GPLFR and PCA\-GP are selected by medianRMSEsig\\mathrm\{RMSE\}\_\{\\text\{sig\}\}on a fixed 500\-example validation set across the five dataset seeds of the 400\-example training set and then held fixed across all shown runs\.

##### Early stopping\.

All models use early stopping based on the*observed*errorRMSEobs=1NvalDy∑i=1Nval‖𝐲^i−𝐲i‖22\\mathrm\{RMSE\}\_\{\\text\{obs\}\}=\\sqrt\{\\frac\{1\}\{N\_\{\\text\{val\}\}\\,D\_\{y\}\}\\sum\_\{i=1\}^\{N\_\{\\text\{val\}\}\}\\\|\\hat\{\\mathbf\{y\}\}\_\{i\}\-\\mathbf\{y\}\_\{i\}\\\|\_\{2\}^\{2\}\}on the validation set, since in practice the true signal is not available for model selection\.

##### GPLFR\.

*Kernel\.*We use per\-latent stationary ARD RBF kernels \(Dq=1∀qD\_\{q\}=1\\ \\forall q\) with separate lengthscales and a single shared amplitude fixed to one\. GPLFR was relatively insensitive to the amplitude grouping: fixing to one usually performed best, with a shared learned amplitude only marginally worse, and per\-latent amplitudes again slightly worse\. This is consistent with GPLFR’s amplitudes being weakly identified with latent scaling\.

*Optimization\.*We fit parameters via MAP estimation using Adam\.

*Priors\.*We use the priors in Table[D\.1](https://arxiv.org/html/2606.06576#A4.T1)\(on the standardized data\)\.

*Hyperparameters\.*Tuned values are listed in Table[D\.2](https://arxiv.org/html/2606.06576#A4.T2)\.

##### PCA\-GP\.

*Kernel\.*As with GPLFR, we use per\-latent ARD RBF kernels with separate lengthscales\. We learn a shared latent noise \(nugget\)\. On amplitude grouping, per\-latent performed best, with a single shared or fixed amplitude significantly hurting performance\.

*Optimization\.*We maximize the marginal likelihood using L\-BFGS\-B\. For the larger latent dimensions in Figure[3](https://arxiv.org/html/2606.06576#S4.F3)we increased the number of L\-BFGS\-B steps from 500 to 1500 \(keeping early stopping\) to ensure convergence; a further increase to 3000 steps for the largest dimensions did not change the result\.

*Hyperparameters\.*Tuned values are listed in Table[D\.2](https://arxiv.org/html/2606.06576#A4.T2)\.

Table D\.1:GPLFR Priors for all experiments\.HyperparameterSelectedCandidatesGPLFRLengthscale groupingper\-latent\[per\-latent, shared\]Amplitude groupingfixed to 1\[per\-latent, shared, fixed to 1\]Inverse\-temperature \(β\\beta\)0\.1\[0\.03, 0\.1, 0\.3\]Latent noise \(λ\\lambda\)10−510^\{\-5\}\[10−5,10−3,10−1\]\[10^\{\-5\},10^\{\-3\},10^\{\-1\}\]Latent learning rate0\.01\[0\.001,0\.003,0\.01,0\.03\]\[0\.001,0\.003,0\.01,0\.03\]Global learning rate0\.003\[0\.001,0\.003,0\.01,0\.03\]\[0\.001,0\.003,0\.01,0\.03\]PCA\-GPLengthscale groupingper\-latent\[per\-latent, shared\]Amplitude groupingper\-latent\[per\-latent, shared, fixed to 1\]Table D\.2:*Synthetic benchmark:*Hyperparameters\.

#### D\.1\.3Signal versus nuisance capture

Let𝐘sig,test\\mathbf\{Y\}\_\{\\text\{sig,test\}\}and𝐘nuis,test\\mathbf\{Y\}\_\{\\text\{nuis,test\}\}denote the held\-out signal and nuisance components of the test outputs \(known from the data\-generating process\)\. For a learned output subspace with orthogonal projector𝐏𝒮^\\mathbf\{P\}\_\{\\hat\{\\mathcal\{S\}\}\}, we define the signal and nuisance capture scores

Csig\(𝒮^\)=‖𝐘sig,test𝐏𝒮^‖F2‖𝐘sig,test‖F2,Cnuis\(𝒮^\)=‖𝐘nuis,test𝐏𝒮^‖F2‖𝐘nuis,test‖F2,C\_\{\\text\{sig\}\}\(\\hat\{\\mathcal\{S\}\}\)=\\frac\{\\\|\\mathbf\{Y\}\_\{\\text\{sig,test\}\}\\,\\mathbf\{P\}\_\{\\hat\{\\mathcal\{S\}\}\}\\\|\_\{F\}^\{2\}\}\{\\\|\\mathbf\{Y\}\_\{\\text\{sig,test\}\}\\\|\_\{F\}^\{2\}\},\\qquad C\_\{\\text\{nuis\}\}\(\\hat\{\\mathcal\{S\}\}\)=\\frac\{\\\|\\mathbf\{Y\}\_\{\\text\{nuis,test\}\}\\,\\mathbf\{P\}\_\{\\hat\{\\mathcal\{S\}\}\}\\\|\_\{F\}^\{2\}\}\{\\\|\\mathbf\{Y\}\_\{\\text\{nuis,test\}\}\\\|\_\{F\}^\{2\}\},i\.e\., the fractions of signal and nuisance energy retained after projection onto𝒮^\\hat\{\\mathcal\{S\}\}\. Both scores rise withDzD\_\{z\}as the subspace expands, but GPLFR consistently captures more signal and less nuisance than PCA\-GP\.

![Refer to caption](https://arxiv.org/html/2606.06576v1/x7.png)Figure D\.1:*Synthetic benchmark:*Signal captureCsigC\_\{\\text\{sig\}\}\(solid, left axis\) and nuisance captureCnuisC\_\{\\text\{nuis\}\}\(dashed, right axis\) versus latent dimensionalityDzD\_\{z\}, withN=800N=800\. GPLFR captures more signal and less nuisance than PCA\-GP at everyDzD\_\{z\}\. Bold lines show medians over five dataset seeds; faint lines show individual seeds\.
#### D\.1\.4GPLFR and PCA\-GP for Noiseless Data

In the idealized case where the training outputs are exactly rank\-DzD\_\{z\}\(σnuis2=σϵ2=0\\sigma^\{2\}\_\{\\text\{nuis\}\}=\\sigma^\{2\}\_\{\\epsilon\}=0\) and the model fits them exactly \(i\.e\.,𝐘=𝐙𝐖⊤\\mathbf\{Y\}=\\mathbf\{Z\}\\mathbf\{W\}^\{\\top\}\), GPLFR and PCA necessarily span the sameDzD\_\{z\}\-dimensional output subspace \(in practice, regularization and imperfect optimization mean this holds only approximately\)\. However, even in this limit, PCA\-GP and GPLFR are not equivalent because they treat the representation’s degrees of freedom within the subspace – i\.e\., any invertible transformation\(𝐙,𝐖\)→\(𝐙𝐑,𝐖𝐑−⊤\)\(\\mathbf\{Z\},\\mathbf\{W\}\)\\to\(\\mathbf\{Z\}\\mathbf\{R\},\\mathbf\{W\}\\mathbf\{R\}^\{\-\\top\}\)– differently\. Since standard GP priors are not rotationally invariant, different bases within the same subspace constitute different regression tasks\.

With random initialization, GPLFR may converge to a locally stable factorization that reconstructs well but yields latent coordinates less well matched to the per\-latent GP priors, leading to worse generalization\. PCA provides an orthogonal, variance\-ordered basis that avoids such mixing, making it easier for per\-latent GP hyperparameters to settle into a coherent configuration\. We can exploit PCA’s good starting basis in GPLFR by initializing the latent coordinates to PCA scores\. Subsequent joint optimization can then make modest, input\-informed adjustments that improve kernel coherence relative to a frozen PCA basis\. \(With PCA initialization, we found it important to allow GPLFR to learn at least a shared GP amplitudeη\\eta; otherwise, global rescaling is forced into the latents themselves, which can push the solution out of the PCA\-initialized basin\.\) All three effects are visible atN=800N=800,Dz=6D\_\{z\}=6: randomly initialized GPLFR achieved anRMSEsig\\text\{RMSE\}\_\{\\text\{sig\}\}of0\.0117−0\.0007\+0\.00320\.0117^\{\+0\.0032\}\_\{\-0\.0007\}, PCA\-GP did better with0\.0091−0\.0006\+0\.00050\.0091^\{\+0\.0005\}\_\{\-0\.0006\}, and PCA\-initialized GPLFR slightly better still with0\.0080−0\.0003\+0\.00110\.0080^\{\+0\.0011\}\_\{\-0\.0003\}\(medians±25–75\\pm\\,25\\text\{\-\-\}75th percentiles over 5 seeds\)\.

We note that PCA initialization can*harm*GPLFR at higher nuisance levels, presumably because it biases the latents toward a basin near the PCA solution, which generalizes poorly in those regimes\.

### D\.2Biomedical Optics: Emulating PyXOpto

#### D\.2\.1PyXOpto simulation details

The simulations in our subset correspond to a single\-layer, semi\-infinite homogeneous medium illuminated by a Gaussian beam \(full\-width\-half\-maximum100μm100\\,\\mu\\mathrm\{m\}\)\.μa\\mu\_\{a\}andμs′\\mu\_\{s\}^\{\\prime\}are the absorption and reduced scattering coefficients, which are varied over a21×2121\\times 21grid withμa∈\[0,5\]cm−1\\mu\_\{a\}\\in\[0,5\]\\,\\mathrm\{cm\}^\{\-1\}andμs′∈\[5,35\]cm−1\\mu\_\{s\}^\{\\prime\}\\in\[5,35\]\\,\\mathrm\{cm\}^\{\-1\}\. Scattering is modeled using a Henyey–Greenstein phase function with shape defined by the scattering anisotropy parameterg∈\{0\.1,0\.3,0\.5,0\.7,0\.9\}g\\in\\\{0\.1,0\.3,0\.5,0\.7,0\.9\\\}which we treat as a discrete input indexed bys∈\{1,…,5\}s\\in\\\{1,\\dots,5\\\}\. The reflectance is collected by a radial detectorr∈\[0,5\]mmr\\in\[0,5\]\\,\\mathrm\{mm\}withDy=500D\_\{y\}=500concentric bins\.

![Refer to caption](https://arxiv.org/html/2606.06576v1/x8.png)Figure D\.2:*PyXOpto emulation:*Example reflectance curves\. The spike at5mm5\\,\\mathrm\{mm\}is a binning effect\.
#### D\.2\.2Model Settings and Hyperparameters

##### Splits\.

There are 2205 examples available in total\. We use a fixed split with sizes 1205 \(train\-pool\), 500 \(validation\), 500 \(test\) with an equal number of examples per task \(ggvalue\) in each partition\. For Figure[5](https://arxiv.org/html/2606.06576#S4.F5), we draw balanced training subsets withN/SN/Ssamples per task for nestedN∈\{50,100,200,400,800\}N\\in\\\{50,100,200,400,800\\\}across sampling seeds0–44\.

##### Preprocessing\.

For each\(seed,N\)\(\\text\{seed\},N\)we fit standardizers on the training subset\. Inputs are z\-scored per\-dimension, outputs are log\-transformedy←log10⁡yy\\leftarrow\\log\_\{10\}yand then z\-scored per\-dimension\.

##### Tuning procedure\.

Hyperparameters for all models are selected by medianRMSE\\mathrm\{RMSE\}on a fixed 500\-example validation set across five dataset seeds of the 200\-example training set and then held fixed across subsequent runs\.

##### Early stopping\.

All models use early stopping based on RMSE on the validation set\.

##### GPLFR\.

*Kernel\.*We use per\-latent latent GPs \(Dq=1D\_\{q\}=1\) with ARD RBF kernels\. Each latent dimension has separate lengthscales and a separate amplitude variance; we found that sharing a single amplitude across latents performed marginally worse\.

*Coregionalization\.*Thegg\-coregionalization matrix𝐁in\\mathbf\{B\}^\{\\text\{in\}\}is parameterized via its Cholesky factor𝐋\\mathbf\{L\}, such that𝐁in=𝐋𝐋⊤\\mathbf\{B\}^\{\\text\{in\}\}=\\mathbf\{L\}\\mathbf\{L\}^\{\\top\}\. We place anLKJCholesky\(1\)\\mathrm\{LKJCholesky\}\(1\)prior on𝐋\\mathbf\{L\}, which ensures𝐁in\\mathbf\{B\}^\{\\text\{in\}\}is a correlation matrix\.

*Optimization\.*We fit parameters via MAP estimation using Adam\.

*Hyperparameters\.*Tuned values are listed in Table[D\.3](https://arxiv.org/html/2606.06576#A4.T3)\.

##### PCA\-ICM\.

*Kernel\.*As with GPLFR, we use per\-score ARD RBF kernels with separate lengthscales and amplitudes\.

*Coregionalization\.*We parameterize thegg\-coregionalization matrix as a correlation matrix𝐁in=corr\(𝐋𝐋⊤\)\\mathbf\{B\}^\{\\text\{in\}\}=\\mathrm\{corr\}\(\\mathbf\{L\}\\mathbf\{L\}^\{\\top\}\), where𝐋\\mathbf\{L\}is a lower\-triangular matrix of free parameters\.

*Optimization\.*We maximize the marginal likelihood using Adam \(L\-BFGS proved unstable for the correlation matrix parameters\)\.

*Priors\.*We use the same priors as for the synthetic benchmark \(Table[D\.1](https://arxiv.org/html/2606.06576#A4.T1)\)\.

*Hyperparameters\.*Tuned values are listed in Table[D\.3](https://arxiv.org/html/2606.06576#A4.T3)\.

##### PCA\-MLP\.

*Architecture\.*The MLP consists of two hidden layers with shared width\. The input is the concatenation\[𝐱;𝐞s\]\[\\mathbf\{x\};\\mathbf\{e\}\_\{s\}\], where𝐞s\\mathbf\{e\}\_\{s\}is the one\-hot encoding of task indexss\.

*Training\.*We minimize MSE on the PCA scores using the AdamW optimizer\.

*Hyperparameters\.*We tuned the hidden width, learning rate, and weight decay, as detailed in Table[D\.3](https://arxiv.org/html/2606.06576#A4.T3)\.

HyperparameterSelectedCandidatesGPLFRLengthscale groupingper\-latent\[per\-latent, shared\]Amplitude groupingper\-latent\[per\-latent, shared\]Inverse\-temperature \(β\\beta\)0\.3\[0\.03, 0\.1, 0\.3\]Latent noise \(λ\\lambda\)10−510^\{\-5\}\[10−5,10−3,10−1\]\[10^\{\-5\},10^\{\-3\},10^\{\-1\}\]Latent learning rate0\.01\[0\.001,0\.003,0\.01,0\.03\]\[0\.001,0\.003,0\.01,0\.03\]Global learning rate0\.003\[0\.001,0\.003,0\.01,0\.03\]\[0\.001,0\.003,0\.01,0\.03\]PCA\-ICMLengthscale groupingper\-latent\[per\-latent, shared\]Amplitude groupingper\-latent\[per\-latent, shared\]Learning rate0\.01\[3⋅10−4,0\.001,0\.003,0\.01,0\.03\]\[3\\cdot 10^\{\-4\},0\.001,0\.003,0\.01,0\.03\]PCA\-MLPHidden widths\[256, 256\]\[\[32,32\], \[64,64\], \[128,128\], \[256,256\]\]ActivationSiLU\[ReLU, SiLU\]Learning rate0\.0010\.001\[10−4,3⋅10−4,0\.001,0\.003,0\.01\]\[10^\{\-4\},3\\cdot 10^\{\-4\},0\.001,0\.003,0\.01\]Weight decay0\.0010\.001\[10−4,3⋅10−4,0\.001,0\.003,0\.01,0\.03\]\[10^\{\-4\},3\\cdot 10^\{\-4\},0\.001,0\.003,0\.01,0\.03\]Table D\.3:*PyXOpto emulation:*Hyperparameters\.

### D\.3Emulating Exoplanet Climate Models

#### D\.3\.1Dataset Details

##### Planets\.

We focus specifically on*tidally\-locked aquaplanets*– ocean\-covered rocky planets in or near the habitable zone, with one hemisphere permanently facing the host star\. These are the most plentiful subclass of rocky exoplanets modelled in the literature\. We focus evaluation on planets satisfying the physical constraints in Table[D\.4](https://arxiv.org/html/2606.06576#A4.T4)\.

Table D\.4:*Exoclimate emulation:*Physical constraints defining the core domain\.
##### GCMs\.

Our data spans simulations from five GCMs\. We designate ExoCAM and the UM as the*target*GCMs, evaluated at test time, since they are high\-fidelity models with relatively plentiful simulations\. ExoPlaSim, LFRic, and an earlier version of ExoCAM \(pre\-2022\) serve as*auxiliary*GCMs, contributing training data only\. ExoPlaSim provides the majority of the training set but is lower fidelity than the rest\. The 2022 update to ExoCAM provided an improved radiation scheme that reduced a CO2\-atmosphere bias in the pre\-2022 setup\.

##### Data sources\.

Our data is mainly sourced from existing datasets in the exoplanet science literature\. The resulting combined dataset is highly non\-uniform, with clusters around community\-favorite planets such as TRAPPIST\-1e and Proxima Centauri b, and each constituent study tending to vary only a few parameters\. To mitigate this, we ran357357bespoke simulations chosen to fill gaps in the input space using a weighted coverage design\. A summary of the dataset is given in Table[D\.5](https://arxiv.org/html/2606.06576#A4.T5)\.

##### Train–test split\.

Train–test splitting is based solely on the focus set examples \(target GCM∩\\capcore domain\)\. All remaining examples \(auxiliary GCMs or outside the core domain\) are included in the training set to improve predictions via cross\-model transfer and broader input\-space coverage\. Out\-of\-domain simulations typically violate the core domain constraints in only one or two input dimensions, so can remain informative for anchoring the response surface near the domain boundaries\. To prevent leakage, duplicated planets \(same continuous parameters, different GCM\) are never split across train and test\.

GCMsWithincore domainAdditionalData sourcesTarget GCMsUM22031Mak et al\.\[[2024](https://arxiv.org/html/2606.06576#bib.bib21)\], Sergeev et al\.\[[2022](https://arxiv.org/html/2606.06576#bib.bib26)\], Stevenson et al\.\[[in review](https://arxiv.org/html/2606.06576#bib.bib27)\]ExoCAM877Hammond et al\.\[[2025](https://arxiv.org/html/2606.06576#bib.bib8)\], Haqq\-Misra et al\.\[[2022](https://arxiv.org/html/2606.06576#bib.bib9)\], Sergeev et al\.\[[2022](https://arxiv.org/html/2606.06576#bib.bib26)\], Wolf et al\.\[[2025](https://arxiv.org/html/2606.06576#bib.bib34)\], Woodward et al\.\[[In preparation](https://arxiv.org/html/2606.06576#bib.bib35)\], Stevenson et al\.\[[in review](https://arxiv.org/html/2606.06576#bib.bib27)\]Auxiliary GCMsExoCAM pre\-202211347Komacek and Abbot\[[2019](https://arxiv.org/html/2606.06576#bib.bib15)\], kumar Kopparapu et al\.\[[2016](https://arxiv.org/html/2606.06576#bib.bib16),[2017](https://arxiv.org/html/2606.06576#bib.bib17)\], Wolf et al\.\[[2019](https://arxiv.org/html/2606.06576#bib.bib32)\], Wolf\[[2017](https://arxiv.org/html/2606.06576#bib.bib33)\], Suissa et al\.\[[2020](https://arxiv.org/html/2606.06576#bib.bib28)\]LFRic145Haqq\-Misra et al\.\[[2022](https://arxiv.org/html/2606.06576#bib.bib9)\], Stevenson et al\.\[[in review](https://arxiv.org/html/2606.06576#bib.bib27)\]ExoPlaSim426776Macdonald et al\.\[[2025](https://arxiv.org/html/2606.06576#bib.bib20)\], Paradise et al\.\[[2021](https://arxiv.org/html/2606.06576#bib.bib22),[2022a](https://arxiv.org/html/2606.06576#bib.bib23),[2022b](https://arxiv.org/html/2606.06576#bib.bib24)\], Stevenson et al\.\[[in review](https://arxiv.org/html/2606.06576#bib.bib27)\]Table D\.5:*Exoclimate emulation:*GCM dataset summary\.*Target*GCMs are evaluated at test time;*auxiliary*GCMs contribute training data only\.*Within core domain*counts simulations satisfying the physical constraints in Table[D\.4](https://arxiv.org/html/2606.06576#A4.T4);*additional*counts simulations outside these constraints that are included in training only\. Train–test splitting is based solely on the 307*focus set*examples \(target GCM∩\\capcore domain\); the remaining 1419 examples are included to improve predictions via cross\-model and cross\-domain transfer\.

#### D\.3\.2Preprocessing

##### Grid\.

Each GCM produces output on its own native grid\. We regrid all fields onto a common32×6432\\times 64Gaussian latitude–longitude grid \(with 10 pressure levels for 3D variables\): horizontal interpolation is bilinear, vertical interpolation is in log\-pressure, and any field with partially missing values is treated as fully unobserved\. This grid’s nodes support an exact spherical harmonic transform up to total wavenumber 21 \(a “T21” grid\)\.

The 10 pressure levels are defined as relative isobarsσk=\(Pk−Ptop\)/\(fbottomP0−Ptop\)\\sigma\_\{k\}=\(P\_\{k\}\-P\_\{\\text\{top\}\}\)/\(f\_\{\\text\{bottom\}\}P\_\{0\}\-P\_\{\\text\{top\}\}\), whereP0P\_\{0\}is the input surface pressure,Ptop=10mbarP\_\{\\text\{top\}\}=10\\,\\text\{mbar\}, andfbottom=0\.95f\_\{\\text\{bottom\}\}=0\.95lifts the lowest level above near\-surface pressure fluctuations\. Theσk\\sigma\_\{k\}are spaced between 0 and 1 according to a fourth\-order polynomial that slightly increases resolution near the top and bottom of the atmosphere\.

##### Inputs\.

Rotation period and surface pressure are log\-transformed\. Gas volume fractions are transformed viax←asinh⁡\(x/s\)x\\leftarrow\\operatorname\{asinh\}\(x/s\), wheressis a fixed pivot per species \(CO2:10−610^\{\-6\}; CH4:10−810^\{\-8\}\), chosen so that the transformation is approximately logarithmic at climatically significant concentrations and linear near zero\. All inputs are then z\-scored over the training set\.

##### Outputs\.

Specific humidity is log\-transformed and cloud fraction is smoothed\-logit\-transformed:c↦logit⁡\(\(c\+ε\)/\(1\+2ε\)\)c\\mapsto\\operatorname\{logit\}\(\(c\+\\varepsilon\)/\(1\+2\\varepsilon\)\)withε=2×10−3\\varepsilon=2\\times 10^\{\-3\}\(selected via a small sweep\); predictions are clamped back to\[0,1\]\[0,1\]after inversion\. Each horizontal field is then expanded in the T21 spherical harmonic basis\. The resulting coefficient vectors are centered and scaled per\-field: for fieldkk, we compute the training\-set mean𝐚¯\(k\)\\bar\{\\mathbf\{a\}\}^\{\(k\)\}and the root\-mean anomaly energy

σ\(k\)=1N∑i=1N‖𝐚i\(k\)−𝐚¯\(k\)‖22\\sigma^\{\(k\)\}=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\mathbf\{a\}\_\{i\}^\{\(k\)\}\-\\bar\{\\mathbf\{a\}\}^\{\(k\)\}\\\|\_\{2\}^\{2\}\}and set𝐲i\(k\)=\(𝐚i\(k\)−𝐚¯\(k\)\)/σ\(k\)\\mathbf\{y\}\_\{i\}^\{\(k\)\}=\(\\mathbf\{a\}\_\{i\}^\{\(k\)\}\-\\bar\{\\mathbf\{a\}\}^\{\(k\)\}\)/\\sigma^\{\(k\)\}\. Dividing byσ\(k\)\\sigma^\{\(k\)\}equalizes the total anomaly variance across fields\. Predictions are denormalized by inverting these steps before evaluation\.

##### Equatorial symmetry\.

All planets in our dataset have symmetric forcing about the equator, so their time\-mean climates should be equatorially symmetric in all fields except meridional windvv, which is antisymmetric\. In practice, some simulations exhibit residual asymmetry due to finite time\-averaging windows and, in some cases, spontaneous symmetry breaking or amplified numerical asymmetries\. We treat these as artifacts \(or at least beyond the emulator’s modeling scope\) and enforce symmetry by zeroing the complementary spherical harmonic coefficients: sinceYlm\(π−θ,ϕ\)=\(−1\)l\+mYlm\(θ,ϕ\)Y\_\{l\}^\{m\}\(\\pi\-\\theta,\\phi\)=\(\-1\)^\{l\+m\}Y\_\{l\}^\{m\}\(\\theta,\\phi\), symmetry corresponds to retaining onlyl\+ml\+meven coefficients and antisymmetry to retaining onlyl\+ml\+modd\. In spatial terms, this is equivalent to averaging the two hemispheres \(for symmetric fields\) or taking half their difference \(forvv\)\.

#### D\.3\.3GPLFR Configuration

##### Output\-space mean function\.

A stationary, zero\-mean GP encoder can capture global trends within the training domain, but out\-of\-domain predictions revert toward the prior mean\. Following standard emulator practice, we separate a low\-capacity parametric trend from the GPLFR residual model by fitting and subtracting a linear mean function in output space before the GP is trained\. Concretely, we regress each output dimension on the standardized inputs plus a one\-hot encoding of GCM identity via ridge regression with a small penalty \(λridge=10−3\\lambda\_\{\\text\{ridge\}\}=10^\{\-3\}\) for conditioning, and replace the outputs with the residuals\. At prediction time, the linear trend is added back to posterior predictive samples\.

##### Input kernel\.

We use an ICM kernelk\(\(𝐱,s\),\(𝐱′,s′\)\)=kx\(𝐱,𝐱′\)Bss′ink\(\(\\mathbf\{x\},s\),\(\\mathbf\{x\}^\{\\prime\},s^\{\\prime\}\)\)=k\_\{x\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\,B^\{\\text\{in\}\}\_\{ss^\{\\prime\}\}\. The continuous componentkxk\_\{x\}is an ARD Matérn\-52\\tfrac\{5\}\{2\}kernel on the standardized planet properties \(the kernel family was selected from RBF, Matérn\-32\\tfrac\{3\}\{2\}, and Matérn\-52\\tfrac\{5\}\{2\}independently per model class during cross\-validation; see Appendix[D\.3\.6](https://arxiv.org/html/2606.06576#A4.SS3.SSS6)\)\. The coregionalization matrix is decomposed as𝐁in=diag\(𝐫\)𝐑diag\(𝐫\)\\mathbf\{B\}^\{\\text\{in\}\}=\\mathrm\{diag\}\(\\mathbf\{r\}\)\\,\\mathbf\{R\}\\,\\mathrm\{diag\}\(\\mathbf\{r\}\), where𝐑∈ℝS×S\\mathbf\{R\}\\in\\mathbb\{R\}^\{S\\times S\}is a correlation matrix with anLKJCholesky\(1\)\\mathrm\{LKJCholesky\}\(1\)prior and𝐫∈ℝ\>0S\\mathbf\{r\}\\in\\mathbb\{R\}^\{S\}\_\{\>0\}are per\-GCM latent amplitude scales, constrained to have unit geometric mean to remove a global scale redundancy withkxk\_\{x\}\. This separates two interpretable quantities:𝐑\\mathbf\{R\}captures how similarly two GCMs respond to changes in planet properties, andrsr\_\{s\}captures the overall sensitivity of GCMssrelative to the others\.

##### Missing data\.

Different GCMs use different vertical grids, not all simulations extend to the same altitude, and some studies do not output all variables we emulate\. Additionally, GCMs that use height\-based rather than pressure\-based vertical coordinates can have pressure fluctuations that leave our lowest or highest isobars only partially observed across the spatial grid; to avoid introducing bias, we treat such partially missing fields as fully unobserved\. After interpolation onto the canonical 10\-level pressure grid, this leaves a structured pattern of missingness: some examples lack data at upper or lower pressure levels, and others lack entire fields\. We handle this within the collapsed decoder likelihood by restricting each output dimension’s contribution to the examples where it is observed, assuming missingness is independent of the unobserved values given the inputs\. Output dimensions sharing the same set of observed examples are grouped so that the Woodbury inversion is computed once per missingness pattern, adding a negligibleO\(PDz3\)O\(PD\_\{z\}^\{3\}\)cost wherePPis the number of distinct patterns\.

##### Model fitting\.

All parameters are fit jointly via MAP estimation using Adam\.

#### D\.3\.4PPCA\-ICM

PPCA\-ICM is a compress\-then\-predict baseline analogous to PCA\-GP, but one that \(i\) replaces PCA with probabilistic PCA \(PPCA\), and \(ii\) replaces independent per\-score GPs with a multi\-task GP using an intrinsic coregionalization model \(ICM\) over GCM labels\. It operates in the same normalized spectral\-coefficient space and uses the same equatorial symmetry masking, missing\-field conventions, and output\-space mean function as GPLFR\.

##### Stage 1: PPCA compression\.

Each example’s observed coefficient vectors are modelled as a linear function of a shared latent𝐳i∈ℝDz\\mathbf\{z\}\_\{i\}\\in\\mathbb\{R\}^\{D\_\{z\}\}with field\-specific loadings and means and a shared isotropic noise varianceσ2\\sigma^\{2\}\. Missing fields are handled by omitting their likelihood terms\. Parameters are fit by maximum likelihood via 50 iterations of expectation\-maximization, which also yields posterior score estimates𝐳i\\mathbf\{z\}\_\{i\}for each training example\.

##### Stage 2: ICM\-GP regression\.

The PPCA scores are regressed against inputs\(𝐱,s\)\(\\mathbf\{x\},s\)using independent GPs with an ICM kernel:k\(\(𝐱,s\),\(𝐱′,s′\)\)=kx\(𝐱,𝐱′\)Bss′ink\(\(\\mathbf\{x\},s\),\(\\mathbf\{x\}^\{\\prime\},s^\{\\prime\}\)\)=k\_\{x\}\(\\mathbf\{x\},\\mathbf\{x\}^\{\\prime\}\)\\,B^\{\\text\{in\}\}\_\{ss^\{\\prime\}\}, wherekxk\_\{x\}is an ARD Matérn\-52\\tfrac\{5\}\{2\}kernel on the continuous planet properties \(selected independently from the same kernel\-family candidates as GPLFR\) and𝐁in∈ℝS×S\\mathbf\{B\}^\{\\text\{in\}\}\\in\\mathbb\{R\}^\{S\\times S\}is a coregionalization matrix across GCMs, using the same correlation\-plus\-scales parameterization as GPLFR\. We compared a shared\-kernel regime \(all scores drawn from a GP with common lengthscales, amplitudes, and𝐁in\\mathbf\{B\}^\{\\text\{in\}\}\) against a per\-component regime \(separate lengthscales and amplitudes per score, with shared𝐁in\\mathbf\{B\}^\{\\text\{in\}\}\); the shared\-kernel regime performed better and is used throughout\. Kernel hyperparameters are fit by maximizing the GP marginal likelihood using Adam with learning rate10−310^\{\-3\}\.

##### Prediction\.

Ensemble members are generated by sampling latent scores from the GP predictive distribution, decoding each through the PPCA loadings and means, and adding PPCA noise\. The resulting spectral coefficients are mapped back to spatial fields via the same inverse preprocessing as GPLFR\.

#### D\.3\.5Deterministic Baselines

##### PPCA\-MLP\.

PPCA\-MLP shares PPCA\-ICM’s compression stage \(Section[D\.3\.4](https://arxiv.org/html/2606.06576#A4.SS3.SSS4), Stage 1\), using the same PPCA scores as regression targets\. The MLP consists of two hidden layers with shared width, taking as input the concatenation\[𝐱;𝐞s\]\[\\mathbf\{x\};\\,\\mathbf\{e\}\_\{s\}\]where𝐞s\\mathbf\{e\}\_\{s\}is the one\-hot encoding of the GCM label\. We minimize MSE on the PPCA scores using AdamW\. Predictions are decoded through the PPCA loadings and means without noise sampling\.

##### kNN\.

For each test input, we find thekknearest training examples in standardized continuous input space using Euclidean distance\. GCM identity is encoded as a scaled one\-hot vector appended to the input, so that different GCMs contribute an additional distance penaltyλ2\\lambda\\sqrt\{2\}, encouraging same\-GCM neighbours while still permitting cross\-GCM matches\. The prediction is the uniform average of the neighbours’ spatial output fields, computed after variable\-specific transforms \(log for humidity, smoothed logit for cloud fraction, identity for other variables\) and mapped back to physical units; missing fields are averaged only over neighbours where they are observed\. For fair comparison, the same spectral truncation and equatorial symmetry enforcement used by GPLFR are applied to predictions and test targets at evaluation\.

##### Training mean\.

The prediction for every test input is the global per\-field mean over the entire training set, computed after the same variable\-specific transforms and mapped back to physical units\. We use a global rather than per\-GCM mean because GCMs cover different regions of input space, so per\-GCM means would conflate simulator differences with input\-space sampling bias\.

#### D\.3\.6Training and model selection

Table D\.6:*Exoclimate emulation:*Hyperparameters\.##### Cross\-validation and hyperparameter selection\.

Hyperparameters for GPLFR, PPCA\-ICM, and PPCA\-MLP are selected via 3\-fold cross\-validation\. Only focus set examples are assigned to validation folds; all other training examples – those from auxiliary GCMs and/or outside the core domain – are included in every fold’s training partition\. To prevent leakage, duplicated planets \(same continuous parameters, different GCM\) are grouped and assigned to folds as a unit\. For each candidate setting, we train on each fold and evaluate RMSE in normalized spectral\-coefficient space on its held\-out set\. We select the setting with the best mean performance across folds, then refit once on the full training set\. Both GPLFR and PPCA\-ICM performed well out to the highest latent dimensionality tested \(Dz=200D\_\{z\}=200\), with the chosenDz=150D\_\{z\}=150lying in the performance plateau of each\. PPCA\-MLP peaked aroundDz=50D\_\{z\}=50and tended to overfit at higherDzD\_\{z\}\. The kNN hyperparameters are selected via 5\-fold CV on the training set\. The selected hyperparameters are shown in Table[D\.6](https://arxiv.org/html/2606.06576#A4.T6)\.

##### Early stopping\.

Both GPLFR and PPCA\-ICM use early stopping based on validation RMSE in normalized spectral\-coefficient space during cross\-validation\. When refitting on the full training set for the final model, we train for the median best\-validation step across folds\.

#### D\.3\.7Training time

Table[D\.7](https://arxiv.org/html/2606.06576#A4.T7)reports approximate wall\-clock fitting times for the models in the exoclimate experiment in Section[4\.3](https://arxiv.org/html/2606.06576#S4.SS3)\(not counting hyperparameter tuning time\)\. All runs used a single NVIDIA H100 PCIe GPU\.

Table D\.7:*Exoclimate emulation:*Approximate wall\-clock fitting times\.\(a\)

![Refer to caption](https://arxiv.org/html/2606.06576v1/x9.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.06576v1/x10.png)

Figure D\.3:*Exoclimate emulation:*Air temperature maps with superimposed wind vectors at relative isobarσ3≈0\.72\\sigma\_\{3\}\\approx 0\.72\(this would be around the mid\-troposphere on Earth\) for two test planets\. In\(a\), both models capture the broad spatial structure; GPLFR is near\-identical to the GCM output, while PPCA\-ICM overestimates the day–night temperature contrast\. In\(b\), a harder case, GPLFR still recovers the temperature and wind structure, including the cold vortices in the western hemisphere, while PPCA\-ICM severely overestimates how cold they are and fails to reproduce the vortex circulation\.\(a\)

![Refer to caption](https://arxiv.org/html/2606.06576v1/x11.png)

\(b\)

![Refer to caption](https://arxiv.org/html/2606.06576v1/x12.png)

Figure D\.4:*Exoclimate emulation:*Dayside and nightside vertical profiles of area\-weighted mean air temperature and specific humidity for the same two test planets as Figure[D\.3](https://arxiv.org/html/2606.06576#A4.F3)\. In\(a\), both models track the GCM profiles well, with noticeable deviations only at the top two levels\. In\(b\), the harder case, GPLFR closely tracks the temperature profiles, while PPCA\-ICM deviates significantly\. Both models struggle more with specific humidity on this planet, but GPLFR is still notably closer than PPCA\-ICM\.
#### D\.3\.8Example Predictions

Figures[D\.3](https://arxiv.org/html/2606.06576#A4.F3)and[D\.4](https://arxiv.org/html/2606.06576#A4.F4)compare GPLFR and PPCA\-ICM against GCM ground truth for a small subset of the full model output on two test planets: one where both models perform well and one where PPCA\-ICM degrades noticeably\. Figure[D\.3](https://arxiv.org/html/2606.06576#A4.F3)shows spatial maps of temperature and wind, and Figure[D\.4](https://arxiv.org/html/2606.06576#A4.F4)shows vertical profiles of temperature and humidity\.

#### D\.3\.9Additional Results

Here we expand the main\-text results \(Tables[2](https://arxiv.org/html/2606.06576#S4.T2)and[1](https://arxiv.org/html/2606.06576#S4.T1)\) by reporting per\-level breakdowns \(Tables[D\.8](https://arxiv.org/html/2606.06576#A4.T8)and[D\.9](https://arxiv.org/html/2606.06576#A4.T9)\) and two additional metrics \(Tables[D\.10](https://arxiv.org/html/2606.06576#A4.T10)and[D\.11](https://arxiv.org/html/2606.06576#A4.T11)\)\.

##### Anomaly correlation coefficient \(ACC\)\.

ACC measures agreement in spatial structure between prediction𝐲^\\hat\{\\mathbf\{y\}\}and truth𝐲\\mathbf\{y\}, ignoring differences in global mean and amplitude\. For a single field,

ACC\(𝐲^,𝐲\)=⟨𝐲^∘,𝐲∘⟩G‖𝐲^∘‖G‖𝐲∘‖G,𝐮∘≡𝐮−u¯𝟏,u¯=⟨𝟏,𝐮⟩G/⟨𝟏,𝟏⟩G,\\text\{ACC\}\(\\hat\{\\mathbf\{y\}\},\\mathbf\{y\}\)=\\frac\{\\langle\\hat\{\\mathbf\{y\}\}^\{\\circ\},\\mathbf\{y\}^\{\\circ\}\\rangle\_\{G\}\}\{\\\|\\hat\{\\mathbf\{y\}\}^\{\\circ\}\\\|\_\{G\}\\,\\\|\\mathbf\{y\}^\{\\circ\}\\\|\_\{G\}\},\\quad\\mathbf\{u\}^\{\\circ\}\\equiv\\mathbf\{u\}\-\\bar\{u\}\\mathbf\{1\},\\quad\\bar\{u\}=\\langle\\mathbf\{1\},\\mathbf\{u\}\\rangle\_\{G\}/\\langle\\mathbf\{1\},\\mathbf\{1\}\\rangle\_\{G\},where⟨𝐮,𝐯⟩G≡𝐮⊤𝐆𝐯\\langle\\mathbf\{u\},\\mathbf\{v\}\\rangle\_\{G\}\\equiv\\mathbf\{u\}^\{\\top\}\\mathbf\{G\}\\mathbf\{v\}is the area\-weighted inner product\. ACC=1=1indicates perfect spatial correlation; for a random prediction,𝔼\[ACC\]=0\\mathbb\{E\}\[\\text\{ACC\}\]=0\.

GPLFR achieves the highest mean ACC across all variables \(Table[D\.10](https://arxiv.org/html/2606.06576#A4.T10)\)\. Its weakest result is for cloud fraction at mid–upper pressure levels\. There, kNN beats both GPLFR and PPCA\-ICM, likely because cloud fraction has relatively high power at high spatial frequencies, some of which is removed by the spectral truncation used for these models\. All models achieve near\-perfect ACC for absorbed shortwave radiation, reflecting the strong dependence of this field on substellar point geometry \(which is consistent across examples\)\.

##### Spread–skill ratio \(SSR\)\.

SSR is a first\-order diagnostic of ensemble calibration\. We define it as

SSR=∑iSpreadi2∑iMSEi,\\text\{SSR\}=\\sqrt\{\\frac\{\\sum\_\{i\}\\text\{Spread\}\_\{i\}^\{2\}\}\{\\sum\_\{i\}\\text\{MSE\}\_\{i\}\}\},where for each test exampleii,

Spreadi2=1M−1∑m=1M‖𝐲i\[m\]−𝐲^i‖G2,MSEi=‖𝐲^i−𝐲i‖G2,\\text\{Spread\}\_\{i\}^\{2\}=\\frac\{1\}\{M\-1\}\\sum\_\{m=1\}^\{M\}\\\|\\mathbf\{y\}^\{\[m\]\}\_\{i\}\-\\hat\{\\mathbf\{y\}\}\_\{i\}\\\|\_\{G\}^\{2\},\\qquad\\text\{MSE\}\_\{i\}=\\\|\\hat\{\\mathbf\{y\}\}\_\{i\}\-\\mathbf\{y\}\_\{i\}\\\|\_\{G\}^\{2\},and𝐲^i\\hat\{\\mathbf\{y\}\}\_\{i\}is the ensemble mean\. A well\-calibrated ensemble has SSR≈1\\approx 1; values below 1 indicate overconfidence \(too little spread\) and values above 1 indicate underconfidence\.

GPLFR is closer to SSR=1=1than PPCA\-ICM for 47 out of 53 fields \(Table[D\.11](https://arxiv.org/html/2606.06576#A4.T11)\)\. Both models tend toward overconfidence\.

Table D\.8:*Exoclimate emulation:*Energy score\. Bold indicates lower \(better\) values\. Rows 0–9 index pressure levels from surface to top for 3D variables; the individual levels are shown in gray\.Table D\.9:*Exoclimate emulation:*RMSE\. Bold indicates lower \(better\) values\. Rows 0–9 index pressure levels from surface to top for 3D variables; the individual levels are shown in gray\.Table D\.10:*Exoclimate emulation:*Anomaly correlation coefficient \(ACC\)\. Bold indicates higher \(better\) values\. Rows 0–9 index pressure levels from surface to top for 3D variables; the individual levels are shown in gray\.Table D\.11:*Exoclimate emulation:*Spread–skill ratio \(SSR\)\. Bold indicates values closer to 1 \(better calibrated spread\)\. Rows 0–9 index pressure levels from surface to top for 3D variables; the individual levels are shown in gray\.
Gaussian Process Latent Factor Regression for Low-Data, High-Dimensional Output Problems

Similar Articles

Sequential sparse Gaussian process quantile regression

Federated Hash Projected Latent Factor Learning

Discovering Latent Response Laws in Forced Physical Systems

Three Costs of Amortizing Gaussian Process Inference with Neural Processes

Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo

Submit Feedback

Similar Articles

Sequential sparse Gaussian process quantile regression
Federated Hash Projected Latent Factor Learning
Discovering Latent Response Laws in Forced Physical Systems
Three Costs of Amortizing Gaussian Process Inference with Neural Processes
Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo