Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators

arXiv cs.LG Papers

Summary

This paper introduces a structured parameterization for noise models in ELTO-based Kalman filters, enabling dynamic adaptation to non-stationary processes and improving state estimation performance in noisy, time-varying environments.

arXiv:2606.14195v1 Announce Type: new Abstract: Kalman filters based on the Embedded Latent Transfer Operators (ELTO) emerge as novel statistical tools for sequential state estimation. However, a critical limitation stems from their use of simplified noise models, which fail to dynamically adapt to non-stationary processes. To address this limitation, we introduce an ELTO-based Bayesian filtering approach with a new structured parameterization for the filter's noise model. This parameterization enables structured noise adaptation, which couples the data-driven learning of an optimal time-invariant noise model with dynamic parameter adaptation that responds to changes in dynamics within non-stationary processes. Empirical results show that our structured noise adaptation improves the filter's dynamic state estimation performance in noisy, time-varying environments.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:12 AM

# Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators
Source: [https://arxiv.org/html/2606.14195](https://arxiv.org/html/2606.14195)
Naichang Kenaichang\.ke@ist\.osaka\-u\.ac\.jp The University of OsakaPongpisit Thanasutivespongpisit\.thanasutives@riken\.jp RIKEN Center for Advanced Intelligence Project \(AIP\)Yoshinobu Kawaharakawahara@ist\.osaka\-u\.ac\.jp The University of Osaka & RIKEN Center for Advanced Intelligence Project \(AIP\)

###### Abstract

Kalman filters based on the Embedded Latent Transfer Operators \(ELTO\) emerge as novel statistical tools for sequential state estimation\. However, a critical limitation stems from their use of simplified noise models, which fail to dynamically adapt to non\-stationary processes\. To address this limitation, we introduce an ELTO\-based Bayesian filtering approach with a new structured parameterization for the filter’s noise model\. This parameterization enables structured noise adaptation, which couples the data\-driven learning of an optimal time\-invariant noise model with dynamic parameter adaptation that responds to changes in dynamics within non\-stationary processes\. Empirical results show that our structured noise adaptation improves the filter’s dynamic state estimation performance in noisy, time\-varying environments\.

## 1Introduction

Sequential state estimation is the process of continuously updating an estimate of a system’s state over time based on incoming observations\. It serves as a foundation for many real\-world applications, including robotics, navigation, and financial modeling\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7); Greenberget al\.,[2023](https://arxiv.org/html/2606.14195#bib.bib37); DeMiguelet al\.,[2024](https://arxiv.org/html/2606.14195#bib.bib44)\)\. A widely used approach to state estimation is Kalman filtering\(Kalman,[1960](https://arxiv.org/html/2606.14195#bib.bib3)\), which performs sequential Bayesian state estimation by iteratively propagating a latent state and its uncertainty forward in time\(Fukumizuet al\.,[2013](https://arxiv.org/html/2606.14195#bib.bib40)\), and then updating on new observations\. Achieving high state estimation performance is, however, often hindered by a fundamental problem: specifying a noise model that remains robust under the non\-stationarity of real\-world processes\(Masreliez and Martin,[2003](https://arxiv.org/html/2606.14195#bib.bib56)\)\. A common approach to this problem is, for example, down\-weighting measurement outliers at the update step, which merely treats the symptom of corrupted data\(Duran\-Martinet al\.,[2024](https://arxiv.org/html/2606.14195#bib.bib53); Wanget al\.,[2018](https://arxiv.org/html/2606.14195#bib.bib54); Agamennoniet al\.,[2012](https://arxiv.org/html/2606.14195#bib.bib55)\)\. In contrast, we address the problem by improving the robustness of the filter’s noise model in learning the underlying system dynamics from noisy data\.

Our proposed method builds on the work of\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\), which introduces a computationally efficient state estimation technique called the kernel Kalman rule \(KKR\)\. Similar to the kernel Bayes’ filter \(KBR\)\(Fukumizuet al\.,[2013](https://arxiv.org/html/2606.14195#bib.bib40)\), KKR, when combined with the kernel sum rule, uses conditional embedding operators \(e\.g\., the transfer operator\) to formulate kernel Kalman filtering \(KKF\) in the reproducing kernel Hilbert space \(RKHS\)\. By performing state estimation in the RKHS, KKF overcomes the explicit parametric assumptions required in the original state space\.

Given high\-dimensional observations,\(Keet al\.,[2025](https://arxiv.org/html/2606.14195#bib.bib39)\)have recently proposed a spectral learning algorithm, derived from stochastic realization theory\(Katayama,[2005](https://arxiv.org/html/2606.14195#bib.bib38)\), to approximate data\-driven representation of the transfer operator governing the evolution of the embedded latent state in an RKHS\. This resulting transfer operator is also known as the Embedded Latent Transfer Operator \(ELTO\)\. Because the spectral learning algorithm enables data\-driven modeling of nonlinear stochastic processes using the ELTO, sequential state estimation \(ELTO\-KF\) can directly combine these identified operators with the Bayesian inference procedure of the KKR\. Nonetheless, the ELTO\-KF approach inherits a critical limitation: the reliance on fixed noise covariances that fail to adapt to non\-stationary processes\.

Optimizing the noise models, i\.e\., the covariance matrices of process and measurement noise, of Kalman filters to ensure robustness against changes in the dynamics of non\-stationary processes is an important challenge, because it is known that suboptimal noise covariance matrices severely degrade filtering performance\(Greenberget al\.,[2023](https://arxiv.org/html/2606.14195#bib.bib37)\)\. One noteworthy model\-based solution is the adaptive extended Kalman filter \(AEKF\), which dynamically tunes the noise matrices using the filter’s residual and innovation\(Wang,[1999](https://arxiv.org/html/2606.14195#bib.bib36); Akhlaghiet al\.,[2017](https://arxiv.org/html/2606.14195#bib.bib34)\)\. Another class of solutions is data\-driven, which treats the parameters of the noise models as learnable and uses numerical optimization to obtain the best time\-invariant parameters in terms of fitting performance\(Greenberget al\.,[2023](https://arxiv.org/html/2606.14195#bib.bib37); Beckeret al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib1)\)\. Both strategies highlight the important role of the noise model\. However, to facilitate the learning of complex system dynamics in high\-dimensional contexts, a research gap remains in constructing a unified approach that benefits from both parameter learning strategies to ultimately improve the filtering performance\.

In this paper, we improve the ELTO\-KF by incorporating a new structured parameterization for the filter’s noise model, proposing a novel ELTO\-based adaptive Kalman filtering method, called ELTO\-AKF\. This parameterization enables structured noise adaptation, coupling the data\-driven learning of the filter’s optimal time\-invariant noise model with dynamic parameter adaptation to enhance dynamic state estimation performance in non\-stationary processes\. Our proposed method incorporates adaptive estimation of the noise covariance matrices into the data\-driven optimization using the filter’s residual and innovation\. This integration provides robust covariance representations for noisy non\-stationary processes\. Our structured parameterization of the noise covariance matrices reduces computational complexity compared to full\-rank matrices, providing a tractable structure for dynamic parameter adaptation\. The proposed data\-driven approach enables the structured noise adaptation, which integrates data\-driven parameter optimization with dynamic parameter adaptation, thereby addressing the aforementioned research gap\. We demonstrate that ELTO\-AKF is effective across a broad range of challenging scenarios, including non\-stationary LiDAR trajectories, high\-dimensional Lorenz systems, and a downstream denoising task for the sparse identification of partial differential equations \(PDEs\)\.

Our contributions are summarized as follows:

- •We introduce a new structured parameterization for tractable representation learning of noise covariance matrices and propose the ELTO\-AKF method, in which the parameterization is implemented\.
- •We demonstrate that structured noise adaptation improves the filter’s dynamic state estimation performance in non\-stationary processes by coupling the learning of a robust, time\-invariant noise model via derivative\-free optimization with dynamic parameter adaptation\.
- •We demonstrate the practical utility of ELTO\-AKF as a robust filter for denoising scientific data in non\-stationary processes\. By effectively reducing noise in the observed states, our method improves the accuracy of downstream data\-driven equation discovery, such as identifying governing PDEs\.

## 2Background

![Refer to caption](https://arxiv.org/html/2606.14195v1/x1.png)Figure 1:State\-space representation of original and latent variables\.𝒯\\mathcal\{T\}and𝒪\\mathcal\{O\}respectively represent the transfer and observable operators for the latent state embedded in the RKHS byμ\\mu\.### 2\.1State Space Models

The state space model \(SSM\) is a mathematical framework for describing a dynamical system, consisting of a latent state process\{𝐱​\(t\)∈𝕏\}\\\{\\mathbf\{x\}\(t\)\\in\\mathbb\{X\}\\\}and an observation process\{𝐲​\(t\)∈𝕐\}\\\{\\mathbf\{y\}\(t\)\\in\\mathbb\{Y\}\\\}\. The system evolution is governed by a state transition densitypt​rp\_\{tr\}, which is defined for any measurable set𝔸\\mathbb\{A\}byP​r​\(𝐱​\(t\+1\)∈𝔸\|𝐱​\(t\)=𝒙\)=∫𝔸pt​r​\(𝒛\|𝒙\)​𝑑𝒛Pr\(\\mathbf\{x\}\(t\+1\)\\in\\mathbb\{A\}\|\\mathbf\{x\}\(t\)=\\bm\{x\}\)=\\int\_\{\\mathbb\{A\}\}p\_\{tr\}\(\\bm\{z\}\|\\bm\{x\}\)d\\bm\{z\}, and an observation densitypo​bp\_\{ob\}, which links the latent state to the observations/measurements, defined byP​r​\(𝐲​\(t\)∈𝔸′\|𝐱​\(t\)=𝒙\)=∫𝔸′po​b​\(𝒚\|𝒙\)​𝑑𝒚Pr\(\\mathbf\{y\}\(t\)\\in\\mathbb\{A\}^\{\\prime\}\|\\mathbf\{x\}\(t\)=\\bm\{x\}\)=\\int\_\{\\mathbb\{A\}^\{\\prime\}\}p\_\{ob\}\(\\bm\{y\}\|\\bm\{x\}\)d\\bm\{y\}\. Note that𝔸′\\mathbb\{A\}^\{\\prime\}is another measurable set\.

#### SSMs in Reproducing Kernel Hilbert Spaces\.

Since Hilbert\-space distribution embeddings allow us to represent arbitrary probability distributions nonparametrically and perform inference entirely in this space, we embed the state densities into RKHSs\. Given a measurable positive definite kernelkkon𝕏\\mathbb\{X\}\(e\.g\.,sup𝒙∈𝕏​k​\(𝒙,𝒙′\)<∞\\text\{sup\}\_\{\\bm\{x\}\\in\\mathbb\{X\}\}k\(\\bm\{x\},\\bm\{x\}^\{\\prime\}\)<\\infty\) and its corresponding feature mapψ\\psi, the probability density of the state𝐱​\(t\)\\mathbf\{x\}\(t\)is represented by a kernel mean embeddingμP𝐱​\(t\)\\mu\_\{P\_\{\\mathbf\{x\}\(t\)\}\}into the RKHS, denoted asℍ\\mathbb\{H\}\. This embedding is defined as the mappingμ:𝕄\+​\(𝕏\)→ℍ,P↦∫ψ​\(𝒙\)​𝑑P​\(𝒙\)\\mu:\\mathbb\{M\}\_\{\+\}\(\\mathbb\{X\}\)\\rightarrow\\mathbb\{H\},P\\mapsto\\int\\psi\(\\bm\{x\}\)dP\(\\bm\{x\}\)on measure space𝕄\+​\(𝕏\)\\mathbb\{M\}\_\{\+\}\(\\mathbb\{X\}\)for any probability measurePPon𝕏\\mathbb\{X\}\. Within this embedded Hilbert space, the abstract transition and observation can be described by covariance operators\. Let\(𝒙,𝒚\)\(\\bm\{x\},\\bm\{y\}\)be a random variable taking values in𝕏×𝕐\\mathbb\{X\}\\times\\mathbb\{Y\}with a marginal distributionP𝐱P\_\{\\mathbf\{x\}\}and a joint distributionP𝐱𝐲P\_\{\\mathbf\{xy\}\}, and letℍ\\mathbb\{H\}and𝔾\\mathbb\{G\}be their corresponding RKHSs with feature mapsψ\\psiandϕ\\phi\. Following\(Baker,[1973](https://arxiv.org/html/2606.14195#bib.bib57)\), the covariance operator𝒞𝐱:ℍ→ℍ\\mathcal\{C\}\_\{\\mathbf\{x\}\}:\\mathbb\{H\}\\rightarrow\\mathbb\{H\}and cross\-covariance operator𝒞𝐲𝐱:ℍ→𝔾\\mathcal\{C\}\_\{\\mathbf\{y\}\\mathbf\{x\}\}:\\mathbb\{H\}\\rightarrow\\mathbb\{G\}are defined as:

𝒞𝐱\\displaystyle\\mathcal\{C\}\_\{\\mathbf\{x\}\}:=∫ψ​\(𝒙\)⊗ψ​\(𝒙\)​𝑑P𝐱​\(𝒙\),\\displaystyle:=\\int\\psi\(\\bm\{x\}\)\\otimes\\psi\(\\bm\{x\}\)dP\_\{\\mathbf\{x\}\}\(\\bm\{x\}\),𝒞𝐲𝐱\\displaystyle\\mathcal\{C\}\_\{\\mathbf\{y\}\\mathbf\{x\}\}:=∫ϕ​\(𝒚\)⊗ψ​\(𝒙\)​𝑑P𝐱𝐲​\(𝒙,𝒚\)\.\\displaystyle:=\\int\\phi\(\\bm\{y\}\)\\otimes\\psi\(\\bm\{x\}\)dP\_\{\\mathbf\{xy\}\}\(\\bm\{x\},\\bm\{y\}\)\.\(1\)The conditional distribution is represented by the conditional embedding operator𝒞𝐲\|𝐱:=𝒞𝐲𝐱​𝒞𝐱−1\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}:=\\mathcal\{C\}\_\{\\mathbf\{y\}\\mathbf\{x\}\}\\mathcal\{C\}\_\{\\mathbf\{x\}\}^\{\-1\}\(Songet al\.,[2009](https://arxiv.org/html/2606.14195#bib.bib47)\)\.

#### Data\-driven Identification of System Operators via Spectral Learning\.

\(Keet al\.,[2025](https://arxiv.org/html/2606.14195#bib.bib39)\)proposed a spectral learning algorithm to estimate matrix representations of system operators, namely the Embedded Latent Transfer Operator \(ELTO\) and the Embedded Observable Operator \(EOO\), which govern the evolution of the embedded latent state in the RKHS as well as its relationship to the observation\. The latent state process\{𝐱​\(t\)\}\\\{\\mathbf\{x\}\(t\)\\\}is constructed in a data\-driven way based on stochastic realization theory\(Katayama,[2005](https://arxiv.org/html/2606.14195#bib.bib38)\)\. Consequently, the spectral learning enables data\-driven identification of the system operators,𝒯:=𝒞𝐱​\(t\+1\)\|𝐱​\(t\)\\mathcal\{T\}:=\\mathcal\{C\}\_\{\\mathbf\{x\}\(t\+1\)\|\\mathbf\{x\}\(t\)\}and𝒪:=𝒞𝐲​\(t\)\|𝐱​\(t\)\\mathcal\{O\}:=\\mathcal\{C\}\_\{\\mathbf\{y\}\(t\)\|\\mathbf\{x\}\(t\)\}, from the observation process\{𝐲​\(t\)\}\\\{\\mathbf\{y\}\(t\)\\\}\. It is shown that a data\-driven approximation of system operators is more flexible and can be used to improve the performance of traditional model\-based \(adaptive\) Kalman filters111A detailed matrix\-based derivation of ELTO is given in Algorithm[3](https://arxiv.org/html/2606.14195#alg3), Appendix[A](https://arxiv.org/html/2606.14195#A1)\. We also compare the performance against model\-based AEKF baselines in Table[1](https://arxiv.org/html/2606.14195#S4.T1)\.\. Both operators \(𝒯\\mathcal\{T\}and𝒪\\mathcal\{O\}in Figure[1](https://arxiv.org/html/2606.14195#S2.F1)\) are necessary for the kernel Kalman filtering described next\.

### 2\.2Kernel Kalman Filtering

Sequential state estimation, i\.e\., Kalman filtering, is the primary task for the state\-space models\. The fundamental goal is to sequentially infer the full probability distribution of the latent state from a history of observations\. Traditional Kalman filters \(left of Figure[2](https://arxiv.org/html/2606.14195#S2.F2)\) require explicit parametric models of the system’s complex dynamics; however, it is difficult to specify the prior and likelihood probability densities explicitly to complete the Bayesian inference step\. Therefore, Bayesian state estimation directly in the observation space is cumbersome\. To bypass these parametric constraints, the Kernel Kalman Rule \(KKR\)\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\)is adopted\. This method leverages the covariance operators established in Section[2\.1](https://arxiv.org/html/2606.14195#S2.SS1)to perform Bayesian updates directly in the RKHS \(right of Figure[2](https://arxiv.org/html/2606.14195#S2.F2)\), thereby avoiding the need for explicit density specifications\(Fukumizuet al\.,[2013](https://arxiv.org/html/2606.14195#bib.bib40)\)\.

![Refer to caption](https://arxiv.org/html/2606.14195v1/x2.png)Figure 2:Comparison of Kalman filtering in the original space and in an RKHS\. Through kernel embeddingμ\\mu, the state vector𝐱​\(t\)\\mathbf\{x\}\(t\)and its covariance𝐏​\(t\)\\mathbf\{P\}\(t\)are lifted to the RKHS mean𝒎t\\bm\{m\}\_\{t\}and covariance operator𝑺t\\bm\{S\}\_\{t\}, with∙−\\bullet^\{\-\}and∙\+\\bullet^\{\+\}denoting the prior and posterior\. We utilize linear matricesFFandHH\(instead of nonlinear transitionspt​rp\_\{tr\}andpo​bp\_\{ob\}\) to provide better intuition\.#### Kernel Kalman Rule\.

Given a training dataset\{\(𝒙~i,𝒙i,𝒚i\)\}i=1N\\\{\(\\tilde\{\\bm\{x\}\}\_\{i\},\\bm\{x\}\_\{i\},\\bm\{y\}\_\{i\}\)\\\}\_\{i=1\}^\{N\},\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\)formulated the KKR to compute the empirical transfer and observable operators\.𝒙~i\\tilde\{\\bm\{x\}\}\_\{i\}represents the preceding state \(corresponding to𝐱​\(t\)\\mathbf\{x\}\(t\)in SSM\) and𝒙i\\bm\{x\}\_\{i\}denotes the current state \(corresponding to𝐱​\(t\+1\)\\mathbf\{x\}\(t\+1\)in the SSM\) with its associated measurement𝒚i\\bm\{y\}\_\{i\}222The data\-driven construction of the state pairs\(𝒙~i,𝒙i\)\(\\tilde\{\\bm\{x\}\}\_\{i\},\\bm\{x\}\_\{i\}\)is performed using the spectral learning algorithm described in Appendix[A](https://arxiv.org/html/2606.14195#A1)\.\. Letkx​\(⋅,⋅\)k\_\{x\}\(\\cdot,\\cdot\)andky​\(⋅,⋅\)k\_\{y\}\(\\cdot,\\cdot\)be the kernel functions for the state and observation spaces, associated with feature mapsψ​\(⋅\)\\psi\(\\cdot\)andϕ​\(⋅\)\\phi\(\\cdot\), respectively\. The feature matrices are defined as:

𝚿=\[ψ​\(𝒙1\),…,ψ​\(𝒙N\)\],𝚽=\[ϕ​\(𝒚1\),…,ϕ​\(𝒚N\)\]\.\\displaystyle\\bm\{\\Psi\}=\[\\psi\(\\bm\{x\}\_\{1\}\),\\dots,\\psi\(\\bm\{x\}\_\{N\}\)\],\\quad\\bm\{\\Phi\}=\[\\phi\(\\bm\{y\}\_\{1\}\),\\dots,\\phi\(\\bm\{y\}\_\{N\}\)\]\.\(2\)
Subsequently, the kernel matrices \(Gram matrices\) can be computed as:

𝑮x=𝚿⊤​𝚿,𝑮y​x=𝚽⊤​𝚿,𝑮x~=𝚿1⊤​𝚿1,𝑮x~​x=𝚿1⊤​𝚿2,𝑮y=𝚽⊤​𝚽,\\displaystyle\\bm\{G\}\_\{x\}=\\bm\{\\Psi\}^\{\\top\}\\bm\{\\Psi\},\\quad\\bm\{G\}\_\{yx\}=\\bm\{\\Phi\}^\{\\top\}\\bm\{\\Psi\},\\quad\\bm\{G\}\_\{\\tilde\{x\}\}=\\bm\{\\Psi\}\_\{1\}^\{\\top\}\\bm\{\\Psi\}\_\{1\},\\quad\\bm\{G\}\_\{\\tilde\{x\}x\}=\\bm\{\\Psi\}\_\{1\}^\{\\top\}\\bm\{\\Psi\}\_\{2\},\\quad\\bm\{G\}\_\{y\}=\\bm\{\\Phi\}^\{\\top\}\\bm\{\\Phi\},\(3\)where𝚿1:=𝚿:,1:N−1\\bm\{\\Psi\}\_\{1\}:=\\bm\{\\Psi\}\_\{:,1:N\-1\}and𝚿2:=𝚿:,2:N\\bm\{\\Psi\}\_\{2\}:=\\bm\{\\Psi\}\_\{:,2:N\}\. Based on these kernel matrices, the transition matrix𝑻\\bm\{T\}and observation matrix𝑶\\bm\{O\}are derived as:

𝑻=\(𝑮x~\+ϵt​𝑰\)−1​𝑮x~​x,𝑶=\(𝑮x\+ϵo​𝑰\)−1​𝑮y​x⊤,\\displaystyle\\bm\{T\}=\(\\bm\{G\}\_\{\\tilde\{x\}\}\+\\epsilon\_\{t\}\\bm\{I\}\)^\{\-1\}\\bm\{G\}\_\{\\tilde\{x\}x\},\\quad\\bm\{O\}=\(\\bm\{G\}\_\{x\}\+\\epsilon\_\{o\}\\bm\{I\}\)^\{\-1\}\\bm\{G\}\_\{yx\}^\{\\top\},\(4\)
whereϵt,ϵo\>0\\epsilon\_\{t\},\\epsilon\_\{o\}\>0\.𝑻\\bm\{T\}and𝑶\\bm\{O\}serve as the empirical representations of the conditional embedding operators𝒞𝐱\|𝐱~\\mathcal\{C\}\_\{\\mathbf\{x\}\|\\tilde\{\\mathbf\{x\}\}\}and𝒞𝐲\|𝐱\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}, which govern the evolution of the transition and observation densities in the RKHS\.

#### Empirical Kernel Kalman Filters

Sequential estimation is performed by recursively calculating the posterior state through iterative prediction and update steps\. To implement these steps in the RKHS using the finite\-dimensional operator representations derived above, the mean mapμ^𝐱​\(t\)\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}and the covariance operator𝒞^𝐱​\(t\)\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}are tracked through a weight vector𝒎t∈ℝN\\bm\{m\}\_\{t\}\\in\\mathbb\{R\}^\{N\}and a weight matrix𝑺t∈ℝN×N\\bm\{S\}\_\{t\}\\in\\mathbb\{R\}^\{N\\times N\}as follows:

μ^𝐱​\(t\)−=𝚿​𝒎t−and𝒞^𝐱​\(t\)−=𝚿​𝑺t−​𝚿⊤,\\displaystyle\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}=\\bm\{\\Psi\}\\bm\{m\}\_\{t\}^\{\-\}\\quad\\text\{and \}\\quad\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}=\\bm\{\\Psi\}\\bm\{S\}\_\{t\}^\{\-\}\\bm\{\\Psi\}^\{\\top\},\(5\)where the a priori and a posteriori belief states are denoted as∙−\\bullet^\{\-\}and∙\+\\bullet^\{\+\}, respectively\. The operators are updated in the RKHS map based on linear operations on these weights\. The prediction step corresponds to:

μ^𝐱​\(t\)−=𝒞^𝐱\|𝐱~​μ^𝐱​\(t−1\)\+=𝚿​𝑻​𝒎t−1\+\\displaystyle\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}=\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\|\\tilde\{\\mathbf\{x\}\}\}\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\-1\)\}^\{\+\}=\\bm\{\\Psi\}\\bm\{T\}\\bm\{m\}\_\{t\-1\}^\{\+\}\\quad⇔𝒎t−=𝑻​𝒎t−1\+,\\displaystyle\\Leftrightarrow\\quad\\bm\{m\}\_\{t\}^\{\-\}=\\bm\{T\}\\bm\{m\}\_\{t\-1\}^\{\+\},𝒞^𝐱​\(t\)−=𝒞^𝐱\|𝐱~​𝒞^𝐱​\(t−1\)\+​𝒞^𝐱\|𝐱~⊤\+𝒞V\\displaystyle\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}=\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\|\\tilde\{\\mathbf\{x\}\}\}\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\-1\)\}^\{\+\}\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\|\\tilde\{\\mathbf\{x\}\}\}^\{\\top\}\+\\mathcal\{C\}\_\{V\}\\quad⇔𝑺t−=𝑻​𝑺t−1\+​𝑻⊤\+𝑸\.\\displaystyle\\Leftrightarrow\\quad\\bm\{S\}\_\{t\}^\{\-\}=\\bm\{T\}\\bm\{S\}\_\{t\-1\}^\{\+\}\\bm\{T\}^\{\\top\}\+\\bm\{Q\}\.\(6\)
Here,𝒞V\\mathcal\{C\}\_\{V\}and𝑸\\bm\{Q\}represent the process noise covariance in operator and matrix form, respectively\. The update step, involving the Kalman gain, is similarly formulated as:

μ^𝐱​\(t\)\+=μ^𝐱​\(t\)−\+𝒢t​\(ϕ​\(𝒚t\)−𝒞𝐲\|𝐱​μ^𝐱​\(t\)−\),\\displaystyle\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\+\}=\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}\+\\mathcal\{G\}\_\{t\}\(\\phi\(\\bm\{y\}\_\{t\}\)\-\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}\),𝒞^𝐱​\(t\)\+=𝒞^𝐱​\(t\)−−𝒢t​𝒞𝐲\|𝐱​𝒞^𝐱​\(t\)−,\\displaystyle\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\+\}=\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}\-\\mathcal\{G\}\_\{t\}\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\},𝒢t=𝒞^𝐱​\(t\)−​𝒞𝐲\|𝐱⊤​\(𝒞𝐲\|𝐱​𝒞^𝐱​\(t\)−​𝒞𝐲\|𝐱⊤\+𝒞W\)−1\.\\displaystyle\\mathcal\{G\}\_\{t\}=\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}^\{\\top\}\(\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\\hat\{\\mathcal\{C\}\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}^\{\\top\}\+\\mathcal\{C\}\_\{W\}\)^\{\-1\}\.\(7\)
𝒢t\\mathcal\{G\}\_\{t\}is the Kalman gain operator\.𝒞W\\mathcal\{C\}\_\{W\}denotes the measurement noise covariance in operator form, whereas𝑹\\bm\{R\}\(introduced later\) denotes its matrix form\. We provide the explicit finite\-dimensional update equations in Section[3\.2](https://arxiv.org/html/2606.14195#S3.SS2)\(Equation[11](https://arxiv.org/html/2606.14195#S3.E11)\-[13](https://arxiv.org/html/2606.14195#S3.E13)\), where they are integrated with our proposed structured noise parameterization\.

## 3Structured Noise Model Adaptation for Sequential Bayesian Filtering

The key limitation of existing Kalman filters is their reliance on dense, unstructured covariance models, which make the direct optimization ill\-posed for non\-stationary processes\. In this section, we incorporate structured noise model adaptation to the data\-driven optimization of the KKF, proposing the ELTO\-AKF\. Within our proposed method, KKF formulates the sequential filtering process given the data\-driven system operators obtained from the spectral learning\. Constructed by a new sparse structure parameterization detailed in Section[3\.1](https://arxiv.org/html/2606.14195#S3.SS1), the structured noise model couples the data\-driven optimization for tracking global noise statistics in noisy environments \(Section[3\.2](https://arxiv.org/html/2606.14195#S3.SS2)\) with dynamic parameter adaptation for enhancing robustness in non\-stationary environments \(Section[3\.3](https://arxiv.org/html/2606.14195#S3.SS3)\)\.

### 3\.1Sparse Structure Parameterization

We introduce the scalar\-block \(SB\) structure to facilitate the tractable learning of noise covariance matrices\.

Definition 1\.Let a matrix𝐀∈ℝN×N\\bm\{A\}\\in\\mathbb\{R\}^\{N\\times N\}be partitioned intok×kk\\times ksub\-blocks\. We define ascalar\-block matrix𝐀θS\\bm\{A\}^\{S\}\_\{\\theta\}as a matrix where each sub\-block is a scaled identity matrix parameterized by a scalarθi​j∈ℝ\\theta\_\{ij\}\\in\\mathbb\{R\}:

𝑨θS=\(θ11⋅𝑰θ12⋅𝑰…θ1​k⋅𝑰θ21⋅𝑰θ22⋅𝑰…θ2​k⋅𝑰⋮⋮⋱⋮θk​1⋅𝑰θk​2⋅𝑰…θk​k⋅𝑰\)\.\\displaystyle\\bm\{A\}^\{S\}\_\{\\theta\}=\\begin\{pmatrix\}\\theta\_\{11\}\\cdot\\bm\{I\}&\\theta\_\{12\}\\cdot\\bm\{I\}&\\dots&\\theta\_\{1k\}\\cdot\\bm\{I\}\\\\ \\theta\_\{21\}\\cdot\\bm\{I\}&\\theta\_\{22\}\\cdot\\bm\{I\}&\\dots&\\theta\_\{2k\}\\cdot\\bm\{I\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ \\theta\_\{k1\}\\cdot\\bm\{I\}&\\theta\_\{k2\}\\cdot\\bm\{I\}&\\dots&\\theta\_\{kk\}\\cdot\\bm\{I\}\\end\{pmatrix\}\.\(8\)
A significant challenge is to enforce the SB structure while guaranteeing the symmetric positive\-definite \(SPD\) property required of a covariance matrix\. To address this challenge, we adopt the Cholesky parameterization to transform a constrained covariance estimation into an unconstrained problem\(Pinheiro and Bates,[1996](https://arxiv.org/html/2606.14195#bib.bib46)\)\. The technique represents the SB matrix via the factorization𝑨θS=𝑳​𝑳⊤\\bm\{A\}^\{S\}\_\{\\theta\}=\\bm\{L\}\\bm\{L\}^\{\\top\}, where𝑳\\bm\{L\}is a block lower\-triangular matrix whose sub\-blocks are scaled identity matrices\. Equivalently, under this Cholesky factorization, each covariance block satisfiesθi​j=∑r=1min⁡\(i,j\)li​r​lj​r\\theta\_\{ij\}=\\sum\_\{r=1\}^\{\\min\(i,j\)\}l\_\{ir\}l\_\{jr\}\. The SPD property of this SB structure is proved in the following\.

Proposition 1\.Let the factor matrix𝐋∈ℝN×N\\bm\{L\}\\in\\mathbb\{R\}^\{N\\times N\}be a block lower\-triangular matrix where each sub\-block𝐋i​j=li​j⋅𝐈\\bm\{L\}\_\{ij\}=l\_\{ij\}\\cdot\\bm\{I\}\. If all diagonal scalarsli​il\_\{ii\}are non\-zero, then the matrix𝐀\\bm\{A\}constructed as𝐀=𝐋​𝐋⊤\\bm\{A\}=\\bm\{L\}\\bm\{L\}^\{\\top\}is a symmetric positive\-definite \(SPD\), scalar\-block matrix\.

Proof\.Symmetry and the scalar\-block structure follow directly from the construction𝑨=𝑳​𝑳⊤\\bm\{A\}=\\bm\{L\}\\bm\{L\}^\{\\top\}and the properties of block matrix multiplication\. For positive definiteness, we consider the quadratic formx⊤​𝑨​x=‖𝑳⊤​x‖22x^\{\\top\}\\bm\{A\}x=\\\|\\bm\{L\}^\{\\top\}x\\\|^\{2\}\_\{2\}\. The conditionli​i≠0l\_\{ii\}\\neq 0for alliiensures that the block lower\-triangular matrix𝑳\\bm\{L\}is invertible\. Since𝑳\\bm\{L\}is invertible,𝑳⊤​x≠0\\bm\{L\}^\{\\top\}x\\neq 0for any non\-zero vector𝒙\\bm\{x\}, which guarantees that the norm‖𝑳⊤​x‖22\\\|\\bm\{L\}^\{\\top\}x\\\|^\{2\}\_\{2\}is positive\. Thus,𝑨\\bm\{A\}is positive\-definite\.□\\square

The SB parameterization, defined by the set of learnable scalars\{θi​j\}\\\{\\theta\_\{ij\}\\\}, provides an expressive yet computationally tractable representation of the noise covariance matrices, serving as a crucial component for establishing the structured noise adaptation of our ELTO\-AKF\.

### 3\.2Data\-driven Optimization of Kernel Kalman Filters

![Refer to caption](https://arxiv.org/html/2606.14195v1/x3.png)Figure 3:Schematic of the ELTO\-AKF approach with dynamic parameter adaptation\.We detail the matrix\-based implementation of our ELTO\-AKF and formulate the optimization objective for learning an optimal, time\-invariant noise model\. The relationship between the abstract operators in Equation[2\.2](https://arxiv.org/html/2606.14195#S2.Ex2)\-[7](https://arxiv.org/html/2606.14195#S2.E7)and their concrete matrices is grounded in the KKF, where operators and state embeddings are represented as linear combinations of the feature mappings\.

We implement the filtering process using the kernel Kalman rule\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\), employing the system matrices𝑻\\bm\{T\}and𝑶\\bm\{O\}derived empirically from Gram matrices via the spectral learning algorithm detailed in Appendix[A](https://arxiv.org/html/2606.14195#A1)\. ELTO\-based filters\(Keet al\.,[2025](https://arxiv.org/html/2606.14195#bib.bib39)\)originally restrict noise covariances to static, identity\-based matrices \(i\.e\.,𝑸=ϵq​𝑰\\bm\{Q\}=\\epsilon\_\{q\}\\bm\{I\}and𝑹=ϵr​𝑰\\bm\{R\}=\\epsilon\_\{r\}\\bm\{I\}\)\. This non\-adaptive design is insufficient for capturing time\-varying dynamics, leading to suboptimal Kalman gains𝑮t\\bm\{G\}\_\{t\}and degraded filtering performance in non\-stationary environments\. To address this problem, we therefore parameterize the noise covariance matrices as learnable, structured matrices \(i\.e\.,𝑸θ\\bm\{Q\}\_\{\\theta\}and𝑹θ\\bm\{R\}\_\{\\theta\}\) using the SB structure\.

The sequential filtering process proceeds fort=1,…,Tt=1,\\dots,Tas follows:

Prediction:

𝒎t−\\displaystyle\\bm\{m\}\_\{t\}^\{\-\}=𝑻​𝒎t−1\+,\\displaystyle=\\bm\{T\}\\bm\{m\}\_\{t\-1\}^\{\+\},\(9\)𝑺t−\\displaystyle\\bm\{S\}\_\{t\}^\{\-\}=𝑻​𝑺t−1\+​𝑻⊤\+𝑸θ\.\\displaystyle=\\bm\{T\}\\bm\{S\}\_\{t\-1\}^\{\+\}\\bm\{T\}^\{\\top\}\+\\bm\{Q\}\_\{\\theta\}\.\(10\)Update:

𝑮t\\displaystyle\\bm\{G\}\_\{t\}=𝑺t−​𝑶⊤​\(𝑮y​𝑶​𝑺t−​𝑶⊤\+𝑹θ\)−1,\\displaystyle=\\bm\{S\}\_\{t\}^\{\-\}\\bm\{O\}^\{\\top\}\(\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{S\}\_\{t\}^\{\-\}\\bm\{O\}^\{\\top\}\+\\bm\{R\}\_\{\\theta\}\)^\{\-1\},\(11\)𝒎t\+\\displaystyle\\bm\{m\}\_\{t\}^\{\+\}=𝒎t−\+𝑮t​\(𝒐​\(𝒚t\)−𝑮y​𝑶​𝒎t−\),\\displaystyle=\\bm\{m\}\_\{t\}^\{\-\}\+\\bm\{G\}\_\{t\}\(\\bm\{o\}\(\\bm\{y\}\_\{t\}\)\-\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\-\}\),\(12\)𝑺t\+\\displaystyle\\bm\{S\}\_\{t\}^\{\+\}=\(𝑰−𝑮t​𝑮y​𝑶\)​𝑺t−\.\\displaystyle=\(\\bm\{I\}\-\\bm\{G\}\_\{t\}\\bm\{G\}\_\{y\}\\bm\{O\}\)\\bm\{S\}\_\{t\}^\{\-\}\.\(13\)Here, we define the feature map𝒐​\(𝒚t\)\\bm\{o\}\(\\bm\{y\}\_\{t\}\)from the subsequence𝒀1:N\\bm\{Y\}\_\{1:N\}of the training observation𝒀=\[𝒚1,…,𝒚T\]\\bm\{Y\}=\[\\bm\{y\}\_\{1\},\\dots,\\bm\{y\}\_\{T\}\], whereN=T−2​h\+2N=T\-2h\+2andhhis the window size\. This map is realized via the kernel trick as a vector of kernel evaluations:𝒐​\(𝒚t\):=\[ky​\(𝒚1,𝒚t\),…,ky​\(𝒚N,𝒚t\)\]⊤\\bm\{o\}\(\\bm\{y\}\_\{t\}\):=\[k\_\{y\}\(\\bm\{y\}\_\{1\},\\bm\{y\}\_\{t\}\),\\dots,k\_\{y\}\(\\bm\{y\}\_\{N\},\\bm\{y\}\_\{t\}\)\]^\{\\top\}\. After obtaining the posterior state estimate \(𝒎t\+,𝑺t\+\\bm\{m\}\_\{t\}^\{\+\},\\bm\{S\}\_\{t\}^\{\+\}\), the pre\-image step maps the estimate back into the original observation space\. The following pre\-image estimate𝜼^t\\hat\{\\bm\{\\eta\}\}\_\{t\}and𝚺^t\\hat\{\\bm\{\\Sigma\}\}\_\{t\}can be compared with the measurement\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\)\.

𝜼^t\\displaystyle\\hat\{\\bm\{\\eta\}\}\_\{t\}=𝒀1:N​𝑶​𝒎t\+\\displaystyle=\\bm\{Y\}\_\{1:N\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\+\}\(14\)𝚺^t\\displaystyle\\hat\{\\bm\{\\Sigma\}\}\_\{t\}=𝒀1:N​𝑶​𝑺t\+​𝑶⊤​𝒀⊤\\displaystyle=\\bm\{Y\}\_\{1:N\}\\bm\{O\}\\bm\{S\}\_\{t\}^\{\+\}\\bm\{O\}^\{\\top\}\\bm\{Y\}^\{\\top\}\(15\)
The sequence of estimates\{𝜼^t\}t=1T\\\{\\hat\{\\bm\{\\eta\}\}\_\{t\}\\\}\_\{t=1\}^\{T\}depends on the noise parametersθ\\thetathrough the matrices𝑸θ\\bm\{Q\}\_\{\\theta\}and𝑹θ\\bm\{R\}\_\{\\theta\}, which influence the Kalman gain𝑮t\\bm\{G\}\_\{t\}and covariance propagation\. Since filter performance is impacted by these matrices and standard estimation relies on potentially violated assumptions\(Wang,[1999](https://arxiv.org/html/2606.14195#bib.bib36)\), directly minimizingθ\\thetato minimize the end\-to\-end estimation error offers a data\-driven approach to finding an optimal time\-invariant model with the best overall robustness for a given distribution\(Greenberget al\.,[2023](https://arxiv.org/html/2606.14195#bib.bib37)\)\. Our objective is the mean squared error \(MSE\):

θ∗=argmin𝜃​1T​∑t=1T‖𝜼^t−𝒚tval‖2\.\\theta^\{\*\}=\\underset\{\\theta\}\{\\text\{argmin\}\}\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\|\|\\hat\{\\bm\{\\eta\}\}\_\{t\}\-\\bm\{y\}\_\{t\}^\{\\text\{val\}\}\|\|^\{2\}\.\(16\)Here,𝒀val=\[𝒚1val,…,𝒚Tval\]\\bm\{Y\}^\{\\text\{val\}\}=\[\\bm\{y\}^\{\\text\{val\}\}\_\{1\},\\dots,\\bm\{y\}^\{\\text\{val\}\}\_\{T\}\]is a split of the full observation data different from training data𝒀\\bm\{Y\}that is used for estimating the system operators from Algorithm[3](https://arxiv.org/html/2606.14195#alg3)\.𝒀val\\bm\{Y\}^\{\\text\{val\}\}lets us evaluate the generalizability of𝑻\\bm\{T\}and𝑶\\bm\{O\}properly, and also find the optimalθ∗\\theta^\{\*\}of the noise covariance matrices for the filter\.

Following the monotonicity property of the RBF kernel proven in Appendix[B](https://arxiv.org/html/2606.14195#A2), minimizing the squared error in observation space \(Equation[16](https://arxiv.org/html/2606.14195#S3.E16)\) shares the same minimizer as minimizing the filter’s innovation residual𝒐​\(𝒚t\)−𝑮y​𝑶​𝒎t−\\bm\{o\}\(\\bm\{y\}\_\{t\}\)\-\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\-\}in the RKHS \(cf\. Equation[12](https://arxiv.org/html/2606.14195#S3.E12)\), providing a principled \(rather than purely heuristic\) optimization target\. Unlike traditional parameter estimation in SSMs, which, for example, relies on maximum likelihood estimation via particle methods\(Kantaset al\.,[2015](https://arxiv.org/html/2606.14195#bib.bib64)\), our RKHS formulation bypasses explicit probability density functions\. To numerically solve this optimization objective, we therefore employ CMA\-ES\(Hansen,[2006](https://arxiv.org/html/2606.14195#bib.bib11)\), a derivative\-free evolution strategy well suited for minimizing non\-convex functions with a small number of parameters\. Algorithm[1](https://arxiv.org/html/2606.14195#alg1)finds the optimal noise model for the observation distribution, ensuring robust performance in noisy, non\-stationary environments333The code is available online at[https://github\.com/keeenda/ELTO\-AKF](https://github.com/keeenda/ELTO-AKF)\.\.

Algorithm 1ELTO\-based Kalman Filtering with Structured Noise Model Adaptation1:Input:training data

𝒀=\[𝒚1,…,𝒚T\]\\bm\{Y\}=\[\\bm\{y\}\_\{1\},\\dots,\\bm\{y\}\_\{T\}\], validation data

𝒀val=\[𝒚1val,…,𝒚Tval\]\\bm\{Y\}^\{\\text\{val\}\}=\[\\bm\{y\}^\{\\text\{val\}\}\_\{1\},\\dots,\\bm\{y\}^\{\\text\{val\}\}\_\{T\}\], window size

ww, memory factors

\(α,β\)\(\\alpha,\\beta\), and randomly initialized parameters

ϵq,ϵr\\epsilon\_\{q\},\\epsilon\_\{r\}or

ϵl,ϵm\\epsilon\_\{l\},\\epsilon\_\{m\}\.

2:// — Initialize ELTO —

3:Compute transfer and observable matrices

𝑻\\bm\{T\}and

𝑶\\bm\{O\}following Algorithm[3](https://arxiv.org/html/2606.14195#alg3)using

𝒀\\bm\{Y\},

4:Initialize mean

𝒎0\\bm\{m\}\_\{0\}and variance

𝑺0\\bm\{S\}\_\{0\}
5:// — Initialize noise covariance matrices following Equation[8](https://arxiv.org/html/2606.14195#S3.E8)—

6:if

α=β=1\\alpha=\\beta=1then

7:Parameterize

𝑸0\\bm\{Q\}\_\{0\}and

𝑹0\\bm\{R\}\_\{0\}with SB structure using

θi​j=ϵq\\theta\_\{ij\}=\\epsilon\_\{q\}or

ϵr\\epsilon\_\{r\}, respectively\.

8:else

9:Parameterize

𝑳0,𝑴0\\bm\{L\}\_\{0\},\\bm\{M\}\_\{0\}with SB structure using

θi​j=ϵl\\theta\_\{ij\}=\\epsilon\_\{l\}or

ϵm\\epsilon\_\{m\}, respectively, for

i≥ji\\geq j
10:

𝑳←𝑳0\\bm\{L\}\\leftarrow\\bm\{L\}\_\{0\},

𝑴←𝑴0\\bm\{M\}\\leftarrow\\bm\{M\}\_\{0\}
11:

𝑸←𝑳​𝑳⊤\\bm\{Q\}\\leftarrow\\bm\{L\}\\bm\{L\}^\{\\top\},

𝑹←𝑴​𝑴⊤\\bm\{R\}\\leftarrow\\bm\{M\}\\bm\{M\}^\{\\top\}
12:endif

13:// — Derivative\-free optimization by CMA\-ES —

14:Compute the pre\-image estimates

\[𝜼^1,…,𝜼^T\],\[𝚺^1,…,𝚺^T\]\[\\hat\{\\bm\{\\eta\}\}\_\{1\},\\dots,\\hat\{\\bm\{\\eta\}\}\_\{T\}\],\[\\hat\{\\bm\{\\Sigma\}\}\_\{1\},\\dots,\\hat\{\\bm\{\\Sigma\}\}\_\{T\}\]and the corresponding

θ∗\\theta^\{\*\}following Algorithm[2](https://arxiv.org/html/2606.14195#alg2)using

𝒀val\\bm\{Y\}^\{\\text\{val\}\}
15:Output:

θ∗\\theta^\{\*\},

\[𝜼^1,…,𝜼^T\]\[\\hat\{\\bm\{\\eta\}\}\_\{1\},\\dots,\\hat\{\\bm\{\\eta\}\}\_\{T\}\]and

\[𝚺^1,…,𝚺^T\]\[\\hat\{\\bm\{\\Sigma\}\}\_\{1\},\\dots,\\hat\{\\bm\{\\Sigma\}\}\_\{T\}\]

### 3\.3Dynamic Parameter Adaptation

We detail the dynamic parameter adaptation for non\-stationary environments where noise levels or dynamic characteristics of the system change abruptly over time, as shown in Figure[4\(c\)](https://arxiv.org/html/2606.14195#S4.F4.sf3)\. In such scenarios, fixed covariance matrices fail to capture sudden dynamic shifts\. We detect these changes by tracking pre\-fit and post\-fit errors: the innovation and the residual\. A spike in innovation indicates a shift in the system dynamics, and a large posterior residual suggests that the updated state still struggles to reconcile with \(potentially unreliable\) incoming sensor data\. Specifically, we employ the analytical update rules of the adaptive extended Kalman filter \(AEKF\)\(Akhlaghiet al\.,[2017](https://arxiv.org/html/2606.14195#bib.bib34)\)to properly handle non\-stationary processes\. The innovation and residual are estimated as follows:

Operator form⇔Matrix form\\displaystyle\\Leftrightarrow\\quad\\textrm\{Matrix form\}ϕ​\(𝒚t\)−𝒞𝐲\|𝐱​μ^𝐱​\(t\)−\\displaystyle\\phi\(\\bm\{y\}\_\{t\}\)\-\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\-\}\\quad⇔dt=𝒐​\(𝒚t\)−𝑮y​𝑶​𝒎t−,\\displaystyle\\Leftrightarrow\\quad d\_\{t\}=\\bm\{o\}\(\\bm\{y\}\_\{t\}\)\-\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\-\},ϕ​\(𝒚t\)−𝒞𝐲\|𝐱​μ^𝐱​\(t\)\+\\displaystyle\\phi\(\\bm\{y\}\_\{t\}\)\-\\mathcal\{C\}\_\{\\mathbf\{y\}\|\\mathbf\{x\}\}\\hat\{\\mu\}\_\{\\mathbf\{x\}\(t\)\}^\{\+\}\\quad⇔ϵt=𝒐​\(𝒚t\)−𝑮y​𝑶​𝒎t\+\.\\displaystyle\\Leftrightarrow\\quad\\epsilon\_\{t\}=\\bm\{o\}\(\\bm\{y\}\_\{t\}\)\-\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\+\}\.\(17\)
The AEKF update uses memory factors \(α,β\\alpha,\\beta\) to dynamically adapt the estimates of𝑸\\bm\{Q\}and𝑹\\bm\{R\}444For example, a larger memory factor puts more weights on the previous state, helping to recover to overall trend of the system dynamics and not get easily damaged by the abrupt change, as shown in Figure[4\(c\)](https://arxiv.org/html/2606.14195#S4.F4.sf3)\.:

𝑹t\\displaystyle\\bm\{R\}\_\{t\}=α​𝑹t−1\+\(1−α\)​\(ϵt​ϵt⊤\+𝑮y​𝑶​𝑺t\+​𝑶⊤\),\\displaystyle=\\alpha\\bm\{R\}\_\{t\-1\}\+\(1\-\\alpha\)\(\\epsilon\_\{t\}\\epsilon\_\{t\}^\{\\top\}\+\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{S\}\_\{t\}^\{\+\}\\bm\{O\}^\{\\top\}\),𝑸t\\displaystyle\\bm\{Q\}\_\{t\}=β​𝑸t−1\+\(1−β\)​\(𝑮t​dt​dt⊤​𝑮t⊤\)\.\\displaystyle=\\beta\\bm\{Q\}\_\{t\-1\}\+\(1\-\\beta\)\(\\bm\{G\}\_\{t\}d\_\{t\}d\_\{t\}^\{\\top\}\\bm\{G\}\_\{t\}^\{\\top\}\)\.\(18\)
However, directly updating covariance matrices faces hurdles with positive\-definiteness and high dimensionality\. As shown in Figure[3](https://arxiv.org/html/2606.14195#S3.F3), our ELTO\-AKF approach circumvents these issues by operating on the Cholesky factors \(𝑳\\bm\{L\}and𝑴\\bm\{M\}\), leveraging the advantages of the SB parameterization\. This approach guarantees that𝑸\\bm\{Q\}and𝑹\\bm\{R\}are SPD matrices and maintains their tractability\. Without the SPD guarantee for the noise covariance matrices, the performance of traditional Kalman filtering can degrade significantly\(Greenberget al\.,[2023](https://arxiv.org/html/2606.14195#bib.bib37)\)\.

The adaptive process begins with initial Cholesky factors𝑳0\\bm\{L\}\_\{0\}and𝑴0\\bm\{M\}\_\{0\}, defining𝑸0=𝑳0​𝑳0⊤\\bm\{Q\}\_\{0\}=\\bm\{L\}\_\{0\}\\bm\{L\}\_\{0\}^\{\\top\}and𝑹0=𝑴0​𝑴0⊤\\bm\{R\}\_\{0\}=\\bm\{M\}\_\{0\}\\bm\{M\}\_\{0\}^\{\\top\}\. For the periodic updates, we employ a sparse projection operatorSP​\(⋅\)\\text\{SP\}\(\\cdot\), defined in two stages\. First, we enforce the SB structure on the dense estimate𝑨\\bm\{A\}\(representing either𝑸^k\\hat\{\\bm\{Q\}\}\_\{k\}or𝑹^k\\hat\{\\bm\{R\}\}\_\{k\}\) by averaging its block traces:θi​j=tr​\(𝑨i​j\)rank​\(𝑨i​j\)\\theta\_\{ij\}=\\frac\{\\text\{tr\}\(\\bm\{A\}\_\{ij\}\)\}\{\\text\{rank\}\(\\bm\{A\}\_\{ij\}\)\}, yielding an intermediate matrix𝑨~\\tilde\{\\bm\{A\}\}\. Second, to guarantee the SPD property required for Cholesky factorization, we perform the rectification𝑨S=𝑽​𝚲~​𝑽⊤\\bm\{A\}^\{S\}=\\bm\{V\}\\tilde\{\\bm\{\\Lambda\}\}\\bm\{V\}^\{\\top\}, where𝑨~=𝑽​𝚲​𝑽⊤\\tilde\{\\bm\{A\}\}=\\bm\{V\}\\bm\{\\Lambda\}\\bm\{V\}^\{\\top\}is the eigen\-decomposition and𝚲~=diag​\(max⁡\(λi,ξ\)\)i=1N\\tilde\{\\bm\{\\Lambda\}\}=\\mathrm\{diag\}\\bigl\(\\max\(\\lambda\_\{i\},\\xi\)\\bigr\)\_\{i=1\}^\{N\}withξ\>0\\xi\>0denoting a minimum stability threshold\. The resulting positive\-definite matrix𝑨S\\bm\{A\}^\{S\}is decomposed into Cholesky factors, which are then updated using the memory terms, as detailed in Algorithm[2](https://arxiv.org/html/2606.14195#alg2)\.

Algorithm 2Kernel Kalman Filtering with Dynamic Parameter Adaptation1:Input:

𝑻\\bm\{T\},

𝑶\\bm\{O\},

𝒎0\\bm\{m\}\_\{0\},

𝑴\\bm\{M\},

𝑳\\bm\{L\},

𝑺0\\bm\{S\}\_\{0\},

𝑸θ\\bm\{Q\}\_\{\\theta\},

𝑹θ\\bm\{R\}\_\{\\theta\}from Algorithm[1](https://arxiv.org/html/2606.14195#alg1),

𝒀val=\[𝒚1val,…,𝒚Tval\]\\bm\{Y\}^\{\\text\{val\}\}=\[\\bm\{y\}^\{\\text\{val\}\}\_\{1\},\\dots,\\bm\{y\}^\{\\text\{val\}\}\_\{T\}\], window size

ww, and memory factors

\(α,β\)\(\\alpha,\\beta\)\.

2:for

t=1,…,Tt=1,\\dots,Tdo

3://— Prediction —

4:

𝒎t−←𝑻​𝒎t−1\+\\bm\{m\}\_\{t\}^\{\-\}\\leftarrow\\bm\{T\}\\bm\{m\}\_\{t\-1\}^\{\+\},

𝑺t−←𝑻​𝑺t−1\+​𝑻⊤\+𝑸θ\\bm\{S\}\_\{t\}^\{\-\}\\leftarrow\\bm\{T\}\\bm\{S\}\_\{t\-1\}^\{\+\}\\bm\{T\}^\{\\top\}\+\\bm\{Q\}\_\{\\theta\}
5://— Update —

6:

𝑮t←𝑺t−​𝑶⊤​\(𝑮y​𝑶​𝑺t−​𝑶⊤\+𝑹θ\)−1\\bm\{G\}\_\{t\}\\leftarrow\\bm\{S\}\_\{t\}^\{\-\}\\bm\{O\}^\{\\top\}\(\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{S\}\_\{t\}^\{\-\}\\bm\{O\}^\{\\top\}\+\\bm\{R\}\_\{\\theta\}\)^\{\-1\}⊳\\trianglerightKalman gain

7:

dt←𝒐​\(𝒚t\)−𝑮y​𝑶​𝒎t−d\_\{t\}\\leftarrow\\bm\{o\}\(\\bm\{y\}\_\{t\}\)\-\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\-\}⊳\\trianglerightInnovation

8:

𝒎t\+←𝒎t−\+𝑮t​dt\\bm\{m\}\_\{t\}^\{\+\}\\leftarrow\\bm\{m\}\_\{t\}^\{\-\}\+\\bm\{G\}\_\{t\}d\_\{t\},

𝑺t\+←\(𝑰−𝑮t​𝑮y​𝑶\)​𝑺t−\\bm\{S\}\_\{t\}^\{\+\}\\leftarrow\(\\bm\{I\}\-\\bm\{G\}\_\{t\}\\bm\{G\}\_\{y\}\\bm\{O\}\)\\bm\{S\}\_\{t\}^\{\-\}
9:

ϵt←𝒐​\(𝒚t\)−𝑮y​𝑶​𝒎t\+\\epsilon\_\{t\}\\leftarrow\\bm\{o\}\(\\bm\{y\}\_\{t\}\)\-\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{m\}\_\{t\}^\{\+\}⊳\\trianglerightResidual

10://— Dynamic Parameter Adaptation —

11:

k←⌊t/w⌋,r←tmodwk\\leftarrow\\lfloor t/w\\rfloor,\\quad r\\leftarrow t\\bmod w
12:if

r=0r=0and not

α=β=1\\alpha=\\beta=1then⊳\\trianglerightIn this condition,𝐐\\bm\{Q\},𝐑\\bm\{R\}are time\-variant\.θ\\thetais abbreviated\.

13:

𝑹^k←1w​∑i=\(k−1\)​w\+1k​wϵi​ϵi⊤\+𝑮y​𝑶​𝑺k​w\+​𝑶⊤\\hat\{\\bm\{R\}\}\_\{k\}\\leftarrow\\frac\{1\}\{w\}\\sum\_\{i=\(k\-1\)w\+1\}^\{kw\}\\epsilon\_\{i\}\\epsilon\_\{i\}^\{\\top\}\+\\bm\{G\}\_\{y\}\\bm\{O\}\\bm\{S\}\_\{kw\}^\{\+\}\\bm\{O\}^\{\\top\},

𝑸^k←1w​∑i=\(k−1\)​w\+1k​w𝑮i​di​di⊤​𝑮i⊤\\hat\{\\bm\{Q\}\}\_\{k\}\\leftarrow\\frac\{1\}\{w\}\\sum\_\{i=\(k\-1\)w\+1\}^\{kw\}\\bm\{G\}\_\{i\}d\_\{i\}d\_\{i\}^\{\\top\}\\bm\{G\}\_\{i\}^\{\\top\}
14:

𝑹^kS←SP​\(𝑹^k\)\\hat\{\\bm\{R\}\}\_\{k\}^\{S\}\\leftarrow\\text\{SP\}\(\\hat\{\\bm\{R\}\}\_\{k\}\),

𝑸^kS←SP​\(𝑸^k\)\\hat\{\\bm\{Q\}\}\_\{k\}^\{S\}\\leftarrow\\text\{SP\}\(\\hat\{\\bm\{Q\}\}\_\{k\}\)⊳\\trianglerightSparse projection into SB structure

15:

𝑴^k←Decompose​\(𝑹^kS\)\\hat\{\\bm\{M\}\}\_\{k\}\\leftarrow\\text\{Decompose\}\(\\hat\{\\bm\{R\}\}\_\{k\}^\{S\}\),

𝑳^k←Decompose​\(𝑸^kS\)\\hat\{\\bm\{L\}\}\_\{k\}\\leftarrow\\text\{Decompose\}\(\\hat\{\\bm\{Q\}\}\_\{k\}^\{S\}\)⊳\\trianglerightCholesky decomposition

16:

𝑴←α​𝑴\+\(1−α\)​𝑴^k\\bm\{M\}\\leftarrow\\alpha\\bm\{M\}\+\(1\-\\alpha\)\\hat\{\\bm\{M\}\}\_\{k\},

𝑳←β​𝑳\+\(1−β\)​𝑳^k\\bm\{L\}\\leftarrow\\beta\\bm\{L\}\+\(1\-\\beta\)\\hat\{\\bm\{L\}\}\_\{k\}
17:

𝑹←𝑴​𝑴⊤\\bm\{R\}\\leftarrow\\bm\{M\}\\bm\{M\}^\{\\top\},

𝑸←𝑳​𝑳⊤\\bm\{Q\}\\leftarrow\\bm\{L\}\\bm\{L\}^\{\\top\}
18:endif

19://— Projection back to the original space —

20:

𝜼^t,𝚺^t←pre\_image​\(𝒎t\+,𝑺t\+\)\\hat\{\\bm\{\\eta\}\}\_\{t\},\\hat\{\\bm\{\\Sigma\}\}\_\{t\}\\leftarrow\\text\{pre\\\_image\}\(\\bm\{m\}\_\{t\}^\{\+\},\\bm\{S\}\_\{t\}^\{\+\}\)⊳\\triangleright𝜼^t\\hat\{\\bm\{\\eta\}\}\_\{t\}is a short\-hand notation for𝛈^t​\(θ\)\\hat\{\\bm\{\\eta\}\}\_\{t\}\(\\theta\)\.

21:endfor

22:

θ∗=argmin𝜃​1T​∑t=1T∥𝜼^t−𝒚tval∥2\\theta^\{\*\}=\\underset\{\\theta\}\{\\text\{argmin\}\}\\ \\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\lVert\\hat\{\\bm\{\\eta\}\}\_\{t\}\-\\bm\{y\}\_\{t\}^\{\\text\{val\}\}\\rVert^\{2\}⊳\\trianglerightDuring the test phase, we perform inference usingθ∗\\theta^\{\*\}\.

23:Output:

θ∗\\theta^\{\*\},

\[𝜼^1,…,𝜼^T\]\[\\hat\{\\bm\{\\eta\}\}\_\{1\},\\dots,\\hat\{\\bm\{\\eta\}\}\_\{T\}\]and

\[𝚺^1,…,𝚺^T\]\[\\hat\{\\bm\{\\Sigma\}\}\_\{1\},\\dots,\\hat\{\\bm\{\\Sigma\}\}\_\{T\}\]

## 4Numerical Experiments

### 4\.1Denoising the Nonlinear Pendulum System

To evaluate the performance of our proposed method in a standard filtering context, we first examined its denoising ability on a simulated pendulum task\. The experimental setup follows the one described in\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\), based on a simulated single pendulum system555The simulation was performed using the code available at[https://github\.com/gregorgebhardt/pyKKR](https://github.com/gregorgebhardt/pyKKR)\.\. For each simulation run, the initial angleq0q\_\{0\}and angular velocityq˙0\\dot\{q\}\_\{0\}were uniformly sampled from the ranges\[−0\.25​π,0\.25​π\]\[\-0\.25\\pi,0\.25\\pi\]and\[−2​π,2​π\]\[\-2\\pi,2\\pi\]rad/s, respectively\. The dynamics were simulated at 10,000 Hz subject to a normally distributed process noise with standard deviationnpn\_\{p\}\. Observations of the joint positions were collected at 10 Hz, and corrupted by an additive Gaussian noise with standard deviationnon\_\{o\}centered on the true angleqtq\_\{t\}\. For our evaluations, we defined 3 specific noise regimes: Default setting \(np=0\.1,no=0\.01n\_\{p\}=0\.1,n\_\{o\}=0\.01\), high process noise \(np=0\.2,no=0\.1n\_\{p\}=0\.2,n\_\{o\}=0\.1\) and high observation noise \(np=0\.1,no=0\.1n\_\{p\}=0\.1,n\_\{o\}=0\.1\)\.

The generated data were partitioned into a training sequence of 1500 and a test sequence of 300 time steps\. The spectral learning architecture was trained using the Adam optimizer with a learning rate of10−310^\{\-3\}and configured with a window size of 5\. For our proposed structured noise model, which used a diagonal parameterization withkq=kr=5k\_\{q\}=k\_\{r\}=5blocks, the optimal noise variances were tuned using CMA\-ES\.

We benchmark ELTO\-AKF against four fundamentally different baselines\. Constant Velocity AEKF \(CV\) and linear AEKF \(Linear\) rely on analytical priors or assumed linear system dynamics\. Sampling\-based and neural\-aided Kalman filters\(Huet al\.,[2025](https://arxiv.org/html/2606.14195#bib.bib41); Looet al\.,[2024](https://arxiv.org/html/2606.14195#bib.bib43)\), such as the Sigma\-Point Kalman Filter \(SPKF\) and CKFNet, avoid explicit Jacobians via spatial sampling and often integrate recurrent networks to adapt to unmodeled dynamics\. The Kernel Kalman Rule \(KKR\)\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\)and its efficient subsampled approximation \(SubKKR\) perform non\-parametric updates on RKHSs while leaving the latent state𝒙\\bm\{x\}model\-based\. Finally, operator\-based ELTO\-KF learns from observations but uses fixed identity\-based covariance matrices \(𝑸=ϵq​𝑰,𝑹=ϵr​𝑰\\bm\{Q\}=\\epsilon\_\{q\}\\bm\{I\},\\bm\{R\}=\\epsilon\_\{r\}\\bm\{I\}\)\.

Table[1\(a\)](https://arxiv.org/html/2606.14195#S4.T1.st1)reports the mean squared error \(MSE\) averaged over 5 trials\. Under default noise, the model\-based AEKF \(CV\) achieves the lowest error since its analytical priors align well with simple dynamics\. However, as noise increases, AEKF and SPKF errors rise considerably due to fixed priors and sparse sampling\. While neural\-aided architectures like CKFNet show promise in standard settings, they often encounter numerical divergence under extreme noise conditions, as unconstrained neural predictions struggle to preserve the symmetric positive\-definite \(SPD\) property of covariance matrices\. Similarly, supervised KKR/SubKKR methods struggle with complex noise distributions\. In contrast, the data\-driven ELTO\-AKF is competitive in the default setting and outperforms baselines under high process and high observation noise\. The Scalar\-Block parameterization and its low\-dimensional Cholesky updates effectively overcome static noise limitations and structurally guarantee stability in dynamic environments\. A conceptual comparison of these filtering methods across key properties is summarized in Table[1\(b\)](https://arxiv.org/html/2606.14195#S4.T1.st2), showing that our data\-driven ELTO\-AKF is the only method designed to be adaptive for non\-stationary processes while guaranteeing the SPD structure\.

Table 1:Comparison of sequential state estimation methods\.\(a\)Denoising performance \(MSE×10−3\\times 10^\{\-3\}\) on the pendulum dataset under various noise conditions\.Noise SettingModel\-based AEKFKernel\- or Sampling\-basedOperator\-basedCVLinearSubKKRKKRSPKFELTO\-KFELTO\-AKFDefault0\.08350\.13212\.75741\.11851\.60990\.10100\.1012High process noise0\.11670\.30793\.93213\.40613\.4730\.23250\.1123High observation noise9\.18110\.30913\.19215\.79712\.21811\.2788\.881
\(b\)Conceptual comparison of filtering methodologies\.MethodData\-drivenoperatorsRKHSinferenceBypassesexplicit PDFAdaptable tonon\-stationarityGuaranteedSPD structureAEKF \(CV/Linear\)×\\times×\\times×\\times✓×\\timesCKFNet×\\times×\\times×\\times✓×\\timesSPKF×\\times×\\times×\\times×\\times×\\timesKKR/SubKKR×\\times✓✓×\\times×\\timesELTO\-KF✓✓✓×\\times×\\timesELTO\-AKF \(Ours\)✓✓✓✓✓

### 4\.2Tracking Non\-Stationary LiDAR Trajectories

We conducted a filtering experiment on a single, long synthetic trajectory to evaluate our method’s ability to handle changing dynamics\. Inspired by tracking problems in\(Greenberget al\.,[2023](https://arxiv.org/html/2606.14195#bib.bib37)\), our experimental setup is modified to test the model’s ability to handle changing dynamics within a continuous data stream of LiDAR measurements666The simulation was performed using code adapted from[https://github\.com/ido90/Optimized\-Kalman\-Filter](https://github.com/ido90/Optimized-Kalman-Filter)\.\. Each trajectory consists of multiple segments of variable length, with each segment defined by constant radial and tangential accelerations sampled from a normal distribution\. This process generates a path with alternating periods of straight\-line motion and coordinated turns, creating the piecewise non\-stationary dynamics\. The base simulation uses initial positions in \[\-50, 50\] per axis, an initial velocity magnitude in \[1, 5\], and radial and tangential acceleration standard deviations of 0\.1 and 0\.5\. The generated trajectory provides the true path and corresponding noisy observations, which are then standardized\. For a comprehensive evaluation, we created distinct datasets by systematically varying two key factors: noise levels \(categorized as default, high, and very high\) and trajectory segment lengths \(categorized as default, small, and large\)777The specific parameter settings for each group are as follows: Noise Levels \(in terms ofnoise\_r,noise\_t\): Default \(1, 0\.5\), High \(2, 1\.5\), and Very High \(5, 2\.5\); Segment Lengths \(in terms ofn\_intervals,int\_len\): Default \(\(25, 30\), \(20, 25\)\), Small \(\(50, 60\), \(10, 12\)\), and Large \(\(10, 15\), \(40, 50\)\)\.\.

The first 50% of the trajectory is used for training, and the other half is used for testing\. To evaluate the model under different practical constraints, we designed two testing regimes in Table[2](https://arxiv.org/html/2606.14195#S4.T2)\. In the Full Optimization setting, we search for the optimalϵq\\epsilon\_\{q\}andϵr\\epsilon\_\{r\}for the initial noise covariance matrices \(𝑸0\\bm\{Q\}\_\{0\}and𝑹0\\bm\{R\}\_\{0\}\) using CMA\-ES, while keepingα=β=1\\alpha=\\beta=1fixed throughout the filtering process\. In the Pure Adaptation setting, we ablate the optimization phase and rely purely on dynamic parameter adaptation \(refer to Algorithm[2](https://arxiv.org/html/2606.14195#alg2)\), starting from naive initializations with memory factorsα=β=0\.9\\alpha=\\beta=0\.9\. Note that kernel parameters \(e\.g\., lengthscales\) remain fixed across both settings\. The filtering performance for these settings, along with evaluations under distribution shifts, is presented in Tables[2](https://arxiv.org/html/2606.14195#S4.T2)and[3](https://arxiv.org/html/2606.14195#S4.T3)\.

Table 2:Dynamic state estimation performance \(MSE \(×10−2\\times 10^\{\-2\}\)\) on piecewise non\-stationary LiDAR trajectories\. We compare the baseline ELTO\-KF with our ELTO\-AKF under two settings: full optimization and pure adaptation\. In the former setting, the initial noise covariance matrices are tuned \(using CMA\-ES\) withα=β=1\\alpha=\\beta=1\. In the latter, they are tuned purely through dynamic parameter adaptation \(α=β=0\.9\\alpha=\\beta=0\.9\), without using CMA\-ES\.Parameter GroupSettingsFull OptimizationPure AdaptationELTO\-KFELTO\-AKFELTO\-KFELTO\-AKF\(α=β=1\\alpha=\\beta=1\)\(α=β=0\.9\\alpha=\\beta=0\.9\)Noise LevelsDefault8\.28251\.946620\.71328\.8112High12\.56295\.049733\.979811\.2288Very High24\.552913\.048681\.578535\.0613Segment LengthsSmall9\.23053\.951122\.25446\.5787Large12\.94787\.984026\.226911\.9133Table 3:Robustness evaluation \(MSE\(×10−1\)\(\\times 10^\{\-1\}\)\) against distribution shifts\. Models are evaluated on unseen noise levels during testing that differ from their training data\. ELTO\-AKF consistently demonstrates superior generalization under these mismatched conditions compared to the ELTO\-KF baseline\.Noise LevelsModelsTrainTestELTO\-KFELTO\-AKFDefaultHigh3\.56041\.0789DefaultVery High7\.20183\.9934HighVery High8\.19696\.5971HighDefault4\.99593\.1255Very HighDefault4\.91592\.2983Very HighHigh3\.69321\.3865As shown in Table[2](https://arxiv.org/html/2606.14195#S4.T2), in the Full Optimization setting where parameters are finely tuned, ELTO\-AKF significantly outperforms the standard ELTO\-KF across all noise levels and trajectory lengths, validating the training efficacy of our proposed structure\. Furthermore, in the Pure Adaptation setting—which tests the model without initial hyperparameter optimization—ELTO\-AKF still maintains robust performance through dynamic parameter adaptation\. Table[3](https://arxiv.org/html/2606.14195#S4.T3)demonstrates that when faced with unseen noise and distribution shifts, ELTO\-AKF also achieves a lower error compared to the baseline\. Further ablation studies \(provided in Appendix[C](https://arxiv.org/html/2606.14195#A3)\) confirm that the proposed structure is crucial for maintaining numerical stability\.

### 4\.3Scaling to High\-Dimensional Lorenz\-96 Systems

To evaluate the scalability and computational stability of our proposed method under varying dimensions, we conducted experiments using the standard Lorenz\-96 chaotic system\(Lorenz,[1996](https://arxiv.org/html/2606.14195#bib.bib58)\)\(forcing constantF=8\.0F=8\.0\) across dimensionsD∈\{5,10,100,500,1000\}D\\in\\\{5,10,100,500,1000\\\}\. Trajectories of 1000 steps were generated via RK4 integration\(Press,[2007](https://arxiv.org/html/2606.14195#bib.bib59)\)\(Δ​t=0\.01\\Delta t=0\.01\) and injected with varying levels of Gaussian observation noise \(σobs∈\{0\.01,0\.1,0\.5\}\\sigma\_\{\\text\{obs\}\}\\in\\\{0\.01,0\.1,0\.5\\\}\)\. The training was configured with 200 epochs, a batch size of 50, and a windoww=5w=5\. For ELTO configuration, the baseline was optimized via 50 CMA\-ES iterations\. For our ELTO\-AKF, the number of scalar blocks for the structured noise parameterization was dynamically scaled based on the system dimension, yieldingk∈\{2,5,10\}k\\in\\\{2,5,10\\\}for the respective dimensions\.

As summarized in Table[4](https://arxiv.org/html/2606.14195#S4.T4), ELTO\-AKF performs competitively with the baseline at low noise and outperforms it at moderate\-to\-high noise across most dimensions\. More importantly, in several high\-dimensional and high\-noise settings \(e\.g\.,D=500D=500,σobs=0\.1\\sigma\_\{\\text\{obs\}\}=0\.1andD=1000D=1000,σobs=0\.5\\sigma\_\{\\text\{obs\}\}=0\.5\), the ELTO\-KF baseline diverges due to a loss of the SPD property, whereas the structured parameterization of ELTO\-AKF inherently guarantees SPD and remains numerically stable\. This indicates that, while the unstructured baseline can sometimes recover when initialization happens to remain well\-conditioned, only the structured parameterization provides reliable scalability\.

Table 4:Performance comparison \(MSE\) on the Lorenz\-96 system\.DDdenotes the system dimension, andσo​b​s\\sigma\_\{obs\}is the standard deviation of the observation noise\. “N/A” indicates that the ELTO\-KF baseline failed to converge in that configuration due to a loss of the SPD property; ELTO\-AKF remained stable in every configuration\.DDNoise \(σo​b​s\\sigma\_\{obs\}\) = 0\.01Noise \(σo​b​s\\sigma\_\{obs\}\) = 0\.1Noise \(σo​b​s\\sigma\_\{obs\}\) = 0\.5ELTO\-KFELTO\-AKFELTO\-KFELTO\-AKFELTO\-KFELTO\-AKF50\.04150\.04510\.23040\.03040\.20760\.2069100\.22430\.16720\.23680\.12570\.28140\.30211000\.50320\.51860\.94890\.52650\.98900\.57825000\.87600\.8446N/A1\.27220\.91590\.925210001\.76821\.67211\.82271\.6836N/A1\.7384
### 4\.4Downstream Application: Data\-Driven PDE Discovery

Our ELTO\-AKF approach is well\-suited for denoising noisy state variables using learned noise covariance matrices, improving the quality of the observational data\. The improved data quality can, in turn, support downstream tasks of data\-driven partial differential equation \(PDE\) discovery\(Rudyet al\.,[2017](https://arxiv.org/html/2606.14195#bib.bib48)\), yielding more accurate recovery of governing PDEs\. We aim to identify sparse governing PDEs parameterized in the following form:ut=𝒩​\(u,ux,ux​x,…\)u\_\{t\}=\\mathcal\{N\}\(u,u\_\{x\},u\_\{xx\},\\dots\); where𝒩\\mathcal\{N\}denotes a hidden nonlinear operator involving the state variableuuand its spatial derivatives \(e\.g\.,ux,ux​x,…u\_\{x\},u\_\{xx\},\\dots\)\. We assume𝒩\\mathcal\{N\}is expressed as a linear combination of a few active terms in the \(sparse\) governing equation\. To identify the underlying PDE, we solve a sparse or best\-subset regression problem constructed over a library of \(independent\) candidate terms, with the temporal derivativeutu\_\{t\}serving as the vector response for selecting the most relevant terms\. Note that we only have access to noisy values ofu​\(x,t\)u\(x,t\)on a discretized spatiotemporal domain\. By applying numerical differentiation or weak\-form computation\(Reinboldet al\.,[2020](https://arxiv.org/html/2606.14195#bib.bib51)\), we can construct an overcomplete candidate library, setting up a system of equations for the regression problem\. The best model \(optimally balancing between approximation error and model complexity\) that is most likely to represent the actual governing PDE is selected using information criteria\(Thanasutiveset al\.,[2024](https://arxiv.org/html/2606.14195#bib.bib49);[2025](https://arxiv.org/html/2606.14195#bib.bib50)\)\.

We aim to demonstrate that refining the observed states using our ELTO\-AKF improves the accuracy of identified PDE coefficients\. The parametric Burgers’ equation is used as an example\. The equation readsut=\[u​uxux​x\]​𝝃u\_\{t\}=\\begin\{bmatrix\}uu\_\{x\}&u\_\{xx\}\\end\{bmatrix\}\\bm\{\\xi\}; where𝝃\\bm\{\\xi\}is the true PDE coefficient vector, containing the kinematic viscosity \(or diffusion coefficient\)ϑ\\vartheta\. We study𝝃=\[−1ϑ\]T\\bm\{\\xi\}=\\begin\{bmatrix\}\-1&\\vartheta\\end\{bmatrix\}^\{T\}; whereϑ=0\.1\\vartheta=0\.1or0\.01π\\frac\{0\.01\}\{\\pi\}\. In the latter case, the viscosity is so small that shock waves develop in the PDE solution, and hence, abrupt changes occur in the system dynamics\. To generateu​\(x,t\)u\(x,t\), the clean state variable is perturbed with Gaussian noise drawn fromϵ​σ100​𝒩​\(0,1\)\\frac\{\\epsilon\\sigma\}\{100\}\\mathcal\{N\}\(0,1\), whereσ\\sigmadenotes the standard deviation of the clean state variable\. The noise level is controlled by the parameterϵ\\epsilon\.

Our noise covariances are trained to denoise by mapping doubly noisy data to \(noisy\) observed data\. Here, the doubly noisyu~​\(x,t\)=u​\(x,t\)\+𝒩​\(0,σ~2\)\\tilde\{u\}\(x,t\)=u\(x,t\)\+\\mathcal\{N\}\(0,\\tilde\{\\sigma\}^\{2\}\)refers to noisy observations further perturbed with synthetic Gaussian noise, whose standard deviationσ~\\tilde\{\\sigma\}is estimated using a robust wavelet\-based estimator\(Donoho and Johnstone,[1994](https://arxiv.org/html/2606.14195#bib.bib52)\)\. A key practical advantage of usingu~\\tilde\{u\}is that it does not need the clean state variable for training and validation\. During the test phase, after obtaining the learned noise covariance matrices, we use them to denoise the observations\. The resulting noise\-reduced data are passed through the PDE discovery method to estimate a sparse vector𝝃^\\bm\{\\hat\{\\xi\}\}of the PDE coefficients, for which we calculate the average of its absolute percentage errors, each defined byℰj​\(𝝃,𝝃^\)=\|ξj−ξ^jξj\|\\mathcal\{E\}\_\{j\}\(\\bm\{\\xi\},\\bm\{\\hat\{\\xi\}\}\)=\\left\|\\frac\{\\xi\_\{j\}\-\\hat\{\\xi\}\_\{j\}\}\{\\xi\_\{j\}\}\\right\|\. Another evaluation metric we consider, which is less susceptible to errors in estimating small coefficients, is the L1 relative error:‖𝝃−𝝃^‖1‖𝝃‖1\\frac\{\|\|\\bm\{\\xi\}\-\\bm\{\\hat\{\\xi\}\}\|\|\_\{1\}\}\{\|\|\\bm\{\\xi\}\|\|\_\{1\}\}\.

Table 5:Comparison of PDE discovery results with and without denoising the noisy state variables\. The mean squared error \(MSE\) implies the L2\-distance between the data used for PDE discovery and the clean state\-variable data\. Better scores are shown inbold\. The ELTO\-AKF algorithm is run withα=β=1\\alpha=\\beta=1, and the correct viscosity is0\.10\.1\.DenoisingMSE \(×10−3\\times 10^\{\-3\}\)Identified PDEMAPEL1 Relative Error \(×10−2\\times 10^\{\-2\}\)N/A8\.088\.08−0\.959286​u​ux\+0\.076843​ux​x\-0\.959286uu\_\{x\}\+0\.076843u\_\{xx\}13\.6113\.615\.815\.81ELTO\-AKF2\.48\\mathbf\{2\.48\}−0\.940588​u​ux\+0\.098496​ux​x\-0\.940588uu\_\{x\}\+0\.098496u\_\{xx\}3\.72\\mathbf\{3\.72\}5\.54\\mathbf\{5\.54\}![Refer to caption](https://arxiv.org/html/2606.14195v1/fig/denoising_visualization.png)\(a\)Burgers’ equation with no shock waves is simulated using a Gaussian initial condition\.
![Refer to caption](https://arxiv.org/html/2606.14195v1/fig/shock_visualization.png)\(b\)Burgers’ equation with shock waves is simulated using a periodic initial condition\.
![Refer to caption](https://arxiv.org/html/2606.14195v1/x4.png)\(c\)Abrupt change in system dynamics governed by the Burgers’ equation caused by shock formation\.

Figure 4:The state\-variable data are depicted under three conditions: the clean data, the noisy \(ϵ=50\\epsilon=50\) observed data, and the denoised data produced by the ELTO\-AKF, which shows the smallest MSE to the clean data\.Table 6:Comparison of denoising methods based on the ELTO\-KF and our proposed ELTO\-AKF with varying memory factors\. Here, the correct viscosity is0\.01π\\frac\{0\.01\}\{\\pi\}\. Note that the identified PDE is incorrect when no denoising is applied; therefore, its evaluation metrics are unavailable \(N/A\)\.DenoisingMSE \(×10−2\\times 10^\{\-2\}\)Identified PDEMAPEL1 Relative Error \(×10−1\\times 10^\{\-1\}\)N/AN/A−0\.464365​u​ux\+0\.000163​u2​ux​x\-0\.464365uu\_\{x\}\+0\.000163u^\{2\}u\_\{xx\}N/AN/AELTO\-KF2\.702\.70−0\.677635​u​ux\+0\.006934​ux​x\-0\.677635uu\_\{x\}\+0\.006934u\_\{xx\}75\.033\.25ELTO\-AKF\(α=β=1\\alpha=\\beta=1\)2\.762\.76−0\.742664​u​ux\+0\.003891​ux​x\-0\.742664uu\_\{x\}\+0\.003891u\_\{xx\}23\.99\\mathbf\{23\.99\}2\.572\.57ELTO\-AKF\(α=β=0\.7\\alpha=\\beta=0\.7\)2\.19\\mathbf\{2\.19\}−0\.787012​u​ux\+0\.004537​ux​x\-0\.787012uu\_\{x\}\+0\.004537u\_\{xx\}31\.9231\.922\.14\\mathbf\{2\.14\}To empirically demonstrate the effectiveness of our ELTO\-AKF in denoising, we first conduct experiments on discovering the canonical Burgers’ equation with viscosityϑ=0\.1\\vartheta=0\.1under a high noise level ofϵ=50\\epsilon=50\. Table[5](https://arxiv.org/html/2606.14195#S4.T5)shows that the denoised state\-variable data resemble the clean data \(as illustrated in Figure[4](https://arxiv.org/html/2606.14195#S4.F4)\) more closely than the noisy observed data, as evidenced by the decreased MSE\. Therefore, a higher\-quality PDE—closer to the true governing form, as indicated by the low MAPE \(around44%\) and the small L1 relative error—is identified\. Then, we experiment on the viscous Burgers’ equation exhibiting shock waves, where the PDE solution becomes discontinuous in space after a certain time, to reveal the full ELTO\-AKF’s capability in adapting to local dynamics by adjusting the memory factors\. As shown in Table[6](https://arxiv.org/html/2606.14195#S4.T6), the identified governing equation is incorrect if no denoising is applied and remains of low quality when denoising is performed using the ELTO\-KF approach\. The ELTO\-AKF with adaptation to local dynamics achieves the best performance in terms of the MSE and the L1 relative error, whereas the ELTO\-AKF with full memory attains the lowest MAPE\. Therefore, whether the inclusion of dynamic parameter adaptation yields better performance depends on the dataset, and currently no definitive conclusion can be drawn\.

### 4\.5Downstream Application: Real\-World Ecological Time Series

Table 7:Filtering performance\. We measure the MSE in recovering the original observations from doubly noisy data generated with varying levels of synthetic Gaussian noise\.Noise level \(ϵ\\epsilon\)ELTO\-KFELTO\-AKF10\.01260\.0120100\.05690\.04191001\.39951\.2602Table 8:Discovery of the Lotka–Volterra dynamics in the lynx–hare dataset\. We compare the discovery process with and without ELTO\-AKF smoothing as a denoising step prior to trajectory\-matching refinement\. Lower MAPE indicates closer agreement with the reference dynamics:x˙1=−0\.84​x1\+0\.026​x1​x2\\dot\{x\}\_\{1\}=\-0\.84x\_\{1\}\+0\.026x\_\{1\}x\_\{2\}, andx˙2=0\.55​x2−0\.028​x1​x2\\dot\{x\}\_\{2\}=0\.55x\_\{2\}\-0\.028x\_\{1\}x\_\{2\}, for which the expected coefficients are statistically derived\(Howard,[2009](https://arxiv.org/html/2606.14195#bib.bib65)\)\.MethodIdentified ODEsMAPEw/o ELTO\-AKFx˙1=−0\.843​x1\+0\.0266​x1​x2,x˙2=0\.547​x2−0\.0281​x1​x2\\dot\{x\}\_\{1\}=\-0\.843x\_\{1\}\+0\.0266x\_\{1\}x\_\{2\},\\quad\\dot\{x\}\_\{2\}=0\.547x\_\{2\}\-0\.0281x\_\{1\}x\_\{2\}0\.9070\.907w/ ELTO\-AKFx˙1=−0\.830​x1\+0\.0261​x1​x2,x˙2=0\.554​x2−0\.0282​x1​x2\\dot\{x\}\_\{1\}=\-0\.830\\,x\_\{1\}\+0\.0261x\_\{1\}x\_\{2\},\\quad\\dot\{x\}\_\{2\}=0\.554x\_\{2\}\-0\.0282x\_\{1\}x\_\{2\}0\.772\\mathbf\{0\.772\}We validated ELTO\-AKF using the Canadian lynx and snowshoe hare records\(Elton and Nicholson,[1942](https://arxiv.org/html/2606.14195#bib.bib63)\)\. These records exhibit non\-stationary Lotka\-Volterra dynamics\(Lotka,[1925](https://arxiv.org/html/2606.14195#bib.bib61); Volterra,[1926](https://arxiv.org/html/2606.14195#bib.bib62)\)and are known to contain outliers and phase mismatches\. Lacking ground truth data on the state variables, we followed the procedure in Section[4\.4](https://arxiv.org/html/2606.14195#S4.SS4)and treated the original observations as validation targets \(𝒀val\\bm\{Y\}^\{\\text\{val\}\}\)\. We injected varying levels of additional synthetic noise,ϵ100​𝒩​\(0,1\)\\frac\{\\epsilon\}\{100\}\\mathcal\{N\}\(0,1\), into the input data and then optimized the filter to recover the original observations from these doubly noisy inputs, evaluating the robustness of the data\-driven Kalman filters\. Table[7](https://arxiv.org/html/2606.14195#S4.T7)demonstrates that ELTO\-AKF consistently outperforms ELTO\-KF across all noise levels, confirming its ability to isolate the complex dynamics inherent in real\-world measurements\. Consequently, these filters show strong potential for further denoising the original observations without relying on ground truth supervision\.

![Refer to caption](https://arxiv.org/html/2606.14195v1/x5.png)Figure 5:Recovered Lotka–Volterra models from the lynx–hare data\. The trajectory\-fit solution and the smoothed observations closely match the simulated solution from the reference coefficients\.We also demonstrate that ELTO\-AKF with its memory factors ablated can be optimized to smooth observations, thereby facilitating the downstream discovery of the governing system of ODEs\. The original 20\-point series is upsampled to 200 evenly spaced points using cubic splines to form the input observations𝒀\\bm\{Y\}, and we apply a sparse\-regression discovery algorithm\(Rudyet al\.,[2017](https://arxiv.org/html/2606.14195#bib.bib48)\)to recover the underlying Lotka\-Volterra dynamics\. The discovery process proceeds in three steps: \(i\) sparse regression over a degree\-2 polynomial library identifies the active terms; \(ii\) iterative Kalman smoothing produces a denoised trajectory; and \(iii\) a trajectory\-matching refinement re\-optimizes the coefficient values by integrating the candidate ODE forward in time and minimizing the squared error against the denoised trajectory, thereby bringing the recovered coefficients closer to the true dynamics\. Table[8](https://arxiv.org/html/2606.14195#S4.T8)reports the results of an ablation study comparing the full discovery process against a variant that skips the smoothing and runs the trajectory\-matching refinement directly on the upsampled observations\. Without the trajectory\-matching refinement, the sparse\-regression step alone yields a MAPE of5\.415\.41in both conditions, which is far from the reference coefficients; however, the refinement step, when fed the smoothed trajectory, lowers the MAPE to0\.7720\.772\. This indicates that ELTO\-AKF smoothing yields a clear downstream benefit by providing the trajectory\-matching step with a cleaner integration target, as shown in Figure[5](https://arxiv.org/html/2606.14195#S4.F5)\.

## 5Conclusions

#### Summary\.

We introduced the ELTO\-AKF, which addresses the limitations of the standard ELTO\-KF in learning non\-stationary dynamics through the novel structured noise adaptation\. The computationally tractable SB parameterization is proposed, guaranteeing the symmetric positive\-definite property of the noise covariance matrices by updating their Cholesky factors during the dynamic parameter adaptation\. This structured parameterization grants our proposed method the capability of structured noise adaptation: learning globally robust, time\-invariant noise models through optimization, while dynamically tracking time\-varying system dynamics using filter residuals\. Empirical results in dynamic state estimation reveal that the ELTO\-AKF significantly outperforms baseline methods in challenging non\-stationary processes\. Furthermore, when applied as a denoising method for data\-driven PDE discovery, our ELTO\-AKF method reduces the identification error for complex systems governed by the viscous Burgers’ equation, whose solutions exhibit shock waves\.

#### Limitations and Future Work\.

While ELTO\-AKF demonstrates significant robustness and circumvents the need for noiseless latent states by optimizing over noisy validation observations, we acknowledge certain limitations in the current parameter fitting procedure\. In SSMs, principled gradient\-based approaches, such as Maximum Marginal Likelihood Estimation \(MMLE\) via particle methods, provide an elegant framework for parameter learning without ground truth data\. However, because our filtering inference operates entirely via distribution embeddings within the RKHS, it inherently bypasses explicit probability density functions\. Consequently, the current RKHS formulation lacks an explicit observation likelihood or a tractable pre\-image approximation, rendering likelihood\-based methods like MMLE not directly applicable and encouraging other optimization techniques \(e\.g\., derivative\-free CMA\-ES\)\. Furthermore, our dynamic tracking mechanism relies on predefined memory factors\(α,β\)\(\\alpha,\\beta\)to govern adaptation rates\. Therefore, formulating a structured likelihood approximation within the RKHS to enable principled, gradient\-based optimization, alongside exploring fully adaptive strategies for the memory factors, remain important directions for future research\.

#### Broader Impact Statement\.

This work advances sequential Bayesian filtering by introducing a structured parameterization for learning noise covariance matrices\. We hope our approach to designing noise\-robust covariance matrices benefits other researchers working on non\-stationary filtering\. While our evaluations focus on domains like LiDAR tracking, robust state estimation is fundamental to general navigation\. Therefore, we acknowledge the dual\-use potential of this technology, particularly in surveillance applications\.

## References

- Approximate inference in state\-space models with heavy\-tailed noise\.IEEE Transactions on Signal Processing60\(10\),pp\. 5024–5037\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1)\.
- S\. Akhlaghi, N\. Zhou, and Z\. Huang \(2017\)Adaptive adjustment of noise covariance in kalman filter for dynamic state estimation\.In2017 IEEE power & energy society general meeting,pp\. 1–5\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p4.1),[§3\.3](https://arxiv.org/html/2606.14195#S3.SS3.p1.1)\.
- C\. R\. Baker \(1973\)Joint measures and cross\-covariance operators\.Transactions of the American Mathematical Society186,pp\. 273–289\.Cited by:[§2\.1](https://arxiv.org/html/2606.14195#S2.SS1.SSS0.Px1.p1.21)\.
- P\. Becker, H\. Pandya, G\. Gebhardt, C\. Zhao, C\.J\. Taylor, and G\. Neumann \(2019\)Recurrent kalman networks: factorized inference in high\-dimensional deep feature spaces\.InProc\. of the 36th Int’l Conf\. on Machine Learning,Vol\.PMLR 97,pp\. 544–552\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p4.1)\.
- V\. DeMiguel, A\. Martín\-utrera, and R\. Uppal \(2024\)A multifactor perspective on volatility\-managed portfolios\.The Journal of Finance79\(6\),pp\. 3859–3891\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1)\.
- D\. L\. Donoho and I\. M\. Johnstone \(1994\)Ideal spatial adaptation by wavelet shrinkage\.Biometrika81\(3\),pp\. 425–455\.External Links:ISSN 0006\-3444,[Document](https://dx.doi.org/10.1093/biomet/81.3.425),[Link](https://doi.org/10.1093/biomet/81.3.425),https://academic\.oup\.com/biomet/article\-pdf/81/3/425/26079146/81\.3\.425\.pdfCited by:[§4\.4](https://arxiv.org/html/2606.14195#S4.SS4.p3.6)\.
- G\. Duran\-Martin, M\. Altamirano, A\. Y\. Shestopaloff, L\. Sánchez\-Betancourt, J\. Knoblauch, M\. Jones, F\. Briol, and K\. Murphy \(2024\)Outlier\-robust kalman filtering through generalised bayes\.arXiv preprint arXiv:2405\.05646\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1)\.
- C\. Elton and M\. Nicholson \(1942\)The ten\-year cycle in numbers of the lynx in canada\.The Journal of Animal Ecology,pp\. 215–244\.Cited by:[§4\.5](https://arxiv.org/html/2606.14195#S4.SS5.p1.2)\.
- K\. Fukumizu, L\. Song, and A\. Gretton \(2013\)Kernel bayes’ rule: bayesian inference with positive definite kernels\.Journal of Machine Learning Research14\(118\),pp\. 3753–3783\.External Links:[Link](http://jmlr.org/papers/v14/fukumizu13a.html)Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1),[§1](https://arxiv.org/html/2606.14195#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.14195#S2.SS2.p1.1)\.
- G\.H\.W\. Gebhardt, A\. Kupcsik, and G\. Neumann \(2019\)The kernel kalman rule: Efficient nonparametric inference by recursive least\-squares and subspace projections\.Machine Learning108,pp\. 2113–2157\.Cited by:[Appendix A](https://arxiv.org/html/2606.14195#A1.p4.7),[§1](https://arxiv.org/html/2606.14195#S1.p1.1),[§1](https://arxiv.org/html/2606.14195#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.14195#S2.SS2.SSS0.Px1.p1.10),[§2\.2](https://arxiv.org/html/2606.14195#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p2.7),[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p4.9),[§4\.1](https://arxiv.org/html/2606.14195#S4.SS1.p1.10),[§4\.1](https://arxiv.org/html/2606.14195#S4.SS1.p3.2)\.
- I\. Greenberg, N\. Yannay, and S\. Mannor \(2023\)Optimization or architecture: how to hack kalman filtering\.Advances in Neural Information Processing Systems36,pp\. 50482–50505\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1),[§1](https://arxiv.org/html/2606.14195#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p5.6),[§3\.3](https://arxiv.org/html/2606.14195#S3.SS3.p3.4),[§4\.2](https://arxiv.org/html/2606.14195#S4.SS2.p1.1)\.
- N\. Hansen \(2006\)The CMA evolution strategy: A comparing review\.Towards a new evolutionary computation: Advances in the estimation of distribution algorithms,pp\. 75–102\.Cited by:[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p6.1)\.
- P\. Howard \(2009\)Modeling basics\.Lecture Notes for Math442\.Cited by:[Table 8](https://arxiv.org/html/2606.14195#S4.T8),[Table 8](https://arxiv.org/html/2606.14195#S4.T8.4.2)\.
- J\. Hu, H\. Zhao, and Y\. Peng \(2025\)CKFNet: neural network aided cubature kalman filtering\.IEEE Signal Processing Letters\.Cited by:[§4\.1](https://arxiv.org/html/2606.14195#S4.SS1.p3.2)\.
- R\. E\. Kalman \(1960\)A new approach to linear filtering and prediction problems\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1)\.
- N\. Kantas, A\. Doucet, S\. S\. Singh, J\. Maciejowski, and N\. Chopin \(2015\)On particle methods for parameter estimation in state\-space models\.Cited by:[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p6.1)\.
- T\. Katayama \(2005\)Subspace methods for system identification\.Book,Springer\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14195#S2.SS1.SSS0.Px2.p1.6)\.
- N\. Ke, R\. Tanaka, and Y\. Kawahara \(2025\)Learning stochastic nonlinear dynamics with embedded latent transfer operators\.InProceedings of The 28th International Conference on Artificial Intelligence and Statistics,Y\. Li, S\. Mandt, S\. Agrawal, and E\. Khan \(Eds\.\),Proceedings of Machine Learning Research, Vol\.258,pp\. 4861–4869\.External Links:[Link](https://proceedings.mlr.press/v258/ke25a.html)Cited by:[Appendix A](https://arxiv.org/html/2606.14195#A1.p1.2),[§1](https://arxiv.org/html/2606.14195#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.14195#S2.SS1.SSS0.Px2.p1.6),[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p2.7)\.
- J\. Y\. Loo, Z\. Y\. Ding, V\. M\. Baskaran, S\. G\. Nurzaman, and C\. P\. Tan \(2024\)Sigma\-point kalman filter with nonlinear unknown input estimation via optimization and data\-driven approach for dynamic systems\.IEEE Transactions on Systems, Man, and Cybernetics: Systems54\(10\),pp\. 6068–6081\.Cited by:[§4\.1](https://arxiv.org/html/2606.14195#S4.SS1.p3.2)\.
- E\. N\. Lorenz \(1996\)Predictability: a problem partly solved\.InProc\. Seminar on predictability,Vol\.1,pp\. 1–18\.Cited by:[§4\.3](https://arxiv.org/html/2606.14195#S4.SS3.p1.6)\.
- A\. J\. Lotka \(1925\)Elements of physical biology\.Williams & Wilkins\.Cited by:[§4\.5](https://arxiv.org/html/2606.14195#S4.SS5.p1.2)\.
- C\. Masreliez and R\. Martin \(2003\)Robust bayesian estimation for the linear model and robustifying the kalman filter\.IEEE transactions on Automatic Control22\(3\),pp\. 361–371\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1)\.
- J\. M\. Phillips and S\. Venkatasubramanian \(2011\)A gentle introduction to the kernel distance\.arXiv preprint arXiv:1103\.1625\.Cited by:[Appendix B](https://arxiv.org/html/2606.14195#A2.p1.2)\.
- J\. C\. Pinheiro and D\. M\. Bates \(1996\)Unconstrained parametrizations for variance\-covariance matrices\.Statistics and computing6\(3\),pp\. 289–296\.Cited by:[§3\.1](https://arxiv.org/html/2606.14195#S3.SS1.p3.3)\.
- W\. H\. Press \(2007\)Numerical recipes 3rd edition: the art of scientific computing\.Cambridge university press\.Cited by:[§4\.3](https://arxiv.org/html/2606.14195#S4.SS3.p1.6)\.
- P\. A\. K\. Reinbold, D\. R\. Gurevich, and R\. O\. Grigoriev \(2020\)Using noisy or incomplete data to discover models of spatiotemporal dynamics\.Phys\. Rev\. E101,pp\. 010203\.External Links:[Document](https://dx.doi.org/10.1103/PhysRevE.101.010203),[Link](https://link.aps.org/doi/10.1103/PhysRevE.101.010203)Cited by:[§4\.4](https://arxiv.org/html/2606.14195#S4.SS4.p1.7)\.
- S\. H\. Rudy, S\. L\. Brunton, J\. L\. Proctor, and J\. N\. Kutz \(2017\)Data\-driven discovery of partial differential equations\.Science Advances3\(4\),pp\. e1602614\.Cited by:[§4\.4](https://arxiv.org/html/2606.14195#S4.SS4.p1.7),[§4\.5](https://arxiv.org/html/2606.14195#S4.SS5.p2.3)\.
- B\. Schölkopf \(2000\)The kernel trick for distances\.Advances in neural information processing systems13\.Cited by:[Appendix B](https://arxiv.org/html/2606.14195#A2.p1.2)\.
- L\. Song, J\. Huang, A\. Smola, and K\. Fukumizu \(2009\)Hilbert space embeddings of conditional distributions with applications to dynamical systems\.InProceedings of the 26th annual international conference on machine learning,pp\. 961–968\.Cited by:[§2\.1](https://arxiv.org/html/2606.14195#S2.SS1.SSS0.Px1.p1.22)\.
- P\. Thanasutives, Y\. Kawahara, and K\. Fukui \(2025\)Bayesian model selection for variable\-coefficient partial differential equation discovery\.Results in Engineering27,pp\. 106930\.External Links:ISSN 2590\-1230,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.rineng.2025.106930),[Link](https://www.sciencedirect.com/science/article/pii/S2590123025029883)Cited by:[§4\.4](https://arxiv.org/html/2606.14195#S4.SS4.p1.7)\.
- P\. Thanasutives, T\. Morita, M\. Numao, and K\. Fukui \(2024\)Adaptive uncertainty\-penalized model selection for data\-driven pde discovery\.IEEE Access12,pp\. 13165–13182\.External Links:[Document](https://dx.doi.org/10.1109/ACCESS.2024.3354819)Cited by:[§4\.4](https://arxiv.org/html/2606.14195#S4.SS4.p1.7)\.
- V\. Volterra \(1926\)Variazioni e fluttuazioni del numero d’individui in specie animali conviventi\.Società anonima tipografica" Leonardo da Vinci"\.Cited by:[§4\.5](https://arxiv.org/html/2606.14195#S4.SS5.p1.2)\.
- H\. Wang, H\. Li, J\. Fang, and H\. Wang \(2018\)Robust gaussian kalman filter with outlier detection\.IEEE Signal Processing Letters25\(8\),pp\. 1236–1240\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p1.1)\.
- J\. Wang \(1999\)Stochastic modeling for real\-time kinematic gps/glonass positioning\.Navigation46\(4\),pp\. 297–305\.Cited by:[§1](https://arxiv.org/html/2606.14195#S1.p4.1),[§3\.2](https://arxiv.org/html/2606.14195#S3.SS2.p5.6)\.

## Appendix AMatrix\-based Spectral Learning Algorithm for Operator Identification

This section summarizes the spectral learning algorithm proposed by\(Keet al\.,[2025](https://arxiv.org/html/2606.14195#bib.bib39)\)for estimating the Embedded Latent Transfer Operators \(ELTO\)\. This algorithm provides a constructive method for deriving the latent state estimates directly from the observation history, which enables the computation of the empirical matrices𝑻\\bm\{T\}and𝑶\\bm\{O\}used in the main text\.

The empirical estimation process begins with a finite observation sequence𝒀=\[𝒚1,…,𝒚T\]\\bm\{Y\}=\[\\bm\{y\}\_\{1\},\\dots,\\bm\{y\}\_\{T\}\]and a window sizehh\. Vectors for the past and future,𝒚p,n\\bm\{y\}\_\{p,n\}and𝒚f,n\\bm\{y\}\_\{f,n\}, are constructed for each time stepn=1,…,Nn=1,\\dots,N, whereN:=T−2​h\+2N:=T\-2h\+2is the total number of windows:

𝒚p,n:=\[𝒚n\+h−2,…,𝒚n−1\]⊤,𝒚f,n:=\[𝒚n\+h−1,…,𝒚2​h−2\+n\]⊤\.\\bm\{y\}\_\{p,n\}:=\[\\bm\{y\}\_\{n\+h\-2\},\.\.\.,\\bm\{y\}\_\{n\-1\}\]^\{\\top\},\\quad\\bm\{y\}\_\{f,n\}:=\[\\bm\{y\}\_\{n\+h\-1\},\.\.\.,\\bm\{y\}\_\{2h\-2\+n\}\]^\{\\top\}\.\(19\)These vectors are then mapped into the RKHS\. For example, for the past window, this results in a feature matrix:

Φp,n:=\[ϕy​\(𝒚n\+h−2\),…,ϕy​\(𝒚n−1\)\]\.\\Phi\_\{p,n\}:=\[\\phi\_\{y\}\(\\bm\{y\}\_\{n\+h\-2\}\),\\dots,\\phi\_\{y\}\(\\bm\{y\}\_\{n\-1\}\)\]\.\(20\)
The empirical covariance matrices are then computed from the features of these windows\. LetΦp=\[Φp,1,…,Φp,N\]\\Phi\_\{p\}=\[\\Phi\_\{p,1\},\\dots,\\Phi\_\{p,N\}\]andΦf=\[Φf,1,…,Φf,N\]\\Phi\_\{f\}=\[\\Phi\_\{f,1\},\\dots,\\Phi\_\{f,N\}\]be the feature matrices for the entire sequence of past and future windows, and let𝑸N:=𝑰N−1N​𝟏N​𝟏N⊤\\bm\{Q\}\_\{N\}:=\\bm\{I\}\_\{N\}\-\\frac\{1\}\{N\}\\mathbf\{1\}\_\{N\}\\mathbf\{1\}\_\{N\}^\{\\top\}be a centering matrix\. The covariance matrices are given by:

𝑪p​p:=1N​Φp​𝑸N​Φp⊤,𝑪f​f:=1N​Φf​𝑸N​Φf⊤,𝑪f​p:=1N​Φf​𝑸N​Φp⊤\.\\displaystyle\\bm\{C\}\_\{pp\}:=\\frac\{1\}\{N\}\\Phi\_\{p\}\\bm\{Q\}\_\{N\}\\Phi\_\{p\}^\{\\top\},\\quad\\bm\{C\}\_\{ff\}:=\\frac\{1\}\{N\}\\Phi\_\{f\}\\bm\{Q\}\_\{N\}\\Phi\_\{f\}^\{\\top\},\\quad\\bm\{C\}\_\{fp\}:=\\frac\{1\}\{N\}\\Phi\_\{f\}\\bm\{Q\}\_\{N\}\\Phi\_\{p\}^\{\\top\}\.\(21\)After computing the Cholesky factorizations𝑪f​f=𝑳​𝑳⊤\\bm\{C\}\_\{ff\}=\\bm\{L\}\\bm\{L\}^\{\\top\}and𝑪p​p=𝑴​𝑴⊤\\bm\{C\}\_\{pp\}=\\bm\{M\}\\bm\{M\}^\{\\top\}, a singular\-value decomposition \(SVD\) of the normalized cross\-covariance is performed:

𝑳−1​𝑪f​p​\(𝑴−1\)⊤≈𝑼^​𝑺^​𝑽^⊤\.\\bm\{L\}^\{\-1\}\\bm\{C\}\_\{fp\}\(\\bm\{M\}^\{\-1\}\)^\{\\top\}\\approx\\hat\{\\bm\{U\}\}\\hat\{\\bm\{S\}\}\\hat\{\\bm\{V\}\}^\{\\top\}\.\(22\)This yields a projection matrix𝑩:=𝑺^1/2​𝑽^⊤​𝑴−1\\bm\{B\}:=\\hat\{\\bm\{S\}\}^\{1/2\}\\hat\{\\bm\{V\}\}^\{\\top\}\\bm\{M\}^\{\-1\}\. To construct the latent state, a feature matrixΦS=\[ϕy​\(𝒚e1\),…,ϕy​\(𝒚e\|S\|\)\]\\Phi\_\{S\}=\[\\phi\_\{y\}\(\\bm\{y\}\_\{e\_\{1\}\}\),\\dots,\\phi\_\{y\}\(\\bm\{y\}\_\{e\_\{\|S\|\}\}\)\]is formed from a subset of observationsS⊆\{𝒚0,…,𝒚T\}S\\subseteq\\\{\\bm\{y\}\_\{0\},\\dots,\\bm\{y\}\_\{T\}\\\}\. A vector of weights,ww, is then learned by optimizing a loss function related to the canonical correlations\. The latent state vector is then constructed as:

𝒙n=𝑩​Φp,n⊤​ΦS​w,\\bm\{x\}\_\{n\}=\\bm\{B\}\\Phi\_\{p,n\}^\{\\top\}\\Phi\_\{S\}w,\(23\)whereΦS​w\\Phi\_\{S\}wrepresents a projection into the RKHS learned via the spectral method to maximize canonical correlations\.

With the latent state sequence\{𝒙n\}\\\{\\bm\{x\}\_\{n\}\\\}obtained, the system’s dynamics are modeled\. Using the feature matrices defined previously, the embedded operators𝒯e\\mathcal\{T\}\_\{e\}and𝒪e\\mathcal\{O\}\_\{e\}are empirically estimated as regularized operators:

𝒯^e\\displaystyle\\hat\{\\mathcal\{T\}\}\_\{e\}=𝚿2​𝚿1⊤​\(𝚿1​𝚿1⊤\+ϵt​ℐ\)−1,\\displaystyle=\\bm\{\\Psi\}\_\{2\}\\bm\{\\Psi\}\_\{1\}^\{\\top\}\(\\bm\{\\Psi\}\_\{1\}\\bm\{\\Psi\}\_\{1\}^\{\\top\}\+\\epsilon\_\{t\}\\mathcal\{I\}\)^\{\-1\},\(24\)𝒪^e\\displaystyle\\hat\{\\mathcal\{O\}\}\_\{e\}=𝚽​𝚿⊤​\(𝚿​𝚿⊤\+ϵo​ℐ\)−1\.\\displaystyle=\\bm\{\\Phi\}\\bm\{\\Psi\}^\{\\top\}\(\\bm\{\\Psi\}\\bm\{\\Psi\}^\{\\top\}\+\\epsilon\_\{o\}\\mathcal\{I\}\)^\{\-1\}\.\(25\)where𝚿1:=𝚿:,1:N−1\\bm\{\\Psi\}\_\{1\}:=\\bm\{\\Psi\}\_\{:,1:N\-1\},𝚿2:=𝚿:,2:N\\bm\{\\Psi\}\_\{2\}:=\\bm\{\\Psi\}\_\{:,2:N\}, andϵt,ϵo\>0\\epsilon\_\{t\},\\epsilon\_\{o\}\>0\. For the matrix\-based implementation of the Kalman filter used in the main text \(Section[3\.2](https://arxiv.org/html/2606.14195#S3.SS2)\), we require theN×NN\\times Nmatrix representations of these operators, which are derived by applying the kernel trick\. Following\(Gebhardtet al\.,[2019](https://arxiv.org/html/2606.14195#bib.bib7)\), we first define the necessary Gram matrices:

𝑮x~=𝚿1⊤​𝚿1,𝑮x~​x=𝚿1⊤​𝚿2,𝑮x=𝚿⊤​𝚿,𝑮y​x=𝚽⊤​𝚿\.\\displaystyle\\bm\{G\}\_\{\\tilde\{x\}\}=\\bm\{\\Psi\}\_\{1\}^\{\\top\}\\bm\{\\Psi\}\_\{1\},\\quad\\bm\{G\}\_\{\\tilde\{x\}x\}=\\bm\{\\Psi\}\_\{1\}^\{\\top\}\\bm\{\\Psi\}\_\{2\},\\quad\\bm\{G\}\_\{x\}=\\bm\{\\Psi\}^\{\\top\}\\bm\{\\Psi\},\\quad\\bm\{G\}\_\{yx\}=\\bm\{\\Phi\}^\{\\top\}\\bm\{\\Psi\}\.\(26\)
This allows the operator in Equation[24](https://arxiv.org/html/2606.14195#A1.E24)\-[25](https://arxiv.org/html/2606.14195#A1.E25)to be re\-expressed as theN×NN\\times Nmatrices𝑻\\bm\{T\}and𝑶\\bm\{O\}:

𝑻\\displaystyle\\bm\{T\}=\(𝑮x~\+ϵt​𝑰N\)−1​𝑮x~​x,\\displaystyle=\(\\bm\{G\}\_\{\\tilde\{x\}\}\+\\epsilon\_\{t\}\\bm\{I\}\_\{N\}\)^\{\-1\}\\bm\{G\}\_\{\\tilde\{x\}x\},\(27\)𝑶\\displaystyle\\bm\{O\}=\(𝑮x\+ϵo​𝑰N\)−1​𝑮y​x⊤\.\\displaystyle=\(\\bm\{G\}\_\{x\}\+\\epsilon\_\{o\}\\bm\{I\}\_\{N\}\)^\{\-1\}\\bm\{G\}\_\{yx\}^\{\\top\}\.\(28\)These matrices𝑻\\bm\{T\}and𝑶\\bm\{O\}, along with the noise matrices𝑸\\bm\{Q\}and𝑹\\bm\{R\}, form the basis of the Kalman filtering process described in Section[3\.2](https://arxiv.org/html/2606.14195#S3.SS2)\. The initialization of sequential state estimation with ELTOs is then given in Algorithm[3](https://arxiv.org/html/2606.14195#alg3)\.

Algorithm 3Learning system dynamics with the Embedded Latent Transfer Operator \(ELTO\)1:Input:

𝒀=\[𝒚1,…,𝒚T\]\\bm\{Y\}=\[\\bm\{y\}\_\{1\},\\dots,\\bm\{y\}\_\{T\}\], kernel function

k​\(⋅,⋅\)k\(\\cdot,\\cdot\)and regularization parameters

ϵt\\epsilon\_\{t\},

ϵo\\epsilon\_\{o\}\.

2:Compute

𝚽,𝚿\\bm\{\\Phi\},\\bm\{\\Psi\}following Equation[2](https://arxiv.org/html/2606.14195#S2.E2)using

𝒀\\bm\{Y\}
3:Compute

𝑻=\(𝑮x~\+ϵt​𝑰N\)−1​𝑮x~​x\\bm\{T\}=\(\\bm\{G\}\_\{\\tilde\{x\}\}\+\\epsilon\_\{t\}\\bm\{I\}\_\{N\}\)^\{\-1\}\\bm\{G\}\_\{\\tilde\{x\}x\},

𝑶=\(𝑮x\+ϵo​𝑰N\)−1​𝑮y​x⊤\\bm\{O\}=\(\\bm\{G\}\_\{x\}\+\\epsilon\_\{o\}\\bm\{I\}\_\{N\}\)^\{\-1\}\\bm\{G\}\_\{yx\}^\{\\top\}following Equation[26](https://arxiv.org/html/2606.14195#A1.E26)

4:Sample

NNbasis vectors

𝒖1,…,𝒖N\\bm\{u\}\_\{1\},\\dots,\\bm\{u\}\_\{N\}from a multivariate Uniform distribution

𝒰​\(0,1\)\\mathcal\{U\}\(0,1\)
5:Define the kernel matrix

𝑲0\\bm\{K\}\_\{0\}where

\(𝑲0\)i,j=kx​\(𝒖i,𝒙j\)\(\\bm\{K\}\_\{0\}\)\_\{i,j\}=k\_\{x\}\(\\bm\{u\}\_\{i\},\\bm\{x\}\_\{j\}\)for

i=1,…,Ni=1,\\dots,Nand

j=1,…,Nj=1,\\dots,N
6:Compute the initial state embedding

𝑪0=\(𝑮x\+ϵo​𝑰N\)−1​𝑲0\\bm\{C\}\_\{0\}=\(\\bm\{G\}\_\{x\}\+\\epsilon\_\{o\}\\bm\{I\}\_\{N\}\)^\{\-1\}\\bm\{K\}\_\{0\}
7:Compute mean

𝒎0\\bm\{m\}\_\{0\}and variance

𝑺0\\bm\{S\}\_\{0\}over the columns of

𝑪0\\bm\{C\}\_\{0\}
8:Output:

𝑻,𝑶,𝒎0\\bm\{T\},\\bm\{O\},\\bm\{m\}\_\{0\}, and

𝑺0\\bm\{S\}\_\{0\}

## Appendix BProof of Distance Monotonicity in RKHS

The following proof validates a general and crucial property of the radial basis function \(RBF\) kernel\. We demonstrate that for any two arbitrary points𝐲,𝐲′∈𝕐\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\\in\\mathbb\{Y\}, the RBF kernelk​\(𝐲,𝐲′\)=exp⁡\(−γ​‖𝐲−𝐲′‖𝕐2\)k\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)=\\exp\(\-\\gamma\|\|\\mathbf\{y\}\-\\mathbf\{y\}^\{\\prime\}\|\|\_\{\\mathbb\{Y\}\}^\{2\}\), which is a shift\-invariant kernel, guarantees a monotonic relationship between the input\-space distance and the feature\-space distance\. This specific connection is a foundational result in kernel methods\(Schölkopf,[2000](https://arxiv.org/html/2606.14195#bib.bib4)\), forming the basis of what is often termed the kernel distance\(Phillips and Venkatasubramanian,[2011](https://arxiv.org/html/2606.14195#bib.bib5)\)\.

Proposition 2\.For any two arbitrary points𝐲,𝐲′∈𝕐\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\\in\\mathbb\{Y\}from the observation space, let the feature mapϕ:𝕐→ℍ\\phi:\\mathbb\{Y\}\\to\\mathbb\{H\}be induced by the RBF kernel \(k​\(𝐲,𝐲′\)=exp⁡\(−γ​‖𝐲−𝐲′‖𝕐2\),γ\>0k\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)=\\exp\\left\(\-\\gamma\|\|\\mathbf\{y\}\-\\mathbf\{y\}^\{\\prime\}\|\|\_\{\\mathbb\{Y\}\}^\{2\}\\right\),\\gamma\>0\)\. The squared Euclidean distance in the feature space,‖ϕ​\(𝐲\)−ϕ​\(𝐲′\)‖ℍ2\|\|\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\|\|\_\{\\mathbb\{H\}\}^\{2\}, is a monotonically increasing function of the squared Euclidean distance in the observation space,‖𝐲−𝐲′‖𝕐2\|\|\\mathbf\{y\}\-\\mathbf\{y\}^\{\\prime\}\|\|\_\{\\mathbb\{Y\}\}^\{2\}\.

Proof\.We define the squared Euclidean distance in the Reproducing Kernel Hilbert Space \(RKHS\)ℍ\\mathbb\{H\}using the inner product⟨⋅,⋅⟩ℍ\\langle\\cdot,\\cdot\\rangle\_\{\\mathbb\{H\}\}:

‖ϕ​\(𝐲\)−ϕ​\(𝐲′\)‖ℍ2\\displaystyle\|\|\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\|\|\_\{\\mathbb\{H\}\}^\{2\}=⟨ϕ​\(𝐲\)−ϕ​\(𝐲′\),ϕ​\(𝐲\)−ϕ​\(𝐲′\)⟩ℍ\\displaystyle=\\langle\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\),\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\\rangle\_\{\\mathbb\{H\}\}\(29\)=⟨ϕ​\(𝐲\),ϕ​\(𝐲\)⟩ℍ\+⟨ϕ​\(𝐲′\),ϕ​\(𝐲′\)⟩ℍ−2​⟨ϕ​\(𝐲\),ϕ​\(𝐲′\)⟩ℍ\.\\displaystyle=\\langle\\phi\(\\mathbf\{y\}\),\\phi\(\\mathbf\{y\}\)\\rangle\_\{\\mathbb\{H\}\}\+\\langle\\phi\(\\mathbf\{y\}^\{\\prime\}\),\\phi\(\\mathbf\{y\}^\{\\prime\}\)\\rangle\_\{\\mathbb\{H\}\}\-2\\langle\\phi\(\\mathbf\{y\}\),\\phi\(\\mathbf\{y\}^\{\\prime\}\)\\rangle\_\{\\mathbb\{H\}\}\.\(30\)
By the kernel trick, the inner product in the feature space can be computed by the kernel function in the input space:

⟨ϕ​\(𝐲\),ϕ​\(𝐲′\)⟩ℍ=k​\(𝐲,𝐲′\)\.\\displaystyle\\langle\\phi\(\\mathbf\{y\}\),\\phi\(\\mathbf\{y\}^\{\\prime\}\)\\rangle\_\{\\mathbb\{H\}\}=k\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\.\(31\)
Substituting this into Equation[30](https://arxiv.org/html/2606.14195#A2.E30), we get:

‖ϕ​\(𝐲\)−ϕ​\(𝐲′\)‖ℍ2=k​\(𝐲,𝐲\)\+k​\(𝐲′,𝐲′\)−2​k​\(𝐲,𝐲′\)\.\\displaystyle\|\|\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\|\|\_\{\\mathbb\{H\}\}^\{2\}=k\(\\mathbf\{y\},\\mathbf\{y\}\)\+k\(\\mathbf\{y\}^\{\\prime\},\\mathbf\{y\}^\{\\prime\}\)\-2k\(\\mathbf\{y\},\\mathbf\{y\}^\{\\prime\}\)\.\(32\)
Specifically, the RBF kernel is shift\-invariant \(stationary\), andk​\(𝐲,𝐲\)=k​\(𝐲′,𝐲′\)=exp⁡\(0\)=1k\(\\mathbf\{y\},\\mathbf\{y\}\)=k\(\\mathbf\{y\}^\{\\prime\},\\mathbf\{y\}^\{\\prime\}\)=\\exp\(0\)=1:

‖ϕ​\(𝐲\)−ϕ​\(𝐲′\)‖ℍ2=2​\(1−exp⁡\(−γ​‖𝐲−𝐲′‖𝕐2\)\)\.\\displaystyle\|\|\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\|\|\_\{\\mathbb\{H\}\}^\{2\}=2\\left\(1\-\\exp\(\-\\gamma\|\|\\mathbf\{y\}\-\\mathbf\{y\}^\{\\prime\}\|\|\_\{\\mathbb\{Y\}\}^\{2\}\)\\right\)\.\(33\)Since‖𝐲−𝐲′‖𝕐2≥0\|\|\\mathbf\{y\}\-\\mathbf\{y\}^\{\\prime\}\|\|\_\{\\mathbb\{Y\}\}^\{2\}\\geq 0andγ\>0\\gamma\>0, asddincreases, the entire‖ϕ​\(𝐲\)−ϕ​\(𝐲′\)‖ℍ2\|\|\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\|\|\_\{\\mathbb\{H\}\}^\{2\}increases\. This directly proves that the feature\-space squared distance is a monotonically increasing function of the input\-space squared distance, confirming the property\.□\\square

The significance of this general proof is that it provides the formal justification for kernel\-based optimization frameworks\. By proving this monotonic property, we confirm that minimizing the squared residual in the RKHS \(i\.e\.,‖ϕ​\(𝐲\)−ϕ​\(𝐲′\)‖ℍ2\|\|\\phi\(\\mathbf\{y\}\)\-\\phi\(\\mathbf\{y\}^\{\\prime\}\)\|\|\_\{\\mathbb\{H\}\}^\{2\}\) is equivalent to minimizing the squared residual in the original observation space \(i\.e\.,‖𝐲−𝐲′‖𝕐2\|\|\\mathbf\{y\}\-\\mathbf\{y\}^\{\\prime\}\|\|\_\{\\mathbb\{Y\}\}^\{2\}\)\. This ensures that an optimal solution found in the RKHS also corresponds to an optimal solution in the observation space \(i\.e\., Equation[16](https://arxiv.org/html/2606.14195#S3.E16)\)\.

## Appendix CExtended Ablation and Analysis

### C\.1Core Architectural Ablation

#### Necessity of the Scalar\-Block \(SB\) Structure\.

To evaluate the SB parameterization, we compare ELTO\-AKF against an ablation baseline lacking the SB structure\. The full model \(Algorithm[2](https://arxiv.org/html/2606.14195#alg2)\) projects noise estimates onto the SB space and updates its low\-dimensional Cholesky factors \(𝑴k,𝑳k\\bm\{M\}\_\{k\},\\bm\{L\}\_\{k\}\) to guarantee positive definiteness\. The ablation skips this projection, updating the dense𝑹\\bm\{R\}and𝑸\\bm\{Q\}matrices directly to simulate a standard AEKF\. Both models are evaluated on non\-stationary trajectories \(Section[4\.2](https://arxiv.org/html/2606.14195#S4.SS2)\) withα=β=0\.9\\alpha=\\beta=0\.9and initialized from the same data\-driven𝑸0,𝑹0\\bm\{Q\}\_\{0\},\\bm\{R\}\_\{0\}\(detailed in footnote[7](https://arxiv.org/html/2606.14195#footnote7)\)\.

Table 9:MSE \(×10−2\\times 10^\{\-2\}\) of structured vs\. unstructured adaptive noise models on non\-stationary trajectories \(α=β=0\.9\\alpha=\\beta=0\.9\)\.Noise LevelELTO\-AKFSB structureWithWithoutDefault8\.811226\.7737High11\.228820\.1178Very High35\.061344\.8016As shown in Table[9](https://arxiv.org/html/2606.14195#A3.T9), ELTO\-AKF outperforms the unstructured ablation\. Direct updates to dense matrices in the ablation model fail to preserve the symmetric positive\-definite \(SPD\) property, triggering numerical instability and elevated tracking errors\. Furthermore, the robustness against varying and extreme noise conditions is provided in Table[3](https://arxiv.org/html/2606.14195#S4.T3)in the main text\.

### C\.2Hyperparameter Sensitivity

#### Memory Factors \(α,β\\alpha,\\beta\)\.

Table[10](https://arxiv.org/html/2606.14195#A3.T10)shows the sensitivity of the adaptation rates under both matched \(train noise equals test noise\) and mismatched conditions\.

Table 10:MSE under varying adaptation rates \(α=β\\alpha=\\beta\)\. Results are grouped by matched \(MSE×10−2\\times 10^\{\-2\}\) and mismatched \(MSE×10−1\\times 10^\{\-1\}\) noise conditions\.Noise LevelAdaptation Rates \(α=β\\alpha=\\beta\)TrainTest0\.90\.70\.50\.3Matched Conditions \(Train = Test\)DefaultDefault8\.811210\.741910\.988629\.4613HighHigh11\.228811\.448513\.010814\.2895Very HighVery High35\.061335\.857241\.661841\.0659Mismatched Conditions \(Train≠\\neqTest\)DefaultHigh4\.33504\.41225\.27806\.0676DefaultVery High4\.96426\.02756\.63397\.1362
#### Scalar\-Block Size \(kk\)\.

Table[11](https://arxiv.org/html/2606.14195#A3.T11)shows the impact of the scalar\-block numberkk\. Increasingkkcan improve expressiveness, andk=10k=10achieves the lowest MSE in this experiment\. However, the relationship is not strictly monotonic, likely because larger parameter spaces can make CMA\-ES optimization more difficult\. We usek=5k=5as a practical default balancing accuracy and optimization cost\.

Table 11:Sensitivity to scalar\-block number \(kk\)\.ModelELTO\-KFELTO\-AKFScalar\-block num\(kk\)\-k=2k=2k=5k=5k=10k=10MSE0\.16000\.10430\.11330\.0694
#### Base ELTO Parameters \(hhand kernel functions\)\.

Tables[12](https://arxiv.org/html/2606.14195#A3.T12)and[13](https://arxiv.org/html/2606.14195#A3.T13)evaluate the foundational operator hyperparameters\. We evaluate our method using three standard kernels: an L2\-distance Laplacian, Matérn \(ν=3/2\\nu=3/2\), and RBF\. The scale parameterγ\\gammafor each is uniformly defined as1/d1/d, withddbeing the number of features\. Regarding the window size, ELTO\-AKF maintains stable performance across differenthhvalues, consistently outperforming the non\-adaptive ELTO\-KF baseline\. As for the kernel functions, ELTO\-AKF substantially improves over ELTO\-KF for the Matérn and RBF kernels, while the Laplacian kernel is less suitable in this setting\. These results suggest that structured adaptation is not solely tied to the RBF kernel, although kernel choice remains important\. Ultimately, we selecth=5h=5and the RBF kernel to balance temporal smoothing and nonlinear representation\.

Table 12:Sensitivity to historical window size \(hh\)\.ModelELTO\-KFELTO\-AKFWindow size \(hh\)5102051020MSE0\.24260\.24120\.24150\.11130\.12960\.1310Table 13:Sensitivity to different kernel functions\.ModelELTO\-KFELTO\-AKFKernelLaplacianMatérnRBFLaplacianMatérnRBFMSE0\.24220\.24180\.24940\.31680\.15440\.1513

### C\.3Computational Complexity and Empirical Runtime

Table[14](https://arxiv.org/html/2606.14195#A3.T14)evaluates the computational cost of ELTO\-AKF, which comprises operator training, CMA\-ES optimization, and inference\. While operator construction involves Gram matrix computation and inversion that scale polynomially with the number of training samplesNN, this process is computed offline and, crucially, is completely independent of the scalar\-block sizekk\. During inference, the state and covariance updates require matrix inversions\. Askkincreases, CMA\-ES optimization time grows because estimating the symmetric matrices𝑸θ\\bm\{Q\}\_\{\\theta\}and𝑹θ\\bm\{R\}\_\{\\theta\}expands the search space to𝒪​\(k2\)\\mathcal\{O\}\(k^\{2\}\)\. Crucially, while a largerkkincreases the parameter optimization cost, both operator training and inference times remain completely unaffected\. Note that the high variance in the baseline ELTO\-KF’s optimization time stems from frequent SVD failures caused by its lack of structural SPD guarantees during the unconstrained search\.

Table 14:Computational complexity analysis of ELTO\-AKFModelELTO\-KFELTO\-AKFScalar\-block size \(kk\)\-2510Operator Training \(s\)3\.82 ± 0\.183\.78 ± 0\.023\.80 ± 0\.153\.81 ± 0\.03CMA\-ES Optimization \(s\)137\.36 ± 45\.27282\.91 ± 0\.96395\.58 ± 1\.11507\.92 ± 0\.71Inference \(s\)1\.40 ± 0\.012\.778±0\.0072\.782±0\.0062\.773±0\.008

Similar Articles

Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain

Hugging Face Daily Papers

This paper introduces Semi-Supervised Noise Adaptation (SSNA), a novel framework that uses synthetic noise domains (e.g., Gaussian distributions) as surrogate source domains to improve generalization in semi-supervised learning settings. The proposed Noise Adaptation Framework (NAF) establishes a generalization bound and demonstrates improved target domain performance.