Learning dynamical systems from noisy data with Weak-form Kernel Ridge Regression
Summary
Introduces Weak-form Kernel Ridge Regression (WKRR) for learning dynamical systems from noisy measurements, combining a weak formulation with kernel ridge regression to filter noise and improve accuracy. The method outperforms baseline methods on chaotic benchmarks up to 64 dimensions and 15,000-dimensional real-world fluid data.
View Cached Full Text
Cached at: 07/02/26, 05:36 AM
# Learning dynamical systems from noisy data with Weak-form Kernel Ridge Regression
Source: [https://arxiv.org/html/2607.00257](https://arxiv.org/html/2607.00257)
Max Kreider Department of Mathematics The Pennsylvania State University, University Park, PA 16802, USA mbk6295@psu\.edu &John Harlim Department of Mathematics, Institute for Computational and Data Sciences The Pennsylvania State University, University Park, PA 16802, USA jharlim@psu\.edu &Daning Huang Department of Aerospace Engineering The Pennsylvania State University, University Park, PA 16802, USA daning@psu\.edu
###### Abstract
Accurate prediction of complex dynamical systems from noisy measurements remains a significant challenge in scientific computing\. Kernel ridge regression learning strategies are often effective when applied to clean data, but have limited success with noisy data\. Recent work has observed that a weak formulation can act to filter noisy data, and different learning strategies have achieved increased noise robustness with a weak\-form framework\. In this manuscript, we give an overview of the filtering mechanism behind the weak formulation and provide a bias\-variance error decomposition\. Using these insights, we combine a weak formulation with a kernel learning strategy to propose Weak\-form Kernel Ridge Regression \(WKRR\) for learning dynamical systems\. The proposed framework is simple to implement, effective for both clean and noisy data, and outperforms several baseline methods\. We demonstrate the performance of WKRR on chaotic benchmark systems in up to 64 dimensions, as well as 15,000\-dimensional real\-world fluid data\.
## 1Introduction
Many problems in scientific computing and engineering disciplines involve modeling and prediction of dynamical systems, with applications including weather\[[17](https://arxiv.org/html/2607.00257#bib.bib76),[37](https://arxiv.org/html/2607.00257#bib.bib74),[57](https://arxiv.org/html/2607.00257#bib.bib75)\], environmental and ecological science\[[29](https://arxiv.org/html/2607.00257#bib.bib72),[49](https://arxiv.org/html/2607.00257#bib.bib70),[64](https://arxiv.org/html/2607.00257#bib.bib73),[82](https://arxiv.org/html/2607.00257#bib.bib71)\], biology\[[28](https://arxiv.org/html/2607.00257#bib.bib67),[60](https://arxiv.org/html/2607.00257#bib.bib69),[77](https://arxiv.org/html/2607.00257#bib.bib68)\], fluid dynamics\[[1](https://arxiv.org/html/2607.00257#bib.bib64),[23](https://arxiv.org/html/2607.00257#bib.bib65),[47](https://arxiv.org/html/2607.00257#bib.bib66),[62](https://arxiv.org/html/2607.00257#bib.bib96)\], finance\[[14](https://arxiv.org/html/2607.00257#bib.bib63),[73](https://arxiv.org/html/2607.00257#bib.bib62)\], and traffic\[[5](https://arxiv.org/html/2607.00257#bib.bib60),[6](https://arxiv.org/html/2607.00257#bib.bib61),[78](https://arxiv.org/html/2607.00257#bib.bib59)\]\. However, many physically relevant problems remain challenging due to high\-dimensionality, complex or chaotic dynamics, lack of known underlying dynamics, and noisy or low\-fidelity observational data\.
Purely data\-driven methods have emerged as a strong option for learning dynamical systems\[[11](https://arxiv.org/html/2607.00257#bib.bib55),[27](https://arxiv.org/html/2607.00257#bib.bib58),[58](https://arxiv.org/html/2607.00257#bib.bib54),[74](https://arxiv.org/html/2607.00257#bib.bib56),[75](https://arxiv.org/html/2607.00257#bib.bib57)\]\. Such methods circumvent the need to form dynamical equations and typically seek to represent unknown dynamics with a high\-fidelity reduced\-order model\. A variety of popular approaches have proven to be competitive in this context\. Dynamic mode decomposition \(DMD\) and variants leverage a Koopman framework that lifts finite\-dimensional nonlinear data to an infinite\-dimensional linear representation, often leading to a simplified low\-dimensional surrogate model\[[21](https://arxiv.org/html/2607.00257#bib.bib52),[20](https://arxiv.org/html/2607.00257#bib.bib50),[41](https://arxiv.org/html/2607.00257#bib.bib53),[43](https://arxiv.org/html/2607.00257#bib.bib51),[55](https://arxiv.org/html/2607.00257#bib.bib49),[75](https://arxiv.org/html/2607.00257#bib.bib57)\]\. Sparse identification of nonlinear dynamical systems \(SINDy\) and variants discover dynamical equations from data by choosing a suitable, often sparse, linear combination of dictionary functions\[[12](https://arxiv.org/html/2607.00257#bib.bib47),[13](https://arxiv.org/html/2607.00257#bib.bib48),[38](https://arxiv.org/html/2607.00257#bib.bib46),[85](https://arxiv.org/html/2607.00257#bib.bib45)\]\. Neural ordinary differential equations \(NODEs\) learning underlying dynamics by training a neural network to represent a continuous\-time vector field\[[15](https://arxiv.org/html/2607.00257#bib.bib43),[33](https://arxiv.org/html/2607.00257#bib.bib41),[44](https://arxiv.org/html/2607.00257#bib.bib79),[59](https://arxiv.org/html/2607.00257#bib.bib42),[83](https://arxiv.org/html/2607.00257#bib.bib238),[86](https://arxiv.org/html/2607.00257#bib.bib40)\]\. Various machine learning techniques such as Long Short\-Term Memory\[[35](https://arxiv.org/html/2607.00257#bib.bib81),[45](https://arxiv.org/html/2607.00257#bib.bib38),[71](https://arxiv.org/html/2607.00257#bib.bib148),[84](https://arxiv.org/html/2607.00257#bib.bib39)\], reservoir computing\[[26](https://arxiv.org/html/2607.00257#bib.bib34),[56](https://arxiv.org/html/2607.00257#bib.bib36),[66](https://arxiv.org/html/2607.00257#bib.bib35),[79](https://arxiv.org/html/2607.00257#bib.bib37)\], and autoencoders\[[25](https://arxiv.org/html/2607.00257#bib.bib33)\]have also gained prominence\. Kernel\-based approaches such as kernel ridge regression \(KRR\) mitigate the curse of dimensionality with the so\-called “kernel trick”\[[2](https://arxiv.org/html/2607.00257#bib.bib30),[4](https://arxiv.org/html/2607.00257#bib.bib31),[24](https://arxiv.org/html/2607.00257#bib.bib32),[65](https://arxiv.org/html/2607.00257#bib.bib77),[72](https://arxiv.org/html/2607.00257#bib.bib29)\]\. KRR is especially attractive because it does not require a dictionary of functions, and is straightforward to implement\. Recent work has shown that KRR outperforms multiple baseline methods in data\-driven dynamical system learning and forecasting problems over a wide range of data sets\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. While these approaches typically perform well when applied to clean data, their performance often degrades significantly in the presence of measurement or observational noise, especially when the underlying dynamics are chaotic\.
Several methods have been proposed to mitigate lack of robustness in the presence of noise, including data assimilation and filtering methods\[[16](https://arxiv.org/html/2607.00257#bib.bib27),[31](https://arxiv.org/html/2607.00257#bib.bib97),[32](https://arxiv.org/html/2607.00257#bib.bib28)\], and Gaussian process approaches\[[30](https://arxiv.org/html/2607.00257#bib.bib26),[80](https://arxiv.org/html/2607.00257#bib.bib25),[81](https://arxiv.org/html/2607.00257#bib.bib24)\]\. The approach developed in\[[31](https://arxiv.org/html/2607.00257#bib.bib97)\], coined RAFDA, has proven to be competitive for low\-dimensional systems, but does not scale well to high\-dimensions\. A different approach is to combine a learning strategy with a weak formulation, which involves integrating data residuals over a family of test functions\. Classical \(strong\) approaches enforce pointwise consistency between given observational data, and typically perform poorly for noisy data because pointwise errors are magnified by erroneous fluctuations\. In contrast, weak approaches relax pointwise loss strategies in favor of orthogonality constraints between residuals and test functions\. Loss functions involving weak formulations have been observed to perform more robustly in the presence of noise, and have been incorporated successfully into both the SINDy framework\[[8](https://arxiv.org/html/2607.00257#bib.bib22),[51](https://arxiv.org/html/2607.00257#bib.bib80),[53](https://arxiv.org/html/2607.00257#bib.bib21),[54](https://arxiv.org/html/2607.00257#bib.bib23)\]and the NODE framework\[[44](https://arxiv.org/html/2607.00257#bib.bib79)\]\. Both\[[44](https://arxiv.org/html/2607.00257#bib.bib79)\]and\[[51](https://arxiv.org/html/2607.00257#bib.bib80)\]point out that a weak formulation acts as a filter for noisy data, suggesting a mechanism for its observed noise robustness\.
In the present manuscript, we will build on these previous insights to show explicitly that the weak formulation acts to filter noisy data\. We will consider a family of test functions that arise from uniform translations of a generating function, and will interpret the resulting weak formulation as an orthogonal projection procedure which filters by projecting noisy data onto the subspace spanned by these test functions\. Moreover, we provide a brief bias\-variance error decomposition of signal filtering via the weak form\. In particular, we provide an exact expression for the variance, i\.e\., the error that arises due to noise corruption\. This interpretation connects to well\-known results in information theory and signal processing\[[70](https://arxiv.org/html/2607.00257#bib.bib20),[68](https://arxiv.org/html/2607.00257#bib.bib15),[67](https://arxiv.org/html/2607.00257#bib.bib18),[69](https://arxiv.org/html/2607.00257#bib.bib19),[76](https://arxiv.org/html/2607.00257#bib.bib16)\]\. Shannon’s sampling theorem, which states that an ideal band\-limited function can be perfectly reconstructed with suitably sampled data, may be interpreted as an orthogonal projection onto a subspace of band\-limited functions\[[63](https://arxiv.org/html/2607.00257#bib.bib17)\]\. Subsequent work has extended this orthogonal projection analysis to classes of functions arising from integer shifts of generating functions\[[70](https://arxiv.org/html/2607.00257#bib.bib20)\]\. While the error analysis that we provide is not novel, we include it to justify the use of a weak formulation and to clarify the mechanism by which it provides noise robustness\.
Motivated by the simplicity and recent success of kernel\-based learning methods and the noise robustness of weak formulations, we propose Weak\-form Kernel Ridge Regression \(WKRR\) as a noise robust, data\-driven learning framework\. We remark that WKRR does not require a suite of dictionary functions, in contrast to Weak SINDy approaches\[[8](https://arxiv.org/html/2607.00257#bib.bib22),[51](https://arxiv.org/html/2607.00257#bib.bib80),[53](https://arxiv.org/html/2607.00257#bib.bib21),[54](https://arxiv.org/html/2607.00257#bib.bib23)\]that critically need the underlying functions or vector fields to be spanned by the dictionary functions\. Moreover, WKRR does not require extensive hyperparameter tuning as many machine learning frameworks, such as NODE, require\[[44](https://arxiv.org/html/2607.00257#bib.bib79)\]\. We will show numerically that the forecasting horizon of WKRR is similar to that of a strong KRR formulation, but much less computationally expensive\. We will also demonstrate the effectiveness of WKRR with different choices of kernel function, including the standard Gaussian kernel and the Diffusion Maps \(DM\) kernel, which has recently received attention for its excellent performance across a wide range of chaotic and experimental datasets\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. The success of WKRR with various kernel functions broadens the method’s applicability and suggests that the practitioner may select kernel functions that favor speed or accuracy as the situation warrants\.
The remainder of the manuscript is organized as follows\. In §[2](https://arxiv.org/html/2607.00257#S2), we review the classical KRR method and the Gaussian and DM kernel functions\. In §[3](https://arxiv.org/html/2607.00257#S3), we review the weak formulation, explain how it acts as a filter, and provide a brief bias\-variance error decomposition for the filtering procedure\. The main contribution of this work is provided in §[4](https://arxiv.org/html/2607.00257#S4), where we propose WKRR as a noise\-robust learning method for dynamical systems\. We propose a validation procedure to select appropriate model hyperparameters, and summarize the steps needed to implement WKRR in practice\. We apply WKRR to several numerical examples in §[5](https://arxiv.org/html/2607.00257#S5), including chaotic baseline systems in up to 64 dimensions, and real\-world turbulent fluid data made available by\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]as part of a Community Challenge\. We conclude with a brief discussion in §[6](https://arxiv.org/html/2607.00257#S6)\.
## 2Kernel Ridge Regression Review
In this section, we review the classical kernel ridge regression \(KRR\) framework for learning solution operators of dynamical systems\. We will refer to this approach as the “strong approach” throughout the manuscript\.
### 2\.1Kernel Ridge Regression
Suppose we are given noisy data𝐮\(ti\)≡𝐮i=\(ui\(1\),…,ui\(n\)\)∈ℝn\\mathbf\{u\}\(t\_\{i\}\)\\equiv\\mathbf\{u\}\_\{i\}=\(u\_\{i\}^\{\(1\)\},\\dots,u\_\{i\}^\{\(n\)\}\)\\in\\mathbb\{R\}^\{n\},i=1,…,Ni=1,\\dots,N, sampled at timesti=\(i−1\)⋅Δtt\_\{i\}=\(i\-1\)\\cdot\\Delta t\. We assume that the data is of the form
𝐮i=𝐱i\+σ𝝃i,\\mathbf\{u\}\_\{i\}=\\mathbf\{x\}\_\{i\}\+\\sigma\\boldsymbol\{\\xi\}\_\{i\},\(1\)where𝐱i\\mathbf\{x\}\_\{i\}denotes clean data generated by an autonomous dynamical system of the form𝐱′=𝐟\(𝐱\)\\mathbf\{x\}^\{\\prime\}=\\mathbf\{f\}\(\\mathbf\{x\}\), and𝝃i\\boldsymbol\{\\xi\}\_\{i\}is i\.i\.d\. Gaussian with zero mean and covariance matrix𝚺=diag\(η12,…,ηn2\)\\boldsymbol\{\\Sigma\}=\\operatorname\{diag\}\(\\eta\_\{1\}^\{2\},\\dots,\\eta\_\{n\}^\{2\}\)\. We defineηℓ\\eta\_\{\\ell\}to be the root\-mean\-square \(RMS\) of theℓ\\ellth component of the clean signal
ηℓ=1N∑i=1N\|xi\(ℓ\)\|2\.\\eta\_\{\\ell\}=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\|x\_\{i\}^\{\(\\ell\)\}\\right\|^\{2\}\}\.\(2\)The parameterσ≥0\\sigma\\geq 0represents the signal\-to\-noise ratio and will be varied in our numerical experiments\.
In this section, we assume thatσ=0\\sigma=0so that the given data is not corrupted by noise\. Letℳ⊂ℝn\\mathcal\{M\}\\subset\\mathbb\{R\}^\{n\}denote the forward invariant set of the underlying system\. Define the flow map𝚽τ=\(Φτ\(1\),…,Φτ\(n\)\):ℝn→ℳ\\boldsymbol\{\\Phi\}\_\{\\tau\}=\(\\Phi\_\{\\tau\}^\{\(1\)\},\\dots,\\Phi\_\{\\tau\}^\{\(n\)\}\):\\mathbb\{R\}^\{n\}\\to\\mathcal\{M\}, which advances the system state over a time incrementτ\>0\\tau\>0, i\.e\.,𝚽τ\(𝐱\(t\)\)=𝐱\(t\+τ\)\\boldsymbol\{\\Phi\}\_\{\\tau\}\(\\mathbf\{x\}\(t\)\)=\\mathbf\{x\}\(t\+\\tau\)\.
We will use KRR to learn solution operators with either adirect\-connectionor askip\-connectionscheme\. Both frameworks are useful in practice; direct\-connection often outperforms skip\-connection when the underlying dynamics are stiff\. Otherwise, the skip\-connection scheme typically yields superior performance\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. We employ both frameworks in our numerical examples\. We now describe each learning framework in turn\.
#### 2\.1\.1Direct\-Connection Scheme
Letk\(⋅,⋅;ϵ\):ℝn×ℝn→ℝk\(\\cdot,\\cdot;\\epsilon\):\\mathbb\{R\}^\{n\}\\times\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}denote a scalar\-valued kernel function, whereϵ\>0\\epsilon\>0denotes a tunable bandwidth parameter\. The key idea of the direct\-connection scheme is to approximate each component of the flow map directly,
Φτ\(ℓ\)\(𝐱\)≈∑i=1Nk\(𝐱,𝐮i;ϵ\)αi\(ℓ\),\\Phi^\{\(\\ell\)\}\_\{\\tau\}\(\\mathbf\{x\}\)\\approx\\sum\_\{i=1\}^\{N\}k\(\\mathbf\{x\},\\mathbf\{u\}\_\{i\};\\epsilon\)\\alpha\_\{i\}^\{\(\\ell\)\},\(3\)for reconstruction coefficients\{αi\(ℓ\)\}ℓ=1,…,n\\\{\\alpha\_\{i\}^\{\(\\ell\)\}\\\}\_\{\\ell=1,\\ldots,n\}to be determined\. To proceed, we collect the available data in snapshot pairsS=\(𝐗,𝐘\)S=\(\\mathbf\{X\},\\mathbf\{Y\}\)\. The input matrix𝐗∈ℝ\(N−1\)×n\\mathbf\{X\}\\in\\mathbb\{R\}^\{\(N\-1\)\\times n\}is defined by
𝐗=\[𝐮1𝐮2…𝐮N−1\]⊤,\\mathbf\{X\}=\\begin\{bmatrix\}\\mathbf\{u\}\_\{1\}&\\mathbf\{u\}\_\{2\}&\\dots&\\mathbf\{u\}\_\{N\-1\}\\end\{bmatrix\}^\{\\top\},\(4\)while the output matrix𝐘∈ℝ\(N−1\)×n\\mathbf\{Y\}\\in\\mathbb\{R\}^\{\(N\-1\)\\times n\}is defined by
𝐘=\[𝐮2𝐮3…𝐮N\]⊤\.\\mathbf\{Y\}=\\begin\{bmatrix\}\\mathbf\{u\}\_\{2\}&\\mathbf\{u\}\_\{3\}&\\dots&\\mathbf\{u\}\_\{N\}\\end\{bmatrix\}^\{\\top\}\.\(5\)
LetK\(⋅,𝐗;ϵ\)=\[k\(⋅,𝐮1;ϵ\),…,k\(⋅,𝐮N−1;ϵ\)\]:ℝn→ℝN−1K\(\\cdot,\\mathbf\{X\};\\epsilon\)=\[k\(\\cdot,\\mathbf\{u\}\_\{1\};\\epsilon\),\\dots,k\(\\cdot,\\mathbf\{u\}\_\{N\-1\};\\epsilon\)\]:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{N\-1\}be the kernel function evaluated over the input data𝐗\\mathbf\{X\}\. In this notation, our goal is to model the flow map
𝐮i\+1=𝚽Δt\(𝐮i\)≈K\(𝐮i,𝐗;ϵ\)𝜶,\\mathbf\{u\}\_\{i\+1\}=\\boldsymbol\{\\Phi\}\_\{\\Delta t\}\(\\mathbf\{u\}\_\{i\}\)\\approx K\(\\mathbf\{u\}\_\{i\},\\mathbf\{X\};\\epsilon\)\\boldsymbol\{\\alpha\},\(6\)where𝜶∈ℝ\(N−1\)×n\\boldsymbol\{\\alpha\}\\in\\mathbb\{R\}^\{\(N\-1\)\\times n\}collects the reconstruction coefficients\. With the goal of expressing \([6](https://arxiv.org/html/2607.00257#S2.E6)\) in matrix notation, define the Gram matrix𝐊\(ϵ\)∈ℝ\(N−1\)×\(N−1\)\\mathbf\{K\}\(\\epsilon\)\\in\\mathbb\{R\}^\{\(N\-1\)\\times\(N\-1\)\}whose\(i,j\)\(i,j\)entry isk\(𝐮i,𝐮j;ϵ\)k\(\\mathbf\{u\}\_\{i\},\\mathbf\{u\}\_\{j\};\\epsilon\)fori,j=1,…,N−1i,j=1,\\dots,N\-1\. We then arrive at a linear least squares problem which enforces pointwise consistency between the snapshot data pairs
𝐘=𝐊\(ϵ\)𝜶\.\\mathbf\{Y\}=\\mathbf\{K\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\.\(7\)We remark that \([7](https://arxiv.org/html/2607.00257#S2.E7)\) is often ill\-conditioned\. In practice, a solution𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}is determined by solving the related KRR problem
𝜶^=argmin𝜶\(‖𝐊\(ϵ\)𝜶−𝐘‖F2\+λ𝜶⊤𝐊\(ϵ\)𝜶\),\\hat\{\\boldsymbol\{\\alpha\}\}=\\arg\\min\_\{\\boldsymbol\{\\alpha\}\}\\left\(\\\|\\mathbf\{K\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\-\\mathbf\{Y\}\\\|\_\{F\}^\{2\}\+\\lambda\\boldsymbol\{\\alpha\}^\{\\top\}\\mathbf\{K\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\\right\),\(8\)whereλ\\lambdais a user\-specified regularization parameter\. The problem \([8](https://arxiv.org/html/2607.00257#S2.E8)\) admits a unique solution
𝜶^=\(𝐊\(ϵ\)\+λ𝐈\)−1𝐘,\\hat\{\\boldsymbol\{\\alpha\}\}=\\left\(\\mathbf\{K\}\(\\epsilon\)\+\\lambda\\mathbf\{I\}\\right\)^\{\-1\}\\mathbf\{Y\},\(9\)where𝐈\\mathbf\{I\}denotes the\(N−1\)×\(N−1\)\(N\-1\)\\times\(N\-1\)identity matrix\. Once𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}has been determined, out\-of\-sample model forecasting can be achieved via
𝐮~i\+1=K\(𝐮~i,𝐗;ϵ\)𝜶^,\\tilde\{\\mathbf\{u\}\}\_\{i\+1\}=K\(\\tilde\{\\mathbf\{u\}\}\_\{i\},\\mathbf\{X\};\\epsilon\)\\hat\{\\boldsymbol\{\\alpha\}\},\(10\)where𝐮~i\\tilde\{\\mathbf\{u\}\}\_\{i\}is the kernel approximation to the given data\.
#### 2\.1\.2Skip\-Connection Scheme
Instead of directly modeling the flow map, a skip\-connection scheme models each component of the residual map
Φτ\(ℓ\)\(𝐱\)−𝐱\(ℓ\)≈∑i=1Nk\(𝐱,𝐮i;ϵ\)αi\(ℓ\)\.\\Phi^\{\(\\ell\)\}\_\{\\tau\}\(\\mathbf\{x\}\)\-\\mathbf\{x\}^\{\(\\ell\)\}\\approx\\sum\_\{i=1\}^\{N\}k\(\\mathbf\{x\},\\mathbf\{u\}\_\{i\};\\epsilon\)\\alpha\_\{i\}^\{\(\\ell\)\}\.\(11\)We define𝐗\\mathbf\{X\}according to \([4](https://arxiv.org/html/2607.00257#S2.E4)\) as before, but now define the output matrix to be
𝐘=\[Δ𝐮1Δ𝐮2…Δ𝐮N−1\]⊤,Δ𝐮i=𝐮i\+1−𝐮i\.\\mathbf\{Y\}=\\begin\{bmatrix\}\\Delta\\mathbf\{u\}\_\{1\}&\\Delta\\mathbf\{u\}\_\{2\}&\\dots&\\Delta\\mathbf\{u\}\_\{N\-1\}\\end\{bmatrix\}^\{\\top\},\\quad\\Delta\\mathbf\{u\}\_\{i\}=\\mathbf\{u\}\_\{i\+1\}\-\\mathbf\{u\}\_\{i\}\.\(12\)In this setup, our goal is to model
𝐮i\+1−𝐮i=𝚽Δt\(𝐮i\)−𝐮i≈K\(𝐮i,𝐗;ϵ\)𝜶\.\\mathbf\{u\}\_\{i\+1\}\-\\mathbf\{u\}\_\{i\}=\\boldsymbol\{\\Phi\}\_\{\\Delta t\}\(\\mathbf\{u\}\_\{i\}\)\-\\mathbf\{u\}\_\{i\}\\approx K\(\\mathbf\{u\}\_\{i\},\\mathbf\{X\};\\epsilon\)\\boldsymbol\{\\alpha\}\.\(13\)In the same notation as above, we arrive at a least squares problem \([7](https://arxiv.org/html/2607.00257#S2.E7)\), but with𝐘\\mathbf\{Y\}now defined according to \([12](https://arxiv.org/html/2607.00257#S2.E12)\)\. Solving for the coefficients𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}proceeds as above, and forecasting is performed according to
𝐮~i\+1=𝐮~i\+K\(𝐮~i,𝐗;ϵ\)𝜶^\.\\tilde\{\\mathbf\{u\}\}\_\{i\+1\}=\\tilde\{\\mathbf\{u\}\}\_\{i\}\+K\(\\tilde\{\\mathbf\{u\}\}\_\{i\},\\mathbf\{X\};\\epsilon\)\\hat\{\\boldsymbol\{\\alpha\}\}\.\(14\)
Notice that both the direct\- and skip\-connection schemes require the user to specify a kernel function, a bandwidth parameterϵ\\epsilon, and a regularization parameterλ\\lambda\. We discuss appropriate choices for the kernel function below\. We defer the discussion of the tunable parametersϵ\\epsilonandλ\\lambdato §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2), where we extend a previously developed validation strategy\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]to handle noisy data using a weak formulation\.
### 2\.2Choice of Kernel Function
The choice of kernel function in the KRR approach described above is crucial to the success of the method\. While any kernel function can be used, a specific choice of kernel may be more, or less, appropriate for a given problem\.
Suppose we are given a dataset𝐔=\{𝐮1,…,𝐮N\}\\mathbf\{U\}=\\\{\\mathbf\{u\}\_\{1\},\\dots,\\mathbf\{u\}\_\{N\}\\\}consisting of observations sampled from the forward invariant setℳ⊂ℝn\\mathcal\{M\}\\subset\\mathbb\{R\}^\{n\}\. Unless otherwise stated, in this work we will consider a standard Gaussian kernelkRBF:ℳ×ℳ→ℝk\_\{\\text\{RBF\}\}:\\mathcal\{M\}\\times\\mathcal\{M\}\\to\\mathbb\{R\},
kRBF\(𝐱,𝐲;ϵ\)=exp\(−‖𝐱−𝐲‖224ϵ\),k\_\{\\text\{RBF\}\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)=\\exp\\left\(\-\\frac\{\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|\_\{2\}^\{2\}\}\{4\\epsilon\}\\right\),\(15\)for any𝐱,𝐲∈ℳ\\mathbf\{x\},\\mathbf\{y\}\\in\\mathcal\{M\}\. The parameterϵ\>0\\epsilon\>0is a scalar\-valued bandwidth parameter that should be determined by the practitioner\. We describe an approach to chooseϵ\\epsilonin §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2)where we extend the KRR approach to noisy data\.
We also consider the Diffusion Maps \(DM\) kernel, a data\-driven kernel function based on the Diffusion Maps algorithm\[[18](https://arxiv.org/html/2607.00257#bib.bib119)\]that has recently been employed with success over a wide range of datasets\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. In particular, recent work has shown that the DM kernel can exhibit superior forecasting performance compared to the standard Gaussian kernel when applied to clean data, especially when the dimension of the invariant setℳ\\mathcal\{M\}is much lower than its ambient dimension,nn\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\.
We will compare the Gaussian and DM kernels over two baseline examples in §[5](https://arxiv.org/html/2607.00257#S5)when only noisy data is available\. We will show numerically that both kernels have nearly identical performance for a low\-dimensional chaotic system, but that the DM kernel exhibits superior performance on a high\-dimensional chaotic system with low intrinsic dimension\. These results suggest that systems with low intrinsic dimension or special geometry may benefit from the DM kernel, even when observational data is corrupted by noise\. Otherwise, the Gaussian kernel may be more appropriate due to lower computational complexity and ease of implementation\.
We now describe the numerical implementation of the DM kernel\. Following\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\], we begin with the standard Gaussian RBF kernel \([15](https://arxiv.org/html/2607.00257#S2.E15)\)\. We then normalize \([15](https://arxiv.org/html/2607.00257#S2.E15)\) in two stages to arrive at the DM kernel,kDM,N:ℳ×ℳ→ℝk\_\{\\text\{DM\},N\}:\\mathcal\{M\}\\times\\mathcal\{M\}\\to\\mathbb\{R\}, by first writing
ks,N\(𝐱,𝐲;ϵ\)=kRBF\(𝐱,𝐲;ϵ\)sN\(𝐱;ϵ\)sN\(𝐲;ϵ\),sN\(𝐱;ϵ\)=1N∑j=1NkRBF\(𝐱,𝐮j;ϵ\),k\_\{s,N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)=\\frac\{k\_\{\\text\{RBF\}\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)\}\{s\_\{N\}\(\\mathbf\{x\};\\epsilon\)s\_\{N\}\(\\mathbf\{y\};\\epsilon\)\},\\qquad s\_\{N\}\(\\mathbf\{x\};\\epsilon\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}k\_\{\\text\{RBF\}\}\(\\mathbf\{x\},\\mathbf\{u\}\_\{j\};\\epsilon\),\(16\)and then writing
kDM,N\(𝐱,𝐲;ϵ\)=ks,N\(𝐱,𝐲;ϵ\)qN\(𝐱;ϵ\)qN\(𝐲;ϵ\),qN\(𝐱;ϵ\)=1N∑j=1Nks,N\(𝐱,𝐮j;ϵ\),k\_\{\\text\{DM\},N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)=\\frac\{\{k\}\_\{s,N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)\}\{\\sqrt\{\{q\}\_\{N\}\(\\mathbf\{x\};\\epsilon\)\{q\}\_\{N\}\(\\mathbf\{y\};\\epsilon\)\}\},\\qquad\{q\}\_\{N\}\(\\mathbf\{x\};\\epsilon\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}k\_\{s,N\}\(\\mathbf\{x\},\\mathbf\{u\}\_\{j\};\\epsilon\),\(17\)for any𝐱,𝐲∈ℳ\\mathbf\{x\},\\mathbf\{y\}\\in\\mathcal\{M\}\. Note that the DM kernel is not the same as the DM matrix\[[18](https://arxiv.org/html/2607.00257#bib.bib119)\],
PN\(𝐱,𝐲;ϵ\)=ks,N\(𝐱,𝐲;ϵ\)∑j=1Nks,N\(𝐱,𝐮j;ϵ\),P\_\{N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)=\\frac\{k\_\{s,N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)\}\{\\sum\_\{j=1\}^\{N\}k\_\{s,N\}\(\\mathbf\{x\},\\mathbf\{u\}\_\{j\};\\epsilon\)\},\(18\)that approximates a Markov transition kernel of a reversible Markov chain onℳ\\mathcal\{M\}\. Observe that \([18](https://arxiv.org/html/2607.00257#S2.E18)\) defines a Markov matrix because the normalizing denominator term here is a summation, not an average as in \([17](https://arxiv.org/html/2607.00257#S2.E17)\)\.
One can verify that
NqN\(𝐱;ϵ\)PN\(𝐱,𝐲;ϵ\)NqN\(𝐲;ϵ\)=ks,N\(𝐱,𝐲;ϵ\)NqN\(𝐱;ϵ\)NqN\(𝐲;ϵ\)=1NkDM,N\(𝐱,𝐲;ϵ\),\\sqrt\{N\{q\}\_\{N\}\(\\mathbf\{x\};\\epsilon\)\}\\frac\{P\_\{N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)\}\{\\sqrt\{N\{q\}\_\{N\}\(\\mathbf\{y\};\\epsilon\)\}\}=\\frac\{\{k\}\_\{s,N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)\}\{\\sqrt\{N\{q\}\_\{N\}\(\\mathbf\{x\};\\epsilon\)N\{q\}\_\{N\}\(\\mathbf\{y\};\\epsilon\)\}\}=\\frac\{1\}\{N\}k\_\{\\text\{DM\},N\}\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\),\(19\)which suggests that the Gram matrix𝐏\(ϵ\)\\mathbf\{P\}\(\\epsilon\)corresponding toPNP\_\{N\}is diagonally conjugate to the normalized Gram matrix𝐊\(ϵ\)/N\\mathbf\{K\}\(\\epsilon\)/Ncorresponding tokDM,N/Nk\_\{\\text\{DM\},N\}/N\. In particular, the eigenvalues of𝐏\(ϵ\)\\mathbf\{P\}\(\\epsilon\)and𝐊\(ϵ\)/N\\mathbf\{K\}\(\\epsilon\)/Nare the same\. We refer to\[[34](https://arxiv.org/html/2607.00257#bib.bib82)\]for a theoretical discussion of the DM kernel in conjunction with KRR\.
## 3Filtering Noise Through a Weak Formulation
The key learning mechanism underlying classical KRR is to enforce pointwise consistency across snapshot pairs of data\. While this strong formulation is often appropriate for clean data, its performance often noticeably worsens in the presence of noise because the regression may fit erroneous fluctuations instead of underlying dynamics\.
The weak formulation is a popular approach to address this lack of noise robustness\. Weak approaches integrate noisy data over a family of test functions, relaxing pointwise consistency in favor of orthogonality constraints\. While such approaches have recently been observed to offer increased robustness to noise\[[8](https://arxiv.org/html/2607.00257#bib.bib22),[44](https://arxiv.org/html/2607.00257#bib.bib79),[51](https://arxiv.org/html/2607.00257#bib.bib80),[52](https://arxiv.org/html/2607.00257#bib.bib78),[53](https://arxiv.org/html/2607.00257#bib.bib21),[54](https://arxiv.org/html/2607.00257#bib.bib23)\], the connection to filtering was recently pointed out in\[[44](https://arxiv.org/html/2607.00257#bib.bib79)\]and\[[51](https://arxiv.org/html/2607.00257#bib.bib80)\]\. Building on this intuition, the purpose of this section is to provide a brief overview of the filtering mechanism behind the weak formulation\.
### 3\.1Problem Statement and Notation
A classical solution to the differential equation𝐱′\(t\)=𝐟\(𝐱\(t\)\)\\mathbf\{x\}^\{\\prime\}\(t\)=\\mathbf\{f\}\(\\mathbf\{x\}\(t\)\)is a solution𝐱\(t\)\\mathbf\{x\}\(t\)which satisfies
𝐑\(𝐱\(t\)\)≡𝐱′\(t\)−𝐟\(𝐱\(t\)\)=𝟎\.\\mathbf\{R\}\(\\mathbf\{x\}\(t\)\)\\equiv\\mathbf\{x\}^\{\\prime\}\(t\)\-\\mathbf\{f\}\(\\mathbf\{x\}\(t\)\)=\\boldsymbol\{0\}\.\(20\)In contrast, a weak formulation involves integrating the residual over smooth test functionsφ\(t\)\\varphi\(t\)
∫ℝdt𝐑\(𝐱\(t\)\)φ\(t\)=𝟎,\\int\_\{\\mathbb\{R\}\}\\text\{d\}t\\;\\mathbf\{R\}\(\\mathbf\{x\}\(t\)\)\\varphi\(t\)=\\boldsymbol\{0\},\(21\)which may be interpreted as anL2L^\{2\}inner product enforcing the orthogonality of the residual with the test functions\. This integral formulation, with appropriate choice of compactly supported test functions, has been observed to increase robustness to noise\[[8](https://arxiv.org/html/2607.00257#bib.bib22),[44](https://arxiv.org/html/2607.00257#bib.bib79),[51](https://arxiv.org/html/2607.00257#bib.bib80),[52](https://arxiv.org/html/2607.00257#bib.bib78),[53](https://arxiv.org/html/2607.00257#bib.bib21),[54](https://arxiv.org/html/2607.00257#bib.bib23)\]\.
In the following, we will study weak formulations applied to an arbitrary noisy signal𝐮\(t\)\\mathbf\{u\}\(t\),
∫ℝdt𝐮\(t\)φj\(t\)=𝟎,\\int\_\{\\mathbb\{R\}\}\\text\{d\}t\\;\\mathbf\{u\}\(t\)\\varphi\_\{j\}\(t\)=\\boldsymbol\{0\},\(22\)for an appropriate family of test functions\{φj\}\\\{\\varphi\_\{j\}\\\}\. While we will ultimately employ the weak formulation in conjunction with KRR, the analysis in this section does not depend on a specific learning method\.
Letφ\(t\)\\varphi\(t\)be a smooth test function with compact support on\[0,L\]\[0,L\], and defineφj\(t\)=φ\(t−jh\)\\varphi\_\{j\}\(t\)=\\varphi\(t\-jh\)to be a family of uniformly translated test functions with compact support on\[Lj,Lj\+1\]\[L\_\{j\},L\_\{j\+1\}\]\. The parameterhhdetermines the distance between the adjacent test functions, and together withLL, describes the extent to which adjacent test functions overlap\. For a dataset𝐮i≡𝐮\(ti\)\\mathbf\{u\}\_\{i\}\\equiv\\mathbf\{u\}\(t\_\{i\}\)of finite length and fixedhh, the indicesjjfor which the support ofφj\(t\)\\varphi\_\{j\}\(t\)lies in the sampling time is restricted\. If the signal consists ofNNpoints sampled uniformly with stepΔt\\Delta t, we adhere to the convention thatj∈\[1,2,…,k∗\]j\\in\[1,2,\\dots,k^\{\*\}\], where
k∗=⌊NΔt−Lh⌋\.k^\{\*\}=\\left\\lfloor\\frac\{N\\Delta t\-L\}\{h\}\\right\\rfloor\.\(23\)
In practice, the integral \([22](https://arxiv.org/html/2607.00257#S3.E22)\) must be computed with quadrature over the available data
cj\(ℓ\)=⟨φj\(t\),u\(ℓ\)\(t\)⟩=∫LjLj\+1dtφj\(t\)u\(ℓ\)\(t\)≈∑i=1Nwiφj\(ti\)ui\(ℓ\),ℓ=1,…,n,j=1,…,k∗,c\_\{j\}^\{\(\\ell\)\}=\\left\\langle\\varphi\_\{j\}\(t\),u^\{\(\\ell\)\}\(t\)\\right\\rangle=\\int\_\{L\_\{j\}\}^\{L\_\{j\+1\}\}\\text\{d\}t\\;\\varphi\_\{j\}\(t\)u^\{\(\\ell\)\}\(t\)\\approx\\sum\_\{i=1\}^\{N\}w\_\{i\}\\varphi\_\{j\}\(t\_\{i\}\)u\_\{i\}^\{\(\\ell\)\},\\quad\\ell=1,\\dots,n,\\quad j=1,\\dots,k^\{\*\},\(24\)wherewiw\_\{i\}are appropriate quadrature weights\. We will find it convenient to express these quadrature approximations as a matrix product\. To that end, collect the translated test functions into the rows of a matrix𝚿∈ℝk∗×N\\boldsymbol\{\\Psi\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times N\}whose\(j,i\)\(j,i\)entry isφj\(ti\)\\varphi\_\{j\}\(t\_\{i\}\)\. Note that𝚿\\boldsymbol\{\\Psi\}is often sparse due to the compact support of the test functions\. Further define𝐖=diag\(w1,…,wN\)∈ℝN×N\\mathbf\{W\}=\\operatorname\{diag\}\(w\_\{1\},\\dots,w\_\{N\}\)\\in\\mathbb\{R\}^\{N\\times N\}to be a diagonal matrix of quadrature weights\. We assume that𝐖\\mathbf\{W\}is invertible throughout the manuscript\. Note that𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}is the matrix whose rows are test functions scaled by the quadrature weights\.
With this notation, we may discretize \([24](https://arxiv.org/html/2607.00257#S3.E24)\) as a linear system
𝐂=𝚿𝐖𝐔,\\mathbf\{C\}=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{U\},\(25\)where𝐂∈ℝk∗×n\\mathbf\{C\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times n\}is the matrix of inner product coefficientscj\(ℓ\)c\_\{j\}^\{\(\\ell\)\}, and𝐔=\[𝐮1,…,𝐮N\]⊤∈ℝN×n\\mathbf\{U\}=\[\\mathbf\{u\}\_\{1\},\\dots,\\mathbf\{u\}\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times n\}collects the data at available sample times\. Notice that in the typical case thatk∗<Nk^\{\*\}<N, the coefficients𝐂\\mathbf\{C\}represent a compressed representation of the original signal𝐔\\mathbf\{U\}\.
### 3\.2The Weak Formulation as a Filter
To see the weak formulation as a filter, we consider the process of reconstructing a signal from a set of coefficients𝐂\\mathbf\{C\}\. Let𝐮^\(t\)=\(u^\(1\)\(t\),…,u^\(n\)\(t\)\)\{\\hat\{\\mathbf\{u\}\}\}\(t\)=\(\\hat\{u\}^\{\(1\)\}\(t\),\\dots,\\hat\{u\}^\{\(n\)\}\(t\)\)denote the reconstructed signal, which we now express as a linear combination of the available test functions
u^\(ℓ\)\(t\)=∑j=1k∗aj\(ℓ\)φj\(t\),ℓ=1,…,n,\\hat\{u\}^\{\(\\ell\)\}\(t\)=\\sum\_\{j=1\}^\{k^\{\*\}\}a\_\{j\}^\{\(\\ell\)\}\\varphi\_\{j\}\(t\),\\quad\\ell=1,\\dots,n,\(26\)whereaj\(ℓ\)a\_\{j\}^\{\(\\ell\)\}are unknown coefficients to be determined\.
In the event that theφj\(t\)\\varphi\_\{j\}\(t\)are orthonormal, taking inner products on both sides of \([26](https://arxiv.org/html/2607.00257#S3.E26)\) is sufficient to isolate the coefficients\. Without orthogonality, we may still recover the coefficients by requiring that the inner product of the reconstruction with test functionsφq\(t\)\\varphi\_\{q\}\(t\)agree with the given coefficients,
𝐂q,ℓ=⟨u^\(ℓ\)\(t\),φq\(t\)⟩=∑j=1k∗aj\(ℓ\)⟨φj\(t\),φq\(t\)⟩\.\\mathbf\{C\}\_\{q,\\ell\}=\\langle\\hat\{u\}^\{\(\\ell\)\}\(t\),\\varphi\_\{q\}\(t\)\\rangle=\\sum\_\{j=1\}^\{k^\{\*\}\}a\_\{j\}^\{\(\\ell\)\}\\langle\\varphi\_\{j\}\(t\),\\varphi\_\{q\}\(t\)\\rangle\.\(27\)
We will find it useful to express \([27](https://arxiv.org/html/2607.00257#S3.E27)\) as a matrix equation\. Let𝐀∈ℝk∗×n\\mathbf\{A\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times n\}be the collection of unknown coefficients whose\(j,ℓ\)\(j,\\ell\)entry isαj\(ℓ\)\\alpha\_\{j\}^\{\(\\ell\)\}, and let𝐆=𝚿𝐖𝚿⊤∈ℝk∗×k∗\\mathbf\{G\}=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\boldsymbol\{\\Psi\}^\{\\top\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times k^\{\*\}\}be the matrix whose\(q,j\)\(q,j\)entry is a quadrature approximation of the inner product⟨φq\(t\),φj\(t\)⟩\\langle\\varphi\_\{q\}\(t\),\\varphi\_\{j\}\(t\)\\rangle\. Under the assumption that the test functions are linearly independent and overlapping,𝐆\\mathbf\{G\}is invertible\. We can then express \([27](https://arxiv.org/html/2607.00257#S3.E27)\) as the linear system𝐂=𝐆𝐀\\mathbf\{C\}=\\mathbf\{G\}\\mathbf\{A\}\. It follows that
𝐀=𝐆−1𝐂\.\\mathbf\{A\}=\\mathbf\{G\}^\{\-1\}\\mathbf\{C\}\.\(28\)
Let𝐔^=\[𝐮^1,…,𝐮^N\]⊤∈ℝN×n\\hat\{\\mathbf\{U\}\}=\[\\hat\{\\mathbf\{u\}\}\_\{1\},\\dots,\\hat\{\\mathbf\{u\}\}\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times n\}denote the matrix collecting the reconstructed signal at the available discrete sample times\. We can now express \([26](https://arxiv.org/html/2607.00257#S3.E26)\) in matrix notation to arrive at an expression for𝐔^\\hat\{\\mathbf\{U\}\},
𝐔^=𝚿⊤𝐀=𝚿⊤𝐆−1𝐂=𝚿⊤𝐆−1𝚿𝐖𝐔=𝚿⊤\(𝚿𝐖𝚿⊤\)−1𝚿𝐖𝐔,\\hat\{\\mathbf\{U\}\}=\\boldsymbol\{\\Psi\}^\{\\top\}\\mathbf\{A\}=\\boldsymbol\{\\Psi\}^\{\\top\}\\mathbf\{G\}^\{\-1\}\\mathbf\{C\}=\\boldsymbol\{\\Psi\}^\{\\top\}\\mathbf\{G\}^\{\-1\}\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{U\}=\\boldsymbol\{\\Psi\}^\{\\top\}\(\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\boldsymbol\{\\Psi\}^\{\\top\}\)^\{\-1\}\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{U\},\(29\)where the second equality follows from \([28](https://arxiv.org/html/2607.00257#S3.E28)\) and the third follows from \([25](https://arxiv.org/html/2607.00257#S3.E25)\)\. Define𝐏≡𝚿⊤\(𝚿𝐖𝚿⊤\)−1𝚿𝐖∈ℝN×N\\mathbf\{P\}\\equiv\\boldsymbol\{\\Psi\}^\{\\top\}\(\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\boldsymbol\{\\Psi\}^\{\\top\}\)^\{\-1\}\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\in\\mathbb\{R\}^\{N\\times N\}\. Note that𝐏\\mathbf\{P\}is a projection matrix because𝐏2=𝐏\\mathbf\{P\}^\{2\}=\\mathbf\{P\}\. Moreover, we see that
⟨𝐏𝐮,𝐯⟩𝐖:=\(𝐏𝐮\)⊤𝐖𝐯=𝐮⊤𝐏⊤𝐖𝐯=𝐮⊤𝐖𝚿⊤𝐆−1𝚿𝐖𝐯=𝐮⊤𝐖\(𝐏𝐯\)=⟨𝐮,𝐏𝐯⟩𝐖,\\langle\\mathbf\{P\}\\mathbf\{u\},\\mathbf\{v\}\\rangle\_\{\\mathbf\{W\}\}:=\(\\mathbf\{P\}\\mathbf\{u\}\)^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}=\\mathbf\{u\}^\{\\top\}\\mathbf\{P\}^\{\\top\}\\mathbf\{W\}\\mathbf\{v\}=\\mathbf\{u\}^\{\\top\}\\mathbf\{W\}\\boldsymbol\{\\Psi\}^\{\\top\}\\mathbf\{G\}^\{\-1\}\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{v\}=\\mathbf\{u\}^\{\\top\}\\mathbf\{W\}\(\\mathbf\{P\}\\mathbf\{v\}\)=\\langle\\mathbf\{u\},\\mathbf\{P\}\\mathbf\{v\}\\rangle\_\{\\mathbf\{W\}\},\(30\)for any𝐮,𝐯∈ℝN\\mathbf\{u\},\\mathbf\{v\}\\in\\mathbb\{R\}^\{N\}, which implies that𝐏\\mathbf\{P\}is self\-adjoint with respect to the inner\-product weighted by the quadrature weight matrix𝐖\\mathbf\{W\}\. In particular, we have
𝐏⊤𝐖=𝐖𝐏\.\\mathbf\{P\}^\{\\top\}\\mathbf\{W\}=\\mathbf\{W\}\\mathbf\{P\}\.\(31\)These observations imply that𝐏\\mathbf\{P\}is an orthogonal projection under the inner product‖𝐏‖W2=tr\(𝐏⊤𝐖𝐏\)\\\|\\mathbf\{P\}\\\|\_\{W\}^\{2\}=\\text\{tr\}\(\\mathbf\{P\}^\{\\top\}\\mathbf\{W\}\\mathbf\{P\}\)\.
The formulation above suggests that the weak\-form reconstruction procedure can be interpreted as an orthogonal projection onto the span of a family of test functions,
𝐔^=𝐏𝐔,\\hat\{\\mathbf\{U\}\}=\\mathbf\{P\}\\mathbf\{U\},\(32\)which allows weak formulations to be interpreted in light of classical results in information theory and signal processing, such as Shannon’s sampling theorem and related results which use the language of orthogonal projection to study filtering\[[63](https://arxiv.org/html/2607.00257#bib.bib17),[70](https://arxiv.org/html/2607.00257#bib.bib20),[68](https://arxiv.org/html/2607.00257#bib.bib15)\]\.
#### 3\.2\.1Choice of Test Function
In this section, we consider a specific choice of test function that was found to be successful in the literature\[[51](https://arxiv.org/html/2607.00257#bib.bib80)\],
φ\(t\)=\{Cp\(t−a\)p\(b−t\)p,a<t<b0,otherwise,Cp=\(2L\)2p,\\varphi\(t\)=\\begin\{cases\}C\_\{p\}\(t\-a\)^\{p\}\(b\-t\)^\{p\},&a<t<b\\\\ 0,&\\text\{otherwise\}\\end\{cases\},\\qquad C\_\{p\}=\\left\(\\frac\{2\}\{L\}\\right\)^\{2p\},\(33\)whereL=b−aL=b\-ais the support length andp∈ℤ\+p\\in\\mathbb\{Z\}^\{\+\}is a parameter which specifies the degree of the polynomial\. Note thatφ\(t\)\\varphi\(t\)is smooth and compactly supported\. We will use \([33](https://arxiv.org/html/2607.00257#S3.E33)\) in all of our numerical examples in §[5](https://arxiv.org/html/2607.00257#S5)\.
Recall that we construct the matrix𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}by generating a family of linearly independent test functions,φj\(t\)=φ\(t−jh\)\\varphi\_\{j\}\(t\)=\\varphi\(t\-jh\), whereh\>0h\>0dictates the overlap between test functions\. Consequently, the polynomial test functions depend on three key parameters: \(i\) the polynomial degreepp, \(ii\) the support lengthLL, and \(iii\) the overlaphh\. Appropriate values for the triple\(p,L,h\)\(p,L,h\)may vary significantly across datasets and noise levels\. In practice, an appropriate choice of these parameter values should balance speed and reconstruction accuracy\.
There are many approaches to empirically measure the accuracy of the weak\-form reconstruction\. If clean or high\-quality reference data is available, one could select parameters such that the reconstruction of the noisy data most closely aligns with the clean reference data under a suitable metric, e\.g\., RMSE\. If no clean reference data is available, a common approach is to define a score metric that can be evaluated from only the observed data\. Such metrics commonly measure whiteness, variance ratio, or smoothness ratio\[[9](https://arxiv.org/html/2607.00257#bib.bib11),[39](https://arxiv.org/html/2607.00257#bib.bib10),[44](https://arxiv.org/html/2607.00257#bib.bib79),[46](https://arxiv.org/html/2607.00257#bib.bib12),[50](https://arxiv.org/html/2607.00257#bib.bib9),[61](https://arxiv.org/html/2607.00257#bib.bib8)\]\. We refer to\[[44](https://arxiv.org/html/2607.00257#bib.bib79)\]for a discussion and numerical implementation of this approach\. Meanwhile, the number of test functionsk∗k^\{\*\}is inversely proportional to the overlap parameterhh\. We will show in §[4](https://arxiv.org/html/2607.00257#S4)that takinghhlarge has the potential to greatly reduce computational complexity requirements\.
In this work, we select polynomial parameters that are observed to filter given noisy data across a range of error metrics\. As a qualitative example, we consider here the Lorenz\-63 system, discussed in more detail in §[5](https://arxiv.org/html/2607.00257#S5), with 10% noise corruption\. Figure[1](https://arxiv.org/html/2607.00257#S3.F1)visualizes a typical filtering result for thexx\-component with\(p,L,h\)=\(5,0\.5,0\.1\)\(p,L,h\)=\(5,\\,0\.5,\\,0\.1\)\. We remark that filtering with appropriately chosen polynomials can provide competitive or superior results to standard approaches, such as wavelet\-based filtering\. We refer to Appendix[A](https://arxiv.org/html/2607.00257#A1)for a quantitative comparison of polynomial and wavelet filtering across a range of error metrics, noise levels, and model systems\. The success of the filtering procedure justifies this choice of test function and explains the noise robustness of the weak formulation\.

Figure 1:An illustration that appropriately chosen test functions can filter noisy data\. A noisy signal \(blue dots\) is reconstructed \(dashed red\) via \([32](https://arxiv.org/html/2607.00257#S3.E32)\) using the test function \([33](https://arxiv.org/html/2607.00257#S3.E33)\) and compared to the ground truth \(black\)\.
### 3\.3Bias\-Variance Decomposition
To analyze the error in the general reconstruction \([32](https://arxiv.org/html/2607.00257#S3.E32)\), let𝓔=𝐔^−𝐔clean\\boldsymbol\{\\mathcal\{E\}\}=\\hat\{\\mathbf\{U\}\}\-\\mathbf\{U\}\_\{\\text\{clean\}\}denote the difference between the underlying clean data and the reconstruction of the noisy signal\. It follows from the decomposition
𝐔^=𝐏𝐔=𝐏\(𝐔clean\+σ𝝃\),\\hat\{\\mathbf\{U\}\}=\\mathbf\{P\}\\mathbf\{U\}=\\mathbf\{P\}\(\\mathbf\{U\}\_\{\\text\{clean\}\}\+\\sigma\\boldsymbol\{\\xi\}\),\(34\)where𝝃=\[𝝃1,…,𝝃N\]⊤∈ℝN×n\\boldsymbol\{\\xi\}=\[\\boldsymbol\{\\xi\}\_\{1\},\\dots,\\boldsymbol\{\\xi\}\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times n\}is the collection of noise vectors, that we may write
𝔼\[‖𝓔‖W2\]=∑ℓ=1nBℓ2\+Vℓ2,Bℓ2=‖\(𝐏−𝐈\)𝐔clean\(ℓ\)‖W2,Vℓ2=σ2ηℓ2tr\(𝐖𝐏\),\\mathbb\{E\}\\left\[\\\|\\boldsymbol\{\\mathcal\{E\}\}\\\|\_\{W\}^\{2\}\\right\]=\\sum\_\{\\ell=1\}^\{n\}B\_\{\\ell\}^\{2\}\+V\_\{\\ell\}^\{2\},\\quad\\quad B\_\{\\ell\}^\{2\}=\\\|\(\\mathbf\{P\}\-\\mathbf\{I\}\)\\mathbf\{U\}\_\{\\text\{clean\}\}^\{\(\\ell\)\}\\\|\_\{W\}^\{2\},\\quad V\_\{\\ell\}^\{2\}=\\sigma^\{2\}\\eta\_\{\\ell\}^\{2\}\\text\{tr\}\(\\mathbf\{W\}\\mathbf\{P\}\),\(35\)which gives a bias\-variance decomposition of the reconstruction error\. The termsBℓ2B\_\{\\ell\}^\{2\}correspond to the “bias” of the reconstruction, which describe the fidelity with which a clean signal can be reconstructed\. It measures how well the clean signal can be approximated in the span of the test functions\. The termsVℓ2V\_\{\\ell\}^\{2\}correspond to the “variance” of the signal, which describes additional error arising from noise corruption\.
The variance term admits significant simplification in the case of uniform quadrature,𝐖=c𝐈\\mathbf\{W\}=c\\mathbf\{I\}forc∈ℝc\\in\\mathbb\{R\},
tr\(𝐖𝐏\)=c⋅tr\(𝐏\)=c⋅rank\(𝐏\)=ck∗,\\text\{tr\}\(\\mathbf\{W\}\\mathbf\{P\}\)=c\\cdot\\text\{tr\}\(\\mathbf\{P\}\)=c\\cdot\\text\{rank\}\(\\mathbf\{P\}\)=ck^\{\*\},\(36\)which holds because𝐏\\mathbf\{P\}is a projection matrix\. Equation \([36](https://arxiv.org/html/2607.00257#S3.E36)\) implies that filtering error due to noise is governed by the number of test functions,k∗k^\{\*\}, and is independent of the form of the test function\. Notice that the bias term is similarly scaled byccin this case, so that a specific choice of this constant does not artificially inflate either the bias or variance error relative to the other\.
In contrast, the bias term generically depends on the choice of test function and is generally difficult to analyze\. Performing computations in Fourier space has the potential to simplify the form of the bias\. One can show that in the ideal case of periodic data and uniform quadrature,𝐖=𝐈\\mathbf\{W\}=\\mathbf\{I\}, the bias and variance have the form
Bℓ2=‖\(𝐇−𝐈\)ℱ\[𝐔clean\(ℓ\)\]‖F2,Vℓ2=σ2ηℓ2k∗\.B\_\{\\ell\}^\{2\}=\\left\\\|\(\\mathbf\{H\}\-\\mathbf\{I\}\)\\mathcal\{F\}\\left\[\\mathbf\{U\}\_\{\\text\{clean\}\}^\{\(\\ell\)\}\\right\]\\right\\\|\_\{F\}^\{2\},\\quad V\_\{\\ell\}^\{2\}=\\sigma^\{2\}\\eta\_\{\\ell\}^\{2\}k^\{\*\}\.\(37\)wherek∗=NΔt/hk^\{\*\}=N\\Delta t/h\(due to the periodicity of the data\) andℱ\[⋅\]\\mathcal\{F\}\[\\cdot\]denotes the Fourier transform\. The matrix𝐇∈ℝN×N\\mathbf\{H\}\\in\\mathbb\{R\}^\{N\\times N\}is a Fourier representation of𝐏\\mathbf\{P\},
𝐇mn=\{φ^mφ^n¯∑ℓ=0h−1\|φ^r\+ℓk∗\|2,m=n\(modk∗\),0,otherwise,\\mathbf\{H\}\_\{mn\}=\\begin\{cases\}\\frac\{\\hat\{\\varphi\}\_\{m\}\\overline\{\\hat\{\\varphi\}\_\{n\}\}\}\{\\sum\_\{\\ell=0\}^\{h\-1\}\|\\hat\{\\varphi\}\_\{r\+\\ell k^\{\*\}\}\|^\{2\}\},&m=n\\;\(\\text\{mod \}k^\{\*\}\),\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(38\)wherer=mmodk∗r=m\\text\{ mod \}k^\{\*\}andφ^m\\hat\{\\varphi\}\_\{m\}is themmth component of the Fourier transform of the untranslated test function\. Note that𝐇\\mathbf\{H\}is sparse with nonzero banded diagonal entries\. The entries on the off\-diagonal corresponding to aliasing artifacts\. We remark that \([38](https://arxiv.org/html/2607.00257#S3.E38)\) holds for a general test function\. However, further simplification for a specific test function is often difficult or impossible\. We provide a brief derivation of \([38](https://arxiv.org/html/2607.00257#S3.E38)\) in Appendix[B](https://arxiv.org/html/2607.00257#A2)\. Finally, we remark that approximation theory may be used to bound the bias under certain conditions in terms of the sampling step of the data\. We refer to\[[3](https://arxiv.org/html/2607.00257#bib.bib14),[7](https://arxiv.org/html/2607.00257#bib.bib13),[70](https://arxiv.org/html/2607.00257#bib.bib20),[68](https://arxiv.org/html/2607.00257#bib.bib15)\]for further theoretical discussion of the bias reconstruction error\.
In the remainder of this section, we numerically study the bias and variance in \([37](https://arxiv.org/html/2607.00257#S3.E37)\) with the specific choice of test function \([33](https://arxiv.org/html/2607.00257#S3.E33)\)\. We assume the ideal case of periodic data and uniform quadrature,𝐖=𝐈\\mathbf\{W\}=\\mathbf\{I\}\. Similarly to\[[51](https://arxiv.org/html/2607.00257#bib.bib80)\], we observe that in certain parameter regimes the bias term may act as an ideal low\-pass filter and is well approximated by the tail of the Fourier modes of the underlying clean signal,
Bℓ2≈∑m\>k∗\|ℱ\[𝐔clean\(ℓ\)\]\|2\.B\_\{\\ell\}^\{2\}\\approx\\sum\_\{m\>k^\{\*\}\}\\left\|\\mathcal\{F\}\[\\mathbf\{U\}\_\{\\text\{clean\}\}^\{\(\\ell\)\}\]\\right\|^\{2\}\.\(39\)To perform numerical experiments, we consider a noisy dataset,𝐔=\[𝐮1,…,𝐮N\]\\mathbf\{U\}=\[\\mathbf\{u\}\_\{1\},\\dots,\\mathbf\{u\}\_\{N\}\], and its clean counterpart,𝐔clean\\mathbf\{U\}\_\{\\text\{clean\}\}, that have been generated from the Lorenz\-63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) \(see §[5](https://arxiv.org/html/2607.00257#S5)\)\. We takeN=1000N=1000, uniform timestepΔt=0\.01\\Delta t=0\.01, andσ=0\.1\\sigma=0\.1\. To proceed, we numerically observe the total error,
𝔼\[‖𝓔‖F2\]=𝔼\[‖𝐏𝐔−𝐔clean‖F2\],\\mathbb\{E\}\\left\[\\\|\\boldsymbol\{\\mathcal\{E\}\}\\\|\_\{F\}^\{2\}\\right\]=\\mathbb\{E\}\\left\[\\\|\\mathbf\{P\}\\mathbf\{U\}\-\\mathbf\{U\}\_\{\\text\{clean\}\}\\\|\_\{F\}^\{2\}\\right\],\(40\)and the bias error
𝔼\[‖𝓔bias‖F2\]=𝔼\[‖𝐏𝐔clean−𝐔clean‖F2\]\.\\mathbb\{E\}\\left\[\\\|\\boldsymbol\{\\mathcal\{E\}\}\_\{\\text\{bias\}\}\\\|\_\{F\}^\{2\}\\right\]=\\mathbb\{E\}\\left\[\\\|\\mathbf\{P\}\\mathbf\{U\}\_\{\\text\{clean\}\}\-\\mathbf\{U\}\_\{\\text\{clean\}\}\\\|\_\{F\}^\{2\}\\right\]\.\(41\)The observed variance error is then computed according to
𝔼\[‖𝓔var‖F2\]=𝔼\[‖𝓔‖F2−‖𝓔bias‖F2\]\.\\mathbb\{E\}\\left\[\\\|\\boldsymbol\{\\mathcal\{E\}\}\_\{\\text\{var\}\}\\\|\_\{F\}^\{2\}\\right\]=\\mathbb\{E\}\\left\[\\\|\\boldsymbol\{\\mathcal\{E\}\}\\\|\_\{F\}^\{2\}\-\\\|\\boldsymbol\{\\mathcal\{E\}\}\_\{\\text\{bias\}\}\\\|\_\{F\}^\{2\}\\right\]\.\(42\)Above, the expectation is empirically approximated over 100 identical trials corrupted with independent noise\.
Results are reported in Figure[2](https://arxiv.org/html/2607.00257#S3.F2)for different test\-function parameters\. We demonstrate that the form ofVℓ2V\_\{\\ell\}^\{2\}in \([37](https://arxiv.org/html/2607.00257#S3.E37)\) captures the scaling of the observed variance, and observe that the bias may be well\-approximated by \([39](https://arxiv.org/html/2607.00257#S3.E39)\) for largeLLand smallhh\.

\(a\)p=12p=12andL=150L=150

\(b\)p=4p=4andL=120L=120

\(c\)p=10p=10andL=70L=70
Figure 2:Top:Plots of the observed variance \([42](https://arxiv.org/html/2607.00257#S3.E42)\) as a function ofhhwith the theoretical expression forVℓ2=k∗V\_\{\\ell\}^\{2\}=k^\{\*\}from \([35](https://arxiv.org/html/2607.00257#S3.E35)\)\. In all cases, there is excellent agreement\.Bottom:Plots of the observed bias \([41](https://arxiv.org/html/2607.00257#S3.E41)\) as a function ofhhwith the approximation \([39](https://arxiv.org/html/2607.00257#S3.E39)\)\. There is good agreement for largeLLand smallhh\. Ashhincreases, the approximation worsens\.
## 4Proposed Approach: Weak\-form Kernel Ridge Regression
In this section, we propose Weak\-form Kernel Ridge Regression \(WKRR\) as a noise robust extension of the strong KRR approach\. We will refer to this framework as the “weak” approach throughout the manuscript\.
### 4\.1Weak\-form Kernel Ridge Regression
To understand WKRR, we follow the formulation of the strong case as in §[2\.1](https://arxiv.org/html/2607.00257#S2.SS1)up to the formulation of the least squares problem \([7](https://arxiv.org/html/2607.00257#S2.E7)\), which enforces pointwise consistency across the given data\. In the context of a weak formulation, we will interpret the least squares problem as a residual,𝐄=𝐘−𝐊\(ϵ\)𝜶\\mathbf\{E\}=\\mathbf\{Y\}\-\\mathbf\{K\}\(\\epsilon\)\\boldsymbol\{\\alpha\}with𝐘\\mathbf\{Y\}and𝐊\(ϵ\)\\mathbf\{K\}\(\\epsilon\)now depending on noisy data, and select𝜶\\boldsymbol\{\\alpha\}to enforce orthogonality of the residual𝐄\\mathbf\{E\}against a family of test functions\.
As in §[3](https://arxiv.org/html/2607.00257#S3), letφj\(t\)\\varphi\_\{j\}\(t\)denote a family of smooth test functions, and let𝚿𝐖∈ℝk∗×\(N−1\)\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times\(N\-1\)\}be the matrix whose rows are the test functions scaled by quadrature weights\. Notice here that𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}hasN−1N\-1rows to be consistent with the snapshot dataSS\. The core idea behind WKRR is to enforce the constraint𝚿𝐖𝐄=𝟎\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{E\}=\\boldsymbol\{0\}\. In a literal sense, this procedure approximates the integral of the residual over test functions by a quadrature approximation\. In the sense described in §[3](https://arxiv.org/html/2607.00257#S3), this procedure amounts to projecting the noisy data onto the span of the test functions, which we interpret as filtering the noisy data\. Accordingly, we consider the modified least squares problem
𝐘𝐖=𝐊𝐖\(ϵ\)𝜶,\\mathbf\{Y\}\_\{\\mathbf\{W\}\}=\\mathbf\{K\}\_\{\\mathbf\{W\}\}\(\\epsilon\)\\boldsymbol\{\\alpha\},\(43\)where𝐘𝐖=𝚿𝐖𝐘∈ℝk∗×n\\mathbf\{Y\}\_\{\\mathbf\{W\}\}=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{Y\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times n\}and𝐊𝐖\(ϵ\)=𝚿𝐖𝐊\(ϵ\)∈ℝk∗×\(N−1\)\\mathbf\{K\}\_\{\\mathbf\{W\}\}\(\\epsilon\)=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{K\}\(\\epsilon\)\\in\\mathbb\{R\}^\{k^\{\*\}\\times\(N\-1\)\}\.
Naively, classical KRR regularization would involve the termλ𝜶⊤𝐊𝐖\(ϵ\)𝜶\\lambda\\boldsymbol\{\\alpha\}^\{\\top\}\\mathbf\{K\}\_\{\\mathbf\{W\}\}\(\\epsilon\)\\boldsymbol\{\\alpha\}; however, this expression is not defined because the weak\-form kernel matrix is no longer square\. Instead, we regularize in the weak coefficient space in terms of the matrix𝐊~𝐖\(ϵ\)=𝚿𝐖𝐊\(ϵ\)\(𝚿𝐖\)⊤∈ℝk∗×k∗\\widetilde\{\\mathbf\{K\}\}\_\{\\mathbf\{W\}\}\(\\epsilon\)=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{K\}\(\\epsilon\)\(\\boldsymbol\{\\Psi\}\\mathbf\{W\}\)^\{\\top\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times k^\{\*\}\}, leading to the problem
𝜶^𝐖=argmin𝜶𝐖\(‖𝐊~𝐖\(ϵ\)𝜶𝐖−𝐘𝐖‖F2\+λ𝜶𝐖⊤𝐊~𝐖\(ϵ\)𝜶𝐖\)\.\\hat\{\\boldsymbol\{\\alpha\}\}\_\{\\mathbf\{W\}\}=\\arg\\min\_\{\\boldsymbol\{\\alpha\}\_\{\\mathbf\{W\}\}\}\\left\(\\\|\\widetilde\{\\mathbf\{K\}\}\_\{\\mathbf\{W\}\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\_\{\\mathbf\{W\}\}\-\\mathbf\{Y\}\_\{\\mathbf\{W\}\}\\\|\_\{F\}^\{2\}\+\\lambda\\boldsymbol\{\\alpha\}\_\{\\mathbf\{W\}\}^\{\\top\}\\widetilde\{\\mathbf\{K\}\}\_\{\\mathbf\{W\}\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\_\{\\mathbf\{W\}\}\\right\)\.\(44\)Note that𝜶𝐖∈ℝk∗×n\\boldsymbol\{\\alpha\}\_\{\\mathbf\{W\}\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times n\}is a weak\-form representation of the coefficients𝜶∈ℝ\(N−1\)×n\\boldsymbol\{\\alpha\}\\in\\mathbb\{R\}^\{\(N\-1\)\\times n\}in the original coordinates\. To obtain a solution in the original coordinates, we study \([44](https://arxiv.org/html/2607.00257#S4.E44)\) with the substitution𝜶=\(𝚿𝐖\)⊤𝜶𝐖\\boldsymbol\{\\alpha\}=\(\\boldsymbol\{\\Psi\}\\mathbf\{W\}\)^\{\\top\}\\boldsymbol\{\\alpha\}\_\{\\mathbf\{W\}\}, which leads to the equivalent problem
𝜶^=argmin𝜶\(‖𝐊𝐖\(ϵ\)𝜶−𝐘𝐖‖F2\+λ𝜶⊤𝐊\(ϵ\)𝜶\),\\hat\{\\boldsymbol\{\\alpha\}\}=\\arg\\min\_\{\\boldsymbol\{\\alpha\}\}\\left\(\\\|\\mathbf\{K\}\_\{\\mathbf\{W\}\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\-\\mathbf\{Y\}\_\{\\mathbf\{W\}\}\\\|\_\{F\}^\{2\}\+\\lambda\\boldsymbol\{\\alpha\}^\{\\top\}\\mathbf\{K\}\(\\epsilon\)\\boldsymbol\{\\alpha\}\\right\),\(45\)with unique solution given by
𝜶^=\(𝚿𝐖\)⊤\(𝐊~𝐖\(ϵ\)\+λ𝐈\)−1𝐘𝐖\.\\hat\{\\boldsymbol\{\\alpha\}\}=\(\\boldsymbol\{\\Psi\}\\mathbf\{W\}\)^\{\\top\}\(\\widetilde\{\\mathbf\{K\}\}\_\{\\mathbf\{W\}\}\(\\epsilon\)\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{Y\}\_\{\\mathbf\{W\}\}\.\(46\)Once𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}is available, forecasting is performed via \([10](https://arxiv.org/html/2607.00257#S2.E10)\) or \([14](https://arxiv.org/html/2607.00257#S2.E14)\)\. We emphasize that forecasting is performed in the original coordinates, not in the weak\-coefficient space\.
Note that the weak least squares formulation arises from the strong formulation left\-multiplied by the matrix𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}\. Although this procedure appears very simple, it grants a surprising number of benefits\. In addition to filtering, we will find that performing KRR in the weak\-coefficient space reduces computational complexity and may increase model robustness compared to a strong implementation\.
We remark that classically, a weak formulation involves integration\-by\-parts when integrating both sides of a differential equation over a family of test functions\. Here, when learning discrete\-time solution operators, we do not integrate by parts\. Nevertheless, we continue to use the term “weak” due to the fact that these approaches enforce the differential equations in the integral or weak sense\. We note that the approach described here can easily be extended to learn the vector field of a dynamical system, and emphasize that the interpretation of the weak form as a filter remains unaltered in a continuous\-time context\.
The success of the WKRR approach hinges on the choice of test function used to construct the matrix𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}, and the hyperparametersϵ\\epsilonandλ\\lambda\. In this work, we consider only the test function \([33](https://arxiv.org/html/2607.00257#S3.E33)\) described in §[3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1)\. Developing a systematic or optimal framework for choosing test function parameters\(p,L,h\)\(p,L,h\)is beyond the scope of the present work\. Instead, our objective is to demonstrate the robustness of the WKRR approach across a representative range of parameter values\. In the numerical experiments \(see §[5](https://arxiv.org/html/2607.00257#S5)\), we select fixed representative values of\(p,L,h\)\(p,L,h\)and demonstrate their success in the WKRR framework\. Additional diagnostics are provided in Appendix[A](https://arxiv.org/html/2607.00257#A1)\. The following section describes a validation procedure to choose the parametersϵ\\epsilonandλ\\lambda, which extends the methodology implemented in\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\.
### 4\.2Validation
The success of WKRR depends on appropriate selection of the bandwidth and regularization parameters\(ϵ,λ\)\(\\epsilon,\\lambda\)\. In this section, we propose a validation procedure that we observe to be successful across a range of chaotic, high\-dimensional, and experimental datasets; see §[5](https://arxiv.org/html/2607.00257#S5)\.
Typically, one reserves a portion of given observation data for use in validation\. This validation data is not seen during training, and is used as a form of ground truth to select appropriate model hyperparameters\. In this section we let𝐮train,i\\mathbf\{u\}\_\{\\text\{train\},i\}fori=1,…,Ni=1,\\dots,Nand𝐮val,i\\mathbf\{u\}\_\{\\text\{val\},i\}fori=1,…,NVi=1,\\dots,N\_\{V\}denote noisy training and validation data, respectively\. While training and validation can be performed on the raw data, in practice we observe that filtering prior to training and validation yields marginally better performance\. We refer to §[4\.3](https://arxiv.org/html/2607.00257#S4.SS3)for a more detailed discussion of our pre\-processing procedure\. We denote𝐮^train,i\\hat\{\\mathbf\{u\}\}\_\{\\text\{train\},i\}and𝐮^val,i\\hat\{\\mathbf\{u\}\}\_\{\\text\{val\},i\}as filtered training and validation data, respectively\. In the present discussion, we assume training and validation are performed on the filtered data\. In our numerical experiments in §[5](https://arxiv.org/html/2607.00257#S5), we consider training on both noisy and filtered data\.
We assume that the matrix𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}is available\. For a fixed\(ϵ,λ\)\(\\epsilon,\\lambda\)pair, let𝐊\(ϵ\)\\mathbf\{K\}\(\\epsilon\)be the kernel Gram matrix formed over the filtered training data𝐮^train,i\\hat\{\\mathbf\{u\}\}\_\{\\text\{train\},i\}, and let𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}be computed according to \([46](https://arxiv.org/html/2607.00257#S4.E46)\)\. Denote𝐱~i\\tilde\{\\mathbf\{x\}\}\_\{i\}as the corresponding kernel forecast over the filtered validation data𝐮^val,i\\hat\{\\mathbf\{u\}\}\_\{\\text\{val\},i\}via either \([10](https://arxiv.org/html/2607.00257#S2.E10)\) or \([14](https://arxiv.org/html/2607.00257#S2.E14)\)\. We assume that𝐱~1=𝐮^val,1\\tilde\{\\mathbf\{x\}\}\_\{1\}=\\hat\{\\mathbf\{u\}\}\_\{\\text\{val\},1\}, i\.e\., that the forecast is initialized with the first step of known validation data\. We now review two error metrics that we will use for validation in the numerical examples considered in this manuscript\.
1. 1\.Valid Prediction Time \(VPT\):We consider the VPT metric for chaotic dynamical systems VPT=Λ⋅tℐ,ℐ=max\{ℑ∣Ei≤γfor all1≤i≤ℑ\},\\text\{VPT\}=\\Lambda\\cdot t\_\{\\mathcal\{I\}\},\\quad\\mathcal\{I\}=\\max\\\{\\mathfrak\{I\}\\mid E\_\{i\}\\leq\\gamma\\text\{ for all \}1\\leq i\\leq\\mathfrak\{I\}\\\},\(47\)whereΛ\\Lambdais the maximal Lyapunov exponent of the system,γ\\gammais a user\-defined threshold, andEiE\_\{i\}is a normalized error metric Ei=1n∑ℓ=1n\(x~i\(ℓ\)−u^val,i\(ℓ\)ζℓ\)2,E\_\{i\}=\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{\\ell=1\}^\{n\}\\left\(\\frac\{\\tilde\{x\}\_\{i\}^\{\(\\ell\)\}\-\\hat\{u\}\_\{\\text\{val\},i\}^\{\(\\ell\)\}\}\{\\zeta\_\{\\ell\}\}\\right\)^\{2\}\},\(48\)whereζℓ\\zeta\_\{\\ell\}is the standard deviation of theℓ\\ellth component of the filtered validation data\. In words, VPT measures how long the normalized error between the kernel predication and the validation data stays below a given thresholdγ\\gamma\. The VPT metric is ideal for chaotic systems because the forecast is expected to diverge after a short time\. Normalization by the Lyapunov exponentΛ\\Lambdaensures the metric is unitless and enables fair comparison with other chaotic systems with faster or slower dynamics\.
2. 2\.Normalized Mean Square Error \(NMSE\):For all other dynamical systems considered in this work, we use the NMSE metric Ei=∑ℓ=1n\|x~i\(ℓ\)−u^val,i\(ℓ\)\|2∑ℓ=1n\|u^val,i\(ℓ\)\|2\.E\_\{i\}=\\frac\{\\sum\_\{\\ell=1\}^\{n\}\|\\tilde\{x\}\_\{i\}^\{\(\\ell\)\}\-\\hat\{u\}\_\{\\text\{val\},i\}^\{\(\\ell\)\}\|^\{2\}\}\{\\sum\_\{\\ell=1\}^\{n\}\|\\hat\{u\}\_\{\\text\{val\},i\}^\{\(\\ell\)\}\|^\{2\}\}\.\(49\)
Our goal now is to select appropriate hyperparameters\(ϵ,λ\)\(\\epsilon,\\lambda\)for use in the weak KRR problem \([45](https://arxiv.org/html/2607.00257#S4.E45)\)\. While systematic grid searches are possible, they are often computationally expensive\. Instead, to save computation time, we follow a data\-driven heuristic method, outlined in Appendix[C](https://arxiv.org/html/2607.00257#A3), to compute a reference bandwidthϵ∗\\epsilon^\{\*\}and a reference regularization parameterλ∗\\lambda^\{\*\}\. These reference values are used to confine the parameter search to a gridH=\[ϵ∗Δϵ,ϵ∗/Δϵ\]×\[λ∗,λ∗/Δλ\]H=\[\\epsilon^\{\*\}\\Delta\\epsilon,\\epsilon^\{\*\}/\\Delta\\epsilon\]\\times\[\\lambda^\{\*\},\\lambda^\{\*\}/\\Delta\\lambda\], whereΔϵ=10−4\\Delta\\epsilon=10^\{\-4\}andΔλ=10−5\\Delta\\lambda=10^\{\-5\}\.
Apart from the grid, the length of data over which forecasts are performed also plays a crucial role\. In the case of clean data, it is often sufficient to validate over a few long trajectories that match or exceed the expected forecast horizon, as in\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. However, such a procedure may yield unstable results when the available data is noisy; predictions may diverge much more rapidly due to fluctuations, causing the model to fit noise instead of underlying dynamics\. To address this concern, here we propose to validate over many short segments, and select parameters that optimize average error over all segments\. To reduce computational complexity, we first select parameters from a coarse gridHcH\_\{c\}\. We then zoom in on the optimal parameter choice from the coarse grid, and form a second refined gridHfH\_\{f\}\. Validation over all\(ϵ,λ\)∈Hf\(\\epsilon,\\lambda\)\\in H\_\{f\}yields an optimal parameter pair, which is subsequently used to form the final model\.
We now provide implementation details for the proposed validation procedure\.
Coarse validation:
1. 1\.Form the coarse grid:Select reference parameters\(ϵ∗,λ∗\)\(\\epsilon^\{\*\},\\lambda^\{\*\}\), form the coarse gridHc=\[ϵ∗Δϵ,ϵ∗/Δϵ\]×\[λ∗,λ∗/Δλ\]H\_\{c\}=\[\\epsilon^\{\*\}\\Delta\\epsilon,\\epsilon^\{\*\}/\\Delta\\epsilon\]\\times\[\\lambda^\{\*\},\\lambda^\{\*\}/\\Delta\\lambda\], and divideHcH\_\{c\}into anNmesh×NmeshN\_\{\\text\{mesh\}\}\\times N\_\{\\text\{mesh\}\}mesh with uniformly spaced gridpoints in log\-space\.
2. 2\.Divide validation data:Divide the given validation data intoNcN\_\{c\}\(possibly overlapping\) segments of lengthνc\\nu\_\{c\}, with initial conditions chosen uniformly and randomly from the full validation dataset\.
3. 3\.Learn and forecast:For each\(ϵ,λ\)\(\\epsilon,\\lambda\)pair inHcH\_\{c\}, learn the kernel solution operator via equation \([46](https://arxiv.org/html/2607.00257#S4.E46)\), forecast over theNcN\_\{c\}validation segments, and compute the average error over each segment\.
4. 4\.Select optimal coarse parameters:Select the hyperparameter pair\(ϵc,λc\)\(\\epsilon\_\{c\},\\lambda\_\{c\}\)that results in the best error over theNmesh2N\_\{\\text\{mesh\}\}^\{2\}measurements\.
Fine validation:
1. 1\.Form the fine grid:Define a zoom\-in factorFFand a fine gridHf=\[ϵc/F,ϵc⋅F\]×\[λc/F,λc⋅F\]H\_\{f\}=\[\\epsilon\_\{c\}/F,\\epsilon\_\{c\}\\cdot F\]\\times\[\\lambda\_\{c\}/F,\\lambda\_\{c\}\\cdot F\]\. BreakHfH\_\{f\}into anNmesh×NmeshN\_\{\\text\{mesh\}\}\\times N\_\{\\text\{mesh\}\}mesh with uniformly spaced gridpoints in log\-space\.
2. 2\.Divide validation data:To prevent overfitting, divide the given validation data intoNfN\_\{f\}\(possibly overlapping\) segments of lengthνf\\nu\_\{f\}, with initial conditions chosen uniformly and randomly from the full validation dataset\. This partition is independent from the previous partition in step 2\.
3. 3\.Repeat steps 3\-4 above using the fine meshHfH\_\{f\}to extract final hyperparameters\(ϵf,λf\)\(\\epsilon\_\{f\},\\lambda\_\{f\}\)\.
Unless otherwise stated, we takeNmesh=20N\_\{\\text\{mesh\}\}=20,Nc=20N\_\{c\}=20,Nf=30N\_\{f\}=30,F=103/4F=10^\{3/4\}, andν≡νc=νf\\nu\\equiv\\nu\_\{c\}=\\nu\_\{f\}\. The lengthsνc\\nu\_\{c\}andνf\\nu\_\{f\}depend on the available data and should be chosen on a problem specific basis\.
Notice that in the above procedure, the strong formulation requires inverting anN×NN\\times Nmatrix for each parameter pair, which is computationally expensive whenNNis large\. In contrast, WKRR involves inverting ak∗×k∗k^\{\*\}\\times k^\{\*\}matrix, which greatly reduces computational cost in the typical case thatk∗≪Nk^\{\*\}\\ll N\.
### 4\.3Signal Pre\-Processing
Application of a weak formulation applied to KRR can be interpreted as filtering prior to regression\. As we will show numerically in §[5](https://arxiv.org/html/2607.00257#S5), some form of filtering is essential prior to model training in the presence of noisy data, whether the filtering is accomplished via explicit pre\-processing or implicitly via a weak formulation\. In our numerical experiments, we compare three approaches: \(i\) a strong formulation trained on pre\-processed \(filtered\) data, \(ii\) a weak formulation trained on noisy data \(but implicitly filtered via projection\), and \(iii\) a weak formulation trained on pre\-processed data \(and further filtered implicitly via projection\)\. In all cases, we validate on filtered data for the sake of consistency with the training data\.
Because the weak formulation acts to filter data, it is natural to compare the proposed WKRR approach to a strong approach applied to filtered data\. We now explain that WKRR grants several benefits beyond just filtering\.
- •Accuracy:We will show numerically in §[5](https://arxiv.org/html/2607.00257#S5)across several examples that WKRR exhibits similar or superior forecasting performance over strong methods when trained on identical filtered data\.
- •Speed:WKRR leads to a compressed representation of observational data in the typical case of fewer test functions than samples,k∗<Nk^\{\*\}<N\. Consequently, training and validation involve inverting matrices of sizek∗×k∗k^\{\*\}\\times k^\{\*\}, instead ofN×NN\\times Nrequired for the strong formulation\.
- •Validation robustness:In our numerical experiments, several strong models performed poorly during testing due to non\-optimal validation\. In contrast, WKRR models trained on identical data did not produce such outliers\. This observation suggests that the weak\-coefficient representation of the data may robustify the validation process, making specific hyperparameter choices less critical for success\.
These observations lead us to incorporate data pre\-processing in the WKRR pipeline\. We remark that increased performance due to filtering followed by a weak formulation has been reported in the literature for other learning methodologies and is not unique to WKRR\[[53](https://arxiv.org/html/2607.00257#bib.bib21)\]\. However, we emphasize that while pre\-processing often enhances WKRR forecasting performance, WKRR without pre\-processing still performs robustly across a range of examples\.
### 4\.4Summary
We summarize the implementation of the proposed WKRR approach in Algorithm[1](https://arxiv.org/html/2607.00257#alg1)\. We assume that noisy observational data𝐮i\\mathbf\{u\}\_\{i\}of the form \([1](https://arxiv.org/html/2607.00257#S2.E1)\) is given, which may be divided into training data𝐮train\\mathbf\{u\}\_\{\\text\{train\}\}and validation data𝐮val\\mathbf\{u\}\_\{\\text\{val\}\}\.
Algorithm 1WKRR implementation1:Training data𝐮train\\mathbf\{u\}\_\{\\text\{train\}\}, and validation data𝐮val\\mathbf\{u\}\_\{\\text\{val\}\}\.
2:Form the test function matrix
𝚿𝐖\\boldsymbol\{\\Psi\}\\mathbf\{W\}\. See §[3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1)for discussion concerning test function selection\.
3:Pre\-process to obtain filtered data
𝐮^train\\hat\{\\mathbf\{u\}\}\_\{\\text\{train\}\}and
𝐮^val\\hat\{\\mathbf\{u\}\}\_\{\\text\{val\}\}according to \([32](https://arxiv.org/html/2607.00257#S3.E32)\) or any other suitable method\.
4:Use
𝐮^train\\hat\{\\mathbf\{u\}\}\_\{\\text\{train\}\}to form
𝐗\\mathbf\{X\}according to \([4](https://arxiv.org/html/2607.00257#S2.E4)\) and
𝐘\\mathbf\{Y\}according to \([5](https://arxiv.org/html/2607.00257#S2.E5)\) or \([12](https://arxiv.org/html/2607.00257#S2.E12)\)\.
5:Using \([15](https://arxiv.org/html/2607.00257#S2.E15)\) or any appropriate kernel function, such as the DM kernel in \([17](https://arxiv.org/html/2607.00257#S2.E17)\), and
𝐗\\mathbf\{X\}, form the gram matrix
𝐊\(ϵ\)\\mathbf\{K\}\(\\epsilon\)\.
6:Validate over
𝐮^val\\hat\{\\mathbf\{u\}\}\_\{\\text\{val\}\}, using
𝐘𝐖=𝚿𝐖𝐘\\mathbf\{Y\}\_\{\\mathbf\{W\}\}=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{Y\}and
𝐊~𝐖=𝚿𝐖𝐊\(ϵ\)\(𝚿𝐖\)⊤\\widetilde\{\\mathbf\{K\}\}\_\{\\mathbf\{W\}\}=\\boldsymbol\{\\Psi\}\\mathbf\{W\}\\mathbf\{K\}\(\\epsilon\)\(\\boldsymbol\{\\Psi\}\\mathbf\{W\}\)^\{\\top\}, according to the procedure outlined in §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2)to compute the optimal parameter pair
\(ϵf,λf\)\(\\epsilon\_\{f\},\\lambda\_\{f\}\)\.
7:Compute
𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}via \([46](https://arxiv.org/html/2607.00257#S4.E46)\) using
\(ϵf,λf\)\(\\epsilon\_\{f\},\\lambda\_\{f\}\)\.
8:Perform out\-of\-sample forecasting via \([10](https://arxiv.org/html/2607.00257#S2.E10)\) or \([14](https://arxiv.org/html/2607.00257#S2.E14)\)\.
9:Optimal parameters\(ϵf,λf\)\(\\epsilon\_\{f\},\\lambda\_\{f\}\)and coefficients𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}for forecasting via \([10](https://arxiv.org/html/2607.00257#S2.E10)\) or \([14](https://arxiv.org/html/2607.00257#S2.E14)\)\.
## 5Numerical Study
In this section, we apply the proposed WKRR method to two baseline systems: the 3\-dimensional Lorenz\-63 \(L63\) system, and a 64\-dimensional discretization of the Kuramoto\-Sivashinsky \(KS\) system\. We compare the performance of the standard Gaussian kernel \([15](https://arxiv.org/html/2607.00257#S2.E15)\) with the DM kernel \([17](https://arxiv.org/html/2607.00257#S2.E17)\)\. We also consider 15,000\-dimensional experimental fluid data that corresponds to turbulent cavity flow \(see Challenge 2\.1 in\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]\)\. When feasible, we compare WKRR with competitive baseline methods, such as strong KRR\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\], RAFDA\[[31](https://arxiv.org/html/2607.00257#bib.bib97)\], and LSTM\[[35](https://arxiv.org/html/2607.00257#bib.bib81)\]\.
For the two baseline systems, we specify the signal\-to\-noise ratio of the given data \([1](https://arxiv.org/html/2607.00257#S2.E1)\) by selecting intensity
σ∈\{0\.01,0\.05,0\.1,0\.2\},\\sigma\\in\\\{0\.01,\\,0\.05,\\,0\.1,\\,0\.2\\\},\(50\)which corresponds to 1%, 5%, 10%, or 20% noise corruption\.
### 5\.1Lorenz\-63
We first consider the Lorenz\-63 \(L63\) dynamics, which are governed by the following equations\[[48](https://arxiv.org/html/2607.00257#bib.bib6)\]
x′=ρ1\(y−x\),y′=x\(ρ2−z\)−y,z′=xy−ρ3z\.\\begin\{split\}x^\{\\prime\}&=\\rho\_\{1\}\(y\-x\),\\\\ y^\{\\prime\}&=x\(\\rho\_\{2\}\-z\)\-y,\\\\ z^\{\\prime\}&=xy\-\\rho\_\{3\}z\.\\end\{split\}\(51\)Here,ρ1=10\\rho\_\{1\}=10,ρ2=28\\rho\_\{2\}=28, andρ3=8/3\\rho\_\{3\}=8/3, and \([51](https://arxiv.org/html/2607.00257#S5.E51)\) admits chaotic dynamics with Lyapunov exponentΛ≈0\.91\\Lambda\\approx 0\.91\[[10](https://arxiv.org/html/2607.00257#bib.bib5),[44](https://arxiv.org/html/2607.00257#bib.bib79),[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\.
For data generation, we randomly generate an initial condition from the standard normal distribution\. We integrate using MATLAB’sode113syntax with relative error tolerance10−1010^\{\-10\}, absolute error tolerance10−1210^\{\-12\}, and timestepΔt=0\.01\\Delta t=0\.01to generate a trajectory consisting ofN=68,000N=68,000steps\. Then, we corrupt the signal with a fixed intensityσ\\sigmafrom \([50](https://arxiv.org/html/2607.00257#S5.E50)\)\. The first10,00010,000steps of the trajectory are pruned to eliminate transients and the resulting segment is divided into 2500 steps of noisy training data, 5500 steps of noisy validation data, and 50,000 of clean testing data\. We save only the clean testing data as ground truth to gauge predictive performance\. We repeat this procedure 100 times independently, which results in 100 segments consisting of training, validation, and testing data\.
We then pre\-process the training and validation data separately by either: \(i\) leaving it unfiltered, \(ii\) filtering with wavelets, or \(iii\) filtering with the polynomial test functions\{φj\(t\):=φ\(t−jh\)\}\\\{\\varphi\_\{j\}\(t\):=\\varphi\(t\-jh\)\\\}as defined in \([33](https://arxiv.org/html/2607.00257#S3.E33)\)\. To filter with wavelets, we use MATLAB’swdenoisesyntax with asym12wavelet\. To filter with polynomials, we reconstruct the noisy signal via \([32](https://arxiv.org/html/2607.00257#S3.E32)\) using test functions generated from \([33](https://arxiv.org/html/2607.00257#S3.E33)\)\. The polynomial parameters\(p,L,h\)\(p,L,h\)are fixed for each noise intensity\. Appendix[A](https://arxiv.org/html/2607.00257#A1)contains diagnostics which empirically show that the selected parameters act as effective filters, justifying their use in this step\. After filtering, we prune the first and last 250 steps to avoid boundary artifacts arising from the polynomial reconstruction\. For each noise intensity and each filtering procedure, this process results in 100 segments consisting of 2000 steps of filtering training data, 5000 steps of filtered validation data, and 50,000 steps of clean testing data\.
We train, validate, and test a distinct WKRR model over each of the 100 trajectories using askip\-connectionscheme \(see §[2](https://arxiv.org/html/2607.00257#S2)\)\. We show results obtained from both the Gaussian kernel \([15](https://arxiv.org/html/2607.00257#S2.E15)\) and the DM kernel \([17](https://arxiv.org/html/2607.00257#S2.E17)\)\. For each model, validation is performed according to the process described in §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2)using the VPT error metric with thresholdγ=0\.3\\gamma=0\.3, and hyperparametersNc=20N\_\{c\}=20,Nf=30N\_\{f\}=30, andν=200\\nu=200\. The reference parameters\(ϵ∗,λ∗\)\(\\epsilon^\{\*\},\\lambda^\{\*\}\)are computed according to the heuristic described in Appendix[C](https://arxiv.org/html/2607.00257#A3)\. Typical validation landscapes are shown in Appendix[D](https://arxiv.org/html/2607.00257#A4)\.
For testing, we randomly select 500 integers in the interval\[1,47,000\]\[1,\\,47,000\]to be used across all models\. We use these indices as starting points to extract 500 segments of length 3000 from each of the 100 testing trajectories\. For each model, we forecast over each of these 500 segments\. This process results in a mean VPT for each of the 100 models which are computed using 50,000 total VPTs\. We show a typical forecasting result obtained with the Gaussian kernel at 5% noise in Figure[3](https://arxiv.org/html/2607.00257#S5.F3)\.


Figure 3:Left:A comparison between ground truth \(black\) and a typical WKRR forecast \(dashed purple\) for the L63 system under 5% noise corruption\. The VPT, marked with a green line, is approximately2\.182\.18\.Right:The butterfly attractor formed from the ground truth \(black\) and the WKRR reconstruction \(dashed purple\) over 5000 timesteps\.We now compare the following frameworks across noise intensities: \(i\) strong KRR trained and validated on unfiltered data, \(ii\) RAFDA trained on unfiltered data, \(iii\) strong KRR trained and validated on wavelet filtered data, \(iv\) strong KRR trained and validated on polynomial filtered data, \(v\) proposed WKRR trained on unfiltered data and validated on polynomial filtered data, \(vi\) proposed WKRR trained and validated on polynomial filtered data, \(vii\) proposed WKRR with a DM kernel trained on unfiltered data and validated on polynomial filtered data, and \(viii\) proposed WKRR with a DM kernel trained and validated on polynomial filtered data\. Note that RAFDA does not receive filtered data due to its predictive mechanism\. Note that knowledge of the noise levelσ\\sigmais required for RAFDA, but is not needed for either weak or strong KRR\. The case of strong KRR applied to unfiltered training and validation data is included as a baseline\. Each framework is tested over 100 models with identical data\.

\(a\) 1% Noise

\(b\) 5% Noise

\(c\) 10% Noise

\(d\) 20% Noise
Figure 4:Empirical VPT densities for the L63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) under various noise intensities\. Median VPT is marked as a horizontal black line\. “Strong” and “Weak” denote classical KRR and proposed WKRR frameworks, and parentheses indicate filtering applied to the training data, where \(n/a\) denotes unfiltered data\. Validation data is filtered as the training data for strong formulations, while polynomials are used for the weak formulations\. The bottom table provides statistics across the 100 mean VPTs and selected test function parameters\.Figure[4](https://arxiv.org/html/2607.00257#S5.F4)reports the empirical mean VPT density for each of the above frameworks in the top four panels, and reports the corresponding statistics and test function parameters in the table below\. Across all noise levels, WKRR trained and validated on filtered data \(Gaussian kernel \- dark purple, DM kernel \- dark red\) consistently yields the best performance, while WKRR trained on unfiltered data \(light purple and light red, respectively\) is competitive or better than the strong approaches\. Notice that several VPT density tails for strong KRR applied to filtered data \(e\.g\., Figure[4](https://arxiv.org/html/2607.00257#S5.F4), panel b\) are quite long, indicating several models which produced VPTs well below the mean\. In contrast, WKRR did not produce such outliers in any of our numerical experiments\. This observation indicates that model validation may be more stable using a weak formulation\. RAFDA exhibits strong performance for small noise, but its predictive performance worsens as noise increases\.
We observe that the Gaussian kernel and DM kernel give nearly identical results for the L63 system, which has an ambient dimension of three and an intrinsic dimension of roughly 2\.06 on the buttefly attractor\[[42](https://arxiv.org/html/2607.00257#bib.bib4)\]\. This result suggests that the RBF kernel may be appropriate for low\-dimensional systems, or for systems with ambient and intrinsic dimension roughly equal\.
We remark that VPT measures short\-term predictive capability, but does not capture long\-term behavior\. To study long\-term forecasting fidelity, we train WKRR and RAFDA on identical data and forecast over a segment of length 40,000\. We bin the forecasts and the ground truth in a histogram, which are shown in Figure[5](https://arxiv.org/html/2607.00257#S5.F5)for each noise intensity\. Both WKRR and RAFDA closely track the long\-time marginal densities at small noise\. However, while WKRR recovers most qualitative features of marginal densities at high noise, the performance of RAFDA deteriorates\.

\(a\) 1% Noise

\(b\) 5% Noise

\(c\) 10% Noise

\(d\) 20% Noise
Figure 5:Invariant measure comparison for the learned Lorenz\-63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) under varying noise intensities comparing ground truth \(black\) to the proposed WKRR approach \(purple\) and RAFDA \(green\)\.
### 5\.2Kuramoto\-Sivashinsky
We now apply WKRR to the Kuramoto\-Sivashinsky \(KS\) system
ut\+uux\+uxx\+uxxxx=0,x∈\[0,J\],t≥0,u\_\{t\}\+uu\_\{x\}\+u\_\{xx\}\+u\_\{xxxx\}=0,\\quad x\\in\[0,J\],\\quad t\\geq 0,\(52\)with periodic boundary conditionsu\(t,0\)=u\(t,J\)u\(t,0\)=u\(t,J\)\. Following\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\], we takeJ=22J=22and discretize \([52](https://arxiv.org/html/2607.00257#S5.E52)\) in space over a uniform grid consisting of 64 points\. Accordingly, we view the KS system as a 64\-dimensional ordinary differential equation\. In this setup, the KS system admits chaotic dynamics with Lyapunov exponentΛ≈0\.043\\Lambda\\approx 0\.043and has Kaplan\-Yorke dimension approximately5\.25\.2\[[22](https://arxiv.org/html/2607.00257#bib.bib3)\], much smaller than the ambient 64\-dimensional space\.
To generate data we selectu\(0,x\)=sin\(16πx/J\)u\(0,x\)=\\sin\(16\\pi x/J\)as an initial condition, and integrate in time using the Exponential Time Differencing with RK4 \(ETDRK4\) method\[[40](https://arxiv.org/html/2607.00257#bib.bib2),[65](https://arxiv.org/html/2607.00257#bib.bib77)\]with stepsizeΔt=0\.01\\Delta t=0\.01forN=3\.05⋅107N=3\.05\\cdot 10^\{7\}steps\. The data is then downsampled by a factor of 10, and the first 50,000 points are pruned to avoid transient behavior\. We then divide the remainder into 50 consecutive segments of 60,000 points\. Each of these 50 segments consists of 4500 steps of training data, 5500 steps of validation data, and 50,000 steps of testing data\. This data is corrupted with independent noise as in the L63 case\.
The pre\-processing step is performed identically to the L63 case, and results in 50 segments consisting of 4000 steps of filtered training data, 5000 steps of filtered validation data, and 50,000 steps of clean testing data\.
To validate and test, we employ adirect\-connectionscheme \(see §[2](https://arxiv.org/html/2607.00257#S2)\)\. Otherwise, the validation process is similar to the L63 case\. We validate according to the procedure described in §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2)using the VPT error metric \([48](https://arxiv.org/html/2607.00257#S4.E48)\) withγ=0\.5\\gamma=0\.5, and hyperparametersNc=20N\_\{c\}=20,Nf=30N\_\{f\}=30, andν=200\\nu=200\. The reference parameters\(ϵ∗,λ∗\)\(\\epsilon^\{\*\},\\lambda^\{\*\}\)were computed according to the method in Appendix[C](https://arxiv.org/html/2607.00257#A3)\. We scaleλ∗\\lambda^\{\*\}by a factor of 1000 for 5%, 10%, and 20% noise to promote more stable validation\. Typical validation landscapes are shown in Appendix[D](https://arxiv.org/html/2607.00257#A4)\. Testing was performed as in the L63 case, resulting in 50 mean VPTs across 50 WKRR models\. Figure[6](https://arxiv.org/html/2607.00257#S5.F6)illustrates a typical forecast under 5% noise with Gaussian kernel\.



Figure 6:Ground truth data \(right\) and a typical WKRR forecast \(middle\) with 5% noise corruption\. The error \(right\) is defined to beEi=\|𝐮i−𝐮~i\|/max\|𝐮i\|E\_\{i\}=\|\\mathbf\{u\}\_\{i\}\-\\tilde\{\\mathbf\{u\}\}\_\{i\}\|/\\max\|\\mathbf\{u\}\_\{i\}\|, where𝐮i\\mathbf\{u\}\_\{i\}is the ground truth and𝐮~i\\tilde\{\\mathbf\{u\}\}\_\{i\}is the WKRR prediction\. The VPT is approximately0\.650\.65and is marked with a green line\.In our numerical experiments, we consider the same frameworks as for the L63 case, with the exception of RAFDA which was omitted due to prohibitively large computational time\.
Figure[7](https://arxiv.org/html/2607.00257#S5.F7)depicts violin plots of the mean VPT densities across each of the frameworks and noise intensities\. The table below provides the corresponding statistics and test function parameters\. In all cases, WKRR with a Gaussian kernel applied to filtered data \(dark purple\) achieved comparable performance to the strong approaches\. However, WKRR with a DM kernel applied to filtered data \(dark red\), achieved superior performance across noise levels\. This result suggests that the DM kernel may be more appropriate for this problem\. We suspect that the DM kernel exhibits superior performance because the flow map of the KS system is rather smooth, relating to a recent finding\[[34](https://arxiv.org/html/2607.00257#bib.bib82)\]that suggests the DM kernel is superior to the Gaussian kernel in identifying smooth labels, especially those spanned by lower\-order eigenfunctions of the Laplace\-Beltrami operator on a submanifold ofℝn\\mathbb\{R\}^\{n\}\. For completeness, we provide results which show the superior performance of WKRR over KRR using the DM kernel in Appendix[E](https://arxiv.org/html/2607.00257#A5)\.

\(a\) 1% Noise

\(b\) 5% Noise

\(c\) 10% Noise

\(d\) 20% Noise
Figure 7:Empirical VPT densities for the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\) under various noise intensities\. Median VPT is marked as a horizontal black line\. “Strong” and “Weak” denote classical KRR and proposed WKRR frameworks, and parentheses indicate filtering applied to the training data, where \(n/a\) denotes unfiltered data\. Validation data is filtered as the training data for strong formulations, while polynomials are used for the weak formulations\. The bottom table provides statistics across the 50 mean VPTs and selected test function parameters\.
### 5\.3Validation Length Study
The validation procedure forms a substantial part of the WKRR framework\. In this section, we perform a sensitivity analysis with respect to the validation segment lengthν\\nu\. Our goal is to demonstrate that the proposed method is robust across several validation strategies and does not require specific parameters to achieve good performance\.
Our previous experiments averagedNc=20N\_\{c\}=20trajectories consisting ofν=200\\nu=200points over a coarse parameter grid, and subsequently averagedNf=30N\_\{f\}=30trajectories consisting ofν=200\\nu=200points over a fine parameter grid\. This modeling choice provided consistent forecasting performance across noise levels for both the L63 and KS systems\.
Here, we compare this baseline validation strategy to four others: \(i\)Nc=80N\_\{c\}=80,Nf=120N\_\{f\}=120, andν=50\\nu=50; \(ii\)Nc=40N\_\{c\}=40,Nf=60N\_\{f\}=60, andν=100\\nu=100; \(iii\)Nc=10N\_\{c\}=10,Nf=15N\_\{f\}=15, andν=400\\nu=400; and \(iv\)Nc=8N\_\{c\}=8,NF=12N\_\{F\}=12, andν=500\\nu=500\. Otherwise, the experimental procedure is identical to the baseline cases\. These frameworks were considered to provide validation segments whose lengths are both larger and smaller than the expected model forecast\. To enact a direct comparison, we defineνLyap=ν⋅Δt⋅Λ\\nu\_\{\\text\{Lyap\}\}=\\nu\\cdot\\Delta t\\cdot\\Lambda, which has VPT units and may be directly compared to mean VPT\. Here, we only show results with the Gaussian kernel since the same conclusion is also valid for the DM kernel\.
Figure[8](https://arxiv.org/html/2607.00257#S5.F8)\(a\) reports the mean and standard deviation of the mean VPTs at 1% noise \(left\) and 20% noise \(right\) over 100 models for the L63 system, while panel \(b\) reports these statistics over 50 models for the KS system\. We remark that the mean VPT remains stable across the validation strategies, highlighting the robustness of the proposed validation procedure\. We emphasize that comparable VPTs were achieved from validation segments which are both larger and smaller than the expected testing VPT\. While validating on segments whose length exceeds the expected forecast horizon is a typical approach to support appropriate generalization, our results suggest that WKRR still achieves comparable generalization even if it uses shorter validation segments\.

\(a\) Lorenz\-63: left: 1% noise, right: 20% noise

\(b\) Kuramoto\-Sivashinsky: left: 1% noise, right: 20% noise
LengthνLyap\\nu\_\{\\text\{Lyap\}\}Mean VPT \(1%\)Mean VPT \(20%\)0\.463\.25±\\pm0\.471\.06±\\pm0\.180\.913\.40±\\pm0\.451\.11±\\pm0\.171\.823\.48±\\pm0\.391\.09±\\pm0\.193\.643\.52±\\pm0\.391\.10±\\pm0\.204\.553\.46±\\pm0\.391\.11±\\pm0\.22
LengthνLyap\\nu\_\{\\text\{Lyap\}\}Mean VPT \(1%\)Mean VPT \(20%\)0\.220\.79±\\pm0\.110\.36±\\pm0\.060\.430\.79±\\pm0\.110\.40±\\pm0\.050\.860\.80±\\pm0\.110\.39±\\pm0\.061\.720\.79±\\pm0\.110\.39±\\pm0\.052\.150\.78±\\pm0\.110\.39±\\pm0\.05
Figure 8:Mean VPT statistics as a function of validation lengthν\\nufor\(a\)the L63 system and\(b\)the KS system at two noise levels\. The tables present the quantitative values\. The first column lists the validation length in Lyapunov time,νLyap=ν⋅Δt⋅Λ\\nu\_\{\\text\{Lyap\}\}=\\nu\\cdot\\Delta t\\cdot\\Lambda\.
### 5\.4Clean Data Performance
In this section, we use WKRR and classical strong KRR to study the L63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) and the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\) as above, but with clean\(σ=0\)\(\\sigma=0\)training and validation data\. The training and validation data are not pre\-processed, but are otherwise constructed identically to the above cases\.
The validation scheme used here differs from that described in §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2)only in that we now utilizeNc=Nf=3N\_\{c\}=N\_\{f\}=3\(possibly overlapping\) validation segments consisting ofν=1500\\nu=1500points, following the setup in\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. Validating over longer segments improves performance because the expected forecast horizon under clean data is significantly longer than the noisy cases\. For WKRR, we scale the reference regularization parameterλ∗\\lambda^\{\*\}by a factor of10−610^\{\-6\}to promote stable validation\.
Figure[9](https://arxiv.org/html/2607.00257#S5.F9)reports the empirical VPT density over 100 models for the L63 system, and 50 models for the KS system, respectively\. The tables below provide corresponding quantitative statistics\. We remark that WKRR loses only a small amount of accuracy compared to the strong KRR approach, indicating its robustness across both clean and noisy data\. We remark that WKRR with a DM kernel exhibits superior performance for both systems\. Previous work showed that classic strong KRR with a DM kernel often exhibits superior performance compared to a Gaussian kernel over clean data\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. The present results suggest the robustness of the DM kernel with a weak formulation\.
One might argue that selecting appropriate test functions for WKRR make the method difficult to use in practice\. However, the selection is straightforward when clean data is available\. One procedure is as follows\. For a given family of test functions, reconstruct a portion of the given training data via \([32](https://arxiv.org/html/2607.00257#S3.E32)\)\. Then, compare the available clean data to its reconstruction using a suitable error metric, and select the best\-performing test function\.

\(a\) Lorenz\-63 system

\(b\) Kuramoto\-Sivashinsky system
Figure 9:Empirical VPT densities over clean data for\(a\)the L63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) and\(b\)the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\)\. Median VPT is marked as a horizontal black line\. “Strong” and “Weak” denote classical KRR and proposed WKRR frameworks\. The bottom tables provide selected test function parameters and statistics across 100 mean VPTs for the L63 system and 50 mean VPTs for the KS system, respectively\.
### 5\.5Experimental Data: Fluid Dynamics
We now apply WKRR with both the Gaussian and DM kernel to experimental fluid data, made available by\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]as a Community Challenge\. The data consists of streamwise velocity𝐮\\mathbf\{u\}and wall\-normal velocity𝐯\\mathbf\{v\}measured on cavity flow\. The dataset includes measurement noise as well as fluctuations due to the inherent turbulence in the flow\. Furthermore, the flow is strongly convective and heavily depends on the time\-varying inlet boundary condition on the left, that is time\-varying and not precisely known due to the experimental setup\. Due to the unknown inlet boundary condition during prediction, it is challenging to achieve long forecasting horizon in this setting\. We refer to Problem 2\.1 in\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]for details of the experimental procedure and data collection process\.
We set
𝐔=\[𝐮𝐯\]∈ℝN×n,\\mathbf\{U\}=\\begin\{bmatrix\}\\mathbf\{u\}&\\mathbf\{v\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{N\\times n\},\(53\)wheren=2⋅7854=15708n=2\\cdot 7854=15708is the spatial dimension andNNdenotes the number of available data timeslices\. We are givenNtrain=12,800N\_\{\\text\{train\}\}=12,800samples as training data, and 32 segments consisting ofNseg=70N\_\{\\text\{seg\}\}=70of validation data\. The original challenge as described in\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]is to forecasting 30 timesteps past the validation data for each of the segments\. However, because this data is not available to us, we instead break apart the 32 segments intoNval=40N\_\{\\text\{val\}\}=40for model validation, andNtest=30N\_\{\\text\{test\}\}=30for model testing\. This procedure allows us to quantitatively analyze the performance of WKRR and compare it to other baseline methods\. To fix notation, let𝐔train=\[𝐮train𝐯train\]∈ℝ12800×15708\\mathbf\{U\}\_\{\\text\{train\}\}=\[\\mathbf\{u\}\_\{\\text\{train\}\}\\;\\mathbf\{v\}\_\{\\text\{train\}\}\]\\in\\mathbb\{R\}^\{12800\\times 15708\}denote the given training data, let𝐔val\(q\)=\[𝐮val𝐯val\]∈ℝ40×15708\\mathbf\{U\}^\{\(q\)\}\_\{\\text\{val\}\}=\[\\mathbf\{u\}\_\{\\text\{val\}\}\\;\\mathbf\{v\}\_\{\\text\{val\}\}\]\\in\\mathbb\{R\}^\{40\\times 15708\}denote the validation data, and let𝐔test\(q\)=\[𝐮test𝐯test\]∈ℝ30×15708\\mathbf\{U\}^\{\(q\)\}\_\{\\text\{test\}\}=\[\\mathbf\{u\}\_\{\\text\{test\}\}\\;\\mathbf\{v\}\_\{\\text\{test\}\}\]\\in\\mathbb\{R\}^\{30\\times 15708\}denote the testing data forq=1,…,32q=1,\\dots,32\.
We do not directly apply WKRR to the given data, because its size makes computations prohibitively expensive\. Instead, we adopt the following pre\-processing procedure\. Setμu,μv∈ℝ1×7854\\mu\_\{u\},\\mu\_\{v\}\\in\\mathbb\{R\}^\{1\\times 7854\}to be the time\-average of𝐮train\\mathbf\{u\}\_\{\\text\{train\}\}and𝐯train\\mathbf\{v\}\_\{\\text\{train\}\}, respectively\. Similarly setσu,σv∈ℝ1×7854\\sigma\_\{u\},\\sigma\_\{v\}\\in\\mathbb\{R\}^\{1\\times 7854\}to be the standard deviation of the training data\. Then for both training and validation data𝐮,𝐯\\mathbf\{u\},\\mathbf\{v\}, we normalize
𝐔^=\[𝐮−μuσu𝐯−μvσv\],\\hat\{\\mathbf\{U\}\}=\\begin\{bmatrix\}\\frac\{\\mathbf\{u\}\-\\mu\_\{u\}\}\{\\sigma\_\{u\}\}&\\frac\{\\mathbf\{v\}\-\\mu\_\{v\}\}\{\\sigma\_\{v\}\}\\end\{bmatrix\},\(54\)where division is interpreted component\-wise\. We then implement a POD procedure by taking a truncated SVD
𝐔^=𝐔r𝐒r𝐕r⊤,\\hat\{\\mathbf\{U\}\}=\\mathbf\{U\}\_\{r\}\\mathbf\{S\}\_\{r\}\\mathbf\{V\}\_\{r\}^\{\\top\},\(55\)whererrmodes are retained\. The normalized data is then projected onto the POD modes
𝐔~=𝐔^𝐕r=\[𝐮~𝐯~\]\.\\tilde\{\\mathbf\{U\}\}=\\hat\{\\mathbf\{U\}\}\\mathbf\{V\}\_\{r\}=\\begin\{bmatrix\}\\tilde\{\\mathbf\{u\}\}&\\tilde\{\\mathbf\{v\}\}\\end\{bmatrix\}\.\(56\)Finally, we normalize per mode by writing
𝐔†=\[𝐮~−𝔲uςu𝐯~−𝔲vςv\],\\mathbf\{U\}^\{\\dagger\}=\\begin\{bmatrix\}\\frac\{\\tilde\{\\mathbf\{u\}\}\-\\mathfrak\{u\}\_\{u\}\}\{\\varsigma\_\{u\}\}&\\frac\{\\tilde\{\\mathbf\{v\}\}\-\\mathfrak\{u\}\_\{v\}\}\{\\varsigma\_\{v\}\}\\end\{bmatrix\},\(57\)where𝔲u,𝔲v∈ℝ1×r\\mathfrak\{u\}\_\{u\},\\mathfrak\{u\}\_\{v\}\\in\\mathbb\{R\}^\{1\\times r\}andςu,ςv∈ℝ1×r\\varsigma\_\{u\},\\varsigma\_\{v\}\\in\\mathbb\{R\}^\{1\\times r\}are the mean and standard deviation of the POD modes that arise from the training data\. We emphasize that the mean and standard deviations are computed only from the training data, although these values are used to normalize both the training and validation data\.
Polynomial parameters are chosen according to the following procedure\. We select a segment of training data consisting of 1000 timesteps\. The data is normalized according to the above procedure\. We then filter and reconstruct the normalized signals over a grid of parameter triples\(p,L,h\)\(p,L,h\)\. The reconstruction is the unnormalized per\-mode, projected onto original coordinates by the action ofVr⊤V\_\{r\}^\{\\top\}, and multiplied by the standard deviation\. The result is a mean\-subtracted signal\. We then compare this signal to the ground truth \(also mean\-subtracted\) using the NSME metric \([49](https://arxiv.org/html/2607.00257#S4.E49)\), and select the parameter pair which minimizes the error\.
We then apply WKRR to𝐔train†\\mathbf\{U\}^\{\\dagger\}\_\{\\text\{train\}\}\. Both the training and validation data is normalized according to the above procedure\. Validation is performed according to the procedure described in §[4\.2](https://arxiv.org/html/2607.00257#S4.SS2), but with the following modifications\. The coarse grid is initialized toHc=\[102,104\]×\[10−16,10−13\]H\_\{c\}=\[10^\{2\},10^\{4\}\]\\times\[10^\{\-16\},10^\{\-13\}\], and NMSE \(computed over the mode\-normalized coefficients\) is averaged overNc=3N\_\{c\}=3andNf=5N\_\{f\}=5segments of lengthν=15\\nu=15for the coarse and fine grids, respectively\. To alleviate computational time, we validate on only one of the 32 validation segments\.111Several validation segments were randomly chosen and forecasting results did not change significantly, indicating that the particular choice of validation segment does not greatly alter the outcome\.The resulting model is tested on all 32 testing segments\. This testing data is unnormalized to mean\-subtracted physical coordinates and compared to the mean\-subtracted ground truth using the NMSE metric\. This metric is consistent with that considered in the Community Challenge paper\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]\.
We compare the performance of WKRR with LSTM, used as a baseline in the Community Challenge paper\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]\. The LSTM implementation, provided by\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\], was modified to use the same normalized data as WKRR\. The baseline results in\[[62](https://arxiv.org/html/2607.00257#bib.bib96)\]using LSTM were computed withr=25r=25, noting that increasingrrmay destabilize the training of the LSTM networks\. In the following, we will compare WKRR and LSTM with bothr=25r=25andr=100r=100\.
Figure[10](https://arxiv.org/html/2607.00257#S5.F10)shows a typical forecast for each method withr=25r=25over 30 timesteps\. We observe that WKRR with the Gaussian and DM kernel gives visually indistinguishable results, and so we report only results using the Gaussian kernel for this case\. The top block of four rows correspond to𝐮\\mathbf\{u\}data, while the last block of four rows corresponds to𝐯\\mathbf\{v\}data\. In each of the two blocks, we compare the ground truth \(first row\) with ther=25r=25POD representation \(second row\), the WKRR forecast \(third row\), and the LSTM forecast \(fourth row\) over one of the 32 testing segments\. The error plots in Figure[11](https://arxiv.org/html/2607.00257#S5.F11)report the NMSE for each of the 32 trials \(light gray\), the average error over all 32 trials \(solid colored line\), and the corresponding standard deviation \(light shaded band\)\. The dashed colored lines correspond to the trajectory represented in the two blocks\.
Both methods produce predictions that reasonably reproduce the POD representation of the data\. Our results indicate that both methods are competitive in this regime\.
Figure 10:A visualization of the data and typical forecasts\. The first four rows correspond to𝐮\\mathbf\{u\}data, and the bottom four rows correspond to𝐯\\mathbf\{v\}data\. From top to bottom, the rows in each block correspond to: \(i\) ground truth, \(ii\) POD representation of the data withr=25r=25, \(iii\) WKRR prediction, and \(iv\) LSTM prediction\.


Figure 11:NMSE plots for WKRR \(left\), LSTM \(middle\), and a comparison of the two \(right\)\. Light gray lines correspond to each of the 32 trials, dashed colored lines correspond to the trajectory chosen for visualize in the top panel, solid colored lines correspond to mean NMSE over all 32 trials, and the colored shaded region corresponds to the standard deviation over all 32 trials\.We now repeat the forecasting procedure takingr=100r=100POD modes\. In this case, we report both the Gaussian and DM kernel results\. Figures[12](https://arxiv.org/html/2607.00257#S5.F12)and[13](https://arxiv.org/html/2607.00257#S5.F13)report the results in the same style as above\. The Gaussian kernel results are shown in purple, while the DM kernel results are shown in red\. We observe that WKRR with both kernels exhibits noticeably more consistent qualitative agreement and smaller mean error over short\-term horizons\. This improved performance for higher\-dimensional data highlights a key advantage of WKRR\.
The performance of WKRR with the Gaussian and DM kernel differs in this regime\. Over longer horizons, WKRR with the DM kernel has noticeably smaller quantitative error\. However, this metric may be misleading\. Qualitatively, forecasts with the DM kernel tend to produce overly smooth predictions which may fail to capture the long\-term statistical behavior of the dynamics\. In contrast, while predictions with the Gaussian kernel produce larger quantitative error, the forecasts appear to be qualitatively more consistent with the underlying dynamics over long time periods\. Over short time periods, the two kernels perform similarly, with a slight edge to the DM kernel\.
Figure 12:Typical data and forecast\. The first five rows correspond to𝐮\\mathbf\{u\}data, and the bottom five rows correspond to𝐯\\mathbf\{v\}data\. From top to bottom, the rows in each block correspond to: \(i\) ground truth, \(ii\) POD representation of the data withr=100r=100, \(iii\) WKRR with a Gaussian kernel, \(iv\) WKRR with a DM kernel, and \(v\) LSTM prediction\.


Figure 13:NMSE plots for WKRR \(left\), LSTM \(middle\), and a comparison of the two \(right\)\. Light gray lines correspond to each of the 32 trials, dashed colored lines correspond to the trajectory chosen for visualize in the top panel, solid colored lines correspond to mean NMSE over all 32 trials, and colored regions correspond to standard deviation over all 32 trials\. We report results for WKRR with a Gaussian kernel \(purple\) and with a DM kernel \(red\)\.
## 6Discussion
In this paper, we propose Weak\-form Kernel Ridge Regression \(WKRR\) as a data\-driven, noise\-robust learning framework\. The proposed approach is computationally cheaper than classical strong\-form KRR, and demonstrates competitive performance in the presence of noise over a range of chaotic, high\-dimensional, and experimental data\-sets\. While selecting appropriate model hyperparameters via validation is often a difficult task that may be sensitive to modeling assumptions, we perform sensitivity studies to demonstrate that WKRR can achieve robust performance with multiple strategies across several baseline systems and noise levels\. Furthermore, we show that with appropriately chosen test functions, WKRR applied to clean data greatly reduces computational complexity with only marginal loss in predictive performance\. This observation positions WKRR as a flexible framework across a range of datasets when the underlying noise level is unknown\. Finally, the success of kernel\-based learning frameworks depends strongly on the choice of kernel function\. We demonstrate that WKRR achieves competitive performance with the standard Gaussian kernel over a range of baseline and experimental data, highlighting the method’s simple implementation\. We also consider the Diffusion Maps \(DM\) kernel\[[18](https://arxiv.org/html/2607.00257#bib.bib119),[65](https://arxiv.org/html/2607.00257#bib.bib77)\]as an alternative choice, and show that the DM kernel can lead to increased predictive performance for systems whose invariant set dimension is much lower than the ambient dimension, even with noisy observational data\. The success of WKRR with multiple kernels broadens the applicability of the method and demonstrates that its noise robustness is not tied to a specific choice of kernel function\. Moreover, these findings suggest that the practitioner has a modeling choice: a Gaussian kernel reduces runtime and appears to be effective for low\-dimensional systems, while a DM kernel may lead to increased forecast horizon for systems with complicated geometry\. We emphasize that the proposed WKRR approach has advantages over existing learning methods which have incorporated a weak formulation\[[8](https://arxiv.org/html/2607.00257#bib.bib22),[44](https://arxiv.org/html/2607.00257#bib.bib79),[51](https://arxiv.org/html/2607.00257#bib.bib80),[53](https://arxiv.org/html/2607.00257#bib.bib21),[54](https://arxiv.org/html/2607.00257#bib.bib23)\]because it \(i\) does not require a choice of dictionary functions that span the target vector field, and \(ii\) does not require precise parameter tuning common to many machine learning architectures\.
Despite the success of WKRR, several open questions remain\. We observed that model validation using WKRR often produced fewer outliers with poor forecast horizons than strong KRR over pre\-processed data\. We hypothesize that validating over weak\-form coefficients may lead to improved validation landscapes than using data in the original coordinates\. It would be fruitful to pursue this observation, which may lead to more efficient validation strategies\.
Selecting appropriate test functions is a challenge common to all weak\-form learning approaches\. The bias\-variance decomposition in §[3\.3](https://arxiv.org/html/2607.00257#S3.SS3)suggests that using a smaller family of test functions reduces error due to noise\. In contrast, one generally expects that using a larger family of test functions reduces error due to signal reconstruction, although this analysis is complicated and typically depends on the specific choice of test function\. Developing a systematic approach to balance reconstruction and noise error constitutes a challenging yet worthwhile goal\.
We demonstrated numerically that training over noisy observational data is significantly more effective with WKRR than with classical KRR\. While we observed that pre\-processing training data often leads to marginally improved forecast horizons, the role of pre\-processing in training and validation is not yet well\-understood\. Incorporating multiple bandwidth parameters, as opposed to a single scalar considered in this work, may also influence training and validation\. Establishing theoretical convergence results for WKRR in these contexts would strengthen the proposed framework\.
The present manuscript assumes that the given data captures all states of an underlying dynamical system\. However, in practice, given data may be only partially observed\. Additionally, many physical systems may depend on parameters or external non\-autonomous forcings\. Extending WKRR to handle partially observed data, time\-dependent data, or data which depends on parameters would broaden the applicability of WKRR\.
## Acknowledgment
This work is partially supported under the NSF grants DMS\-2505605 and CMMI\-2340266, and the ICDS Penn State seed grant\.
## Data Availability
## Appendix ATest Function Parameters
In this section, we provide additional diagnostics for polynomial test functions as filters\. We also compare with wavelet pre\-processing as a standard baseline filter\.
In our numerical experiments in §[5](https://arxiv.org/html/2607.00257#S5), we apply WKRR to both the L63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) and the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\)\. In both cases, the training and validation data were corrupted with various noise intensities\. In Figures[4](https://arxiv.org/html/2607.00257#S5.F4)and[7](https://arxiv.org/html/2607.00257#S5.F7), we demonstrated that both strong and weak forms applied to polynomial filtered data resulted in higher mean VPT than the unfiltered case for a specific set of polynomial parameters\(p,L,h\)\(p,L,h\)\. Here, we numerically demonstrate that the weak formulation with these parameters effectively filters the data, explaining the mechanism underlying the success of WKRR in these scenarios\.
For each of the 100 L63 models or 50 KS models, we explicitly reconstruct the noise\-corrupted training data with polynomial test functions via \([32](https://arxiv.org/html/2607.00257#S3.E32)\)\. We also filter this training data with wavelets using MATLAB’swdenoisesyntax with asym12wavelet\. In both cases, we prune the first and last 250 data points to avoid boundary artifacts\. In this section, we denote the pruned ground truth data as𝐮i=\[ui\(1\),…,ui\(n\)\]⊤\\mathbf\{u\}\_\{i\}=\[u\_\{i\}^\{\(1\)\},\\dots,u\_\{i\}^\{\(n\)\}\]^\{\\top\}and the pruned filtered data as𝐮^i=\[u^i\(1\),…,u^i\(n\)\]⊤\\hat\{\\mathbf\{u\}\}\_\{i\}=\[\\hat\{u\}\_\{i\}^\{\(1\)\},\\dots,\\hat\{u\}\_\{i\}^\{\(n\)\}\]^\{\\top\}fori=1,…,𝔑i=1,\\dots,\\mathfrak\{N\}\. We measure the filter performance using three error metrics:
ℰRMSE=1n∑ℓ=1n1𝔑∑i=1𝔑\|ui\(ℓ\)−u^i\(ℓ\)\|2,ℰθ=1𝔑∑i=1𝔑arccos\(𝐮i⋅𝐮^i‖𝐮i‖2‖𝐮^i‖2\),ℰLSD=1n∑ℓ=1n1M∑m=1M\|10log10𝒫\[𝐮\(ℓ\)\]m−10log10𝒫\[𝐮^\(ℓ\)\]m\|2,\\begin\{split\}\\mathcal\{E\}\_\{RMSE\}&=\\frac\{1\}\{n\}\\sum\_\{\\ell=1\}^\{n\}\\sqrt\{\\frac\{1\}\{\\mathfrak\{N\}\}\\sum\_\{i=1\}^\{\\mathfrak\{N\}\}\\left\|u\_\{i\}^\{\(\\ell\)\}\-\\hat\{u\}\_\{i\}^\{\(\\ell\)\}\\right\|^\{2\}\},\\\\ \\mathcal\{E\}\_\{\\theta\}&=\\frac\{1\}\{\\mathfrak\{N\}\}\\sum\_\{i=1\}^\{\\mathfrak\{N\}\}\\arccos\\left\(\\frac\{\\mathbf\{u\}\_\{i\}\\cdot\\hat\{\\mathbf\{u\}\}\_\{i\}\}\{\\\|\\mathbf\{u\}\_\{i\}\\\|\_\{2\}\\\|\\hat\{\\mathbf\{u\}\}\_\{i\}\\\|\_\{2\}\}\\right\),\\\\ \\mathcal\{E\}\_\{LSD\}&=\\frac\{1\}\{n\}\\sum\_\{\\ell=1\}^\{n\}\\sqrt\{\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\left\|10\\log\_\{10\}\\mathcal\{P\}\[\\mathbf\{u\}^\{\(\\ell\)\}\]\_\{m\}\-10\\log\_\{10\}\\mathcal\{P\}\[\\hat\{\\mathbf\{u\}\}^\{\(\\ell\)\}\]\_\{m\}\\right\|^\{2\}\},\\end\{split\}\(58\)where𝒫\[⋅\]m\\mathcal\{P\}\[\\cdot\]\_\{m\}is themmth component of the power spectral density of the signal, assumed to be of lengthMM\.
The metricℰRMSE\\mathcal\{E\}\_\{RMSE\}computes RMSE over time, and averages these errors over spatial dimensions\. It provides a standard measure of pointwise error\. The metricℰθ\\mathcal\{E\}\_\{\\theta\}measures the angle between the truth and reconstruction at a fixed sample time, and averages the result over all available sample times\. Loosely speaking, it provides a measure of correctness of direction\. The metricℰLSD\\mathcal\{E\}\_\{LSD\}computes the log\-spectra distance between the components of the true and reconstructed signals, and averages these errors over spatial dimension\. It provides a standard measure of spectral difference between two signals\.
We report the mean and standard deviation of these error metrics averaged over 100 trials for the L63 system in Figure[14](https://arxiv.org/html/2607.00257#A1.F14)and averaged over 50 trials for the KS system in Figure[15](https://arxiv.org/html/2607.00257#A1.F15)\. In all cases, the data𝐮^\\hat\{\\mathbf\{u\}\}is filtered with wavelets \(blue\), polynomial test functions \(orange\), or left unfiltered \(gray\)\. Across noise levels, both wavelets and polynomials give comparable results in terms of RMSE and angle differences\. The polynomial filter produces filtered data with smaller LSD errors\. The resulting error metrics are significantly improved relative to the unfiltered baseline, indicating that both wavelets and polynomials effectively filter the data\. In particular, the success of the polynomials in filtering the data justifies the\(p,L,h\)\(p,L,h\)parameter selection used in our numerical experiments\.









\(a\) 1% noise

\(b\) 5% noise

\(c\) 10% noise

\(d\) 20% noise
Figure 14:A comparison of the filtering properties of wavelets \(blue\) and polynomial test functions \(orange\) with an unfiltered baseline \(gray\) for the L63 system over three error metrics:Top row:RMSEℰRMSE\\mathcal\{E\}\_\{RMSE\},Middle row:Angleℰθ\\mathcal\{E\}\_\{\\theta\}, andBottom row:Log\-spectral distanceℰLSD\\mathcal\{E\}\_\{LSD\}\([58](https://arxiv.org/html/2607.00257#A1.E58)\)\.








\(a\) 1% noise

\(b\) 5% noise

\(c\) 10% noise

\(d\) 20% noise
Figure 15:A comparison of the filtering properties of wavelets \(blue\) and polynomial test functions \(orange\) with an unfiltered baseline \(gray\) for the KS system over three error metrics:Top row:RMSEℰRMSE\\mathcal\{E\}\_\{RMSE\},Middle row:Angleℰθ\\mathcal\{E\}\_\{\\theta\}, andBottom row:Log\-spectral distanceℰLSD\\mathcal\{E\}\_\{LSD\}\([58](https://arxiv.org/html/2607.00257#A1.E58)\)\.
## Appendix BH Matrix
Suppose that data𝐔=\[𝐮1,…,𝐮N\]⊤∈ℝN×n\\mathbf\{U\}=\[\\mathbf\{u\}\_\{1\},\\dots,\\mathbf\{u\}\_\{N\}\]^\{\\top\}\\in\\mathbb\{R\}^\{N\\times n\}, with corresponding clean data𝐔clean\\mathbf\{U\}\_\{\\text\{clean\}\}, is given\. We assume that the underlying clean data is periodic, which implies thatk∗=N/\(h/Δt\)k^\{\*\}=N/\(h/\\Delta t\)\. We also assume that𝐖=𝐈\\mathbf\{W\}=\\mathbf\{I\}\.
Our main goal in this section is to show that the general expression for the bias given in \([35](https://arxiv.org/html/2607.00257#S3.E35)\), under the above ideal assumptions, can be expressed in the form
Bℓ2=‖\(𝐏−𝐈\)𝐔clean\(ℓ\)‖F2=‖\(𝐇−𝐈\)ℱ\[𝐔clean\(ℓ\)\]‖F2,B\_\{\\ell\}^\{2\}=\\\|\(\\mathbf\{P\}\-\\mathbf\{I\}\)\\mathbf\{U\}\_\{\\text\{clean\}\}^\{\(\\ell\)\}\\\|\_\{F\}^\{2\}=\\left\\\|\(\\mathbf\{H\}\-\\mathbf\{I\}\)\\mathcal\{F\}\\left\[\\mathbf\{U\}\_\{\\text\{clean\}\}^\{\(\\ell\)\}\\right\]\\right\\\|\_\{F\}^\{2\},\(59\)in terms of the Fourier transform of theℓ\\ellth component of the true underlying signal, and the matrix𝐇∈ℝN×N\\mathbf\{H\}\\in\\mathbb\{R\}^\{N\\times N\}as defined in \([38](https://arxiv.org/html/2607.00257#S3.E38)\)\. This approach is inspired by the approach taken in\[[70](https://arxiv.org/html/2607.00257#bib.bib20)\], which performs computations in continuous time\.
We first observe that when𝐖=𝐈\\mathbf\{W\}=\\mathbf\{I\}, we have that𝐏=𝚿⊤\(𝚿𝚿⊤\)−1𝚿\\mathbf\{P\}=\\boldsymbol\{\\Psi\}^\{\\top\}\(\\boldsymbol\{\\Psi\}\\boldsymbol\{\\Psi\}^\{\\top\}\)^\{\-1\}\\boldsymbol\{\\Psi\}\. Notice that the entries of the term in parenthesis are quadrature approximations of the integral
\[\(𝚿𝚿⊤\)\]ij=∫ℝdtφi\(t\)φj\(t\)=∫ℝdτφ\(τ\)φ\(τ−\(j−i\)h\),\\begin\{split\}\[\(\\boldsymbol\{\\Psi\}\\boldsymbol\{\\Psi\}^\{\\top\}\)\]\_\{ij\}&=\\int\_\{\\mathbb\{R\}\}\\text\{d\}t\\;\\varphi\_\{i\}\(t\)\\varphi\_\{j\}\(t\)=\\int\_\{\\mathbb\{R\}\}\\text\{d\}\\tau\\;\\varphi\(\\tau\)\\varphi\(\\tau\-\(j\-i\)h\),\\end\{split\}\(60\)which shows this matrix is Toeplitz, and leveraging periodicity, circulant\.
This observation is important because the discrete Fourier Transform \(DFT\) diagonalizes circulant matrices\. Let𝐅∈ℝN×N\\mathbf\{F\}\\in\\mathbb\{R\}^\{N\\times N\}with entries𝐅mn=1/Nexp\(−2πmni/N\)\\mathbf\{F\}\_\{mn\}=1/\\sqrt\{N\}\\exp\(\-2\\pi mni/N\)\. Notice that𝐅∗𝐅=𝐈N×N\\mathbf\{F\}^\{\*\}\\mathbf\{F\}=\\mathbf\{I\}\_\{N\\times N\}\. Consider the frequency domain representation of the test functions𝚿^=𝐅𝚿⊤∈ℂN×k∗\\hat\{\\boldsymbol\{\\Psi\}\}=\\mathbf\{F\}\\boldsymbol\{\\Psi\}^\{\\top\}\\in\\mathbb\{C\}^\{N\\times k^\{\*\}\}\. Leveraging periodicity and straightforward manipulations, one can show that the\(m,j\)\(m,j\)entry of this object has the form
𝚿^mj=∑p=0N−11Nexp\(−2πmpi/N\)φ\[p−jh\]=exp\(−2πmji/k∗\)φ^m,\\begin\{split\}\\hat\{\\boldsymbol\{\\Psi\}\}\_\{mj\}&=\\sum\_\{p=0\}^\{N\-1\}\\frac\{1\}\{\\sqrt\{N\}\}\\exp\(\-2\\pi mpi/N\)\\varphi\[p\-jh\]=\\exp\(\-2\\pi mji/k^\{\*\}\)\\hat\{\\varphi\}\_\{m\},\\end\{split\}\(61\)whereφ^m\\hat\{\\varphi\}\_\{m\}is themmth component of the DFT applied to the mother polynomial function\. Let𝐄∈ℂN×k∗\\mathbf\{E\}\\in\\mathbb\{C\}^\{N\\times k^\{\*\}\}be the matrix whose\(m,j\)\(m,j\)entries areexp\(−2πmji/k∗\)\\exp\(\-2\\pi mji/k^\{\*\}\), and let𝐃∈ℂN×N\\mathbf\{D\}\\in\\mathbb\{C\}^\{N\\times N\}be the diagonal matrix whosemmth entries isφ^m\\hat\{\\varphi\}\_\{m\}\. Then, we have
𝚿^=𝐅𝚿⊤=𝐃𝐄\.\\hat\{\\boldsymbol\{\\Psi\}\}=\\mathbf\{F\}\\boldsymbol\{\\Psi\}^\{\\top\}=\\mathbf\{D\}\\mathbf\{E\}\.\(62\)Moreover, note that the columns of𝐄\\mathbf\{E\}arek∗k^\{\*\}\-periodic, so the matrix𝐄\\mathbf\{E\}is composed ofhhblocks of sizek∗×k∗k^\{\*\}\\times k^\{\*\}stacked on top of each other\. Each of these blocks is thek∗×k∗k^\{\*\}\\times k^\{\*\}DFT matrixk∗𝐅k∗×k∗\\sqrt\{k^\{\*\}\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}\. Letting𝐉\\mathbf\{J\}be the block identity matrix of sizeN×k∗N\\times k^\{\*\}, we can write
𝚿^=𝐅𝚿⊤=k∗𝐃𝐉𝐅k∗×k∗,\\hat\{\\boldsymbol\{\\Psi\}\}=\\mathbf\{F\}\\boldsymbol\{\\Psi\}^\{\\top\}=\\sqrt\{k^\{\*\}\}\\mathbf\{D\}\\mathbf\{J\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\},\(63\)
Now that this setup is complete, we can perform several substitutions to recover the𝐇\\mathbf\{H\}matrix above\. Consider
\(𝐏−𝐈\)=\(𝚿⊤\(𝚿𝚿⊤\)−1𝚿−𝐈\)=\(𝚿⊤\(𝚿𝐅∗𝐅𝚿⊤\)−1𝚿−𝐅∗𝐅\)=\(𝚿⊤\(k∗𝐅k∗×k∗∗\[𝐉∗\|𝐃\|2𝐉\]𝐅k∗×k∗\)−1𝚿−𝐅∗𝐅\)\.\\begin\{split\}\(\\mathbf\{P\}\-\\mathbf\{I\}\)&=\(\\boldsymbol\{\\Psi\}^\{\\top\}\(\\boldsymbol\{\\Psi\}\\boldsymbol\{\\Psi\}^\{\\top\}\)^\{\-1\}\\boldsymbol\{\\Psi\}\-\\mathbf\{I\}\)\\\\ &=\(\\boldsymbol\{\\Psi\}^\{\\top\}\(\\boldsymbol\{\\Psi\}\\mathbf\{F\}^\{\*\}\\mathbf\{F\}\\boldsymbol\{\\Psi\}^\{\\top\}\)^\{\-1\}\\boldsymbol\{\\Psi\}\-\\mathbf\{F\}^\{\*\}\\mathbf\{F\}\)\\\\ &=\(\\boldsymbol\{\\Psi\}^\{\\top\}\(k^\{\*\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}^\{\*\}\[\\mathbf\{J\}^\{\*\}\|\\mathbf\{D\}\|^\{2\}\\mathbf\{J\}\]\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}\)^\{\-1\}\\boldsymbol\{\\Psi\}\-\\mathbf\{F\}^\{\*\}\\mathbf\{F\}\)\.\\end\{split\}\(64\)Let𝐒=𝐉∗\|𝐃\|2𝐉∈ℝk∗×k∗\\mathbf\{S\}=\\mathbf\{J\}^\{\*\}\|\\mathbf\{D\}\|^\{2\}\\mathbf\{J\}\\in\\mathbb\{R\}^\{k^\{\*\}\\times k^\{\*\}\}\. Themmth entry of the diagonal matrix\|𝐃\|2\|\\mathbf\{D\}\|^\{2\}is\|ϕ^m\|2\|\\hat\{\\phi\}\_\{m\}\|^\{2\}\. The action of𝐉\\mathbf\{J\}on either side is to sum up the “blocks”, so𝐒\\mathbf\{S\}is a diagonal matrix whosemmth entry is∑ℓ=0h−1\|ϕ^m\+ℓk∗\|2\\sum\_\{\\ell=0\}^\{h\-1\}\|\\hat\{\\phi\}\_\{m\+\\ell k^\{\*\}\}\|^\{2\}\. Continuing, we have
\(𝐏−𝐈\)=1k∗\(𝚿⊤𝐅k∗×k∗∗𝐒−1𝐅k∗×k∗𝚿−𝐅∗𝐅\)=1k∗\(𝐅∗𝚿^𝐅k∗×k∗∗𝐒−1𝐅k∗×k∗𝚿^∗𝐅−𝐅∗𝐅\)=1k∗𝐅∗\(\[k∗𝐃𝐉𝐅k∗×k∗𝐅k∗×k∗∗\]𝐒−1\[k∗𝐃𝐉𝐅k∗×k∗𝐅k∗×k∗∗\]∗−𝐈\)𝐅=𝐅∗\(\(𝐃𝐉\)𝐒−1\(𝐉∗𝐃∗\)−𝐈\)𝐅\.\\begin\{split\}\(\\mathbf\{P\}\-\\mathbf\{I\}\)&=\\frac\{1\}\{k^\{\*\}\}\(\\boldsymbol\{\\Psi\}^\{\\top\}\\mathbf\{F\}^\{\*\}\_\{k^\{\*\}\\times k^\{\*\}\}\\mathbf\{S\}^\{\-1\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}\\boldsymbol\{\\Psi\}\-\\mathbf\{F\}^\{\*\}\\mathbf\{F\}\)\\\\ &=\\frac\{1\}\{k^\{\*\}\}\(\\mathbf\{F\}^\{\*\}\\hat\{\\boldsymbol\{\\Psi\}\}\\mathbf\{F\}^\{\*\}\_\{k^\{\*\}\\times k^\{\*\}\}\\mathbf\{S\}^\{\-1\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}\\hat\{\\boldsymbol\{\\Psi\}\}^\{\*\}\\mathbf\{F\}\-\\mathbf\{F\}^\{\*\}\\mathbf\{F\}\)\\\\ &=\\frac\{1\}\{k^\{\*\}\}\\mathbf\{F\}^\{\*\}\\left\(\[\\sqrt\{k^\{\*\}\}\\mathbf\{D\}\\mathbf\{J\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}\\mathbf\{F\}^\{\*\}\_\{k^\{\*\}\\times k^\{\*\}\}\]\\mathbf\{S\}^\{\-1\}\[\\sqrt\{k^\{\*\}\}\\mathbf\{D\}\\mathbf\{J\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}\\mathbf\{F\}\_\{k^\{\*\}\\times k^\{\*\}\}^\{\*\}\]^\{\*\}\-\\mathbf\{I\}\\right\)\\mathbf\{F\}\\\\ &=\\mathbf\{F\}^\{\*\}\\left\(\(\\mathbf\{D\}\\mathbf\{J\}\)\\mathbf\{S\}^\{\-1\}\(\\mathbf\{J\}^\{\*\}\\mathbf\{D\}^\{\*\}\)\-\\mathbf\{I\}\\right\)\\mathbf\{F\}\.\\end\{split\}\(65\)Define𝐇=\(𝐃𝐉\)𝐒−1\(𝐉∗𝐃∗\)∈ℝN×N\\mathbf\{H\}=\(\\mathbf\{D\}\\mathbf\{J\}\)\\mathbf\{S\}^\{\-1\}\(\\mathbf\{J\}^\{\*\}\\mathbf\{D\}^\{\*\}\)\\in\\mathbb\{R\}^\{N\\times N\}\. One can show, using the block structure of𝐉\\mathbf\{J\}, that the entries of𝐇\\mathbf\{H\}are of the form
𝐇mn=\{φ^mφ^n¯∑ℓ=0h−1\|φ^r\+ℓk∗\|2,m=n\(modk∗\),0,otherwise\.\\mathbf\{H\}\_\{mn\}=\\begin\{cases\}\\frac\{\\hat\{\\varphi\}\_\{m\}\\overline\{\\hat\{\\varphi\}\_\{n\}\}\}\{\\sum\_\{\\ell=0\}^\{h\-1\}\|\\hat\{\\varphi\}\_\{r\+\\ell k^\{\*\}\}\|^\{2\}\},&m=n\\;\(\\text\{mod \}k^\{\*\}\),\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(66\)wherer=mmodk∗r=m\\text\{ mod \}k^\{\*\}\. This computation recovers the form \([59](https://arxiv.org/html/2607.00257#A2.E59)\)\.
## Appendix CValidation Heuristic
In this section, we review a validation heuristic to compute reference bandwidth and regularization parameters\(ϵ∗,λ∗\)\(\\epsilon^\{\*\},\\lambda^\{\*\}\)which was utilized in\[[65](https://arxiv.org/html/2607.00257#bib.bib77)\]for strong KRR applied to clean data\. Here, we modify it slightly to improve robustness in the case of noisy data\. We emphasize that the procedure described below is a heuristic which has been observed to provide reasonable results over a range of examples\[[36](https://arxiv.org/html/2607.00257#bib.bib106),[65](https://arxiv.org/html/2607.00257#bib.bib77)\]\. While it often provides a good starting point, it should not be interpreted as an optimal approach that is appropriate for all problems\.
Suppose that training data𝐮i\\mathbf\{u\}\_\{i\}fori=1,…,Ni=1,\\dots,Nof the form \([1](https://arxiv.org/html/2607.00257#S2.E1)\) is given\. This data could be filtered, but is not required to be\. LetL∗L\_\{\*\}be the maximum pairwiseL2L\_\{2\}distance between these data points\. Let
ρ\(𝐱,𝐲;η\)=exp\(−∥𝐱−𝐲∥22/\(L∗2η\)\\rho\(\\mathbf\{x\},\\mathbf\{y\};\\eta\)=\\exp\(\-\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|\_\{2\}^\{2\}/\(L\_\{\*\}^\{2\}\\eta\)\(67\)be a standard Gaussian RBF kernel with bandwidthη\\eta\. Further define
S\(η\)=1N2∑i,j=1Nρ\(𝐮i,𝐮j;η\)\.S\(\\eta\)=\\frac\{1\}\{N^\{2\}\}\\sum\_\{i,j=1\}^\{N\}\\rho\(\\mathbf\{u\}\_\{i\},\\mathbf\{u\}\_\{j\};\\eta\)\.\(68\)Following\[[19](https://arxiv.org/html/2607.00257#bib.bib7),[65](https://arxiv.org/html/2607.00257#bib.bib77)\], this fact leads one to consider
V\(η\)=2ηS\(η\)dS\(η\)dη,V\(\\eta\)=2\\frac\{\\eta\}\{S\(\\eta\)\}\\frac\{\\text\{d\}S\(\\eta\)\}\{\\text\{d\}\\eta\},\(69\)and to consider the problem
η∗=argmaxηV\(η\)\.\\eta^\{\*\}=\\arg\\max\_\{\\eta\}V\(\\eta\)\.\(70\)For clean data,V\(η\)V\(\\eta\)is often uni\-modal, and an appropriate choice ofη∗\\eta^\{\*\}is unambiguous\. For noisy data, we numerically observe thatV\(η\)V\(\\eta\)may admit multiple peaks with similar values\. In practice, ifηr∗\\eta^\{\*\}\_\{r\}denotes the arguments at which the local maxima occur, we recommend selectingη∗=maxrηr∗\\eta^\{\*\}=\\max\_\{r\}\\eta^\{\*\}\_\{r\}\. The goal of this procedure is to avoid selecting an extremely small bandwidth, which could lead to poor generalization\.
Onceη∗\\eta^\{\*\}has been computed, we use it to define a reference bandwidthϵ∗\\epsilon^\{\*\}via the following formula
ϵ∗=250L∗2η∗\.\\epsilon^\{\*\}=250L\_\{\*\}^\{2\}\\eta^\{\*\}\.\(71\)
To select the reference regularization parameterλ∗\\lambda^\{\*\}, letk\(𝐱,𝐲;ϵ\)=exp\(−‖𝐱−𝐲‖22/ϵ\)k\(\\mathbf\{x\},\\mathbf\{y\};\\epsilon\)=\\exp\(\-\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|\_\{2\}^\{2\}/\\epsilon\)be the Gaussian RBF kernel, and let𝐊RBF\(ϵ\)\\mathbf\{K\}\_\{\\text\{RBF\}\}\(\\epsilon\)be the corresponding Gram matrix whose\(i,j\)\(i,j\)entry isk\(𝐮i,𝐮j;ϵ\)k\(\\mathbf\{u\}\_\{i\},\\mathbf\{u\}\_\{j\};\\epsilon\)\. We chooseλ∗\\lambda^\{\*\}to be the minimum eigenvalue of𝐊RBF\(ϵ\)\\mathbf\{K\}\_\{\\text\{RBF\}\}\(\\epsilon\)\.
## Appendix DValidation Landscapes
In this section, we depict typical WKRR validation landscapes across noise levels for the L63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) in Figure[16](https://arxiv.org/html/2607.00257#A4.F16), for the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\) in Figure[17](https://arxiv.org/html/2607.00257#A4.F17), and for the experimental fluid data in Figure[18](https://arxiv.org/html/2607.00257#A4.F18)\.
Note that in the case of the experimental fluid data, over\-regularizing may improve quantitative performance in terms of the error metric at the cost of qualitative fidelity, i\.e\., over\-regularizing often causes WKRR to rapidly converge to a mean flow that does not capture important qualitative features of the data\.

\(a\) 1% Noise

\(b\) 5% Noise

\(c\) 10% Noise

\(d\) 20% Noise
Figure 16:Typical coarse \(left sub\-columns\) and fine \(right sub\-columns\) validation landscapes for the L63 system \([51](https://arxiv.org/html/2607.00257#S5.E51)\) at various noise intensities\. Color denotes VPT\. The pink dots denote the chosen parameter pair\.
\(a\) 1% Noise

\(b\) 5% Noise

\(c\) 10% Noise

\(d\) 20% Noise
Figure 17:Typical coarse \(left sub\-columns\) and fine \(right sub\-columns\) validation landscapes for the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\) at various noise intensities\. Color denotes VPT\. The pink dots denote the chosen parameter pair\.

Figure 18:Typical coarse \(left\) and fine \(right\) validation landscapes for the Community Challenge fluid data withr=25r=25POD modes retained\. Error values are thresholded at 10 to better visualize the landscape\.
## Appendix EKS Diffusion Maps Kernel Data
Here, we repeat the experimental procedure for the KS system described in §[5\.2](https://arxiv.org/html/2607.00257#S5.SS2)with the DM kernel \([17](https://arxiv.org/html/2607.00257#S2.E17)\) instead of the Gaussian kernel \([15](https://arxiv.org/html/2607.00257#S2.E15)\)\. The results are shown in Table[1](https://arxiv.org/html/2607.00257#A5.T1), and demonstrate the superior performance of WKRR over the strong formulation across all noise levels\.
Table 1:VPT statistics for the KS system \([52](https://arxiv.org/html/2607.00257#S5.E52)\) under various noise intensities using the DM kernel\. “Strong” and “Weak” denote classical KRR and proposed WKRR frameworks, and parentheses indicate filtering applied to the training data, where \(n/a\) denotes unfiltered data\. Validation data is filtered as the training data for strong formulations, while polynomials are used for the weak formulations\.
## References
- \[1\]\(2020\)Exploration and prediction of fluid dynamical systems using auto\-encoder technology\.Physics of Fluids32\(6\)\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[2\]A\. M\. Ahmed, E\. Sharma, S\. J\. J\. Jui, R\. C\. Deo, T\. Nguyen\-Huy, and M\. Ali\(2022\)Kernel ridge regression hybrid method for wheat yield prediction with satellite\-derived predictors\.Remote Sensing14\(5\),pp\. 1136\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[3\]A\. Aldroubi, M\. Unser, and A\. Aldroubi\(1994\)Sampling procedures in function spaces and asymptotic equivalence with Shannon’s sampling theory\.Numerical functional analysis and optimization15\(1\-2\),pp\. 1–21\.Cited by:[§3\.3](https://arxiv.org/html/2607.00257#S3.SS3.p3.9)\.
- \[4\]M\. Ali, R\. Prasad, Y\. Xiang, and Z\. M\. Yaseen\(2020\)Complete ensemble empirical mode decomposition hybridized with random forest and kernel ridge regression model for monthly rainfall forecasts\.Journal of Hydrology584,pp\. 124647\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[5\]C\. Antoniou, H\. N\. Koutsopoulos, and G\. Yannis\(2013\)Dynamic data\-driven local traffic state estimation and prediction\.Transportation Research Part C: Emerging Technologies34,pp\. 89–107\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[6\]A\. M\. Avila and I\. Mezić\(2020\)Data\-driven analysis and forecasting of highway traffic dynamics\.Nature communications11\(1\),pp\. 2090\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[7\]T\. Blu and M\. Unser\(1999\)Approximation error for quasi\-interpolators and \(multi\-\) wavelet expansions\.Applied and Computational Harmonic Analysis6\(2\),pp\. 219–251\.Cited by:[§3\.3](https://arxiv.org/html/2607.00257#S3.SS3.p3.9)\.
- \[8\]D\. M\. Bortz, D\. A\. Messenger, and A\. Tran\(2024\)Weak form\-based data\-driven modeling: computationally efficient and noise robust equation learning and parameter inference\.InHandbook of Numerical Analysis,Vol\.25,pp\. 53–82\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1),[§1](https://arxiv.org/html/2607.00257#S1.p5.1),[§3\.1](https://arxiv.org/html/2607.00257#S3.SS1.p1.4),[§3](https://arxiv.org/html/2607.00257#S3.p2.1),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[9\]G\. E\. Box, G\. M\. Jenkins, G\. C\. Reinsel, and G\. M\. Ljung\(2015\)Time series analysis: forecasting and control\.John Wiley & Sons\.Cited by:[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p3.3)\.
- \[10\]E\. L\. Brugnago, J\. A\. Gallas, and M\. W\. Beims\(2020\)Predicting regime changes and durations in Lorenz’s atmospheric convection model\.Chaos: An Interdisciplinary Journal of Nonlinear Science30\(10\)\.Cited by:[§5\.1](https://arxiv.org/html/2607.00257#S5.SS1.p1.4)\.
- \[11\]S\. L\. Brunton and J\. N\. Kutz\(2022\)Data\-driven science and engineering: machine learning, dynamical systems, and control\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[12\]S\. L\. Brunton, J\. L\. Proctor, and J\. N\. Kutz\(2016\)Discovering governing equations from data by sparse identification of nonlinear dynamical systems\.Proceedings of the national academy of sciences113\(15\),pp\. 3932–3937\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[13\]S\. L\. Brunton, J\. L\. Proctor, and J\. N\. Kutz\(2016\)Sparse identification of nonlinear dynamics with control \(SINDYc\)\.IFAC\-PapersOnLine49\(18\),pp\. 710–715\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[14\]O\. Castillo and P\. Melin\(1995\)An intelligent system for financial time series prediction combining dynamical systems theory, fractal theory, and statistical methods\.InProceedings of 1995 Conference on Computational Intelligence for Financial Engineering \(CIFEr\),pp\. 151–155\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[15\]R\. T\. Chen, Y\. Rubanova, J\. Bettencourt, and D\. K\. Duvenaud\(2018\)Neural ordinary differential equations\.Advances in neural information processing systems31\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[16\]S\. Cheng, C\. Quilodrán\-Casas, S\. Ouala, A\. Farchi, C\. Liu, P\. Tandeo, R\. Fablet, D\. Lucor, B\. Iooss, J\. Brajard,et al\.\(2023\)Machine learning with data assimilation and uncertainty quantification for dynamical systems: a review\.IEEE/CAA Journal of Automatica Sinica10\(6\),pp\. 1361–1387\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1)\.
- \[17\]H\. M\. Christensen and J\. Berner\(2019\)From reliable weather forecasts to skilful climate response: a dynamical systems approach\.Quarterly Journal of the Royal Meteorological Society145\(720\),pp\. 1052–1069\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[18\]R\. R\. Coifman and S\. Lafon\(2006\)Diffusion maps\.Applied and computational harmonic analysis21\(1\),pp\. 5–30\.Cited by:[§2\.2](https://arxiv.org/html/2607.00257#S2.SS2.p3.2),[§2\.2](https://arxiv.org/html/2607.00257#S2.SS2.p5.2),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[19\]R\. R\. Coifman, Y\. Shkolnisky, F\. J\. Sigworth, and A\. Singer\(2008\)Graph Laplacian tomography from unknown random projections\.IEEE Transactions on Image Processing17\(10\),pp\. 1891–1899\.Cited by:[Appendix C](https://arxiv.org/html/2607.00257#A3.p2.11)\.
- \[20\]M\. J\. Colbrook, L\. J\. Ayton, and M\. Szőke\(2023\)Residual dynamic mode decomposition: robust and verified Koopmanism\.Journal of Fluid Mechanics955,pp\. A21\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[21\]M\. J\. Colbrook\(2023\)The mpedmd algorithm for data\-driven computations of measure\-preserving dynamical systems\.SIAM Journal on Numerical Analysis61\(3\),pp\. 1585–1608\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[22\]R\. A\. Edson, J\. E\. Bunder, T\. W\. Mattner, and A\. J\. Roberts\(2019\)Lyapunov exponents of the Kuramoto–Sivashinsky PDE\.The ANZIAM Journal61\(3\),pp\. 270–285\.Cited by:[§5\.2](https://arxiv.org/html/2607.00257#S5.SS2.p1.4)\.
- \[23\]O\. Erge and E\. Van Oort\(2022\)Combining physics\-based and data\-driven modeling in well construction: hybrid fluid dynamics modeling\.Journal of Natural Gas Science and Engineering97,pp\. 104348\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[24\]P\. Exterkate, P\. J\. Groenen, C\. Heij, and D\. van Dijk\(2016\)Nonlinear forecasting with many predictors using kernel ridge regression\.International Journal of Forecasting32\(3\),pp\. 736–753\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[25\]D\. Floryan and M\. D\. Graham\(2022\)Data\-driven discovery of intrinsic dynamics\.Nature Machine Intelligence4\(12\),pp\. 1113–1120\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[26\]D\. J\. Gauthier, E\. Bollt, A\. Griffith, and W\. A\. Barbosa\(2021\)Next generation reservoir computing\.Nature communications12\(1\),pp\. 5564\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[27\]A\. Ghadami and B\. I\. Epureanu\(2022\)Data\-driven prediction in dynamical systems: recent developments\.Philosophical transactions\. Series A, Mathematical, physical, and engineering sciences380\(2229\),pp\. 20210213\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[28\]W\. Gilpin, Y\. Huang, and D\. B\. Forger\(2020\)Learning dynamics from large biological data sets: machine learning meets systems biology\.Current Opinion in Systems Biology22,pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[29\]À\. Giménez\-Romero\(2024\)Theoretical and data\-driven models in ecology\.Ph\.D\. Thesis,University of the Balearic Islands \(UIB\); Institute for Cross\-Disciplinary Physics and Complex Systems\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[30\]A\. Girard, C\. Rasmussen, J\. Q\. Candela, and R\. Murray\-Smith\(2002\)Gaussian process priors with uncertain inputs application to multiple\-step ahead time series forecasting\.Advances in neural information processing systems15\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1)\.
- \[31\]G\. A\. Gottwald and S\. Reich\(2021\)Combining machine learning and data assimilation to forecast dynamical systems from noisy partial observations\.Chaos: An Interdisciplinary Journal of Nonlinear Science31\(10\)\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1),[§5](https://arxiv.org/html/2607.00257#S5.p1.1)\.
- \[32\]G\. A\. Gottwald and S\. Reich\(2021\)Supervised learning from noisy observations: combining machine\-learning techniques with data assimilation\.Physica D: Nonlinear Phenomena423,pp\. 132911\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1)\.
- \[33\]P\. Goyal and P\. Benner\(2022\)Neural ODEs with irregular and noisy data\.arXiv preprint arXiv:2205\.09479\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[34\]J\. Harlim, D\. Huang, J\. Song, and A\. Townsend\(2026\)Diffusion maps kernel ridge regression\.arXiv preprint, in preparation\.Cited by:[§2\.2](https://arxiv.org/html/2607.00257#S2.SS2.p6.6),[§5\.2](https://arxiv.org/html/2607.00257#S5.SS2.p6.1)\.
- \[35\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1),[§5](https://arxiv.org/html/2607.00257#S5.p1.1)\.
- \[36\]D\. Huang, H\. He, J\. Harlim, and Y\. Li\(2025\)Learning vector fields of differential equations on manifolds with geometrically constrained operator\-valued kernels\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2607.00257#A3.p1.1)\.
- \[37\]A\. J\. Hussain, P\. Liatsis, M\. Khalaf, H\. Tawfik, and H\. Al\-Asker\(2018\)A dynamic neural network architecture with immunology inspired optimization for weather data forecasting\.Big data research14,pp\. 81–92\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[38\]K\. Kaheman, J\. N\. Kutz, and S\. L\. Brunton\(2020\)SINDy\-PI: a robust algorithm for parallel implicit sparse identification of nonlinear dynamics\.Proceedings\. Mathematical, physical, and engineering sciences476\(2242\),pp\. 20200279\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[39\]R\. E\. Kalman\(1960\)A new approach to linear filtering and prediction problems\.Transactions of the ASME–Journal of Basic Engineering\.Cited by:[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p3.3)\.
- \[40\]A\. Kassam and L\. N\. Trefethen\(2005\)Fourth\-order time\-stepping for stiff PDEs\.SIAM Journal on Scientific Computing26\(4\),pp\. 1214–1233\.Cited by:[§5\.2](https://arxiv.org/html/2607.00257#S5.SS2.p2.3)\.
- \[41\]J\. N\. Kutz, S\. L\. Brunton, B\. W\. Brunton, and J\. L\. Proctor\(2016\)Dynamic mode decomposition: data\-driven modeling of complex systems\.SIAM\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[42\]N\. Kuznetsov, T\. Mokaev, O\. Kuznetsova, and E\. Kudryashova\(2020\)The Lorenz system: hidden boundary of practical stability and the Lyapunov dimension\.Nonlinear Dyn102,pp\. 713–732\.Cited by:[§5\.1](https://arxiv.org/html/2607.00257#S5.SS1.p8.1)\.
- \[43\]Q\. Li, F\. Dietrich, E\. M\. Bollt, and I\. G\. Kevrekidis\(2017\)Extended dynamic mode decomposition with dictionary learning: a data\-driven adaptive spectral decomposition of the Koopman operator\.Chaos: An Interdisciplinary Journal of Nonlinear Science27\(10\)\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[44\]X\. Li, J\. Harlim, D\. Chakraborty, and R\. Maulik\(2025\)A weak penalty neural ODE for learning chaotic dynamics from noisy time series\.arXiv preprint arXiv:2511\.06609\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1),[§1](https://arxiv.org/html/2607.00257#S1.p3.1),[§1](https://arxiv.org/html/2607.00257#S1.p5.1),[§3\.1](https://arxiv.org/html/2607.00257#S3.SS1.p1.4),[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p3.3),[§3](https://arxiv.org/html/2607.00257#S3.p2.1),[§5\.1](https://arxiv.org/html/2607.00257#S5.SS1.p1.4),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[45\]B\. Lindemann, T\. Müller, H\. Vietz, N\. Jazdi, and M\. Weyrich\(2021\)A survey on long short\-term memory networks for time series prediction\.Procedia Cirp99,pp\. 650–655\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[46\]G\. M\. Ljung and G\. E\. Box\(1978\)On a measure of lack of fit in time series models\.Biometrika65\(2\),pp\. 297–303\.Cited by:[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p3.3)\.
- \[47\]Y\. Long, X\. She, and S\. Mukhopadhyay\(2018\)Hybridnet: integrating model\-based and data\-driven learning to predict evolution of dynamical systems\.InConference on robot learning,pp\. 551–560\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[48\]E\. N\. Lorenz\(2017\)Deterministic nonperiodic flow 1\.InUniversality in Chaos, 2nd edition,pp\. 367–378\.Cited by:[§5\.1](https://arxiv.org/html/2607.00257#S5.SS1.p1.5)\.
- \[49\]Y\. Luo, K\. Ogle, C\. Tucker, S\. Fei, C\. Gao, S\. LaDeau, J\. S\. Clark, and D\. S\. Schimel\(2011\)Ecological forecasting and data assimilation in a data\-rich era\.Ecological Applications21\(5\),pp\. 1429–1442\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[50\]P\. S\. Maybeck\(1982\)Stochastic models, estimation, and control\.Vol\.3,Academic press\.Cited by:[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p3.3)\.
- \[51\]D\. A\. Messenger and D\. M\. Bortz\(2021\)Weak SINDy for partial differential equations\.Journal of Computational Physics443,pp\. 110525\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1),[§1](https://arxiv.org/html/2607.00257#S1.p5.1),[§3\.1](https://arxiv.org/html/2607.00257#S3.SS1.p1.4),[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p1.4),[§3\.3](https://arxiv.org/html/2607.00257#S3.SS3.p4.1),[§3](https://arxiv.org/html/2607.00257#S3.p2.1),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[52\]D\. A\. Messenger and D\. M\. Bortz\(2021\)Weak SINDy: galerkin\-based data\-driven model selection\.Multiscale Modeling & Simulation19\(3\),pp\. 1474–1497\.Cited by:[§3\.1](https://arxiv.org/html/2607.00257#S3.SS1.p1.4),[§3](https://arxiv.org/html/2607.00257#S3.p2.1)\.
- \[53\]D\. A\. Messenger and D\. M\. Bortz\(2025\)Asymptotic consistency of the WSINDy algorithm in the limit of continuum data\.IMA Journal of Numerical Analysis45\(6\),pp\. 3264–3312\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1),[§1](https://arxiv.org/html/2607.00257#S1.p5.1),[§3\.1](https://arxiv.org/html/2607.00257#S3.SS1.p1.4),[§3](https://arxiv.org/html/2607.00257#S3.p2.1),[§4\.3](https://arxiv.org/html/2607.00257#S4.SS3.p3.1),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[54\]D\. A\. Messenger, A\. Tran, V\. Dukic, and D\. M\. Bortz\(2024\)The weak form is stronger than you think\.arXiv preprint arXiv:2409\.06751\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1),[§1](https://arxiv.org/html/2607.00257#S1.p5.1),[§3\.1](https://arxiv.org/html/2607.00257#S3.SS1.p1.4),[§3](https://arxiv.org/html/2607.00257#S3.p2.1),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[55\]I\. Mezić\(2022\)On numerical approximations of the Koopman operator\.Mathematics10\(7\),pp\. 1180\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[56\]K\. Nakajima and I\. Fischer\(2021\)Reservoir computing\.Springer\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[57\]E\. D\. Nino\-Ruiz and F\. J\. Acevedo García\(2021\)Data\-driven methods for weather forecast\.InInternational Conference on Computational Science,pp\. 326–336\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[58\]J\. S\. North, C\. K\. Wikle, and E\. M\. Schliep\(2023\)A review of data\-driven discovery for dynamic systems\.International Statistical Review91\(3\),pp\. 464–492\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[59\]Y\. Oh, S\. Kam, J\. Lee, D\. Lim, S\. Kim, and A\. Bui\(2025\)Comprehensive review of neural differential equations for time series analysis\.arXiv preprint arXiv:2502\.09885\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[60\]B\. Prokop and L\. Gelens\(2025\)Data\-driven discovery of dynamical models in biology\.arXiv preprint arXiv:2509\.06735\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[61\]A\. Savitzky and M\. J\. Golay\(1964\)Smoothing and differentiation of data by simplified least squares procedures\.\.Analytical chemistry36\(8\),pp\. 1627–1639\.Cited by:[§3\.2\.1](https://arxiv.org/html/2607.00257#S3.SS2.SSS1.p3.3)\.
- \[62\]O\. T\. Schmidt, A\. Towne, A\. Lozano\-Duran, S\. Dawson, and R\. Vinuesa\(2026\)Data\-driven reduced\-complexity modeling of fluid flows: a community challenge\.arXiv preprint arXiv:2601\.06183\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1),[§1](https://arxiv.org/html/2607.00257#S1.p6.1),[§5\.5](https://arxiv.org/html/2607.00257#S5.SS5.p1.2),[§5\.5](https://arxiv.org/html/2607.00257#S5.SS5.p2.10),[§5\.5](https://arxiv.org/html/2607.00257#S5.SS5.p5.5),[§5\.5](https://arxiv.org/html/2607.00257#S5.SS5.p6.4),[§5](https://arxiv.org/html/2607.00257#S5.p1.1)\.
- \[63\]C\. E\. Shannon\(1949\)Communication in the presence of noise\.Proceedings of the IRE37\(1\),pp\. 10–21\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p4.1),[§3\.2](https://arxiv.org/html/2607.00257#S3.SS2.p5.2)\.
- \[64\]J\. Song, B\. Xiang, X\. Wang, L\. Wu, and C\. Chang\(2014\)Application of dynamic data driven application system in environmental science\.Environmental Reviews22\(3\),pp\. 287–297\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[65\]J\. Song, D\. Huang, and J\. Harlim\(2025\)Learning solution operator of dynamical systems with diffusion maps kernel ridge regression\.arXiv preprint arXiv:2512\.17203\.Cited by:[Appendix C](https://arxiv.org/html/2607.00257#A3.p1.1),[Appendix C](https://arxiv.org/html/2607.00257#A3.p2.11),[§1](https://arxiv.org/html/2607.00257#S1.p2.1),[§1](https://arxiv.org/html/2607.00257#S1.p5.1),[§2\.1\.2](https://arxiv.org/html/2607.00257#S2.SS1.SSS2.p2.4),[§2\.1](https://arxiv.org/html/2607.00257#S2.SS1.p3.1),[§2\.2](https://arxiv.org/html/2607.00257#S2.SS2.p3.2),[§2\.2](https://arxiv.org/html/2607.00257#S2.SS2.p5.1),[§4\.1](https://arxiv.org/html/2607.00257#S4.SS1.p6.7),[§4\.2](https://arxiv.org/html/2607.00257#S4.SS2.p6.3),[§5\.1](https://arxiv.org/html/2607.00257#S5.SS1.p1.4),[§5\.2](https://arxiv.org/html/2607.00257#S5.SS2.p1.4),[§5\.2](https://arxiv.org/html/2607.00257#S5.SS2.p2.3),[§5\.4](https://arxiv.org/html/2607.00257#S5.SS4.p2.4),[§5\.4](https://arxiv.org/html/2607.00257#S5.SS4.p3.1),[§5](https://arxiv.org/html/2607.00257#S5.p1.1),[§6](https://arxiv.org/html/2607.00257#S6.p1.1)\.
- \[66\]G\. Tanaka, T\. Yamane, J\. B\. Héroux, R\. Nakane, N\. Kanazawa, S\. Takeda, H\. Numata, D\. Nakano, and A\. Hirose\(2019\)Recent advances in physical reservoir computing: a review\.Neural Networks115,pp\. 100–123\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[67\]M\. Unser, A\. Aldroubi, and M\. Eden\(2002\)Polynomial spline signal approximations: filter design and asymptotic equivalence with Shannon’s sampling theorem\.IEEE Transactions on Information Theory38\(1\),pp\. 95–103\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p4.1)\.
- \[68\]M\. Unser and A\. Aldroubi\(2002\)A general sampling theory for nonideal acquisition devices\.IEEE Transactions on Signal Processing42\(11\),pp\. 2915–2925\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p4.1),[§3\.2](https://arxiv.org/html/2607.00257#S3.SS2.p5.2),[§3\.3](https://arxiv.org/html/2607.00257#S3.SS3.p3.9)\.
- \[69\]M\. Unser and J\. Zerubia\(2002\)A generalized sampling theory without band\-limiting constraints\.IEEE transactions on circuits and systems II: analog and digital signal processing45\(8\),pp\. 959–969\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p4.1)\.
- \[70\]M\. Unser\(2002\)Sampling\-50 years after Shannon\.Proceedings of the IEEE88\(4\),pp\. 569–587\.Cited by:[Appendix B](https://arxiv.org/html/2607.00257#A2.p2.2),[§1](https://arxiv.org/html/2607.00257#S1.p4.1),[§3\.2](https://arxiv.org/html/2607.00257#S3.SS2.p5.2),[§3\.3](https://arxiv.org/html/2607.00257#S3.SS3.p3.9)\.
- \[71\]P\. R\. Vlachas, W\. Byeon, Z\. Y\. Wan, T\. P\. Sapsis, and P\. Koumoutsakos\(2018\)Data\-driven forecasting of high\-dimensional chaotic systems with long short\-term memory networks\.Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences474\(2213\),pp\. 20170844\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[72\]V\. Vovk\(2013\)Kernel ridge regression\.InEmpirical inference: Festschrift in honor of vladimir n\. vapnik,pp\. 105–116\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[73\]S\. Waheed, M\. Qayyum, O\. Khan, and G\. Chambashi\(2026\)Data\-driven neural modeling and chaos control in fractional\-order financial dynamical systems\.AIP Advances16\(1\)\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[74\]J\. Wang, A\. Hertzmann, and D\. J\. Fleet\(2005\)Gaussian process dynamical models\.Advances in neural information processing systems18\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[75\]M\. O\. Williams, I\. G\. Kevrekidis, and C\. W\. Rowley\(2015\)A data–driven approximation of the Koopman operator: extending dynamic mode decomposition\.Journal of Nonlinear Science25\(6\),pp\. 1307–1346\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[76\]A\. D\. Wyner and S\. Shamai\(1998\)Introduction to ‘communication in the presence of noise’ by CE Shannon\.Proc\. IEEE86\(2\),pp\. 442–446\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p4.1)\.
- \[77\]J\. Xing\(2022\)Reconstructing data\-driven governing equations for cell phenotypic transitions: integration of data science and systems biology\.Physical Biology19\(6\),pp\. 061001\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[78\]F\. Xu, Y\. Lin, J\. Huang, D\. Wu, H\. Shi, J\. Song, and Y\. Li\(2016\)Big data driven mobile traffic understanding and forecasting: a time series approach\.IEEE transactions on services computing9\(5\),pp\. 796–805\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[79\]M\. Yan, C\. Huang, P\. Bienstman, P\. Tino, W\. Lin, and J\. Sun\(2024\)Emerging opportunities and challenges for the future of reservoir computing\.Nature Communications15\(1\),pp\. 2056\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[80\]W\. Yan, H\. Qiu, and Y\. Xue\(2009\)Gaussian process for long\-term time\-series forecasting\.In2009 international joint conference on neural networks,pp\. 3420–3427\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1)\.
- \[81\]S\. Yang, S\. W\. Wong, and S\. Kou\(2021\)Inference of dynamic systems from noisy and sparse data via manifold\-constrained gaussian processes\.Proceedings of the National Academy of Sciences118\(15\),pp\. e2020397118\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p3.1)\.
- \[82\]H\. Ye, R\. J\. Beamish, S\. M\. Glaser, S\. C\. Grant, C\. Hsieh, L\. J\. Richards, J\. T\. Schnute, and G\. Sugihara\(2015\)Equation\-free mechanistic ecosystem forecasting using empirical dynamic modeling\.Proceedings of the National Academy of Sciences112\(13\),pp\. E1569–E1576\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p1.1)\.
- \[83\]Y\. Yu, D\. Huang, S\. Park, and H\. Pangborn\(2026\)Learning networked dynamical system models with weak form and graph neural networks\.Journal of Guidance Control and Dynamics\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2407.16779)Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[84\]Y\. Yu, X\. Si, C\. Hu, and J\. Zhang\(2019\)A review of recurrent neural networks: LSTM cells and network architectures\.Neural computation31\(7\),pp\. 1235–1270\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[85\]L\. Zhang and H\. Schaeffer\(2019\)On the convergence of the SINDy algorithm\.Multiscale Modeling & Simulation17\(3\),pp\. 948–972\.Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.
- \[86\]H\. Zhao, Y\. Wang, H\. Qi, Z\. Huang, H\. Zhao, L\. Sha, and H\. Shao\(2025\)Accelerating neural ODEs: a variational formulation\-based approach\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2607.00257#S1.p2.1)\.Similar Articles
A Bayesian Filtering Approach for Learning Lagrangian Dynamics from Noisy Measurements
This paper presents a Bayesian filtering approach to learn Lagrangian dynamics from partial, noisy measurements by parameterizing kinetic and potential energies with neural networks and jointly estimating states and parameters via maximum likelihood.
Perturbative methods for non-parametric instrumental variable
Introduces a perturbative approach for nonparametric instrumental variable estimation that extends kernel ridge methods with higher-order corrections, achieving up to 99% reduction in prediction error in high-dimensional settings.
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization
The paper introduces Kernel Discovery, an LLM-driven evolutionary framework for high-dimensional Bayesian optimization that searches a broader kernel space and achieves state-of-the-art results on benchmarks.
Learning Dynamical Systems from Multiple Sparse Datasets: A Hierarchical Bayesian Modeling Approach
Proposes a hierarchical Bayesian framework for meta-learning in dynamical systems from multiple sparse, noisy datasets, using gradient-based MCMC with an embedded ODE solver for efficient posterior inference of shared and dataset-specific parameters.
Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter
This paper presents CA-NKCF, a novel distributed latent state estimator combining partial domain knowledge with deep neural networks, achieving robust performance without noise statistics knowledge, outperforming traditional filters in linear, chaotic, and wireless tracking environments.