Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter
Summary
This paper presents CA-NKCF, a novel distributed latent state estimator combining partial domain knowledge with deep neural networks, achieving robust performance without noise statistics knowledge, outperforming traditional filters in linear, chaotic, and wireless tracking environments.
View Cached Full Text
Cached at: 06/30/26, 05:27 AM
# Learning to Distributedly Estimate under Partially Known Dynamics: A Covariance-Agnostic Neural Kalman Consensus Filter
Source: [https://arxiv.org/html/2606.28441](https://arxiv.org/html/2606.28441)
George Stamatelis, , Kyriakos Stylianopoulos, , and George C\. AlexandropoulosThe authors are with the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimiopolis Ilissia, 16122 Athens, Greece\. \(e\-mails: \{georgestamat, kstylianop alexandg\}@di\.uoa\.gr\)\. This work has been supported by the SNS JU project 6G\-DISAC under the European Union’s Horizon Europe research and innovation programme under Grant Agreement No 101139130\. G\. Stamatelis was supported by the Hellenic Foundation for Research and Innovation \(HFRI\) under the 5th Call for HFRI PhD Fellowships \(Fellowship Number: 21080\)\.
###### Abstract
Online latent state estimation constitutes a fundamental challenge within the artificial intelligence field, serving as a foundational tool for diverse applications, including sequential decision making, anomaly and change\-point detection\. In this paper, a novel online distributed sensing framework, where agents collaborate and exchange information to perform latent state estimation, is presented\. The proposed estimator combines available partial domain knowledge with the representation capabilities of deep neural networks\. In particular, the designed sensing framework incorporates prior estimates, optimized consensus weights, and Kalman\-like recursive updates to perform decentralized inference, without relying on knowledge of noise statistics\. Extensive experiments on linear, chaotic \(Lorenz\), and practical wireless tracking environments reveal that the proposed Covariance\-Agnostic Neural Kalman Consensus Filter \(CA\-NKCF\) outperforms traditional distributed Kalman and particle filters as well as purely model\-free deep neural networks, exhibiting robustness even when the underlying motion and observation models are misspecified\. It is also demonstrated that CA\-NKCF’s performance advantage remains stable across varying noise levels, random communication topologies, latent state dimensions, and observation clutter densities induced by scattering objects in wireless systems\.
## IIntroduction
Decentralized decision making under uncertainty is a cornerstone of Multi\-Agent \(MA\) Artificial Intelligence \(AI\)\[[55](https://arxiv.org/html/2606.28441#bib.bib80),[52](https://arxiv.org/html/2606.28441#bib.bib81)\]\. A key prerequisite for efficient decentralized decisions is accurate online estimation \(a\.k\.a\.filtering\) of latent state representations\. One possible application is in robotics, where a group of robots explore their surrounding space, process observations using different sensors, and exchange messages in order to track moving targets; the latent state could be the position or velocity of targets\[[22](https://arxiv.org/html/2606.28441#bib.bib30),[35](https://arxiv.org/html/2606.28441#bib.bib29),[5](https://arxiv.org/html/2606.28441#bib.bib31)\]\. Another possible application is in large\-scale sensor networks tasked with monitoring a physical phenomenon, e\.g\., spreading fire or a gas leak\[[44](https://arxiv.org/html/2606.28441#bib.bib82),[4](https://arxiv.org/html/2606.28441#bib.bib83)\]\. Due to bandwidth limitations, sending all sensing data to a central node, access point, or Base Station \(BS\) may be infeasible or too time consuming for safety applications\. Alternatively, each sensor may exchange local readings with nearby peers to track the underlying system state\.
In the centralized, single\-sensor domain, the most established estimator is the Kalman Filter \(KF\)\[[26](https://arxiv.org/html/2606.28441#bib.bib28)\]and its extensions, such as the Extended \(EKF\)\[[22](https://arxiv.org/html/2606.28441#bib.bib30)\]and Unscented \(UKF\) KF\[[62](https://arxiv.org/html/2606.28441#bib.bib42)\]\. This filter is the optimal estimator for linear systems with zero\-mean Gaussian noise\. In fact, despite being developed over 50 years ago, it finds numerous engineering applications as well as emerging AI applications\[[60](https://arxiv.org/html/2606.28441#bib.bib40),[59](https://arxiv.org/html/2606.28441#bib.bib41)\]\. However, for realistic nonlinear systems, KF extensions are suboptimal\. Furthermore, one key limitation of such Model\-Based \(MB\) approaches is that they require precise specification of the underlying system dynamics\. This feature actually renders them unsuitable in systems where the state and observation generating functions are approximations of the true dynamics\. Additionally, in some high\-dimensional systems, the noise can be complicated, nonstationary and, hence, it is infeasible to accurately model it\[[54](https://arxiv.org/html/2606.28441#bib.bib32)\]\.
On the other hand, purely Model\-Free \(MF\) approaches, such as end\-to\-end Recurrent Neural Networks \(RNNs\)\[[23](https://arxiv.org/html/2606.28441#bib.bib57),[9](https://arxiv.org/html/2606.28441#bib.bib54)\], do not utilize any domain knowledge, usually sacrificing performance\. To that end, KFs aided by artificial Neural Networks \(NNs\)\[[54](https://arxiv.org/html/2606.28441#bib.bib32)\]have emerged as a powerful synergy, combining the representational power of deep NNs with the available domain knowledge provided either by application experts or by offline estimation methods \(e\.g\.,\[[20](https://arxiv.org/html/2606.28441#bib.bib35),[15](https://arxiv.org/html/2606.28441#bib.bib36)\]\)\. Besides potential for improved performance, these hybrid schemes offer improved interpretability over purely MF estimation mechanisms; a feature that is beneficial for safety critical monitoring applications\. Additionally, NN\-aided KFs with exact noise knowledge are discussed for example in references\[[50](https://arxiv.org/html/2606.28441#bib.bib44),[19](https://arxiv.org/html/2606.28441#bib.bib43),[11](https://arxiv.org/html/2606.28441#bib.bib33)\]\. For instance, the work in\[[50](https://arxiv.org/html/2606.28441#bib.bib44)\]utilizes a two\-step approach, where a conventional estimation algorithm is first applied, and then NNs perform error correction\. A recent group of works focuses on the more generally applicable case ofpartial domain knowledge, designing estimation filters without covariance information\[[48](https://arxiv.org/html/2606.28441#bib.bib45),[6](https://arxiv.org/html/2606.28441#bib.bib46),[56](https://arxiv.org/html/2606.28441#bib.bib47)\]\. In these works, a prior of the state is computed according to the KF, and then, specific features are extracted and used by RNNs to estimate the Kalman Gain \(KG\) without the availability of the noise covariance matrices\. Then, that NN\-based KG is used in the posterior filtering operation\.
In this paper, we focus on MA estimation systems and extend the aforedescribed NN\-based, covariance\-agnostic idea to distributed filtering\. This distributed latent state estimation goal is actually more challenging due to the lack of global information and the induced communication costs\. In particular, the agents/nodes performing estimation need to reach consensus \(i\.e\., similar estimates\) and not just minimize their local errors; to this end, consensus algorithms constitute an entire research sub\-field with many intricacies\[[12](https://arxiv.org/html/2606.28441#bib.bib84),[3](https://arxiv.org/html/2606.28441#bib.bib85)\]\. In fact, even for linear distributed systems, optimal filters do not exist or are computationally prohibitive, forcing a reliance on heuristic consensus\-based Kalman updates\[[42](https://arxiv.org/html/2606.28441#bib.bib4)\]\. Additionally, in the MA Reinforcement Learning \(MARL\) community, it is well known that MA optimization is more challenging due to the inherent nonstationarity of the learning process\[[32](https://arxiv.org/html/2606.28441#bib.bib48),[38](https://arxiv.org/html/2606.28441#bib.bib49)\]\. More specifically, as one agent updates its estimator model, it alters the distribution of the transmitted messages to neighboring nodes\. Consequently, the environment dynamics of all other agents change, and previously learned behaviors must constantly adapt\. In the following, we summarize the contributions of this paper:
- •A novel distributed estimator, termed as Covariance\-Agnostic Neural Kalman Consensus Filter \(CA\-NKCF\), is presented, which fuses principles from MARL with model\-informed NN\-based filtering and a novel consensus mechanism\. A primary advantage of CA\-NKCF is its lightweight communication structure, since, unlike advanced information\-weighted filters\[[40](https://arxiv.org/html/2606.28441#bib.bib74),[27](https://arxiv.org/html/2606.28441#bib.bib5)\]that require exchanging full covariance or information matrices, the proposed approach requires only the exchange of state priors, substantially reducing the inter\-agent/\-node communication bandwidth\. In addition, to overcome the inherent nonstationarity of MA optimization, the proposed NN filtering modules and consensus weights are jointly optimized offline using a central dataset\. Inspired by the Centralized Learning, Decentralized Execution \(CLDE\) paradigm\[[32](https://arxiv.org/html/2606.28441#bib.bib48),[38](https://arxiv.org/html/2606.28441#bib.bib49)\], the considered training allows the framework to discover effective collaborative behaviors, and promotes extreme scalability by sharing copies of identical optimized NNs across nodes\.
- •We provide strong mathematical intuition regarding the stability of the proposed NN\-based consensus mechanism\. It is shown that, for each estimation node, the posterior estimate is a convex combination of their local prior and the priors of their neighboring nodes\. This feature indicates that each local posterior is trapped inside the multi\-dimensional convex hull of all priors, a fact that prevents catastrophic erroneous updates\.
- •We present extensive numerical experiments that showcase the superiority of the proposed CA\-NKCF approach over a wide range of distributed estimators, including both MB and purely data\-driven approaches, in both linear and nonlinear systems\. Furthermore, a rigorous ablation study is used to demonstrate the critical importance of a unified loss function, a central training procedure, and the joint optimization of both the NN parameters and the consensus weights\.
The remainder of this paper is organized as follows\. A literature review of MB and distributed filtering schemes is provided in Section II, whereas the mathematical background of KFs and related algorithms is included in Section III, along with the data\-driven problem formulation under investigation\. Our novel CA\-NKCF approach is presented in Section IV, and test performance is experimentally assessed in Section V\. Finally, Section VI includes the paper’s concluding remarks\.
Notation:Lower\-case bold letters refer to vectors \(e\.g\.,𝐱\\mathbf\{x\}\), upper\-case bold letters to matrices \(e\.g\.,𝐗\\mathbf\{X\}\), and calligraphic letters indicate sets \(e\.g\.,𝒳\\mathcal\{X\}\)\.𝒩\(μ,σ\)\\mathcal\{N\}\(\\mu,\\sigma\)represents a Gaussian distribution with parameters meanμ\\muand varianceσ\\sigma, whereas𝔼\[⋅\]\\mathbb\{E\}\[\\cdot\]denotes expectation\. The notation𝐈d\\mathbf\{I\}\_\{d\}stands for thed×dd\\times d\(d≥2d\\geq 2\) identity matrix, and\|⋅\|\|\\cdot\|denotes the cardinality operator\.
## IIRelated Work
### II\-AConventional Distributed Filtering
Decentralized implementations of the celebrated KF algorithms have been studied by the control community for over two decades, initially for ideal fully\-connected networks\[[46](https://arxiv.org/html/2606.28441#bib.bib58)\]\. The foundations for scalable decentralized KF with consensus algorithms for systems with sparse and possibly time\-varying topologies have been set by Olfati\-Saber in his seminal Kalman Consensus Filter \(KCF\) in\[[42](https://arxiv.org/html/2606.28441#bib.bib4)\]\. This algorithm combines local KF\-like updates with a consensus imposed on the weighted sensor disagreement\. Although this filter has not been proven to be optimal, due to the heuristic variance\-dependent choice of the consensus weights, it remains very popular up to date due to its satisfactory performance, simplicity, and numerical stability\.
Improvements that either run multiple consensus steps on the information vectors\[[27](https://arxiv.org/html/2606.28441#bib.bib5)\]or compute the optimal value of consensus weights\[[13](https://arxiv.org/html/2606.28441#bib.bib3),[29](https://arxiv.org/html/2606.28441#bib.bib6)\]have been proposed, at the expense, however, of much higher computational costs, practicality limitations, and stability risks\. Computing optimal values as proposed in\[[13](https://arxiv.org/html/2606.28441#bib.bib3),[29](https://arxiv.org/html/2606.28441#bib.bib6)\]requires maintaining and performing consensus on covariance matrices, and then performing multiple matrix inversions on massive block matrices for observation correlations, hindering numerical stability\. All in all, even though these papers produce beautiful mathematical optimality results, Saber’s KCF remains very popular due to its easy implementation and stability\. Additionally, progress has been made in combining distributed filters with practical engineering challenges, e\.g\., non\-Gaussian observations\[[24](https://arxiv.org/html/2606.28441#bib.bib7)\], limited sensor range\[[36](https://arxiv.org/html/2606.28441#bib.bib10)\], communication costs\[[37](https://arxiv.org/html/2606.28441#bib.bib11),[18](https://arxiv.org/html/2606.28441#bib.bib9)\], as well as privacy risks\[[39](https://arxiv.org/html/2606.28441#bib.bib8)\]\.
It is noted that, while KF\-based approaches are the most widely utilized estimators, an alternative family of methods known as Particle Filters \(PFs\) can be used in nonlinear systems instead\. These filters constitute non\-parametric sequential Monte Carlo estimators that have been used for single\-agent\[[8](https://arxiv.org/html/2606.28441#bib.bib50),[58](https://arxiv.org/html/2606.28441#bib.bib51),[14](https://arxiv.org/html/2606.28441#bib.bib52)\]as well as distributed, MA\[[10](https://arxiv.org/html/2606.28441#bib.bib53)\]systems\. However, these approaches carry much greater overhead and are highly sensitive to the number of particles simulated\.
### II\-BModel\-Based Neural Filtering
There are two main approaches for online hidden state estimation in discrete\-time Dynamical Systems \(DSs\) using MB NNs:i\) external architectures; andii\) NNs embedded in the KF logic\. Prior works oni\), either utilize NNs to extract features from high\-dimensional observations, which are then combined with known state updates\[[11](https://arxiv.org/html/2606.28441#bib.bib33)\], or utilize RNNs to perform error correction on traditional filters\[[50](https://arxiv.org/html/2606.28441#bib.bib44)\]\. Exact specification of the state evolution’s mean and covariance is required\. Most works onii\) are based on the KalmanNet framework\[[48](https://arxiv.org/html/2606.28441#bib.bib45)\]\. This work first proposed utilizing supervised Gated Recurrent Units \(GRUs\)\[[9](https://arxiv.org/html/2606.28441#bib.bib54)\]to estimate the KG without the need for covariance matrix knowledge\. The proposed hybrid model was shown to outperform both MB KFs and PFs as well as fully MF RNNs\. Since then, various extensions of the original KalmanNet framework for different types of DSs have been proposed\. For instance,\[[6](https://arxiv.org/html/2606.28441#bib.bib46)\]combines the KalmanNet framework with deep convolutional feature extractors for tracking problems profiting from visual observations from cameras\. Very recently, an extension of the KalmanNet with an additional GRU, termed as MJFNet\[[56](https://arxiv.org/html/2606.28441#bib.bib47)\], was designed to filter trajectories with switching behavior\. One alternative framework in this line of research is the fully unsupervised Data\-Driven Nonlinear State Estimation \(DANSE\) model\[[19](https://arxiv.org/html/2606.28441#bib.bib43)\], which does not require any specification of the latent state evolution model\. However, it is limited to linear observations with fully known Gaussian noise\.
MB deep learning techniques for latent space estimation combine domain knowledge of traditional algorithms with the expressiveness and generalization capabilities of modern NNs to improve performance\. MB deep learning has been also proposed for other tasks like smoothing, i\.e\., offline/noncausal state estimation\[[50](https://arxiv.org/html/2606.28441#bib.bib44),[49](https://arxiv.org/html/2606.28441#bib.bib38),[34](https://arxiv.org/html/2606.28441#bib.bib59)\]\. Besides discriminative learning problems, NNs can also be used to estimate the underlying dynamics of high\-dimensional nonlinear DSs, which is a form of generative learning\[[33](https://arxiv.org/html/2606.28441#bib.bib34),[20](https://arxiv.org/html/2606.28441#bib.bib35),[15](https://arxiv.org/html/2606.28441#bib.bib36),[16](https://arxiv.org/html/2606.28441#bib.bib37)\]\. Note that discriminative and generative approaches can be used in parallel, i\.e\., by first learning the dynamics offline with a generative algorithm and then training a discriminative NN\-based KF\. In that case, the learned dynamics constitute the state and observation generating recursions utilized by the filter\. It is, however, highlighted that all aforementioned works arelimited to centralized single\-node systems, where a single processing center infers hidden state information using all available observations\.
It is finally noted for completeness that, apart from estimation of time\-varying processes, MB deep learning\[[53](https://arxiv.org/html/2606.28441#bib.bib56)\]has numerous interesting applications, e\.g\., medical imaging\[[2](https://arxiv.org/html/2606.28441#bib.bib19)\], near\-field localization\[[17](https://arxiv.org/html/2606.28441#bib.bib20)\], as well as dimensionality reduction\[[7](https://arxiv.org/html/2606.28441#bib.bib21)\]\. An important subcategory of MB deep learning is physics\-informed NNs, where physical laws are directly integrated into the NN operation, e\.g\.,\[[45](https://arxiv.org/html/2606.28441#bib.bib23),[64](https://arxiv.org/html/2606.28441#bib.bib24),[57](https://arxiv.org/html/2606.28441#bib.bib25),[65](https://arxiv.org/html/2606.28441#bib.bib27)\]\.
## IIIPreliminaries
### III\-ACentralized Filters
In the centralized setting \(single node/sensor/agent\), discrete\-time DSs \(indexed by timett\) are described as follows:
𝐱t\+1\\displaystyle\\mathbf\{x\}\_\{t\+1\}=f\(𝐱t\)\+𝐰t∈ℝs,\\displaystyle=f\(\\mathbf\{x\}\_\{t\}\)\+\\mathbf\{w\}\_\{t\}\\in\\mathbb\{R\}^\{s\},\(1a\)𝐳t\\displaystyle\\mathbf\{z\}\_\{t\}=h\(𝐱t\)\+𝐯t∈ℝo\.\\displaystyle=h\(\\mathbf\{x\}\_\{t\}\)\+\\mathbf\{v\}\_\{t\}\\in\\mathbb\{R\}^\{o\}\.\(1b\)wheref\(⋅\)f\(\\cdot\)andh\(⋅\)h\(\\cdot\)are termed as the transition and observation functions, respectively, whereas𝐰t\\mathbf\{w\}\_\{t\}and𝐯t\\mathbf\{v\}\_\{t\}are noise vectors\. MB estimators typically assume that𝐰t∼𝒩\(0,𝐐\),𝐯t∼𝒩\(0,𝐑\)\\mathbf\{w\}\_\{t\}\\sim\\mathcal\{N\}\(0,\\mathbf\{Q\}\),\\mathbf\{v\}\_\{t\}\\sim\\mathcal\{N\}\(0,\\mathbf\{R\}\), where𝐐∈ℝs×s,𝐑∈ℝo×o\\mathbf\{Q\}\\in\\mathbb\{R\}^\{s\\times s\},\\mathbf\{R\}\\in\\mathbb\{R\}^\{o\\times o\}\.
A simple yet fundamental category of DSs is the linear DS, where the functionsf\(⋅\)f\(\\cdot\)andh\(⋅\)h\(\\cdot\)are respectively the matrices𝐅\\mathbf\{F\}and𝐇\\mathbf\{H\}, yielding the state\-space equations:
𝐱t\+1\\displaystyle\\mathbf\{x\}\_\{t\+1\}=𝐅𝐱t\+𝐰t∈ℝs,\\displaystyle=\\mathbf\{F\}\\mathbf\{x\}\_\{t\}\+\\mathbf\{w\}\_\{t\}\\in\\mathbb\{R\}^\{s\},\(2a\)𝐳t\\displaystyle\\mathbf\{z\}\_\{t\}=𝐇𝐱t\+𝐯t∈ℝo\.\\displaystyle=\\mathbf\{H\}\\mathbf\{x\}\_\{t\}\+\\mathbf\{v\}\_\{t\}\\in\\mathbb\{R\}^\{o\}\.\(2b\)We are concerned with filtering, i\.e\., estimating the current latent variable𝐱t\\mathbf\{x\}\_\{t\}, leveraging the past and present observations𝐳1:t≜\[𝐳1,𝐳2,…,𝐳t\]\\mathbf\{z\}\_\{1:t\}\\triangleq\[\\mathbf\{z\}\_\{1\},\\mathbf\{z\}\_\{2\},\\ldots,\\mathbf\{z\}\_\{t\}\]\. The prior and posterior estimates at the time instancettare respectively denoted as follows:
𝐱^t\|t−1≜𝔼\[𝐱t\|𝐳1:t−1\],𝐱^t\|t≜𝔼\[𝐱t\|𝐳1:t\]\.\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{t\|t\-1\}\\triangleq\\mathbb\{E\}\[\\mathbf\{x\}\_\{t\}\|\\mathbf\{z\}\_\{1:t\-1\}\],\\,\\hat\{\\mathbf\{x\}\}\_\{t\|t\}\\triangleq\\mathbb\{E\}\[\\mathbf\{x\}\_\{t\}\|\\mathbf\{z\}\_\{1:t\}\]\.\(3\)Consequently, the prior and posterior errors are denoted as𝜼t\|t−1\\bm\{\\eta\}\_\{t\|t\-1\}and𝜼t\|t\\bm\{\\eta\}\_\{t\|t\}, and the respective error covariance matrices are defined as𝐏t≜𝔼\[𝜼t\|t−1𝜼t\|t−1T\]\\mathbf\{P\}\_\{t\}\\triangleq\\mathbb\{E\}\[\\bm\{\\eta\}\_\{t\|t\-1\}\\bm\{\\eta\}\_\{t\|t\-1\}^\{T\}\]and𝐌t≜𝔼\[𝜼t\|t𝜼t\|tT\]\\mathbf\{M\}\_\{t\}\\triangleq\\mathbb\{E\}\[\\bm\{\\eta\}\_\{t\|t\}\\bm\{\\eta\}\_\{t\|t\}^\{T\}\]\. The KF for linear DSs is given by the following recursion:
𝐊t\\displaystyle\\mathbf\{K\}\_\{t\}=𝐏t𝐇T\(𝐑\+𝐇𝐏t𝐇T\)−1,\\displaystyle=\\mathbf\{P\}\_\{t\}\\mathbf\{H\}^\{T\}\(\\mathbf\{R\}\+\\mathbf\{H\}\\mathbf\{P\}\_\{t\}\\mathbf\{H\}^\{T\}\)^\{\-1\},\(4a\)𝐱^t\|t\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{t\|t\}=𝐱^t\|t−1\+𝐊t\(𝐳t−𝐇𝐱^t\|t−1\),\\displaystyle=\\hat\{\\mathbf\{x\}\}\_\{t\|t\-1\}\+\\mathbf\{K\}\_\{t\}\(\\mathbf\{z\}\_\{t\}\-\\mathbf\{H\}\\hat\{\\mathbf\{x\}\}\_\{t\|t\-1\}\),\(4b\)𝐌t\\displaystyle\\mathbf\{M\}\_\{t\}=𝐏t−𝐏t𝐇T\(𝐑\+𝐇𝐏t𝐇T\)−1𝐇𝐏t,\\displaystyle=\\mathbf\{P\}\_\{t\}\-\\mathbf\{P\}\_\{t\}\\mathbf\{H\}^\{T\}\(\\mathbf\{R\}\+\\mathbf\{H\}\\mathbf\{P\}\_\{t\}\\mathbf\{H\}^\{T\}\)^\{\-1\}\\mathbf\{H\}\\mathbf\{P\}\_\{t\},\(4c\)𝐏t\+1\\displaystyle\\mathbf\{P\}\_\{t\+1\}=𝐅𝐌t𝐅T\+𝐐,\\displaystyle=\\mathbf\{F\}\\mathbf\{M\}\_\{t\}\\mathbf\{F\}^\{T\}\+\\mathbf\{Q\},\(4d\)𝐱^t\+1\|t\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{t\+1\|t\}=𝐅𝐱^t\|t\.\\displaystyle=\\mathbf\{F\}\\hat\{\\mathbf\{x\}\}\_\{t\|t\}\.\(4e\)The variable𝐊t\\mathbf\{K\}\_\{t\}in expression \([4a](https://arxiv.org/html/2606.28441#S3.E4.1)\) is known as the KG\. The KalmanNet framework\[[48](https://arxiv.org/html/2606.28441#bib.bib45),[6](https://arxiv.org/html/2606.28441#bib.bib46),[56](https://arxiv.org/html/2606.28441#bib.bib47)\]performs the same a priori estimate for𝐱^t\|t−1\\hat\{\\mathbf\{x\}\}\_\{t\|t\-1\}as the KF, but utilizes a GRUθ\\thetain order to estimate𝐊t\\mathbf\{K\}\_\{t\}\. The approximate KG𝐊t,θ\\mathbf\{K\}\_\{t,\\theta\}is then plugged into the expression \([4b](https://arxiv.org/html/2606.28441#S3.E4.2)\)\. It is noted for completeness that the EKF is a popular yet suboptimal approach for filtering nonlinear DSs, where the state updates become as follows:
𝐱^t\|t=𝐱^t\|t−1\+𝐊t\(𝐳t−h\(𝐱^t\|t−1\)\),𝐱^t\+1\|t=f\(𝐱^t\|t\)\.\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{t\|t\}=\\hat\{\\mathbf\{x\}\}\_\{t\|t\-1\}\+\\mathbf\{K\}\_\{t\}\(\\mathbf\{z\}\_\{t\}\-h\(\\hat\{\\mathbf\{x\}\}\_\{t\|t\-1\}\)\),\\,\\hat\{\\mathbf\{x\}\}\_\{t\+1\|t\}=f\(\\hat\{\\mathbf\{x\}\}\_\{t\|t\}\)\.\(5\)
For the covariance updates and the KG computation, matrices𝐅\\mathbf\{F\}and𝐇\\mathbf\{H\}are replaced by the following Jacobians which are evaluated on the state estimates:
𝐅~t=𝐉f\(𝐱^t−1\|t−1\),𝐇~t=𝐉h\(𝐱^t\|t−1\)\.\\displaystyle\\tilde\{\\mathbf\{F\}\}\_\{t\}=\\mathbf\{J\}\_\{f\}\(\{\\hat\{\\mathbf\{x\}\}\}\_\{t\-1\|t\-1\}\),\\,\\tilde\{\\mathbf\{H\}\}\_\{t\}=\\mathbf\{J\}\_\{h\}\(\{\\hat\{\\mathbf\{x\}\}\}\_\{t\|t\-1\}\)\.\(6\)It needs to be noted that the KalmanNet algorithm\[[48](https://arxiv.org/html/2606.28441#bib.bib45)\]supports the use of EKF\-like state predictions\.
### III\-BThe Kalman Consensus Filter \(KCF\)
Consider an environment withNNsensors \(indexed withi=1,2,…,Ni=1,2,\\ldots,N\) which collect distinct, possibly overlapping, measurements according to the following model:
𝐳i,t=𝐇i𝐱t\+𝐯i,t∈ℝoi,\\mathbf\{z\}\_\{i,t\}=\\mathbf\{H\}\_\{i\}\\mathbf\{x\}\_\{t\}\+\\mathbf\{v\}\_\{i,t\}\\in\\mathbb\{R\}^\{o\_\{i\}\},\(7\)where the covariance matrix of eachii\-th observation noise is represented as𝐑i∈ℝoi×oi\\mathbf\{R\}\_\{i\}\\in\\mathbb\{R\}^\{o\_\{i\}\\times o\_\{i\}\}\. It is assumed that each sensor nodeiican exchange messages with a set of neighbors𝒩i,t\\mathcal\{N\}\_\{i,t\}; furthermore, we write𝒥i,t≜𝒩i,t∪\{i\}\\mathcal\{J\}\_\{i,t\}\\triangleq\\mathcal\{N\}\_\{i,t\}\\cup\\\{i\\\}\. The nodes are essentially located in a graph, where edges indicate sensor communication \(i\.e\., message exchange\)\. We assume that the graph topology can evolve with time, and is not controlled by the nodes\. Algorithms that combine state estimation models with intelligent topology design are a very interesting potential research direction, outside the scope of this work\.
The seminal work in\[[42](https://arxiv.org/html/2606.28441#bib.bib4)\]proposed KCF according to which, each sensor nodeiibroadcasts its local prior state estimate, and the Kalman posterior update is combined with a consensus update rule based on the prior estimates\. The update recursion of KCF is defined as follows:
𝐊i,t=𝐏i,t𝐇iT\(𝐑i\+𝐇i𝐏i𝐇iT\)−1,\\displaystyle\\mathbf\{K\}\_\{i,t\}=\\mathbf\{P\}\_\{i,t\}\\mathbf\{H\}\_\{i\}^\{T\}\(\\mathbf\{R\}\_\{i\}\+\\mathbf\{H\}\_\{i\}\\mathbf\{P\}\_\{i\}\\mathbf\{H\}\_\{i\}^\{T\}\)^\{\-1\},\(8a\)𝐱^i,t\|t=𝐱^i,t\|t−1\+𝐊i,t\(𝐳i,t−𝐇i𝐱^i,t\|t−1\)\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}=\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\+\\mathbf\{K\}\_\{i,t\}\(\\mathbf\{z\}\_\{i,t\}\-\\mathbf\{H\}\_\{i\}\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\)\+𝐂i,t∑j∈𝒩i\(𝐱^j,t\|t−1−𝐱^i,t\|t−1\),\\displaystyle\\quad\\quad\\quad\+\\mathbf\{C\}\_\{i,t\}\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}\(\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}\-\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\),\(8b\)𝐀i,t=𝐈−𝐊i,t𝐇i,\\displaystyle\\mathbf\{A\}\_\{i,t\}=\\mathbf\{I\}\-\\mathbf\{K\}\_\{i,t\}\\mathbf\{H\}\_\{i\},\(8c\)𝐌i,t=𝐀i,t𝐏i,t𝐀i,tT\+𝐊i,t𝐑i𝐊i,t,\\displaystyle\\mathbf\{M\}\_\{i,t\}=\\mathbf\{A\}\_\{i,t\}\\mathbf\{P\}\_\{i,t\}\\mathbf\{A\}\_\{i,t\}^\{T\}\+\\mathbf\{K\}\_\{i,t\}\\mathbf\{R\}\_\{i\}\\mathbf\{K\}\_\{i,t\},\(8d\)𝐏i,t\+1=𝐅𝐌i,t𝐅T\+𝐐,\\displaystyle\\mathbf\{P\}\_\{i,t\+1\}=\\mathbf\{F\}\\mathbf\{M\}\_\{i,t\}\\mathbf\{F\}^\{T\}\+\\mathbf\{Q\},\(8e\)𝐱^i,t\+1\|t=𝐅𝐱^i,t\|t,\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\+1\|t\}=\\mathbf\{F\}\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\},\(8f\)where a typical heuristic choice for the consensus weight is:
𝐂i,t≜ϵ𝐏i,t1\+‖𝐏i,t‖F∈ℝs×s\\mathbf\{C\}\_\{i,t\}\\triangleq\\epsilon\\frac\{\\mathbf\{P\}\_\{i,t\}\}\{1\+\|\|\\mathbf\{P\}\_\{i,t\}\|\|\_\{\\rm F\}\}\\in\\mathbb\{R\}^\{s\\times s\}\(9\)withϵ\\epsilonbeing an appropriately chosen hyperparameter\.
It is noted that, while the baseline recursion in \([8](https://arxiv.org/html/2606.28441#S3.E8)\) relies on a heuristic consensus matrix and the exchange of only prior state estimates, a rigorously derived variant in\[[40](https://arxiv.org/html/2606.28441#bib.bib74), Algorithm 3\]offers improved performance\. This advanced implementation employs an information\-weighted formulation where nodes broadcast their local information vector \(𝐮i,t\\mathbf\{u\}\_\{i,t\}\) and information matrix \(𝐔i,t\\mathbf\{U\}\_\{i,t\}\) alongside their state predictions\. By aggregating this data from the neighborhood𝒥i,t\\mathcal\{J\}\_\{i,t\}, the distributed update successfully minimizes estimate disagreement across the MA network and bypasses the traditional KG matrix calculation\. The complete recursion for this formulation is defined as:
𝐮j,t=𝐇jT𝐑j−1𝐳j,t∀j∈𝒥i,t,\\displaystyle\\mathbf\{u\}\_\{j,t\}=\\mathbf\{H\}\_\{j\}^\{T\}\\mathbf\{R\}\_\{j\}^\{\-1\}\\mathbf\{z\}\_\{j,t\}\\,\\,\\forall j\\in\\mathcal\{J\}\_\{i,t\},\(10a\)𝐔j,t=𝐇jT𝐑j−1𝐇j∀j∈𝒥i,t,\\displaystyle\\mathbf\{U\}\_\{j,t\}=\\mathbf\{H\}\_\{j\}^\{T\}\\mathbf\{R\}\_\{j\}^\{\-1\}\\mathbf\{H\}\_\{j\}\\,\\,\\forall j\\in\\mathcal\{J\}\_\{i,t\},\(10b\)𝐲i,t=∑j∈𝒥i,t𝐮j,t,\\displaystyle\\mathbf\{y\}\_\{i,t\}=\\sum\_\{j\\in\\mathcal\{J\}\_\{i,t\}\}\\mathbf\{u\}\_\{j,t\},\(10c\)𝐒i,t=∑j∈𝒥i,t𝐔j,t,\\displaystyle\\mathbf\{S\}\_\{i,t\}=\\sum\_\{j\\in\\mathcal\{J\}\_\{i,t\}\}\\mathbf\{U\}\_\{j,t\},\(10d\)𝐌i,t=\(𝐏i,t\|t−1−1\+𝐒i,t\)−1,\\displaystyle\\mathbf\{M\}\_\{i,t\}=\\left\(\\mathbf\{P\}\_\{i,t\|t\-1\}^\{\-1\}\+\\mathbf\{S\}\_\{i,t\}\\right\)^\{\-1\},\(10e\)𝐱^i,t\|t=𝐱^i,t\|t−1\+𝐌i,t\(𝐲i,t−𝐒i,t𝐱^i,t\|t−1\),\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}=\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\+\\mathbf\{M\}\_\{i,t\}\\left\(\\mathbf\{y\}\_\{i,t\}\-\\mathbf\{S\}\_\{i,t\}\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\\right\),\+ϵ𝐌i,t∑j∈𝒩i\(𝐱^j,t\|t−1−𝐱^i,t\|t−1\),\\displaystyle\\quad\\quad\\quad\+\\epsilon\\mathbf\{M\}\_\{i,t\}\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}\\left\(\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}\-\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\\right\),\(10f\)𝐏i,t\+1\|t=𝐅𝐌i,t𝐅T\+𝐐,\\displaystyle\\mathbf\{P\}\_\{i,t\+1\|t\}=\\mathbf\{F\}\\mathbf\{M\}\_\{i,t\}\\mathbf\{F\}^\{T\}\+\\mathbf\{Q\},\(10g\)𝐱^i,t\+1\|t=𝐅𝐱^i,t\|t\.\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\+1\|t\}=\\mathbf\{F\}\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}\.\(10h\)
### III\-CEstimation Problem Formulation
In this work, we wish to design a data\-driven distributed filtering algorithm, such that each sensor nodeiican reliably estimate the true state𝐱t\\mathbf\{x\}\_\{t\}, based on past observations and received messages\. Formally, nodeiiemploys an NNθi\\theta\_\{i\}in order to estimate the latent state vector as:
𝐱^i,t\|t=𝔼\[𝐱t\|𝐳i,1:t;θi\]\.\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}=\\mathbb\{E\}\[\\mathbf\{x\}\_\{t\}\|\\mathbf\{z\}\_\{i,1:t\};\\theta\_\{i\}\]\.\(11\)In particular, the core objective is to minimize the average Mean Squared Error \(MSE\) loss; in mathematical terms:
minθ1,θ2,…,θN1N∑i=1N𝔼\[‖𝐱^i,t\|t−𝐱t‖2\]\.\\min\_\{\\theta\_\{1\},\\theta\_\{2\},\\ldots,\\theta\_\{N\}\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{E\}\\left\[\\left\\\|\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}\-\\mathbf\{x\}\_\{t\}\\right\\\|^\{2\}\\right\]\.\(12\)
Following the reasoning of established works\[[48](https://arxiv.org/html/2606.28441#bib.bib45),[49](https://arxiv.org/html/2606.28441#bib.bib38)\], we proceed adopting the following realistic assumptions:
- •The noise distribution is unknown\. In practical systems, noise distributions are often complex and nonstationary, and thus hard to estimate, necessitating filters that do not rely on the exact specification of noise statistics\.
- •Approximations of the transition functionf\(⋅\)f\(\\cdot\)in \([1a](https://arxiv.org/html/2606.28441#S3.E1.1)\) and the observation functionshi\(⋅\)h\_\{i\}\(\\cdot\)∀i\\forall iin the general version of \([7](https://arxiv.org/html/2606.28441#S3.E7)\):𝐳i,t=hi\(𝐱t\)\+𝐯i,t\\mathbf\{z\}\_\{i,t\}=h\_\{i\}\(\\mathbf\{x\}\_\{t\}\)\+\\mathbf\{v\}\_\{i,t\}are available, either estimated offline or provided by an application expert\. However, in the results section, we will investigate the effectiveness of our scheme when these approximations are incorrect\.
Furthermore, in order to train our NN models, we will assume the availability of a large labeled dataset with hidden states and sensor\-wise observations, defining the following set:
𝒟≜\{𝐱1:t\(d\),𝐳1,1:t\(d\),𝐳2,1:t\(d\),…,𝐳N,1:t\(d\)\}d=1D\.\\mathcal\{D\}\\triangleq\\\{\\mathbf\{x\}\_\{1:t\}^\{\(d\)\},\\mathbf\{z\}^\{\(d\)\}\_\{1,1:t\},\\mathbf\{z\}^\{\(d\)\}\_\{2,1:t\},\\ldots,\\mathbf\{z\}\_\{N,1:t\}^\{\(d\)\}\\\}\_\{d=1\}^\{D\}\.\(13\)Finally, the design of our distributed estimator, described in the sequel, is influenced by two critical practical requirements:
- R1Strict Real\-Time Causality:In highly dynamic systems, the state transitions from𝐱t\\mathbf\{x\}\_\{t\}to𝐱t\+1\\mathbf\{x\}\_\{t\+1\}vary rapidly\. To prevent predictions from becoming obsolete, the estimator’s forward inference time needs to be very quick\. This hard latency constraint precludes the use of computationally heavy architectures \(e\.g\., massive transformers\[[61](https://arxiv.org/html/2606.28441#bib.bib55)\]or graph NNs\[[57](https://arxiv.org/html/2606.28441#bib.bib25)\]\), high\-capacity PFs\[[10](https://arxiv.org/html/2606.28441#bib.bib53)\], or iterative multi\-step consensus protocols\.
- R2Massive Scalability:The designed framework needs to be readily deployable in large\-scale Internet of Things \(IoT\) applications comprising a vast number of sensor nodes\. This necessitates an architecture that inherently minimizes memory footprint and training complexity, a requirement that directly motivates the Parameter Sharing \(PS\) strategy introduced in the sequel\.
## IVProposed Distributed Estimation Method
Having established the necessary mathematical background and presented the hybrid machine learning setup, we will now present our proposed CA\-NKCF\. We have adopted a PS method where allNNsensor nodes are equipped with the same NN parametersθ\\theta\(i\.e\.,θi=θ\\theta\_\{i\}=\\theta∀i\\forall i\); this choice was actually made for scalability purposes\. In modern sensor network applications, the number of sensors can be very large, implying that initializing and optimizingNNseparate RNNs carries prohibitive computational and memory costs\. By adopting PS in our multi\-node setup, only one RNN needs to be trained, which is very cost efficient and memory friendly\. Notably, PS is often used for large\-scale MARL\[[43](https://arxiv.org/html/2606.28441#bib.bib61),[30](https://arxiv.org/html/2606.28441#bib.bib62),[28](https://arxiv.org/html/2606.28441#bib.bib63)\]for the same reason\. Furthermore, in distributed estimation, in addition to total error minimization, the nodes must reach similar state estimates\. Since the NN parameters are shared, unless the local inputs differ significantly, we expect the disagreement to be minimal\. In the next section, we will demonstrate that CA\-NKCF achieves better consensus than MB baselines\.
Algorithm 1Proposed CA\-NKCF at eachii\-th Node0:Input: Observation
𝐳i,t\\mathbf\{z\}\_\{i,t\}, shared parameters
θ\\thetaand
𝜸\\bm\{\\gamma\}, neighbors
𝒩i,t\\mathcal\{N\}\_\{i,t\}, and previous GRU hidden state
𝐡i,t−1\\mathbf\{h\}\_\{i,t\-1\}\.
0:Output: Posterior
𝐱^i,t\|t\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}and new hidden state
𝐡i,t\\mathbf\{h\}\_\{i,t\}\.
0:Local Prediction:
1:Compute the local prior
𝐱^i,t\|t−1=f\(𝐱^i,t−1\|t−1\)\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}=f\(\\hat\{\\mathbf\{x\}\}\_\{i,t\-1\|t\-1\}\)\.
2:Broadcast
𝐱^i,t\|t−1\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}to neighbors
j∈𝒩i,tj\\in\\mathcal\{N\}\_\{i,t\}\.
2:Neural KG Estimation:
3:Compute the RNN features as in expression \([14](https://arxiv.org/html/2606.28441#S4.E14)\)\.
4:Set
𝐊i,t;θ,𝐡i,t=𝖥𝖯θ\(ϕi,t;𝐡i,t−1\)\\mathbf\{K\}\_\{i,t;\\theta\},\\mathbf\{h\}\_\{i,t\}=\\mathsf\{FP\}\_\{\\theta\}\(\\bm\{\\phi\}\_\{i,t\};\\mathbf\{h\}\_\{i,t\-1\}\)\.
4:Consensus\-Based Update:
5:Collect all prior estimates
𝐱^j,t\|t−1\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}∀j∈𝒩i,t\\forall j\\in\\mathcal\{N\}\_\{i,t\}\.
6:Compute the consensus term:
𝐮i,tcons=1\|𝒩i,t\|∑j∈𝒩i,t\(𝐱^j,t\|t−1−𝐱^i,t\|t−1\)\\mathbf\{u\}\_\{i,t\}^\{\\text\{cons\}\}=\\frac\{1\}\{\|\\mathcal\{N\}\_\{i,t\}\|\}\\sum\_\{j\\in\\mathcal\{N\}\_\{i,t\}\}\\left\(\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}\-\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\\right\)\.
7:Perform the update:
𝐱^i,t\|t=𝐱^i,t\|t−1\+𝐊i,t;θΔ𝐳^i,t\+σ\(𝜸\)𝐮i,tcons\.\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}=\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\+\\mathbf\{K\}\_\{i,t;\\theta\}\\Delta\\hat\{\\mathbf\{z\}\}\_\{i,t\}\+\\sigma\(\\bm\{\\gamma\}\)\\mathbf\{u\}\_\{i,t\}^\{\\text\{cons\}\}\.
8:return
𝐱^i,t\|t\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}and
𝐡i,t\\mathbf\{h\}\_\{i,t\}\.
To remain consistent with prior works, we modelθ\\thetaas GRU, but other models, such as Long Short Term Memory \(LSTM\) models\[[23](https://arxiv.org/html/2606.28441#bib.bib57)\]or transformers\[[61](https://arxiv.org/html/2606.28441#bib.bib55)\], can be used instead\. At each time instancettand for each sensor nodeii, the NN parametrized byθ\\thetais provided with KF\-specific input featuresϕi,t\\bm\{\\phi\}\_\{i,t\}, and then combines them with its most recent internal hidden state𝐡i,t−1\\mathbf\{h\}\_\{i,t\-1\}to estimate the local KG𝐊i,t;θ\\mathbf\{K\}\_\{i,t;\\theta\}\.
Figure 1:Visualization of the proposed CA\-NKCF framework for distributed estimation withNNsensor nodes\.### IV\-ACovariance\-Agnostic Neural Kalman Consensus Filtering
At each time instancett, each sensor nodeiiperforms the following steps, collectively presented in Algorithm[1](https://arxiv.org/html/2606.28441#alg1)and visualized in Fig\.[1](https://arxiv.org/html/2606.28441#S4.F1)\. First, the prior state estimate𝐱^i,t\|t−1=f\(𝐱^i,t−1\|t−1\)\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}=f\(\\hat\{\\mathbf\{x\}\}\_\{i,t\-1\|t\-1\}\)is computed and transmitted to its neighboring nodes\. Then, the RNN input features are computed as follows:
ϕi,t≜\[Δ𝐳^i,t,Δ𝐳¯i,t,Δ𝐱^i,t\]∈ℝ2oi\+s×1,\\bm\{\\phi\}\_\{i,t\}\\triangleq\\left\[\\Delta\\hat\{\\mathbf\{z\}\}\_\{i,t\},\\Delta\\bar\{\\mathbf\{z\}\}\_\{i,t\},\\Delta\\hat\{\\mathbf\{x\}\}\_\{i,t\}\\right\]\\in\\mathbb\{R\}^\{2o\_\{i\}\+s\\times 1\},\(14\)including the following vectors:
Δ𝐳^i,t\\displaystyle\\Delta\\hat\{\\mathbf\{z\}\}\_\{i,t\}≜𝐳i,t−hi\(𝐱^i,t\|t−1\)∈ℝoi×1,\\displaystyle\\triangleq\\mathbf\{z\}\_\{i,t\}\-h\_\{i\}\(\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\)\\in\\mathbb\{R\}^\{o\_\{i\}\\times 1\},\(15a\)Δ𝐳¯i,t\\displaystyle\\Delta\\bar\{\\mathbf\{z\}\}\_\{i,t\}≜𝐳i,t−𝐳i,t−1∈ℝoi×1,\\displaystyle\\triangleq\\mathbf\{z\}\_\{i,t\}\-\\mathbf\{z\}\_\{i,t\-1\}\\in\\mathbb\{R\}^\{o\_\{i\}\\times 1\},\(15b\)Δ𝐱^i,t\\displaystyle\\Delta\\hat\{\\mathbf\{x\}\}\_\{i,t\}≜𝐱^i,t−1\|t−1−𝐱^i,t−1\|t−2∈ℝs×1\.\\displaystyle\\triangleq\\hat\{\\mathbf\{x\}\}\_\{i,t\-1\|t\-1\}\-\\hat\{\\mathbf\{x\}\}\_\{i,t\-1\|t\-2\}\\in\\mathbb\{R\}^\{s\\times 1\}\.\(15c\)This feature combination is selected due to its demonstrated effectiveness in prior single\-sensor works\[[48](https://arxiv.org/html/2606.28441#bib.bib45),[56](https://arxiv.org/html/2606.28441#bib.bib47)\]\. The KG is estimated using the GRU as𝐊i,t;θ=𝖥𝖯θ\(ϕi,t;𝐡i,t−1\)\\mathbf\{K\}\_\{i,t;\\theta\}=\\mathsf\{FP\}\_\{\\theta\}\(\\bm\{\\phi\}\_\{i,t\};\\mathbf\{h\}\_\{i,t\-1\}\), where𝐡i,t−1\\mathbf\{h\}\_\{i,t\-1\}is the previous hidden state of the GRU and𝖥𝖯v\(⋅;𝐡\)\\mathsf\{FP\}\_\{v\}\(\\cdot;\\mathbf\{h\}\)denotes the forward\-pass operation of an RNN with parametersvvand hidden state𝐡\\mathbf\{h\}\. For each nodeii, a separate hidden state𝐡i,t\\mathbf\{h\}\_\{i,t\}is maintained and updated over time\. Using a consensus weightγξ\\gamma\_\{\\xi\}for eachξ\\xi\-th possible state \(withξ∈\{1,2,⋯,s\}\\xi\\in\\\{1,2,\\cdots,s\\\}\) included in the consensus vector𝜸\\bm\{\\gamma\}, the posterior estimate is:
𝐱^i,t\|t=\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}=𝐱^i,t\|t−1\+𝐊i,t;θ\(𝐳i,t−hi\(𝐱^i,t\|t−1\)\)\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\+\\mathbf\{K\}\_\{i,t;\\theta\}\(\\mathbf\{z\}\_\{i,t\}\-h\_\{i\}\(\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\)\)\+σ\(𝜸\)\|𝒩i\|∑j∈𝒩i\(𝐱^j,t\|t−1−𝐱^i,t\|t−1\),\\displaystyle\+\\frac\{\\sigma\(\\bm\{\\gamma\}\)\}\{\|\\mathcal\{N\}\_\{i\}\|\}\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}\(\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}\-\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\),\(16\)whereσ\(⋅\)\\sigma\(\\cdot\)is an element\-wise sigmoid function having the role of forcing the consensus weight inside\(0,1\)\(0,1\)to prevent numerical issues\. Note that, in this expression, instead of utilizing a large consensus matrix𝐂i,t∈ℝs×s\\mathbf\{C\}\_\{i,t\}\\in\\mathbb\{R\}^\{s\\times s\}as in\[[42](https://arxiv.org/html/2606.28441#bib.bib4)\], we employ the single learnable consensus weightγξ\\gamma\_\{\\xi\}for eachξ\\xi\-th state, which models each node’s trust on its own information versus the information provided by the peers regarding that specific state component\. We argue that the complexity of handling nonlinearities, correlations, and elaborate dynamics is sufficiently managed by the proposed neural KG estimation and, hence, decoupled dimension\-wise consensus is entirely sufficient; this will become evident in the numerical investigations presented later on in Section[V](https://arxiv.org/html/2606.28441#S5)\. Consequently, the optimization process only needs to learn the general informative value and reliability of the shared peer estimates for each individual state variable\.
### IV\-BMathematical Intuition Regarding Stability
Providing a rigorous stability proof for a distributed recurrent estimator is a difficult task that falls beyond the scope of this paper\. However, we herein provide strong mathematical intuition as to why we expect the consensus mechanism within the proposed CA\-NKCF framework to be stable in practice\. Focusing on the proposed update rule in \([16](https://arxiv.org/html/2606.28441#S4.E16)\) for a single dimensionξ\\xi, the following is deduced:
𝐱^i,t\|t\[ξ\]=𝐱^i,t\|t−1\[ξ\]\+σ\(γξ\)\|𝒩i\|∑j∈𝒩i\(𝐱^j,t\|t−1\[ξ\]−𝐱^i,t\|t−1\[ξ\]\)\\displaystyle\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}\[\\xi\]=\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\[\\xi\]\+\\frac\{\\sigma\(\\gamma\_\{\\xi\}\)\}\{\|\\mathcal\{N\}\_\{i\}\|\}\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}\\left\(\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}\[\\xi\]\-\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\[\\xi\]\\right\)=\(1−σ\(γξ\)\)⏟self\-weight𝐱^i,t\|t−1\[ξ\]\+∑j∈𝒩iσ\(γξ\)\|𝒩i\|⏟neighbor\-weight𝐱^j,t\|t−1\[ξ\]\.\\displaystyle=\\underbrace\{\(1\-\\sigma\(\\gamma\_\{\\xi\}\)\)\}\_\{\\text\{self\-weight\}\}\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}\[\\xi\]\+\\sum\_\{j\\in\\mathcal\{N\}\_\{i\}\}\\underbrace\{\\frac\{\\sigma\(\\gamma\_\{\\xi\}\)\}\{\|\\mathcal\{N\}\_\{i\}\|\}\}\_\{\\text\{neighbor\-weight\}\}\\hat\{\\mathbf\{x\}\}\_\{j,t\|t\-1\}\[\\xi\]\.\(17\)This decomposition implies the following significant properties of the proposed latent state estimation recursion:
1. 1\.Non\-negativity:The sigmoid activation in \([17](https://arxiv.org/html/2606.28441#S4.E17)\) guarantees thatσ\(γξ\)∈\(0,1\)\\sigma\(\\gamma\_\{\\xi\}\)\\in\(0,1\), hence, the local self\-weight and the individual neighbor weights incorporated into the update rule are strictly positive\. Strict positivity is a well\-established requirement for the stability of dynamic consensus systems\[[47](https://arxiv.org/html/2606.28441#bib.bib77)\]\.
2. 2\.Convexity:It can be easily concluded that the latter weights sum to unity, implying that the posterior estimate is a convex combination of the local priors\. Consequently, the state estimates are perpetually bounded within the multi\-dimensional convex hull of the network’s current “beliefs”\[[41](https://arxiv.org/html/2606.28441#bib.bib75)\], precluding compounding over\-corrections and catastrophic divergence often observed in unconstrained end\-to\-end MA models\.
It is finally noted that, beyond the proposed consensus step, we expect the local Kalman\-like tracking to remain stable, as recurrent KG estimation using GRUs has been thoroughly studied and validated in the single\-agent domain\. Hence, while we cannot rigorously prove global stability and convergence for MA systems, due to the inherent complexity of GRUs, the synthesis of bounded consensus and reliable local filtering provides strong theoretical foundation for it\. This intuition is further corroborated by the numerical experiments provided later on in the respective section\.
### IV\-CTraining Optimization
The loss function over the dataset𝒟\\mathcal\{D\}in \([13](https://arxiv.org/html/2606.28441#S3.E13)\) corresponding to eachii\-th sensor node was chosen as follows:
ℒi≜∑d=1D∑t=1T‖𝐱^t\|t−1\(d\)−𝐱t\(d\)‖2,\\mathcal\{L\}\_\{i\}\\triangleq\\sum\_\{d=1\}^\{D\}\\sum\_\{t=1\}^\{T\}\\left\\\|\\hat\{\\mathbf\{x\}\}^\{\(d\)\}\_\{t\|t\-1\}\-\\mathbf\{x\}\_\{t\}^\{\(d\)\}\\right\\\|^\{2\},\(18\)and the total, centralized loss function was set as the average of the local estimation errors\. In this paper, we treat the learnable weights of NNθ\\thetaand the consensus weights𝜸\\bm\{\\gamma\}as a unified learnable parameter set, formulating the following optimization objective for the considered NN\-based distributed estimation framework:
minθ,𝜸ℒ≜1N∑i=1Nℒi\.\\min\_\{\\theta,\\bm\{\\gamma\}\}\\mathcal\{L\}\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathcal\{L\}\_\{i\}\.\(19\)
The proposed model was trained on the mini\-batch version ofℒ\\mathcal\{L\}\. For each trajectory in the batch, the neural filtering operation fromt=0t=0up tot=Tt=Twas conducted sequentially, and the local losses at each time instancettwere added to compute the final loss result\. For moderate trajectories \(e\.g\., up toT=100T=100\), the hidden state vectors𝐡1,t,𝐡2,t,…,𝐡N,t\\mathbf\{h\}\_\{1,t\},\\mathbf\{h\}\_\{2,t\},\\ldots,\\mathbf\{h\}\_\{N,t\}were initialized at the beginning, and the gradients were computed using Back Propagation Through Time \(BPTT\)\. For longer horizons, the trajectories can be split into shorter segments of lengthTtruncT\_\{\\rm trunc\}to apply truncated BPTT\. To handle datasets with trajectories of different lengths, max\-padding can be used\. The considered training process on a single trajectory is summarized in Algorithm[2](https://arxiv.org/html/2606.28441#alg2)\.
Algorithm 2CA\-NKCF Optimization on a Single Trajectory0:Input: Trajectory data𝐱1:T,𝐳1,1:T,𝐳2,1:t,…,𝐳N,1:T\\mathbf\{x\}\_\{1:T\},\\mathbf\{z\}\_\{1,1:T\},\\mathbf\{z\}\_\{2,1:t\},\\ldots,\\mathbf\{z\}\_\{N,1:T\}and neighbors\{𝒩i,t\}\\\{\\mathcal\{N\}\_\{i,t\}\\\}\.
0:Parameter: Shared parametersθ\\thetaand𝜸\\bm\{\\gamma\}\.
0:Output: Optimizedθ\\thetaand𝜸\\bm\{\\gamma\}\.
0:Initialization
1:Initialize GRU hidden states as𝐡i,0=𝟎\\mathbf\{h\}\_\{i,0\}=\\mathbf\{0\}∀i\\forall i\.
2:Initialize state estimates𝐱^i,0\|0\\hat\{\\mathbf\{x\}\}\_\{i,0\|0\}\.
3:Setℒtraj=0\\mathcal\{L\}\_\{\\text\{traj\}\}=0\.
4:fort=1t=1toTTdo
4:Distributed Filtering Step
5:fori=1i=1toNNdo
6:Call CA\-NKCF \(Algorithm[1](https://arxiv.org/html/2606.28441#alg1)\):𝐱^i,t\|t,𝐡i,t←CA\-NKCF\(𝐳i,t,θ,𝜸,𝒩i,t,𝐡i,t−1\)\.\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\},\\mathbf\{h\}\_\{i,t\}\\leftarrow\\text\{CA\-NKCF\}\(\\mathbf\{z\}\_\{i,t\},\\theta,\\bm\{\\gamma\},\\mathcal\{N\}\_\{i,t\},\\mathbf\{h\}\_\{i,t\-1\}\)\.
7:endfor
7:Loss Accumulation
8:Compute the instantaneous loss:ℒt=1N∑i=1N‖𝐱^i,t\|t−𝐱t‖2\.\\mathcal\{L\}\_\{t\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}\-\\mathbf\{x\}\_\{t\}\\\|^\{2\}\.
9:Perform the updateℒtraj=ℒtraj\+ℒt\.\\mathcal\{L\}\_\{\\text\{traj\}\}=\\mathcal\{L\}\_\{\\text\{traj\}\}\+\\mathcal\{L\}\_\{t\}\.
10:endfor
10:Parameter Optimization
11:Compute∇θ,𝜸ℒtraj\\nabla\_\{\\theta,\\bm\{\\gamma\}\}\{\\mathcal\{L\}\}\_\{\\text\{traj\}\}using BPTT\.
12:Updateθ\\thetaand𝜸\\bm\{\\gamma\}using Adam\[[31](https://arxiv.org/html/2606.28441#bib.bib60)\]\.
13:returnθ,𝜸\\theta,\\bm\{\\gamma\}
## VNumerical Results and Discussion
Figure 2:Sampled oscillator states forN=4N=4sensors\.\(a\)Average Sensor MSE
\(b\)Worst Sensor MSE
\(c\)Consensus Disagreement
Figure 3:MSE performance versus the node connection probability for the linear scenario\.In this section, diverse numerical investigations are presented to validate the effectiveness of the proposed CA\-NKCF framework, by comparing it with traditional MB filters and MF NNs\. We have simulated three scenarios: a linear example with harmonic oscillators, a nonlinear experiment inspired by chaos theory including a Lorenz attractor, and a practical wireless tracking application\. For all simulations, the node/sensor/agent graph was chosen at random and time\-varying, with the goal to verify that our model does not overfit to specific favorable topologies\. For each time instancett, the existence of a communication link between nodesiiandjjwas determined by an independent Bernoulli trial with probabilitypcp\_\{c\}\(thus, determining111As emphasized in Remark 1, this dynamic, random topology generation is considered to showcase that the performance gains of the proposed CA\-NKCF framework stem entirely from its robust consensus mechanism, precluding the GRU from profiting itself by memorizing static communication patterns that may not be present in real\-world deployments\.the sets of neighboring nodes𝒩i,t\\mathcal\{N\}\_\{i,t\}∀i\\forall i\)\.
All reported results in the sequel were obtained through averaging across1010random seeds; each instance was trained on5×1045\\times 10^\{4\}trajectories and tested on other10410^\{4\}ones\. Training consisted of100100epochs performed with a learning rate of5×10−45\\times 10^\{\-4\}\. For the linear example, the CA\-NKCF was realized with a GRU with22hidden layers of6464units\. For the nonlinear experiment, the hidden size was increased to256256, and an additional pre\-processing feed\-forward module with hidden dimension of128128and a ReLU activation was used before the GRU\. The output of the GRU was transformed to the appropriate KG dimensions with an additional linear layer\. For MF benchmark, GRUs with similar structure trained on the same dataset were considered\. Since they did not have access to domain knowledge, the hidden sizes, and the number of hidden layers were doubled\. For MB benchmarks, the KCF given by \([10](https://arxiv.org/html/2606.28441#S3.E10)\) was utilized for linear systems, and the Extended KCF \(EKCF\), Unscented KCF \(UKCF\) and Distributed PF \(DPF\) with200200particles were examined for nonlinear systems\.
### V\-ALinear Scenario
An environment withNNdecoupled one\-dimensional harmonic oscillators \(same as the number of nodes\) has been simulated\. In particular, the state𝐱t∈ℝ2N\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{2N\}was a vector stacking the position and velocity of all oscillators\. Each nodeiiobserved only one oscillator, meaning that estimation of the entire state𝐱t∈ℝ2N\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{2N\}relied on successful consensus, and not just on developing powerful local KG estimators\. The state evolution was actually dictated by the block\-diagonal matrix𝐅≜diag\(𝐅1,𝐅2,…,𝐅N\)\\mathbf\{F\}\\triangleq\\text\{diag\}\(\\mathbf\{F\}\_\{1\},\\mathbf\{F\}\_\{2\},\\ldots,\\mathbf\{F\}\_\{N\}\), where each sub\-block monitored a rotation with a distinct frequency, i\.e\.,∀i\\forall i:
𝐅i=\[cos\(ωiΔt\)−sin\(ωiΔt\)sin\(ωiΔt\)cos\(ωiΔt\)\]\.\\mathbf\{F\}\_\{i\}=\\begin\{bmatrix\}\\cos\(\{\\omega\_\{i\}\\Delta\_\{t\}\}\)&\-\\sin\(\\omega\_\{i\}\\Delta\_\{t\}\)\\\\ \\sin\(\\omega\_\{i\}\\Delta\_\{t\}\)&\\cos\(\{\\omega\_\{i\}\\Delta\_\{t\}\}\)\\end\{bmatrix\}\.\(20\)
We have examined systems withN=4k,k∈ℕN=4k,k\\in\\mathbb\{N\}oscillators, settingΔt=0\.1\\Delta\_\{t\}=0\.1sec\. The first quarter of the oscillators had the frequency0\.50\.5rad/sec, the second quarter11rad/sec, the third1\.51\.5rad/sec, and the final22rad/sec\. The process noise in \([1](https://arxiv.org/html/2606.28441#S3.E1)\) was chosen as zero\-mean Gaussian with𝐐=0\.05𝐈2N\\mathbf\{Q\}=0\.05\\mathbf\{I\}\_\{2N\}\. Each measurement matrix𝐇i∈ℝ2×2N\\mathbf\{H\}\_\{i\}\\in\\mathbb\{R\}^\{2\\times 2N\}extracted only theii\-th component of the complete state and all observation noise matrices were set as𝐑i=0\.1𝐈2\\mathbf\{R\}\_\{i\}=0\.1\\mathbf\{I\}\_\{2\}\. Finally, he horizon was fixed toT=50T=50\. Examples of state evolution for the first and last oscillator forN=4N=4are depicted in Fig\.[2](https://arxiv.org/html/2606.28441#S5.F2)\.
First, we considered a small system withN=4N=4nodes, varyingpcp\_\{c\}from0\.10\.1to0\.40\.4\. As is evident from Fig\.[3\(a\)](https://arxiv.org/html/2606.28441#S5.F3.sf1), the proposed method significantly outperforms all benchmarks\. For further intuition, we have trained local KalmanNets for each node, both with shared \(i\.e\., PS\) and individual parameters\. It is shown that these methods perform worse than the linear KCF, validating the paramount importance of consensus for this partially observable task\. Interestingly, it is showcased that utilizing powerful NNs to perform KG estimation is not enough for distributed systems, indicating that the nodes must also perform intelligent consensus\. We also tracked the worst case MSE \(out of all nodes\) for each episode as well as the average disagreement in Figs\.[3\(b\)](https://arxiv.org/html/2606.28441#S5.F3.sf2)and[3\(c\)](https://arxiv.org/html/2606.28441#S5.F3.sf3)respectively, which collectively verify that our approach can successfully optimize these robustness metrics without being directly trained to do so\. Finally, in Fig\.[4](https://arxiv.org/html/2606.28441#S5.F4), we fixedpc=0\.4p\_\{c\}=0\.4and variedNNup to3232nodes\. As is apparent from the presented results, the proposed CA\-NKCF scales well withNNconsistently outperforming the baselines\. Besides superior performance, our filter also achieves very good robustness to initialization conditions as inferred by examining the shaded region\. In contrast, the shaded region of the GRU in Fig\.[4](https://arxiv.org/html/2606.28441#S5.F4)is significantly larger\.
TABLE I:Ablation analysis for the proposed CA\-NKCF in the linear scenario\.Figure 4:Average MSE versus the number of nodesNNfor the linear scenario\.
### Ablation Analysis
We have verified the importance of each component of the prososed CA\-NKCF by examining the following three ablation benchmarks:
1. 1\.To examine whether the KG computation is a necessary step, we have trained a GRU that processes the most recent observation, the prior𝐱^i,t\|t−1\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\-1\}, and the average prior of all peers in the set𝒩i,t\\mathcal\{N\}\_\{i,t\}, with the goal to directly map them to𝐱^i,t\|t\\hat\{\\mathbf\{x\}\}\_\{i,t\|t\}\.
2. 2\.One important feature of our method is the joint optimization of𝜸\\bm\{\\gamma\}\. To investigate whether joint optimization hinders stability, we have trained the same models, but with a fixed𝜸\\bm\{\\gamma\}, whose value was determined by an element\-wise grid search on\[0,1\]\[0,1\]\. The training process was repeated for each considered value of𝜸\.\\bm\{\\gamma\}\.
3. 3\.For the final ablation test, we have investigated the importance of end\-to\-end training with the consensus loop present during training, and the centralized loss function\. To this end, we have trainedNNindependent local KalmanNets \(one for each node\), and then during testing alone, we added the consensus term in the recursion\. Again, the value of the weight was determined by grid search on\[0,1\]\[0,1\]\.
For fairness, all latter33GRUs were designed to have similar structure to the GRU of CA\-NKCF\. As depicted in Table[I](https://arxiv.org/html/2606.28441#S5.T1), all33components are necessary\. The Kalman\-like recursion is superior to just processing priors, optimizing𝜸\\bm\{\\gamma\}along withθ\\thetaimproves training, and the network must learn to cooperate during training by incorporating the consensus step in the training code; consensus on deployment alone is not sufficient\.

Figure 5:The considered state trajectory for the Lorenz attractor scenario\.
Figure 6:Average MSE versus the observation noise level for the Lorenz attractor scenario\.
### V\-BLorenz Attractor
Next, we performed experiments with a polynomial33\-dimensional DS inspired by the Lorenz attractor\. Such DSs are often utilized to test the effectiveness of sequential estimators due to their inherent complexity\. The latent state of the Lorenz attractor evolves according to the following expression:
𝐱t\+1\\displaystyle\\mathbf\{x\}\_\{t\+1\}=𝐅\(𝐱t\)𝐱t\+𝐰t∈ℝ3,\\displaystyle=\\mathbf\{F\}\(\\mathbf\{x\}\_\{t\}\)\\mathbf\{x\}\_\{t\}\+\\mathbf\{w\}\_\{t\}\\in\\mathbb\{R\}^\{3\},\(21a\)𝐅\(𝐱t\)\\displaystyle\\mathbf\{F\}\(\\mathbf\{x\}\_\{t\}\)≜exp\(\[−1010028−1𝐱t−10𝐱t−83\]Δt\)∈ℝ3×3\.\\displaystyle\\triangleq\\exp\\left\(\\begin\{bmatrix\}\-10&10&0\\\\ 28&\-1&\\mathbf\{x\}\_\{t\-1\}\\\\ 0&\\mathbf\{x\}\_\{t\}&\-\\frac\{8\}\{3\}\\end\{bmatrix\}\\Delta\_\{t\}\\right\)\\in\\mathbb\{R\}^\{3\\times 3\}\.\(21b\)We have generated trajectories using the55\-th order \(J=5J=5\) Taylor series expansion of𝐅\(𝐱t\)\\mathbf\{F\}\(\\mathbf\{x\}\_\{t\}\)forΔt=0\.02\\Delta\_\{t\}=0\.02, setting the state process noise in \([1](https://arxiv.org/html/2606.28441#S3.E1)\) to𝐐=0\.1𝐈3\\mathbf\{Q\}=0\.1\\mathbf\{I\}\_\{3\}\. An example simulated trajectory is provided in Fig\.[5](https://arxiv.org/html/2606.28441#S5.F5)\. To investigate our filter’s ability to generalize to incorrect model knowledge, we trained two variants: one assumingJ=5J=5\(correct dynamics\) and one assumingJ=2J=2\(incorrect dynamics\)\. A system withN=3N=3sensor nodes was considered, with the graph connection probability set topc=0\.4p\_\{c\}=0\.4\. The first node received a noisy estimate of the first two state coordinates, the second observed the last two, and the third node receives an estimate of the first and third state coordinates after they have been converted to polar coordinates\. The observation noise was additive zero\-mean Gaussian with covariance𝐑i=σLor𝐈2\\mathbf\{R\}\_\{i\}=\\sigma\_\{\\text\{Lor\}\}\\mathbf\{I\}\_\{2\}for eachii\-th node\. To ensure the neural filter generalizes across different temporal lengths rather than overfitting to a fixed sequence, the horizonTTfor each trajectory was selected uniformly at random from the set222During training, trajectories were split to shorter segments of lengthTtrunc=20T\_\{\\rm trunc\}=20to apply truncated BPTT\.\{1250,1500,1750,2000\}\\\{1250,1500,1750,2000\\\}\.
Figure[6](https://arxiv.org/html/2606.28441#S5.F6)depicts the average MSE performance with the observation noise strengthσLor\\sigma\_\{\\text\{Lor\}\}ranging from−30\-30dB to0dB\. The DPF benchmark displayed poor performance with very large errors, and has been omitted from the plot to maintain presentation quality\. As shown in the figure, CA\-NKCF achieves a significant performance gain, yielding up to50%50\\%error reduction relative to the MF GRU benchmark, even when the former assumes incorrect system dynamics\.
### Sensitivity Study
Training machine learning models on sequential data with a nonstationary structure can be an unstable process, highly affected by hyperparameter choices\[[21](https://arxiv.org/html/2606.28441#bib.bib86)\]\. Having established the effectiveness of the proposed CA\-NKCF framework in the Lorenz attractor scenario, we have also used those chaotic trajectories to verify its stability to moderate hyperparameter changes\. More specifically, we have tested the following:
- •Learning rate:The training and testing procedures were repeated for learning rates ranging from5×10−55\\times 10^\{\-5\}to10−310^\{\-3\}\.
- •Gradient clipping:The clip threshold was modified from0\.50\.5to2\.52\.5\.
- •Truncation length \(TtruncT\_\{\\rm trunc\}\):It was varied from1515to5050\.
The Coefficients of Variation \(CoVar\) on the test set, considering observation noise ofσLor=0\\sigma\_\{\\text\{Lor\}\}=0dB, are depicted in Fig\.[7](https://arxiv.org/html/2606.28441#S5.F7)\. As is evident, the proposed algorithm is robust to moderate hyperparameter changes as the CoVars are substantially smaller than11; this implies that the variance of the test set score is much smaller than the average\. The learning rate causes some fluctuation in the final result, particularly for large values, whereas the other two parameters have negligible effect\.
Figure 7:Sensitivity study results for the Lorenz attractor scenario considering observation noise of0dB\.
### V\-CWireless Tracking Application
To demonstrate the applicability of the proposed estimator in real\-world tasks, we have designed a practical wireless application inspired by the current 5th Generation \(5G\) New Radio \(NR\) telecommunications standard\[[1](https://arxiv.org/html/2606.28441#bib.bib64)\]\. A wireless User Equipment \(UE\) tracking system\[[63](https://arxiv.org/html/2606.28441#bib.bib65)\]was considered, where the UE can move along two axes with a fixed acceleration\(αx,αy\)\(\\alpha\_\{x\},\\alpha\_\{y\}\)\. The UE state is the44\-dimensional vector𝐱t=\[px,t,py,t,vx,t,vy,t\]\\mathbf\{x\}\_\{t\}=\[p\_\{x,t\},p\_\{y,t\},v\_\{x,t\},v\_\{y,t\}\], whose entries are dictated by:
px,t\+1\\displaystyle p\_\{x,t\+1\}=px,t\+vx,tΔt\+12αxΔt2\+wpx,t,\\displaystyle=p\_\{x,t\}\+v\_\{x,t\}\\Delta\_\{t\}\+\\frac\{1\}\{2\}\\alpha\_\{x\}\\Delta\_\{t\}^\{2\}\+w\_\{p\_\{x\},t\},\(22a\)vx,t\+1\\displaystyle v\_\{x,t\+1\}=vx,t\+αxΔt\+wvx,t,\\displaystyle=v\_\{x,t\}\+\\alpha\_\{x\}\\Delta\_\{t\}\+w\_\{v\_\{x\},t\},\(22b\)py,t\+1\\displaystyle p\_\{y,t\+1\}=py,t\+vy,tΔt\+12αyΔt2\+wpy,t,\\displaystyle=p\_\{y,t\}\+v\_\{y,t\}\\Delta\_\{t\}\+\\frac\{1\}\{2\}\\alpha\_\{y\}\\Delta\_\{t\}^\{2\}\+w\_\{p\_\{y\},t\},\(22c\)vy,t\+1\\displaystyle v\_\{y,t\+1\}=vy,t\+αyΔt\+wvy,t\.\\displaystyle=v\_\{y,t\}\+\\alpha\_\{y\}\\Delta\_\{t\}\+w\_\{v\_\{y\},t\}\.\(22d\)All state noise variables \(the rightmostwwterms\) were assumed zero\-mean Gaussian with variance of0\.10\.1, the time interval was set toΔt=0\.05\\Delta\_\{t\}=0\.05sec, and the acceleration duplet was set to\(0\.5,0\.5\)\(0\.5,0\.5\)m/sec2\. The single\-antenna UE was assumed to transmit a beacon signal at every time instance\. Each sensor node corresponded to a BS equipped withNA=8N\_\{\\rm A\}=8antenna elements, which was tasked to estimate the channel coefficients, i\.e\., the transfer function of the signal propagation system between itself and the UE\. For eachii\-th BS, this channel coefficient is represented by𝐠i,t\(𝐩t\)\\mathbf\{g\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\), where𝐩t≜\[px,t,py,t\]⊤\\mathbf\{p\}\_\{t\}\\triangleq\[p\_\{x,t\},p\_\{y,t\}\]^\{\\top\}is the current UE 2D coordinate vector, which can be estimated333Notice that the channel measurements are only determined by the current UE position vector\. To this end, the filter needs to incorporate past measurements through its GRU memory to estimate the velocity\.using existing channel estimation methods\[[25](https://arxiv.org/html/2606.28441#bib.bib26)\]\. In our simulations, we have assumed the presence of noise during the channel estimation process, with a Signal\-to\-Noise Ratio \(SNR\) of2020dB, leading to erroneous observation of𝐠i,t\(𝐩t\)\\mathbf\{g\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)denoted by𝐠^i,t\(𝐩t\)\\hat\{\\mathbf\{g\}\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)\. Besides the UE and the BSs, static objects known as scatterers were considered present in the scene\. Those are, in general, passive environmental features \(e\.g\., buildings, bridges, trees\) that reflect the signal, creating multi\-path conditions that adversely affect the estimation of the UE state\. Note that, to ensure resolvability of the UE position from the considered multi\-BS system, the total number of reception antennas must exceed the number of scatterers and, additionally, those antennas need to be spatially distributed, motivating the adoption of KCF\-based techniques\. Details of the channel model simulated are provided in the Appendix\.
In Fig\.[8](https://arxiv.org/html/2606.28441#S5.F8), we have fixedT=50T=50,pc=0\.4p\_\{c\}=0\.4, and varied the number of scatterers fromK=20K=20to5050to evaluate the performance of the proposed CA\-NKCF in rich scattering conditions\. It is evident that the proposed UE position estimator outperforms all baselines by a pronounced margin\. In addition, consistent with previous observations, the DPF benchmark exhibits particularly weak performance\. While the UKCF serves as the most competitive baseline, CA\-NKCF consistently surpasses it\. Notably, even though our model’s performance exhibits slight variance depending on the random initialization seed, its worst\-case execution strictly outperforms the UKCF across all evaluated scattering densities\. This experiment verifies that our consensus\-based distributed estimation framework can achieve good performance in real\-world systems\. Finally, after repeating the sensitivity study of Fig\.[7](https://arxiv.org/html/2606.28441#S5.F7)for this wireless tracking application, we verified the stability of the our MA algorithm in practical applications; the results forK=40K=40scatterers are depicted in Fig\.[9](https://arxiv.org/html/2606.28441#S5.F9)\.
Figure 8:Average MSE versus the number of scatterers for the wireless UE tracking scenario\.Figure 9:Sensitivity study results for the wireless UE tracking scenario considering4040scatterers\.
### V\-DForward Inference Time Comparison
While the preceding evaluations demonstrate the superior estimation capabilities of our hybrid scheme, that combines Kalman\-like priors with data\-driven NNs, compared to traditional latent state estimation algorithms, it is crucial to verify that the integration of GRUs in CA\-NKCF does not incur additional latency\. Table[II](https://arxiv.org/html/2606.28441#S5.T2)details the forward inference times of all simulated estimation methods, confirming that our approach avoids computational bottlenecks\. All evaluations were conducted on a6464\-bit Linux workstation\. The system is equipped with an1111th Generation Intel Core i7−117007\-11700KF processor \(88cores,1616threads\) operating at a base frequency of3\.603\.60GHz, alongside an NVIDIA GeForce RTX30803080GPU and3232GB of system memory\. As shown from this table, the proposed CA\-NKCF achieves faster execution times than standard MB estimators; this is primarily attributed to the fact that it bypasses the need for costly matrix inversions\. Crucially, this efficiency ensures CA\-NKCF satisfies the strict real\-time causality constraint required for online tracking, as its inference time remains significantly smaller than the system’s sampling period\. In contrast, scaling particle\-based baselines pushes their execution time beyond the physical state update interval, rendering their predictions obsolete\.
TABLE II:Comparison of forward inference time in seconds between the proposed CA\-NKCF and conventional filters\.
## VIConclusion and Future Work
In this paper, we presented CA\-NKCF, a powerful domain\-informed deep learning algorithm for online latent state estimation in decentralized systems\. The proposed estimation framework combines the representational power of NNs, the temporal modeling abilities of GRUs, and the elegant mathematical structure of KFs with an optimized novel lightweight consensus mechanism\. Strong mathematical intuition demonstrating that the learned consensus updates promote stable estimation was provided\. Crucially, the proposed framework improves upon the communication costs of traditional information\-based filters by requiring only the exchange of state priors, while completely bypassing the need for computationally expensive matrix inversions\. Detailed numerical investigations on physics\-inspired trajectories and on a realistic wireless system under extreme scattering conditions showcased the superiority of our approach over both traditional MB and MF recurrent estimators\. It was also demonstrated that the proposed hybrid architecture, hybrid combining Kalman\-like priors with data\-driven NNs, remains remarkably robust to reasonable hyperparameter changes\.
For future work, we plan on extending CA\-NKCF to tracking the state of more complicated latent processes, e\.g\., high\-order autoregressive models and switching Markov models\. Furthermore, we intend to develop distributed filters robust to Byzantine attacks and freeloaders\. Mechanisms that adaptively optimize the agent network topology to improve distributed inference, leveraging bandit algorithms or reinforcement learning, constitute another interesting research direction\.
## Appendix AWireless Channel Model
We have considered a wireless system operating at the carrier frequencyf0=28f\_\{0\}=28GHz, comprising a single\-antenna UE broadcasting a beacon signal \(i\.e\., fixed and known\) at every time instancett, which is received by a set ofNNBSs playing the role of distributed sensor nodes\. Each BS was assumed equipped with a Uniform Linear Array \(ULA\) ofNA=8N\_\{\\rm A\}=8antenna elements\. The ULAs of all BSs were aligned with theyyaxis of the coordinate plane, with each BS’s adjacent antenna elements spaced at distances ofλ/2\\lambda/2, whereλ=c/f0\\lambda=c/f\_\{0\}represents the wavelength andc=3×108c=3\\times 10^\{8\}m/sec is the speed of light\. Since, at every time\-slot, only a single transmission took place, we assumed the UE to remain quasi\-static, therefore Doppler effects were ignored\. The wireless environment additionally containedKKscattering objects that partially absorbed and reflected the impinging signals; the positions and properties of all scatterers remained fixed within each UE trajectory\. As a result, the channel in the far field between the UE and eachii\-th BS \(i=1,2,…,Ni=1,2,\\ldots,N\) can be expressed, similar to\[[51](https://arxiv.org/html/2606.28441#bib.bib67)\], as follows:
𝐠i,t\(𝐩t\)≜𝐠i,tD\+∑k=1K𝐠i,k,t∈ℂNA×1,\\mathbf\{g\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)\\triangleq\\mathbf\{g\}\_\{i,t\}^\{\\rm D\}\+\\sum\_\{k=1\}^\{K\}\\mathbf\{g\}\_\{i,k,t\}\\in\\mathbb\{C\}^\{N\_\{\\rm A\}\\times 1\},\(23\)where𝐠i,tD\\mathbf\{g\}\_\{i,t\}^\{\\rm D\}is the direct link from the UE to theii\-th BS and𝐠i,k,t\\mathbf\{g\}\_\{i,k,t\}is the link corresponding to the path reflected by thekk\-th scatterer \(modeled as a point\)\. The𝐠i,tD\\mathbf\{g\}\_\{i,t\}^\{\\rm D\}is defined as:
𝐠i,tD≜L\(𝐩t,𝐩iBS\)exp\(−ȷ2πλ‖𝐩t−𝐩iBS‖2\)𝐚\(θi,tD\),\\mathbf\{g\}\_\{i,t\}^\{\\rm D\}\\triangleq L\(\\mathbf\{\\mathbf\{p\}\}\_\{t\},\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\)\\exp\\left\(\-\\jmath\\frac\{2\\pi\}\{\\lambda\}\\\|\\mathbf\{p\}\_\{t\}\-\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\\\|\_\{2\}\\right\)\\mathbf\{a\}\(\\theta^\{\\rm D\}\_\{i,t\}\),\(24\)where𝐩iBS\\mathbf\{p\}^\{\\rm BS\}\_\{i\}is the position of theii\-th BS \(corresponding to its left\-most ULA antenna element\),L\(𝐩𝐭,𝐩iBS\)L\(\\mathbf\{\\mathbf\{p\}\_\{t\}\},\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\)represents signal attenuation due to pathloss:
L\(𝐩𝐭,𝐩iBS\)=\(λ4π‖𝐩𝐭−𝐩iBS‖2\)2,L\\left\(\\mathbf\{\\mathbf\{p\}\_\{t\}\},\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\\right\)=\\left\(\\frac\{\\lambda\}\{4\\pi\\left\\\|\\mathbf\{\\mathbf\{p\}\_\{t\}\}\-\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\\right\\\|\_\{2\}\}\\right\)^\{2\},\(25\)the exponential factor models the distance\-dependent phase shift, and𝐚\(θi,tD\)\\mathbf\{a\}\(\\theta^\{\\rm D\}\_\{i,t\}\)is the ULA steering vector which depends on the angleθi,tD\\theta^\{\\rm D\}\_\{i,t\}between the UE antenna and the first antenna element of theii\-th BS\. This vector models incremental phase shifts incurred by the antenna spacing, as follows:
𝐚\(θi,tD\)≜\[1,e−jπsin\(θi,tD\),…,e−jπ\(NA−1\)sin\(θi,tD\)\]T\.\\mathbf\{a\}\(\\theta^\{\\rm D\}\_\{i,t\}\)\\triangleq\\begin\{bmatrix\}1,e^\{\-j\\pi\\sin\(\\theta^\{\\rm D\}\_\{i,t\}\)\},\\dots,e^\{\-j\\pi\(N\_\{\\rm A\}\-1\)\\sin\(\\theta^\{\\rm D\}\_\{i,t\}\)\}\\end\{bmatrix\}^\{T\}\.\(26\)Moreover, each vector𝐠i,k,t\\mathbf\{g\}\_\{i,k,t\}in \([23](https://arxiv.org/html/2606.28441#A1.E23)\) can be expressed as:
𝐠i,k,t=ΓkL\(𝐩t,𝐩kSC\)L\(𝐩kSC,𝐩iBS\)\\displaystyle\\mathbf\{g\}\_\{i,k,t\}=\\Gamma\_\{k\}L\(\\mathbf\{\\mathbf\{p\}\}\_\{t\},\\mathbf\{p\}^\{\\rm SC\}\_\{k\}\)L\(\\mathbf\{p\}^\{\\rm SC\}\_\{k\},\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\)\(27\)×exp\(−ȷ2πλ‖𝐩t−𝐩kSC‖2\+‖𝐩kSC−𝐩iBS‖2\)𝐚\(θi,kSC\),\\displaystyle\\times\\exp\\left\(\-\\jmath\\frac\{2\\pi\}\{\\lambda\}\\left\\\|\\mathbf\{p\}\_\{t\}\-\\mathbf\{p\}^\{\\rm SC\}\_\{k\}\\right\\\|\_\{2\}\+\\left\\\|\\mathbf\{p\}^\{\\rm SC\}\_\{k\}\-\\mathbf\{p\}^\{\\rm BS\}\_\{i\}\\right\\\|\_\{2\}\\right\)\\mathbf\{a\}\(\\theta^\{\\rm SC\}\_\{i,k\}\),whereΓk\\Gamma\_\{k\}is thekk\-th scatterer reflection coefficient modeled as a complex random value with uniform amplitude and phase,𝐩kSC\\mathbf\{p\}^\{\\rm SC\}\_\{k\}is its position, andθi,kSC\\theta^\{\\rm SC\}\_\{i,k\}denotes the angle between thekk\-th scatterer and theii\-th BS\.
To emulate the channel estimation process at the fixed SNR level of2020dB, we have inserted additive white Gaussian noise to the actual channel vector as follows:
𝐠^i,t\(𝐩t\)=𝐠i,t\(𝐩t\)\+𝐯i,t,\\hat\{\\mathbf\{g\}\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)=\\mathbf\{g\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)\+\\mathbf\{v\}\_\{i,t\},\(28\)where𝐯i,t\\mathbf\{v\}\_\{i,t\}was sampled from the complex normal distribution𝒞𝒩\(𝟎NA,σCE2𝐈NA\)\\mathcal\{CN\}\(\\mathbf\{0\}\_\{N\_\{\\rm A\}\},\\sigma\_\{\\rm CE\}^\{2\}\\mathbf\{I\}\_\{N\_\{\\rm A\}\}\)withσCE=‖𝐠i,t\(𝐩t\)‖F/\(10NA\)\\sigma\_\{\\rm CE\}=\\\|\\mathbf\{g\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)\\\|\_\{\\rm F\}/\(10\\sqrt\{N\_\{\\rm A\}\}\)\. Therefore, following \([23](https://arxiv.org/html/2606.28441#A1.E23)\), the observation vector at eachii\-th BS sensor can be expressed as follows:
𝐳i,t=\[ℜ𝔢\(𝐠^i,t\(𝐩t\)\),ℑ𝔪\(𝐠^i,t\(𝐩t\)\)\]T∈ℝ2NA×1,\\mathbf\{z\}\_\{i,t\}=\[\\mathfrak\{Re\}\(\\hat\{\\mathbf\{g\}\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)\),\\mathfrak\{Im\}\(\\hat\{\\mathbf\{g\}\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)\)\]^\{T\}\\in\\mathbb\{R\}^\{2N\_\{\\rm A\}\\times 1\},\(29\)whereℜ𝔢\(⋅\)\\mathfrak\{Re\}\(\\cdot\)andℑ𝔪\(⋅\)\\mathfrak\{Im\}\(\\cdot\)represent the real and imaginary components of the channel estimation vector𝐠^i,t\(𝐩t\)\\hat\{\\mathbf\{g\}\}\_\{i,t\}\(\\mathbf\{p\}\_\{t\}\)which were appended together to form𝐳i,t\\mathbf\{z\}\_\{i,t\}\. For our simulations, we have usedN=3N=3BSs whose positions𝐩iBS\\mathbf\{p\}^\{\\rm BS\}\_\{i\}’s remained in the fixed positions\(10,10\)\(10,10\)m,\(90,10\)\(90,10\)m,\(50,95\)\(50,95\)m across all UE trajectories\. At the beginning of every trajectory,𝐩kSC\\mathbf\{p\}^\{\\rm SC\}\_\{k\}’s were re\-sampled uniformly within the bounded 2D box indicated by the coordinates of theNNBSs\.
## References
- \[1\]\(2020\)NR; Physical channels and modulation\.Technical SpecificationTechnical ReportTS 38\.211,3rd Generation Partnership Project \(3GPP\)\.Cited by:[§V\-C](https://arxiv.org/html/2606.28441#S5.SS3.p1.3)\.
- \[2\]H\. K\. Aggarwalet al\.\(2019\)MoDL: model\-based deep learning architecture for inverse problems\.IEEE Trans\. Medical Imag\.38\(2\),pp\. 394–405\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.
- \[3\]A\. Amirkhani and A\. H\. Barshooi\(2022\-06\)Consensus in multi\-agent systems: a review\.Artif\. Intell\. Rev\.55\(5\),pp\. 3897–3935\.External Links:ISSN 0269\-2821,[Link](https://doi.org/10.1007/s10462-021-10097-x),[Document](https://dx.doi.org/10.1007/s10462-021-10097-x)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p4.1)\.
- \[4\]V\. P\. Anandet al\.\(2026\)Simultaneous distributed acoustic and temperature sensing for robust leakage detection in gas pipelines\.J\. Lightwave Techn\.44\(7\),pp\. 2849–2857\.External Links:[Document](https://dx.doi.org/10.1109/JLT.2026.3671630)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1.1.2)\.
- \[5\]F\. Augeret al\.\(2013\)Industrial applications of the Kalman filter: a review\.IEEE Trans\. Ind\. Electron60\(12\),pp\. 5458–5471\.External Links:[Document](https://dx.doi.org/10.1109/TIE.2012.2236994)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1.1)\.
- \[6\]I\. Buchniket al\.\(2024\)Latent\-KalmanNet: learned Kalman filtering for tracking from high\-dimensional signals\.IEEE Trans\. Signal Process\.72\(\),pp\. 352–367\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2023.3344360)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.28441#S3.SS1.p2.16)\.
- \[7\]H\. Caiet al\.\(2021\)Learned robust PCA: a scalable deep unfolding approach for high\-dimensional outlier detection\.InProc\. NeurIPS,Virtual\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.
- \[8\]C\. M\. Carvalhoet al\.\(2010\)Particle learning and smoothing\.Statistical Science25\(1\),pp\. 88–106\.Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p3.1)\.
- \[9\]J\. Chunget al\.\(2014\)Empirical evaluation of gated recurrent neural networks on sequence modeling\.arXiv preprint: 1412\.3555\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1)\.
- \[10\]M\. Coates\(2004\)Distributed particle filters for sensor networks\.InProc\. ACM IPSN,Berkeley, California, USA\.External Links:[Document](https://dx.doi.org/10.1145/984622.984637)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p3.1),[item R1](https://arxiv.org/html/2606.28441#S3.I2.ix1.p1.2)\.
- \[11\]H\. Coskunet al\.\(2017\)Long short\-term memory Kalman filters: recurrent neural estimators for pose regularization\.InProc\. IEEE ICCV,Vol\.,Venice, Italy\.External Links:[Document](https://dx.doi.org/10.1109/ICCV.2017.589)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1)\.
- \[12\]T\. Cuiet al\.\(2026\)Distributed weighted average consensus fusion based on admm under measurement uncertainty\.Signal Process\.241,pp\. 110380\.External Links:ISSN 0165\-1684,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.sigpro.2025.110380),[Link](https://www.sciencedirect.com/science/article/pii/S0165168425004967)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p4.1)\.
- \[13\]R\. Deshmukhet al\.\(2017\)Optimal discrete\-time kalman consensus filter\.InProc\. ACC,Vol\.,Seattle, Washington, USA\.External Links:[Document](https://dx.doi.org/10.23919/ACC.2017.7963859)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[14\]P\. Doshi and P\. J\. Gmytrasiewicz\(2005\)A particle filtering based approach to approximating interactive POMDPs\.InProc\. AAAI,Pittsburgh, Pennsylvania, USA\.Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p3.1)\.
- \[15\]M\. Fraccaroet al\.\(2016\)Sequential neural models with stochastic layers\.InProc\. NeurIPS,Barcelona, Spain\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1)\.
- \[16\]M\. Fraccaroet al\.\(2017\)A disentangled recognition and nonlinear dynamics model for unsupervised learning\.InProc\. NeurIPS,Long Beach, California, USA\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1)\.
- \[17\]A\. Gastet al\.\(2025\)DCD\-MUSIC: deep\-learning\-aided cascaded differentiable MUSIC algorithm for near\-field localization of multiple sources\.InProc\. IEEE ICASSP,Vol\.,Hyderabad, India,pp\.\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.
- \[18\]D\. Ghion and M\. Zorzi\(2022\)Distributed Kalman filtering with event\-triggered communication: a robust approach\.InProc\. Medit\. Conf\. Control Autom\.,Vol\.,Athens, Greece\.External Links:[Document](https://dx.doi.org/10.1109/MED54222.2022.9837137)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[19\]A\. Ghoshet al\.\(2024\)DANSE: data\-driven non\-linear state estimation of model\-free process in unsupervised learning setup\.IEEE Trans\. Signal Process\.72\(\),pp\. 1824–1838\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2024.3383277)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1)\.
- \[20\]A\. A\. Gorji and M\. B\. Menhaj\(2008\)Identification of nonlinear state space models using an MLP network trained by the EM algorithm\.InProc\. IEEE IJCNN,Vol\.,Hong Kong, China\.External Links:[Document](https://dx.doi.org/10.1109/IJCNN.2008.4633766)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1)\.
- \[21\]K\. Greffet al\.\(2017\)LSTM: a search space odyssey\.IEEE Trans Neural Netw\. Learn\. Syst\.28\(10\),pp\. 2222–2232\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2016.2582924)Cited by:[§V](https://arxiv.org/html/2606.28441#S5.SSx2.p1.3)\.
- \[22\]M\. Gruber\(1967\)An approach to target tracking\.Technical ReportMIT Lincoln Laboratory,Lexington, MA\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1.1),[§I](https://arxiv.org/html/2606.28441#S1.p2.1)\.
- \[23\]S\. Hochreiter and J\. Schmidhuber\(1997\)Long short\-term memory\.Neural Computation9\(8\),pp\. 1735–1780\.External Links:[Document](https://dx.doi.org/10.1162/neco.1997.9.8.1735)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§IV](https://arxiv.org/html/2606.28441#S4.p2.7)\.
- \[24\]C\. Hu and B\. Chen\(2022\)An efficient distributed Kalman filter over sensor networks with maximum correntropy criterion\.IEEE Trans\. Signal Inf\. Process\. Networks8\(\),pp\. 433–444\.External Links:[Document](https://dx.doi.org/10.1109/TSIPN.2022.3175363)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[25\]L\. Italianoet al\.\(2025\)A tutorial on 5G positioning\.IEEE Commun\. Surveys & Tuts\.27\(3\),pp\. 1488–1535\.Cited by:[§V\-C](https://arxiv.org/html/2606.28441#S5.SS3.p1.15.3)\.
- \[26\]R\. E\. Kalman\(1960\-03\)A new approach to linear filtering and prediction problems\.J\. Basic Engineering82\(1\),pp\. 35–45\.External Links:ISSN 0021\-9223,[Document](https://dx.doi.org/10.1115/1.3662552),[Link](https://doi.org/10.1115/1.3662552),https://asmedigitalcollection\.asme\.org/fluidsengineering/article\-pdf/82/1/35/5518977/35\_1\.pdfCited by:[§I](https://arxiv.org/html/2606.28441#S1.p2.1)\.
- \[27\]A\. T\. Kamalet al\.\(2012\)Information weighted consensus\.InProc\. IEEE CDC,Vol\.,Maui, Hawaii, USA\.External Links:[Document](https://dx.doi.org/10.1109/CDC.2012.6426886)Cited by:[1st item](https://arxiv.org/html/2606.28441#S1.I1.i1.p1.1.1),[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[28\]M\. Kaushiket al\.\(2019\)Parameter sharing reinforcement learning architecture for multi agent driving\.InProc\. AIR, ACM International Conference Proceedings Series,Chennai, India\.Cited by:[§IV](https://arxiv.org/html/2606.28441#S4.p1.5)\.
- \[29\]S\. Khanet al\.\(2023\)Optimal Kalman filter with information\-weighted consensus\.IEEE Trans\. Autom\. Control68\(9\),pp\. 5624–5629\.External Links:[Document](https://dx.doi.org/10.1109/TAC.2022.3220528)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[30\]W\. Kim and Y\. Sung\(2023\)Parameter sharing with network pruning for scalable multi\-agent deep reinforcement learning\.InProc\. AAMAS,London, United Kingdom\.Cited by:[§IV](https://arxiv.org/html/2606.28441#S4.p1.5)\.
- \[31\]D\. P\. Kingma and J\. Ba\(2017\)Adam: a method for stochastic optimization\.arXiv preprint: 1412\.6980\.External Links:[Link](https://arxiv.org/abs/1412.6980)Cited by:[12](https://arxiv.org/html/2606.28441#alg2.l12.2)\.
- \[32\]L\. Kraemer and B\. Banerjee\(2016\)Multi\-agent reinforcement learning as a rehearsal for decentralized planning\.Neurocomputing190,pp\. 82–94\.External Links:ISSN 0925\-2312,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neucom.2016.01.031),[Link](https://www.sciencedirect.com/science/article/pii/S0925231216000783)Cited by:[1st item](https://arxiv.org/html/2606.28441#S1.I1.i1.p1.1.1.1),[§I](https://arxiv.org/html/2606.28441#S1.p4.1)\.
- \[33\]R\. G\. Krishnanet al\.\(2015\)Deep Kalman filters\.arXiv preprint: 1511\.05121\.External Links:[Link](https://arxiv.org/abs/1511.05121)Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1)\.
- \[34\]R\. G\. Krishnanet al\.\(2017\)Structured inference networks for nonlinear state space models\.InProc\. AAAI,San Francisco, California, USA\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1)\.
- \[35\]R\. E\. Larsonet al\.\(1967\)Application of the extended Kalman filter to ballistic trajectory estimation\.Stanford Research Institute, Tech\. Rep\.\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1.1)\.
- \[36\]B\. Lianet al\.\(2022\)Distributed Kalman consensus filter for estimation with moving targets\.IEEE Trans\. Cybern\.52\(6\),pp\. 5242–5254\.External Links:[Document](https://dx.doi.org/10.1109/TCYB.2020.3029007)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[37\]Q\. Liuet al\.\(2024\)Distributed Kalman filtering under two\-bitrate periodic coding strategies\.IEEE Trans\. Autom\. Control69\(12\),pp\. 8633–8646\.External Links:[Document](https://dx.doi.org/10.1109/TAC.2024.3413009)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[38\]R\. Loweet al\.\(2017\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.InProc\. NeurIPS,Long Beach, California, USA\.Cited by:[1st item](https://arxiv.org/html/2606.28441#S1.I1.i1.p1.1.1.1),[§I](https://arxiv.org/html/2606.28441#S1.p4.1)\.
- \[39\]A\. Moradiet al\.\(2022\)Privacy\-preserving distributed Kalman filtering\.IEEE Trans\. Signal Process\.70\(\),pp\. 3074–3089\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2022.3182590)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p2.1)\.
- \[40\]R\. Olfati\-Saber\(2007\)Distributed kalman filtering for sensor networks\.InProc\. IEEE CDC,Vol\.,New Orleans, Louisiana, USA,pp\.\.External Links:[Document](https://dx.doi.org/10.1109/CDC.2007.4434303)Cited by:[1st item](https://arxiv.org/html/2606.28441#S1.I1.i1.p1.1.1),[§III\-B](https://arxiv.org/html/2606.28441#S3.SS2.p3.3)\.
- \[41\]R\. Olfati\-Saberet al\.\(2007\)Consensus and cooperation in networked multi\-agent systems\.Proc\. IEEE95\(1\),pp\. 215–233\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2006.887293)Cited by:[item 2](https://arxiv.org/html/2606.28441#S4.I1.i2.p1.1)\.
- \[42\]R\. Olfati\-Saber\(2009\)Kalman\-consensus filter : optimality, stability, and performance\.InProc\. IEEE CDC,Vol\.,Shanghai, China\.External Links:[Document](https://dx.doi.org/10.1109/CDC.2009.5399678)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p1.1),[§III\-B](https://arxiv.org/html/2606.28441#S3.SS2.p2.1),[§IV\-A](https://arxiv.org/html/2606.28441#S4.SS1.p1.19),[Remark 2](https://arxiv.org/html/2606.28441#Thmremark2.p1.5.5),[Remark 3](https://arxiv.org/html/2606.28441#Thmremark3.p1.6.6)\.
- \[43\]OpenAI\(2019\)Dota 2 with large scale deep reinforcement learning\.arXiv preprint: 1912\.06680\.Cited by:[§IV](https://arxiv.org/html/2606.28441#S4.p1.5)\.
- \[44\]H\. X\. Pham and Lothers\(2017\)A distributed control framework for a team of unmanned aerial vehicles for dynamic wildfire tracking\.InProc\. IEEE/RSJ IROS,Vol\.,Vancouver, Canada\.External Links:[Document](https://dx.doi.org/10.1109/IROS.2017.8206579)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1.1.2)\.
- \[45\]M\. Raissiet al\.\(2019\)Physics\-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations\.J\. Comput\. Physics378,pp\. 686–707\.External Links:ISSN 0021\-9991,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.jcp.2018.10.045),[Link](https://www.sciencedirect.com/science/article/pii/S0021999118307125)Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.
- \[46\]B\.S\.Y\. Raoet al\.\(1993\)A fully decentralized multi\-sensor system for tracking and surveillance\.Intern\. J\. Robotics Research12\(1\),pp\. 20–44\.External Links:[Document](https://dx.doi.org/10.1177/027836499301200102)Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p1.1)\.
- \[47\]W\. Ren and R\.W\. Beard\(2005\)Consensus seeking in multiagent systems under dynamically changing interaction topologies\.IEEE Trans\. Autom\. Control50\(5\),pp\. 655–661\.External Links:[Document](https://dx.doi.org/10.1109/TAC.2005.846556)Cited by:[item 1](https://arxiv.org/html/2606.28441#S4.I1.i1.p1.1)\.
- \[48\]G\. Revachet al\.\(2022\)KalmanNet: neural network aided Kalman filtering for partially known dynamics\.IEEE Trans\. Signal Proces\.70\(\),pp\. 1532–1547\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2022.3158588)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.28441#S3.SS1.p2.16),[§III\-A](https://arxiv.org/html/2606.28441#S3.SS1.p3.3),[§III\-C](https://arxiv.org/html/2606.28441#S3.SS3.p2.1),[§IV\-A](https://arxiv.org/html/2606.28441#S4.SS1.p1.14)\.
- \[49\]G\. Revachet al\.\(2023\)RTSNet: learning to smooth in partially known state\-space models\.IEEE Trans\. Signal Process\.71\(\),pp\. 4441–4456\.External Links:[Document](https://dx.doi.org/10.1109/TSP.2023.3329964)Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1),[§III\-C](https://arxiv.org/html/2606.28441#S3.SS3.p2.1)\.
- \[50\]V\. G\. Satorraset al\.\(2019\)Combining generative and discriminative models for hybrid inference\.InProc\. NeurIPS,Vancouver Canada\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p2.1)\.
- \[51\]A\. M\. Sayeed\(2002\)Deconstructing multiantenna fading channels\.IEEE Trans\. Signal Process\.50\(10\),pp\. 2563–2579\.Cited by:[Appendix A](https://arxiv.org/html/2606.28441#A1.p1.11)\.
- \[52\]C\. Shiet al\.\(2026\)IMAS2: joint agent selection and information\-theoretic coordinated perception in Dec\-POMDPs\.InProc\. AAMAS,Paphos, Cyprus\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1)\.
- \[53\]N\. Shlezingeret al\.\(2023\)Model\-based deep learning\.Proc\. IEEE111\(5\),pp\. 465–499\.External Links:[Document](https://dx.doi.org/10.1109/JPROC.2023.3247480)Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.
- \[54\]N\. Shlezingeret al\.\(2025\)Artificial intelligence\-aided Kalman filters: AI\-augmented designs for Kalman\-type algorithms\.IEEE Signal Process\. Mag\.\(\),pp\. 2–26\.External Links:[Document](https://dx.doi.org/10.1109/MSP.2025.3569395)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p2.1),[§I](https://arxiv.org/html/2606.28441#S1.p3.1)\.
- \[55\]R\. Solomonet al\.\(2026\)LumiMAS: a comprehensive framework for real\-time monitoring and enhanced observability in multi\-agent systems\.InProc\. AAMAS,Paphos, Cyprus\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p1.1.1)\.
- \[56\]G\. Stamatelis and G\. C\. Alexandropoulos\(2026\)Filtering Markov jump systems with partially known dynamics: a model\-based deep learning approach\.IEEE Trans\. Signal Proces\. \(Early Acess\)\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p1.1),[§III\-A](https://arxiv.org/html/2606.28441#S3.SS1.p2.16),[§IV\-A](https://arxiv.org/html/2606.28441#S4.SS1.p1.14)\.
- \[57\]K\. Stylianopouloset al\.\(2025\)Graph\-CNNs for RF imaging: learning the electric field integral equations\.InProc\. EUSIPCO,Palermo, Italy\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1),[item R1](https://arxiv.org/html/2606.28441#S3.I2.ix1.p1.2)\.
- \[58\]I\. Urtcagaet al\.\(2016\)Sequential monte carlo methods under model uncertainty\.InProc\. IEEE SSP,Palma de Mallorca, Spain\.Cited by:[§II\-A](https://arxiv.org/html/2606.28441#S2.SS1.p3.1)\.
- \[59\]V\. Vahidpouret al\.\(2019\)Partial diffusion Kalman filtering for distributed state estimation in multiagent networks\.IEEE Trans\. Neural Netw\. Learn\. Syst\.30\(12\),pp\. 3839–3846\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2019.2899052)Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p2.1)\.
- \[60\]D\. Valenciaet al\.\(2025\)CTD4 – a deep continuous distributional actor\-critic agent with a Kalman fusion of multiple critics\.InProc\. AAAI,Philadelphia, Pennsylvania, USA\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p2.1)\.
- \[61\]A\. Vaswaniet al\.\(2017\)Attention is all you need\.arXiv preprint arXiv:1706\.03762\.Cited by:[item R1](https://arxiv.org/html/2606.28441#S3.I2.ix1.p1.2),[§IV](https://arxiv.org/html/2606.28441#S4.p2.7)\.
- \[62\]Eric\.A\. Wan and R\. Van Der Merwe\(2000\)The unscented Kalman filter for nonlinear estimation\.InProc\. IEEE AS\-SPCC,Vol\.,Alberta, Canada\.Cited by:[§I](https://arxiv.org/html/2606.28441#S1.p2.1)\.
- \[63\]H\. Wymeersch and G\. Seco\-Granados\(2022\)Radio localization and sensing—Part i: fundamentals\.IEEE Commun\. Lett\.26\(12\),pp\. 2816–2820\.Cited by:[§V\-C](https://arxiv.org/html/2606.28441#S5.SS3.p1.3)\.
- \[64\]C\. Zhouet al\.\(2025\)Dual\-balancing for physics\-informed neural networks\.InProc\. IJCAI,Montreal Canada\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.
- \[65\]R\. Zhuet al\.\(2025\)U\-PINet: physics\-informed hierarchical learning for radar cross section prediction via 3D electromagnetic scattering reconstruction\.arXiv preprint: 2508\.03774\.Cited by:[§II\-B](https://arxiv.org/html/2606.28441#S2.SS2.p3.1)\.Similar Articles
Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators
This paper introduces a structured parameterization for noise models in ELTO-based Kalman filters, enabling dynamic adaptation to non-stationary processes and improving state estimation performance in noisy, time-varying environments.
Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics
This paper presents a distributed approach for constrained multi-agent reinforcement learning that uses state-augmented policy learning and neighbor-to-neighbor consensus over dual variables to satisfy global resource constraints while scaling linearly with the number of agents. Experiments on smart grid demand response demonstrate that consensus coordination is essential for feasibility, scaling to thousands of agents unlike centralized training approaches.
Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise
The paper introduces the Bayesian Filtering Transformer (BFT), which incorporates uncertainty into Transformers via precision-weighted attention and Kalman update residuals, improving performance on sequential recommendation and noisy LLM fine-tuning.
Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
This paper proposes new discrete-time approximations for stochastic gradient Langevin dynamics (SGLD) with and without momentum, enabling accurate predictions of stationary covariance, iterate average covariance, and integrated autocorrelation time. The method provides improved tuning guidance for large-sample uncertainty quantification, especially under model misspecification.
Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning
Proposes a Wasserstein-GAN approach for unsupervised calibration of sensor-induced distribution drifts, validated on tracking detector toy models and simulated calorimeter data with aging effects.