Time-Varying Deep State Space Models for Sequences with Switching Dynamics
Summary
The paper proposes a class of time-varying deep state-space models where dynamics are learned via a basis function expansion, enabling adaptive modeling of switching systems. The approach outperforms time-invariant counterparts on synthetic switching data and a speech denoising task.
View Cached Full Text
Cached at: 05/18/26, 06:39 AM
# Time-Varying Deep State Space Models for Sequences with Switching Dynamics
Source: [https://arxiv.org/html/2605.15311](https://arxiv.org/html/2605.15311)
Subhrakanti DeyAyça ÖzçelikkaleDepartment of Electrical Engineering, Uppsala University, Sweden, \(e\-mail: \{Sanja\.Karilanova, Subhrakanti\.Dey, Ayca\.Ozcelikkale\}@angstrom\.uu\.se\)\.
###### Abstract
The identification and modeling of time\-varying systems is a fundamental challenge in signal processing and system identification\. To address this challenge, we propose a class of time\-varying state\-space model \(SSM\) based neural networks in which the neurons’ states are governed by time\-varying dynamics\. The proposed model provides the learnable time\-varying dynamics through a dictionary of basis functions, where each basis function evolves differently over time\. We evaluate the proposed approach on both synthetic data from switching systems and a speech denoising task where real audio is corrupted with switching dynamics noise\. The results show that the proposed time\-varying model consistently outperforms its time\-invariant counterparts while maintaining comparable computational complexity\. Our investigations also reveal which aspects of the time\-varying dynamics of the data most need to be captured by the proposed time\-invariant models, how the additional freedom provided by time\-varying basis functions should be allocated across model components, and to what extent larger models can compensate for time\-invariant limitations\.
###### keywords:
System identification, Time\-varying, Deep learning, Switching State\-Space Models, Deep State\-Space Models
††thanks:S\. Karilanova acknowledges the support of Center for Interdisciplinary Mathematics \(CIM\), Uppsala University\. A\. Özçelikkale acknowledges the support from the Swedish Research Council through grant agreement no\. 2024\-05194\. The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden \(NAISS\), partially funded by the Swedish Research Council through grant agreement no\. 2022\-06725\.## 1Introduction
Figure 1:Left: Example network architecture with an input layer with a single channel, two hidden layers with SSM neurons and an output layer of two channels\. Middle: An SSM neuron with time\-varying dynamics\. Right: Illustration of the proposed basis function expansion for a single time\-varying element of the SSM state transition matrix\.Many real\-world processes are inherently time\-varying, meaning that their underlying dynamics evolve over time\. This time\-varying behavior appears in various domains, such as financial markets, weather forecasting, energy consumption, and neural activity analysis\. A central challenge in system identification is therefore to build models that are flexible enough to capture such evolving dynamicsHuaet al\.\([2023](https://arxiv.org/html/2605.15311#bib.bib35)\); Liuet al\.\([2022](https://arxiv.org/html/2605.15311#bib.bib18)\)\.
State Space Models \(SSMs\) provide a principled and interpretable framework for system identification\. In their classical form with linear state transitions, often referred to as Linear Dynamical Systems \(LDS\)Kalman \([1963](https://arxiv.org/html/2605.15311#bib.bib48)\), they assume fixed time\-invariant transition dynamics\.
A common extension of LDSs for handling time\-varying data is the Switching Linear Dynamical System \(SLDS\)Sun and Ge \([2005](https://arxiv.org/html/2605.15311#bib.bib43)\), where the dynamics transition between multiple LDS modes\. While effective in capturing regime changes, learning SLDSs remains challenging, as it requires estimating the number of modes, the model order of each mode, and the corresponding parametersPaolettiet al\.\([2007](https://arxiv.org/html/2605.15311#bib.bib10)\)\.
Another extension of LDS, distinct from SLDS, is deep\-SSMs\. These models stack LDS blocks into layers with nonlinear transformations and have recently achieved performance comparable to transformers on long\-sequence modeling tasksGuet al\.\([2022b](https://arxiv.org/html/2605.15311#bib.bib26)\)Smithet al\.\([2023](https://arxiv.org/html/2605.15311#bib.bib22)\), as well as have been applied to nonlinear system identificationGedonet al\.\([2021](https://arxiv.org/html/2605.15311#bib.bib54)\)\. Extensions to Deep\-SSMs include input\-dependent mechanismGu and Dao \([2024](https://arxiv.org/html/2605.15311#bib.bib52)\)and input timescale invariant formulationGuet al\.\([2020](https://arxiv.org/html/2605.15311#bib.bib53)\)\.
Traditionally, system identification research has long explored time\-varying and non\-stationary behavior through methods such as adaptive filtering, basis function expansionsTsatsanis and Giannakis \([1993](https://arxiv.org/html/2605.15311#bib.bib38)\); Grenier \([1983](https://arxiv.org/html/2605.15311#bib.bib47)\); Niedzwiecki \([1988](https://arxiv.org/html/2605.15311#bib.bib45)\); Zouet al\.\([2003](https://arxiv.org/html/2605.15311#bib.bib44)\), and change point detectionvan den Burg and Williams \([2020](https://arxiv.org/html/2605.15311#bib.bib20)\)\. More recently, deep learning has also become a powerful tool for system identificationLjunget al\.\([2020](https://arxiv.org/html/2605.15311#bib.bib55)\), including approaches based on time\-varying neural networks, such as dynamic neural networksHuaet al\.\([2023](https://arxiv.org/html/2605.15311#bib.bib35)\)and non\-stationary transformer architecturesLiuet al\.\([2022](https://arxiv.org/html/2605.15311#bib.bib18)\)\.
Our work brings the time\-varying basis function expansion methodology into contemporary deep SSM architectures and illustrates the performance for system identification tasks\. Unlike explicit modeling through SLDSs, our framework requires no explicit specification of switching modes and instead models smooth, continuously time\-varying dynamics, enabling efficient training through standard backpropagation through time \(BPTT\)\. In contrast to modeling time variation in deep SSMs through input\-dependence, our framework explicitly models time\-dependent dynamics independent of the input\. Figure[1](https://arxiv.org/html/2605.15311#S1.F1)provides a visualization of our proposed framework\.
The key contributions of this paper are as follows:
- •We propose a novel time\-varying deep SSM framework using learnable basis functions to parameterize state transition, input, and output matrices over time\.
- •We provide experimental validation on both synthetic and real\-world data on audio denoising scenarios with non\-stationary noise dynamics\.
- •We explore the trade\-offs in the proposed model architecture, including effects of different parts of SSM dynamics and basis function allocation, and compare it with its time\-invariant counterpart\.
Overall, our results show that the proposed time\-varying model consistently outperforms its time\-invariant counterparts under switching dynamics data while retaining comparable computational complexity\.
## 2Methods
### 2\.1Preliminaries
#### 2\.1\.1Time\-invariant SSM model:
A time\-invariant linear discrete\-time SSM is given byLjung \([1987](https://arxiv.org/html/2605.15311#bib.bib46)\)
𝒙\[t\]\\displaystyle\\bm\{x\}\[t\]=𝑨𝒙\[t−1\]\+𝑩𝒖\[t−1\]\\displaystyle=\\bm\{A\}\\bm\{x\}\[t\-1\]\+\\bm\{B\}\\bm\{u\}\[t\-1\]\(1a\)𝒚\[t\]\\displaystyle\\bm\{y\}\[t\]=𝑪𝒙\[t\],\\displaystyle=\\bm\{C\}\\bm\{x\}\[t\],\(1b\)with possibly learnable matrix parameters𝑨∈ℝn×n\\bm\{A\}\\in\\mathbb\{R\}^\{n\\times n\},𝑩∈ℝn×nin,𝑪∈ℝnout×n,𝑫∈ℝnout×nin\\bm\{B\}\\in\\mathbb\{R\}^\{n\\times n\_\{in\}\},\\bm\{C\}\\in\\mathbb\{R\}^\{n\_\{out\}\\times n\},\\bm\{D\}\\in\\mathbb\{R\}^\{n\_\{out\}\\times n\_\{in\}\}, input𝒖\[t\]∈ℝnin×1\\bm\{u\}\[t\]\\in\\mathbb\{R\}^\{n\_\{in\}\\times 1\}, state variable𝒙\[t\]∈ℝn×1\\bm\{x\}\[t\]\\in\\mathbb\{R\}^\{n\\times 1\}, and output𝒚\[t\]∈ℝnout×1\\bm\{y\}\[t\]\\in\\mathbb\{R\}^\{n\_\{out\}\\times 1\}\.
We refer to the SSM in \([1](https://arxiv.org/html/2605.15311#S2.E1)\) as a neuron in the context of a SSM based neural network\. Multiple neurons with such SSM dynamics may be used to form a SSM layer, see Figure[1](https://arxiv.org/html/2605.15311#S1.F1)\. Layers with SSM dynamics are combined with possibly non\-linear activation function layers and mixing layers to form a deep SSM networkGuet al\.\([2022b](https://arxiv.org/html/2605.15311#bib.bib26)\)Smithet al\.\([2023](https://arxiv.org/html/2605.15311#bib.bib22)\)\.
#### 2\.1\.2Basis function expansion:
Basis function expansion is a way of representing a function,f\(t\)f\(t\), as a linear combination of simpler functions,ϕ\(k\)\(⋅\)\\phi^\{\(k\)\}\(\\cdot\), called basis functions, as follows:
f\(t\)=∑k=1Kα\(k\)×ϕ\(k\)\(t\),\\displaystyle f\(t\)=\\sum^\{K\}\_\{k=1\}\\alpha^\{\(k\)\}\\times\\phi^\{\(k\)\}\(t\),\(2\)whereαk∈ℝ\\alpha\_\{k\}\\in\\mathbb\{R\}are the coefficients\. Basis functions may assume various forms, such as with Fourier series, Gaussian functions, polynomial functions, rational orthogonal functions, and radial basis functions\. For example, we can haveϕ\(k\)=𝒩\(t\|μk,σk2\)\\phi^\{\(k\)\}=\\mathcal\{N\}\(t\|\\mu\_\{k\},\\sigma\_\{k\}^\{2\}\)for a dictionary of Gaussian functions with𝒩\(⋅\)\\mathcal\{N\}\(\\cdot\)representing the shape of a Gaussian function andϕ\(k\)=sin\(wkt\+ϕk\)\\phi^\{\(k\)\}=\\sin\(w\_\{k\}t\+\\phi\_\{k\}\)for a dictionary of sine functions, whereμk,σk,wk,ϕk\\mu\_\{k\},\\sigma\_\{k\},w\_\{k\},\\phi\_\{k\}are fixed pre\-defined scalars\.
### 2\.2Proposed time\-varying SSM model
We propose the following time\-varying SSM model
𝒙\[t\]\\displaystyle\\bm\{x\}\[t\]=𝑨\[t\]𝒙\[t−1\]\+𝑩\[t\]𝒖\[t−1\]\\displaystyle=\\bm\{A\}\[t\]\\bm\{x\}\[t\-1\]\+\\bm\{B\}\[t\]\\bm\{u\}\[t\-1\]\(3a\)𝒚\[t\]\\displaystyle\\bm\{y\}\[t\]=𝑪\[t\]𝒙\[t\],\\displaystyle=\\bm\{C\}\[t\]\\bm\{x\}\[t\],\(3b\)where𝑨\[t\],𝑩\[t\],𝑪\[t\]\\bm\{A\}\[t\],\\bm\{B\}\[t\],\\bm\{C\}\[t\]have the same dimensions as the time\-invariant model in \([1](https://arxiv.org/html/2605.15311#S2.E1)\)\. However, each element of these matrices is now a linear combination of basis functions\. In particular, for𝑨\[t\]\\bm\{A\}\[t\], we have
\[𝑨\[t\]\]ij=ai,j\[t\]=∑k=1KAai,j\(k\)×ϕA,i,j\(k\)\[t\]\\displaystyle\[\\bm\{A\}\[t\]\]\_\{ij\}=a\_\{i,j\}\[t\]=\\sum^\{K\_\{A\}\}\_\{k=1\}a\_\{i,j\}^\{\(k\)\}\\times\\phi\_\{A,i,j\}^\{\(k\)\}\[t\]\(4\)where\[𝑨\[t\]\]ij=ai,j\[t\]\[\\bm\{A\}\[t\]\]\_\{ij\}=a\_\{i,j\}\[t\]denotes theiith rowjjth column element of the matrix𝑨\[t\]\\bm\{A\}\[t\]\. An illustration of the expansion in \([4](https://arxiv.org/html/2605.15311#S2.E4)\) is provided in Figure[1](https://arxiv.org/html/2605.15311#S1.F1)\. We note that the number of basis functionsKAK\_\{A\}is independent of any other dimensions in the model\. Here we present the general case where the basis functionϕA,i,j\(k\)\\phi\_\{A,i,j\}^\{\(k\)\}may differ across SSM matrices\(A,B,C\)\(A,B,C\), indices\(i,j\)\(i,j\), and expansion terms\(k\)\(k\)\. In practice, this heterogeneity can be reduced by using a smaller shared dictionary of basis functions\.
Similar to𝑨\[t\]\\bm\{A\}\[t\], we have the following:
\[𝑩\[t\]\]ij\\displaystyle\[\\bm\{B\}\[t\]\]\_\{ij\}=bi,j\[t\]=∑k=1KBbi,j\(k\)×ϕB,i,j\(k\)\[t\]\\displaystyle=b\_\{i,j\}\[t\]=\\sum^\{K\_\{B\}\}\_\{k=1\}b\_\{i,j\}^\{\(k\)\}\\times\\phi\_\{B,i,j\}^\{\(k\)\}\[t\]\(5\)\[𝑪\[t\]\]ij\\displaystyle\[\\bm\{C\}\[t\]\]\_\{ij\}=ci,j\[t\]=∑k=1KCci,j\(k\)×ϕC,i,j\(k\)\[t\]\.\\displaystyle=c\_\{i,j\}\[t\]=\\sum^\{K\_\{C\}\}\_\{k=1\}c\_\{i,j\}^\{\(k\)\}\\times\\phi\_\{C,i,j\}^\{\(k\)\}\[t\]\.\(6\)We note thatKAK\_\{A\},KBK\_\{B\},KCK\_\{C\}can be chosen independently of each other\. Hence, a subset of𝑨\[t\],𝑩\[t\],𝑪\[t\]\\bm\{A\}\[t\],\\bm\{B\}\[t\],\\bm\{C\}\[t\]can remain time\-invariant as in the time\-invariant model \([1](https://arxiv.org/html/2605.15311#S2.E1)\) independently of the others\.
### 2\.3Stability
A central question in deep SSM development is to ensure stability of the SSM dynamics\. The stability of the dynamics in \([3](https://arxiv.org/html/2605.15311#S2.E3)\) is controlled by𝑨\[t\]\\bm\{A\}\[t\]\. For simplicity of presentation, we now assume that𝑨\[t\]\\bm\{A\}\[t\]is a diagonal matrix and the basis functions satisfy\|ϕA,i,j\(k\)\[t\]\|≤1\|\\phi\_\{A,i,j\}^\{\(k\)\}\[t\]\|\\leq 1\. To ensure stability, the modulus of the diagonal entries \(eigenvalues\) of𝑨\[t\]\\bm\{A\}\[t\]must be strictly smaller than11, i\.e\.,\|ai,i\[t\]\|<1\|a\_\{i,i\}\[t\]\|<1\. Using the triangle inequality, we have
\|ai,i\[t\]\|\\displaystyle\|a\_\{i,i\}\[t\]\|=\|∑k=1KAai,i\(k\)×ϕA,i,i\(k\)\[t\]\|\\displaystyle=\|\\sum^\{K\_\{A\}\}\_\{k=1\}a\_\{i,i\}^\{\(k\)\}\\times\\phi\_\{A,i,i\}^\{\(k\)\}\[t\]\|\(7a\)≤∑k=1KA\|ai,i\(k\)×ϕA,i,i\(k\)\[t\]\|≤∑k=1KA\|ai,i\(k\)\|\\displaystyle\\leq\\sum^\{K\_\{A\}\}\_\{k=1\}\|a\_\{i,i\}^\{\(k\)\}\\times\\phi\_\{A,i,i\}^\{\(k\)\}\[t\]\|\\leq\\sum^\{K\_\{A\}\}\_\{k=1\}\|a\_\{i,i\}^\{\(k\)\}\|\(7b\)Hence, if∑k=1KA\|ai,i\(k\)\|<1\\sum^\{K\_\{A\}\}\_\{k=1\}\|a\_\{i,i\}^\{\(k\)\}\|<1is guaranteed, then\|ai,i\[t\]\|<1\|a\_\{i,i\}\[t\]\|<1, which imply stability of the dynamics governed by𝑨\[t\]\\bm\{A\}\[t\]\. During training, we enforce the stability condition,∑k=1K\|ai,i\(k\)\|<1\\sum^\{K\}\_\{k=1\}\|a\_\{i,i\}^\{\(k\)\}\|<1, by checking it once per forward pass rather than at each time step\. If this condition is violated, i\.e\.,∑k=1K\|ai,i\(k\)\|=c\>1\\sum^\{K\}\_\{k=1\}\|a\_\{i,i\}^\{\(k\)\}\|=c\>1, then we apply a scaling strategy to enforce the constraint by redefining the coefficients asa^i,i\(k\)=1c\+ϵai,i\(k\)\\widehat\{a\}\_\{i,i\}^\{\(k\)\}=\\frac\{1\}\{c\+\\epsilon\}a\_\{i,i\}^\{\(k\)\}\. With this scaling, we have the following that ensure stability
∑k=1K\|a^i,i\(k\)\|=1c\+ϵ∑k=1K\|ai,i\(k\)\|=1c\+ϵc<1\.\\displaystyle\\sum^\{K\}\_\{k=1\}\|\\widehat\{a\}\_\{i,i\}^\{\(k\)\}\|=\\frac\{1\}\{c\+\\epsilon\}\\sum^\{K\}\_\{k=1\}\|a\_\{i,i\}^\{\(k\)\}\|=\\frac\{1\}\{c\+\\epsilon\}c<1\.\(8\)
### 2\.4Parameter count
In \([1](https://arxiv.org/html/2605.15311#S2.E1)\), each element of𝑨\\bm\{A\},𝑩\\bm\{B\},𝑪\\bm\{C\}is a single scalar learnable parameter, while each element of𝑨\[t\]\\bm\{A\}\[t\],𝑩\[t\]\\bm\{B\}\[t\]and𝑪\[t\]\\bm\{C\}\[t\]matrices in \([3](https://arxiv.org/html/2605.15311#S2.E3)\) is a linear combination ofKAK\_\{A\},KBK\_\{B\}andKCK\_\{C\}basis functions where each coefficient is trainable, hence it corresponds toKAK\_\{A\},KBK\_\{B\}andKCK\_\{C\}learnable parameters per element, respectively\. Hence, the proposed model increases the trainable parameters associated with𝑨\\bm\{A\},𝑩\\bm\{B\}and𝑪\\bm\{C\}byKAK\_\{A\},KBK\_\{B\}andKCK\_\{C\}per neuron, respectively\.
Table 1:Trainable parameters in the baseline \(time\-invariant SSM\) and scaling factor relative to the proposed time\-varying SSM\.Table[1](https://arxiv.org/html/2605.15311#S2.T1)summarizes the parameter counts and scaling factors, wherehhdenotes the number of neurons in a layer governed by either \([1](https://arxiv.org/html/2605.15311#S2.E1)\) or \([3](https://arxiv.org/html/2605.15311#S2.E3)\);𝑾\\bm\{W\}denotes the weight matrix for the mixing layer between a SSM layer withhhneurons each withnoutn\_\{out\}outputs andninn\_\{in\}inputs;𝑪bias\\bm\{C\}\_\{bias\}is a learnable output bias added to \([1b](https://arxiv.org/html/2605.15311#S2.E1.2)\) and \([3b](https://arxiv.org/html/2605.15311#S2.E3.2)\)\.
In many cases, the neural network has a large number of SSM neurons with a low dimensional state space, i\.e\.,n≪hn\\ll h, see, e\.g\. neuromorphic computing benchmarkYik and et\. al\. \([2025](https://arxiv.org/html/2605.15311#bib.bib16)\)\. Hence, under moderate dictionary sizes, the additional number of learnable parameters brought by the proposed time\-varying model is expected to be low compared to the total model size\.
Fornout=nin=1n\_\{out\}=n\_\{in\}=1and diagonal𝑨\\bm\{A\}, the total number of learnable parameters per neuron in the time\-invariant case ispinvary=3ninvar\+1p\_\{invary\}=3n\_\{invar\}\+1, while in the time\-varying case ispvary=nvary\(KA\+KB\+KC\)\+1p\_\{vary\}=n\_\{vary\}\(K\_\{A\}\+K\_\{B\}\+K\_\{C\}\)\+1, whereninvarn\_\{invar\}andnvaryn\_\{vary\}are the state dimension of the neuron in the time\-invariant and time\-varying case, respectively\. As shown in Section[4\.2](https://arxiv.org/html/2605.15311#S4.SS2), the time\-varying model can outperform the time\-invariant one even whenpvary=pinvaryp\_\{vary\}=p\_\{invary\}\.
### 2\.5Inference complexity
The learnable coefficientsai,j\(k\),bi,j\(k\),ci,j\(k\)a\_\{i,j\}^\{\(k\)\},b\_\{i,j\}^\{\(k\)\},c\_\{i,j\}^\{\(k\)\}are determined during the training based on training data\. For the inference phase, i\.e\. when the model performs a forward pass to evaluate on test data, the matrices𝑨\(t\),𝑩\(t\),𝑪\(t\)\\bm\{A\}\(t\),\\bm\{B\}\(t\),\\bm\{C\}\(t\)can be pre\-computed using these coefficients\. Hence, during inference, the number of Multiply–Accumulate \(MAC\) operations is the same for the time\-invariant model in \([1](https://arxiv.org/html/2605.15311#S2.E1)\) and the proposed time\-varying model \([3](https://arxiv.org/html/2605.15311#S2.E3)\) if the network architecture is otherwise the same\.
Hence, even if the number of trainable parameters for the time\-varying model is higher than the invariant model for a fixedn,nin,noutn,n\_\{in\},n\_\{out\}, the inference MAC complexity of the time\-varying model is the same\. Nevertheless, under this strategy of pre\-computation of𝑨\(t\),𝑩\(t\),𝑪\(t\)\\bm\{A\}\(t\),\\bm\{B\}\(t\),\\bm\{C\}\(t\), the space complexity of the time\-invariant model is higher, scaling with sequence lengthTT, due to storage of time\-dependent matrix parameters\.
## 3Experimental set\-up
#### 3\.0\.1Model initialization and training
For all experiments in Section[4](https://arxiv.org/html/2605.15311#S4), we use a network with an input layer, one or two hidden layers, and an output layer\. The input and output layers correspond to the data channels and the task\-specific output channels, respectively\. Each hidden layer containshhSSM neurons with either identity or The Gaussian Error Linear Unit \(GELU\) activation\. The trainable parameters in the networks are the following:𝑨coeff,𝑩coeff,𝑪coeff\\bm\{A\}\_\{\\text\{coeff\}\},\\bm\{B\}\_\{\\text\{coeff\}\},\\bm\{C\}\_\{\\text\{coeff\}\}for each SSM neuron, the weight connections between layers𝑾\\bm\{W\}and the normalization layers’ mean and variance scale\. We initialize𝑩coeff\\bm\{B\}\_\{\\text\{coeff\}\}as ones,𝑪coeff∼U\[0,1\)\\bm\{C\}\_\{\\text\{coeff\}\}\\sim U\[0,1\),𝑪bias\\bm\{C\}\_\{bias\}as zeros, and𝑨coeff\\bm\{A\}\_\{\\text\{coeff\}\}using the real part of S4D\-Lin initializationGuet al\.\([2022a](https://arxiv.org/html/2605.15311#bib.bib27)\), expanded across theKAK\_\{A\}dimension by settingai,jk=ai,j/KAa\_\{i,j\}^\{k\}=a\_\{i,j\}/K\_\{A\}\. We initialize𝑾\\bm\{W\}uniformly, use batch normalization between layers, and train with BPTT and AdamW\. All models are trained on an NVIDIA Tesla T4 GPU \(16GB RAM\)\. Further details are in Appendix[B](https://arxiv.org/html/2605.15311#A2)\.
#### 3\.0\.2Choice of basis functions
For the basis function expansion for the elements of𝑨\[t\],𝑩\[t\],𝑪\[t\]\\bm\{A\}\[t\],\\bm\{B\}\[t\],\\bm\{C\}\[t\], we use dictionary of Gaussian functions and an additional constant function\. Specifically, ifKKis the number of basis functions andTTis the sequence length, then per element we have11constant function andK−1K\-1Gaussian\-shaped functions with amplitude11, mean∈U\(0,T\)\\in U\(0,T\), and standard deviation∈U\(T5\(K−1\)\+1,T\(K−1\)/3\+1\)\\in U\(\\frac\{T\}\{5\(K\-1\)\+1\},\\frac\{T\}\{\(K\-1\)/3\+1\}\)\.
#### 3\.0\.3Performance metrics for evaluation
The Mean\-Square\-Error \(MSE\)∈\[0,∞\)\\in\[0,\\infty\)is calculated between the vectors𝒔\\bm\{s\}and𝒔^\\widehat\{\\bm\{s\}\}per time step, where MSE=0=0implies a perfect match i\.e\.𝒔=𝒔^\\bm\{s\}=\\widehat\{\\bm\{s\}\}\. Signal\-to\-noise ratio \(SNR\) and Scale\-invariant signal\-to\-noise ratio \(SI\-SNR\), are used as defined inRouxet al\.\([2019](https://arxiv.org/html/2605.15311#bib.bib19)\)\. Higher dB indicates cleaner speech\. While task\-dependent, SI\-SNR values<0<0dB imply the estimate is worse than the noise,0−50\-5dB imply low interpretability of speech,5−155\-15dB imply decent enhancement and15−2015\-20dB imply very good enhancementSubakanet al\.\([2022](https://arxiv.org/html/2605.15311#bib.bib42)\); Wichernet al\.\([2019](https://arxiv.org/html/2605.15311#bib.bib41)\)\. All performance values reported use the testing subset of the datasets and are averages over multiple model initializations\.
#### 3\.0\.4Trainable parameter count fairness
In a set of experiments in Section[4\.2](https://arxiv.org/html/2605.15311#S4.SS2), we match the number of trainable model parameters between time\-varying and time\-invariant models\. Following Section[2\.4](https://arxiv.org/html/2605.15311#S2.SS4), we keep the layer architecture the same and for each SSM neuron, we setpinvary=pvaryp\_\{invary\}=p\_\{vary\}, i\.e\.,ninvar=nvary3\(KA\+KB\+KC\)n\_\{invar\}=\\frac\{n\_\{vary\}\}\{3\}\(K\_\{A\}\+K\_\{B\}\+K\_\{C\}\)\.
#### 3\.0\.5Dataset 1: Four mode system
We generate a dataset from an SLDS with44operating modes, with a fixed switching order and equal mode durations\. Inputs to the SLDS are linear combination of two sinusoidal sequences with time stepsN=128N=128, phaseϕ∈U\[0,2π\]\\phi\\in U\[0,2\\pi\]and frequencyω=2πlN\\omega=\\frac\{2\\pi l\}\{N\}, wherel∈U\[0,N/2\]l\\in U\[0,N/2\]\. The input then passes through the four mode SLDS and creates the outputy\[t\]y\[t\]\. We consider variations where each of \(𝑨\\bm\{A\},𝑩\\bm\{B\},𝑪\\bm\{C\}\) independently switches or remains fixed\. The dataset consists of20002000input\-output pairs \(80%80\\%train,20%20\\%test\), and the task is to reproduce the SLDS output given the input\. Further details are provided in Appendix[A](https://arxiv.org/html/2605.15311#A1)\.
#### 3\.0\.6Dataset 2: Distorted speech
Figure 2:Visualization of the speech distortion set\-up\.As a real\-world case study, we consider audio denoising with time\-varying, switched noise dynamics\. We use the “surrounding” keyword subset of the MSWC datasetMazumderet al\.\([2024](https://arxiv.org/html/2605.15311#bib.bib15)\)consisting of500500train,100100val, and100100test samples\. Each sample is11s long, which is represented with4800048000discrete\-time points\. These samples represent clean speech\. We corrupt the clean speech with the four\-mode SLDS noise from Section[3\.0\.5](https://arxiv.org/html/2605.15311#S3.SS0.SSS5)\. Figure[2](https://arxiv.org/html/2605.15311#S3.F2)provides a visualization of the dataset creation\. Each noise cycle spans128128points, giving375375repetitions over the4800048000\-point speech signal, mimicking recurring noise patterns such as rotating machinery\. The initial SNR is set to55dB\. See Appendix[B\.2](https://arxiv.org/html/2605.15311#A2.SS2)for more details\.
The model takes the noise as input and learns to estimate the distorted speech; clean speech is then recovered by subtraction\. This mirrors active noise cancellation scenarios where the noise signal is accessible but clean speech is not\.
## 4Results
Table 2:MSE for different Data \(columns\) and Model \(rows\) configurations\. For the model parameters \(𝑨M,𝑩M,𝑪M\\bm\{A\}\_\{M\},\\bm\{B\}\_\{M\},\\bm\{C\}\_\{M\}\) and data parameters \(𝑨D,𝑩D,𝑪D\\bm\{A\}\_\{D\},\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}\), ‘o’ represents time\-varying/switched, while ‘x’ represents time\-invariant/fixed, respectively\. The reported values are rounded averages to one decimal point\. Lower values of MSE is better\.### 4\.1Four mode system
Figure 3:Example sample from the four mode system task\.We refer to the data\-generating SSM matrices as𝑨D,𝑩D,𝑪D\\bm\{A\}\_\{D\},\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}and the network models SSM matrices as𝑨M,𝑩M,𝑪M\\bm\{A\}\_\{M\},\\bm\{B\}\_\{M\},\\bm\{C\}\_\{M\}\.
Figure[3](https://arxiv.org/html/2605.15311#S4.F3)shows a sine input passed through an SLDS with switching𝑨D,𝑩D,𝑪D\\bm\{A\}\_\{D\},\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}\. The time\-varying model captures mode changes well, while the time\-invariant model learns an average across modes\.
Table[2](https://arxiv.org/html/2605.15311#S4.T2)examines combinations of switched and fixed𝑨D,𝑩D,𝑪D\\bm\{A\}\_\{D\},\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}and time\-varying and time\-invariant model parameters𝑨M,𝑩M,𝑪M\\bm\{A\}\_\{M\},\\bm\{B\}\_\{M\},\\bm\{C\}\_\{M\}\. When none of𝑨D,𝑩D,𝑪D\\bm\{A\}\_\{D\},\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}is switched \(first column\), all models achieve MSE=0=0\. This is consistent with the fact that there is no time\-varying dynamics in the data\. The model with all𝑨M,𝑩M,𝑪M\\bm\{A\}\_\{M\},\\bm\{B\}\_\{M\},\\bm\{C\}\_\{M\}time\-varying \(last row\) matches or outperforms all models across every data configuration\. This is consistent with the fact that this is the most flexible model\. When𝑨M\\bm\{A\}\_\{M\}is time\-varying alongside at least one of𝑩M\\bm\{B\}\_\{M\}or𝑪M\\bm\{C\}\_\{M\}\(oxo or oox\), performance is similar to the fully time\-varying case, suggesting both𝑩M\\bm\{B\}\_\{M\}and𝑪M\\bm\{C\}\_\{M\}need not be time\-varying simultaneously\. When only𝑨M\\bm\{A\}\_\{M\}is time\-varying \(oxx\), this model\-configuration performs worse than when only𝑩M\\bm\{B\}\_\{M\}or𝑪M\\bm\{C\}\_\{M\}is time\-varying \(xox or xxo\)\. This suggests that time\-varying𝑨M\\bm\{A\}\_\{M\}cannot compensate for switched𝑩D\\bm\{B\}\_\{D\}and/or𝑪D\\bm\{C\}\_\{D\}since the row ‘oxx’ has higher MSE values \(at the MSE level10−110^\{\-1\}\) for all except for its own ‘oxx’ column\. On the other hand, the rows ‘xox’ for time\-varying𝑩M\\bm\{B\}\_\{M\}and ‘xxo’ for time\-varying𝑪M\\bm\{C\}\_\{M\}, can obtain relatively low values \(at the MSE level10−210^\{\-2\}\) for all columns\.
### 4\.2Distorted speech
We now discuss the distorted speech results, where ‘time\-varying’ and ‘time\-invariant’ refer to all of𝑨M,𝑩M,𝑪M\\bm\{A\}\_\{M\},\\bm\{B\}\_\{M\},\\bm\{C\}\_\{M\}being time\-varying or time\-invariant, respectively\.
#### 4\.2\.1Layer depth and activation function:
Table 3:SI\-SNR for speech denoising task for varyinghhand fixedninvar=16,nvary=4,KA=KB=KC=4n\_\{invar\}=16,n\_\{vary\}=4,K\_\{A\}=K\_\{B\}=K\_\{C\}=4\.Table[3](https://arxiv.org/html/2605.15311#S4.T3)compares time\-varying and time\-invariant models across architectures, with equal parameter counts in each row\. The time\-varying model consistently outperforms the time\-invariant one, reaching up to19\.619\.6dB vs\.14\.214\.2dB SI\-SNR from a55dB starting point\. All time\-invariant models plateau at7\.8±07\.8\\pm 0dB, consistent with learning an average across modes as visualized in Figure[4](https://arxiv.org/html/2605.15311#S4.F4)\.
With identity activation, adding layers or neurons yields no significant gains for either model\. With GELU, two hidden layers consistently match or exceed one layer, suggesting depth is beneficial only with nonlinear activation\. Under one hidden layer, GELU improves the time\-invariant model across allhhbut offers no consistent gain for the time\-varying model\.
#### 4\.2\.2Number of parameters:
Figure 4:Example sample from the speech\-distortion task\. The plots show a zoomed\-in segment of the11s speech sequence\.Table 4:SI\-SNR for speech denoising task forh=512h=512and varyingppsuch thatninvar=Knvaryn\_\{invar\}=Kn\_\{vary\}andK=KA=KB=KCK=K\_\{A\}=K\_\{B\}=K\_\{C\}\.With fixedhhand increasing parameters per neuronpp\(Table[4](https://arxiv.org/html/2605.15311#S4.T4)\), the time\-varying model improves while the time\-invariant model is unaffected\. This suggests that additional parameters enhance the time\-varying model’s ability to capture non\-stationary dynamics, but cannot compensate for the time\-invariant model’s lack of inherent flexibility\.
#### 4\.2\.3Basis functions allocation:
Table[5](https://arxiv.org/html/2605.15311#S4.T5)examines how to allocate a fixed basis function budget among𝑨coeff,𝑩coeff,𝑪coeff\\bm\{A\}\_\{\\text\{coeff\}\},\\bm\{B\}\_\{\\text\{coeff\}\},\\bm\{C\}\_\{\\text\{coeff\}\}\. Allocating the entire budget toKAK\_\{A\}performs significantly worse than other configurations, while allocating it fully toKBK\_\{B\}orKCK\_\{C\}yields the best performance, consistent with Section[4\.1](https://arxiv.org/html/2605.15311#S4.SS1)\. Equal splits across two or three matrices perform similarly to each other\.
Table 5:SI\-SNR for speech denoising with varying basis function allocation over neuron matrix parameters\.nvary=4,h=512n\_\{vary\}=4,h=512\.
## 5Conclusions
We have proposed time\-varying SSM networks parameterized by a dictionary of basis functions with learnable coefficients\. Our investigations illustrate that the proposed models can learn and adapt to periodically switched dynamics, whereas time\-invariant SSMs cannot capture such switching behavior\. Extensions to non\-stationary settings with non\-periodic behavior, as well as exploring different families of basis functions, are important directions for future work\. Benchmarking against both simpler regression methods and more sophisticated architectures such as non\-stationary transformers would further clarify the trade\-offs of the proposed approach\.
## References
- D\. Gedon, N\. Wahlström, T\. B\. Schön, and L\. Ljung \(2021\)Deep state space models for nonlinear system identification\.IFAC\-PapersOnLine54\(7\),pp\. 481–486\.Note:19th IFAC Symposium on System Identification SYSID 2021External Links:ISSN 2405\-8963Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p4.1)\.
- Y\. Grenier \(1983\)Time\-dependent ARMA modeling of nonstationary signals\.IEEE Trans\. on Acoustics, Speech, and Signal Processing31\(4\),pp\. 899–911\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- A\. Gu, T\. Dao, S\. Ermon, A\. Rudra, and C\. Ré \(2020\)Hippo: recurrent memory with optimal polynomial projections\.NeurIPS33,pp\. 1474–1487\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p4.1)\.
- A\. Gu and T\. Dao \(2024\)Mamba: linear\-time sequence modeling with selective state spaces\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p4.1)\.
- A\. Gu, K\. Goel, A\. Gupta, and C\. Ré \(2022a\)On the parameterization and initialization of diagonal state space models\.NeurIPS35,pp\. 35971–35983\.Cited by:[§3\.0\.1](https://arxiv.org/html/2605.15311#S3.SS0.SSS1.p1.10)\.
- A\. Gu, K\. Goel, and C\. Ré \(2022b\)Efficiently modeling long sequences with structured state spaces\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p4.1),[§2\.1\.1](https://arxiv.org/html/2605.15311#S2.SS1.SSS1.p2.1)\.
- C\. Hua, X\. Cao, Q\. Xu, B\. Liao, and S\. Li \(2023\)Dynamic neural network models for time\-varying problem solving: a survey on model structures\.IEEE Access11\(\),pp\. 65991–66008\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p1.1),[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- R\. E\. Kalman \(1963\)Mathematical description of linear dynamical systems\.J\. of the Society for Industrial and Applied Mathematics Series A Control1\(2\),pp\. 152–192\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p2.1)\.
- Y\. Liu, H\. Wu, J\. Wang, and M\. Long \(2022\)Non\-stationary transformers: exploring the stationarity in time series forecasting\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p1.1),[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- L\. Ljung, C\. Andersson, K\. Tiels, and T\. B\. Schön \(2020\)Deep learning and system identification\.IFAC\-PapersOnLine53\(2\),pp\. 1175–1181\.Note:21st IFAC World CongressExternal Links:ISSN 2405\-8963Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- L\. Ljung \(1987\)System identification: theory for the user\.Prentice Hall PTR\.Cited by:[§2\.1\.1](https://arxiv.org/html/2605.15311#S2.SS1.SSS1.p1.6)\.
- M\. Mazumder, S\. Chitlangia, C\. Banbury, Y\. Kang, J\. M\. Ciro, K\. Achorn, D\. Galvez, M\. Sabini, P\. Mattson, D\. Kanter,et al\.\(2024\)Multilingual spoken words corpus\.InNeurIPS,Cited by:[§3\.0\.6](https://arxiv.org/html/2605.15311#S3.SS0.SSS6.p1.9)\.
- M\. Niedzwiecki \(1988\)Functional series modeling approach to identification of nonstationary stochastic systems\.IEEE Trans\. on Aut\. Control33\(10\),pp\. 955–961\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- S\. Paoletti, A\. Lj\. Juloski, G\. Ferrari\-Trecate, and R\. Vidal \(2007\)Identification of hybrid systems a tutorial\.European J\. of Control13\(2\),pp\. 242–260\.External Links:ISSN 0947\-3580Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p3.1)\.
- J\. L\. Roux, S\. Wisdom, H\. Erdogan, and J\. R\. Hershey \(2019\)SDR– half\-baked or well done?\.InICASSP,Vol\.,pp\. 626–630\.Cited by:[§3\.0\.3](https://arxiv.org/html/2605.15311#S3.SS0.SSS3.p1.9)\.
- J\. T\.H\. Smith, A\. Warrington, and S\. Linderman \(2023\)Simplified state space layers for sequence modeling\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p4.1),[§2\.1\.1](https://arxiv.org/html/2605.15311#S2.SS1.SSS1.p2.1)\.
- C\. Subakan, M\. Ravanelli, S\. Cornell, and F\. Grondin \(2022\)REAL\-M: towards speech separation on real mixtures\.InICASSP,pp\. 6862–6866\.Cited by:[§3\.0\.3](https://arxiv.org/html/2605.15311#S3.SS0.SSS3.p1.9)\.
- Z\. Sun and S\. S\. Ge \(2005\)Switched linear systems: control and design\.Springer\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p3.1)\.
- M\.K\. Tsatsanis and G\.B\. Giannakis \(1993\)Time\-varying system identification and model validation using wavelets\.IEEE Trans\. on Signal Processing41\(12\),pp\. 3512–3523\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- G\. J\. J\. van den Burg and C\. K\. I\. Williams \(2020\)An evaluation of change point detection algorithms\.ArXivabs/2003\.06222\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
- G\. Wichern, J\. Antognini, M\. Flynn, L\. R\. Zhu, E\. McQuinn, D\. Crow, E\. Manilow, and J\. L\. Roux \(2019\)Wham\!: extending speech separation to noisy environments\.arXiv:1907\.01160\.Cited by:[§3\.0\.3](https://arxiv.org/html/2605.15311#S3.SS0.SSS3.p1.9)\.
- J\. Yik and et\. al\. \(2025\)The neurobench framework for benchmarking neuromorphic computing algorithms and systems\.Nature Communications16,pp\. 1545\.External Links:ISSN 2041\-1723Cited by:[§2\.4](https://arxiv.org/html/2605.15311#S2.SS4.p3.1)\.
- R\. Zou, H\. Wang, and K\. H\. Chon \(2003\)A robust time\-varying identification algorithm using basis functions\.Annals of Biomedical Engineering31,pp\. 840–853\.Cited by:[§1](https://arxiv.org/html/2605.15311#S1.p5.1)\.
## Appendix AFour mode SLDS
### A\.1Model Definition
We construct an SLDS with four modes, each a linear SSM withn=4n=4,nin=1n\_\{in\}=1,nout=1n\_\{out\}=1:
System 1:𝑨1=Diag\(0\.9,0\.8,0\.9,0\.8\),\\displaystyle\\text\{System 1: \}\\bm\{A\}\_\{1\}=\\mathrm\{Diag\}\(0\.9,0\.8,0\.9,0\.8\),𝑩1=\[0\.9,0\.8,0\.9,0\.8\]T,𝑪1=\[0\.1,0\.2,0\.1,0\.2\]\.\\displaystyle\\hskip 18\.00005pt\\bm\{B\}\_\{1\}=\\phantom\{\-\}\[0\.9,0\.8,0\.9,0\.8\]^\{T\},\\hskip 1\.00006pt\\bm\{C\}\_\{1\}=\\phantom\{\-\}\[0\.1,0\.2,0\.1,0\.2\]\.System 2:𝑨2=Diag−\(0\.1,0\.2,0\.1,0\.2\),\\displaystyle\\text\{System 2: \}\\bm\{A\}\_\{2\}=\\mathrm\{Diag\}\-\(0\.1,0\.2,0\.1,0\.2\),𝑩2=−\[0\.9,0\.8,0\.9,0\.8\]T,𝑪2=−\[0\.5,0\.7,0\.7,0\.5\]\.\\displaystyle\\hskip 18\.00005pt\\bm\{B\}\_\{2\}=\-\[0\.9,0\.8,0\.9,0\.8\]^\{T\},\\hskip 1\.00006pt\\bm\{C\}\_\{2\}=\-\[0\.5,0\.7,0\.7,0\.5\]\.System 3:𝑨3=−Diag\(0\.9,0\.8,0\.9,0\.8\),\\displaystyle\\text\{System 3: \}\\bm\{A\}\_\{3\}=\-\\mathrm\{Diag\}\(0\.9,0\.8,0\.9,0\.8\),𝑩3=−\[0\.1,0\.2,0\.1,0\.2\]T,𝑪3=−\[0\.1,0\.2,0\.1,0\.2\]\.\\displaystyle\\hskip 18\.00005pt\\bm\{B\}\_\{3\}=\-\[0\.1,0\.2,0\.1,0\.2\]^\{T\},\\hskip 1\.00006pt\\bm\{C\}\_\{3\}=\-\[0\.1,0\.2,0\.1,0\.2\]\.System 4:𝑨4=Diag\(0\.1,0\.2,0\.1,0\.2\),\\displaystyle\\text\{System 4: \}\\bm\{A\}\_\{4\}=\\mathrm\{Diag\}\(0\.1,0\.2,0\.1,0\.2\),𝑩4=\[0\.1,0\.2,0\.1,0\.2\]T,𝑪4=\[0\.9,0\.8,0\.9,0\.8\]\.\\displaystyle\\hskip 18\.00005pt\\bm\{B\}\_\{4\}=\\phantom\{\-\}\[0\.1,0\.2,0\.1,0\.2\]^\{T\},\\hskip 1\.00006pt\\bm\{C\}\_\{4\}=\\phantom\{\-\}\[0\.9,0\.8,0\.9,0\.8\]\.The system transitions sequentially\(1→2→3→4\)\(1\\rightarrow 2\\rightarrow 3\\rightarrow 4\)with equal mode durations\.
### A\.2Dataset creation
Each of\(𝑨D,𝑩D,𝑪D\)\(\\bm\{A\}\_\{D\},\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}\)can switch or remain fixed independently \(Table[2](https://arxiv.org/html/2605.15311#S4.T2)\)\. In the ‘ooo’ case, modeiiuses\(𝑨i,𝑩i,𝑪i\)\(\\bm\{A\}\_\{i\},\\bm\{B\}\_\{i\},\\bm\{C\}\_\{i\}\)\. In the ‘xxx’ case, all matrices are fixed; there are43=644^\{3\}=64such combinations, over which we report the mean\. In partial cases such as ‘oxx’, only𝑨D\\bm\{A\}\_\{D\}switches while𝑩D,𝑪D\\bm\{B\}\_\{D\},\\bm\{C\}\_\{D\}are fixed to one of1616possible\(𝑩j,𝑪k\)\(\\bm\{B\}\_\{j\},\\bm\{C\}\_\{k\}\)pairs\. Note that partially switching cases are not subsets of ‘ooo’, so the fully switching system is not necessarily the hardest to predict\.
## Appendix BModel and training details
### B\.1Four mode system
We use a single hidden layer withh=16h=16neurons, one input and one output channel\. The time\-varying model usesnvary=32n\_\{vary\}=32,KA=KB=KC=16K\_\{A\}=K\_\{B\}=K\_\{C\}=16\. The time\-invariant model usesninvar=32n\_\{invar\}=32andK=1K=1by default\. Models are trained with MSE loss for200200epochs, batch size6464, with a linear warmup \(5%5\\%of epochs\) followed by cosine learning rate decay\.
### B\.2Distorted speech
Architecture choices \(LL,hh,nn,KK\) depend on the task and are specified in the corresponding sections/tables\. The performed Hyperparameter optimization \(HPO\) uses200200random\-search trials on a single\-layer network withh=512h=512, identity activationnvary=4n\_\{vary\}=4,K=4K=4for time\-varying, andninvar=16n\_\{invar\}=16for time\-invariant, selecting the best configuration by validation performance \(Table[6](https://arxiv.org/html/2605.15311#A2.T6)\)\. Models are trained with MSE loss for one epoch, with the4800048000\-step signal divided into128128\-step segments \(≈\\approx2\.7ms\) mimicking an online/adaptive\-filtering setting\. All metrics are averaged over1010runs\.
Table 6:HPO ranges and chosen values for the distorted\-speech task\.Similar Articles
Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series
This paper presents an online framework for modeling streaming time series as dynamic mixtures of time-delay systems, addressing regime shifts and memory constraints via a summary system tensor and tensor decomposition.
Prediction and control with temporal segment models
OpenAI introduces a method for learning complex nonlinear system dynamics using deep generative models over temporal segments, enabling stable long-horizon predictions and differentiable trajectory optimization for model-based control.
Rethinking State Tracking in Recurrent Models Through Error Control Dynamics
This paper argues that robust state tracking in recurrent models depends on error control dynamics rather than just expressive capacity, proving that affine recurrent networks suffer from accumulating errors that limit their effective horizon.
A Switching System Theory of Q-Learning with Linear Function Approximation
This paper presents a switching-system theory for Q-learning with linear function approximation, using joint spectral radius to analyze convergence stability under deterministic, i.i.d., and Markovian observations.
Interpretable EEG Microstate Discovery via Variational Deep Embedding: A Systematic Architecture Search with Multi-Quadrant Evaluation
This paper presents Conv-VaDE, a variational deep embedding model for interpretable EEG microstate discovery that jointly learns topographic reconstruction and probabilistic soft clustering. It includes a systematic architecture search evaluated on resting-state EEG data to determine optimal model configurations for stability and interpretability.